MapR-DB Ingestion
Infoworks DataFoundry allows creating Hive tables to project MapR-DB table data. The metadata for these tables will also be stored in the Infoworks DataFoundry metastore so that the Hive tables can be used in data transformations.
MapR-DB is a HBase-like case-sensitive document store which stores nested key value structures in a document. Each row in a MapR table can have different schema.
MapR-DB tables can be queried via Hive by creating a corresponding Hive table and specifying a storage handler and _id column. The _id column is a unique identifier which identifies a row and is used by Hive to query the tables.
MapR-DB Datatypes
The following table lists the data types supported by Infoworks DataFoundry in MapR-DB.
MapR-DB Data Types | Supported | Mapping | Infoworks Data Type | Hive Data Type |
---|---|---|---|---|
BOOLEAN | Y | Y | BOOLEAN | BOOLEAN |
BINARY | Y | Y | BINARY | BINARY |
BYTE | Y | Y | TINYINT | TINYINT |
DATE | Y | Y | DATE | DATE |
DOUBLE | Y | Y | DOUBLE | DOUBLE |
FLOAT | Y | Y | FLOAT | FLOAT |
INT | Y | Y | INT | INT |
BIGINT | Y | Y | LONG | LONG |
SHORT | Y | Y | SMALLINT | SMALLINT |
STRING | Y | Y | STRING | STRING |
TIMESTAMP | Y | Y | TIMESTAMP | TIMESTAMP |
Creating MapR-DB Source
For creating a MapR-DB source, see Creating Source. Ensure that the Source Type selected is MapR-DB. Enter the Hive schema name and HDFS location. The target Hive schema is where the Hive tables will be created corresponding to each MapR-DB table.
Configuring MapR-DB Source
For configuring an MapR-DB source, see Configuring Source.
- In the Settings page, enter the MapR-DB Base Path.
- Click Save and Fetch New Tables. A list of tables fetched from the MapR-DB will be displayed.
- Select the MapR tables for which the Hive tables must be created. Modify the Hive table name if required. You can also modify the Id column value which is _id by default. The modified ID value will be the ID column name for the Hive table.
Crawling MapR-DB Metadata
- In the Settings page, after fetching the MapR table, click Crawl Tables. The table metadata will be stored in the Infoworks metastore and the corresponding tables will be created in Hive.
- Click the Build icon to track the job. The job summary with the details of tables crawled and skipped will be displayed.
- After successful crawl, click the Source Configuration icon. The crawled tables will be displayed. You can also verify the crawled tables in Hive.
- Click the View button for the required table. The table data viewer will be displayed.
- Click Recrawl Metadata to recrawl the metadata for the tables that were already crawled. This will drop and recreate the corresponding Hive tables.
NOTE: The table schema is detected from the top n schema documents, n is 10 by default. This value can be set at the admin level using the MAPRDB_TYPE_DETECTION_ROW_COUNT configuration.
Limitations
- If MapR-DB table data includes key names with uppercase letters, the columns are displayed as null while querying the Hive table. This is because MapR-DB is case-sensitive and Hive converts all the column names to lowercase.
- MapR-DB tables can have different schemas for each row while Hive can only have a fixed schema. Hence, it is assumed that MapR-DB rows will adhere to a superset of the detected schema.
- MapR-DB ingestion supports only MapR-DB JSON tables. Binary tables are not supported.
- The subpaths in the MapR base path will not be crawled. To create Hive tables for subpaths, you must create a different source with the subpath as the base path.