MapR-DB Ingestion
Infoworks allows you to create Hive tables to project MapR-DB table data. The metadata for these tables will also be stored in the Infoworks metastore so that the Hive tables can be used in data transformations.
MapR-DB is a HBase-like case-sensitive document store which stores nested key value structures in a document. Each row in a MapR table can have different schema.
MapR-DB tables can be queried via Hive by creating a corresponding Hive table and specifying a storage handler and _id column. The _id column is a unique identifier which identifies a row and is used by Hive to query the tables.
MapR-DB Datatypes
The following table lists the data types supported by Infoworks in MapR-DB.
MapR-DB Data Types | Supported | Mapping | Infoworks Data Type | Hive Data Type |
---|---|---|---|---|
BOOLEAN | Y | Y | BOOLEAN | BOOLEAN |
BINARY | Y | Y | BINARY | BINARY |
BYTE | Y | Y | TINYINT | TINYINT |
DATE | Y | Y | DATE | DATE |
DOUBLE | Y | Y | DOUBLE | DOUBLE |
FLOAT | Y | Y | FLOAT | FLOAT |
INT | Y | Y | INT | INT |
BIGINT | Y | Y | LONG | LONG |
SHORT | Y | Y | SMALLINT | SMALLINT |
STRING | Y | Y | STRING | STRING |
TIMESTAMP | Y | Y | TIMESTAMP | TIMESTAMP |
Creating Source
In the Admin section (Admin > Source > New Source), create a source and select the source type as MapRDB. Enter the Hive schema name and HDFS location. The target Hive schema is where the Hive tables will be created corresponding to each MapR-DB table.
Configuring Source
- Click the Sources menu and select the MapR-DB source you created.
- In the Source Configuration page, click the Click here to enter them link.
- In the Settings page, enter the MapR-DB Base Path.
- Click Save and Fetch New Tables. A list of tables fetched from the MapR-DB is displayed.
- Select the MapR tables for which the Hive tables must be created. Modify the Hive table name if required. You can also modify the Id column value which is _id by default. The modified Id value will be the ID column name for the Hive table.
Schema Crawl
- In the Settings page, after fetching the MapR table, click Crawl Tables. The table metadata will be stored in the Infoworks metastore and the corresponding tables will be created in Hive.
- Click the Build icon to track the job. The job summary with the details of tables crawled and skipped will be displayed.
- After successful crawl, click the Source Configuration icon. The crawled tables will be displayed. You can also verify the crawled tables in Hive.
- Click the View button for the required table. The table data viewer will be displayed.
- Click Recrawl Metadata to recrawl the metadata for the tables already crawled. This will drop and recreate the corresponding Hive tables.
The table schema is detected from the top n schema documents, n is 10 by default. This value can be set at the admin level using the MAPRDB_TYPE_DETECTION_ROW_COUNT configuration.
Limitations
- If MapR-DB table data includes key names with uppercase letters, the columns are displayed as null while querying the Hive table. This is because MapR-DB is case sensitive and Hive converts all the column names to lowercase.
- MapR-DB tables can have different schemas for each row while Hive can only have a fixed schema. Hence, it is assumed that MapR-DB rows will adhere to a superset of the detected schema.
- MapR-DB ingestion supports only MapR-DB JSON tables. Binary tables are not supported.
- The subpaths in the MapR base path will not be crawled. To create Hive tables for subpaths, you must create a different source with the subpath as the base path.