MapR-DB Ingestion

Infoworks DataFoundry allows creating Hive tables to project MapR-DB table data. The metadata for these tables will also be stored in the Infoworks DataFoundry metastore so that the Hive tables can be used in data transformations.

MapR-DB is a HBase-like case-sensitive document store which stores nested key value structures in a document. Each row in a MapR table can have different schema.

MapR-DB tables can be queried via Hive by creating a corresponding Hive table and specifying a storage handler and _id column. The _id column is a unique identifier which identifies a row and is used by Hive to query the tables.

MapR-DB Datatypes

The following table lists the data types supported by Infoworks DataFoundry in MapR-DB.

MapR-DB Data Types	Supported	Mapping	Infoworks Data Type	Hive Data Type
BOOLEAN	Y	Y	BOOLEAN	BOOLEAN
BINARY	Y	Y	BINARY	BINARY
BYTE	Y	Y	TINYINT	TINYINT
DATE	Y	Y	DATE	DATE
DOUBLE	Y	Y	DOUBLE	DOUBLE
FLOAT	Y	Y	FLOAT	FLOAT
INT	Y	Y	INT	INT
BIGINT	Y	Y	LONG	LONG
SHORT	Y	Y	SMALLINT	SMALLINT
STRING	Y	Y	STRING	STRING
TIMESTAMP	Y	Y	TIMESTAMP	TIMESTAMP

Creating MapR-DB Source

For creating a MapR-DB source, see Creating Source. Ensure that the Source Type selected is MapR-DB. Enter the Hive schema name and HDFS location. The target Hive schema is where the Hive tables will be created corresponding to each MapR-DB table.

Configuring MapR-DB Source

For configuring an MapR-DB source, see Configuring Source.

In the Settings page, enter the MapR-DB Base Path.
Click Save and Fetch New Tables. A list of tables fetched from the MapR-DB will be displayed.
Select the MapR tables for which the Hive tables must be created. Modify the Hive table name if required. You can also modify the Id column value which is _id by default. The modified ID value will be the ID column name for the Hive table.

Crawling MapR-DB Metadata

In the Settings page, after fetching the MapR table, click Crawl Tables. The table metadata will be stored in the Infoworks metastore and the corresponding tables will be created in Hive.
Click the Build icon to track the job. The job summary with the details of tables crawled and skipped will be displayed.
After successful crawl, click the Source Configuration icon. The crawled tables will be displayed. You can also verify the crawled tables in Hive.
Click the View button for the required table. The table data viewer will be displayed.
Click Recrawl Metadata to recrawl the metadata for the tables that were already crawled. This will drop and recreate the corresponding Hive tables.

NOTE: The table schema is detected from the top n schema documents, n is 10 by default. This value can be set at the admin level using the MAPRDB_TYPE_DETECTION_ROW_COUNT configuration.

Limitations

If MapR-DB table data includes key names with uppercase letters, the columns are displayed as null while querying the Hive table. This is because MapR-DB is case-sensitive and Hive converts all the column names to lowercase.
MapR-DB tables can have different schemas for each row while Hive can only have a fixed schema. Hence, it is assumed that MapR-DB rows will adhere to a superset of the detected schema.
MapR-DB ingestion supports only MapR-DB JSON tables. Binary tables are not supported.
The subpaths in the MapR base path will not be crawled. To create Hive tables for subpaths, you must create a different source with the subpath as the base path.

Last updated on