MapR-DB Ingestion

Infoworks DataFoundry allows creating Hive tables to project MapR-DB table data. The metadata for these tables will also be stored in the Infoworks DataFoundry metastore so that the Hive tables can be used in data transformations.

MapR-DB is a HBase-like case-sensitive document store which stores nested key value structures in a document. Each row in a MapR table can have different schema.

MapR-DB tables can be queried via Hive by creating a corresponding Hive table and specifying a storage handler and _id column. The _id column is a unique identifier which identifies a row and is used by Hive to query the tables.

MapR-DB Datatypes

The following table lists the data types supported by Infoworks DataFoundry in MapR-DB.

MapR-DB Data TypesSupportedMappingInfoworks Data TypeHive Data Type
BOOLEANYYBOOLEANBOOLEAN
BINARYYYBINARYBINARY
BYTEYYTINYINTTINYINT
DATEYYDATEDATE
DOUBLEYYDOUBLEDOUBLE
FLOATYYFLOATFLOAT
INTYYINTINT
BIGINTYYLONGLONG
SHORTYYSMALLINTSMALLINT
STRINGYYSTRINGSTRING
TIMESTAMPYYTIMESTAMPTIMESTAMP

Creating MapR-DB Source

For creating a MapR-DB source, see Creating Source. Ensure that the Source Type selected is MapR-DB. Enter the Hive schema name and HDFS location. The target Hive schema is where the Hive tables will be created corresponding to each MapR-DB table.

Configuring MapR-DB Source

For configuring an MapR-DB source, see Configuring Source.

  • In the Settings page, enter the MapR-DB Base Path.
  • Click Save and Fetch New Tables. A list of tables fetched from the MapR-DB will be displayed.
  • Select the MapR tables for which the Hive tables must be created. Modify the Hive table name if required. You can also modify the Id column value which is _id by default. The modified ID value will be the ID column name for the Hive table.

Crawling MapR-DB Metadata

  • In the Settings page, after fetching the MapR table, click Crawl Tables. The table metadata will be stored in the Infoworks metastore and the corresponding tables will be created in Hive.
  • Click the Build icon to track the job. The job summary with the details of tables crawled and skipped will be displayed.
  • After successful crawl, click the Source Configuration icon. The crawled tables will be displayed. You can also verify the crawled tables in Hive.
  • Click the View button for the required table. The table data viewer will be displayed.
  • Click Recrawl Metadata to recrawl the metadata for the tables that were already crawled. This will drop and recreate the corresponding Hive tables.

NOTE: The table schema is detected from the top n schema documents, n is 10 by default. This value can be set at the admin level using the MAPRDB_TYPE_DETECTION_ROW_COUNT configuration.

Limitations

  • If MapR-DB table data includes key names with uppercase letters, the columns are displayed as null while querying the Hive table. This is because MapR-DB is case-sensitive and Hive converts all the column names to lowercase.
  • MapR-DB tables can have different schemas for each row while Hive can only have a fixed schema. Hence, it is assumed that MapR-DB rows will adhere to a superset of the detected schema.
  • MapR-DB ingestion supports only MapR-DB JSON tables. Binary tables are not supported.
  • The subpaths in the MapR base path will not be crawled. To create Hive tables for subpaths, you must create a different source with the subpath as the base path.
Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard