File Ingestion Process

The ingestion process includes the following steps:

The file sources supported are CSV, Fixed-Width, Mainframe Data, JSON, XML and Unstructured files.

Creating Source

NOTE: Only an Admin can create a source.

  • Login to Infoworks DataFoundry.
  • Click Admin > Sources > New Source.
  • Enter the source details.
FieldDescription
Source NameThe name of the source that will be displayed in the Infoworks DataFoundry UI.
Source TypeThe type of file from which the data must be ingested.
Target Hive SchemaThe Hive schema name created by the Hadoop admin.
Target HDFS LocationThe HDFS path on Hadoop cluster, created by the Hadoop admin.
Make Publicly AvailableThe option to make the source publicly available for anyone to use. If unchecked, the source will be visible only to the current user and users who have access to the domains where the source has been added.
Enable ECB AgentThe option to enable the ECB agent.
  • Click Save Settings.

Configuring Source

  • Click the Sources menu and select the source you created.
  • In the Source Configuration page, click Click here to enter them to configure the source.
  • Enter the Source Configuration details. See specific ingestion sections for their respective settings.
  • Click Save Settings. Once the settings are saved, you can test the connection or navigate to the Source Configuration page to create tables and crawl the metadata.

Testing Connection

  • Click the Test Connection button. The test connection job will be run.
  • Click the Ingestion Logs icon to track the job progress.
  • Click the build record to view the Summary and MR Jobs. You can also View and Download the logs from the Logs section.

Creating Tables

  • Click the Source Configuration icon.
  • Create tables. The procedure for creating tables vary according to the database type. See specific ingestion sections for their respective settings.

Crawling Metadata

The metadata crawl process crawls the schema of tables, views from source and stores it on metadata store in the Infoworks DataFoundry platform.

  • Click the Crawl Metadata button.
  • In the pop-up window, click Yes, Crawl Metadata. The metadata crawl will be initiated.
  • Click the Ingestion Logs icon to track the job progress.

Once fetching metadata is complete, the list of tables will be displayed in the Tables page.

Configuring Source for Ingestion

  • Click the Configure button for the table to be ingested.
  • Enter the ingestion configuration details.
  • Click Save Settings.

Ingestion Configurations

FieldDescription
Ingest TypeThe types of synchronization that can be performed on the table. The options include Full Load, Timestamp-Based Incremental Ingestion, Batch ID Based Incremental Ingestion.
Incremental Append ModeThe option to append the incremental data to Hive, instead of merging the data.
Source Configuration
Natural KeysThe key to identify the row uniquely. This key is used to identify and merge incremental data with the existing data on the target. Distribution of data into secondary partitions for a table on target will be computed based on the hashcode value of the natural key. This field is mandatory for incremental ingestion tables. NOTE: The value of natural key column cannot be updated when updating a row on source; all the components of the natural key are immutable.
Enable Schema Synchronization

This option enables column addition in delimited and fixed width ingestion. After enabling this option, if a new column is added in the files that are being crawled, the ingestion job fails with the message that the new column has been added to the schema. New column name/datatype can be added to the schema definition in the Edit Schema page in the table configuration.

After the new column has been defined, the ingestion process can be restarted, which will add the new column data in the datalake. Unlike RDBMS ingestion, the data is not backfilled for delimited and fixed-width files.

Timestamp-Based Incremental Load
Use column from DataThe option to use timestamp column from the data. If enabled, you can select the timestamp column from the Timestamp Column drop-down list.
Use file timestampThe option to use file timestamp.
Timestamp ColumnThe column based on which the incremental data is sorted.
BatchID-Based Incremental Load
Batch-ID ColumnThis column used for fetching of delta. This column must be a numeric column. Source datatype and target datatype must be equal.

Target Configuration

Hive Table NameThe name of the table in Hive which will be used to access the ingested data.
Storage FormatThe format of the data file to be stored in HDFS. The options include ORC and Parquet.
Partition Hive Table

The option to partition the data in target. The partition column also can be derived for date, datetime and timestamp column. This will further partition the data. A hierarchy of partitions are supported with both normal partitions and derived partitions.

NOTE: Ensure that the partition columns data is immutable. You can also provide a combination of normal and derived partitions in the hierarchy.

Number of Secondary PartitionsThe number of secondary partitions to run the MR jobs in parallel. The target data can be distributed among various partitions and each partitions in turn can have various secondary partitions. A table with no primary partition can also have secondary partitions. Secondary partition helps in parallelising the ingestion process.
Number of ReducersThe number of reducers for the ingestion map reduce job. Increasing the number of reducers helps reduce the ingestion duration. This will help in processing the MR jobs faster. This will be effective with the combination of partition key and number of secondary partitions. In any data processing using MR job, this will help in bringing the parallelism based on the data distribution across number of primary partitions and secondary partitions on Hadoop.
Generate History ViewThe option to create history view table (along with the current view table) in Hive, which contains the versions of incremental updates and deletes.

Creating Table Group

  • Click the Table Groups tab in the Source Configuration page.
  • Click the Add Table Group button.
  • Enter the table group configuration details.
FieldDescription
Table Group NameThe name of the table group.
Max.Connections to SourceThe maximum number of source database connections allocated to the table group.
Max.Parallel TablesThe maximum number of tables that can be crawled at a given instance.
Yarn Queue Name (optional)The name of the yarn queue for ingestion and export jobs.
Add TablesThe option to add tables to the table group. Select the required tables and click Add Tables. The tables will be added to the table group.
% Connection QuotaThe percentage of Max. connections to source that a table is allocated with.
  • Click Save Configuration. The table group will be created and displayed in the Table Groups page.

Running Ingestion

  • Click the View Table Group icon for the required table group.
  • Click Initialize And Ingest Now when performing ingestion on the tables for the first time. To append new data to the crawled source, click Ingest Now.
  • In the pop-up window, click Yes, Initialize And Ingest. The ingestion process will be initiated.
  • Click the Ingestion Logs icon to track the job progress.

NOTE

When ingesting CSV file from S3 Cloud storage, ensure the following:

  • Run the hadoop fs ls s3a://<path> command on the S3 bucket from which the data is to be ingested (all nodes of the cluster including edge node). If this command is successful, the binary will be the part of Hadoop classpath and hence does not require copying the libraries to a specific location for Data Foundry to use.
  • Contact the Hadoop vendor (CDH or Google) to get the version details and install on the cluster accordingly.
Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard