File Ingestion Process

The ingestion process includes the following steps:

The file sources supported are CSV, Fixed-Width, Mainframe Data, JSON, XML and Unstructured files.

Creating Source

NOTE: Only an Admin can create a source.

Login to Infoworks DataFoundry.
Click Admin > Sources > New Source.
Enter the source details.

Field	Description
Source Name	The name of the source that will be displayed in the Infoworks DataFoundry UI.
Source Type	The type of file from which the data must be ingested.
Target Hive Schema	The Hive schema name created by the Hadoop admin.
Target HDFS Location	The HDFS path on Hadoop cluster, created by the Hadoop admin.
Make Publicly Available	The option to make the source publicly available for anyone to use. If unchecked, the source will be visible only to the current user and users who have access to the domains where the source has been added.
Enable ECB Agent	The option to enable the ECB agent.

Click Save Settings.

Configuring Source

Click the Sources menu and select the source you created.
In the Source Configuration page, click Click here to enter them to configure the source.
Enter the Source Configuration details. See specific ingestion sections for their respective settings.
Click Save Settings. Once the settings are saved, you can test the connection or navigate to the Source Configuration page to create tables and crawl the metadata.

Testing Connection

Click the Test Connection button. The test connection job will be run.
Click the Ingestion Logs icon to track the job progress.
Click the build record to view the Summary and MR Jobs. You can also View and Download the logs from the Logs section.

Creating Tables

Click the Source Configuration icon.
Create tables. The procedure for creating tables vary according to the database type. See specific ingestion sections for their respective settings.

Crawling Metadata

The metadata crawl process crawls the schema of tables, views from source and stores it on metadata store in the Infoworks DataFoundry platform.

Click the Crawl Metadata button.
In the pop-up window, click Yes, Crawl Metadata. The metadata crawl will be initiated.
Click the Ingestion Logs icon to track the job progress.

Once fetching metadata is complete, the list of tables will be displayed in the Tables page.

Configuring Source for Ingestion

Click the Configure button for the table to be ingested.
Enter the ingestion configuration details.
Click Save Settings.

Ingestion Configurations

Field	Description
Ingest Type	The types of synchronization that can be performed on the table. The options include Full Load, Timestamp-Based Incremental Ingestion, Batch ID Based Incremental Ingestion.
Incremental Append Mode	The option to append the incremental data to Hive, instead of merging the data.
Source Configuration
Natural Keys	The key to identify the row uniquely. This key is used to identify and merge incremental data with the existing data on the target. Distribution of data into secondary partitions for a table on target will be computed based on the hashcode value of the natural key. This field is mandatory for incremental ingestion tables. NOTE: The value of natural key column cannot be updated when updating a row on source; all the components of the natural key are immutable.
Enable Schema Synchronization	This option enables column addition in delimited and fixed width ingestion. After enabling this option, if a new column is added in the files that are being crawled, the ingestion job fails with the message that the new column has been added to the schema. New column name/datatype can be added to the schema definition in the Edit Schema page in the table configuration. After the new column has been defined, the ingestion process can be restarted, which will add the new column data in the datalake. Unlike RDBMS ingestion, the data is not backfilled for delimited and fixed-width files.
Timestamp-Based Incremental Load
Use column from Data	The option to use timestamp column from the data. If enabled, you can select the timestamp column from the Timestamp Column drop-down list.
Use file timestamp	The option to use file timestamp.
Timestamp Column	The column based on which the incremental data is sorted.
BatchID-Based Incremental Load
Batch-ID Column	This column used for fetching of delta. This column must be a numeric column. Source datatype and target datatype must be equal.
Target Configuration
Hive Table Name	The name of the table in Hive which will be used to access the ingested data.
Storage Format	The format of the data file to be stored in HDFS. The options include ORC and Parquet.
Partition Hive Table	The option to partition the data in target. The partition column also can be derived for date, datetime and timestamp column. This will further partition the data. A hierarchy of partitions are supported with both normal partitions and derived partitions. NOTE: Ensure that the partition columns data is immutable. You can also provide a combination of normal and derived partitions in the hierarchy.
Number of Secondary Partitions	The number of secondary partitions to run the MR jobs in parallel. The target data can be distributed among various partitions and each partitions in turn can have various secondary partitions. A table with no primary partition can also have secondary partitions. Secondary partition helps in parallelising the ingestion process.
Number of Reducers	The number of reducers for the ingestion map reduce job. Increasing the number of reducers helps reduce the ingestion duration. This will help in processing the MR jobs faster. This will be effective with the combination of partition key and number of secondary partitions. In any data processing using MR job, this will help in bringing the parallelism based on the data distribution across number of primary partitions and secondary partitions on Hadoop.
Generate History View	The option to create history view table (along with the current view table) in Hive, which contains the versions of incremental updates and deletes.

Creating Table Group

Click the Table Groups tab in the Source Configuration page.
Click the Add Table Group button.
Enter the table group configuration details.

Field	Description
Table Group Name	The name of the table group.
Max.Connections to Source	The maximum number of source database connections allocated to the table group.
Max.Parallel Tables	The maximum number of tables that can be crawled at a given instance.
Yarn Queue Name (optional)	The name of the yarn queue for ingestion and export jobs.
Add Tables	The option to add tables to the table group. Select the required tables and click Add Tables. The tables will be added to the table group.
% Connection Quota	The percentage of Max. connections to source that a table is allocated with.

Click Save Configuration. The table group will be created and displayed in the Table Groups page.

Running Ingestion

Click the View Table Group icon for the required table group.
Click Initialize And Ingest Now when performing ingestion on the tables for the first time. To append new data to the crawled source, click Ingest Now.
In the pop-up window, click Yes, Initialize And Ingest. The ingestion process will be initiated.
Click the Ingestion Logs icon to track the job progress.

NOTE

When ingesting CSV file from S3 Cloud storage, ensure the following:

Run the hadoop fs ls s3a://<path> command on the S3 bucket from which the data is to be ingested (all nodes of the cluster including edge node). If this command is successful, the binary will be the part of Hadoop classpath and hence does not require copying the libraries to a specific location for Data Foundry to use.
Contact the Hadoop vendor (CDH or Google) to get the version details and install on the cluster accordingly.

Last updated on

File Ingestion Process

Target Configuration