Structured File Ingestion

Infoworks supports the following types of structured file ingestion:

Delimited File Ingestion

Delimited File Ingestion supports crawling delimited text files with Append and CDC modes.

It can be performed in the following ways:

Record Level Processing

DFI provides the following features for record-level processing:

Schema crawl
Data Crawl
Append Mode
CDC and Merge

In the Admin section (Admin > Source > New Source), create a source and select the source type as Structured Files. Enter the Hive schema name and HDFS location. Select Enable ECB Agent option to enable the ECB agent.

Configuring Source

Click the Sources menu and select the structured file source you created.
In the Source Configuration page, click the Click here to enter them link.
In the Settings page, perform the following:
Select either of the following depending on where the files are stored: From FileSystem, From Hadoop Cluster, From Remote Server (using SFTP), From Cloud Storage.

NOTE: If it is From Remote Server, specify the sftp User Name, Password, Host, and Port.

Enter the source base path.
Select the ECB agent.

Click Save Settings.

Schema Crawl

NOTE: For a folder that contains files with similar structure, the system can detect the types of all columns.

Following are the steps to perform a schema crawl:

Click the Sources menu and select the source you created.
Click the Source Settings icon.
In the File Mapping section, click Add Entry to add a folder as a table.
Configure the following table details:
Table: Table name.
Source Path: Folder path of the table. This is relative to the source base path.
Target HDFS Path: Target HDFS path. This is relative to the target base path.
Include/Exclude Files From Directory: Regex pattern to include or skip files.
Ingest sub-directories: Specifies whether to crawl the files in the recursive structure of the specified source path.
Archive source files: Archives the files for which the data has been crawled and inserted into Hive. The files will be archived on the edge node.

NOTE: Truncate and Truncate reload of table will not affect archived files. For every data file the control file will also be archived.

Processing Level: Record-level processing or file-level processing.
Number of Header Rows: Can be zero or greater than zero. If greater than zero, (for example, n), the first line of the file will be used to get the column names. The next n-1 lines will be skipped.
Column Separator: Column/Field Separator.
Column Enclosed By: Column/Field Encapsulator.
Escape Character: Character to escape delimiter, encapsulator, and new lines in data.
Character Encoding: Character encoding.

Control Files: A control file (a file with data file metadata) against which you can validate the data file. The regular expression fields, Data Files pattern and Extract format, allows you to specify the corresponding control file for every data file as a function of the data file path. Every file with CSV extension will be considered as a data file and the same file path ending with CTL extension will be treated as control file.

Ensure that the data directory after applying the include and exclude filters only returns data files and control files. When processing, the system first applies include and exclude filters for every file. Assuming that the file path which passes the Data Files pattern regex is a data file, the corresponding control file is found using the mentioned regex patterns. Also ensure that control files should not pass Data Files pattern regex.

Currently, the control files support java properties file format which validates checksum , count, and file size.

NOTE: Control file features with compressed files in are not supported in DFI ingestion.

Following is an example:

The validation logic and control file reading logic is pluggable. Hence multiple validation variables and different control file parsing logics can be plugged to extend this feature.

NOTE: For the validations to function, the calc_file_level_ing_metrics configuration must be set as true.

Click Save and Crawl Schema. You will be redirected to the Edit Schema page where you can add or edit columns.
Click Save Schema.

To recrawl metadata, navigate to the Source Configuration page and click Recrawl Metadata.

NOTE: If the table is not mapped the Crawl Metadata button will be displayed in the Source Configuration page.

Data Crawl Full Load

Following are the steps to perform a full load data crawl:

Click the Sources menu and click the DFI source file. The Tables page is displayed with the list of tables in the source.
Click the Configure button for the table that requires a full load data crawl.
Select the Ingest Type as Full Load, enter the required values and click Save Configuration. For descriptions of fields, see the Source Table Configuration Field Descriptions section.

In the Table page, click the Table Group tab.
Click the View Table Group icon for the required table group.
Navigate to table group and click Initialize and Ingest Now.
For first time ingestion, and if you need a clean crawl, click Initialize and Ingest Now.

NOTE: This mode should be used for cases where new inserts come in new files and there are no updates.

To append new data to the crawled source click Ingest Now from the second crawl onwards. The new data can be placed in the same location. Only new and changed files would be picked.

NOTE: This mode should be used for cases where new inserts and new updates are present for every crawl.

Data Crawl Incremental Load

Following are the steps to perform incremental load data crawl on the DFI files:

Click the Sources menu and click the DFI Source file. The Tables page is displayed with the list of tables in the source.
Click the Configure button for the table that requires an incremental load data crawl.
Select the required incremental load Ingest Type, enter the required values and click Save Configuration.
For descriptions of fields, see the Source Table Configuration Field Descriptions section.

NOTE: For timestamp-based incremental ingestion, only timestamp datatypes can be selected as a Timestamp Column.

In the Tables page, click the Table Group tab.
Click the View Table Group icon for the required table group.
Navigate to table group and click Initialize and Ingest Now.
For first time ingestion, and if you need a clean crawl, click Initialize and Ingest Now.

NOTE: This mode should be used for cases where new inserts come in new files and there are no updates.

To get the new CDC data and merge it to the crawled source click Ingest Now from the second crawl onwards. The new data can be put in the same location. Only new and changed files would be picked.

NOTE: This mode should be used for cases where new inserts and new updates are present for every crawl.

Limitations

If multiple versions of the same record (same natural key) are present in the same file, one of the records will be picked randomly and the others will be moved to the history table.
Editing previously ingested files is not supported since the current version of the record might be affected.

File-Level Processing

DFI provides the following features for file-level processing:

Schema specification
Data crawl
CDC append mode