Unstructured File Ingestion
Unstructured File Ingestion (UFI) is an archival system for the files in HDFS. The user can specify a location in the Local Filesystem or on an SFTP host which can be specified and the files get copied from that location to the required HDFS location.
Prerequisites
Following are the steps to create a source of unstructured file type:
- In the Manage Sources page, click New Source.
- Enter the field values as follows:
- Source Name: Name for the source
- Source Type: Select Unstructured Files as the source type.
- Hive Schema Name: The Hive schema is not used in the Infoworks product as of now, but it would be used for cataloging later. Please fill this value as the cataloging table value.
- HDFS Location: The base location on which the files are supposed to be copied to.
- Navigate to the Source you just created.
- Click the Settings icon.
- Enter the table configurations.
NOTE: If you want to copy files from local filesystem, you can provide the base path. If you want to copy files from a remote filesystem (using SFTP), you can configure it.
- Click Save Settings and scroll down to the File Mappings.
- Click Add Entry.
The fields and their configurations are as follows:
- Table : The table name that you want for cataloging
- Source Path : The relative path of the folder that you want to copy from. Note that this is relative path. The absolute path of the folder becomes source base path (given in the source configuration previously) + source path ( In this case: /var/log)
- Target HDFS Path: The relative path you want the files to be copied to. Note that this is the relative path. The absolute path in HDFS is Target base path + Target HDFS path. (In this case : /ufisource/logic)
- Exclude filename containing pattern: The files with filenames matching this regex pattern would be skipped. Note: this Java specific regular expression.
- Ingest Sub directories: Check this box if you want to copy the sub directories also.
- Click Save Entry.
- To add more tables, click Source Configuration icon. The table gets added and you can see the table on the Source Configuration page as shown below.
- Click Table Groups and click Add Table Group.
- Enter the required values and click Save Configuration.
- Click View Table Group icon.
- If this is the first-time ingestion, click Initialize and Ingest Now.
- Initialize and Ingest leads to a Full Load, that is, the files would be deleted from the destination folder and everything would be copied again.
- Ingest Now job ingests only those files which have been modified after the last ingestion had happened. If a file has been deleted on the source, the corresponding file is NOT deleted on the target system.
- Click the Ingestion Logs icon to see the ingestion job progress.
UFI Feature Scope and Known Issues
The files containing ":" in the name are not supported as of now.
Configurations
Ufi_max_failure_percentage_per_table : The percentage of files for which the file ingestion is failed before the job is shown as failed.
Error Handling for File Ingestion
For DFI, JSON and XML ingestion, a folder, error_files is created in the // folder which stores the error records for the job. These records are saved in same format as the initial ingestion.
The files are compressed by default and the file name will be in the timestamp_jobId_filename_records.gz format.
Incremental Append Mode
For full-load ingestion of files, an incremental append mode functionality is provided. This will only fetch the new files in the source entirely, and not CDC data from each file. To enable this feature, click the Incremental Append Mode checkbox under Full Load ingest type.
NOTE: Switching the incremental ingestion from Append to Merge mode might result in some missing records that were previously ingested. It is therefore, strongly recommended that users perform Initialize and Ingest and perform a full load immediately after switching the modes.
