Introduction

Data Ingestion is the process of obtaining data from various source formats and moving it onto Hadoop / Hive, where the data can be stored and further analyzed. Ingestion is the first step to perform data preparation/analytics via Infoworks.

Data can be streamed in real time or ingested in batches. Infoworks supports loading entire large source data sets at once and then load the incremental changes to that source data.

Ingestion Data Source Types

Ingestion is classified based on the data source type as follows:

RDBMS Ingestion

Teradata Ingestion
Oracle Ingestion
MySQL
Maria DB
SQL Server
DB2
Netezza
SAP Hana
Hive
SybaseIq
Apache Ignite
Vertica

No SQL Ingestion

MapR-DB Ingestion

CRM Ingestion

SalesForce Ingestion

File Ingestion

Structured File Ingestion - Delimited File Ingestion, Fixed Width Ingestion, Mainframe Data File Ingestion
JSON Ingestion
XML Ingestion
Unstructured File Ingestion

Ingestion Sync Types

Ingestion is classified based on the sync type as follows:

Full Ingestion - fetches the complete data every time the ingestion job is run.
Incremental Load Ingestion - fetches the complete data only in the first run, and in subsequent runs, fetches only the changed data.

Segmented Ingestion allows data to be loaded in segments defined by values of a column and can be performed on full load and incremental load.

Full Ingestion

NOTE: Tables or sources that are fully ingested will always be truncated and reloaded on target.

Following are the steps to perform full ingestion for a table:

Click the Sources menu and click the required source.
Click the Configure button for the required table.
In the Configuration page, set the Ingest Type to Full Load, and enter the required values.

Incremental Ingestion

When incremental ingestion sync type is selected and ingestion is performed for the first time, the entire data will be crawled. For the consecutive ingestions, only the records that have been inserted/updated will be crawled.

Incremental load ingestion includes the following:

Timestamp-Based Incremental Ingestion
Query-Based Incremental Ingestion
Batch ID Based Incremental Ingestion
The Oracle, SQL Server, DB2 and Sybase databases additionally support Log-Based Incremental Ingestion.
The Oracle database additionally supports OGG-Based Incremental Ingestion.

Following are the steps to perform incremental ingestion for a table:

Click the Sources menu and click the required source.
Click the Configure button for the required table.
In the Configuration page, set the Ingest Type to Timestamp-Based Incremental Load, Query Based SCD1 Incremental Load, Query Based SCD2 Incremental Load, Batch ID Based Incremental Load or Log-Based Incremental Load and enter the required values.

WARNING: Switching the incremental ingestion from Append to Merge mode might result in some missing records that were previously ingested. It is therefore, strongly recommended to perform Initialize and Ingest and perform a full load immediately after switching the modes.

Last updated on