Continuous Merge
Infoworks DataFoundry supports continuous merging of delta records with the base tables in the Hadoop or Cloud cluster. This allows you to continually ingest changed data at low latencies from sources, while still making the base and refreshed data available for downstream access.
Infoworks maintains additional tables for every synchronized Hive table that has been created by Infoworks ingestion. So for every such Hive table called TableA in the data lake, a view is created in the same Hive schema called realtime_TableA that contains the most fresh data from the incremental ingestion process. This realtime table contains a combination of the CDC delta records and the merged data. This table allows you to access the merged data before the actual merge occurred. The view query merges the CDC-data and the full-data while the user is reading it.
TableA, therefore, contains the last merged data and is available for downstream applications that have a high read-performance requirement, or does not have stringent SLA on data freshness.
Realtime_TableA contains the latest data from the CDC and is suitable for applications that have a stringent SLA or need to access changed data without waiting for a merge process to complete.
Locking
NOTE: The merged data does not reflect into the TableA until a switch is performed. A switch is a process that replaces the old secondary partition folders with new secondary partition folders with the merged and updated CDC data.
For a table, switch will be performed as follows:
- A lock is taken on the primary partition of the table (if table is not primary partitioned, the lock is taken on the whole table).
- All the secondary partitions for this primary partition is switched.
- The lock is released.
- The corresponding CDC data is deleted from the CDC folder.
- The table X and the view X_realtime must give same results for all queries. If you click the Ingest Now button, CDC merge occurs followed by a switch.
Near-Realtime Use Cases
If the use case is to always access the data in near-realtime, follow the following approach:
Near-realtime use cases requires table to be updated with the latest data within minutes of the source update and requires the table to be available for querying most of the time. The table is locked or unsuitable (when locking is OFF) for querying when the table is being switched after the merge and user queries might be blocked or fail. For such cases, you must query the realtime view which is available with the latest data while CDC and merge are running on the table. The actual table will also be available during CDC and merges but the data will be updated in the actual table only after a switch.
For example, if merge and switch are scheduled for every midnight and CDC occurs every 15 minutes, then the realtime table will have the new data every 15 minutes while, the actual table update will be available at midnight. This strategy is useful when the merge takes more than few minutes and the table needs to be available for continuous query throughout the day.
Hence, you can either run/schedule the ingest now job which handles the CDC, merge and switch consecutively, or, you can schedule the CDC, merge and switch jobs independent of each other.
NOTE: The actual table does not show the new data until a switch has been executed on the table.
By default, locking is OFF. To turn locking ON for a table, ensure that locking is enabled in the hive and set the MERGE_LOCKING_ON configuration key to true (default is false). This configuration is available on the table, source, and admin level.
Rest API Call for Switch Job
[POST] :/v1.1/source/table_group/ingest.json?``table_group_id=xxxxx&ingestion_type=switch&auth_token=xxxx
Infoworks CLI command for Switch Job
/path/to/cli-script ingest --source --group --ingest-type switch --username --password
NOTE: Continuous Merge and realtime views are only applicable for incremental tables.