DB2 Log-based Incremental Ingestion Architecture

This document describes the design and implementation of DB2 log-based incremental ingestion.

Infoworks DataFoundry ingests data into Hive and keeps it synchronized with the database using change data tables provided by the IBM SQL Replication process. This design supports DB2 incremental ingestion on both, stand alone server and mainframe z/os.

For details on enabling SQL replication for DB2 tables, see How to Enable SQL Replication for DB2.

Once SQL replication is set for a table in DB2, the following two tables are created: change data table and target table.

Both the tables use different table space in the database. Only change data table is required for Infoworks DataFoundry to capture the CDC records. Target table is used to maintain the full view of the table where all versions of the record is maintained. If target table is not required, it can be dropped once the SQL replication is set.

DB2 Log-based Incremental Ingestion Architecture

  • Source Table: The table for which the user configures the DB2 logging.
  • DB2 log: The log file where all the DML transaction statements are stored. These log files can either be read by the built-in DB2 capture program (DB2 replication service) or by third party tools (Infosphere or Attunity CDC agent). This design is limited to reading logs only through capture program.
  • Capture Program: Once the capture program starts on the DB2 server (stand alone or mainframe z/os), the DB2 logs are read, the change data is added in the Change data table and the commit information is added in the UoW (Unit of Work) table.
  • Change data table: This table contains all the changes (inserts, updates and deletes) of the table. This table includes the offset associated for each commit of the DML (Data Manipulation Language) statement along with the original DML statement with all the columns.

Sample Table

  • UoW table: This table contains information about each commit on the source table. This information includes commit timestamp and offset associated with each commit transaction.

Sample Table

Reading CDC from Change Data Tables

By default, the change data table (table delta) is created for each table in <change_schema>.<change_tableName>.

Change data is read by joining the UoW table and Change data table to get the DML along with the commit timestamp. This timestamp is used to get the next CDC start time.

Sample CDC Query

SQL
Copy

Sample Output

Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard