Data Reconciliation

Infoworks Data Reconciliation is a process to validate ingested data into Hadoop/Hive, against the data in source systems. Data reconciliation will be supported only for RDBMS databases. This solution is applicable for all types of data synchronization mechanisms (Timestamp based, Query based, Log Based, BatchID based) in RDBMS ingestion.

Prerequisite

Ensure that the table includes a date/timestamp column on which data will be filtered for reconciliation.

NOTE: You can configure reconciliation for tables in RDBMS Sources. Reconciliation configuration can be set at the table configuration page.

Procedure

To perform a data reconciliation, follow these steps:

  • Navigate to the Source which contains the table to be reconciled.
  • Locate the table and click Configurations icon next to the table.
  • Click Modify Configuration. The following page opens. Figure: Source Configuration Page
  • Click Data Reconciliation Configuration. The following page opens. Figure: Data Reconciliation Page
  • Click Add Query. The following page displays.

The Type drop-down has two options:

  • Type: Used to define type of query (Row Count or Aggregation).
  • Reconcile On: Used to define the date/timestamp column which will be used for filtering data for reconciliation.
  • Reconciliation Interval: Interval for which reconciliation needs to be done. The start and end timestamps of the intervals will be compared against Reconcile On column, which you configured. The supported Reconciliation Intervals are:
  • Daily: Reconciliation happens daily. The exact interval of daily can be computed from the configuration of Compute Interval From
  • If you select Server Timestamp, then the interval will be computed from the current server timestamp and the reconciliation will be done for the previous day's ingested data.
  • If you select Reconciliation Column, then the interval will be computed from the MAX (Reconciliation Column) and the reconciliation will be done for the previous day of MAX ( Reconciliation Column) ingested data.
  • Last Month: Reconciliation for all records for the last calendar month.
  • Last Week: Reconciliation for all records in the last completed week. Week starts on Sunday and ends on Saturday.
  • This Month: Reconciliation for all records in current month. The exact interval of this month can be computed from the configuration of Compute Interval From
  • If you select Server Timestamp, then the reconciliation will be done from the starting of the current calendar month till the current server timestamp.
  • If you select Reconciliation Column, the reconciliation will be done for the starting of the current calendar month till MAX ( Reconciliation Column) ingested data.
  • This Week: Reconciliation for all records in current week. The exact interval of this week can be computed from the configuration of Compute Interval From
  • If you select Server Timestamp, then the reconciliation will be done from the starting of the current week till the current server timestamp.
  • If you select Reconciliation Column, the reconciliation will be done for the starting of the current week till MAX (Reconciliation Column) ingested data.
  • Fixed: Reconciliation for all records in time interval. You must define From Date and To Date
  • All Dates: Reconciliation will be done for all the ingested data. In this case, end timestamp will be computed using max (Reconciliation Column) on hive and perform the reconciliation with this value on source and target.
  • Aggregation Function: Computes aggregation of data on source and target by configuring the aggregate column and group by
  • Aggregation Column: Column on which aggregation function can be applied.
  • Group By Column: Columns on which group by clause can be applied in the aggregation query.
  • Enable Where: Enable this check box to define a where condition to filter the data on source and target while performing data reconciliation. A sample where condition is displayed in the UI.

WARNING: You must not enter the where construct but only the actual condition.

NOTE: You can add or remove the queries for reconciliation.

  • Click Save to save the configuration.

WARNING: Table must have date/timestamp column to configure reconciliations. Else, the error message, Date/Timestamp column is mandatory for table data reconciliation, **is displayed.

If there is a change in table columns on which you have configured reconciliation, you must manually update the configuration from Data Reconciliation Configuration page.

Querying Data Reconciliation Status

For every reconciliation configured table, after the data reconciliation is done, the reconciliation status is stored in hive table level. It is created in the same schema wherever the actual data table is created. Below is a screenshot for one such table. The table name for the reconciliation status table is <table_name>_reconciliation.

The fields of the table are as described below:

  • job_id: The Infoworks job ID which runs the reconciliation. It is a string type column.
  • table_name: Name of the table on which reconciliation is done. It is a string type column.
  • reconcile_on: Column name on which the data reconciliation is done for this table. It is a string type column.
  • query_index: Serial number of the query that you have defined in the reconciliation configuration. It is an int type column.
  • query_executed_on: At what time of the server time that the query is executed. It is a timestamp type column.
  • low_water_mark: The start timestamp for the reconciliation query. It is a timestamp type column.
  • high_water_mark: The end timestamp for the reconciliation query. It is a timestamp type column.
  • recon_status: Whether the reconciliation status is successful or not. It is a boolean type column.

Triggering Reconciliation Job

Reconciliation job can be triggered in the following two ways:

  • Using UI Hook: This hook is available in the maintenance options of a table group, as shown below.
  • Using a scheduler: You can schedule a reconciliation job. This feature is very useful for periodic reconciliation, which reconcile for last day, last 7 days and last 30 days You can schedule reconcile job at a table group level, as shown below.

Running Reconciliation Job

Once the reconciliation job is triggered, the queries and corresponding results can be seen in the summary logs of the ingestion logs page, as shown below. Figure: Running Reconciliation Job

Generating Reconciliation Metrics

There are two types of reconciliation metrics that you can generate:

  • Table Reconciliation Metrics
  • Table Group Reconciliation Metrics

Table Reconciliation Metrics

Following are the steps to, generate the table reconciliation metrics of a given source:

  • Click the Sources menu and select the required source.
  • Click Ingestion Reports icon.
  • In the Report Type drop-down, select Table Reconciliation Metrics.
  • From the Table drop-down, select the required table.
  • Choose the Submitted After and Submitted Before and click Apply Filters. The Table Reconciliation Report is displayed.

The following table lists the columns in the Table Reconciliation Report and their descriptions.

ColumnDescription
Job IDUnique identifier for a job.
Table GroupThe table group name in which the required table is present.
Query TypeIdentifies whether it is a Row Count or Aggregation query type.
Query IndexOrder of execution of query in a table.
Source QueryQuery that runs on RDBMS or Source system.
Target QueryQuery that runs on Hive.
Query Recon StatusThe status of each reconciliation query configured by the user.
Source Query DurationDuration of source query execution.
Target Query DurationDuration of target query execution.
Source CountsNumber of rows in the source table.
Target CountsNumber of rows in the target table.
Submitted AtThe time and date when the job was submitted.
Low WatermarkThe time of the first build.
High WatermarkThe time at which the last incremental load completed.

The Last Reconciliation Status of the table is also maintained which includes the reconciliation of all user-configured queries.

Table Group Reconciliation Metrics

The table group reconciliation metrics shows the details for a selected table group. The columns in the report are same as those of table reconciliation metrics.

The Last Reconciliation Status of the tablegroup is also maintained which includes the reconciliation of all user-configured queries.

Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard