SQL Server Log-Based Incremental Ingestion

Through MSSQL Server Log Based method, data is ingested into Hive and continuously synchronized with the database using captured CDC tables provided by MSSQL Server. But this method works only with the enterprise editions of MSSQL Server. In this method, change data capture must be enabled for the incremental tables. Before enabling for a table, it must be enabled for the schema in which the table is present.

Infoworks DataFoundry uses SQL Server transactional replication to capture change data from SQL server distribution databases. For CDC to work, you must enable SQL Server Replication on the system from which change data is to be captured. If the database has a high volume of change activity, a distributed server must be used as the host of the distribution database.

To configure CDC in Infoworks DataFoundry, you must define a capture registration for each source table. In the capture registration, you can select a subset of columns for which the data is to be captured.

Infoworks DataFoundry supports ActiveDirectoryPassword with ms-sql.jar.

Following is a sample connection string with AD login:

jdbc:sqlserver://testingclsfss.database.windows.net:1433;database=testingcluster;encrypt=true;trustServerCertificate=false;hostNameInCertificate=.database.windows.net;loginTimeout=30;authentication=ActiveDirectoryPassword.

Planning SQL Server CDC

Before configuring SQL server CDC, ensure that the following prerequisites and user authority requirements are available. Also, review the restrictions to configure CDC correctly.

Prerequisites

Infoworks CDC has the following SQL Server prerequisites:

  • Infoworks CDC requires an edition of Microsoft SQL Server 2008 or later that supports transactional replication. You must configure and enable transactional replication on the source system to perform CDC.
  • Infoworks uses SQL Server 2008 Server Management Objects (SMO) to manage SQL Server publications. SMO is a .NET Framework object model used by applications to manage and administer SQL Server objects and services. You must install SQL Server 2008 SMO and related packages before you create or edit capture registrations. For your convenience, Infoworks provides the installation packages for these SQL server objects.
  • The Microsoft SQL Server Agent and Log Reader Agent must be running on the Windows machine from which change data is extracted. Usually, the SQL Server Agent remains running after it is initially started.
  • Each source table in the distribution database must have a primary key.
  • If the Infoworks product is installed on a Windows machine that is different from the one where the Microsoft SQL Server source runs, you must install the SQL server client components on the Infoworks machine.
  • Ensure that the Infoworks product runs on the Windows system that contains the Microsoft SQL Server source or on another Windows system.

User Authority for SQL Server CDC

Ensure that proper authority is available to complete registration and SQL server configuration tasks.

Following are the authorities required:

  • System Administrator authority or SA, to enable transactional replication on the publication database.
  • DB_OWNER authority to create registration groups and capture registrations from Infoworks.
  • Read access to the database to run change data extractions against the SQL Server distribution database.

SQL Server Datatypes Supported for CDC

Infoworks supports most SQL server datatypes for CDC, with some exceptions.

The following table indicates the SQL Server datatypes supported for CDC by Infoworks:

SQL DatatypesSupportedComments
bigintY-
binaryY-
bitY-
charY-
dateY-
datetimeY-
datetime2Y-
datetimeoffsetYInfoworks treats this datatype as varchar.
decimalY-
floatY-
geographyN-
geometryN-
hierarchyidN-
image*NUse varbinary(MAX) instead.
intY-
moneyY-
ncharY-
numericYUse nvarchar(MAX) instead.
nvarcharY-
realY-
smalldatetimeY-
smallintY-
smallmoneyY-
sql_variantNInfoworks does not capture change data for sql_variant columns but does capture change data for other columns in the same table.
text*YUse varchar(MAX) instead.
timeY-
timestampY-
tinyintY-
uniqueidentifierYInfoworks imports the uniqueidentifier datatype as a varchar datatype of 38 characters.
user-defined datatypes (UDTs)YInfoworks treats a UDT in the same way as the datatype on which the UDT is based.
varbinaryY-
varcharY-
xmlYInfoworks treats this datatype as varchar(MAX).

*Infoworks might not capture change data for columns that includes image, ntext, or text datatype, due to SQL server transactional replication restrictions on these column types. Instead, use the alternative datatypes that Microsoft recommends, as indicated in the Comments column.

SQL Server CDC Operational Considerations

Following are the operational considerations for SQL server CDC:

  • Infoworks can capture change data from SQL Server distribution databases for which Transparent Data Encryption (TDE) is enabled. No special configuration tasks are required.
  • Infoworks does not capture change data for SQL Server system tables.
  • The maximum length of a row for which Infoworks can capture and process change data is 128,000 bytes.
  • Infoworks does not capture the user ID that is associated with the original transaction that updated the database.
  • The timestamp that Infoworks records for each captured change indicates when the change was captured, not when the original transaction occurred.
  • Infoworks does not capture change data for derived columns that are not persisted. SQL Server computes values for these columns at run-time based on an expression but does not store the values in a table.
  • SQL server publishes deferred updates to SQL Server tables as DELETEs followed by INSERTs rather than as UPDATEs. Consequently, Infoworks propagates deferred updates as DELETEs followed by INSERTs, even if you select AI for the Image Type attribute in the CDC connection. Infoworks does not include Before Image (BI) and Change Indicator (CI) information in DELETE and INSERT records. To capture a deferred update as an UPDATE for business reasons, set the SQL Server 8207 trace flag. This flag causes the SQL Server Replication Log Reader to combine the DELETE and INSERT pair into a single UPDATE. For more details on SQL Server processing of deferred updates and the SQL Server 8207 trace flag, see the SQL Server documentation.
  • Infoworks does not support the use of local aliases when connecting to SQL server and creating publications at registration creation.
  • If you need to switch the status of multiple SQL Server capture registrations from active to inactive or from inactive to active, use the DTLUCBRG utility with the MSSOPTS UPDATESTATUS parameter. This optional parameter enables you to switch the status of many registrations in one operation and regenerate the associated SQL server publications.
  • You can run Infoworks on a Windows system other than the Windows system where the SQL Server source distribution database runs. On the Infoworks system, you must install the Microsoft SQL Server Native Client and define an MSQL CAPI CONNECTION statement that provides connection information for the SQL Server source distribution database in the dbmover configuration file. Also, in the dbmover configuration files on the Infoworks Service machine, define a NODE statement that points to the Infoworks system.
  • During SQL Server log-based incremental ingestion, if records have same LSN (Last Sequence Number) value in the SQL Server CDC table, the order of transactions within the same commit will be stored in the ZIW_SEQVAL audit column. This column values include the ROW_NUMBER order by SEQVAL column from the SQL Server CDC table.

Configuring SQL Server for CDC

You must perform a few configuration tasks to prepare SQL Server for Infoworks CDC.

If the SQL Server tables have a high level of update activity, use a distributed server as the host of the distribution database from which to capture change data. This practice prevents competition between Infoworks CDC and your production database for CPU use and disk storage.

Infoworks DataFoundry uses SQL Server Management Objects (SMO) to manage SQL Server publications. SMO is a .NET Framework object model that applications can use to manage and administer SQL Server objects and services. You must install SMO and some related packages before you create or edit capture registrations.

Infoworks DataFoundry provides the 64-bit installation packages for the required SMO objects and packages for your convenience.

  • On the system where you run the Infoworks utilities, install the following SQL Server 2008 SMO objects and packages by using the installer .msi files.
  • Microsoft SQL Server 2008 Management Objects
  • Microsoft SQL Server 2008 Replication Management Objects
  • Microsoft SQL Server 2008 Native Client
  • Microsoft SQL Server System CLR Types

NOTE: You must install the Microsoft SQL Server 2008 Replication Management Objects even if you use SQL Server 2012 or SQL Server 2014.

  • Start the SQL Server Agent and Log Reader Agent if they are not running. For more details, see your Microsoft SQL Server documentation.
  • Configure and enable SQL Server transactional replication on the publication database. For more details, see your Microsoft SQL Server documentation.
  • Verify that each source table in the distribution database has a primary key.

Enabling Logs in MSSQL Server

Logs in MSSQL server is enabled using the CDC program of the SQL Server. This program is bundled with Enterprise, Developer, and Evaluation editions of MSSQL Server installations. Once logs are enabled for a table, the corresponding table changes will be stored in a capture table instance in a default database called CDC. The capture table name can be the same as source table name (step 2 will be executed in enabling logging) or user can configure this while enabling logging for the table (step 3 will be executed while enabling logging).

Following is a two-step process {[1,2] or [1,3]} to enable logging. In these scripts, to execute the commands, we assume that there is a database called automation_db and a table timestamps_inserts_updates in that database.

  • Enable CDC for a database automation_db. This is basically executing CDC program on the database:
Copy
  • Enable CDC for table timestamp_inserts_updates with default capture instance name.

NOTE: If you do not specify the name of the capture instance, the capture instance name will be same as source table name in the CDC schema.

Copy
  • Enable CDC for table with user-given captured instance name.

NOTE: If you specify the name of the capture instance while enabling logging for the table, the source table name is timestamp_inserts_updates logging the CDC table name as name_of_capture_instance.

Copy

Once the logging is enabled for a table, the metadata of the corresponding captured table will be stored in the cdc.change_tables table. You can use the following query to verify list of captured tables:

SQL
Copy

Following figure shows the sample output of the CDC-enabled tables:

Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard