BigQuery Ingestion

NOTE: Ensure that the guava-18.0.jar file is copied from the $IW_HOME/lib/df folder to the /opt/infoworks/lib/extras/ingestion folder for HDP2 environment.

Creating BigQuery Source

For creating a BigQuery source, see Creating Source. Ensure that the Source Type selected is BigQuery.

Configuring BigQuery Source

For configuring a BigQuery source, see Configuring Source.

BigQuery Configurations

FieldDescription
Project IDThe ID of the project in the source BigQuery.
Service Account TypeDefault Service Account can be used for DataProc and GCP machines, while Custom Service Account can be used for other environments.
OAuth Service Account EmailThe email ID of the account to be used for authentication to BigQuery. This field is available if the Service Account Type is Custom.
Service Account JSON Credentials FileThe location of the JSON credentials on the Infoworks Edge node. This field is available if the Service Account Type is Custom. NOTE: The Service Account JSON file must be available in edge node and on all data nodes.
Data SetThe dataset of the BigQuery where the data must be ingested from.
Dataset LocationThe location/geographic area of the source BigQuery.

Once the setting is saved, you can test the connection or navigate to the Source Configuration page to crawl the metadata.

Crawling BigQuery Source Metadata

For crawling a SQL Server source, see Crawling Metadata.

Configuring BigQuery Source for Ingestion

For configuring a SQL Server source for ingestion, see Configuring Source for Ingestion.

FieldDescription
Ingest Type

The types of synchronization that can be performed on the table.

The options include Full Load, Timestamp-Based Incremental Ingestion.

Segmented Load Status

The option to crawl large tables by breaking into smaller chunks. The smaller chunks can be crawled in parallel. Select the column to Perform Segmented Load On. To derive a column, check the Use substring from field option and select the Extract function. After completing the Segmented Load, ensure to set this value to Completed to submit Incremental Load or Full Load jobs.

NOTE: In case of date, datetime, timestamp column types you can extract new columns. The possible extraction options are year, month, year-month, month-day and day-num-in-month.

Source Configuration
Natural Keys

The key to identify the row uniquely. This key is used to identify and merge incremental data with the existing data on target. Distribution of data into secondary partitions for a table on target will be computed based on the hashcode value of the natural key.

This field is mandatory for incremental ingestion tables.

NOTE: The value of natural key column cannot be updated when updating a row on source; all the components of the natural key are immutable.

Number of Split PartitionsThe option to crawl the table in parallel with multiple connections to database.

Synchronization Configuration

Timestamp-Based Incremental Load
Timestamp Column for Update

The source column to be used as update watermark. To change the column, a truncate reload must be performed.

NOTE: It can also be the source column to be used as insert watermark.

Target Configuration

Hive Table NameThe name of the table in Hive which will be used to access the ingested data.
Storage FormatThe format of the data file to be stored in HDFS. The options include ORC and Parquet.
Partition Hive Table on

The option to partition the data in target. The partition column also can be derived for date, datetime and timestamp column. This will further partition the data. A hierarchy of partitions are supported with both normal partitions and derived partitions.

NOTES: Ensure that the partition columns data is immutable. You can also provide a combination of normal and derived partitions in the hierarchy.

Extract a New ColumnCheck this option for supported datatypes to derive a partition from an existing column. NOTE: Fill the corresponding field values based on the derived type.
Generate History ViewThe option to create history view table (along with the current view table) in Hive, which contains the versions of incremental updates and deletes.
Number of Partitions to Merge in ParallelThe number of partitions that can be merged in parallel for a table.

NOTE:

NoSuchMethodError: com.google.api.client.json.jackson2.JacksonFactory.getDefaultInstance()Lcom/google/api/client/json/jackson2/JacksonFactory.

If the above error occurs, perform the following:

  • Copy the /opt/infoworks/lib/cloud/google-http-client-jackson2-1.23.0.jar file to the /opt/infoworks/lib/extras/ingestion/ folder.
  • Restart query service using the following commands:
  • $IW_HOME/bin/stop.sh queryservice
  • $IW_HOME/bin/start.sh queryservice
Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard