BigQuery Ingestion

NOTE: Ensure that the guava-18.0.jar file is copied from the $IW_HOME/lib/df folder to the /opt/infoworks/lib/extras/ingestion folder for HDP2 environment.

Creating BigQuery Source

For creating a BigQuery source, see Creating Source. Ensure that the Source Type selected is BigQuery.

Configuring BigQuery Source

For configuring a BigQuery source, see Configuring Source.

BigQuery Configurations

Field	Description
Project ID	The ID of the project in the source BigQuery.
Service Account Type	Default Service Account can be used for DataProc and GCP machines, while Custom Service Account can be used for other environments.
OAuth Service Account Email	The email ID of the account to be used for authentication to BigQuery. This field is available if the Service Account Type is Custom.
Service Account JSON Credentials File	The location of the JSON credentials on the Infoworks Edge node. This field is available if the Service Account Type is Custom. NOTE: The Service Account JSON file must be available in edge node and on all data nodes.
Data Set	The dataset of the BigQuery where the data must be ingested from.
Dataset Location	The location/geographic area of the source BigQuery.

Once the setting is saved, you can test the connection or navigate to the Source Configuration page to crawl the metadata.

Crawling BigQuery Source Metadata

For crawling a SQL Server source, see Crawling Metadata.

Configuring BigQuery Source for Ingestion

For configuring a SQL Server source for ingestion, see Configuring Source for Ingestion.

Field	Description
Ingest Type	The types of synchronization that can be performed on the table. The options include Full Load, Timestamp-Based Incremental Ingestion.
Segmented Load Status	The option to crawl large tables by breaking into smaller chunks. The smaller chunks can be crawled in parallel. Select the column to Perform Segmented Load On. To derive a column, check the Use substring from field option and select the Extract function. After completing the Segmented Load, ensure to set this value to Completed to submit Incremental Load or Full Load jobs. NOTE: In case of date, datetime, timestamp column types you can extract new columns. The possible extraction options are year, month, year-month, month-day and day-num-in-month.
Source Configuration
Natural Keys	The key to identify the row uniquely. This key is used to identify and merge incremental data with the existing data on target. Distribution of data into secondary partitions for a table on target will be computed based on the hashcode value of the natural key. This field is mandatory for incremental ingestion tables. NOTE: The value of natural key column cannot be updated when updating a row on source; all the components of the natural key are immutable.
Number of Split Partitions	The option to crawl the table in parallel with multiple connections to database.
Synchronization Configuration
Timestamp-Based Incremental Load
Timestamp Column for Update	The source column to be used as update watermark. To change the column, a truncate reload must be performed. NOTE: It can also be the source column to be used as insert watermark.
Target Configuration
Hive Table Name	The name of the table in Hive which will be used to access the ingested data.
Storage Format	The format of the data file to be stored in HDFS. The options include ORC and Parquet.
Partition Hive Table on	The option to partition the data in target. The partition column also can be derived for date, datetime and timestamp column. This will further partition the data. A hierarchy of partitions are supported with both normal partitions and derived partitions. NOTES: Ensure that the partition columns data is immutable. You can also provide a combination of normal and derived partitions in the hierarchy.
Extract a New Column	Check this option for supported datatypes to derive a partition from an existing column. NOTE: Fill the corresponding field values based on the derived type.
Generate History View	The option to create history view table (along with the current view table) in Hive, which contains the versions of incremental updates and deletes.
Number of Partitions to Merge in Parallel	The number of partitions that can be merged in parallel for a table.

NOTE:

NoSuchMethodError: com.google.api.client.json.jackson2.JacksonFactory.getDefaultInstance()Lcom/google/api/client/json/jackson2/JacksonFactory.

If the above error occurs, perform the following:

Copy the /opt/infoworks/lib/cloud/google-http-client-jackson2-1.23.0.jar file to the /opt/infoworks/lib/extras/ingestion/ folder.
Restart query service using the following commands:
$IW_HOME/bin/stop.sh queryservice
$IW_HOME/bin/start.sh queryservice

Last updated on