BigQuery Ingestion
NOTE: Ensure that the guava-18.0.jar file is copied from the $IW_HOME/lib/df folder to the /opt/infoworks/lib/extras/ingestion folder for HDP2 environment.
Creating BigQuery Source
For creating a BigQuery source, see Creating Source. Ensure that the Source Type selected is BigQuery.
Configuring BigQuery Source
For configuring a BigQuery source, see Configuring Source.
BigQuery Configurations
Field | Description |
---|---|
Project ID | The ID of the project in the source BigQuery. |
Service Account Type | Default Service Account can be used for DataProc and GCP machines, while Custom Service Account can be used for other environments. |
OAuth Service Account Email | The email ID of the account to be used for authentication to BigQuery. This field is available if the Service Account Type is Custom. |
Service Account JSON Credentials File | The location of the JSON credentials on the Infoworks Edge node. This field is available if the Service Account Type is Custom. NOTE: The Service Account JSON file must be available in edge node and on all data nodes. |
Data Set | The dataset of the BigQuery where the data must be ingested from. |
Dataset Location | The location/geographic area of the source BigQuery. |
Once the setting is saved, you can test the connection or navigate to the Source Configuration page to crawl the metadata.
Crawling BigQuery Source Metadata
For crawling a SQL Server source, see Crawling Metadata.
Configuring BigQuery Source for Ingestion
For configuring a SQL Server source for ingestion, see Configuring Source for Ingestion.
Field | Description |
---|---|
Ingest Type | The types of synchronization that can be performed on the table. The options include Full Load, Timestamp-Based Incremental Ingestion. |
Segmented Load Status | The option to crawl large tables by breaking into smaller chunks. The smaller chunks can be crawled in parallel. Select the column to Perform Segmented Load On. To derive a column, check the Use substring from field option and select the Extract function. After completing the Segmented Load, ensure to set this value to Completed to submit Incremental Load or Full Load jobs. NOTE: In case of date, datetime, timestamp column types you can extract new columns. The possible extraction options are year, month, year-month, month-day and day-num-in-month. |
Source Configuration | |
Natural Keys | The key to identify the row uniquely. This key is used to identify and merge incremental data with the existing data on target. Distribution of data into secondary partitions for a table on target will be computed based on the hashcode value of the natural key. This field is mandatory for incremental ingestion tables. NOTE: The value of natural key column cannot be updated when updating a row on source; all the components of the natural key are immutable. |
Number of Split Partitions | The option to crawl the table in parallel with multiple connections to database. |
Synchronization Configuration | |
Timestamp-Based Incremental Load | |
Timestamp Column for Update | The source column to be used as update watermark. To change the column, a truncate reload must be performed. NOTE: It can also be the source column to be used as insert watermark. |
Target Configuration | |
Hive Table Name | The name of the table in Hive which will be used to access the ingested data. |
Storage Format | The format of the data file to be stored in HDFS. The options include ORC and Parquet. |
Partition Hive Table on | The option to partition the data in target. The partition column also can be derived for date, datetime and timestamp column. This will further partition the data. A hierarchy of partitions are supported with both normal partitions and derived partitions. NOTES: Ensure that the partition columns data is immutable. You can also provide a combination of normal and derived partitions in the hierarchy. |
Extract a New Column | Check this option for supported datatypes to derive a partition from an existing column. NOTE: Fill the corresponding field values based on the derived type. |
Generate History View | The option to create history view table (along with the current view table) in Hive, which contains the versions of incremental updates and deletes. |
Number of Partitions to Merge in Parallel | The number of partitions that can be merged in parallel for a table. |
NOTE:
NoSuchMethodError: com.google.api.client.json.jackson2.JacksonFactory.getDefaultInstance()Lcom/google/api/client/json/jackson2/JacksonFactory.
If the above error occurs, perform the following:
- Copy the /opt/infoworks/lib/cloud/google-http-client-jackson2-1.23.0.jar file to the /opt/infoworks/lib/extras/ingestion/ folder.
- Restart query service using the following commands:
- $IW_HOME/bin/stop.sh queryservice
- $IW_HOME/bin/start.sh queryservice