Issue
When creating Spark based pipelines, few ORC tables are not displayed.
Cause
One of the main reason behind tables (ORC tables) not being visible in Spark based pipelines is due to known issue with ORC and Spark Pipeline Compatibility.
Since 2.5.4 version, Infoworks supports access to ORC tables on spark pipelines.
Solution
This fix does not include any change for the newly ingested tables but requires migration for older tables that are already crawled (tables that were ingested in the past).
A script, support_spark_pipelines.sh, is available in the $IW_HOME/bin/migration_support_spark_pipelines folder. This script obtains two or more arguments and convert them into a new Hive table structure to build Spark pipelines.
Following are the steps to run the script:
- Login to the Infoworks edge node.
- Navigate to the $IW_HOME/bin folder and source env.sh file using the following commands:
cd $IW_HOME/bin
source env.sh
- Following are the steps to run the migrations:
The script can obtain two or more parameters depending on the user requirement.
./support_spark_pipelines.sh <auth_token> <table_id> <table_id>
For example,
where,
- jhasjdbjagxabjsnxh=sBc3ZDHn7T8YyqBOw= is the auth token of the user migrating to the new Hive table structure. This auth token must be URL encodable.
- yshs7hd82bd92vdjbdhxsd is the ID of the first table to be migrated.
- 73bdhw4hcbswhbd6nbwc is the ID of the second table to be migrated.
After the above steps, it is recommended to restart the DF services. You can now access the ORC tables in Spark Pipelines.
Applicable Versions
- 2.5 and lesser versions
Fixed Version: This issue has been resolved in 2.5.4 version.