Troubleshooting ORC Tables with Spark Pipelines

Issue

When creating Spark based pipelines, few ORC tables are not displayed.

Cause

One of the main reason behind tables (ORC tables) not being visible in Spark based pipelines is due to known issue with ORC and Spark Pipeline Compatibility.

Since 2.5.4 version, Infoworks supports access to ORC tables on spark pipelines.

Solution

This fix does not include any change for the newly ingested tables but requires migration for older tables that are already crawled (tables that were ingested in the past).

A script, support_spark_pipelines.sh, is available in the $IW_HOME/bin/migration_support_spark_pipelines folder. This script obtains two or more arguments and convert them into a new Hive table structure to build Spark pipelines.

Following are the steps to run the script:

Login to the Infoworks edge node.
Navigate to the $IW_HOME/bin folder and source env.sh file using the following commands:

cd $IW_HOME/bin

source env.sh

Following are the steps to run the migrations:

The script can obtain two or more parameters depending on the user requirement. ./support_spark_pipelines.sh <auth_token> <table_id> <table_id>

For example,

where,

jhasjdbjagxabjsnxh=sBc3ZDHn7T8YyqBOw= is the auth token of the user migrating to the new Hive table structure. This auth token must be URL encodable.
yshs7hd82bd92vdjbdhxsd is the ID of the first table to be migrated.
73bdhw4hcbswhbd6nbwc is the ID of the second table to be migrated.

After the above steps, it is recommended to restart the DF services. You can now access the ORC tables in Spark Pipelines.

Applicable Versions

2.5 and lesser versions

Fixed Version: This issue has been resolved in 2.5.4 version.

Last updated on