Creating Pipeline

Infoworks Data Transformation is used to transform data ingested by Infoworks DWA for various purposes like consumption by analytics tools, DF pipelines, Infoworks Cube builder, export to other systems, etc.

For details on creating a domain, see Domain Management.

Following are the steps to add a new pipeline to the domain:

  • Click the Domains menu and click the required domain from the list. You can also search for the required domain.
  • In the Summary page, click the Pipelines icon.
  • In the Pipelines page, click the New Pipeline button.
  • In the New Pipeline page, select Create new pipeline.

To duplicate an existing pipeline, select the pipeline from the Duplicate Existing Pipeline drop-down list and enable the checkbox to retain properties such as target table name, target schema name, target HDFS location, analytics model name, analytics model HDFS location, MapR-DB table path in import.

  • Enter the Name and Description.
  • Select the Execution Engine type. The Execution Engines supported are Hive, Spark and Impala.
  • Click Save. The new pipeline will be added to the list of pipelines in the Pipelines page.

Using Spark: Currently v2.0 and higher versions of Spark are supported. Spark as execution engine uses the Hive metastore to store metadata of tables. All the nodes supported by Hive and Impala are supported by spark engine.

Known Limitations of Spark

  • Parquet has issues with decimal type. This will affect pipeline targets that include decimal type. Recommend to cast any decimal type to double when using in a pipeline target
  • The number of tasks for reduce phase can be tuned by using sql.shuffle.partitions setting. This setting controls the number of files and can be tuned per pipeline with df_batch_sparkapp_settings in UI.
  • Column names with spaces are not supported in Spark v2.2 but supported in v2.0. For example, column name as ID Number is not supported in spark v2.2.

Submitting Spark Pipelines

Spark pipelines can be configured to run in client mode on edge node or can be submitted via Apache Livy. By default, if no configuration is specified, Spark pipeline will run in client mode on the edge node.

Configurations

Configuring Spark to Run in Client Mode on Edge Node

  • Add the following key-value in pipeline advance configuration: job.dispatcher.type=native

Submitting Spark Pipeline through Livy

  • Add following key-value in pipeline advance configuration: job.dispatcher.type=livy
  • Create a livy.properties file.
  • In the $IW_HOME/conf/conf.properties file, set the absolute path of the livy.properties file in the df_livy_configfile configuration. For example, if the file is located at /opt/infoworks/livy/conf/livy.properties, then set df_livy_configfile=/opt/infoworks/livy/conf/livy.properties in $IW_HOME/conf/conf.properties.
  • Add following mandatory property in livy.properties file: livy.url="https://<livy host>:<livy port>
  • You can also add other optional Livy client configurations, for details see here.
  • In the $IW_HOME/conf/conf.properties file, add ${LIVY_HOME}/rsc-jars/* to df_batch_classpath configuration .
  • Copy $IW_HOME/conf folder from local to hdfs://${IW HOME}/conf.
  • Set spark.driver.extraJavaOptions=-DIW_HOME=hdfs:// in df_spark_defaults.conf file.
  • By default, the spark pipelines are submitted in the existing Livy session. If a Livy session is not available, a new session is created and pipelines are submitted in the new session.
  • To submit the application in a new session add the following property in pipeline advance configuration: livy_use_existing_session=false

NOTE: InfoworksDF is compatible with__livy-0.5.0-incubating and other Livy 0.5 compatible versions.

Best Practices

For best practices, see General Guidelines on Data Pipelines.

Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard