Introduction

Infoworks Data Transformation is primarily used to prepare, transform, and augment data.

Feature Highlights

Following are some data transformation features:

User-friendly GUI-driven method to transform data.
SQL syntax support for transforming data.
ETL workload import support for importing the required SQL files to automatically build data pipelines.
Immediate feedback on syntax and semantic errors.
Enterprise-friendly features, including domain-based source access and audits.
Multiple pipelines to create workflows.
Interactive data preparation for large data sets.
Intelligent data sampling to view data changes as transformations applied.
Visual debugging using interactive data.
Advance analytics nodes.

Optimizations

Data transformation provides the following optimizations:

Design-Time Optimizations

Instant feedback for data, syntax and semantic errors: During transformation, data errors like data format in column, regex on columns having issues, etc can be verified in the sample data.
Visualize flow of data, support to design better flows.
Auto-materialize transformation nodes for faster responses.
Automatic dependency management: When a transformation node is modified, the system automatically computes the dependent nodes. The platform uses a Mark and Sweep algorithm to perform this efficiently.
Safe handling, refactors column include/exclude/rename even in user-defined expressions.
Automatic rename of duplicate column names.
Reuse of Hive and Impala connections to support interactive viewing of data while designing data transformation pipelines.

Execution Optimizations

Automatically detects pipeline-specific intermediate storage by backend engine.
Automatically parallelizes population of multiple targets.
Reuses expressions: computes once, uses multiple times, reducing CPU/IO.
Provides optimal merge process for handling updates.
Provides ability to use pipeline-specific environment settings for MapReduce, Hive, memory, and compression for build.
Automatically selects only required columns based on the pipeline, and thus reduces CPU/IO usage.

Query-Time Optimizations

Support for Primary partitions with tables created through pipeline targets
Support for Different Format ORC and Parquet on pipeline targets
Table statistics update after every build for Cost-Based Optimizer on pipeline targets.

Prerequisites

To achieve the required data extraction using Data Transformation, you must either create the required sources and domains in the Infoworks DataFoundry or use the existing sources and domains.

NOTE: Only a user with admin access can perform the following. If you do not have admin access, contact the administrator.

Create a source.
Create a domain.
Add the required source to the domain.

NOTE: Creating a pipeline requires minimum one source to be added to the domain.

Spark Configurations

dt_spark_configfile for interactive mode must point to the full path of the spark-defaults.conf file in the edge node for interactive pipelines. The default value is /etc/spark2/conf/spark-defaults.conf.
spark-defaults.conf must have all relevant properties to connect to Yarn.
dt_spark_configfile_batch for batch mode must point to the full path of the spark-defaults.conf file in the edge node for batch pipelines. The default value is /etc/spark2/conf/spark-defaults.conf.
A hive-site.xml with data similar to the following must be present in the spark conf directory:

export SPARK_DIST_CLASSPATH=/usr/hdp/2.5.5.0-157/spark2/jars/ must be set in env.sh of Infoworks DataFoundry (the location is the location of the spark jars).
Recommended settings in spark-defaults.conf might have different settings for interactive and batch mode.
spark.sql.warehouse.dir must be set to /apps/hive/warehouse (or any other location equivalent to the Hive warehouse location).
spark.sql.hive.convertMetastoreParquet false // for parquet ingested tables to be read
spark.mapreduce.input.fileinputformat.input.dir.recursive true // for parquet ingested tables to be read
spark.hive.mapred.supports.subdirectories true // for parquet ingested tables to be read
spark.mapred.input.dir.recursive true // for parquet ingested tables to be read
spark.sql.shuffle.partitions // to control number of tasks for reduce phase
spark.dynamicAllocation.enabled true // if dynamic allocation is needed
spark.shuffle.service.enabled true // if dynamic allocation is needed
spark.executor.memory // according to workload
spark.executor.cores // according to workload

NOTE: For encoded partitions, the spark.sql.hive.convertMetastoreParquet value must be set to true.

Some pipelines might require configurations apart from the above configurations like,

dt_batch_sparkapp_settings: Any setting that must be changed at the application master level (such as spark.dynamicAllocation.enabled, spark.executor.memory, etc).
dt_batch_spark_settings: Any setting that must be changed at sparksession level (such as spark.sql.crossJoin.enabled, etc).

Last updated on