Data Transformation

Data Transformation, an important feature of the Infoworks Autonomous Data Engine (ADE) product is primarily used to prepare, transform, and augment data.

Feature Highlights

Following are some data transformation features:

User-friendly GUI-driven method to transform data.
SQL syntax support for transforming data.
ETL workload import support for importing the required SQL files to automatically build data pipelines.
Immediate feedback on syntax and semantic errors.
Enterprise-friendly features, including domain-based source access and audits.
Multiple pipelines to create workflows.
Interactive data preparation for large data sets.
Intelligent data sampling to view data changes as transformations applied.
Visual debugging using interactive data.
Advance analytics nodes.

Optimizations

Data transformation provides the following optimizations:

Design-Time Optimizations

Instant feedback for data, syntax and semantic errors: During transformation, data errors like data format in column, regex on columns having issues, etc can be verified in the sample data.
Visualize flow of data, support to design better flows.
Auto-materialize transformation nodes for faster responses.
Automatic dependency management: When a transformation node is modified, the system automatically computes the dependent nodes. The platform uses a Mark and Sweep algorithm to perform this efficiently.
Safe handling, refactors column include/exclude/rename even in user-defined expressions.
Automatic rename of duplicate column names.
Reuse of Hive and Impala connections to support interactive viewing of data while designing data transformation pipelines.

Execution Optimizations

Automatically detects pipeline-specific intermediate storage by backend engine.
Automatically parallelizes population of multiple targets.
Reuses expressions: computes once, uses multiple times, reducing CPU/IO.
Provides optimal merge process for handling updates.
Provides ability to use pipeline-specific environment settings for MapReduce, Hive, memory, and compression for build.
Automatically selects only required columns based on the pipeline, and thus reduces CPU/IO usage.

Query-Time Optimizations

Support for Primary partitions with tables created through pipeline targets
Support for Different Format ORC and Parquet on pipeline targets
Table statistics update after every build for Cost-Based Optimizer on pipeline targets.

Prerequisites

To achieve the required data extraction using Data Transformation, you must either create the required sources and domains in the Infoworks ADE or use the existing sources and domains.

Only a user with admin access can perform the following. If you do not have admin access, contact the administrator to perform these tasks.

If the required sources and domains are not available, follow the below steps:

Create a source.
Create a corresponding domain.
Add the required source to the domain.

Creating a pipeline requires a minimum of one source to be added to a domain. There is no limit for the maximum number of sources.

Spark Configurations

df_spark_configfile for interactive mode must point to the full path of the spark-defaults.conf file in the edge node for interactive pipelines, defaults to /etc/spark2/conf/spark-defaults.conf.
spark-defaults.conf must have all relevant properties to connect to Yarn.
df_spark_configfile_batch for batch mode must point to the full path of the spark-defaults.conf file in the edge node for batch pipelines, defaults to /etc/spark2/conf/spark-defaults.conf.
A hive-site.xml with data similar to below must be present in the spark conf directory

export SPARK_DIST_CLASSPATH=/usr/hdp/2.5.5.0-157/spark2/jars/ must be set in env.sh of Infoworks (the location is the location of the spark jars).
Recommended settings in spark-defaults.conf might have different settings for interactive and batch mode.
spark.sql.warehouse.dir must be set to /apps/hive/warehouse (or any other location equivalent to the hive warehouse location).

spark.sql.hive.convertMetastoreParquet false // for parquet ingested tables to be read

spark.mapreduce.input.fileinputformat.input.dir.recursive true // for parquet ingested tables to be read

spark.hive.mapred.supports.subdirectories true // for parquet ingested tables to be read

spark.mapred.input.dir.recursive true // for parquet ingested tables to be read

spark.sql.shuffle.partitions // to control number of tasks for reduce phase

spark.dynamicAllocation.enabled true // if dynamic allocation is needed

spark.shuffle.service.enabled true // if dynamic allocation is needed

spark.executor.memory // according to workload

spark.executor.cores // according to workload

For encoded partitions, the spark.sql.hive.convertMetastoreParquet value must be set to true.

Some pipelines might need different configurations apart from the above configurations like,

df_batch_sparkapp_settings: Any settings that need to be changed at the application master level (such as spark.dynamicAllocation.enabled, spark.executor.memory etc)
df_batch_spark_settings: Any settings that need to be changed at sparksession level (such as spark.sql.crossJoin.enabled etc)

Last updated on