Data Transformation
Data Transformation, an important feature of the Infoworks Autonomous Data Engine (ADE) product is primarily used to prepare, transform, and augment data.
Feature Highlights
Following are some data transformation features:
- User-friendly GUI-driven method to transform data.
- SQL syntax support for transforming data.
- ETL workload import support for importing the required SQL files to automatically build data pipelines.
- Immediate feedback on syntax and semantic errors.
- Enterprise-friendly features, including domain-based source access and audits.
- Multiple pipelines to create workflows.
- Interactive data preparation for large data sets.
- Intelligent data sampling to view data changes as transformations applied.
- Visual debugging using interactive data.
- Advance analytics nodes.
Optimizations
Data transformation provides the following optimizations:
Design-Time Optimizations
- Instant feedback for data, syntax and semantic errors: During transformation, data errors like data format in column, regex on columns having issues, etc can be verified in the sample data.
- Visualize flow of data, support to design better flows.
- Auto-materialize transformation nodes for faster responses.
- Automatic dependency management: When a transformation node is modified, the system automatically computes the dependent nodes. The platform uses a Mark and Sweep algorithm to perform this efficiently.
- Safe handling, refactors column include/exclude/rename even in user-defined expressions.
- Automatic rename of duplicate column names.
- Reuse of Hive and Impala connections to support interactive viewing of data while designing data transformation pipelines.
Execution Optimizations
- Automatically detects pipeline-specific intermediate storage by backend engine.
- Automatically parallelizes population of multiple targets.
- Reuses expressions: computes once, uses multiple times, reducing CPU/IO.
- Provides optimal merge process for handling updates.
- Provides ability to use pipeline-specific environment settings for MapReduce, Hive, memory, and compression for build.
- Automatically selects only required columns based on the pipeline, and thus reduces CPU/IO usage.
Query-Time Optimizations
- Support for Primary partitions with tables created through pipeline targets
- Support for Different Format ORC and Parquet on pipeline targets
- Table statistics update after every build for Cost-Based Optimizer on pipeline targets.
Prerequisites
To achieve the required data extraction using Data Transformation, you must either create the required sources and domains in the Infoworks ADE or use the existing sources and domains.
Only a user with admin access can perform the following. If you do not have admin access, contact the administrator to perform these tasks.
If the required sources and domains are not available, follow the below steps:
- Create a source.
- Create a corresponding domain.
- Add the required source to the domain.
Creating a pipeline requires a minimum of one source to be added to a domain. There is no limit for the maximum number of sources.
Spark Configurations
- df_spark_configfile for interactive mode must point to the full path of the spark-defaults.conf file in the edge node for interactive pipelines, defaults to /etc/spark2/conf/spark-defaults.conf.
- spark-defaults.conf must have all relevant properties to connect to Yarn.
- df_spark_configfile_batch for batch mode must point to the full path of the spark-defaults.conf file in the edge node for batch pipelines, defaults to /etc/spark2/conf/spark-defaults.conf.
- A hive-site.xml with data similar to below must be present in the spark conf directory
- export SPARK_DIST_CLASSPATH=/usr/hdp/2.5.5.0-157/spark2/jars/ must be set in
env.sh
of Infoworks (the location is the location of the spark jars). - Recommended settings in spark-defaults.conf might have different settings for interactive and batch mode.
- spark.sql.warehouse.dir must be set to /apps/hive/warehouse (or any other location equivalent to the hive warehouse location).
spark.sql.hive.convertMetastoreParquet false // for parquet ingested tables to be read
spark.mapreduce.input.fileinputformat.input.dir.recursive true // for parquet ingested tables to be read
spark.hive.mapred.supports.subdirectories true // for parquet ingested tables to be read
spark.mapred.input.dir.recursive true // for parquet ingested tables to be read
spark.sql.shuffle.partitions // to control number of tasks for reduce phase
spark.dynamicAllocation.enabled true // if dynamic allocation is needed
spark.shuffle.service.enabled true // if dynamic allocation is needed
spark.executor.memory // according to workload
spark.executor.cores // according to workload
For encoded partitions, the spark.sql.hive.convertMetastoreParquet value must be set to true.
Some pipelines might need different configurations apart from the above configurations like,
- df_batch_sparkapp_settings: Any settings that need to be changed at the application master level (such as spark.dynamicAllocation.enabled, spark.executor.memory etc)
- df_batch_spark_settings: Any settings that need to be changed at sparksession level (such as spark.sql.crossJoin.enabled etc)