Data Transformation Target Configurations

This document illustrates the measures to optimally design data transformation pipelines and targets.

A data transformation pipeline is used to clean, transform or denormalize data from data lake and store into the data transformation target tables.

NOTE: The target tables must be carefully designed according to the queries that use these tables.

Target node provides several configuration options to define the structure of the final data transformation target table.

This section describes the significance and usage of each option available under Target Configuration.

Sync Type

Overwrite: To truncate and insert data into target table when the pipeline is built.
Append: To insert new data (delta) into existing table when the pipeline is built.
Merge: To update existing records and/or insert new data when the pipeline is built.

Column-Mapping between Target Node and Merge/Append Reference Table

In the Mapping section, all columns from reference table are listed on the left while all columns from target node are listed on the right.
Number of target columns must be greater than or equal to the number of columns in reference table.
You can update the order/mapping of columns of the target node (using the edit feature) if the order/mapping of target columns does not match the columns of reference table.
All unmapped columns from target table are excluded automatically when saved.
Data types of mapped columns must match.
All audit columns must be mapped to the respective audit columns.
If the column name in target is different than the corresponding mapped column of reference table, user receives a notification for renaming the column. On clicking OK, column name in the target node is renamed.
On saving the Mapping section, schema of the target table is changed appropriately to match the schema of reference table.

Schema Name

This is the Hive metastore schema name where this target table will be created.
The name must not contain spaces or special characters (except underscore).
Multiple pipelines/target nodes can use the same schema name, as long as table names are unique across all domains.
Data transformation targets cannot be created in schemas containing sources or cubes. Ensure that this schema name is not used by any source or cube.

Table Name

This is the name of the actual Hive metastore table created corresponding to this target node.
The name must not contain spaces or special characters (except underscore).
Combination of schemaName.tableName must be unique across all domains.

HDFS Path

This is the HDFS path where target table data files will be located.
It must be unique for all artefacts (source, cube, or pipeline) across all domains. Ensure no other target node, source or cube is configured to use this path.
Conflicting paths are not allowed. For example, if a built target-table uses path "/a/b/c", you cannot specify the following paths in a new target node: "/a/b/c/", "/a/b/c", "/a/b/c/d", "/a/b"

Best Practice

Always include the target schema name and table name at the end of the HDFS path to minimise path conflicts and better readability.

Storage Format

This is the format of data files used to save target-table records.

Best Practices

Choose storage format based on the query engine. Use ORC for best query performance on Hive. Use Parquet to query target tables using Impala.
Use compression whenever possible. Although compression and decompression is an overhead for processors, it improves query performance as querying on big data is I/O intensive. Compression of data on HDFS ensures smaller data is moved across nodes, which ultimately reduces the query time.

Natural Keys

This is a set of columns which uniquely identify a record in the target table.
These columns are useful for Join and Merge optimisation because in Hive target tables, bucketing is performed using these natural keys by default.

Best Practice

If no column can uniquely identify a record, use a derive node (before target node) with ‘_iwuuid()_’ as the expression to derive a unique ID per record.

Primary Partition Columns

Partitioning creates a hierarchy of directories (one subdirectory for each partition column value) to divide records into multiple data files. This allows your filter query to scan only relevant data files when filtered using partitioning column.
Selecting partitioning column: An immutable column which has low/medium cardinality is recommended for primary partitions.
It is critical to choose low/medium cardinality partitioning columns such as country, city, region etc, as HDFS data manipulation is more efficient with small set of large files than large set of small files.
If there are multiple partitioning columns, their order is significant since it translates to the directory structure in HDFS.
Decimal/Float types are not allowed as partition columns. If you intend to use a decimal type column as partition key, typecast it to int or string using a derive node before target node.

Best Practices

Partitioning target table on columns frequently used for filtering or joining can speed up all such queries as it significantly reduces the IO cost of scanning the entire table.
To query only some parts of a data that accumulates over time, such as time series data or logs, use partitioning to store the data in a tree such as year/month/day, etc.
If the query reads most parts of data, partitioning is useful if partition columns have low cardinality and less number of secondary partitions.
Partitions with uniform distribution of records parallelizes merge and reduces target build time.
When using in-memory execution engines like Impala, partition using numeric columns instead of strings whenever possible. This improves performance and scalability as the in-memory storage for the partition key columns is more compact, and computations are faster. Use the smallest integer type that holds the appropriate range of values, typically TINYINT for MONTH and DAY, and SMALLINT for YEAR.

Caution:

Avoid over-partitioning. For each partition column, extra metadata is maintained to redirect query as per partition. Too many small partitions in a table could have a degrading impact on query performance.
It is critical to select correct columns for partitioning after completely understanding the kind of queries to be run on that data. A partition-column would be beneficial for some queries while it could degrade the performance of other queries.

Number of Secondary Partitions

This field defines the number of buckets created under target table location of an unpartitioned table, or under each partition-specification of a partitioned table.
Secondary Partitions group records into one data file for faster Join and Search performance.
Number of secondary partitions also determines the parallelism for building the pipeline.
Default number of secondary partitions is one for DF targets.

Best Practices

Optimum secondary partition count: Number of secondary partitions should be limited such that each bucket has about 250 MB compressed data or one million rows, whichever is lower.
If number of partitions is more, use less number of secondary partitions. This should be determined by estimating the total number of small files created across all partitions. HDFS efficiency degrades with too many small files.
Use relevant Hive configurations for Bucket Map join or SMB Map join when joining target tables with more than one secondary partition. DF tries to bucket records using natural keys whenever possible. If one of the natural keys is a partition key, then group records based on the unique ziw_row_id.

Index Columns

Indexing uses bloom filters in ORC files to index string columns. When a target is created in ORC format, the system processes these columns to create indexes within ORC files.
Only string type columns can be used for indexing. If you need to index numeric or date type, use a derive node before target node to cast it to string.

Best Practices

Indexing is useful when partitioning is not possible. It improves search and filter performance.
User must also sort on index columns because sorting ensure that only few values are present as part of an ORC stripe.

Sort Columns

The records in each reducer will be sorted according to the order specified by the user.
The order of column selection matters.

Best Practices

Columns used in the WHERE clause must be selected in this section. Columns should be selected first in order of their cardinality, with lower cardinality ones selected before higher cardinality ones.
If indexing is used, sort columns must also include all index columns to utilize indexing effectively.

Last updated on