General Guidelines on Data Pipelines

This article explains the general guidelines to be followed in Data Transformation.

Denormalization

Store the denormalized target tables whenever possible for faster querying.
Data storage cost is cheaper compared to query processing cost.
By maintaining denormalized data, multiple I/O operations for joining using foreign keys are avoided which helps in performance benefits.

The settings of execution engine impacts query performance significantly. Ensure the execution engine is set according to the cluster capacity and the type of queries to be executed.
For example, the default value of the Hive configuration variable hive.mapjoin.smalltable.filesize is 25MB. This parameter helps determine the small-table to fit in memory for map-join. You must set this according to the available resources on the cluster.
Similarly, other optimizations supported by the execution engine must be used for best query performance. For example, Hive supports SMB map join optimization to join bucketed and sorted tables. Set the relevant Hive configurations before executing queries involving such joins for optimum performance.

Gathering statistics in advance for all tables used in performance-critical or high-volume join queries provides significant cost-based join optimizations.
Stats are computed for data transformation targets during pipeline build in most of the cases. However, if queries involve joining with Infoworks sources or external data lake tables, precompute statistics for these tables.
Column statistics are expensive to calculate and provide insignificant benefit while querying denormalized tables.

Last updated on