Session 2: Data Pipeline

Concept

After the data is loaded into the data lake, it needs to be transformed in preparation for downstream use. Challenges in this area tend to be a collection of smaller issues that add up over time to create ongoing maintenance and management challenges. Some of the issues that need to be considered include:

  • How do you create pipelines for incrementally loading data?
  • How do you migrate legacy SQL workloads into the big data environment?
  • How do you optimize the running of pipelines and ensure reliability and availability?
  • How do you deal with changes in the underlying technology?
  • How does the system handle propagation of upstream changes?

Infoworks automates and simplifies the creation of production-ready data pipelines that consist of data transformations and machine learning algorithms.

Objectives

In this session, you will:

  • Create a pipeline with Orders-related data
  • Derive columns from Weather and Orders to join the tables
  • Connect Weather with Orders in the pipeline
  • Verify data using Oracle SQL Developer or a hive query tool of your choice

The pipeline combines data from Orders and OrderDetails from SalesDB with weather data from the Weather source we ingested in the previous session. Since the date format in these tables is different, you will need to standardize the date formats before joining them. You will be creating 2 targets as part of this pipeline: SalesTarget (denormalized view of the order details) and OrdersDim (distinct orders). In the next session, we will use these tables as the fact and dimension tables in the cube.

Step 1: Create Pipeline

  • Click the Domains tab to view the domain assigned to you.
  • Search for SalesForecasting_your username.
  • Click the SalesForecasting domain created for you.
  • Click the Pipelines icon on the left.
  • Click the New Pipeline button.
  • Create a new pipeline with the Name specified in your “Welcome to Infoworks” email.

Set Execution Engine to Hive and click Save.

Step 2: Design Pipeline

  • Click the SalesPipeline pipeline that you created.
  • Click the Open Version Editor button or anywhere in the design canvas.

Step 3: Add Source Tables To Pipeline

  • Search the table Orders in the Sources section.
  • Drag and drop the Orders table to the canvas.
  • Similarly, drag and drop the Weather and Order_Details tables to the canvas.

Step 4: Add Derive Nodes To The Pipeline

We need to standardize the date columns in the orders and weather tables. In order to do so, we need to derive new columns from the existing date columns.

  • Drag and drop the two Derive transformation nodes to the canvas.
  • Connect the Orders and Weather tables to their respective Derive node by clicking and dragging from the grey dot in the source nodes.

Step 5: Transform And Format Order Date

  • Double-click the Derive node (that is connected to ORDERS)
  • Click the edit icon next to the node name to rename it.
  • Click the Add Derivation button.
  • Enter the Column Name as FmtOrderDate and Expression as follows:
SQL
Copy
  • Click Save.
  • Click the Preview Data button and scroll to the right to view the transformed data in the FmtOrderDate derived column.
  • Edit the Derive node name and rename it to StandardizedOrderDate.
  • Click the X on the top right to return to the canvas.

Step 6: Transform And Format Weather Date

  • Double-click the Derive_2 node (that is connected to Weather), and then click the edit icon next to the node name to rename it.
  • Click the Add Derivation button, enter the Column Name as FmtWeatherDate and Expression as follows:
SQL
Copy
  • Click Save
  • Edit the Derive_2 node name and rename it to StandardizedWeatherDat e.
  • Click the Preview Data button and scroll to the right to view the transformed data in the FmtWeatherDate derived column.
  • Click the X on the top right to return to the canvas.

Step 7: Join Orders and Weather

  • Drag and drop a Join node to the canvas.
  • Connect the StandardizedOrderDate and StandardizedWeatherDate nodes to the Join node.

(First, connect StandardizedOrderDate and then StandardizedWeatherDate to the Join node.)

  • Double-click the Join node
  • Rename it to be called JoinOrdersWeather
  • Click the Edit Join button
  • Edit the properties as follows:

Join type: LEFT OUTER

Join columns:

StandardizedOrderDate: FmtOrderDate

StandardizedWeatherDate: FmtWeatherDate

  • Click Add to join on one more column

StandardizedOrderDate: shipcity

StandardizedWeatherDate: city

  • Click Save.
  • Click the Preview Data button to view the tables being joined.
  • Click the X to return to the canvas.

In the sample data, you may see a few rows showing up with NULL values in the weather columns. This is expected to happen as we defined the join as left join, and there may be cities for which we may not have weather information.

Step 8: Join OrderDetails and JoinOrdersWeather

  • Drag and drop the two Join nodes to the canvas.
  • Connect the OrderDetails table and JoinOrdersWeather node to the second Join node.

(First, connect OrderDetails and then JoinOrdersWeather to the Join node.)

  • Double-click the Join node
  • Rename it to be called JoinOrderDetailsOrders
  • Click the Edit Join button and edit the properties as follows:

Join type: INNER

Join columns:

ORDER_DETAILS: ORDERID

JoinOrdersWeather: orderid_1

  • Click Save.
  • Click the Preview Data button to view the tables being joined.
  • Click the X on the top right to return to the canvas.

Step 9: Add Derive Node To Derive Additional Metrics To Pipeline

  • Drag and drop a Derive node to the canvas.
  • Connect the JoinOrderDetailsOrders node to the Derive node and rename the node to DeriveMetrics.
  • Click the Add Derivation button and configure the Column Name as TotalAmt and Expression as follows:

(To compute the total amount by multiplying the unit price with the quantity)

SQL
Copy
  • Click Save.
  • Click the Preview Data button to view the data and scroll to the right to view the values populated in the new derived metric TotalAmt.

Step 10: Add Target Table Node To Pipeline

Pipeline targets have several configurations that can be adjusted to optimize queries and build duration of the target. For the scope of this tutorial, we will not be explaining the details. Please ask your instructor for a detailed explanation should be you interested.

  • Drag and drop a Target node to the canvas. (You will find it under the Transformations list in the left bottom.)
  • Connect the DeriveMetrics node to the Target Node.
  • Double-click to open the Target node
  • Rename the node to SalesTarget
  • Click the Edit Properties button and enter the following information:

Sync Type: OVERWRITE

Schema Name: salesforcasting (replace salesforecasting with your domain name or use salesforecasting_enter-your-name-here)

Table Name: sales_target

HDFS Path:/iw/enter-schema-name-here/sales_target

Natural Keys: ORDERID

Primary Partition Columns: SHIPREGION

No. Of Buckets: 1

  • Click Save.
  • Click the Preview Data button to view the sample data. (Typically you would want to ensure that all the columns that you want in your data model are available here.)
  • Click the X on the top right to return to the canvas.

Step 11: Add Aggregate Node To The Pipeline

You will now generate a second target table which will compute the number of orders placed by ship city. You can choose to skip this step and jump to step to Build the pipeline if you want to.

  • Drag and drop an Aggregate node to the canvas.
  • Connect the DeriveMetrics node to the Aggregate node.
  • Double-click the Aggregate node
  • Rename it to be called NumOrdersByShipCity
  • Click the Add Group By button and choose the following:

Group By Column: SHIPCITY

Column Name: SHIPCITY (leave default)

  • Click the Add Aggregate button and choose the following:

Column Name: NumOrders

From Column: ORDERID

Function: Distinct Count

  • Click Save for each Input Column.
  • Click the Preview Data button to view the sample data.
  • Click the X on the top right to return to the canvas.

Step 12: Add Target Table Node To Pipeline

  • Drag and drop a Target node to the canvas.
  • Connect NumOrdersByShipCity node to the target node.
  • Double-click the Target node
  • Rename the node OrdersByShipCity.
  • Click the Edit Properties Configuration button and enter the following information:

Sync Type: OVERWRITE

Schema Name: salesforcasting (replace salesforecasting with your domain name or use salesforecasting_enter-your-name-here)

Table Name: orders_by_shipcity

HDFS Path:/iw/enter-schema-name-here/orders_by_shipcity

Natural Keys: shipcity

No. Of Buckets: 1

  • Click Save.
  • Click the Preview Data button to view the sample data. (Typically you would want to ensure that all the columns that you want in your data model are available here.)
  • Click the X on the top right to return to the canvas.
  • Click the X on the top right to close the Pipeline Editor.

Congratulations! You have now completed the design of the Sales Pipeline.

Step 13: Build Pipeline

  • Click the Build icon.
  • Click the Build Now button.
  • The build is initiated and the status of the build is displayed.
  • On completion of the build, the status will be displayed as completed with a green pop-up alerting the build is successful. The build job will also spawn an additional job to generate sample data.

Congratulations! You have designed and built a data model that combines sales data coming from a database with weather data coming from files.

Session Summary

  • Identified how Orders can be connected to the Weather data.
  • Created a pipeline with Orders related data.
  • Derived columns from Weather and Orders to join the tables.
  • Derived metrics and aggregates from Weather and Orders.
  • Created 2 data models joining Weather with Orders in the pipeline.
VariableType to search · ESC to discard
GlossaryType to search · ESC to discard
InsertType to search · ESC to discard
No matches