Introduction
Infoworks Orchestrator is an add-on to the Infoworks which aims at solving the complexity of ETL workloads. It is especially designed for use by ETL Developers and Production Administrators.
Scope and Purpose
Majority of Enterprise Data Warehouse (EDW) power is spent on Extract Transform Load (ETL) tasks. Migrating ETL workloads from EDW to Hadoop is complex, manually intensive, and expensive.
Infoworks Orchestrator, the complete solution for workload automation and management, is the fastest way to offload ETL use cases and manage ETL workloads on Hadoop. This helps in designing the dependencies between tasks to execute them in the defined order to get desired results. This also enables reuse and automation of workload processing.
It offers an easy-to-use visual editor to author and edit workflows. The ETL developer can drag and drop tasks from the left palette on the canvas and connect them in order to define the dependency.
This user guide covers all the capabilities of Infoworks Orchestrator.
Feature Highlights
Following are some of the features of Infoworks Orchestrator:
- Provides a user-friendly GUI-driven way to transform data.
- Provides fault tolerance of production workloads.
- Provides execution metrics for performance tuning.
- Enables dynamic control of the production workloads.
- Fetches immediate feedback on failed tasks.
- Provides enterprise-friendly features, including domain-based access to sources, and pipelines.
- Ensures automatic dependency
User Roles
The Production Administrators use Infoworks Orchestrator to monitor, control, debug, and performance tune the workloads.
Data Engineers or Analysts use Orchestrator to design the Orchestration of end-to-end use cases from data ingestion, synchronization, and building of data models.
Advantages
Complex tasks require a large number of production engineers to run the workloads in production. It is also difficult to optimize workload balancing across all available resources.
Infoworks Orchestrator solves the challenges faced during manual ETL workload management in the following ways:
- Workload Management: Orchestrator provides fine-tuned control on execution logic such as parameters, automated dependency management, and ability to pause and continue or restart the workflow.It takes only a couple of minutes to debug job failures.
- Efficient Resource Management: Automatically and simultaneously executes various tasks.
- Easy to Design and Maintain: Complex ETL pipelines can be authored using a drag and drop tool providing an audit trail of changes made.
- Fault Tolerance: Provides the ability to control, retry, and restart logic.
- Future Proof ETL Pipeline: Eases new developers due to least dependency on the underlying tools or programming languages.
- Self Documentation: The workflows and pipelines are user-friendly and provide a visual representation of the process flow and task descriptions.
Pre-requisites
To achieve the required ETL workload management using Infoworks Orchestrator, you must either create the required Domains on the Infoworks or use the applicable Domains existing in the system. Workflows define an order of performing tasks to achieve specific results, and are created within a domain. These can be made reusable across Infoworks. If you do not have the pre-existing Domains in the system, follow these steps as pre-requisites before working on the Orchestrator.
NOTE: Only a user with admin access can create the Domain, and add Sources to a Domain. If you do not have admin access, contact the administrator to perform these tasks.
- Create a Domain.
- Add the required Source/Sources to the Domain.
NOTE: To process data, minimum of one Source must be added to a Domain. There is no limit on the maximum number of Sources.
- If building pipelines are also part of the workflow, ensure that those pipelines are created in the same domain or accessible domain (the domain that has access to the domain where pipelines are present).
Concepts
Workflow
In the Orchestrator, a Workflow is a collection of all the tasks you want to run, organized in a way that reflects their relationships and dependencies.
For example, a simple Workflow could consist of three tasks: A, B, and C. It could say that A has to run successfully before B can run, but C can run anytime. It could say that task A times out after 5 minutes, and B can be restarted up to 5 times in case it fails. It might also say that the workflow will run every night at 10 pm, but shouldn't start until a certain date.
In this way, a workflow describes how you want to carry out your workload. For example, A, B, and C could be any task. Suppose that A prepares data for B to analyze, while C sends an email. The important thing is that the Workflow is not concerned with what the constituent tasks do. It ensures that whatever they do occurs at the right time, in the right order, or with the right handling of any unexpected issues.
Task
A task describes a single task in a workflow. Tasks are usually (but not always) atomic, meaning they can stand on their own and don't need to share resources with any other tasks. The Workflow will make sure that tasks run in the correct certain order; other than those dependencies, tasks generally run independently. In fact, they may run on two completely different machines.
When a task is added to a workflow (by dragging it onto the canvas), the task properties can be configured. For example, for a Pipeline build task, the user will need to specify the pipeline that should be built.
Task Dependency
Task dependencies define the order in which tasks in the workflow must be executed. Infoworks Orchestrator supports the following types of dependencies:
- Sequential Execution - Task B runs after task A has successfully completed
- Parallel Execution - Task B and C can run in parallel
- Conditional Execution - After Task A has run, based on some condition, either Task B or C will be executed (the other task will be skipped)