Prerequisites
On-premise
- High configuration machine must be available for edge node to run the jobs faster.
- Hadoop, Hive, Spark2 and HBase must be installed and must be running in the cluster.
- Minimum disk space of 50GB must be available.
- Java 8 must be installed on the cluster.
Hortonworks
If security like Ranger and Kerberos is enabled, following are the prerequisites:
- Kerberos: A Principal and Keytab must be created for the
<Infoworks_User>
and the Keytab must be available in the edge node. - Ranger:
<Infoworks_User>
must have permissions to Hadoop policy for the/user/<Infoworks_User>
directory.
Supported version: HDP-2.6.4.0
Cloudera
If security is enabled like Sentry and Kerberos, following are the prerequisites:
- Kerberos: A principal and Keytab should be created for the
<Infoworks_User>
and the Keytab must be available in the edge node. - Sentry:
<Infoworks_User>
must have permissions to Hive policy.
Supported Version: CDH 5.13.0
MapR
If security is enabled, perform the following on the edge node to generate a ticket from the <Infoworks_User>
terminal:
- Run the maprlogin password command.
- Enter the password of
<Infoworks_User>
when prompted.
Supported Version: MapR 6.0.1.20180404222005.GA
Cloud
GCP
- Quota limit for CPU cores must be greater than 72 for the region that Infoworks is spinning up.
- API must be enabled for the DataProc, Compute Engine, Deployment Manager and Runtime Configuration services.
Microsoft Azure
- Quota limit for CPU cores must be greater than 60 for the region that Infoworks is spinning up.
- ADL storage, if used, must be created in a resource group before spinning up Infoworks.
- Vnet must be created in a resource group before spinning up Infoworks.
EMR
EMR Version: 5.17
Components
The following components (for EMR 5.17) must be selected when spinning up the cluster:
- Hadoop 2.8.4
- HBase 1.4.6
- Hive and HCatalog 2.3.3
- Spark 2.3.1
- Tez 0.8.4
- Zookeeper 3.4.12
Node Services
Ensure that the following node services are running on the respective nodes:
- Master Node: Name Node, Resource Manager, Hive Servers, Application Timeline Server, Spark History Server and Zookeeper Server.
- Core Nodes: Data Nodes, Node Managers and Region Servers.
- EdgeNode: All Clients and IWX.
Edge Node
Infoworks requires a compute instance to set up as an edge node.
- This instance must be in the same subnet as the subnet of the EMR Cluster.
- This instance must have an IAM role associated with EMR and S3.
- If security is enabled, Infoworks requires principal and keytab for the user, which is used to install and run Infoworks.
Additional Information
- User must provide AWS account ID for Infoworks to provide access to the edge node image.
- User must provide Private DNS of the master node of the EMR cluster.