Deploying Infoworks for EMR Using Infoworks AMI

Prerequisites

  • EMR Version: 5.28.1
  • AWS Account ID of the customer to be whitelisted for accessing the Infoworks edge node.

Infoworks provides an Amazon Machine Image (AMI) of the edge node and Infoworks server software in a private marketplace library.

To obtain access to this AMI prior to proceeding with further steps, email the AWS Account ID of the account which will be used to access the Infoworks edge node image, to the Infoworks support team.

(Your Account ID will be displayed in the Amazon console My Account section.)

Infoworks support will enable access to AMI from the provided AWS Account ID. Once this is completed, you can proceed with further steps.

Procedure

  • Login to AWS Console.
  • Search for EC2 in Find Services in the AWS Console dashboard.

NOTE: Infoworks Secured AMI works only on Kerberos type EMR Cluster. Before starting the installation procedure, the user must set up the Kerberos configuration.

To create the security configurations on EMR, see Secured EMR Cluster Deployment .

Choose AMI

  • Select Launch Instance from the EC2 Dashboard. Select the image from My AMI Section.

NOTE: The AMI ID for secured edgenode is ami-0a867d67c3d1f9f2e

If the AMIs are not available in the above screen, following is the alternate option to launch the AMI:

  • Open the EC2 dashboard.
  • Navigate to AMIs > Private Images.
  • Select Infoworks EMR AMI.
  • Click the Actions option and select Launch.

Choose Instance Type

  • Select the machine type for the Infoworks Edge node. Minimum and recommended is m4.4xlarge.

Configure Instance

  • Number of Instance is 1.
  • Select the VPC and Subnet ID, similar to EMR Cluster.

Add Storage

  • Add Root volume Storage in GB. For example, 300 GB

Add Tags

  • Add naming convention or environment tags for the resource.

Configure Security Group

  • Create a new security group and allow IW Ports and SSH.

Review

  • In this section review the configurations and select existing key pair or create a new key pair and proceed with creation of Instance.

Pre-Installation Procedure

  • Login to the Infoworks DataFoundry edge node as hadoop or ec2-user.
  • Copy the ssl-client.xml and ssl-server.xml files from master node to S3 using the following commands:

aws s3 cp /etc/hadoop/conf/ssl-client.xml s3://bucket_name/ssl-xmls/ssl-client.xml

aws s3 cp /etc/hadoop/conf/ssl-server.xml s3://bucket_name/ssl-xmls/ssl-server.xml

  • Download the ssl-client.xml and ssl-server.xml files to the Edge node using the following commands:

wget https://bucket_name.s3.amazonaws.com/ssl-xmls/ssl-client.xml -O /etc/hadoop/conf/ssl-client.xml

wget https://bucket_name.s3.amazonaws.com/ssl-xmls/ssl-server.xml -O /etc/hadoop/conf/ssl-server.xml

  • Copy JKS certificates from master node to S3 using the following command:

aws s3 cp /usr/share/aws/emr/security/conf/ s3://bucket_name/jks_certs/ --recursive

  • Download the certificate files to the Edgenode using the following commands:

wget https://bucket_name.s3.amazonaws.com/jks_certs/keystore.jks -O /usr/share/aws/emr/security/conf/keystore.jks

wget https://bucket_name.s3.amazonaws.com/jks_certs/truststore.jks -O /usr/share/aws/emr/security/conf/truststore.jks

Installation Procedure

The default user is ec2-user.

Perform the following steps:

NOTE: Kerberos tickets are renewed before running all the Infoworks DataFoundry jobs. Infoworks DataFoundry platform supports single Kerberos principal for a Kerberized cluster. Hence, all Infoworks DataFoundry jobs work using the same Kerberos principal, which must have access to all the artifacts in Hive, Spark, and HDFS.

Post Installation

  • Copy Keystore passwords from /etc/hadoop/conf/ssl-client.xml to $IW_HOME/conf/dt_spark_defaults.conf file.

Set the password mapping as follows:

ssl.client.keystore.keypassword ⇒ spark.ssl.keyPassword

ssl.client.truststore.password ⇒ spark.ssl.trustStorePassword

ssl.client.keystore.password ⇒ spark.ssl.keyStorePassword

  • If spark.dynamicAllocation.enabled is true, replace the spark.dynamicAllocation.minExecutors,spark.dynamicAllocation.initialExecutors property value from 50 to 1.
  • Copy userData.json file from Master node to S3 using the following command:

aws s3 cp /var/aws/emr/userData.json s3://bucket_name/userdata/userData.json

  • Download userData.json file to the Edge node using the following command:

wget https://bucket_name.s3.amazonaws.com/userdata/userData.json -O /var/aws/emr/userData.json

  • Change the ownership of the file to Infoworks-user and run the following command:

sudo chown infoworks-user:infoworks-user /var/aws/emr/userData.json

  • Copy the Keytabs to S3 using the following commands:

aws s3 cp /etc/hdfs.keytab s3://bucket_name/keytabs/hdfs.keytab

aws s3 cp /etc/infoworks-user.keytab s3://bucket_name/keytabs/infoworks-user.keytab

  • In the Infoworks DataFoundry landing page, navigate to Admin > Configuration > Add Configuration Entry. Add property modified_time_as_cksum to True, and save it.
  • In the Infoworks DataFoundry landing page, navigate to Admin > Configuration. Change value of CSV_PARSER_LIB property from COMMONS to UNIVOCITY.

Perform sanity check by running the HDFS commands and Hive shell in the edge Node.

IMPORTANT: Ensure that you add the EdgeNode Security Group ID to allow all inbound traffic to EMR Security Group.

Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard