Specifications

Warning: Ensure that all the pending CDCs are merged before installing the package.

Infoworks Server

Web server (primary and secondary for availability)
Metadata server (primary and secondary for availability) The secondary servers provide service recovery but are not mandatory. Secondary servers are recommended for Metadata server.

Product Requirements

Infoworks DF Path: The path where the Infoworks DF must be installed, for example, /opt/infoworks.

Ports: Infoworks services use the ports 2999-3009, 5672, 7070, 7071, 7080, 7001, 7005, 7006, 8005, 27017, 3011, 3012, 23011, 23012 (if platform HA is enabled) on the Infoworks server.

Access to Database: To ingest data from RDBMS, all the nodes in the cluster must have access to the database.

Hadoop Stack: Any one of the supported Hadoop distributions mentioned in the Product Availability Matrix.

Softwares: The Infoworks DF requires the software, services, and libraries listed below:

Java: JDK 64-bit (v8)
Hadoop Clients: Hadoop v2.7.x, Hive v.1.2.x, Hbase v1.x Clients

External Libraries: The following libraries must be downloaded and installed by the customers:

Download Java Cryptography Extension (JCE) from http://www.oracle.com/technetwork/java/javase/downloads/jce8-download-2133166.html.
Copy the jar files from this package to $JAVA_HOME/jre/lib/security/ folder on the Infoworks edge node.
For external libraries based on your use-case, see External Client Drivers.

Storage Requirements

Disk Storage: The disk on which the Infoworks DF is installed (typically the /opt/infoworks directory), must have at least 10 GB of free space exclusively for the Infoworks DF usage. Recommended disk space for the Infoworks DF usage is 50 GB. This does not include the data storage on HDFS.
HDFS Storage: This depends on the size of the source data being ingested.

User Privileges

Ensure that Infoworks Hive users have privilege to perform the following:

HBase Privileges

Create table and namespace

HDFS Privileges

Write permissions for the iw_df_workspace directory on HDFS

Hive Privileges

Describe table (for schema crawl)
Create/Drop Database in Hive
Create/Drop/Alter external and internal tables
Create partitions
Create/Drop views
Create/Drop temporary function
Add jar files

Privileges for Oracle Log-Based Ingestion

Create/Drop temp table in Oracle database
Create view in Oracle database

The Infoworks DF must be configured before installation to access all services that are expected to be used. In turn, each service that is to be consumed by the Infoworks DF must listen on an interface accessible on all nodes of the Hadoop cluster. MongoDB will be configured by the Infoworks DF installer to perform this. For the following services, the Infoworks DF requires the host addresses to be configured before installation and the clients for the same must be installed on the node where the Infoworks DF will run:

Hadoop NameNode
HiveServer2
Spark Master

Folder Structure

Infoworks product is installed on the local filesystem in a pre-defined directory (typically /opt/infoworks). On the Infoworks Server Node, the following structure is created.

Main folder: /opt/infoworks

Subfolder	Content
apricot-meteor/	#Infoworks UI, Job Executors and State Machine
bin/	#Infoworks binaries and shell scripts
conf/	#Configuration folder
cube-engine/	#Cube Engine
df/	#DataTransformation
logs/	#Logs for Infoworks Services
lib/	#Dependencies
orchestrator-engine	#Orchestrator
resources/	#3rd party tools used by infoworks python, ant, nodejs
RestAPI/	#Ad-Hoc Query and Scheduler servers
temp/	#Temporary generated files

File and Folder Permissions on HDFS

The Infoworks DF runs under a separate user. The Infoworks DF user must meet one of the following conditions:

Target folder permissions

All entities created using the Infoworks DF require a target location. The dedicated the Infoworks DF user (or if impersonation is enabled, the user performing the job) must have the write permissions to this location.
Also, the Infoworks DF requires a configurable temporary location which also requires write permissions.

Ownership of the top-level directories

The top-level directories must be created by an authorized user and the ownership must be assigned to the Infoworks DF user, or Infoworks group, or the user performing the job, if impersonation is enabled.

Time Synchronization

Hadoop, HBase, and by extension, the Infoworks DF requires all nodes in the cluster to have the Network Time Protocol (NTP) installed and synchronized.

To install and perform one-time synchronization of NTP on RHEL (Please refer to the NTPD man-pages for more details), use:

Bash
    
 
# yum install ntpd# ntpd –qg# chkconfig ntpd on# service ntpd start #(or restart if already installed)
Copy

To perform synchronization at any time, use:

Bash
    
 
# service ntpd stop# sudo ntpdate -s time.nist.gov# service ntpd start
Copy

Hardware Configurations

Hardware requirements are estimated using the expected size of the data and the load on the cluster. In general, the Master (Servers - Primary and Secondary NameNode, Hive Server2, HBase Master, and Spark Master), and Slave (Datanode, HBase Region Servers, and Spark Workers) nodes have different configuration requirements. Contact the Infoworks team to come up with the precise needs.

The recommended configurations are as follows:

Server	Configuration
For Masters	16 vCPU, 64 GB RAM, 1 TB Storage Disk
For Slaves (Datanodes)	32 vCPU, 128 GB RAM, 4-8 TB Storage Disk
For Infoworks Servers	32 vCPU, 256 GB RAM, 1 TB Storage Disk

Last updated on