Specifications

Product Requirements

Infoworks Path: The path where the Infoworks must be installed, for example, /opt/infoworks.

Ports: Infoworks requires port 80 open for interacting with the Hadoop cluster outside the Virtual Private Cloud (VPC) network. Infoworks services use the proxy ports 2999-3009, 5672, 7070, 7071, 7080, 7001, 7005, 7006, 8005, 27017, 3030, 3011, 3012, 23011, 23012 (if platform HA is enabled) on the Infoworks server. These ports can communicate within the VPC and does not require to be open outside the internal network.

Access to Database: To ingest data from RDBMS, all the nodes in the cluster must have access to the database.

Softwares: The Infoworks requires the software, services, and libraries listed below:

  • Java: JDK 64-bit (v8)
  • Hadoop and Hive Clients associated with the cluster

External Libraries: The following libraries must be downloaded and installed by the customers:

Storage Requirements

  • Disk Storage: The disk on which the Infoworks is installed (typically the /opt/infoworks directory), must have at least 10 GB of free space exclusively for the Infoworks usage. Recommended disk space for the Infoworks usage is 50 GB. This does not include the data storage on HDFS.
  • HDFS Storage: This depends on the size of the source data being ingested.

User Privileges

Ensure that Infoworks Hive users have privilege to perform the following:

HBase Privileges

  • Create table and namespace

HDFS Privileges

  • Write permissions for the iw_df_workspace directory on HDFS

Hive Privileges

  • Describe table (for schema crawl)
  • Create/Drop Database in Hive
  • Create/Drop/Alter external and internal tables
  • Create partitions
  • Create/Drop views
  • Create/Drop temporary function
  • Add jar files

Configurations

The Infoworks must be configured before installation to access all services that are expected to be used. In turn, each service that is to be consumed by the Infoworks must listen on an interface accessible on all nodes of the Hadoop cluster. MongoDB will be configured by the Infoworks installer to perform this. For the following services, the Infoworks requires the host addresses to be configured before installation and the clients for the same must be installed on the node where the Infoworks will run:

  • Hadoop NameNode
  • HiveServer2
  • Spark Master

Folder Structure

Infoworks product is installed on the local filesystem in a pre-defined directory (typically /opt/infoworks). On the Infoworks Server Node, the following structure is created.

Main folder: /opt/infoworks

SubfolderContent
apricot-meteor/#Infoworks UI, Job Executors and State Machine
bin/#Infoworks binaries and shell scripts
conf/#Configuration folder
dt/#DataTransformation
logs/#Logs for Infoworks Services
lib/#Dependencies
orchestrator-engine#Orchestrator
resources/#3rd party tools used by infoworks python, ant, nodejs
RestAPI/#Ad-Hoc Query and Scheduler servers
temp/#Temporary generated files

File and Folder Permissions on HDFS

The Infoworks runs under a separate user. The Infoworks user must meet one of the following conditions:

Target folder permissions

  • All entities created using the Infoworks require a target location. The dedicated the Infoworks user (or if impersonation is enabled, the user performing the job) must have the write permissions to this location.
  • Also, the Infoworks requires a configurable temporary location which also requires write permissions.

Ownership of the top-level directories

The top-level directories must be created by an authorized user and the ownership must be assigned to the Infoworks user, or Infoworks group, or the user performing the job, if impersonation is enabled.

Time Synchronization

Hadoop, HBase, and by extension, the Infoworks requires all nodes in the cluster to have the Network Time Protocol (NTP) installed and synchronized.

To install and perform one-time synchronization of NTP on RHEL (Please refer to the NTPD man-pages for more details), use:

Bash
Copy

To perform synchronization at any time, use:

Bash
Copy

Hardware Configurations

Hardware requirements are estimated using the expected size of the data and the load on the cluster. In general, the Master (Servers - Primary and Secondary NameNode, Hive Server2, HBase Master, and Spark Master), and Slave (Datanode, HBase Region Servers, and Spark Workers) nodes have different configuration requirements. Contact the Infoworks team to come up with the precise needs.

The recommended configurations are as follows:

ServerConfiguration
For Masters16 vCPU, 64 GB RAM, 1 TB Storage Disk
For Slaves (Datanodes)32 vCPU, 128 GB RAM, 4-8 TB Storage Disk
For Infoworks Servers32 vCPU, 256 GB RAM, 1 TB Storage Disk
Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard