Specifications
Warning: Ensure that all the pending CDCs are merged before installing the package.
Infoworks Server
- Web server (primary and secondary for availability)
- Metadata server (primary and secondary for availability) The secondary servers provide service recovery but are not mandatory. Secondary servers are recommended for Metadata server.
Product Requirements
Infoworks DF Path: The path where the Infoworks DF must be installed, for example, /opt/infoworks.
Ports: Infoworks services use the ports 2999-3009, 5672, 7070, 7071, 7080, 7001, 7005, 7006, 8005, 27017, 3011, 3012, 23011, 23012 (if platform HA is enabled) on the Infoworks server.
Access to Database: To ingest data from RDBMS, all the nodes in the cluster must have access to the database.
Hadoop Stack: Any one of the supported Hadoop distributions mentioned in the Product Availability Matrix.
Softwares: The Infoworks DF requires the software, services, and libraries listed below:
- Java: JDK 64-bit (v8)
- Hadoop Clients: Hadoop v2.7.x, Hive v.1.2.x, Hbase v1.x Clients
External Libraries: The following libraries must be downloaded and installed by the customers:
- Download Java Cryptography Extension (JCE) from http://www.oracle.com/technetwork/java/javase/downloads/jce8-download-2133166.html.
- Copy the jar files from this package to $JAVA_HOME/jre/lib/security/ folder on the Infoworks edge node.
- For external libraries based on your use-case, see External Client Drivers.
Storage Requirements
- Disk Storage: The disk on which the Infoworks DF is installed (typically the /opt/infoworks directory), must have at least 10 GB of free space exclusively for the Infoworks DF usage. Recommended disk space for the Infoworks DF usage is 50 GB. This does not include the data storage on HDFS.
- HDFS Storage: This depends on the size of the source data being ingested.
User Privileges
Ensure that Infoworks Hive users have privilege to perform the following:
HBase Privileges
- Create table and namespace
HDFS Privileges
- Write permissions for the iw_df_workspace directory on HDFS
Hive Privileges
- Describe table (for schema crawl)
- Create/Drop Database in Hive
- Create/Drop/Alter external and internal tables
- Create partitions
- Create/Drop views
- Create/Drop temporary function
- Add jar files
Privileges for Oracle Log-Based Ingestion
- Create/Drop temp table in Oracle database
- Create view in Oracle database
Configurations
The Infoworks DF must be configured before installation to access all services that are expected to be used. In turn, each service that is to be consumed by the Infoworks DF must listen on an interface accessible on all nodes of the Hadoop cluster. MongoDB will be configured by the Infoworks DF installer to perform this. For the following services, the Infoworks DF requires the host addresses to be configured before installation and the clients for the same must be installed on the node where the Infoworks DF will run:
- Hadoop NameNode
- HiveServer2
- Spark Master
Folder Structure
Infoworks product is installed on the local filesystem in a pre-defined directory (typically /opt/infoworks). On the Infoworks Server Node, the following structure is created.
Main folder: /opt/infoworks
Subfolder | Content |
---|---|
apricot-meteor/ | #Infoworks UI, Job Executors and State Machine |
bin/ | #Infoworks binaries and shell scripts |
conf/ | #Configuration folder |
cube-engine/ | #Cube Engine |
df/ | #DataTransformation |
logs/ | #Logs for Infoworks Services |
lib/ | #Dependencies |
orchestrator-engine | #Orchestrator |
resources/ | #3rd party tools used by infoworks python, ant, nodejs |
RestAPI/ | #Ad-Hoc Query and Scheduler servers |
temp/ | #Temporary generated files |
File and Folder Permissions on HDFS
The Infoworks DF runs under a separate user. The Infoworks DF user must meet one of the following conditions:
Target folder permissions
- All entities created using the Infoworks DF require a target location. The dedicated the Infoworks DF user (or if impersonation is enabled, the user performing the job) must have the write permissions to this location.
- Also, the Infoworks DF requires a configurable temporary location which also requires write permissions.
Ownership of the top-level directories
The top-level directories must be created by an authorized user and the ownership must be assigned to the Infoworks DF user, or Infoworks group, or the user performing the job, if impersonation is enabled.
Time Synchronization
Hadoop, HBase, and by extension, the Infoworks DF requires all nodes in the cluster to have the Network Time Protocol (NTP) installed and synchronized.
To install and perform one-time synchronization of NTP on RHEL (Please refer to the NTPD man-pages for more details), use:
# yum install ntpd
# ntpd –qg
# chkconfig ntpd on
# service ntpd start #(or restart if already installed)
To perform synchronization at any time, use:
# service ntpd stop
# sudo ntpdate -s time.nist.gov
# service ntpd start
Hardware Configurations
Hardware requirements are estimated using the expected size of the data and the load on the cluster. In general, the Master (Servers - Primary and Secondary NameNode, Hive Server2, HBase Master, and Spark Master), and Slave (Datanode, HBase Region Servers, and Spark Workers) nodes have different configuration requirements. Contact the Infoworks team to come up with the precise needs.
The recommended configurations are as follows:
Server | Configuration |
---|---|
For Masters | 16 vCPU, 64 GB RAM, 1 TB Storage Disk |
For Slaves (Datanodes) | 32 vCPU, 128 GB RAM, 4-8 TB Storage Disk |
For Infoworks Servers | 32 vCPU, 256 GB RAM, 1 TB Storage Disk |