Infoworks Service Recovery Tool

The Infoworks Service Recovery (SR) tool provides an active-passive setup option for service recovery of Infoworks product. With two installations of Infoworks on different nodes, and configuration declaring installation details, the service recovery tool continuously monitors the active installation along with continuous metadata sync with the passive installation. If any services on the active node are stopped for any reason, the service recovery tool will first try to restart the service directly on the active node. After multiple attempts at a restart, the tool will then initiate failover to the passive setup, switching the roles of the two installations.

IMPORTANT

Failover implies complete transition of operation from active to passive node – not just the failed service. Since the passive node is in continuous sync with the active node, all metadata from the active node exist in the passive node at a few seconds latency. However, any job that was running on the active node at the time of failover will be abandoned. These jobs can be restarted on the new active node manually.

The service recovery tool can be installed on the active node, passive node or on a completely different node.

Prerequisites

  • Minimum of three node clusters must be setup.
  • All nodes must be identical.
  • Primary edge node must be setup with Infoworks DataFoundry.

For more details on setting up a three node cluster for SR compatibility, see https://docs.infoworks.io/knowledge-base/knowledge-base/how-to-setup-ha-with-three-nodes-for-infoworks

After the first time setup of Infoworks DataFoundry on the primary edge node,

  • Run the Filesync script.
  • Set up MongoDB service recovery.
  • Run the service recovery scripts to set up Postgres, RabbitMQ, and Platform.

Running the Filesync script

Infoworks File Synchronisation utility ensures manual synchronization of files between the active and passive host in the service recovery configuration.

Prerequisites

  • IW_USER must be present on remote machines (secondary nodes).
  • SSH_USER must have su permissions to IW_USER.
  • Infoworks DataFoundry must be set up on the primary edge node.
  • Infoworks home directory must be created and write permissions must be provided for IW_USER on secondary edge nodes.
  • Owner and group of IW_HOME directory must be same across all Infoworks DataFoundry nodes in the cluster.

Perform the steps below for file synchronisation:

  • Navigate to the <IW_HOME>/bin/infoworks-filesync-ansiblefolder.
  • Run the following command: ./setup-iw-filesync.sh. This copies the required files to the remote hosts using Infoworks DataFoundry tool.

Infoworks Metadata (MongoDB) Service Recovery

Infoworks achieves service recovery for metadata redundancy and data availability of metadata using replica set.

Replication provides redundancy and increases data availability. With multiple copies of data on different database servers, replication provides a level of fault tolerance against the loss of a single database server.

When a primary does not communicate with the other members of the set for more than the configured electionTimeoutMillis period (10 seconds by default), an eligible secondary calls for an election to nominate itself as the new primary. The cluster attempts to complete the election of a new primary and resume normal operations.

A replica set can have up to 50 members, but only 7 voting members. If the replica set already has 7 voting members, additional members must be non-voting members.

Fault Tolerance

No. of Members
Majority Required to Elect New Primary
Fault Tolerance
321
431
532
642

Setting Up Mongo Service Recovery

Prerequisites

  • IW_USER must be present on remote machines (secondary nodes).
  • SSH_USER must be have su permissions to IW_USER.
  • Infoworks DataFoundry must be set up on the primary edge node.
  • Infoworks home directory must be created and write permissions must be provided for IW_USER on secondary edge nodes.
  • Secondary edge nodes must be identical to the primary edge node and must be a part of the same cluster.
  • If the <IW_HOME/resources/mongodb> folder already exists on the secondary edge nodes, ensure the following:

All the edge nodes must have the same mongoha credentials.

The replSet key must exist in the /resources/mongodb/conf/mongodb.conf file, and the values must be identical.

The file specified as the value for the keyFile key in the mongodb.conf file must exist, and must be identical on all the nodes.

  • If the <IW_HOME/resources/mongodb> folder does not exist on the secondary edge nodes, it will be created during Mongo service recovery set up.

NOTE: Jobs running while setting up service recovery will fail. Thus, ensure that no Infoworks DF jobs are running during the service recovery setup process.

Procedure

  • Navigate to the <IW_HOME>/bin folder.
  • Run the following command: mongo-ha-setup.sh

This processes the Pre-execution, User Creation and Replica-Set Creation as described below:

Pre-execution

  • A backup of the conf.properties file will be created in the <IW_HOME>/temp/conf.properties.YYYY-MM-DD-HH-MM file.
  • All Infoworks DataFoundry services will be stopped.
  • A tar file for the Mongo directory will be created in the <IW_HOME>/temp/<DATE>-pre directory. (Optional)
  • Prompt will be displayed to enter the path to pem file and, mongo port which defaults to 27017.

User Creation

  • An oplogger user will be created with read access to local database.
  • An iw-ha admin user will be created, if it does not exist.

Replica-Set Creation

  • A backup tar of Mongo directory will be created in the <IW_HOME>/temp/<DATE> directory.
  • Prompt will be displayed to enter the IPs/hostnames of the other two nodes.
  • Backup tar will be copied from the primary node to the secondary nodes and Mongo will be started on the secondary nodes.
  • When Mongo is up and running on the remote nodes, replica set is initiated and remote nodes are added to the replica set.

Post Execution:

  • Remove the backup created in the <IW_HOME>/temp/<DATE> folder.

Validation:

On the primary edge node, navigate to <IW_HOME>/bin location and run the following command: ./status.sh all. MongoDB replica displays the status of mongo servers running in the cluster.

On successful installation, the replica set will be online. Follow the steps below to view the replica set information:

  • Navigate to the <IW_HOME>/resources/mongodb/bin folder.
  • If you do not know the MongoDB password, go to the /conf/conf.properties file and find the encrypted Mongo password using the value of the metadbPassword parameter.

Execute the following command to find your MongoDB password : <IW_HOME>/apricot-meteor/infoworks_python/infoworks/bin/infoworks_security.sh -decrypt -p <MONGO PASSWORD>

  • Execute the following command: ./mongo admin --host $host -u iw-ha -p $password with relevant parameters, where $host is the primary node hostname or the IP address, and $password is the MongoDB password.
  • In the Mongo prompt, execute the following command: rs.status(). This displays the replica set information.
  • Exit from the prompt using the exit command.

NOTE: If the Mongo service recovery node is in unrecoverable state, perform the procedure mentioned in Resync Member of Replica Set.

Starting/Stopping/Monitoring Mongo Service Recovery

Ensure that the following files are available in <IW_HOME>/bin directory before setting up MongoHA:

NOTE: Ensure that the latest version of mongoha_start.sh and mongoha-start-stop.sh files are downloaded.

  • mongo-ha-reset.sh – used to reset the Mongo HA node to non-SR(non-service recovery) node. The script does not impact the other remote machines.
  • mongo-ha-setup.sh, mongoha_start.sh – used to setup the Mongo HA.
  • mongo-ha-start-stop.sh – used to start/stop Mongo remotely from the edge node.
  • Usage: mongo-ha-start-stop.sh {host} {stop/start}
  • mongo_start.sh mongo_stop.sh – used to stop/start mongo locally. Available in the <IW_HOME>/resources/mongodb/bin directory.
  • status.sh – used to monitor the status of all Infoworks services including Mongo replica. Check for the MongoDB Replica parameter.

Setting Up RabbitMQ, Platform Services, and Postgres Service Recovery

Infoworks RabbitMQ Service Recovery

RabbitMQ can now be deployed in active-active clustering mode. This ensures continuous accessibility of Infoworks services that uses RabbitMQ.

Infoworks Platform Service Recovery

A load balancer, Nginx, has been introduced in setting up service recovery for platform services. Earlier, communication to the platform services were performed directly without any load balancer. Ngnix now routes the requests for the platform services across multiple instances. This ensures that the platform services are independent across multiple edge nodes.

Infoworks Postgres Service Recovery

Postgres can now be deployed in hot-standby mode with asynchronous replication. This ensures that the postgres data is continuously available on the standby host.

Prerequisites

Ensure the following:

  • IW_USER is present on remote machines (secondary nodes).
  • The remote machine is identical to the existing edge node (must be part of the cluster).
  • SSH_USER has su permissions to IW_USER.
  • Infoworks DataFoundry is setup on all edge nodes.

Procedure

  • Navigate to the <IW_HOME>/bin/infoworks-ha-ansible folder.
  • Run the following command: ./setup-iw-ha.sh
  • Setup service recovery for RabbitMQ, Postgres and Platform services to true, as required.
  • Enter the host details for the selected service recovery services.

NOTE: Under the [postgres-master] section in the host file, enter the primary node IP address and, under the [postgres-standby] section in the host file, enter the IP addresses of the secondary nodes.

  • Enter the ssh details for the hosts.

All Infoworks services will be stopped.

  • Enter the su password, if password is not set, press Enter.

The required files will be copied to the remote hosts using Infoworks tool. Installation procedure starts. After successful installation, service recovery is setup for the selected services, and this starts the Infoworks services.

Infoworks Service Recovery Failover

On failure of MongoDB, RabbitMQ, and Platform services on primary edge node, the secondary edge node will be automatically promoted. This requires no action. For Postgres service recovery failover, follow the steps below:

Prerequisites

Ensure that the setup-iw-ha.shcommand has already been run in your system, and the following values are configured:

In the conf.properties file for all the nodes, ensure the following:

  • postgresha=y
  • postgres_host=<IP1,IP2,IP3>. For example, postgres_host=172.30.1.5,172.30.1.6,172.30.1.7
  • postgres_port=<Postgres Port1,Postgres Port2,Postgres Port3>. For example, postgres_port=3008,3008,3008

In the primary node which runs as the Postgres master node, perform the following:

  • Navigate to the <IW_HOME>/binfolder.
  • Execute the following command:./status.sh
  • Ensure that you get the following output: Postgres [master] RUNNING

In the secondary nodes which are run as Postgres standby, perform the following:

  • Navigate to the <IW_HOME>/binfolder.
  • Ensure that you get the following output: Postgres RUNNING

Procedure

Run the following steps in one of the cluster nodes:

  • Navigate to the <IW_HOME>/bin/infoworks-ha-ansiblefolder.
  • Run the following command: ./postgres-failover.sh. This runs the Postgres in the standalone mode.
  • Enter the ssh details for the hosts. You must specify a working standby node as the master node in the hosts file.
  • Enter the su password. If password is not set, press Enter.
  • To specify the new Postgres master, enter y in the command prompt when prompted. Specify the IP address of the required node. If you do not want to specify a new node, the most synced node will be selected as the master node.

This processes the following:

  • Stops Postgres server on the existing Postgres master.
  • Receives replication lag from all the standby nodes.
  • Identifies the most synched standby host, if a new Postgres master is not specified by the user.
  • Promotes Postgres on the most synced standby as the master node.
  • Stops Postgres on all the nodes except the current master node .
  • Updates the conf.properties file on all the hosts.
  • Restarts the orchestrator services on all the nodes.

To validate the procedure, ensure the following:

  • Postgres runs on the new master node, and the service is displayed as postgres [master].
  • Postgres stops on all other nodes.
  • New Postgres master must be the first IP in the conf.properties file.

Now, perform the following steps on the newly promoted master node:

  • Navigate to the /bin/infoworks-ha-ansible folder.
  • Run the following command: ./setup-iw-ha.sh
  • Set up service recovery only for Postgres, and set it as False for RabbitMQ and Platform.
  • Enter the host details for the Postgres service recovery.

NOTE: Ensure that while entering host details, the newly promoted master is given under [postgres-master] section, and the old master along with the arbiter node are given under the [postgres-standby] section.

  • Enter the ssh details for the hosts.

The required files will be automatically copied to the remote hosts using the Infoworks tool, and all Infoworks services will be stopped.

  • Enter the su password. If password is not set, press Enter.

This initiates the installation procedure. After successful installation, service recovery is setup for the selected services and the Infoworks services start.

NOTE: Ensure that while running workflows, orchestrator services run only on postgres master node.

Infoworks Services

The manual steps for failover can be used when one of the following Infoworks services fails on the primary node, or during a node failure:

  • User Interface
  • Hangman
  • Governor
  • REST API
  • Query Service
  • Orchestrator
  • Cube Engine
  • Data Transformation

Prerequisites

Ensure the following:

  • IW_USER is present on remote machines (secondary nodes).
  • The remote machine is identical to the existing edge node (must be part of the cluster).
  • SSH_USER has su permissions to IW_USER.
  • Infoworks Home directory is created and are identical on all nodes of the cluster.
  • All Infoworks and Infra services are stopped on the secondary nodes.

Procedure

For all Infoworks DataFoundry versions 2.7 and above, perform the following steps:

  • Navigate to the <IW_HOME>/bin/infoworks-filesync-ansible folder.
  • Run the following command: setup-iw-filesync.sh. The required files are automatically copied to remote hosts using the Infoworks tool.
  • Restart all the Infoworks services as described below:

Stop all the Infoworks services on the primary node such as UI, Governor, Hangman REST API, QueryService, Scheduler, DF, Orchestrator, and Cube-engine. Then, restart all the services on the secondary node.

  • Restart Postgres, Platform, and RabbitMQ as described below:

For Postgres, if service recovery has already been setup, you must bring the service up on the secondary node. Navigate to the <IW_HOME>/bin/infoworks-ha-ansiblelocation, and run ./postgres-failover.sh and then run ./setup-iw-ha.sh.

For Platform and RabbitMQ, if service recovery has already been setup, you must bring the service up on the secondary node. Navigate to the <IW_HOME>/bin/infoworks-ha-ansiblelocation, and run the following command:./setup-iw-ha.sh.

If service recovery has not been setup for Postgres, Platform, and RabbitMQ, restart the service normally on the secondary node.

NOTE: Ensure that the primary node IP/hostname is provided as the topmost node in the hosts file. Also, ensure that at any point of time, only one instance of Hangman is running across all nodes.

Service Specific Instructions

REST API

Ensure to change the REST API in calls:

  • All external calls to Infoworks REST API from sources external to Infoworks must change the REST API host to {{seconday_node_ip}} .
  • Internal calls to Infoworks REST API from sources within Infoworks will automatically point to the {{seconday_node_ip}}, and hence requires no action.

Validation:

Execute the following command in the secondary node: $sudo netstat -tulpn | grep 2999

Ensure that the port is listening on 0.0.0.0, and the following output appears:

tcp 0 0 0.0.0.0:2999 0.0.0.0:* LISTEN 25920/gunicorn: mas

DF and QueryServices

  • Ensure that JAVA_HOME is set correctly in the secondary node by running the following command: echo $JAVA_HOME.
  • If JAVA_HOME is not set, set it as an environment variable. For example, export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.191.b12-0.el7_5.x86_64

Validate the Tomcat set up by performing the following steps:

  • Run “start.sh {{service_name}}”command on the secondary node.
  • Run the following on the secondary node: ps auwwx | grep catalina.startup.Bootstrap

A correct set up displays several java processes.

Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard