Release Notes 2.8.0

Date: 23 AUG 2019

New Features and Enhancements

Component: Data Ingestion and Synchronization

IPD-6793 - HDP 3.1 Support: Infoworks DataFoundry now supports Hortonworks Data Platform (HDP) 3.1.

NOTE: The SUPPORT_RESERVED_KEYWORDS configuration must be set to false to enable Hive connection on HDP 3.1.

IPD-7868 - REST API Ingestion Enhancement: The REST API Ingestion process flow has been enhanced to improve the user experience. Also, support for configuration migration has been added.
IPD-8065 - Multiline Support for SQL Server BCP Source: Multilines are now supported in string columns for SQL Server BCP sources.
IPD-7865 - Salesforce Ingestion Schema Synchronization: An option to enable schema synchronization has been added for Salesforce REST API ingestion.

Component: Data Transformation

IPD-7905 - Target Data Connection Support: Snowflake support for multiple data warehouse end points has been added using target data connections. This feature allows user to configure target connections that can be used in any pipeline target within the domain.

NOTE: The snowflake configuration file is not supported from this release. User must now create a target connection for the Snowflake targets already configured.

IPD-7976 - H2O Machine Learning Engine Support: Infoworks supports H2O as a machine learning engine for all the analytics nodes. H2O is an open source, in-memory, distributed, fast, and scalable machine learning and predictive analytics platform that allows user to build machine learning models on big data. For more details, see here.
IPD-7558 - Spark Cluster Mode Support: Spark cluster mode is now supported for Spark 2.1.0 and higher versions.

Component: Cube

IPD 7968 - Support to Configure Execution Engine for Cube: Users can now configure the execution engine for cube using the Advanced Configuration option.

Component: Platform

IPD-7973 - Python 3 Support: Python 2.7 has now been upgraded to Python 3.6 across the Infoworks DataFoundry platform. The user must ensure to upgrade all Python environments used internally in Infoworks DataFoundry to Python 3.6.

NOTE: Ensure that the post hook script, bash script, and any custom Python script used on Infoworks edge nodes must be compatible with Python 3.6.

IPD-4491 - Multiple Edge Nodes Support for Increased Job Concurrency: Infoworks DataFoundry now supports running concurrent jobs on multiple edge nodes. This allows horizontal scaling of the number of edge nodes to support maximum number of jobs depending on the workload. For more details, see here.
IPD-7667 - Kerberos Ticket Renewal before All Jobs: Kerberos tickets are now renewed before running all the Infoworks DataFoundry jobs. Infoworks DataFoundry platform now supports single Kerberos principal for a Kerberized cluster. Hence, all Infoworks DataFoundry jobs work using the same Kerberos principal, which must have access to all the artifacts in Hive, Spark, and HDFS.
IPD-7210 - Service Recovery for Infoworks DataFoundry: Infoworks DataFoundry now supports service recovery for MongoDB, Postgres, RabbitMQ, and Platform services. This eliminates all points of failure for these services on Infoworks DataFoundry. For more details, see here.

Component: Admin and Operations

IPD-7864 - Support to Disable Invalid Users: Infoworks DataFoundry now supports disabling user records for which user details or valid roles are not available during user synchronization. The jobs scheduled by the disabled users will also be removed from Infoworks DataFoundry.
IPD-7847 - Support for Case-Insensitive Username: The login username is now case-insensitive.

Bug Fixes

IPD-8040 - Ctrl-A Column Delimiter Issue Fix: When delimited file ingestion was performed, the Ctrl-A column delimiter was incorrectly read as a string. This issue has now been fixed.
IPD-8524 - Load Time Read from Incorrect Table: During full load for SQL Server log-based ingestion, current timestamp from database was used as load time instead of the max value from the CDC table. This issue has now been fixed.
IPD-8525 - Missing Records in SQL Server Log-based Ingestion: The records added with the same load time as the previous ingestion were not being ingested. This issue has now been fixed by adding ingest operation in the CDC query.
IPD-8526 - Log-based Merge Issue for Golden Gate: During Oracle log-based ingestion, if a record is inserted and updated in a single transaction, either of the records were ingested. This issue has been fixed and now the updated record will be ingested. Also, if a record is updated multiple times in a single transaction, the record will be ingested based on the priority value in the column set using the SEQUENCE_NO_COL configuration.
IPD-8551 - Issue in Fetching Incremental Source Table Data: While fetching the source table incremental data when processing pipelines, the data that was already processed in previous runs was also fetched. This data was reprocessed when building pipelines. This lead to increase in time and cluster load for the merge pipeline targets and duplication of data for the append pipeline targets. This issue has now been fixed.

Limitations

Cubes are not supported in HDP 3.1.

For BigQuery Export in HDP 3.1, the following must be performed:

Classpath must be modified.
The following Hive configuration must be added in the Admin Configuration section: key as bq_hive_conf_vars and value as hive.exec.dynamic.partition.mode=nonstrict.

For more details, see here.

Netezza and Teradata exports are not supported in HDP 3.1.

Installation

Refer Installation to install Infoworks DataFoundry 2.8.0.

Upgrading to This Release

To upgrade your current Infoworks DataFoundry version, execute the following commands on the edge node:

NOTE: Before starting the upgrade, ensure that all Infoworks services are running and no Infoworks jobs are running.

Run the following command: source $IW_HOME/bin/env.sh
Navigate to the scripts directory using the following command: cd $IW_HOME/scripts; where, $IW_HOME is the directory where the Infoworks DataFoundry is installed. If scripts folder is not available (2.4.x, 2.5.x, 2.6.x base versions), create scripts folder in $IW_HOME.
Download the update script using the following command: wget <link-to-download>; reach out to your Infoworks support representative to get the link to download and replace with the link.
Upgrade the Infoworks DataFoundry version using the following commands: ./update.sh -v <version_number>

NOTES:

For CentOS/RHEL6, replace <version_number> with 2.8.0

For CentOS/RHEL7, replace <version_number> with 2.8.0 -rhel7

For Azure, replace <version_number>with 2.8.0 -azure

For GCP, replace <version_number>with 2.8.0 -gcp

For EMR, replace <version_number>with 2.8.0 -emr

If the base version is below version 2.8.0, the upgrade procedure upgrades Metadata DB (Mongo) from 3.6 to 4.0 version. The upgrade of metadata DB includes the following:

updates the metadata DB binaries
sets up feature compatibility version

Infoworks DataFoundry now supports HDP 3.1, in addition to HDP 2.5.5 and HDP 2.6.4 versions. And, Python version has been upgraded to Python 3.6.9. This requires modifications in the $IW_HOME/conf/conf.properties and $IW_HOME/conf/dispatcher.properties files.

The properties must be modified when upgrading Infoworks DataFoundry from previous versions to 2.8.0 version (irrespective of whether the HDP version is 3.1 distro or not).

NOTE: New installations of Infoworks DataFoundry 2.8.0 works automatically, without the modifications.

Environment: HDP

Change Request 1

Navigate to the $IW_HOME/conf/conf.properties file.
Remove $IW_HOME//lib/parquet-support/* from the iw_jobs_classpath key value.
Ensure that the additional : at the end of the value is removed.

Change Request 2

Navigate to the $IW_HOME/conf/conf.properties file.
Replace the Hive client libraries (like /usr/hdp/current/hive-client/lib/…) in the iw_jobs_classpath key value with /usr/hdp/current/hive-client/lib/*.
Ensure that the additional : at the end of the value is removed.

Environments: HDP, MapR, CDH, Azure, GCP, EMR

Change Request 3

Navigate to the $IW_HOME/conf/conf.properties file.
Remove $IW_HOME/lib/shared/* from the df_batch_classpath key value.
Ensure that the additional : at the end of the value is removed.

Change Request 4

Navigate to the $IW_HOME/conf/conf.properties file.
Remove $IW_HOME/lib/shared/* from the df_tomcat_classpath key value.
Ensure that the additional : at the end of the value is removed.
Stop and start the transformation service using the following commands: source $IW_HOME/bin/env.sh; $IW_HOME/bin/stop.sh df; $IW_HOME/bin/start.sh df

Change Request 5

Navigate to the $IW_HOME/conf/dispatcher.properties file.
Replace the content with the following properties:

source.iw.job.impl.class.name=io.infoworks.awb.dispatcher.job.impl.PipelineJob

pipeline.iw.job.impl.class.name=io.infoworks.awb.dispatcher.job.impl.PipelineJob

For SQL Server log-based ingestion and OGG-based ingestion, perform Initialize and Ingest (full load ingestion) for all tables.

Release Notes 2.8.0.1

Date: 03 OCT 2019

Enhancement

Component: Data Ingestion and Synchronization

IPD-8647 - Support to Filter Committed Transactions: A new configuration, FILTER_ORACLE_COMMITTED_TRANSACTIONS, has been added at source-level for Oracle log-based CDC ingestion. This configuration allows reading of only the committed transactions from Oracle LogMiner. This configuration must be set to true if uncommitted transactions are available in the logs, which must be filtered out. The default value is false.

NOTE: This configuration might have a slight impact on the job time and might result in out-of-memory issues in Oracle while filtering committed transactions.

Bug Fixes

Component: Data Ingestion and Synchronization

IPD-8644 - Issue in SQL Server Log-based Incremental Ingestion: During SQL Server log-based incremental ingestion, if records had same LSN (Last Sequence Number) value in the SQL Server CDC table, one of those records were written to the datalake randomly. This issue has been fixed by adding a new audit column, ZIW_SEQVAL, to store the order of transactions within the same commit. This column values include the ROW_NUMBER order by SEQVAL column from the SQL Server CDC table.

For the ingested tables impacted by this issue, perform the following:

Truncate the table
Recrawl the metadata (the newly added SEQVAL column will be displayed in the table metadata)
Perform full load ingestion on the table
Resume incremental ingestion

Component: User Interface

IPD-8636 - Cube Selection Issue in Workflow: In workflow, the cube selected in the Build Cube task properties were not saved and hence the cube build job was not initiated. This issue has now been fixed.
IPD-8690 - Issue in Segment Load: After the completion of segmented loading, the Enable Complete Loading button did not work as expected. This issue has now been fixed.

Component: Platform

IPD-8684: Incorrect API URL: The scheduler submit API URL was incorrect in the log rotate utility script. This issue has now been fixed.
IPD-8676 - Platform Server Restart Issue: When platform server was restarted, the existing scheduled jobs were not being run. This issue has now been fixed.

Release Notes 2.8.0.2

Date: 05 NOV 2019

Enhancement

Component: Data Ingestion and Synchronization

IPD-8926 - SQL Server Source Support on GCP Platform: SQL Server ingestion via JDBC is now supported on GCP Dataproc Platform.

Bug Fix

Component: Data Ingestion and Synchronization

IPD-8920 - Segmented Load Ingestion Issue Fix: During segmented load ingestion, if the Select All option was used and if any of the segments was already loaded, the unloaded segments were skipped. This issue has now been fixed.

Last updated on