Generic REST API Ingestion

Multiple sources expose the data in REST API format. To ingest data, response from REST API must obtain metadata and data. Currently only JSON response from the REST API is supported.

Feature List

Generic REST API ingestion supports the following features:

Schema crawl
Data crawl
CDC, Append and Merge

Reference Video

The demo video of REST API Ingestion is available here.

In the Admin section (Admin > Source > New Source), create a source and select the source type as Rest-Generic. Select the Source Data Format as XML or JSON. Enter the Hive schema name and HDFS location.

Configuring Source

Click the Sources menu and select the Generic REST API source you created.
In the Source Configuration page, click the Click here to enter them link.

In the Settings page, enter the following:

Authentication Mechanism: The authentication mechanism used to connect to REST API auth server. The options include OAuth, OAuth 2.0, Basic Authentication, None.
Request Type: The HTTP request type method. The options include GET and POST.
OAuth URL: URL for the OAuth server for the specified client.
OAuth Token JSON Path: Path of the OAuth Token in the JSON response of Auth URL. For details on JSON path, see JSON Path Syntax.
Secret Key: Secret key is static with the client and will be a part of Authorization headers. This field is displayed if the Authentication Mechanism selected is Basic Authentication.
Test Connection URL: URL for verifying basic authentication mechanism. Test connection will be successful if the response (with secret key and other headers) is OK.
Request Content: Authentic data to be sent to the OAuth server if the Request Type selected is POST.
Request Headers: HTTP request header key-value pair.
Request Params: HTTP request parameter key-value pair.

Click Save Settings and perform a Test Connection.

Schema Crawl

Click the Source Configuration icon.

Click Add New Table and enter the following details:

Table Name: Hive table name
Target HDFS Path: Target HDFS path relative to the source base path.
Meta URL: Meta URL is useful in retrieving metadata response with same JSON schema as that of base or CDC URL groups.
Request Headers: HTTP request header key-value pair.

To send Auth Token in the request, mention the header as Authorization. No other auth headers are supported.

Request Params: HTTP request parameter key-value pair.

Click Save and Crawl Schema.

A tree representing the schema created by crawling response from one of the URL in the base URL group.

Select a path from the tree and click Extract Schema.

Enter the following details:

Table Name: Hive table name.
Target HDFS Path: Target HDFS path relative to the source base path.

Click Save. You can click Recrawl Source for Schema to recrawl the source.

Data Crawl Full Load

Following are the steps to perform a full load data crawl:

Click Configure button for the table that requires a full load data crawl.
Select the Ingest Type as Full Load.
Use Meta Url as Base Url: Check this box to use the meta URL as the base URL.
Base URLs: Enter the URL on the basis of the criteria where every URL must have the same JSON schema, request headers and request params. Base URLs are used only for full load ingestion. For example, localhost:8080/confluence/rest/api/space/ds/content.
Pagination Mechanism: Includes Request Params, URI Path, Next URI in response and None.

Pagination is a method for handling large datasets and responses in the browser-based Web to minimize response time for requests and improve the user experience. The pagination parameter must be mentioned in request parameter. For example, localhost:8080/confluence/rest/api/space/ds/content?page=1&size=10. The content in bold indicates the pagination key and value.

Pagination Mechanism: Request Params

Page Param Key: The key that indicates the page parameter in pagination. For example, localhost:8080/confluence/rest/api/space/ds/content?page=1&size=10. The content in bold indicates the page param key.
Param Initial Value: The value at which pagination result starts. For example, localhost:8080/confluence/rest/api/space/ds/content?page=1&size=10. The content in bold indicates the param initial value.

Pagination Mechanism: URI Path

Path Param Key: The key that indicates the page parameter in pagination. For example, localhost:8080/confluence/rest/api/space/ds/content/page/1. The content in bold indicates the path param key.
Param Initial Value: The value at which pagination result starts. For example, localhost:8080/confluence/rest/api/space/ds/content/page/1. The content in bold indicates the param initial value.

NOTE: Currently only numeric values are supported for page values.

Pagination Mechanism: Next URI in Response

Next URL JSON Path: JSON path for the next URL in the current URL response. This field is displayed when the pagination mechanism selected is Next URI in Response.
Base Group URL Prefix: The next URL in the current URL response can be full or partial. For partial URL, enter the static prefix. This field is displayed when the pagination mechanism selected is Next URI in Response.

Pagination Mechanism: Request Params Limit and Offset

Limit Param Key: The key that indicates the limit parameter. For example, …/defects?offset=25&limit=25. The content in bold indicates the limit param key.
Limit Param Initial Value: The limit value for the number of records (starting from the offset value) to be displayed in the response. For example, …/defects?offset=25&limit=25. The content in bold indicates the limit param initial value.
Offset Param Key: The key that indicates the offset parameter. For example, …/defects?offset=25&limit=25. The content in bold indicates the offset param key.
Offset Param Value: The offset value from which the records will be displayed in the response. For example, …/defects?offset=25&limit=25. The content in bold indicates the offset param initial value.
Enter the other required values and click Save Configuration. For descriptions of fields, see Source Table Configuration Field Descriptions.

Click Save Configuration.
Click Table Groups tab and add a table group.
Click View Table Group icon for the table group.
For first time ingestion or for a clean crawl, click Initialize and Ingest Now.
To append new data to the crawled source, click Ingest Now from the second crawl onwards. Only the new and changed data will be picked.

Data Crawl Incremental Load

Following are the steps to perform an incremental load data crawl:

Click Configure button for the table that requires an incremental load data crawl.
Select the required incremental load Ingest Type,
Check the Incremental Append Mode option to perform incremental ingestion.

Use Meta Url as Base Url: Check this box to use the meta URL as the base URL.
Base URLs: Enter the URL on the basis of the criteria where every URL must have the same JSON schema, request headers and request params. Base URLs are used only for full load ingestion. For example, localhost:8080/confluence/rest/api/space/ds/content.
Pagination Mechanism: Includes Request Params, URI Path, Next URI in response and None.

Pagination Mechanism: Request Params

Page Param Key: The key that indicates the page parameter in pagination. For example, localhost:8080/confluence/rest/api/space/ds/content?page=1&size=10. The content in bold indicates the page param key.
Param Initial Value: The value at which pagination result starts. For example, localhost:8080/confluence/rest/api/space/ds/content?page=1&size=10. The content in bold indicates the param initial value.

Pagination Mechanism: URI Path

Path Param Key: The key that indicates the page parameter in pagination. For example, localhost:8080/confluence/rest/api/space/ds/content/page/1. The content in bold indicates the path param key.
Param Initial Value: The value at which pagination result starts. For example, localhost:8080/confluence/rest/api/space/ds/content/page/1. The content in bold indicates the param initial value.

NOTE: Currently only numeric values are supported for page values.

Pagination Mechanism: Next URI in Response

Next URL JSON Path: JSON path for the next URL in the current URL response. This field is displayed when the pagination mechanism selected is Next URI in Response.
Base Group URL Prefix: The next URL in the current URL response can be full or partial. For partial URL, enter the static prefix. This field is displayed when the pagination mechanism selected is Next URI in Response.

Pagination Mechanism: None

URL Groups for CDCs: Collection of URLs having the same JSON schema, request headers and request params. Response will be obtained from each URL and will be ingested. Click Add URL group to enter the URL groups. This option is displayed only when the Incremental Append Mode option is enabled.

Pagination Mechanism: Request Params Limit and Offset

Limit Param Key: The key that indicates the limit parameter. For example, …/defects?offset=25&limit=25. The content in bold indicates the limit param key.
Limit Param Initial Value: The limit value for the number of records (starting from the offset value) to be displayed in the response. For example, …/defects?offset=25&limit=25. The content in bold indicates the limit param initial value.
Offset Param Key: The key that indicates the offset parameter. For example, …/defects?offset=25&limit=25. The content in bold indicates the offset param key.
Offset Param Value: The offset value from which the records will be displayed in the response. For example, …/defects?offset=25&limit=25. The content in bold indicates the offset param initial value.
Enter the other required values and click Save Configuration. For descriptions of fields, see Source Table Configuration Field Descriptions.
Click Table Groups tab and add a table group.
Click View Table Group icon for the table group.
For first time ingestion or for a clean crawl, click Initialize and Ingest Now.
To append new data to the crawled source, click Ingest Now from the second crawl onwards, only the new and changed data will be picked.

Configuration Migration

The following occurs when a Generic REST API source is imported:

for the existing tables, only the Pagination Mechanism parameters will be migrated
for new tables, the following parameters will be migrated: Meta URL, Request Headers, Request Params, Base URL, Pagination Mechanism, and URL Groups for CDC.

Limitation

Generic Rest API source on CDH is currently not supported.

Last updated on