Hive OCF Connector: Install and Configure

Alation Cloud Service Applies to Alation Cloud Service instances of Alation

Customer Managed Applies to customer-managed instances of Alation

Network Connectivity

Open the ports that apply to your Hive configuration:

  • Port 10000 to the Hive server

  • Port 9083 to the Hive metastore server

  • Port 50070 or 9870 to the WebHDFS server

    • The default WebHDFS port depends on the Hadoop version and distribution. For Hadoop 2.x, the default WebHDFS port is 50070. For Hadoop 3.x, in some of the distributions, the default WebHDFS port may be 9870. For example, on CDH, the default WebHDFS port may be 9870.

    • Port 14000 to the HttpFS server, if using HttpFS.

    • Port 1006 on datanodes. WebHDFS requests are redirected to this port.

If using Kerberos authentication:

  • TCP/UDP port 88 to all Kerberos servers involved in the authentication chain.

If using Apache Knox:

  • Port 8443 for Knox connections.

Service Account

We recommend creating a Hive service account for Alation before adding the Hive data source to the catalog. The service account requires permissions to access schemas and tables in Hive for metadata extraction (MDE), data sampling and profiling, and access to the query log directory on HDFS for query log ingestion (QLI).

Permissions for Metadata Extraction

Basic authentication

No requirements

Kerberized Hive instance

SELECT on schemas, tables, and columns to be extracted.

We recommend creating the service account in the same realm as the one used by Hive users.

Permission for Sampling and Profiling

Basic authentication

No requirements

Kerberized Hive instance

Depending on your Hive authorization configuration, grant read or SELECT:

  • Storage Based Access Control Lists—Service account needs read access to all DB folders/files that need to be sampled.

  • SQL Based Access Control Lists—Grant SELECT on DBs and tables that need to be sampled.

Permissions for Query Log Ingestion

The service account needs the read permission for the job history logs directory on HDFS.

Hive Client Configuration Files

When configuring your OCF Hive data source in Alation, you will need to upload Hive configuration files as a .tgz archive. Hive client configuration files enable metadata extraction and query log ingestion. Alation will use these files to connect to Hive over the APIs and perform metadata extraction (MDE) and query log ingestion (QLI).

This section describes how to get client configuration files for Hive from several Hive management systems. In case of kerberized Hive and connection over SSL, you will need to place additional files into the configuration archive.

Adding Kerberos Configuration File

If your Hive is kerberized, you will need to add the krb5.conf file to the Hive client configurations archive. The default_realm should be set to the realm that has the user principal which you are planning to use as the service account in Alation.

Example of the krb5.conf file

[libdefaults]
default_realm = TEST.ALATIONDATA.COM
dns_lookup_realm = false
dns_lookup_kdc = false
ticket_lifetime = 24h
forwardable = true

[realms]
TEST.ALATIONDATA.COM = {
kdc = 10.11.21.205
admin_server = 10.11.21.205
}

[domain_realm]
.alationdata.com = TEST.ALATIONDATA.COM
alationdata.com = TEST.ALATIONDATA.COM

Adding ssl-client.xml and Keystore Files to Client Configurations Archive

If connecting over SSL, you will need to add the .jks keystore file specified in the ssl.client.truststore.location property of the ssl-client.xml file to your client configurations archive.

The ssl-client.xml is usually found in the /etc/hadoop/conf directory on HDFS. The location of the .jks file is usually specified in the property ssl.client.truststore.location.

  • The ssl.client.truststore.location property often contains an absolute path on the filesystem; however, Alation isolates execution of its Hive engine: we only pay attention to the filename of the truststore.

    Example

    Given a location of /etc/security/awesomeCorp.jks for the ssl.client.truststore.location property, Alation will look for, and use, the first file named awesomeCorp.jks that was uploaded with the client configuration archive.

  • Ensure that the password specified for the ssl.client.truststore.password property is the correct password for the file listed in the ssl.client.truststore.location property.

Example of ssl-client.xml

<configuration>
  <property>
      <name>ssl.client.truststore.location</name>
      <value>truststore.jks</value>
  </property>
  <property>
      <name>ssl.client.truststore.password</name>
      <value>changeit</value>
  </property>
  <property>
      <name>ssl.client.truststore.type</name>
      <value>jks</value>
  </property>
</configuration>

Knox SSL

The ssl-client.xml is usually found in the /etc/hadoop/conf directory. The location of the .jks file usually is /var/lib/knox/data-2.6.5.0-292/security/keystores in the Hadoop cluster.

Example of ssl-client.xml for Knox SSL

<configuration>
  <property>
      <name>ssl.client.truststore.location</name>
      <value>gateway.jks</value>
  </property>
  <property>
      <name>ssl.client.truststore.password</name>
      <value>admin</value>
  </property>
  <property>
      <name>ssl.client.truststore.type</name>
      <value>jks</value>
  </property>
</configuration>

Adding the aha-conf.xml file

This step applies only if Alation supports QLI for your Hive distribution.

Create and save an additional file named aha-conf.xml.

By default, the OCF Hive connector returns URL-encoded queries. To get the query format suitable for ingestion, you will need to include the aha-conf.xml file into the configuration tarball for your data source.

Use the following example to create the file:

<?xml version="1.0" encoding="UTF-8"?><?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

<property>

    <name>alation.qli.check-url-decode</name>

    <description>Some distributions of hive will arbitrarily url-encode queries. This will break QLI.
    Setting this to true will add an extra preprocessing step to check each query before ingesting it,
    decoding as needed. Enabling this may increase QLI time.
    </description>

    <value>true</value>

</property>

</configuration>

Prepare the Client Configurations Archive

CDH, HDP, and EMR

To prepare the Hive client configurations archive:

  1. SSH to your cluster and form the tarball file with the Hive client configurations.

    Example

    mkdir hive_configs
    cp -r /etc/hive/conf/* hive_configs
    cp -r /etc/hadoop/conf/* hive_configs
    
  2. If applicable to your Hive deployment, add:

    • The krb5.conf file if using Kerberos authentication

    • The keytab file for the service account you are going to use in Alation

    • The .jks certificate file if using SSL

    • The aha-conf.xml file if your Hive deployment supports QLI (see Adding the aha-conf.xml file).

    Example

    cp /etc/krb5.conf hive_configs
    
  3. Form the archive.

    tar -czf hive_configs.tgz hive_configs
    
  4. Copy the client configuration archive to your machine. This file will need to be uploaded to Alation on your Hive data source Settings page.

HDInsight (Apache Ambari)

  1. On your Hadoop manager dashboard, in the left-hand menu, click Actions and Download All Client Configs:

    ../../../_images/DS_Hive01.png
  2. Un-package your download to a directory.

  3. If applicable to your Hive configuration, add the keytab file to the directory.

    Note

    You do not need to add the krb5.conf file because it is already in the archive you downloaded.

  4. If using SSL, make sure the ssl-client.xml file is present and add the .jks keystore file.

  5. Open the core-site.xml for editing and make the following changes:

    5.1 Replace the encrypted account key present in the fs.azure.account.key.<YOUR_STORAGE_ACCOUNT_NAME>.blob.core.windows.net property with the unencrypted key value for your storage account <YOUR_STORAGE_ACCOUNT_NAME>.

    5.2 Comment out the following properties to disable encryption and decryption of the account key:

    • fs.azure.account.keyprovider.YOUR_STORAGE_ACCOUNT_NAME.blob.core.windows.net

    • fs.azure.shellkeyprovider.script

      Example:

    ../../../_images/OCF_Hive_core_site.png
  6. Create a .tgz archive from this directory.

  7. Copy the client configuration archive to your machine. This file will need to be uploaded to Alation on your Hive data source Settings page.

MapR

MapR configurations are spread out and there is no way to download them using a Hive manager UI.

To get Hive client configuration files from MapR:

  1. SSH to your manager instance and form the configuration tarball file.

    Example

    mkdir client_configs
    cp -r /opt/mapr/conf/ client_configs
    cp -r /opt/mapr/hive/hive-2.3/conf/ client_configs
    cp -r /opt/mapr/hadoop/hadoop-2.7.0/etc/hadoop/ client_configs
    
  2. If applicable, add:

    • The krb5.conf file for Kerberos authentication

    • The keytab file for the service account

    • The ssl-client.xml file and the .jks file

    • The aha-conf.xml if your Hive deployment supports QLI (see Adding the aha-conf.xml file).

  3. Copy the client configuration tarball (client_configs.tgz in this example) to your machine. This file will need to be uploaded to Alation on your Hive data source Settings page.

Prepare Kerberos Configuration File and Certificate

Apart from the configuration tarball, make sure you have the krb5.conf file, the keytab file, and the SSL certificate on your machine. You will need to upload them separately and in addition to the client configurations tarball. For sampling and profiling, Alation uses the Hive JDBC driver and passes the Kerberos configuration file, keytab, and the certificate in the JDBC connection parameters.

JDBC URI

Basic Authentication

Format

hive2://<hostname_or_ip>:<port>

Example

hive2://10.13.80.155:10000

Basic Authentication + SSL

hive2://10.13.80.155:10000/default;ssl=true;

Kerberos Authentication + SSL

Format

hive2://<hostname_or_ip>:<port>/default;principal=<principal>;ssl=true;

Example

hive2://10.13.80.155:10000/default;principal=hive/[email protected];ssl=true;

Using ZooKeeper

Format

hive2://<hostname_or_ip>:<port>/;principal=<principal>;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=<namespace>;ssl=true;

Example

hive2://ip-10-13-10-125.alation-test.com:2181/;principal=hive/[email protected];serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2;ssl=true;

Using Knox

Knox is supported, replacing the Hive server URI with the Knox Server URI in the data source settings. If using Apache Knox, you will need to provide the Knox WebHDFS URI in the JDBC URI field. Documentation on this URI, as well as on how Knox routes gateways, topologies, and clusters using this URI can be found in the Knox book.

Configuration in Alation

Step 1: Install the Connector

Alation On-Premise

Important

Installation of OCF connectors requires Alation Connector Manager to be installed as a prerequisite.

To install an OCF connector:

  1. If this has not been done on your instance, install the Alation Connector Manager: Install Alation Connector Manager.

  2. Ensure that the OCF connector Zip file is available on your local machine.

  3. Install the connector on the Connectors Dashboard page using the steps in Manage Connectors.

Alation Cloud Service

Note

On Alation Cloud Service instances, Alation Connector Manager is available by default.

Depending on your network configuration, you may need to use Alation Agent to connect to databases.

Connection via Alation Agent
  1. Ensure that Alation Agent is enabled on your Alation instance. If necessary, create a Support ticket with Alation for an Alation representative to enable the Alation Agent feature on your instance and to receive the Alation Agent installer.

  2. Install the Alation Agent.

  3. Install the connector on the Connectors Dashboard page using the steps in Manage Connectors.

Connection Without Agent

To install an OCF connector:

  1. Ensure that the OCF connector Zip file is available on your local machine.

  2. Install the connector on the Connectors Dashboard page using the steps in Manage Connectors.

STEP 2: Create and Configure a New Data Source

In Alation, add a new data source:

  1. Log in to Alation as a Server Admin.

  2. Expand the Apps menu on the right of the main toolbar and select Sources.

  3. On the Sources page, click +Add on the top right of the page and in the list that opens, click Data Source. This will open the Add a Data Source wizard.

  4. On the first screen of the wizard, specify a name for your data source, assign additional Data Source Admins, if necessary, and click the Continue Setup button on the bottom. The Add a Data Source screen will open.

  5. On the Add a Data Source screen, the only field you should populate is Database Type. From the Database Type dropdown, select the connector name. After that you will be navigated to the Settings page of your new data source.

    Note

    Agent-based connectors will have the Agent name appended to the connector name.

The name of this connector is Hive OCF connector.

Access

On the Access tab, set the data source visibility using these options:

  • Public Data Source—The data source will be visible to all users of the catalog.

  • Private Data Source—The data source will be visible to the users allowed access to the data source by Data Source Admins.

You can add new Data Source Admin users in the Data Source Admins section.

General Settings

Note

This section describes configuring settings for credentials and connection information stored in the Alation database. If your organization has configured Azure KeyVault or AWS Secrets Manager to hold such information, the user interface for the General Settings page will change to include the following icons to the right of most options:

../../../_images/VaultOrDB.png

By default, the database icon is selected, as shown. In the vault case, instead of the actual credential information, you enter the ID of the secret. See Configure Secrets for OCF Connector Settings for details.

Perform the configuration on the General Settings tab.

Application Settings

Specify Application Settings if applicable. Click Save to save the changes after providing the information.

Parameter

Description

BI Connection Info

This parameter is used to generate lineage between the current data source and another source in the catalog, for example a BI source that retrieves data from the underlying database. The parameter accepts host and port information of the corresponding BI data source connection.

Use the following format: host:port

You can provide multiple values as a comma-separated list:

10.13.71.216:1541,sever.com:1542

Find more details in BI Connection Info.

Disable Automatic Lineage Generation

Select this checkbox to disable automatic lineage generation from QLI, MDE, and Compose queries. By default, automatic lineage generation is enabled.

Connector Settings

Data Source Connection

Populate the data source connection information and save the values by clicking Save in this section.

Parameter

Description

JDBC URI

Specify the JDBC URI in the required format.

Username

Specify the service account username.

Password

Specify the service account password.

Hive version

Select your Hive version.

Hadoop version

Select your Hadoop version.

Hive Configuration

Use the upload link under Hive Configuration to upload your Hive client configurations archive.

Kerberos Configuration

Provide the Kerberos information and save the values by clicking Save in this section.

Parameter

Description

Enable Kerberos Authentication

Enable or disable Kerberos authentication by selecting or clearing the Enable Kerberos Authentication checkbox. If the Kerberos Authentication checkbox is enabled upload the krb5.conf file using the upload link under the checkbox.

Use keytab

To enable Kerberos with keytab authentication, select the Use Keytab checkbox and upload the keytab file for the service account using the upload link under the checkbox.

SSL Configuration

Provide the SSL information and save the values by clicking Save in this section.

Note

Ensure that you are uploading the certificate in the correct format. See Adding ssl-client.xml and Keystore Files to Client Configurations Archive for details.

Parameter

Description

Enable SSL

Enable SSL authentication by selecting the Enable SSL checkbox. If the Enable SSL checkbox is enabled, upload the SSL certificate using the upload link below.

Truststore password

Specify the password for the SSL certificate.

The password will be deleted if the data source connection is deleted.

Performance Configuration

Use these settings to enhance the performance of MDE and QLI. Save the values you specify by clicking Save.

Parameter

Description

Thread Count

By default, QLI runs on a single thread. When the number of QLI files is large, ingestion may take a long time as it reads each file. In such cases, you can increase the thread count, so that the connector can process several files in parallel, reducing the QLI time.

When the value is set to a specific number, Alation will create this number of threads for QLI.

Timeout

The OCF connector calls the standalone executable (AHA) for QLI and MDE. Communication between the connector and AHA happens through gRPC. If AHA does not respond within the timeout configured using this parameter, in seconds, the connector will mark the MDE or QLI job as failed. The default timeout is 120 seconds.

We recommend using this parameter for debugging only. You should not have the need to change it in a normal case.

Logging Configuration

Select the logging level for the connector logs and save the values by clicking Save in this section. The available log levels are based on the Log4j framework.

Parameter

Description

Log level

Select the log level to generate logs. The available options are INFO, DEBUG, WARN, TRACE, ERROR, FATAL, ALL.

You can view the connector logs in Admin Settings > Manage Connectors > Hive OCF connector.

Obfuscate Literals

Obfuscate Literals—Enable this toggle to hide actual values in the query statements that are ingested during query log ingestion or executed in Compose. This toggle is disabled by default.

Test Connection

Under Test Connection, click Test to validate network connectivity.

If the connection test fails, a detailed error message will be available in connector logs. Below are some known errors that could be encountered during Test Connection:

Error Message

Solution

No subject alternative names matching IP address <IP_ADDRESS> found

Make sure the IP address added in JDBC URI matches with the SSL certificate. If not, change the JDBC URI IP address.

UnknownHostException

This error is specific to Azure HDInsight. Change the primary hostname to the IP address in the hive-site.xml file or make the host accessible to the Alation instance.

Return code 1 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask. Permission denied. javax.security.auth.login.LoginException: Message stream modified (41)

Change the hostname to ip_address in krb5.conf. Comment out the timeout if it exists in krb5.conf.

Could not connect to hostname on port 10000

Hostname is incorrect. Check the hostname of the Hive instance and replace it in the JDBC URI.

Bad URL format. Hostname not found in the authority part of the url: 10.13.54.57:10000;principal=mapr. Are you missing a ‘/’ after the hostname?

Make sure the JDBC URI is the correct format. See JDBC URI.

return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask

Using the Alation service account, run a query in Hive Beeline. It will give you the actual error. If it is a permissions error, grant sufficient permissions. If the connection fails even after permissions are granted, add the following JDBC parameter to the URI: hive.execution.engine=tez;

Example: hive2://ip-10-13-54-231.alation-test.com:8443/;hive.execution.engine=tez;ssl=true;transportMode=http;httpPath=gateway/default/hive

Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.tez.TezTask

Using the Alation service account, run a query in Hive Beeline. It will give you the actual error. If it is a permissions error, grant sufficient permissions. If the connection fails even after permissions are granted, add the following JDBC parameter to the URI: hive.execution.engine=tez;

Example: hive2://ip-10-13-54-231.alation-test.com:8443/;hive.execution.engine=tez;ssl=true;transportMode=http;httpPath=gateway/default/hive

Metadata Extraction

You can configure metadata extraction (MDE) for an OCF data source on the Metadata Extraction tab of the Settings page. Refer to Configure Metadata Extraction for OCF Data Sources for information about the available configuration options.

Sampling and Profiling

Sampling and profiling is supported. For details, see Configure Sampling and Profiling for OCF Data Sources.

Compose

For details about configuring the Compose tab of the Settings, refer to Configure Compose for OCF Data Sources.

Query Log Ingestion

On the Query Log Ingestion tab of the Settings page, you can perform QLI manually on demand or schedule it. The QLI pipeline goes through the Hive API and uses the Hive client configuration files uploaded on the General Settings tab. No additional configuration is required for QLI on the Query Log Ingestion tab.

Automated and Manual QLI

You can either perform QLI manually on demand or enable automated QLI:

  1. To perform manual QLI, under the Automated and Manual Query Log Ingestion section of the Query Log Ingestion tab, ensure that the Enable Automated Query Log Ingestion toggle is disabled.

    Note

    Metadata extraction must be completed first before running QLI.

  2. Click Preview to get a sample of the query history data to be ingested.

  3. Click the Import button to perform QLI on demand.

  4. To schedule QLI, enable the Enable Automated Query Log Ingestion toggle.

  5. Set a schedule under Automated Query Log Ingestion Time by specifying values in the week, day, and time fields. The next QLI job will run on the schedule you have specified.

    Note

    The hourly schedule for automated QLI is not supported.

Troubleshooting

Refer to Troubleshooting for information about logs.