You are on page 1of 36

Thales Data Platform

INSTALLATION GUIDE
Document Information
Document Information

Product Version 3.1.5

Document Number 007-000726-001, Rev C

Release Date 17 February 2021

Trademarks
All intellectual property is protected by copyright. All trademarks and product names used or referred to are the
copyright of their respective owners. No part of this document may be reproduced, stored in a retrieval system
or transmitted in any form or by any means, electronic, mechanical, chemical, photocopy, recording or
otherwise without the prior written permission of Thales.

Disclaimer
Thales HDP (Hadoop Data Platform) is based on Hadoop, an open source software or freeware component.
Hadoop is packaged with TDP for your convenience but is provided on an "AS-IS" basis. Thales will only
provide support for TDP that will include TDP cluster installation, configuration and set up of DDC. Thales
expressly excludes any support for, including but not limited to, optimization, network configuration, hardware
configuration or any other Hadoop administration not related to DDC initial setup and such shall be the sole and
exclusive responsibility of Licensee.
Each free software or open source component provided with TDP is the copyright of its respective copyright
owner. The incorporated open source projects are released primarily under the Apache Software License 2.0
(“ASLv2”, full description at https://apache.org/licenses/LICENSE-2.0) whereas other software included may
be released under the terms of alternative ASLv2 compatible open source licenses. Please review the license
and notice files accompanying the software for additional licensing information.
Thales makes no representations or warranties with respect to the contents of this document and specifically
disclaims any implied warranties of merchantability or fitness for any particular purpose. Furthermore, Thales
reserves the right to revise this publication and to make changes from time to time in the content hereof without
the obligation upon Thales to notify any person or organization of any such revisions or changes.
We have attempted to make these documents complete, accurate, and useful, but we cannot guarantee them
to be perfect. When we discover errors or omissions, or they are brought to our attention, we endeavor to
correct them in succeeding releases of the product.
Thales invites constructive comments on the contents of this document. Send your comments, together with
your personal and/or company details to the address below.

Thales Data Platform 3.1.5 : Installation Guide


17 February 2021, Copyright © 2021 Thales. All rights reserved. 2
Contact Method Contact Information

Mail Thales
4690 Millennium Drive
Belcamp, Maryland 21017
USA

Email technical.support.DIS@thalesgroup.com

Thales Data Platform 3.1.5 : Installation Guide


17 February 2021, Copyright © 2021 Thales. All rights reserved. 3
CONTENTS

Document Information 2

Setting Up the Ambari Server 6


1. The Image 6
2. Initial Login Credentials 6
3. Set the Hostname 7
3.1 Optional - Configure Networking 7
4. Configure Ambari 7
5. Steps to Set up Additional Nodes in the Hadoop Cluster 9
6. Browse the Ambari Server 10
7. Hadoop Services Installation Through Ambari 10
Name the Cluster 10
Select Version 11
Install Options 11
Choose Services 12
Assign Masters 13
Assign Slaves and Clients 14
Customize Services 14
8. Namenode HA Configuration 15
9. Configuring Knox 16
Modifying Embedded LDAP 16
Knox Authentication Information in Ambari 18
10. Configuring Phoenix Query Server (HBase) 20
11. Creating DDC Directory Under HDFS 22
Creating the DDC Directory Using the Command Line 22
Managing the DDC Directory Using the User Interface 22
12. Creating the PQS Schema 23
13. Updating and Exporting the Knox Server Certificate 24
14. Changing the Knox Log Level 25
15. Configuring CipherTrust Manager 25

Upgrading the Thales Data Platform 26


Preparing for the Upgrade 26
Registering and Installing the Target Version on the Ambari-Server Node 26
Performing the Upgrade 27
References 28

Launching the Amazon Instance 29


Installing the Instance 29
Logging in to the Instance and Additional Configuration 30

Thales Data Platform 3.1.5 : Installation Guide


17 February 2021, Copyright © 2021 Thales. All rights reserved. 4
Securing Hadoop Configuration 31

Hadoop Backup 32
Preparing for the Backup 32
Backup Options 32
Create HDFS Backup 32
Restore HDFS Backup 33
Create and Restore HBase Backup 33
Snapshots 34
How to use the HBase snapshot utility 34
Creating a Backup 34
Restoring the Backup 35
Export/Import the Tables 35
How to use Export 35
How to use Import 36

Thales Data Platform 3.1.5 : Installation Guide


17 February 2021, Copyright © 2021 Thales. All rights reserved. 5
Setting Up the Ambari Server
Thales Data Platform 3.1.5 (TDP) is a Big Data platform based on Hadoop technology. It can be used for POCs
and production in conjunction with DDC. TDP 3.1.5 is available as an OVA and AMI image.

NOTE
> When installing the TDP OVA on ESXi 6.0 you may see an error regarding the vSphere
Client not supporting the SHA256 hashing algorithm. If you see this error, please refer to
the following KB article:
https://kb.portsys.com/help/the-ovf-package-is-invalid-and-cannot-be-deployed-sha256-
error
> On ESXi 6.0 you may see a warning that the configured guest (CentOS 4/5 or later) for this
virtual machine does not match the guest that is currently running (CentOS 7). This is only
a warning message. You may click the X button to dismiss it.

1. The Image
The virtual machine must have the following specifications:
> 64-bit Intel/AMD.
> OVA: VMware vSphere ESXi 6.0 or later. The OVA can be downloaded from the Thales portal.
> AMI: You will need to instantiate the AMI in your AWS. For instructions on instantiating the AMI, refer to
"Launching the Amazon Instance" on page 29.
> OS type: Linux, CentOS 7 (64-bit).
> Minimum of 150 GB disk.
> At least 4 CPUs.
> At least 16 GB RAM. At least 64 GB RAM if deployed as a single node.

2. Initial Login Credentials


You should perform this step through ssh/putty because a private key will be generated and you will need to
copy-and-paste it from the terminal to a file. You need the public hostname of the TDP instance (in VMware
or AWS), to be able to ssh to it.
The initial root credentials for the image are:
> Login: root
> Password: thales123

NOTE The user will be prompted to change the root password after the initial login.

Thales Data Platform 3.1.5 : Installation Guide


17 February 2021, Copyright © 2021 Thales. All rights reserved. 6
Setting Up the Ambari Server

3. Set the Hostname


To set a hostname for your Ambari server, run the command:
hostnamectl set-hostname <hostname>
In the above command, substitute the <hostname> variable for the actual hostname of your Ambari server. For
example:
hostnamectl set-hostname hadoop-server.company.com

NOTE This hostname must be configured for both forward and reverse DNS. The Ambari
and Hadoop applications must be able to translate the hostname to its IP address and back in
order to work properly.

3.1 Optional - Configure Networking


The Hadoop VM is set to use DHCP by default. If a static IP is desired, you may choose either of these options:
a. Use this utility:
/root/setup/network_setup.sh
b. Update this file:
/etc/sysconfig/network-scripts/ifcfg-eth0
Below, is an example of a static IP setting in ifcfg-eth0:
BOOTPROTO=static
IPADDR=10.3.16.227
NETMASK=255.255.0.0
GATEWAY=10.3.30.254
DNS1=10.3.110.224
DNS2=10.3.110.104
NM_CONTROLLED=no
Restart the network service or reboot the server for the changes to take effect. Command to restart the
network:
$ sudo systemctl restart network

4. Configure Ambari
Run this script to configure the host as an Ambari Server:
# /root/setup/ambari_setup.sh
This will prompt you to set the admin password. This will be the password of the Ambari admin. You can
use these credentials to access the Ambari UI.
When this script finishes, it shows you the private key that has been set up for the node. Please save the key.
You will need it for setting up the Ambari server private key.

Thales Data Platform 3.1.5 : Installation Guide


17 February 2021, Copyright © 2021 Thales. All rights reserved. 7
Setting Up the Ambari Server

You can always view the private key later by issuing the cat .ssh/id_rsa command, as illustrated in the
example below:

Thales Data Platform 3.1.5 : Installation Guide


17 February 2021, Copyright © 2021 Thales. All rights reserved. 8
Setting Up the Ambari Server

5. Steps to Set up Additional Nodes in the Hadoop Cluster


NOTE For a Demo installation this step is optional because a one node Hadoop deployment
is sufficient for the demo purposes.

The OVA image can be used to create additional nodes for the cluster.
Perform the step "2. Initial Login Credentials" on page 6 and "3. Set the Hostname" on page 7.
Instead of running the ambari_setup.sh script in step 5, run the following script:
> /root/setup/node_setup.sh
This will prompt you for the public key of the Ambari server. Go to the Ambari server to retrieve the public key.
This can be any of the two files:
> /root/.ssh/authorized_keys
> /root/.ssh/id_rsa.pub
Once the node is set up, log in to the Ambari server and add the node to the cluster with the following steps:
1. Go to the Ambari UI.
2. Click the Hosts button in the menu on the left. You should see the list of nodes already present on the
cluster.
3. Click Actions -> Add New Hosts.

Thales Data Platform 3.1.5 : Installation Guide


17 February 2021, Copyright © 2021 Thales. All rights reserved. 9
Setting Up the Ambari Server

NOTE
> As in the host names field you can add more than one node, you can add more than one
node at the same time. However, the previous steps are required on every node.
> The Ambari server must solve the hostname of each node and the node must be
accessible from it.

4. On the ssh private key field you need to paste the private key of the Ambari Server node (typically, the first
node of your cluster).
5. Click Register and Confirm.
6. Now, you just have to follow the instructions and add the desired services on the new node.

6. Browse the Ambari Server


The Ambari server is configured to use TLS (https) at port 443 and using a self-signed certificate.
1. Log in to the Ambari GUI by using the hostname configured in the step "3. Set the Hostname" on page 7.
2. Log in as admin with the password that you set in the step "4. Configure Ambari" on page 7.
3. Click LAUNCH INSTALL WIZARD.

TIP Ambari UI Not responding


If you cannot access the Ambari UI, try resetting the Ambari service. Open an SSH session on
the machine and run the following command:
ambari-server restart

7. Hadoop Services Installation Through Ambari

Name the Cluster


Choose a name that you consider more appropriate and descriptive for your cluster:

Thales Data Platform 3.1.5 : Installation Guide


17 February 2021, Copyright © 2021 Thales. All rights reserved. 10
Setting Up the Ambari Server

Select Version
At this point, you should see the UI in this state:

Here, the repositories are pre-filled, so the user doesn't need to know them. The local repository will point to
file:///var/repo/hdp.
The only needed OS in the Ambari configuration is Redhat 7, with the string file:///var/repo/hdp as value for
the three paths. If you can see any other OS, you can remove it.

Install Options
In this screen the hosts are configured. Put the hostname and private key of all nodes that will be on your
cluster. The private key is the one that you saved in the step "4. Configure Ambari" on page 7.

Click REGISTER AND CONFIRM. After that, the installation will start for all nodes. This can take a few
minutes.

Thales Data Platform 3.1.5 : Installation Guide


17 February 2021, Copyright © 2021 Thales. All rights reserved. 11
Setting Up the Ambari Server

NOTE When this step finishes you may see the message "Some warnings were
encountered while performing checks against the registered hosts above". This is a common
behavior but usually there is no reason to worry so you can skip it.

Choose Services
Here, we are going to select the services that will be available on our virtual machine. These are the required
services for DDC:
> HDFS
> Yarn + MapReduce2
> HBase
> Zookeeper
> Ambari_metrics
> Knox

Thales Data Platform 3.1.5 : Installation Guide


17 February 2021, Copyright © 2021 Thales. All rights reserved. 12
Setting Up the Ambari Server

In the list shown in the GUI, we only need to select those. In the next step, we'll add PQS to the cluster.

TIP If any of the required services fails to start automatically, you may need to start it
manually. In case of repeated problems, please check out the service log for additional
troubleshooting information. Refer to the article "Viewing Service Logs" on the Hortonworks
documentation pages for details.

Assign Masters
In our example, as we're only using one node, so this will be our master for all the services. In production
environments, you can select other nodes as the service Master.

Thales Data Platform 3.1.5 : Installation Guide


17 February 2021, Copyright © 2021 Thales. All rights reserved. 13
Setting Up the Ambari Server

Assign Slaves and Clients

NOTE This step is really important. Make sure that you select "Phoenix Query Server"
because PQS is a hard requirement for DDC.

Customize Services
At this point, you only need to provide credentials for different services and review the installation.

Thales Data Platform 3.1.5 : Installation Guide


17 February 2021, Copyright © 2021 Thales. All rights reserved. 14
Setting Up the Ambari Server

1. Create the Knox master secret password and store it somewhere. Then, keep clicking NEXT until the
screen that shows DEPLOY. Click it.
2. When the installation completes, click NEXT and then click COMPLETE.
More detailed information about the steps: For further information, please refer to the Hortonworks official
webpage.

8. Namenode HA Configuration
This section is only required if you are deploying TDP in more than 1 node.

NOTE It is recommended to perform this step at setup, as any changes to the Namenode HA
configuration will result in downtime.

1. Go to the Ambari UI and select the HDFS services (Services > HDFS in the Ambari menu on the left).
2. Click ACTIONS and then the Enable NameNode HA section.
3. Follow the wizard.

Thales Data Platform 3.1.5 : Installation Guide


17 February 2021, Copyright © 2021 Thales. All rights reserved. 15
Setting Up the Ambari Server

9. Configuring Knox
The Apache Knox Gateway is a system that provides a single point of authentication and access for Hadoop
services in a cluster. While Knox supports many authentication/federation providers this document will only
describe configuring LDAP. Please check Apache Knox Gateway's online configuration for details on the other
providers:
http://knox.apache.org/books/knox-1-0-0/user-guide.html#Authentication

**WARNING** It is your responsibility as the system owner to properly configure


the authentication to secure the production environment. For demo purposes you
can opt for the built-in LDAP server but for everything else you should configure
your company's authentication/federation provider.

A change in the Knox configuration is needed to provide access from the DDC client. There are two ways a
user could configure Knox for DDC. One is to modify the default topology and the other is to create a new
topology. If you want to create a new topology, refer to the Knox documentation.

Modifying Embedded LDAP


To set up the embedded LDAP for demo usage, follow these steps:
1. Go to Knox configuration in the Ambari server UI.
2. (applicable for demo purposes only) Go to the ACTIONS menu and click Start Demo LDAP. Without
starting the demo LDAP you cannot use the Knox logins.

3. Expand the Advanced topology section.

Thales Data Platform 3.1.5 : Installation Guide


17 February 2021, Copyright © 2021 Thales. All rights reserved. 16
Setting Up the Ambari Server

4. Add an entry for the PQS server.


• Single node configuration:
<service>
<role>AVATICA</role>
<url>http://<HOSTNAME_OF_PHOENIX_QUERY_SERVER>:8765</url>
</service>

• (optional) If the cluster is Namenode HA enabled, you have to list all PQS nodes:
<service>
<role>AVATICA</role>
<url>http://<HOSTNAME_OF_PHOENIX_QUERY_SERVER1>:8765</url>
<url>http://<HOSTNAME_OF_PHOENIX_QUERY_SERVER2>:8765</url>
<url>http://<HOSTNAME_OF_PHOENIX_QUERY_SERVER3>:8765</url>
...
</service>

5. (optional for creating a cluster) If the cluster is Namenode HA enabled, add the following section:
<provider>
<role>ha</role>
<name>HaProvider</name>
<enabled>true</enabled>
<param>
<name>WEBHDFS</name>

Thales Data Platform 3.1.5 : Installation Guide


17 February 2021, Copyright © 2021 Thales. All rights reserved. 17
Setting Up the Ambari Server

<value>maxFailoverAttempts=3;failoverSleep=1000;maxRetryAttempts=300;retrySleep=1000;enabled=tr
ue</value>
</param>
</provider>

6. (applicable for non-demo purposes) Example for LDAP:


<provider>
<role>authentication</role>
<name>ShiroProvider</name>
<enabled>true</enabled>
<param>
<name>main.ldapRealm</name>
<value>org.apache.shiro.realm.ldap.JndiLdapRealm</value>
</param>
<param>
<name>main.ldapRealm.userDnTemplate</name>
<value>uid= {0},ou=people,dc=hadoop,dc=apache,dc=org</value>
</param>
<param>
<name>main.ldapRealm.contextFactory.url</name>
<value>ldap://localhost:33389</value>
</param>
<param>
<name>main.ldapRealm.contextFactory.authenticationMechanism</name>
<value>simple</value>
</param>
<param>
<name>urls./**</name>
<value>authcBasic</value>
</param>
</provider>

7. Click SAVE then restart all the affected components.


At the top of the screen it will tell you that a restart is required and there is an orange RESTART button.
Click that button and select Restart All Affected.
8. With this configuration, use '/gateway/default/avatica' for the PQS URI and '/gateway/default/webhdfs/v1'
for the HDFS URI in "Configuring HDFS" and "Configuring HBase" in the "Thales CipherTrust Data
Discovery and Classification Deployment Guide".
For instructions on configuring an external LDAP or Active Directory refer to the Apache Knox online
documentation:
> LDAP Configuration
> Active Directory Configuration

Knox Authentication Information in Ambari


When you are using the demo LDAP embedded with Knox, there is a file with the users credentials. To view
these default credentials, select:
Knox > CONFIGS > Advanced-users-ldif

Thales Data Platform 3.1.5 : Installation Guide


17 February 2021, Copyright © 2021 Thales. All rights reserved. 18
Setting Up the Ambari Server

Scroll down to the ‘#entry for sample user admin’ to find the ‘userPassword’, which by default is ‘admin-
password’.

Thales Data Platform 3.1.5 : Installation Guide


17 February 2021, Copyright © 2021 Thales. All rights reserved. 19
Setting Up the Ambari Server

NOTE
In production environments, when you are using a real LDAP or Active Directory, you have to
use a user with admin permissions on the service.

10. Configuring Phoenix Query Server (HBase)


Apache Phoenix requires additional modification before you can use it with DDC. You need to enable
namespace mapping and map the system tables to the namespace. For this procedure we are using the
Ambari UI.
1. In the advanced HBase configuration, scroll down to Custom hbase-site (HBase > CONFIGS >
ADVANCED > Custom hbase-site).

Thales Data Platform 3.1.5 : Installation Guide


17 February 2021, Copyright © 2021 Thales. All rights reserved. 20
Setting Up the Ambari Server

2. Use the Add Property... link to add these two additional properties (as illustrated in the image further
below):
phoenix.schema.isNamespaceMappingEnabled=true
phoenix.schema.mapSystemTablesToNamespace=true

NOTE These properties will be added to the hbase-site.xml configuration file.

3. Click SAVE then restart all the affected components.


At the top of the screen it will tell you that a restart is required and there is an orange RESTART button.
Click that button and select Restart All Affected.

Thales Data Platform 3.1.5 : Installation Guide


17 February 2021, Copyright © 2021 Thales. All rights reserved. 21
Setting Up the Ambari Server

11. Creating DDC Directory Under HDFS


For DDC to utilize Hadoop Distributed File System (HDFS), you need to create a directory under HDFS. DDC
will use this space for storing scan results and reports. You can create it through the command line and using
the Ambari UI. You need to do it only on the primary node. This directory can have any name, but for this
example and throughout the following sections of this document, we will use /ciphertrust_ddc.

TIP You will need this path later to configure CipherTrust. See "Configuring HDFS" in the
"Thales CipherTrust Data Discovery and Classification Deployment Guide".

Creating the DDC Directory Using the Command Line


1. SSH to the TDP instance and log in as root.
2. Switch to the "hdfs" user, who has permissions to create and destroy folders:
su - hdfs
3. Create the DDC directory in HDFS by issuing this command:
hdfs dfs -mkdir /ciphertrust_ddc
4. Because the default permissions will not allow the scans to write over the Hadoop Database, you have to
change the access permissions.
Issue this command to grant full access rights to the directory for all users:
hdfs dfs -chmod 777 /ciphertrust_ddc
NOTE: After entering the third command, enter 'exit' to return to the root prompt.

Managing the DDC Directory Using the User Interface

NOTE This section is just for your information and optional. You do not have to perform the
steps below.

Once you have created the HDFS directory using the CLI, you can use the user interface to create sub-
directories, download the files, or simply check the HDFS status.
1. Open the Namenode UI:

Thales Data Platform 3.1.5 : Installation Guide


17 February 2021, Copyright © 2021 Thales. All rights reserved. 22
Setting Up the Ambari Server

2. Click to expand the Utilities menu and next click the Browse the file system link.

From here, you can see all the HDFS directories. In this particular case, you cannot create a directory on the
root because the user used by the UI has no permissions to do it.

12. Creating the PQS Schema


DDC uses Hbase to store scan results and reports and uses Phoenix Query Server to store and access this
data. To that end, you have to create a PQS schema in the cluster. We will use ciphertrust_ddc as the name
of the schema.

Thales Data Platform 3.1.5 : Installation Guide


17 February 2021, Copyright © 2021 Thales. All rights reserved. 23
Setting Up the Ambari Server

TIP This is a known issue with the PQS library. It can cause an error when the user tries to
configure the PQS service on DDC for the first time. It can be avoided by creating the PQS
schema before configuring DDC.

1. Open an SSH session on a Hadoop node which has PQS installed.


If PQS is installed on multiple nodes, choose any one of them. If you don't remember which node has PQS
installed, you can go to the Hosts tab and see the services installed on each node.
2. Run the sqlline-thin.py script to open the SQL interpreter. For example:
/usr/hdp/3.1.5.0-316/phoenix/bin/sqlline-thin.py http://localhost:8765
(Typically, you will find the script in /usr/hdp/3.1.5.0-316/phoenix/bin/).
3. Within the SQL interpreter, run the following command to create the DDC schema:
CREATE SCHEMA IF NOT EXISTS ciphertrust_ddc;
4. Use Ctrl-D to exit.
Use ciphertrust_ddc as the PQS schema in the CipherTrust configuration in the "Configuring HBase" section
of the "Thales CipherTrust Data Discovery and Classification Deployment Guide".
You can find more details on creating the PQS schema in the official Phoenix documentation (section Usage =>
Client).

13. Updating and Exporting the Knox Server Certificate


When the Knox service is installed by Ambari, a self-signed certificate is created internally. However, the
certificate created uses SHA1 and with a key size of 1024 bits. The update_knox_cert.sh script below will
create and replace the certificate with a new one using SHA256 and a key size of 2048 bits.
1. Run this script in an SSH session on the TDP instance:
# /root/setup/update_knox_cert.sh
2. When this script is run, it will prompt for the Knox master secret.
This secret is the one that you set in the step "7. Hadoop Services Installation Through Ambari > Customize
Services" (CREDENTIALS > Knox Master Secret).
Finally, you need to export the SSL certificate of the Knox server and configure DDC to talk to Hadoop. You
need to obtain the certificate where Knox is installed.
1. Put the Knox server certificate in a file, by using this command:
$ echo -n | openssl s_client -connect <IP_OF_KNOX_SERVER>:8443 | sed -ne '/-BEGIN
CERTIFICATE-/,/-END CERTIFICATE-/p' > /tmp/hadoop.cert
2. Copy the certificate to the system where you will connect to the CipherTrust/DDC GUI.
Issue this command to display the certificate in the terminal window:
# cat /tmp/knox.cert
Now, you can copy the certificate from the terminal window and paste it in its own file on your machine. You
will need this file when configuring DDC to talk to Hadoop. Refer to "Configuring CipherTrust Manager" in
the "Thales CipherTrust Data Discovery and Classification Deployment Guide".

Thales Data Platform 3.1.5 : Installation Guide


17 February 2021, Copyright © 2021 Thales. All rights reserved. 24
Setting Up the Ambari Server

You can find the comprehensive procedure at this link on the Hortonworks documentation pages.

14. Changing the Knox Log Level


Knox uses Log4j to keep track of its log messages. The default log level is INFO which may quickly cause the
log file to use up the available space and DDC scans to fail. Unless you need this level of detail in the logs you
should change it to ERROR by editing its configuration file.
1. SSH to the TDP instance and log in as root.
2. Stop Knox by issuing this command:
su -l knox -c '$gateway_home/bin/gateway.sh stop'
3. Switch to the "hdfs" user:
su - hdfs
4. Open the /etc/knox/conf/gateway-log4j.properties file for editing.
• Change the log4j.logger.audit parameter to:
log4j.logger.audit=ERROR, auditfile
• Add these two additional parameters at the end:
log4j.appender.auditfile.MaxFileSize=10MB
log4j.appender.auditfile.MaxBackupIndex=10
5. Start Knox:
su -l knox -c '$gateway_home/bin/gateway.sh start'
If after applying these changes your Knox log grows too quickly, you may have to purge the log directory. First,
however, you need to check if this is the case.
1. Check if Ambari is displaying a "NameNode Directory Status" error. This error indicates a failed directory
(that is, one or more directories are reporting as not healthy.)
2. Check Ambari for a "Failed directory count" message to find out which directories are reporting problems. If
the error message is showing "Failed directory count: 1" it may be the logs directory.
3. In the terminal, check the free disk space for /var/log by issuing the command:
df -h
If the output of the command is showing no free disk space for the /var/log directory, remove all the log
files.

15. Configuring CipherTrust Manager


Refer to the "DDC Deployment Guide" to complete the CipherTrust Manager configuration.

Thales Data Platform 3.1.5 : Installation Guide


17 February 2021, Copyright © 2021 Thales. All rights reserved. 25
Upgrading the Thales Data Platform
This section describes how to upgrade TDP from version 3.1.4 to version 3.1.5.

Preparing for the Upgrade


The first stage is to download the upgrade script and the upgrade tarball and prepare the script for upgrade.
Please follow the steps below:
1. Download the TDP 3.1.5 upgrade script (ambari_upgrade.sh) from Thales support and copy it to every
node in your TDP cluster.
2. Download the tar file (rpms_315.tar.gz) from Thales support. This can be placed on an internal server
that is external to the TDP cluster nodes.
3. Place the downloaded tar file (rpms_315.tar.gz) into /var/tmp/. This way, the path for the tar file is
/var/tmp/rpms_315.tar.gz.
4. Run the ambari_upgrade.sh script on each node in the cluster:
• On the ambari-server select option 1 (ambari-server and ambari-agent)
• On all other nodes select option 2 (ambari-agent)

NOTE After the above steps, some of the cluster's components might be in a stopped/stale
status as indicated by red alert on the left panel or upper right of the page. It is strongly
recommended to restart all services to prepare for the upgrade. This can be done using the
Ambari server UI by clicking the 3 dots ("...") to the right of "Services" > "Stop All" then
"Start All". As an alternative, for each service (HDFS, HBase, Zookeeper and Knox), go to
"Actions" then click "Restart All".

Registering and Installing the Target Version on the Ambari-Server Node


The next step is to register and install the target version on the ambari-server node only. The steps are as
follows:
1. Log in to the ambari-server UI.
2. Browse to Cluster Admin > Stack and Versions.
3. Click the Versions tab. You see the version currently running, marked as Current.

NOTE The full version depends on the HDP version you are actually running. For example, if
you are currently running the HDP 3.1.4.0 release, you would see something like HDP-
3.1.4.0-315 as the full version number.

4. Click “Manage Versions”.


5. Proceed to register a new version by clicking “Register Version”.
6. Put 5.0 in place of Version Number so the Name will look like HDP-3.1.5.0.
7. Check “Use local repository” and put the Base URL as file:///var/repo/hdp-315 in all 3 fields.
Click Save.

Thales Data Platform 3.1.5 : Installation Guide


17 February 2021, Copyright © 2021 Thales. All rights reserved. 26
Upgrading the Thales Data Platform

8. Click the Dashboard.


9. Browse from “Cluster Admin” > “Stack and Versions”.
10.Click the “Versions” tab.
11.On a registered target version, click “Install Packages” and click OK to confirm.
The Install version operation starts. This installs the target version on all hosts in the cluster. You can
monitor the progress of the install by clicking the “Installing link”. When the installation completes, the
“Upgrade” button will replace the “Install Packages” button.

Performing the Upgrade


At this point, you are ready to perform the upgrade, so follow all of the steps below on the ambari-server node
only:
1. Log in to the ambari-server UI.
2. Before starting the upgrade you have to disable the Auto Start Settings in the Cluster:
Admin > Service Auto Start
3. Run “Service Check” on HDFS, HBASE and KNOX in “Actions” > “Run Service Check”.
4. Browse to Cluster Admin > Stack and Versions.
5. Click the Versions tab. The registered and installed target HDP version displays an “Upgrade” button.
6. Click “Upgrade” on the target version.
Based on your current HDP version and the target HDP version, Ambari performs a set of prerequisite
checks to determine if you can perform a rolling or an express upgrade. Note that if any required checks
pop up, you will need to perform the recommended actions until all checks pass. A dialog displays the
options available.
7. Select “Express Upgrade” or “Rolling Upgrade” method.
• A "Rolling Upgrade" orchestrates the TDP upgrade in an order that is meant to preserve cluster
operation and minimize service impact during upgrade. This process has more stringent prerequisites
(particularly regarding cluster high availability configuration) and can take longer to complete than an
"Express Upgrade".
• An "Express Upgrade" orchestrates the TDP upgrade in an order that will incur cluster downtime but with
less stringent prerequisites.

TIP If there are mission critical applications running, a “Rolling Upgrade” is recommended.

8. Click “Proceed”.
9. There may be some manual steps needed at this point. Please follow the instructions given and check the
box that says “I have performed the manual steps above” before proceeding.
10.When the upgrade stages complete, Click “Finalize” to complete the upgrade process.

Thales Data Platform 3.1.5 : Installation Guide


17 February 2021, Copyright © 2021 Thales. All rights reserved. 27
Upgrading the Thales Data Platform

TIP After the upgrade is completed, you can confirm the upgrade has succeeded by going to
‘Stack and Versions’ -> ‘Versions’. You will see that the current version is HDP-3.1.5.0
(3.1.5.0-316). On the ‘Stack and Versions’ -> ’Upgrade History’ tab, when you click 'Upgrade'
and see the drop-down list, you will also see details of each service’s versions before and
after upgrade, together with other information about the upgrade.

References
> https://docs.cloudera.com/HDPDocuments/Ambari-2.7.5.0/bk_ambari-upgrade/content/upgrading_HDP_
register_and_install_target_version.html
> https://docs.cloudera.com/HDPDocuments/Ambari-2.7.5.0/bk_ambari-upgrade/content/upgrading_HDP_
perform_the_upgrade.html

Thales Data Platform 3.1.5 : Installation Guide


17 February 2021, Copyright © 2021 Thales. All rights reserved. 28
Launching the Amazon Instance
NOTE Please contact Thales support to get the Amazon Machine Image (AMI) ID that you
should use in your Amazon Web Services (AWS) region.

Installing the Instance


1. Select EC2 from you Amazon Management Console. Select a region, for example “Canada (Central) ca-
central-1”.
2. Next, create the instance by clicking Images > AMIs.
3. Search for the AMI by “AMI Name” or “AMI ID”, then select the desired AMI.
4. Follow the wizard to launch the instance:
a. Choose AMI.
Select the desired AMI from the search list and click “Launch Instance”.
b. Choose Instance Type.
i. Select “t2.xlarge”, if you’re installing a node that will be part of a 5 node cluster, and for a single node
select "m4.4xlarge".
ii. Click Next [Next: Configure Instance Details].
c. Configure Instance.
i. Key in the number of instances required for the “Number of instances”. For example, 5 for a 5-
node cluster and 1 for a single node.
ii. For the rest of the values go with the default options, if you are not customizing the instance, and click
Next [Next: Add Storage].
d. Add Storage.
Go with the default values and click Next [Next: Add Tags].
e. Add Tags.
i. Click “Add Tag”.
ii. Add a “Key” and a “Value”. For example: "Owner demo-user".
iii. Click Next [Next: Configure Security Group].
f. Configure Security Group.
i. Select “Create a new security group”.
ii. Change “Type” to “All traffic”.
iii. If a pre-configured group with all traffic is available, choose that by selecting the previous selection as
“Select an existing security group”.
iv. Click [Review and Launch].
g. Review.
i. Review all the properties of the instance that you have selected in the wizard. If needed, you can edit
them in this step.

Thales Data Platform 3.1.5 : Installation Guide


17 February 2021, Copyright © 2021 Thales. All rights reserved. 29
Launching the Amazon Instance

ii. When you are done, click [Launch].


h. Select an existing key pair or create a new key pair.
i. Create a new key by choosing “Create a new key pair” from the drop down menu, name the key
pair and download the same for connecting to the instance.
ii. If the key pair was created prior to launching the AMI, use the same. In this case, select “Choose an
existing key pair” and select the key by name.
iii. Click the acknowledge check box and click [Launch Instance].
5. Once your instances are launched, on your EC2 Dashboard you will see the details as below. You can
rename your instances - these here are examples, but they won’t be the hostnames of your instances.

TIP Note down the Private DNS and Public IP of your instances as these will be required
later.

Logging in to the Instance and Additional Configuration


The initial login for the image is:
> login: root
> password: thales123
You will be prompted to change the password after initial login.
1. Change the hostname.
Set a hostname for your Ambari server and nodes of the Hadoop cluster. Change the hostname to the
Private DNS. Run this command, for example, as root to set a new hostname:
# hostnamectl set-hostname ip-172-31-3-54.ca-central-1.compute.internal
Repeat this on all hosts!
2. Continue with the configuration by following the steps in "Configure Ambari" under "Setting Up the Ambari
Server" till step 13 . Then Configure CiperTrust Manager to complete the Installation.

Thales Data Platform 3.1.5 : Installation Guide


17 February 2021, Copyright © 2021 Thales. All rights reserved. 30
Securing Hadoop Configuration

Securing Hadoop Configuration


Thales strongly recommends appropriately configuring and securing your Hadoop environment according to
industry best practices and your organization’s security policies.
The following references are not comprehensive, but are intended to provide a starting point for background,
tools, and best practices that may be applied to your Hadoop environment:
1. “HDP Security Overview”:
https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.5/security-overview/content/hdp_security_
overview.html
2. “Introduction to Hadoop Security”:
https://www.bmc.com/blogs/hadoop-security/
3. “Securing Hadoop: Security Recommendations for Hadoop" (Paper):
https://securosis.com/blog/securing-hadoop-security-recommendations-for-hadoop-new-paper
4. "Hadoop Security" (O’Reilly Book):
https://www.oreilly.com/library/view/hadoop-security/9781491900970/

Thales Data Platform 3.1.5 : Installation Guide


17 February 2021, Copyright © 2021 Thales. All rights reserved. 31
Hadoop Backup

Hadoop Backup
At some point, you may want to to make a backup of everything in Hadoop that is related to DDC. Such a
backup will include the HBase tables and all the generated files (.tar) with the information of the scans (hdfs).
In order to save the DDC data and create a backup you have to perform these two steps (separately):
1. Back up HDFS
2. Back up HBase

Preparing for the Backup


> Execute the df -h command in the hdfs directory to be copied and make sure that there is enough space in
the destination location.

NOTE
> It is not necessary to make the backup in each node, but it is good practice make the
backup on the name node.
> You need to have root privileges to make a backup and restore it.
> To run Export/Import commands, you need to switch to the 'hdfs' user.

Backup Options
There are several options to make and restore a backup:
> HBASE - Full Shutdown Backup (with stopping of the service)
> HBASE - Make and restore a snapshot (recommended option)
> HBASE - Export and Import a table. WARNING: This option is not a Clone of the table and inconsistencies
could appear
> HDFS - By using the distcp command
For more information about these options, refer to the official Cloudera documentation.

Create HDFS Backup


The best way to create a HDFS backup on a different cluster is to use DistCp. The most common use of DistCp
is an inter-cluster copy:
hadoop distcp hdfs://nn1:8020/source hdfs://nn2:8020/destination
Where:
> nn1 is the name node where your data is located
> the source is folder where the .tar files are (that is, the folder indicated in the path field when HDFS is
configured in DDC)
> nn2 is the name node where you want to save your data ( nn2 can be the same as nn1 )

Thales Data Platform 3.1.5 : Installation Guide


17 February 2021, Copyright © 2021 Thales. All rights reserved. 32
Hadoop Backup

> the destination is the folder to which the .tar files will be copied and saved
Example command:
hadoop distcp hdfs://sys146116.i.vormetric.com:8020/ddc_demo_83
hdfs://sys146116.i.vormetric.com:8020/ddc_backup/hdfs_dir/ddc_demo_83
You can find more information on using DistCp on the official webpage or on the Cloudera blog.

NOTE
> The destination folder should be created before executing the discp command.
> Note that DistCp requires absolute paths.
> These actions are made on ACTIVE NameNode and as 'hdfs' user.

Restore HDFS Backup


To restore a backup you also have to use the DistCP command. Again, the most common use of DistCp is an
inter-cluster copy:
hadoop distcp hdfs://nn2:8020/destination hdfs://nn1:8020/source
Where:
> nn1 is the name node where your data is located
> the source is the folder where the .tar files are (that is, the folder indicated in the path field when HDFS is
configured in DDC)
> nn2 is the name node where you want to save your data ( nn2 can be the same as nn1 )
> the destination is the folder to which the .tar files will be copied and saved
Example command:
hadoop distcp hdfs://sys146116.i.vormetric.com:8020/ddc_backup/hdfs_dir/ddc_demo_83
hdfs://sys146116.i.vormetric.com:8020/ddc_demo_83

NOTE
> If the file already exists, it will be skipped. New files (not yet backed up) will still be there,
and deleted files will be restored.
> If you want to completely restore the folder (and only keep the files that were there when
the copy was made), you have to execute a command to delete the files. This action is left
to your discretion.
> These actions must be made on the NameNode and as 'hdfs' user.

Create and Restore HBase Backup


There are different approaches to how to create HBase backups. All of them are described at this link:
https://blog.cloudera.com/approaches-to-backup-and-disaster-recovery-in-hbase/#export
Here, we describe the most common scenario - a full backup.

Thales Data Platform 3.1.5 : Installation Guide


17 February 2021, Copyright © 2021 Thales. All rights reserved. 33
Hadoop Backup

NOTE It is possible to make a complete backup by stopping the services and using the distcp
command. For more details on a full shutdown backup see the HBase documentation.

Snapshots
If you do not want to stop the services, you can use HBase snapshots by executing the steps as follows:
> Take a snapshot for each table (hbase doc: Take a Snapshot)
> Restore the snapshot for the tables that you want (hbase doc: Restore a snapshot)

How to use the HBase snapshot utility


To run these commands, please, open a SSH session on the HBase Master node. Let us start by listing all the
tables of the original HBase. Please, note that "DDC_SCHEMA1" is the schema name defined on the PQS
configuration (PQS Tab of Hadoop Services in Settings).
$ hbase shell
$ hbase(main):001:0> list
TABLE
DDC_SCHEMA1:DATASTORE_SUMMARY_REPORT
DDC_SCHEMA1:DATA_OBJECT_REPORT
DDC_SCHEMA1:SCAN_EXECUTION_REPORT
SYSTEM:CATALOG
SYSTEM:FUNCTION
SYSTEM:LOG
SYSTEM:MUTEX
SYSTEM:SEQUENCE
SYSTEM:STATS
TEST:DATASTORE_SUMMARY_REPORT
TEST:DATA_OBJECT_REPORT
TEST:SCAN_EXECUTION_REPORT
12 row(s)

Creating a Backup
Take the snapshot with the command
snapshot 'myTable', 'myTableSnapshot-122112'
and list the snapshots:
$ hbase(main):002:0> snapshot 'DDC_SCHEMA1:DATASTORE_SUMMARY_REPORT', 'myTableSnapshot-
datastore_summary_report'
$ hbase(main):003:0> snapshot 'DDC_SCHEMA1:DATA_OBJECT_REPORT', 'myTableSnapshot-data_object_
report'
$ hbase(main):004:0> snapshot 'DDC_SCHEMA1:SCAN_EXECUTION_REPORT', 'myTableSnapshot-scan_
execution_report'
$ hbase(main):005:0> list_snapshots
SNAPSHOT TABLE + CREATION TIME
myTableSnapshot-datastore_summary_report DDC_SCHEMA1:DATASTORE_SUMMARY_REPORT (2020-11-
24 06:36:13 -0800)
myTableSnapshot-data_object_report DDC_SCHEMA1:DATA_OBJECT_REPORT (2020-11-24
06:37:05 -0800)
myTableSnapshot-scan_execution_report DDC_SCHEMA1:SCAN_EXECUTION_REPORT (2020-11-24
06:37:12 -0800)
3 row(s)

Thales Data Platform 3.1.5 : Installation Guide


17 February 2021, Copyright © 2021 Thales. All rights reserved. 34
Hadoop Backup

Restoring the Backup


Restore the backup by executing the commands as follows:
$ hbase(main):006:0> disable 'DDC_SCHEMA1:DATASTORE_SUMMARY_REPORT'
$ hbase(main):007:0> restore_snapshot 'myTableSnapshot-datastore_summary_report'
$ hbase(main):008:0> enable 'DDC_SCHEMA1:DATASTORE_SUMMARY_REPORT'
$ hbase(main):009:0> disable 'DDC_SCHEMA1:DATA_OBJECT_REPORT'
$ hbase(main):010:0> restore_snapshot 'myTableSnapshot-data_object_report'
$ hbase(main):011:0> enable 'DDC_SCHEMA1:DATA_OBJECT_REPORT'
$ hbase(main):012:0> disable 'DDC_SCHEMA1:SCAN_EXECUTION_REPORT'
$ hbase(main):013:0> restore_snapshot 'myTableSnapshot-scan_execution_report'
$ hbase(main):014:0> enable 'DDC_SCHEMA1:SCAN_EXECUTION_REPORT'

Export/Import the Tables


The HBase tables can be exported to HDFS using "Export" Command. After that, we can use DistCP to store
the data somewhere else.
To export the tables related to DDC, it's necessary to know the Schema that we are using in DDC. The tables
are the same for each schema:
DATASTORE_SUMMARY_REPORT, DATA_OBJECT_REPORT and SCAN_EXECUTION_REPORT

How to use Export


First of all, let's list all the tables of our origin Hbase:
$ hbase shell
$ hbase(main):001:0> list
TABLE
DDC_SCHEMA1:DATASTORE_SUMMARY_REPORT
DDC_SCHEMA1:DATA_OBJECT_REPORT
DDC_SCHEMA1:SCAN_EXECUTION_REPORT
SYSTEM:CATALOG
SYSTEM:FUNCTION
SYSTEM:LOG
SYSTEM:MUTEX
SYSTEM:SEQUENCE
SYSTEM:STATS
TEST:DATASTORE_SUMMARY_REPORT
TEST:DATA_OBJECT_REPORT
TEST:SCAN_EXECUTION_REPORT
12 row(s)
$ hbase(main):001:0> quit
Export the tables on a HDFS directory executing hbase export command:
$ bin/hbase org.apache.hadoop.hbase.mapreduce.Export <tablename> <outputdir>
$ hbase org.apache.hadoop.hbase.mapreduce.Export DDC_SCHEMA1:DATASTORE_SUMMARY_REPORT
hdfs:///ddc_backup/hbase/ddc_schema1_datastore_summary_report
$ hbase org.apache.hadoop.hbase.mapreduce.Export DDC_SCHEMA1:DATA_OBJECT_REPORT hdfs:///ddc_
backup/hbase/ddc_schema1_data_object_report
$ hbase org.apache.hadoop.hbase.mapreduce.Export DDC_SCHEMA1:SCAN_EXECUTION_REPORT hdfs:///ddc_
backup/hbase/ddc_schema1_scan_exectuion_report

NOTE The output directory must not exist.

Thales Data Platform 3.1.5 : Installation Guide


17 February 2021, Copyright © 2021 Thales. All rights reserved. 35
Hadoop Backup

How to use Import


If you want to restore the same tables in the same schema, you can execute the hbase import command.
Import the tables from the HDFS directory where the previous export is. To do that, execute the hbase import
command:
$ bin/hbase org.apache.hadoop.hbase.mapreduce.Import <tablename> <inputdir>
$ hbase org.apache.hadoop.hbase.mapreduce.Import DDC_SCHEMA1:DATASTORE_SUMMARY_REPORT
hdfs:///ddc_backup/hbase/ddc_schema1_datastore_summary_report
$ hbase org.apache.hadoop.hbase.mapreduce.Import DDC_SCHEMA1:DATA_OBJECT_REPORT hdfs:///ddc_
backup/hbase/ddc_schema1_data_object_report
$ hbase org.apache.hadoop.hbase.mapreduce.Import DDC_SCHEMA1:SCAN_EXECUTION_REPORT hdfs:///ddc_
backup/hbase/ddc_schema1_scan_exectuion_report

NOTE
> The import utility replaces the existing rows, but it does not clone the table keeping the new
rows after the export.
> There may be inconsistencies in the data, especially in the lates reports.

Thales Data Platform 3.1.5 : Installation Guide


17 February 2021, Copyright © 2021 Thales. All rights reserved. 36

You might also like