Professional Documents
Culture Documents
INSTALLATION GUIDE
Document Information
Document Information
Trademarks
All intellectual property is protected by copyright. All trademarks and product names used or referred to are the
copyright of their respective owners. No part of this document may be reproduced, stored in a retrieval system
or transmitted in any form or by any means, electronic, mechanical, chemical, photocopy, recording or
otherwise without the prior written permission of Thales.
Disclaimer
Thales HDP (Hadoop Data Platform) is based on Hadoop, an open source software or freeware component.
Hadoop is packaged with TDP for your convenience but is provided on an "AS-IS" basis. Thales will only
provide support for TDP that will include TDP cluster installation, configuration and set up of DDC. Thales
expressly excludes any support for, including but not limited to, optimization, network configuration, hardware
configuration or any other Hadoop administration not related to DDC initial setup and such shall be the sole and
exclusive responsibility of Licensee.
Each free software or open source component provided with TDP is the copyright of its respective copyright
owner. The incorporated open source projects are released primarily under the Apache Software License 2.0
(“ASLv2”, full description at https://apache.org/licenses/LICENSE-2.0) whereas other software included may
be released under the terms of alternative ASLv2 compatible open source licenses. Please review the license
and notice files accompanying the software for additional licensing information.
Thales makes no representations or warranties with respect to the contents of this document and specifically
disclaims any implied warranties of merchantability or fitness for any particular purpose. Furthermore, Thales
reserves the right to revise this publication and to make changes from time to time in the content hereof without
the obligation upon Thales to notify any person or organization of any such revisions or changes.
We have attempted to make these documents complete, accurate, and useful, but we cannot guarantee them
to be perfect. When we discover errors or omissions, or they are brought to our attention, we endeavor to
correct them in succeeding releases of the product.
Thales invites constructive comments on the contents of this document. Send your comments, together with
your personal and/or company details to the address below.
Mail Thales
4690 Millennium Drive
Belcamp, Maryland 21017
USA
Email technical.support.DIS@thalesgroup.com
Document Information 2
Hadoop Backup 32
Preparing for the Backup 32
Backup Options 32
Create HDFS Backup 32
Restore HDFS Backup 33
Create and Restore HBase Backup 33
Snapshots 34
How to use the HBase snapshot utility 34
Creating a Backup 34
Restoring the Backup 35
Export/Import the Tables 35
How to use Export 35
How to use Import 36
NOTE
> When installing the TDP OVA on ESXi 6.0 you may see an error regarding the vSphere
Client not supporting the SHA256 hashing algorithm. If you see this error, please refer to
the following KB article:
https://kb.portsys.com/help/the-ovf-package-is-invalid-and-cannot-be-deployed-sha256-
error
> On ESXi 6.0 you may see a warning that the configured guest (CentOS 4/5 or later) for this
virtual machine does not match the guest that is currently running (CentOS 7). This is only
a warning message. You may click the X button to dismiss it.
1. The Image
The virtual machine must have the following specifications:
> 64-bit Intel/AMD.
> OVA: VMware vSphere ESXi 6.0 or later. The OVA can be downloaded from the Thales portal.
> AMI: You will need to instantiate the AMI in your AWS. For instructions on instantiating the AMI, refer to
"Launching the Amazon Instance" on page 29.
> OS type: Linux, CentOS 7 (64-bit).
> Minimum of 150 GB disk.
> At least 4 CPUs.
> At least 16 GB RAM. At least 64 GB RAM if deployed as a single node.
NOTE The user will be prompted to change the root password after the initial login.
NOTE This hostname must be configured for both forward and reverse DNS. The Ambari
and Hadoop applications must be able to translate the hostname to its IP address and back in
order to work properly.
4. Configure Ambari
Run this script to configure the host as an Ambari Server:
# /root/setup/ambari_setup.sh
This will prompt you to set the admin password. This will be the password of the Ambari admin. You can
use these credentials to access the Ambari UI.
When this script finishes, it shows you the private key that has been set up for the node. Please save the key.
You will need it for setting up the Ambari server private key.
You can always view the private key later by issuing the cat .ssh/id_rsa command, as illustrated in the
example below:
The OVA image can be used to create additional nodes for the cluster.
Perform the step "2. Initial Login Credentials" on page 6 and "3. Set the Hostname" on page 7.
Instead of running the ambari_setup.sh script in step 5, run the following script:
> /root/setup/node_setup.sh
This will prompt you for the public key of the Ambari server. Go to the Ambari server to retrieve the public key.
This can be any of the two files:
> /root/.ssh/authorized_keys
> /root/.ssh/id_rsa.pub
Once the node is set up, log in to the Ambari server and add the node to the cluster with the following steps:
1. Go to the Ambari UI.
2. Click the Hosts button in the menu on the left. You should see the list of nodes already present on the
cluster.
3. Click Actions -> Add New Hosts.
NOTE
> As in the host names field you can add more than one node, you can add more than one
node at the same time. However, the previous steps are required on every node.
> The Ambari server must solve the hostname of each node and the node must be
accessible from it.
4. On the ssh private key field you need to paste the private key of the Ambari Server node (typically, the first
node of your cluster).
5. Click Register and Confirm.
6. Now, you just have to follow the instructions and add the desired services on the new node.
Select Version
At this point, you should see the UI in this state:
Here, the repositories are pre-filled, so the user doesn't need to know them. The local repository will point to
file:///var/repo/hdp.
The only needed OS in the Ambari configuration is Redhat 7, with the string file:///var/repo/hdp as value for
the three paths. If you can see any other OS, you can remove it.
Install Options
In this screen the hosts are configured. Put the hostname and private key of all nodes that will be on your
cluster. The private key is the one that you saved in the step "4. Configure Ambari" on page 7.
Click REGISTER AND CONFIRM. After that, the installation will start for all nodes. This can take a few
minutes.
NOTE When this step finishes you may see the message "Some warnings were
encountered while performing checks against the registered hosts above". This is a common
behavior but usually there is no reason to worry so you can skip it.
Choose Services
Here, we are going to select the services that will be available on our virtual machine. These are the required
services for DDC:
> HDFS
> Yarn + MapReduce2
> HBase
> Zookeeper
> Ambari_metrics
> Knox
In the list shown in the GUI, we only need to select those. In the next step, we'll add PQS to the cluster.
TIP If any of the required services fails to start automatically, you may need to start it
manually. In case of repeated problems, please check out the service log for additional
troubleshooting information. Refer to the article "Viewing Service Logs" on the Hortonworks
documentation pages for details.
Assign Masters
In our example, as we're only using one node, so this will be our master for all the services. In production
environments, you can select other nodes as the service Master.
NOTE This step is really important. Make sure that you select "Phoenix Query Server"
because PQS is a hard requirement for DDC.
Customize Services
At this point, you only need to provide credentials for different services and review the installation.
1. Create the Knox master secret password and store it somewhere. Then, keep clicking NEXT until the
screen that shows DEPLOY. Click it.
2. When the installation completes, click NEXT and then click COMPLETE.
More detailed information about the steps: For further information, please refer to the Hortonworks official
webpage.
8. Namenode HA Configuration
This section is only required if you are deploying TDP in more than 1 node.
NOTE It is recommended to perform this step at setup, as any changes to the Namenode HA
configuration will result in downtime.
1. Go to the Ambari UI and select the HDFS services (Services > HDFS in the Ambari menu on the left).
2. Click ACTIONS and then the Enable NameNode HA section.
3. Follow the wizard.
9. Configuring Knox
The Apache Knox Gateway is a system that provides a single point of authentication and access for Hadoop
services in a cluster. While Knox supports many authentication/federation providers this document will only
describe configuring LDAP. Please check Apache Knox Gateway's online configuration for details on the other
providers:
http://knox.apache.org/books/knox-1-0-0/user-guide.html#Authentication
A change in the Knox configuration is needed to provide access from the DDC client. There are two ways a
user could configure Knox for DDC. One is to modify the default topology and the other is to create a new
topology. If you want to create a new topology, refer to the Knox documentation.
• (optional) If the cluster is Namenode HA enabled, you have to list all PQS nodes:
<service>
<role>AVATICA</role>
<url>http://<HOSTNAME_OF_PHOENIX_QUERY_SERVER1>:8765</url>
<url>http://<HOSTNAME_OF_PHOENIX_QUERY_SERVER2>:8765</url>
<url>http://<HOSTNAME_OF_PHOENIX_QUERY_SERVER3>:8765</url>
...
</service>
5. (optional for creating a cluster) If the cluster is Namenode HA enabled, add the following section:
<provider>
<role>ha</role>
<name>HaProvider</name>
<enabled>true</enabled>
<param>
<name>WEBHDFS</name>
<value>maxFailoverAttempts=3;failoverSleep=1000;maxRetryAttempts=300;retrySleep=1000;enabled=tr
ue</value>
</param>
</provider>
Scroll down to the ‘#entry for sample user admin’ to find the ‘userPassword’, which by default is ‘admin-
password’.
NOTE
In production environments, when you are using a real LDAP or Active Directory, you have to
use a user with admin permissions on the service.
2. Use the Add Property... link to add these two additional properties (as illustrated in the image further
below):
phoenix.schema.isNamespaceMappingEnabled=true
phoenix.schema.mapSystemTablesToNamespace=true
TIP You will need this path later to configure CipherTrust. See "Configuring HDFS" in the
"Thales CipherTrust Data Discovery and Classification Deployment Guide".
NOTE This section is just for your information and optional. You do not have to perform the
steps below.
Once you have created the HDFS directory using the CLI, you can use the user interface to create sub-
directories, download the files, or simply check the HDFS status.
1. Open the Namenode UI:
2. Click to expand the Utilities menu and next click the Browse the file system link.
From here, you can see all the HDFS directories. In this particular case, you cannot create a directory on the
root because the user used by the UI has no permissions to do it.
TIP This is a known issue with the PQS library. It can cause an error when the user tries to
configure the PQS service on DDC for the first time. It can be avoided by creating the PQS
schema before configuring DDC.
You can find the comprehensive procedure at this link on the Hortonworks documentation pages.
NOTE After the above steps, some of the cluster's components might be in a stopped/stale
status as indicated by red alert on the left panel or upper right of the page. It is strongly
recommended to restart all services to prepare for the upgrade. This can be done using the
Ambari server UI by clicking the 3 dots ("...") to the right of "Services" > "Stop All" then
"Start All". As an alternative, for each service (HDFS, HBase, Zookeeper and Knox), go to
"Actions" then click "Restart All".
NOTE The full version depends on the HDP version you are actually running. For example, if
you are currently running the HDP 3.1.4.0 release, you would see something like HDP-
3.1.4.0-315 as the full version number.
TIP If there are mission critical applications running, a “Rolling Upgrade” is recommended.
8. Click “Proceed”.
9. There may be some manual steps needed at this point. Please follow the instructions given and check the
box that says “I have performed the manual steps above” before proceeding.
10.When the upgrade stages complete, Click “Finalize” to complete the upgrade process.
TIP After the upgrade is completed, you can confirm the upgrade has succeeded by going to
‘Stack and Versions’ -> ‘Versions’. You will see that the current version is HDP-3.1.5.0
(3.1.5.0-316). On the ‘Stack and Versions’ -> ’Upgrade History’ tab, when you click 'Upgrade'
and see the drop-down list, you will also see details of each service’s versions before and
after upgrade, together with other information about the upgrade.
References
> https://docs.cloudera.com/HDPDocuments/Ambari-2.7.5.0/bk_ambari-upgrade/content/upgrading_HDP_
register_and_install_target_version.html
> https://docs.cloudera.com/HDPDocuments/Ambari-2.7.5.0/bk_ambari-upgrade/content/upgrading_HDP_
perform_the_upgrade.html
TIP Note down the Private DNS and Public IP of your instances as these will be required
later.
Hadoop Backup
At some point, you may want to to make a backup of everything in Hadoop that is related to DDC. Such a
backup will include the HBase tables and all the generated files (.tar) with the information of the scans (hdfs).
In order to save the DDC data and create a backup you have to perform these two steps (separately):
1. Back up HDFS
2. Back up HBase
NOTE
> It is not necessary to make the backup in each node, but it is good practice make the
backup on the name node.
> You need to have root privileges to make a backup and restore it.
> To run Export/Import commands, you need to switch to the 'hdfs' user.
Backup Options
There are several options to make and restore a backup:
> HBASE - Full Shutdown Backup (with stopping of the service)
> HBASE - Make and restore a snapshot (recommended option)
> HBASE - Export and Import a table. WARNING: This option is not a Clone of the table and inconsistencies
could appear
> HDFS - By using the distcp command
For more information about these options, refer to the official Cloudera documentation.
> the destination is the folder to which the .tar files will be copied and saved
Example command:
hadoop distcp hdfs://sys146116.i.vormetric.com:8020/ddc_demo_83
hdfs://sys146116.i.vormetric.com:8020/ddc_backup/hdfs_dir/ddc_demo_83
You can find more information on using DistCp on the official webpage or on the Cloudera blog.
NOTE
> The destination folder should be created before executing the discp command.
> Note that DistCp requires absolute paths.
> These actions are made on ACTIVE NameNode and as 'hdfs' user.
NOTE
> If the file already exists, it will be skipped. New files (not yet backed up) will still be there,
and deleted files will be restored.
> If you want to completely restore the folder (and only keep the files that were there when
the copy was made), you have to execute a command to delete the files. This action is left
to your discretion.
> These actions must be made on the NameNode and as 'hdfs' user.
NOTE It is possible to make a complete backup by stopping the services and using the distcp
command. For more details on a full shutdown backup see the HBase documentation.
Snapshots
If you do not want to stop the services, you can use HBase snapshots by executing the steps as follows:
> Take a snapshot for each table (hbase doc: Take a Snapshot)
> Restore the snapshot for the tables that you want (hbase doc: Restore a snapshot)
Creating a Backup
Take the snapshot with the command
snapshot 'myTable', 'myTableSnapshot-122112'
and list the snapshots:
$ hbase(main):002:0> snapshot 'DDC_SCHEMA1:DATASTORE_SUMMARY_REPORT', 'myTableSnapshot-
datastore_summary_report'
$ hbase(main):003:0> snapshot 'DDC_SCHEMA1:DATA_OBJECT_REPORT', 'myTableSnapshot-data_object_
report'
$ hbase(main):004:0> snapshot 'DDC_SCHEMA1:SCAN_EXECUTION_REPORT', 'myTableSnapshot-scan_
execution_report'
$ hbase(main):005:0> list_snapshots
SNAPSHOT TABLE + CREATION TIME
myTableSnapshot-datastore_summary_report DDC_SCHEMA1:DATASTORE_SUMMARY_REPORT (2020-11-
24 06:36:13 -0800)
myTableSnapshot-data_object_report DDC_SCHEMA1:DATA_OBJECT_REPORT (2020-11-24
06:37:05 -0800)
myTableSnapshot-scan_execution_report DDC_SCHEMA1:SCAN_EXECUTION_REPORT (2020-11-24
06:37:12 -0800)
3 row(s)
NOTE
> The import utility replaces the existing rows, but it does not clone the table keeping the new
rows after the export.
> There may be inconsistencies in the data, especially in the lates reports.