Professional Documents
Culture Documents
1
and Ambari 2.1
from IBM Spectrum Scale Hadoop connector
to IBM HDFS Transparency
Version 1.1
1
Contents
1. Background ........................................................................................................................................... 2
2. Upgrade Guide ...................................................................................................................................... 4
2.1 Preparation........................................................................................................................................ 4
2.2 Checklist .......................................................................................................................................... 11
2.3 Update steps ................................................................................................................................... 12
3. Revision History .................................................................................................................................. 21
1. Background
IBM Spectrum Scale provides integration with the Hadoop framework through a Hadoop
connector.
IBM Spectrum Scale has released two types of Hadoop connectors:
2
improved architecture that leverages the native HDFS client for better compatibility,
performance and support for third party tools. The HDFS transparency connector is the
strategic direction for Hadoop support on Spectrum Scale.
The HDFS transparency rpm package name is gpfs.hdfs-protocol-<version>.<arch>.rpm.
HDFS transparency is integrated with IBM Spectrum Scale as an Ambari service in IBM
BigInsights Ambari IOP 4.1 released in July, 2016.
The Ambari integration package is called gpfs.hdfs-transparency.ambari-iop_4.1-
<version>.noarch.rpm.
This document describes how to upgrade from IOP 4.1 and IBM Spectrum Scale with the first
generation Hadoop connector environment to an IOP 4.1 and IBM Spectrum Scale with the
second generation HDFS transparency cluster. This manual upgrade process for moving from
the first generation connector to the new HDFS Transparency connector is a one-time process.
Future upgrades will be handled through the Ambari dashboard.
For an existing cluster that has IOP 4.1 and Ambari 2.1 and IBM Spectrum Scale and Hadoop
connector, the following packages must be deployed in your environment:
gpfs.ambari-iop_4.1-<version>.noarch.rpm
gpfs.hadoop-connector-2.7.0-<version>.<arch>.rpm
To upgrade to the second generation HDFS transparency, the following packages are required:
gpfs.hdfs-transparency.ambari-iop_4.1-0.noarch.rpm
gpfs.hdfs-protocol-<version>.x86_64.rpm
The packages above can be downloaded from the IBM DeveloperWorks - IOP with Apache
Hadoop 2nd generation HDFS Transparency webpage.
To determine the connector that the cluster is currently using, run the following commands:
To see if the first generation Hadoop connector is running, run the following command
on all the nodes where the connector is installed:
rpm -qa | grep gpfs.hadoop
3
To see if HDFS transparency is running, run the following command on all the nodes
where the connector is installed:
rpm -qa | grep gpfs.hdfs
If the command returns the corresponding package, then the cluster is using that connector.
To determine the Ambari integration package that the cluster is currently using, run the
following commands on the Ambari server:
To check if the first generation Hadoop connector Ambari integration package is being
used, run the following command:
rpm -qa | grep gpfs.ambari
To check if the HDFS transparency Ambari integration package is being used, run the
following command:
rpm -qa | grep gpfs.hdfs-transparency
If the command returns the corresponding package, the cluster is using that Ambari integration
package.
2. Upgrade Guide
2.1 Preparation
IMPORTANT NOTE
The current environment has information that must be captured before starting the
upgrade process to HDFS transparency.
Use a document to manually write in and maintain all information that is mentioned
from Step1 to Step5.
Step1) Write your Ambari server hostname. This document will refer to it as
ambari_server_host.
Step2) Write the user name and password for the Ambari database.
4
By default, the user name is admin and the password is admin and it has been used during
the ambari-server set-up installation phase. If the username and password were changed
during the installation, ensure that you have the new username and password.
Step3) Write the zookeeper server hostname. If you have more than one zookeeper server, you
write only one. This document will refer to it as zookeeper_server_host.
Write the MySQL server hostname, username, password, and database values.
From the Ambari GUI, click HiveConfigsAdvancedHive Metastore to get the values.
Write the following information from the Hive panel as shown in the following screenshot:
5
If PostgreSQL is used for Hives meta data:
Write the PostgreSQL Database Host, Database Name, Database Username and Database
Password from Ambari GUIHiveConfigsHive Metastore, shown in the following
screenshot:
6
Database Password: This document will refer to it as Hive_PostgreSQL_Password.
Step6: Check a sample of the current data in Hbase, Hive, and BigInsights Value-Add databases.
This is a sanity check to verify that everything is functioning correctly after the upgrade is
complete.
Hbase
On any Hbase node, run the following command to check the data:
# su - hbase
$ /usr/iop/4.1.0.0/hbase/bin/hbase shell
hbase(main):001:0> list
TABLE
ambarismoketest
moviedb
2 row(s) in 0.2050 seconds
Note: The moviedb is an example database value. Replace moviedb with the name of a
database that exists in your cluster.
Hive
On any Hive node, run the following commands to check the data:
su - hive
$ /usr/iop/4.1.0.0/hive/bin/hive
hive> show databases;
OK
bigdata
default
Time taken: 1.86 seconds, Fetched: 2 row(s)
7
hive> use default;
hive> show tables;
OK
hivetbl1
hivetbl10_part
Time taken: 0.077 seconds, Fetched: 2 row(s)
Note: default is an example database value. Replace default with the name of a database that
exists in your cluster.
BigInsights BigSQL
On the BigSQL head node, run the following commands to check the data:
su - bigsql
Active Databases
SCHEMANAME
-----------------------------------------------------------------------------------------------
8
BIGSQL
DEFAULT
SYSFUN
SYSHADOOP
SYSIBM
SYSIBMADM
SYSIBMINTERNAL
SYSIBMTS
SYSPROC
SYSPUBLIC
SQLJ
SYSCAT
SYSSTAT
SYSTOOLS
GOSALESDW
NULLID
16 record(s) selected.
From the select schemaname from syscat.schemata command output, select a schema used
for the application data that must be used to run the list tables command below. Save the
current table lists output which will be used later for a sanity check after the upgrade process.
This example uses the <user-app-schema1> schema.
Step7: Write the data replica of your spectrum scale file system
# mmlsfs <your-file-system> -r
flag value description
------------------- ------------------------ -----------------------------------
-r 3 Default number of data replicas
9
Note: Step 8 and Step 9 require the Ambari server and the Hive service to be up. If you are
unable to stop all the application jobs, then another way is to stop all Ambari services, (such as
Yarn, Hive, HBase,) from the Ambari GUI. After all the services are stopped, start only the Hive
services.
## run the following command to list all the databases in your MySQL environment:
mysql -u <Hive_MySQL_Username> -p
## input your Hive_MySQL_Password here
MariaDB [(none)]> show databases;
+--------------------+
| Database |
+--------------------+
| information_schema |
| hive |
| mysql |
| performance_schema |
+--------------------+
4 rows in set (0.01 sec)
10
MariaDB [(none)]>
For each database listed above, run the following command from the bash console to perform
the backup:
# for the above listed databases, run the following commands to back them up
mysqldump -u hive -p <Hive_mySQL_db> > hive.backup
The planned upgrade modifies the <Hive_mySQL_db>. However, to avoid any potential issues,
perform a backup of all the databases.
For example:
su - <Hive_PostgreSQL_Username>
pg_dump <Hive_PostgreSQL_db> > <Hive_PostgreSQL_db>.backup
After performing these steps, you can perform the update to HDFS transparency.
2.2 Checklist
Review the following checklist table to ensure that all the tasks are completed before proceeding to do
the upgrade.
11
2 Downloaded the new Ambari integration for HDFS transparency
package?
3 Write the following Ambari information:
Ambari Server node hostname
Ambari username
Ambari password
4 Write the following MySQL information:
MySQL server node hostname
MySQL database username
MySQL database password
5 Performed the sanity check for Hive data?
6 Performed the sanity check for BigInsights Value-Add, such as
BigSQL?
7 Performed a backup of the Ambari database?
8 Performed a backup of the MySQL database?
Before proceeding, ensure that you have performed all the steps in Section 2.1 Preparation.
IMPORTANT NOTE
Review the sample commands in steps 6 and 16. If you can perform the steps, then
proceed. Otherwise, contact scale@us.ibm.com for guidance.
Step1) Check whether all the services (except Hive service) on the Ambari GUI are stopped.
Note: Hive service should be active for data update in the following steps.
Step2) Remove the GPFS service with the REST API from the Ambari server by using the Bash
console as root.
curl -u admin:admin -H "X-Requested-By: ambari" -X DELETE
http://localhost:8080/api/v1/clusters/<your-IOP-cluster-name>/services/GPFS
Note: Replace <your-IOP-cluster-name> in the above link with the cluster name. The cluster
name will be displayed in the top-left panel after logging in to the Ambari GUI. Replace
admin:admin with the Ambari username and password.
12
In the following screenshot, iop420 is the cluster name:
Refresh the Ambari GUI and check that the Spectrum Scale menu from the left panel is
removed.
Step4) Check the stack version listed on the postgres console (see Step3):
ambari=# select * from ambari.stack ;
stack_id | stack_name | stack_version
----------+-------------+-----------------------
1 | BigInsights | 4.1
2 | BigInsights | 4.0
51 | BigInsights | 4.1.SpectrumScale
(3 rows)
Write the stack_id values corresponding to the stack_version column for 4.1 and
4.1.SpectrumScale.
For the above output, the numbers 1 and 51 are the stack_ids for the corresponding stack
versions 4.1 and 4.1.SpectrumScale. In later steps, we need to change database records from
the stack version "4.1.SpectrumScale" to 4.1. The 4.1.SpectrumScale stack_version is the
Ambari GPFS integration package version for the older Hadoop connector. For the new HDFS
transparency connector, a different Ambari stack is not required because it integrates as a
service in the default stack.
13
# vim mysql_migrate.sh
# cat mysql_migrate.sh
#!/bin/bash
database="$1"
username="$2"
password="$3"
14
Log in to the Hive Hive_PostgreSQL_Server node and dump all the records from the PostgreSQL
database.
# first create postgresql_migrate.sh
# vim postgresql_migrate.sh
# cat postgresql_migrate.sh
#!/bin/bash
database="$1"
username="$2"
15
NOTE: To avoid database crashes that might occur because of using the wrong stack_id entries
for Step6 and Step16, you can send the output from Step4 and the file mysqlData.output to
scale@us.ibm.com before proceeding. The IBM Support team will return a list of commands for
your environment for performing Step6 and Step16.
If you have reviewed your commands and changes carefully from Step6 and Step16 and noted
that the changes are correct, continue performing the following steps.
Step6) Update the Ambari database to switch the stack version from 4.1.SpectrumScale to 4.1.
Note: The commands with the stack_id values of 1 and 51 are derived from the output of Step
4. You must change the values according to the output of Step 4.
update ambari.clusterconfig set stack_id = '1' where stack_id = '51';
update ambari.clusters set desired_stack_id = '1' where desired_stack_id = '51';
update ambari.clusterstate set current_stack_id = '1' where current_stack_id = '51';
update ambari.servicedesiredstate set desired_stack_id = '1' where desired_stack_id = '51';
update ambari.serviceconfig set stack_id = '1' where stack_id = '51';
update ambari.servicecomponentdesiredstate set desired_stack_id = '1' where desired_stack_id =
'51';
update ambari.hostcomponentdesiredstate set desired_stack_id = '1' where desired_stack_id = '51';
Step7) Restart the Ambari server and stop/start all the Ambari agents.
On the Ambari server node, run the following commands to stop and start the Ambari server:
ambari-server stop
ambari-server start
On all the Ambari agent nodes, run the following commands to stop and start the Ambari
agents:
ambari-agent stop
ambari-agent start
Step8) Uninstall the IBM Spectrum Scale Ambari integration package for the Hadoop connector.
16
On the Ambari server, uninstall the old integration package by running the following command:
rpm -e gpfs.ambari-iop_4.1*
Follow the commands in Step7 to restart the ambari server and all the agents.
Note: The HDFS NameNode in this step will be the HDFS transparency NameNode in step 15
and it should be one of the nodes of the IBM Spectrum Scale cluster.
MapReduce2
For MapReduce2, on the Ambari dashboard, click MapReduce2ConfigsAdvanced panel:
mapreduce.client.submit.file.replication
If the value is 0, change it to the data replica value that was written down in Step7 of the
section 2.1 Preparation.
HBase
For HBase, remove the following configurations entries from the
HBaseConfigsAdvancedCustom Hbase-site panel:
gpfs.sync.queue=true
gpfs.sync.range=true
hbase.fsutil.hdfs.impl=org.apache.hadoop.hbase.gpfs.util.FSGPFSUtils
hbase.regionserver.hlog.writer.impl=
org.apache.hadoop.hbase.gpfs.regionserver.wal.PreallocatedProtobufLogWriter
hbase.regionserver.hlog.reader.impl=
org.apache.hadoop.hbase.gpfs.regionserver.wal.PreallocatedProtobufLogReader
17
You can click the Remove button to remove the configuration from Custom hbase-site:
Step11) Restart all the services and run the service check for HDFS.
There is no need to run service checks for the other services.
Step13) Manually uninstall the old connector from all the nodes.
/usr/lpp/mmfs/bin/mmdsh -N all " /usr/lpp/mmfs/bin/mmhadoopctl connector stop"
/usr/lpp/mmfs/bin/mmdsh -N all " /usr/lpp/mmfs/bin/mmhadoopctl connector detach --distribution
BigInsights"
/usr/lpp/mmfs/bin/mmdsh -N all "rpm -e gpfs.hadoop-connector"
Note: IBM Spectrum Scale will give installation errors if the above steps were not performed.
The first command, mmhadoopctl connector stop, will report an error if the Spectrum Scale
Hadoop connector was already stopped in Step 1. The error messages in this case just mean
that Spectrum Scale Hadoop connector is not up. One can use the mmhadoopctl connector
getstate command to check the connector and only run the mmhadoopctl connector stop if the
connector is still up.
Step14) Install the new GPFS Ambari integration module for HDFS Transparency on the Ambari
server node.
18
Platform with Apache Hadoop - 2nd generation HDFS Transparency - Download Releases
section.
Download the Deploying BigInsights 4.1 IBM Spectrum Scale HDFS Transparency with
Ambari 2.1 document on the IBM DeveloperWorks Spectrum Scale Wiki.
o Follow the section 5.4.2 Setting up the IBM Spectrum Scale repository to set up
IBM Spectrum Scale HDFS transparency repository in the Deploying BigInsights
4.1 IBM Spectrum Scale HDFS transparency with Ambari 2.1 document.
o Follow the section 4.2.1.3 Add Spectrum Scale service to an existing Ambari IOP
and an HDFS Transparency cluster - Install the GPFS integration module into
Ambari in the Deploying BigInsights 4.1 IBM Spectrum Scale HDFS Transparency with
Ambari 2.1 document.
Step15) Add the IBM Spectrum Scale service to Ambari and integrate the existing IOP with the
existing IBM Spectrum Scale cluster.
Follow section 4.2.1.3 Add Spectrum Scale service to an existing Ambari IOP and an HDFS
Transparency cluster- Adding the IBM Spectrum Scale service to Ambari in the Deploying
BigInsights 4.1 IBM Spectrum Scale HDFS Transparency with Ambari 2.1 document on the IBM
DeveloperWorks Spectrum Scale Wiki.
The old data ingested from the Hadoop connector uses the schema as gpfs:// in the meta data
database. This schema is not supported in HDFS transparency because it uses the native HDFS
schema. Therefore, the correct schema to use is hdfs://. Therefore, all the records in the meta
data database must be modified from using the gpfs:// value to the hdfs:// value.
NOTE: If this modification is not implemented, you will be unable to view the old data in Hive.
19
For example, in the mysqlData.output of Step5, for table DBS, the records are using
gpfs://c8f2n13.gpfs.net:8020 (assuming that the correct schema is
hdfs://c8f2n13.gpfs.net:8020), then the table DBS must be put in the to-be-updated table list:
table name DBS
============>
The table DBS and SDS are two tables that must be changed. Check for other tables in your
cluster to see if whether they need to be changed.
If you want to update only one specific record (e.g. only update the record whose DB_ID is 1),
use a command similar to the following:
update DBS set DB_LOCATION_URI=(REPLACE(DB_LOCATION_URI, 'gpfs://', 'hdfs://')) where
DB_ID =1;
20
update "DBS" set "DB_LOCATION_URI"=(REPLACE("DB_LOCATION_URI", 'gpfs://', 'hdfs://'));
If you want to update only one specific record (e.g. only update the record whose DB_ID is 1),
use a command similar to the following:
update "DBS" set "DB_LOCATION_URI"=(REPLACE("DB_LOCATION_URI", 'gpfs://', 'hdfs://')) where
"DB_ID"='1';
Step17) Start all the services from Ambari and run service checks for all the services.
Step18) Follow Step 7 - Check the current data in Hbase, Hive and BigInsights Value-Add
databases in section 2.1 Preparation to sanity check the configuration by comparing the data
output after the upgrade with the previous saved outputs.
3. Revision History
21