You are on page 1of 21

Upgrading IBM Open Platform with Apache Hadoop 4.

1
and Ambari 2.1
from IBM Spectrum Scale Hadoop connector
to IBM HDFS Transparency
Version 1.1

IBM Spectrum Scale BDA Team


2016-8-15

1
Contents
1. Background ........................................................................................................................................... 2
2. Upgrade Guide ...................................................................................................................................... 4
2.1 Preparation........................................................................................................................................ 4
2.2 Checklist .......................................................................................................................................... 11
2.3 Update steps ................................................................................................................................... 12
3. Revision History .................................................................................................................................. 21

1. Background

IBM Spectrum Scale provides integration with the Hadoop framework through a Hadoop
connector.
IBM Spectrum Scale has released two types of Hadoop connectors:

First generation connector: Hadoop connector


This first generation connector is used to enable Hadoop over IBM Spectrum Scale by using
the Hadoop file system APIs. This connector is in Support Mode only and there are no new
functions delivered to this connector.
The Hadoop connector rpm package name is gpfs.hadoop-connector-<version>.<arch>.rpm.
The Hadoop connector is integrated with IBM Spectrum Scale as an Ambari stack in IBM
BigInsights Ambari IOP 4.1 released in November, 2015.
The Ambari integration package is called gpfs.ambari-iop_4.1-<version>.noarch.rpm.

Second generation connector: HDFS transparency


The second generation IBM Spectrum Scale HDFS Transparency, also known as HDFS
Protocol, offers a set of interfaces that allows applications to use the native HDFS client to
access IBM Spectrum Scale through HDFS RPC requests. This new connector has an

2
improved architecture that leverages the native HDFS client for better compatibility,
performance and support for third party tools. The HDFS transparency connector is the
strategic direction for Hadoop support on Spectrum Scale.
The HDFS transparency rpm package name is gpfs.hdfs-protocol-<version>.<arch>.rpm.
HDFS transparency is integrated with IBM Spectrum Scale as an Ambari service in IBM
BigInsights Ambari IOP 4.1 released in July, 2016.
The Ambari integration package is called gpfs.hdfs-transparency.ambari-iop_4.1-
<version>.noarch.rpm.

This document describes how to upgrade from IOP 4.1 and IBM Spectrum Scale with the first
generation Hadoop connector environment to an IOP 4.1 and IBM Spectrum Scale with the
second generation HDFS transparency cluster. This manual upgrade process for moving from
the first generation connector to the new HDFS Transparency connector is a one-time process.
Future upgrades will be handled through the Ambari dashboard.

For an existing cluster that has IOP 4.1 and Ambari 2.1 and IBM Spectrum Scale and Hadoop
connector, the following packages must be deployed in your environment:

gpfs.ambari-iop_4.1-<version>.noarch.rpm
gpfs.hadoop-connector-2.7.0-<version>.<arch>.rpm

To upgrade to the second generation HDFS transparency, the following packages are required:

gpfs.hdfs-transparency.ambari-iop_4.1-0.noarch.rpm
gpfs.hdfs-protocol-<version>.x86_64.rpm

The packages above can be downloaded from the IBM DeveloperWorks - IOP with Apache
Hadoop 2nd generation HDFS Transparency webpage.

To determine the connector that the cluster is currently using, run the following commands:

To see if the first generation Hadoop connector is running, run the following command
on all the nodes where the connector is installed:
rpm -qa | grep gpfs.hadoop

3
To see if HDFS transparency is running, run the following command on all the nodes
where the connector is installed:
rpm -qa | grep gpfs.hdfs
If the command returns the corresponding package, then the cluster is using that connector.
To determine the Ambari integration package that the cluster is currently using, run the
following commands on the Ambari server:

To check if the first generation Hadoop connector Ambari integration package is being
used, run the following command:
rpm -qa | grep gpfs.ambari

To check if the HDFS transparency Ambari integration package is being used, run the
following command:
rpm -qa | grep gpfs.hdfs-transparency
If the command returns the corresponding package, the cluster is using that Ambari integration
package.

2. Upgrade Guide

2.1 Preparation

IMPORTANT NOTE
The current environment has information that must be captured before starting the
upgrade process to HDFS transparency.
Use a document to manually write in and maintain all information that is mentioned
from Step1 to Step5.

Step1) Write your Ambari server hostname. This document will refer to it as
ambari_server_host.

Step2) Write the user name and password for the Ambari database.

4
By default, the user name is admin and the password is admin and it has been used during
the ambari-server set-up installation phase. If the username and password were changed
during the installation, ensure that you have the new username and password.

Step3) Write the zookeeper server hostname. If you have more than one zookeeper server, you
write only one. This document will refer to it as zookeeper_server_host.

Step4) Write the HBase configuration key value hbase.rootdir.


From the Ambari GUI, click HBaseConfigsAdvanced panel to get the hbase.rootdir value.

Step5) Write the meta data information about Hive:


If MySQL server is used for Hives meta data:
Write the MySQL server hostname if you have installed the Hive service. From the Ambari GUI,
click HiveSummary to get the MySQL server value. This document will refer to it as
hive_mysql_server_host.

Write the MySQL server hostname, username, password, and database values.
From the Ambari GUI, click HiveConfigsAdvancedHive Metastore to get the values.
Write the following information from the Hive panel as shown in the following screenshot:

Database Host: The MySQL server for Hive.


Database Name: The database name. This document will refer to it as Hive_mySQL_db.
Database Username: This document will refer to it as Hive_MySQL_Username.
Database Password: This document will refer to it asHive_MySQL_Password.

5
If PostgreSQL is used for Hives meta data:
Write the PostgreSQL Database Host, Database Name, Database Username and Database
Password from Ambari GUIHiveConfigsHive Metastore, shown in the following
screenshot:

Database Host: This document will refer to it as Hive_PostgreSQL_Server


Database Name: The database name. This document will refer to it as
Hive_PostgreSQL_db.
Database Username: This document will refer to it as Hive_PostgreSQL_Username.

6
Database Password: This document will refer to it as Hive_PostgreSQL_Password.

Step6: Check a sample of the current data in Hbase, Hive, and BigInsights Value-Add databases.
This is a sanity check to verify that everything is functioning correctly after the upgrade is
complete.

Hbase
On any Hbase node, run the following command to check the data:

# su - hbase
$ /usr/iop/4.1.0.0/hbase/bin/hbase shell
hbase(main):001:0> list
TABLE
ambarismoketest
moviedb
2 row(s) in 0.2050 seconds

hbase(main):003:0> scan 'moviedb'


ROW COLUMN+CELL
21jumpstreet-2012 column=director:, timestamp=1470840069133, value=phil lord
21jumpstreet-2012 column=genre:, timestamp=1470840069146, value=comedy
21jumpstreet-2012 column=title:, timestamp=1470840069126, value=21 jump street

Note: The moviedb is an example database value. Replace moviedb with the name of a
database that exists in your cluster.

Hive
On any Hive node, run the following commands to check the data:

su - hive
$ /usr/iop/4.1.0.0/hive/bin/hive
hive> show databases;
OK
bigdata
default
Time taken: 1.86 seconds, Fetched: 2 row(s)

7
hive> use default;
hive> show tables;
OK
hivetbl1
hivetbl10_part
Time taken: 0.077 seconds, Fetched: 2 row(s)

hive> select * from hivetbl1;

Note: default is an example database value. Replace default with the name of a database that
exists in your cluster.

BigInsights BigSQL

On the BigSQL head node, run the following commands to check the data:

su - bigsql

db2 => LIST ACTIVE DATABASES

Active Databases

Database name = BIGSQL


Applications connected currently =3
Database path =
/var/ibm/bigsql/database/bigdb/bigsql/NODE0000/SQL00001/MEMBER0000/

db2 => connect to bigsql

Database Connection Information

Database server = DB2/LINUXX8664 10.6.3


SQL authorization ID = BIGSQL
Local database alias = BIGSQL

db2 => select schemaname from syscat.schemata

SCHEMANAME
-----------------------------------------------------------------------------------------------

8
BIGSQL
DEFAULT
SYSFUN
SYSHADOOP
SYSIBM
SYSIBMADM
SYSIBMINTERNAL
SYSIBMTS
SYSPROC
SYSPUBLIC
SQLJ
SYSCAT
SYSSTAT
SYSTOOLS
GOSALESDW
NULLID

16 record(s) selected.

From the select schemaname from syscat.schemata command output, select a schema used
for the application data that must be used to run the list tables command below. Save the
current table lists output which will be used later for a sanity check after the upgrade process.
This example uses the <user-app-schema1> schema.

db2 => list tables for schema <user-app-schema1>

Table/View Schema Type Creation time


------------------------------- --------------- ----- --------------------------

Step7: Write the data replica of your spectrum scale file system

# mmlsfs <your-file-system> -r
flag value description
------------------- ------------------------ -----------------------------------
-r 3 Default number of data replicas

The value 3 is your default data replica.


Step8: Stop all application jobs, including Hive and BigInsights Value-Add.

9
Note: Step 8 and Step 9 require the Ambari server and the Hive service to be up. If you are
unable to stop all the application jobs, then another way is to stop all Ambari services, (such as
Yarn, Hive, HBase,) from the Ambari GUI. After all the services are stopped, start only the Hive
services.

Step9: Perform a backup of the Ambari database.


Ensure that the Ambari server is up.
Log in to the ambari_server_host as root and run the following commands to back up the
Ambari database:
su - postgres
pg_dump ambari>ambari.backup
pg_dump ambarirca>ambarirca.backup

Note: Do not remove the backup files.

Step10: Perform a backup of the Hive meta data database.


Ensure that the Hive services are up.
If MySQL server is used as the Hives meta data database:
Log in to the Hive MySQL server node as root and run the following commands to list all the
databases in the MySQL server:

## run the following command to list all the databases in your MySQL environment:
mysql -u <Hive_MySQL_Username> -p
## input your Hive_MySQL_Password here
MariaDB [(none)]> show databases;
+--------------------+
| Database |
+--------------------+
| information_schema |
| hive |
| mysql |
| performance_schema |
+--------------------+
4 rows in set (0.01 sec)

10
MariaDB [(none)]>

For each database listed above, run the following command from the bash console to perform
the backup:

# for the above listed databases, run the following commands to back them up
mysqldump -u hive -p <Hive_mySQL_db> > hive.backup

The planned upgrade modifies the <Hive_mySQL_db>. However, to avoid any potential issues,
perform a backup of all the databases.
For example:

mysqldump -u hive -p hive > hive.backup


mysqldump -u hive -p information_schema >information_schema.backup
mysqldump -u hive -p mysql >mysql.backup
mysqldump -u hive -p performance_schema >performance_schema.backup

If PostgreSQL server is used as the Hives meta data database:


Log in to the Hive_PostgreSQL_Server as root and run the following commands to back up the
Hives meta data:

su - <Hive_PostgreSQL_Username>
pg_dump <Hive_PostgreSQL_db> > <Hive_PostgreSQL_db>.backup

Note: Input the Hive_PostgreSQL_Password. Replace the <Hive_PostgreSQL_Username> and


the <Hive_PostgreSQL_db> values according to the information in Step 5.

After performing these steps, you can perform the update to HDFS transparency.

2.2 Checklist

Review the following checklist table to ensure that all the tasks are completed before proceeding to do
the upgrade.

Checklist# Description Completed?


1 Downloaded the HDFS transparency package?

11
2 Downloaded the new Ambari integration for HDFS transparency
package?
3 Write the following Ambari information:
Ambari Server node hostname
Ambari username
Ambari password
4 Write the following MySQL information:
MySQL server node hostname
MySQL database username
MySQL database password
5 Performed the sanity check for Hive data?
6 Performed the sanity check for BigInsights Value-Add, such as
BigSQL?
7 Performed a backup of the Ambari database?
8 Performed a backup of the MySQL database?

2.3 Update steps

Before proceeding, ensure that you have performed all the steps in Section 2.1 Preparation.

IMPORTANT NOTE
Review the sample commands in steps 6 and 16. If you can perform the steps, then
proceed. Otherwise, contact scale@us.ibm.com for guidance.

To upgrade to HDFS transparency, perform the following steps:

Step1) Check whether all the services (except Hive service) on the Ambari GUI are stopped.
Note: Hive service should be active for data update in the following steps.

Step2) Remove the GPFS service with the REST API from the Ambari server by using the Bash
console as root.
curl -u admin:admin -H "X-Requested-By: ambari" -X DELETE
http://localhost:8080/api/v1/clusters/<your-IOP-cluster-name>/services/GPFS

Note: Replace <your-IOP-cluster-name> in the above link with the cluster name. The cluster
name will be displayed in the top-left panel after logging in to the Ambari GUI. Replace
admin:admin with the Ambari username and password.

12
In the following screenshot, iop420 is the cluster name:

Refresh the Ambari GUI and check that the Spectrum Scale menu from the left panel is
removed.

Step3) Log in to the Ambari postgres database console.

Log in to the Ambari server node as root:


# su - postgres
# psql
postgres=# \connect ambari

Step4) Check the stack version listed on the postgres console (see Step3):
ambari=# select * from ambari.stack ;
stack_id | stack_name | stack_version
----------+-------------+-----------------------
1 | BigInsights | 4.1
2 | BigInsights | 4.0
51 | BigInsights | 4.1.SpectrumScale
(3 rows)

Write the stack_id values corresponding to the stack_version column for 4.1 and
4.1.SpectrumScale.
For the above output, the numbers 1 and 51 are the stack_ids for the corresponding stack
versions 4.1 and 4.1.SpectrumScale. In later steps, we need to change database records from
the stack version "4.1.SpectrumScale" to 4.1. The 4.1.SpectrumScale stack_version is the
Ambari GPFS integration package version for the older Hadoop connector. For the new HDFS
transparency connector, a different Ambari stack is not required because it integrates as a
service in the default stack.

Step5) Dump all the Hive meta data records:


If MySQL server is used as the Hives meta data database:
Log in to the Hive MySQL server node and dump all the records from the MySQL database.
# first create mysql_migrate.sh

13
# vim mysql_migrate.sh
# cat mysql_migrate.sh
#!/bin/bash

database="$1"
username="$2"
password="$3"

if [[ "$database" == "" || "$password" == "" ]];


then
echo "$0 <database-name><username><password>"
exit
fi
echo "Begin to query all tables under database $1..."
index=1
for table in BUCKETING_COLS CDS COLUMNS_V2 COMPACTION_QUEUE COMPLETED_TXN_COMPONENTS
DATABASE_PARAMS DBS DB_PRIVS DELEGATION_TOKENS FUNCS FUNC_RU GLOBAL_PRIVS HIVE_LOCKS
IDXS INDEX_PARAMS MASTER_KEYS NEXT_COMPACTION_QUEUE_ID NEXT_LOCK_ID NEXT_TXN_ID
NOTIFICATION_LOG NOTIFICATION_SEQUENCE NUCLEUS_TABLES PARTITIONS PARTITION_EVENTS
PARTITION_KEYS PARTITION_KEY_VALS PARTITION_PARAMS PART_COL_PRIVS PART_COL_STATS
PART_PRIVS ROLES ROLE_MAP SDS SD_PARAMS SEQUENCE_TABLE SERDES SERDE_PARAMS
SKEWED_COL_NAMES SKEWED_COL_VALUE_LOC_MAP SKEWED_STRING_LIST
SKEWED_STRING_LIST_VALUES SKEWED_VALUES SORT_COLS TABLE_PARAMS TAB_COL_STATS TBLS
TBL_COL_PRIVS TBL_PRIVS TXNS TXN_COMPONENTS TYPES TYPE_FIELDS VERSION
do
echo "${index} table name $table"
echo "============>"
echo "use ${database};" > /tmp/iop41_mig.sql
echo "select * from ${table};" >> /tmp/iop41_mig.sql
mysql -u ${username} --password=${password} < /tmp/iop41_mig.sql
echo "<============"
echo
((index++))
done

# chmod a+rx mysql_migrate.sh


# run the script to dump the records
# ./mysql_migrate.sh <Hive_mySQL_db><Hive_MySQL_Username>< Hive_MySQL_Password>>
mysqlData.output

The <Hive_mySQL_db>,<Hive_MySQL_Username> and < Hive_MySQL_Password> values are


derived from Step 5 in section 2.1 Preparation.

If PostgreSQL server is used as the Hives meta data database:

14
Log in to the Hive Hive_PostgreSQL_Server node and dump all the records from the PostgreSQL
database.
# first create postgresql_migrate.sh
# vim postgresql_migrate.sh
# cat postgresql_migrate.sh
#!/bin/bash

database="$1"
username="$2"

if [[ "$database" == "" ]];


then
echo "$0 <database-name> <username>"
exit
fi
echo "Begin to query all tables under database $1..."
echo > /tmp/iop41_mig.sql
echo "\c ${database};" > /tmp/iop41_mig.sql
index=1
for table in BUCKETING_COLS CDS COLUMNS_V2 compaction_queue completed_txn_components
DATABASE_PARAMS DBS DB_PRIVS DELEGATION_TOKENS FUNCS FUNC_RU GLOBAL_PRIVS hive_locks
IDXS INDEX_PARAMS MASTER_KEYS next_compaction_queue_id next_lock_id next_txn_id
NOTIFICATION_LOG NOTIFICATION_SEQUENCE NUCLEUS_TABLES PARTITIONS PARTITION_EVENTS
PARTITION_KEYS PARTITION_KEY_VALS PARTITION_PARAMS PART_COL_PRIVS PART_COL_STATS
PART_PRIVS ROLES ROLE_MAP SDS SD_PARAMS SEQUENCE_TABLE SERDES SERDE_PARAMS
SKEWED_COL_NAMES SKEWED_COL_VALUE_LOC_MAP SKEWED_STRING_LIST
SKEWED_STRING_LIST_VALUES SKEWED_VALUES SORT_COLS TABLE_PARAMS TAB_COL_STATS TBLS
TBL_COL_PRIVS TBL_PRIVS txns txn_components TYPES TYPE_FIELDS VERSION
do
echo "select * from \"${table}\";" >> /tmp/iop41_mig.sql
((index++))
done

echo "\q" >> /tmp/iop41_mig.sql


psql -U ${username} < /tmp/iop41_mig.sql

# chmod a+rx postgresql_migrate.sh


# run the script to dump the records
# ./postgresql_migrate.sh <Hive_PostgreSQL_db> <Hive_PostgreSQL_Username> >
mysqlData.output

The <Hive_PostgreSQL_db>, <Hive_PostgreSQL_Username> and <


Hive_PostgreSQL_Password> values are derived from Step 5 in section 2.1 Preparation.

15
NOTE: To avoid database crashes that might occur because of using the wrong stack_id entries
for Step6 and Step16, you can send the output from Step4 and the file mysqlData.output to
scale@us.ibm.com before proceeding. The IBM Support team will return a list of commands for
your environment for performing Step6 and Step16.
If you have reviewed your commands and changes carefully from Step6 and Step16 and noted
that the changes are correct, continue performing the following steps.

Step6) Update the Ambari database to switch the stack version from 4.1.SpectrumScale to 4.1.
Note: The commands with the stack_id values of 1 and 51 are derived from the output of Step
4. You must change the values according to the output of Step 4.
update ambari.clusterconfig set stack_id = '1' where stack_id = '51';
update ambari.clusters set desired_stack_id = '1' where desired_stack_id = '51';
update ambari.clusterstate set current_stack_id = '1' where current_stack_id = '51';
update ambari.servicedesiredstate set desired_stack_id = '1' where desired_stack_id = '51';
update ambari.serviceconfig set stack_id = '1' where stack_id = '51';
update ambari.servicecomponentdesiredstate set desired_stack_id = '1' where desired_stack_id =
'51';
update ambari.hostcomponentdesiredstate set desired_stack_id = '1' where desired_stack_id = '51';

Step7) Restart the Ambari server and stop/start all the Ambari agents.
On the Ambari server node, run the following commands to stop and start the Ambari server:
ambari-server stop
ambari-server start
On all the Ambari agent nodes, run the following commands to stop and start the Ambari
agents:
ambari-agent stop
ambari-agent start

Log in to the Ambari GUI.


NOTE: If you cannot log in to the Ambari GUI, contact scale@us.ibm.com immediately.

Step8) Uninstall the IBM Spectrum Scale Ambari integration package for the Hadoop connector.

16
On the Ambari server, uninstall the old integration package by running the following command:
rpm -e gpfs.ambari-iop_4.1*
Follow the commands in Step7 to restart the ambari server and all the agents.

Step9) Add the native HDFS service into Ambari.


Follow the Ambari wizard from the Ambari dashboard. Click ActionsAdd Service.

Note: The HDFS NameNode in this step will be the HDFS transparency NameNode in step 15
and it should be one of the nodes of the IBM Spectrum Scale cluster.

Step10) Check the configuration for some of the services in Ambari.


HDFS
Check the configuration fs.defaultFS on Ambari dashboardHDFSConfigsAdvanced core-
site and ensure that it is the hdfs://<hdfs-namenode-hostname>:8020. The <hdfs-namenode-
hostname> must be the HDFS NameNode in Step 9.

MapReduce2
For MapReduce2, on the Ambari dashboard, click MapReduce2ConfigsAdvanced panel:
mapreduce.client.submit.file.replication

If the value is 0, change it to the data replica value that was written down in Step7 of the
section 2.1 Preparation.

HBase
For HBase, remove the following configurations entries from the
HBaseConfigsAdvancedCustom Hbase-site panel:
gpfs.sync.queue=true
gpfs.sync.range=true
hbase.fsutil.hdfs.impl=org.apache.hadoop.hbase.gpfs.util.FSGPFSUtils
hbase.regionserver.hlog.writer.impl=
org.apache.hadoop.hbase.gpfs.regionserver.wal.PreallocatedProtobufLogWriter
hbase.regionserver.hlog.reader.impl=
org.apache.hadoop.hbase.gpfs.regionserver.wal.PreallocatedProtobufLogReader

17
You can click the Remove button to remove the configuration from Custom hbase-site:

Check the hbase.rootdirfield under HBaseConfigsAdvancedAdvanced Hbase-site. Ensure


that the hostname specified in the field is the HDFS NameNode hostname.
e.g. hdfs://c16f1n06.gpfs.net:8020/apps/hbase/data. If the value in this field does not reflect
the correct NameNode (c161f1n06.gpfs.net in this example), modify it accordingly.

Step11) Restart all the services and run the service check for HDFS.
There is no need to run service checks for the other services.

Step12) Stop all services on the Ambari GUI.

Step13) Manually uninstall the old connector from all the nodes.
/usr/lpp/mmfs/bin/mmdsh -N all " /usr/lpp/mmfs/bin/mmhadoopctl connector stop"
/usr/lpp/mmfs/bin/mmdsh -N all " /usr/lpp/mmfs/bin/mmhadoopctl connector detach --distribution
BigInsights"
/usr/lpp/mmfs/bin/mmdsh -N all "rpm -e gpfs.hadoop-connector"

Note: IBM Spectrum Scale will give installation errors if the above steps were not performed.
The first command, mmhadoopctl connector stop, will report an error if the Spectrum Scale
Hadoop connector was already stopped in Step 1. The error messages in this case just mean
that Spectrum Scale Hadoop connector is not up. One can use the mmhadoopctl connector
getstate command to check the connector and only run the mmhadoopctl connector stop if the
connector is still up.

Step14) Install the new GPFS Ambari integration module for HDFS Transparency on the Ambari
server node.

Download the GPFS Ambari integration module (gpfs.hdfs-transparency.ambari-iop_4.1-


<version>.noarch.bin) from IBM DeveloperWorks Spectrum Scale Wiki - IBM Open

18
Platform with Apache Hadoop - 2nd generation HDFS Transparency - Download Releases
section.
Download the Deploying BigInsights 4.1 IBM Spectrum Scale HDFS Transparency with
Ambari 2.1 document on the IBM DeveloperWorks Spectrum Scale Wiki.
o Follow the section 5.4.2 Setting up the IBM Spectrum Scale repository to set up
IBM Spectrum Scale HDFS transparency repository in the Deploying BigInsights
4.1 IBM Spectrum Scale HDFS transparency with Ambari 2.1 document.
o Follow the section 4.2.1.3 Add Spectrum Scale service to an existing Ambari IOP
and an HDFS Transparency cluster - Install the GPFS integration module into
Ambari in the Deploying BigInsights 4.1 IBM Spectrum Scale HDFS Transparency with
Ambari 2.1 document.

Step15) Add the IBM Spectrum Scale service to Ambari and integrate the existing IOP with the
existing IBM Spectrum Scale cluster.
Follow section 4.2.1.3 Add Spectrum Scale service to an existing Ambari IOP and an HDFS
Transparency cluster- Adding the IBM Spectrum Scale service to Ambari in the Deploying
BigInsights 4.1 IBM Spectrum Scale HDFS Transparency with Ambari 2.1 document on the IBM
DeveloperWorks Spectrum Scale Wiki.

Step16) Update the data in the Hive meta data server.

The old data ingested from the Hadoop connector uses the schema as gpfs:// in the meta data
database. This schema is not supported in HDFS transparency because it uses the native HDFS
schema. Therefore, the correct schema to use is hdfs://. Therefore, all the records in the meta
data database must be modified from using the gpfs:// value to the hdfs:// value.
NOTE: If this modification is not implemented, you will be unable to view the old data in Hive.

Assuming that your HDFS transparency NameNode is HDFS-Transparency-host in Step15, the


correct schema after the upgrade is hdfs://HDFS-Transparency-host:8020. Check the output
mysqlData.output in Step5 and determine the to-be-updated table list that have records with
incorrect gpfs:// and hdfs://. For example, the following schema values are all invalid: gpfs:///,
gpfs://HDFS-Transparency-host:8020, gpfs://not-HDFS-Transparency-host:8020 and
hdfs://not-HDFS-Transparency-host:8020. If there are records that use any of the above four
invalid schema formats, the table must be put in the to-be-updated table list.

19
For example, in the mysqlData.output of Step5, for table DBS, the records are using
gpfs://c8f2n13.gpfs.net:8020 (assuming that the correct schema is
hdfs://c8f2n13.gpfs.net:8020), then the table DBS must be put in the to-be-updated table list:
table name DBS

============>

DB_ID DESC DB_LOCATION_URI NAME OWNER_NAME OWNER_TYPE

1 Default Hive database gpfs://c8f2n13.gpfs.net:8020/apps/hive/warehouse default public ROLE

6 Hive test database gpfs://c8f2n13.gpfs.net:8020/apps/hive/warehouse/bigdata.db bigdata hive USER

11 NULL gpfs://c8f2n13.gpfs.net:8020/apps/hive/warehouse/bigsql.db bigsql bigsql USER

16 NULL gpfs://c8f2n13.gpfs.net:8020/apps/hive/warehouse/gosalesdw.db gosalesdw bigsql USER

The table DBS and SDS are two tables that must be changed. Check for other tables in your
cluster to see if whether they need to be changed.

If using MySQL for Hives meta data:


For each table in to-be-updated table list, issue commands similar to the following to correct
the schema for all records:
update DBS set DB_LOCATION_URI=(REPLACE(DB_LOCATION_URI, 'gpfs://', 'hdfs://'));

If you want to update only one specific record (e.g. only update the record whose DB_ID is 1),
use a command similar to the following:
update DBS set DB_LOCATION_URI=(REPLACE(DB_LOCATION_URI, 'gpfs://', 'hdfs://')) where
DB_ID =1;

If your schema in the table DBS is gpfs://not-HDFS-Transparency-host:<port-number>, use a


command similar to the following:

update DBS set DB_LOCATION_URI=(REPLACE(DB_LOCATION_URI, 'gpfs://not-HDFS-


Transparency-host:<port-number>, 'hdfs://HDFS-Transparency-host:8020'));

If using PostgreSQL for Hives meta data:


For each table in to-be-updated table list, issue commands similar to the following to correct
the schema for all the records:

20
update "DBS" set "DB_LOCATION_URI"=(REPLACE("DB_LOCATION_URI", 'gpfs://', 'hdfs://'));

If you want to update only one specific record (e.g. only update the record whose DB_ID is 1),
use a command similar to the following:
update "DBS" set "DB_LOCATION_URI"=(REPLACE("DB_LOCATION_URI", 'gpfs://', 'hdfs://')) where
"DB_ID"='1';

If your schema in the table DBS is gpfs://not-HDFS-Transparency-host:<port-number>, use a


command similar to the following:
update "DBS" set "DB_LOCATION_URI"=(REPLACE("DB_LOCATION_URI", 'hdfs://localhost:8020',
'hdfs://c16f1n06.gpfs.net:8020'));

Step17) Start all the services from Ambari and run service checks for all the services.

Step18) Follow Step 7 - Check the current data in Hbase, Hive and BigInsights Value-Add
databases in section 2.1 Preparation to sanity check the configuration by comparing the data
output after the upgrade with the previous saved outputs.

3. Revision History

Version Change Date Owner Change Logs


0.1 2016-8-5 Yong Yong initialized the draft
0.2 2016-8-16 Yong Merged some comments from Wen Qi
0.3 2016-8-17 Yong Merged comments from Linda
0.4 2016-8-17 Yong Merged comments from PC
0.5 2016-8-23 Yong Merged comments from Linda and PC
0.6 2016-8-23 Yong Merged comments from ID team member Lata
0.7 2016-8-24 Yong Merged comments from PC
0.8 2016-8-26 Yong Merged comments from customer
0.9 2016-8-30 Yong Updated the guide with PostgreSQL for Hives meta data
1.0 2016-8-31 Yong Merged comments from Linda
1.1 2016-9-1 Yong Merged comments from PC

21

You might also like