You are on page 1of 131

Huawei Certification Big Data Training Courses

HCIA - Big Data V2.0


Lab Guide for Big Data Engineers
ISSUE: 2.0

HUAWEI TECHNOLOGIES CO., LTD.


Copyright © Huawei Technologies Co., Ltd. 2018. All rights reserved.
No part of this document may be reproduced or transmitted in any form or by any means without prior written
consent of Huawei Technologies Co., Ltd.

Trademarks and Permissions

and other Huawei trademarks are trademarks of Huawei Technologies Co., Ltd.
All other trademarks and trade names mentioned in this document are the property of their respective holders.

Notice
The purchased products, services and features are stipulated by the contract made between Huawei and the
customer. All or part of the products, services and features described in this document may not be within the
purchase scope or the usage scope. Unless otherwise specified in the contract, all statements, information, and
recommendations in this document are provided "AS IS" without warranties, guarantees or representations of any
kind, either express or implied.
The information in this document is subject to change without notice. Every effort has been made in the
preparation of this document to ensure accuracy of the contents, but all statements, information, and
recommendations in this document do not constitute a warranty of any kind, express or implied.

Huawei Technologies Co., Ltd.


Address: Huawei Industrial Base
Bantian, Longgang
Shenzhen 518129
People's Republic of China

Website: http://e.huawei.com
HCNA-BigData V2.0 Experiment Guide Page 3

About This Document

Overview
This guide instructs trainees to perform all the experiment tasks required by the HCIA-Big Data
course on Huawei FusionInsight HD Big Data platform. It is aimed to help trainees master the
knowledge about using Big Data components of the FusionInsight HD platform.

Content Description
This document contains eight experiments: FusionInsight client installation, HBase database practice,
HDFS file system practice, Loader data import and export practice, Flume data collection practice,
Kafka message subscription practice, Hive data warehouse practice, and cluster comprehensive
experiment.

Precautions
During an experiment, trainees are not allowed to delete files randomly.
When naming a directory, topic, or file, a trainee must include the trainee’s account stuxx or userxx,
for example, stu06_data and user01_socker.
The trainer manages and allocates all the user names and passwords for logging in to the
environment. If you have any questions regarding the user name and password, ask the trainer
please.

References
FusionInsight HD product documentation

Experiment Environment
Table 1-1 Experimental Hardware and Software

1.1 2288H V5

1.1.1 Basic Configuration

02312BTK H22H-05-S26AFC (25*2.5 inch hard disk chassis, 1


onboard 2*GE+2*10GE optical port
HCNA-BigData V2.0 Experiment Guide Page 4

(excluding optical
modules) ,2*1500W AC power
supply)H22H-05

1.1.2 SKYLAKE CPU

Intel Xeon Platinum


8176(2.1GHz/28-
02311XFF BC4M04CPU 2
core/38.5MB/165W) processor (with
heat sink)

1.1.3 Memory

DDR4 RDIMM Memory-32GB-


06200241 N26DDR402 12
2666MT/s-2Rank(2G*4bit)-1.2V-ECC

1.1.4 HardDisk(with 2.5" Handle bar)-SAS

HardDisk-600GB-SAS 12Gb/s-10K
02311HAP N600S1210W2 rpm-128MB -2.5 inch(2.5 inch 2
bracket)

HardDisk-1800GB-SAS 12Gb/s-10K
02311FM
N1800S10W2 rpm-128MB-2.5 inch(2.5 inch 4
R
bracket)

1.1.5 HardDisk(with 2.5" Handle bar)-SSD

ES3600S V5 solid state disk -800GB


02312FRL ES3600S800GW2 SAS 12Gb/s-read/write mixed -2.5 2
inch (2.5 inch tray)

1.1.6 Raid controller card

SR530C-M 1G(LSI3108)SAS/SATA
RAID card -RAID0,1,5,6,10,50,60-
02311SMF BC1M05ESMLB 1
1GB Cache-supports supercapacitor
and out-of-band management.

LSI3108 1 GB Cache RAID card


supercapacitor (4 GB, including
02311YPU BC1M08TFM 1
cables and mechanical parts)-
Applicable to rack servers /X6800

1.1.7 Riser card

02311TW
BC1M31RISE 3*x8 (x16 slot) RISER1 module 1
R

1.1.8 PCIe card -NIC

Ethernet adapter -10Gb optical port


(Intel 82599)- dual-port -SFP+
02311EUX CN2ITGAA20 1
(including two multi-mode optical
modules) -PCIe 2.0 x8

1.1.9 Cables and optical modules


HCNA-BigData V2.0 Experiment Guide Page 5

Optical module -SFP+-10G-multi-


02318169 OMXD30000 2
mode module (850nm, 0.3km, LC)

1.1.10 Guide rail and cable tray

21240434 EGUIDER01 2U static slide rail kit 1

1.1.11 Operating system

SLES for SAP Applications-English


Version-Enterprise Edition -12.x-2
sockets or 2VMs-x86-64 Bit-Physical
05200723 GOSSLES33 Goods (Paper)-No Documentation- 1
Three Years 7*24 Service (Operating
System Manufacturer Service)-
Greater China Region

To download the FusionCompute software, visit the following website:


http://support.huawei.com/enterprise/en/cloud-computing/fusioncompute-pid-
8576912/software
To download the FusionInsight C70 software, visit the following website:
https://support.huawei.com/enterprise/en/cloud-computing/fusioninsight-hd-pid-
21110924/software/23949194?idAbsPath=fixnode01%7C7919749%7C7941815%7C19942
925%7C250430185%7C21110924

Other hardware

The minimum configuration is 1Gb Ethernet switches. It


Switch is recommended that all 10Gb Ethernet switches be
configured.

OS partition requirements for each VM:


Target node Partition Directory Partition Size

/ 10G

/tmp 10G

/var 10G
Management/Control/Data
node /var/log ≥200G

/srv/BigData ≥60G

/opt ≥300G
HCNA-BigData V2.0 Experiment Guide Page 6

VM port group configuration and interconnection switch configuration:


VM name Network Port Name Port Group Name Indicates the VLAN ID.

eth0 PortgroupX vlanX


VM01
eth1 PortgroupY vlanY

eth0 PortgroupX vlanX


VM02
eth1 PortgroupY vlanY

eth0 PortgroupX vlanX


VM03
eth1 PortgroupY vlanY

A trunk interface is used to connect the physical switch to the hypervisor


Physical switch of the virtualization platform. In this way, the VLAN of the internal port
group of the virtual switch can be divided.

Experiment Topology
Three server nodes are used.

Figure 1-2 Non-redundant cluster topology


HCNA-BigData V2.0 Experiment Guide Page 7

Trainee Accounts and Software Access


Each trainee is assigned two accounts: The FusionInsight HD cluster account starts with stu, which
can be used for logging in to the FusionInsight Manager management interface, the communication
between big data components, and the access to big data components. The account starting with
user is the OS account of cluster nodes. It is used to log in to the operating system of a cluster node
and perform big data component experiment operations.
The cluster client software and files which will be used during the experiments are saved in the
/FusionInsight_Client directory of each cluster node. Trainees can obtain the software and files from
this directory.
The SSH and file uploading tools which will be used during the experiments are saved in the 07 other
tool directory under ftp://10.175.199.8/. The FTP user name and password are admin1 and admin1
respectively. Trainees can obtain the tools by themselves.
HCNA-BigData V2.0 Experiment Guide Page 8

Contents

About This Document ................................................................................................................. 3


Overview ............................................................................................................................................................................. 3
Content Description ............................................................................................................................................................ 3
Precautions ......................................................................................................................................................................... 3
References .......................................................................................................................................................................... 3
Experiment Environment .................................................................................................................................................... 3
Experiment Topology .......................................................................................................................................................... 6
Trainee Accounts and Software Access ............................................................................................................................... 7
1 FusionInsight HD Client Installation........................................................................................ 10
1.1 Background ................................................................................................................................................................. 10
1.2 Objective ..................................................................................................................................................................... 10
1.3 Experiment Tasks ........................................................................................................................................................ 10
1.3.1 Installing a Client ...................................................................................................................................................... 10
1.4 Summary ..................................................................................................................................................................... 12
2 HDFS File System Practice ...................................................................................................... 13
2.1 Background ................................................................................................................................................................. 13
2.2 Objectives ................................................................................................................................................................... 13
2.3 Experiment Tasks ........................................................................................................................................................ 13
2.3.1 Common HDFS Operations ...................................................................................................................................... 13
2.3.2 HDFS Management Operations ............................................................................................................................... 20
2.4 Summary ..................................................................................................................................................................... 30
3 HBase Database Practice........................................................................................................ 31
3.1 Background ................................................................................................................................................................. 31
3.2 Objective ..................................................................................................................................................................... 31
3.3 Experiment Tasks ........................................................................................................................................................ 31
3.3.1 Common HBase Operations ..................................................................................................................................... 31
3.3.2 Using Filter ............................................................................................................................................................... 37
3.3.3 Creating a Table with Pre-Distributed Regions......................................................................................................... 38
3.3.4 HBase Load Balancing .............................................................................................................................................. 43
3.4 Summary ..................................................................................................................................................................... 45
4 Hive Data Warehouse Practice ............................................................................................... 46
4.1 Background ................................................................................................................................................................. 46
4.2 Objectives ................................................................................................................................................................... 46
4.3 Experiment Tasks ........................................................................................................................................................ 46
4.3.1 Common Functions of Hive ...................................................................................................................................... 46
HCNA-BigData V2.0 Experiment Guide Page 9

4.3.2 Creating a Table........................................................................................................................................................ 49


4.3.3 Querying .................................................................................................................................................................. 53
4.3.4 Hive Join Operations ................................................................................................................................................ 57
4.3.5 Hive on Spark Operation .......................................................................................................................................... 60
4.3.6 Associating a Hive Table with an HBase Table .......................................................................................................... 61
4.3.7 Merging Small Hive Files .......................................................................................................................................... 62
4.3.8 Hive Column Encryption .......................................................................................................................................... 63
4.3.9 Using Hue to Execute HQL ....................................................................................................................................... 64
4.4 Summary ..................................................................................................................................................................... 67
5 Data Import and Export Using Loader .................................................................................... 68
5.1 Background ................................................................................................................................................................. 68
5.2 Objective ..................................................................................................................................................................... 68
5.3 Experiment Tasks ........................................................................................................................................................ 68
5.3.1 Importing HBase Data to HDFS ................................................................................................................................ 68
5.3.2 Loading HDFS Data to HBase.................................................................................................................................... 75
5.3.3 Importing HDFS Data to MySQL ............................................................................................................................... 81
5.3.4 Importing MySQL Data to HDFS ............................................................................................................................... 88
5.3.5 Importing MySQL Data to HBase ............................................................................................................................. 92
5.3.6 Importing HBase Data to MySQL ............................................................................................................................. 96
5.3.7 Importing MySQL Data to Hive .............................................................................................................................. 100
5.4 Summary ................................................................................................................................................................... 104
6 Flume Data Collection Practice............................................................................................. 105
6.1 Background ............................................................................................................................................................... 105
6.2 Objective ................................................................................................................................................................... 105
6.3 Experiment Tasks ...................................................................................................................................................... 105
6.3.1 Collecting spooldir Data to the HDFS ..................................................................................................................... 105
6.3.2 Collecting avro Data to the HDFS ........................................................................................................................... 112
6.4 Summary ................................................................................................................................................................... 115
7 Comprehensive Cluster Experiment ..................................................................................... 116
7.1 Background ............................................................................................................................................................... 116
7.2 Objective ................................................................................................................................................................... 116
7.3 Experiment Tasks ...................................................................................................................................................... 116
7.3.1 Offline Data Collection and Analysis and Real-Time Query Involving MySQL, Loader, Hive, and HBase ............... 116
7.4 Summary ................................................................................................................................................................... 129
8 Appendix ............................................................................................................................. 130
8.1 Common Linux Commands ....................................................................................................................................... 130
8.2 Other HDFS Commands ............................................................................................................................................ 130
8.3 Methods of Creating a new Flume Job ..................................................................................................................... 131
HCNA-BigData V2.0 Experiment Guide Page 10

1 FusionInsight HD Client Installation

1.1 Background
The FusionInsight HD client is the interface for the communication between users and the cluster as
well as the foundation of subsequent experiments. After a client is installed, it requires security
authentication to communicate with the cluster if the cluster is deployed in secure mode.

1.2 Objective
⚫ To understand how to download and install a client.

1.3 Experiment Tasks

1.3.1 Installing a Client


Step 1 Log in to a cluster node.
Use PuTTY and the trainee account (such as userXX) to log in to a cluster node, for example,
192.168.224.45. (The IP address of a specific node must be assigned by the trainer.)
HCNA-BigData V2.0 Experiment Guide Page 11

Copy the FusionInsight HD client to the home directory of userXX (for example, user01). The client
files are saved in the /FusionInsight_Client directory of each cluster node.

> cd /FusionInsight-Client
> cp FusionInsight_Cluster_1_Services_ClientConfig.tar /home/userXX

Decompress client software.


> cd /home/userXX
> tar -xvf FusionInsight_Cluster_1_Services_ClientConfig.tar

Step 2 Install the client.


Go to the FusionInsight_Cluster_1_Services_ClientConfig directory and run the installation
command to install the software in the /home/userXX/hadoopclient directory of the current user.

> cd /home/userXX/FusionInsight_Cluster_1_Services_ClientConfig/
>./install.sh /home/userXX/hadoopclient

If message Components client installation is complete is displayed, the installation is complete.


!!! После установки удалить FusionInsight_Cluster_1_Services_ClientConfig.tar.

Step 3 Configure environment variables and perform the authentication.


Go to /home/userXX/hadoopclient and run the following commands to set environment variables:

> cd /home/userXX/hadoopclient
> source bigdata_env
> kinit stuXX

Password for stuXX@HADOOP.COM:


HCNA-BigData V2.0 Experiment Guide Page 12

Note: The initial password is Huawei@123 (or consult the trainer). If the system prompts you to
change the password during the first authentication, change the password to Huawei12#$.

Step 4 Test the client.


Run the hdfs command to test the client.

> hdfs dfs –ls /


drwxr-x---+ - flume hadoop 0 2017-07-15 00:39 /flume
drwx------+ - hbase supergroup 0 2018-03-31 10:28 /hbase
drwxrwxr-x+ - admin supergroup 0 2018-01-28 15:52 /mapreduceInput
drwxrwxrwx+ - mapred hadoop 0 2017-07-15 00:39 /mr-history

If the test is successful, it indicates that the client is installed successfully.

----End

1.4 Summary
This experiment demonstrates how to install a FusionInsight HD client. During the installation, you
need to decompress the client software twice. Note that no file or folder exists in the directory
where the client is installed. Otherwise, the installation fails.
HCNA-BigData V2.0 Experiment Guide Page 13

2 HDFS File System Practice

2.1 Background
HDFS is a distributed file system on the Hadoop Big Data platform and provides data storage for
upper-layer applications or other Big Data components, such as Hive, Mapreduce, Spark, and HBase.
On the HDFS shell client, you can operate and manage the distributed file system. Using HDFS helps
us better understand and master Big Data.

2.2 Objectives
⚫ To have a good command of common HDFS operations
⚫ To master HDFS file system management operations

2.3 Experiment Tasks

2.3.1 Common HDFS Operations


2.3.1.1 Commands
--Сначала выполняем эти команды

> cd /home/userXX/hadoopсlient
> source bigdata_env
> kinit stuXX

 -help: Checks instructions of a command.

> hdfs dfs -help


Usage: hadoop fs [generic options]
[-appendToFile <localsrc> ... <dst>]
[-cat [-ignoreCrc] <src> ...]
[-checksum <src> ...]
[-chgrp [-R] GROUP PATH...]
[-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...]
[-chown [-R] [OWNER][:[GROUP]] PATH...]
[-copyFromLocal [-f] [-p] [-l] <localsrc> ... <dst>]
[-copyToLocal [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
[-count [-q] [-h] [-v] [-t [<storage type>]] <path> ...]
HCNA-BigData V2.0 Experiment Guide Page 14

[-cp [-f] [-p | -p[topax]] <src> ... <dst>]


[-createSnapshot <snapshotDir> [<snapshotName>]]
[-deleteSnapshot <snapshotDir> <snapshotName>]
[-df [-h] [<path> ...]]
[-du [-s] [-h] <path> ...]

 -Is: Displays the directory information.

~> hdfs dfs -ls /


-rw-r--r--+ 3 wkj supergroup 13 2018-04-02 16:42 /HDFS
drwxrwxr-x+ - hive supergroup 0 2017-07-15 00:43 /apps
drwxr-xr-x+ - admin supergroup 0 2018-03-13 19:44 /bigdata
drwxr-x---+ - flume hadoop 0 2017-07-15 00:39 /flume
drwx------+ - hbase supergroup 0 2018-03-31 10:28 /hbase
drwxrwxr-x+ - admin supergroup 0 2018-01-28 15:52 /mapreduceInput
drwxrwxrwx+ - mapred Hadoop 0 2017-07-15 00:39 /mr-history

 -mkdir: Creates a directory in the HDFS.

> hdfs dfs -mkdir /user/app_stuXX

> hdfs dfs -ls /user


drwxr-xr-x+ - wkj supergroup 0 2018-04-02 17:20 /0402
drwxr-xr-x+ - wkj supergroup 0 2018-04-02 16:57 /0810
-rw-r--r--+ 3 wkj supergroup 13 2018-04-02 16:42 /HDFS
drwxr-xr-x+ - user01 supergroup 0 2018-04-04 15:04 /app_stu01

 -put: Uploads a local file to the specified directory in the HDFS.

> hdfs dfs -put /FusionInsight-Labs/test01.txt /user/app_stuXX


> hdfs dfs -ls -h /user/app_stuXX

-h: Format file sizes in a human-readable fashion (eg 64.0m instead of


67108864).

-rw-rw-rw-+ 3 user01 hadoop 38 2020-07-13 16:29


/user/app_stu01/test01.txt

 -get: Downloads a file from HDFS to a local host, which is equivalent to copyToLocal.

Copying /app_stuXX/test01.txt to local:

> hdfs dfs -get /user/app_stuXX/test01.txt /home/userXX


> cd /home/userXX
> ll

total 2881728
drwxr-xr-x 15 user01 hadoop 4096 Apr 4 10:58 1001_hadoopclient
-rw-r--r-- 1 user01 hadoop 63 Apr 4 16:30 appendtext.txt
drwxr-xr-x 2 user01 hadoop 4096 Apr 4 10:03 bin
-rw-r--r-- 1 user01 hadoop 0 Apr 4 15:28 hdfs
-rwxr-xr-x 1 user01 hadoop 2947983360 Apr 4 10:05 Service_Client.tar
-rw-r--r-- 1 user01 hadoop 38 Apr 4 16:27 stu01.txt
-rw-r--r-- 1 user01 hadoop 38 Apr 4 17:54 test01.txt
HCNA-BigData V2.0 Experiment Guide Page 15

 -moveFromLocal: Cuts and pastes from local to the HDFS.

There is an abcd file in the home directory of userXX.

> cp /FusionInsight-Labs/abcd.txt /home/userXX


> cd /home/userXX

> ll

total 2881716
drwxr-xr-x 15 user01 hadoop 4096 Apr 4 10:58 1001_hadoopclient
drwxr-xr-x 2 user01 hadoop 4096 Apr 4 10:03 bin
-rw-r--r-- 1 user01 hadoop 0 Apr 4 15:28 abcd
-rwxr-xr-x 1 user01 hadoop 2947983360 Apr 4 10:05 Service_Client.tar

Execute the moveFromLocal command to move the abcd file to the /user/app_stuXX directory of
the HDFS.

> hdfs dfs -moveFromLocal /home/userXX/abcd.txt /user/app_stuXX

After the execution is complete, check that the file does not exist anymore in the home directory of
userXX.

> ll
total 2881716
drwxr-xr-x 15 stu01 hadoop 4096 Apr 4 10:58 1001_hadoopclient
drwxr-xr-x 2 stu01 hadoop 4096 Apr 4 10:03 bin
-rwxr-xr-x 1 stu01 hadoop 2947983360 Apr 4 10:05 Service_Client.tar

The file has been moved to the HDFS.

> hdfs dfs -ls -h /user/app_stuXX

Found 3 items
-rw-rw-rw-+ 3 root hadoop 1.3 G 2020-07-13 16:21
/user/app_stu20/FusionInsight_Cluster_1_Services_Client.tar
-rw-rw-rw-+ 3 user01 hadoop 38 2020-07-13 16:45
/user/app_stu01/abcd.txt
-rw-rw-rw-+ 3 user01 hadoop 38 2020-07-13 16:29
/user/app_stu01/test01.txt

 -cat: Displays the file content.

> hdfs dfs -cat /user/app_stuXX/test01.txt

01,HDFS
02,Zookeeper
03,HBase
04,Hive
HCNA-BigData V2.0 Experiment Guide Page 16

 -appendToFile: Adds data to the end of a file.

There is a local file appendtext.txt which includes:

> cp /FusionInsight-Labs/appendtext.txt /home/userXX


> cat appendtext.txt

10,Spark
11,Storm
12,Kafka
13,Flink
14,ELK
15,FusionInsight HD

Add the content in appendtext.txt to the end of test01.txt.

> hdfs dfs -appendToFile /home/userXX/appendtext.txt


/user/app_stuXX/test01.txt

Check whether the content has been added successfully.

> hdfs dfs -cat /user/app_stuXX/test01.txt

01,HDFS
02,Zookeeper
03,HBase
04,Hive
10,Spark
11,Storm
12,Kafka
13,Flink
14,ELK
15,FusionInsight HD

 -chmod: Modifies the file permission.

> hdfs dfs -ls /user/app_stuXX

Found 3 items
-rw-rw-rw-+ 3 root hadoop 1352929792 2020-07-13 16:21
/user/app_stu01/FusionInsight_Cluster_1_Services_Client.tar
-rw-rw-rw-+ 3 user01 hadoop 38 2020-07-13 16:45
/user/app_stu01/abcd.txt
-rw-rw-rw-+ 3 user01 hadoop 101 2020-07-13 16:57
/user/app_stu01/test01.txt

Modify the permission of /user/app_stuXX/test01.txt to 755:

> hdfs dfs -chmod 755 /user/app_stuXX/test01.txt

> hdfs dfs -ls /user/app_stuXX/test01.txt

-rwxr-xr-x+ 3 user01 hadoop 101 2020-07-13 16:57


/user/app_stu01/test01.txt
HCNA-BigData V2.0 Experiment Guide Page 17

To use chown, you must have the superuser permission.

 -cp: Copies a file.

Copy /user/app_stuXX/test01.txt to the /tmp/stuXX directory with the name file01.txt:

> hdfs dfs -cp /user/app_stuXX/test01.txt /tmp/stuXX/file01.txt

> hdfs dfs -ls /tmp/stuXX

Found 1 items

-rw-rw-rw-+ 3 user01 hadoop 101 2020-07-13 17:40


/tmp/stu01/file01.txt

 -mv: Moves a file.

Move /tmp/stuXX/file01.txt to the /user/app_stuXX directory:

> hdfs dfs -mv /tmp/stuXX/file01.txt /user/app_stuXX

> hdfs dfs -ls /tmp/stuXX

Found 2 items
-rw-rw-rw-+ 3 user01 hadoop 38 2020-07-13 16:45 /tmp/stu01/abcd.txt
-rw-rw-rw-+ 3 user01 hadoop 101 2020-07-13 17:40
/tmp/stu01/test01.txt

 -getmerge: Merge and download multiple files:

There are two files in the /user/app_stuXX directory, which are file01 and test01.txt.

> hdfs dfs -put /FusionInsight-Labs/file01.txt /user/app_stuXX

> hdfs dfs -ls /user/app_stuXX/

-rw-rw-rw-+ 3 user01 hadoop 120 2020-07-16 11:53


/user/app_stu01/file01.txt
-rwxr-xr-x+ 3 user01 hadoop 101 2020-07-13 16:57
/user/app_stu01/test01.txt

The contents of the two files are as follows:

> hdfs dfs -cat /user/app_stuXX/file01.txt

001 FusionInsight HD
002 FusionInsight Miner
003 FusionInsight LibrA
004 FusionInsight Farmer
005 FusionInsight Manager

> hdfs dfs -cat /user/app_stuXX/test01.txt


HCNA-BigData V2.0 Experiment Guide Page 18

01,HDFS
02,Zookeeper
03,HBase
04,Hive
10,Spark
11,Storm
12,Kafka
13,Flink
14,ELK
15,FusionInsight HD

Combine the files and copy them to a local directory:

> hdfs dfs -getmerge /user/app_stuXX/file01.txt /user/app_stuXX/test01.txt


/home/userXX/merge_file.txt

> cat /home/userXX/merge_file.txt

001 FusionInsight HD
002 FusionInsight Miner
003 FusionInsight LibrA
004 FusionInsight Farmer
005 FusionInsight Manager
01,HDFS
02,Zookeeper
03,HBase
04,Hive
10,Spark
11,Storm
12,Kafka
13,Flink
14,ELK
15,FusionInsight HD

 -rm: Deletes a file or folder.

Delete the /user/app_stuXX/file01.txt file:

> hdfs dfs -rm -f /user/app_stuXX/file01.txt


INFO fs.Trash: Moved: 'hdfs://hacluster/app_stu01/file01' to trash at:
hdfs://hacluster/user/stu01/.Trash/Current

 -df: Checks the available space of the file system.

> hdfs dfs -df -h /


Filesystem Size Used Available Use%
hdfs://hacluster 1.7 T 11.9 G 1.7 T 1%

 -du: Checks the folder size.

> hdfs dfs -du -h /user/app_stuXX


213.1 M /user/admin
0 /user/hdfs
75 /user/hdfs-examples
HCNA-BigData V2.0 Experiment Guide Page 19

213.1 M /user/hive
4.3 K /user/loader
493 /user/mapred

 -count: Checks the number of files in a specific directory.

> hdfs dfs -count -h /user/app_stuXX

344 494 3.2 G /user

344 in the first column indicates the number of folders in the /user/ directory, and 494 in the second
column indicates the number of files in /user/. 3.2G indicates the disk space occupied by all files in
/user/ (excluding the number of copies).

2.3.1.2 Recycle Bin Usage


Files may be deleted by mistake during work. If this happens, you can get back the deleted files from
the recycle bin of the HDFS. By default, the recycle bin saves deleted files for seven days. For
example, in the preceding experiment, the -rm command is used to delete the file01 file. After the
file is deleted, the system prompts that the deleted file is saved in fs.Trash: Moved:
'hdfs://hacluster/user/app_stuXX/file01.txt' to trash at:
hdfs://hacluster/user/userXX/.Trash/Current/user/app_stuXX. However, the HDFS system archives
the deleted file in a different directory.

> hdfs dfs -ls /user/userXX/.Trash

drwx------+ - user01 hadoop 0 2020-06-08 13:13 /user/user01/.Trash/Current

> hdfs dfs -ls /user/userXX/.Trash/Current

drwx------+ - user01 hadoop 0 2020-06-08 13:13


/user/user01/.Trash/Current/user

> hdfs dfs -ls /user/userXX/.Trash/Current/user

drwx------+ - user01 hadoop 0 2020-06-08 13:13


/user/user01/.Trash/Current/user/app_stu01

View the /user/userXX/.Trash/Current/user/app_stuXX directory.

> hdfs dfs -ls -h /user/userXX/.Trash/Current/user/app_stuXX

-rw-rw-rw-+ 3 user01 hadoop 120 2020-07-16 11:53


/user/user01/.Trash/Current/user/app_stu20/file01.txt

Then, use the -mv parameter to move the file to the specified directory. For details, see description
about -mv in this section.

> hdfs dfs -mv /user/userXX/.Trash/Current/user/app_stuXX/file01.txt


/user/app_stuXX
HCNA-BigData V2.0 Experiment Guide Page 20

2.3.2 HDFS Management Operations


2.3.2.1 HDFS Quota Management
When multiple tenants use the HDFS, the HDFS space available for each tenant should be limited.
HDFS quota management is designed for this matter.

2.3.2.1.1 Creating Quota

Step 1 On the FusionInsight Manager interface, click Tenant Management (в новом интерфейсе:
Tenant Resources -> Tenant Resources Management).

In the tenant list on the left, click tenant queue_stuXX whose HDFS storage directory needs to be
modified.

Step 2 Click the Resource tab.


Step 3 In the HDFS Storage table, click Create Directory.
HCNA-BigData V2.0 Experiment Guide Page 21

Step 4 Add a directory.

Path: Fill in the director assigned to the tenant. If the path does not exist, the system automatically
creates the path. (/user/app_stuXX/myquota)
Quota: Fill in the upper limit of the total number of stored files and directories. Quota: 3
SpaceQuota: Fill in the storage space quota for creating the directory. SpaceQuota: 1000 MB
HCNA-BigData V2.0 Experiment Guide Page 22

The path must be unique.


After filling in all the values, click OK.

Step 5 Check the result of adding a directory.


Run the HDFS file uploading command:

> hdfs dfs -put /home/userXX/test01.txt /user/app_stuXX/myquota

Run the following command to check whether the file has been uploaded:

> hdfs dfs -ls /user/app_stuXX/myquota

Found 1 items
-rw-rw-rw-+ 3 user01 hadoop 38 2020-07-16 20:09
/user/app_stu01/myquota/test01.txt

If the preceding information is displayed, the /myquota directory is created successfully and the
current user has the permission to upload files.

Step 6 Test SpaceQuota.


Pre-applied disk space = Number of blocks corresponding to the file x Block size x 3. The default block
size is 128 MB. Therefore, the minimum pre-applied disk space (1 data block) is 128 MB x 3 = 384
MB. The SpaceQuota is set to 1000 MB in step 4. Therefore, the maximum file size is 2 x 128 MB=256
MB. When the file size is greater than 256 MB, you need to apply for at least three data blocks (3 x
128 MB x 3 > 1000 MB). If SpaceQuota cannot meet the requirement, the file will fail to be uploaded.
(Number of blocks corresponding to a file = File size/128. If it is indivisible, the value is rounded up.)
The following is an example of uploading a file larger than 256 MB when the SpaceQuota is 1000
MB.

> cd /FusionInsight-Client/Flume

Check the size of FusionInsight_Cluster_1_Flume_Client.tar:

> ll -h
total 451M
-rwxr-xr-x 1 root root 451M Jul 23 13:22
FusionInsight_Cluster_1_Flume_Client.tar

Run the following command to upload the file to the HDFS:


HCNA-BigData V2.0 Experiment Guide Page 23

> hdfs dfs -put /FusionInsight-


Client/Flume/FusionInsight_Cluster_1_Flume_Client.tar /user/app_stuXX/myquota

put: The DiskSpace quota of /user/app_stu01/myquota is exceeded: quota =


1048576000 B = 1000 MB but diskspace consumed = 1207959666 B = 1.13 GB

It can be seen that the file fails to be uploaded when SpaceQuota is set to1000 MB and the file size is
greater than 256 MB.

Step 7 Test Quota.


As configured in step 4, when the number of uploading files exceeds two (3!), the files fail to be
uploaded. Run the following command to perform the test:

> hdfs dfs -put /home/userXX/hadoopclient/switchuser.py


/user/app_stuXX/myquota

> hdfs dfs -put /home/userXX/hadoopclient/install.ini /user/app_stuXX/myquota

put: The NameSpace quota (directories and files) of directory


/user/app_stu01/myquota is exceeded: quota=3 file count=4

Run the following command to view the file list in the specified HDFS directory:

> hdfs dfs -ls /user/app_stuXX/myquota

Found 2 items
-rw-rw-rw-+ 3 user01 hadoop 1774 2020-07-23 18:35
/user/app_stu01/myquota/switchuser.py
-rw-rw-rw-+ 3 user01 hadoop 38 2020-07-23 18:31
/user/app_stu01/myquota/test01.txt
If the command output does not contain the install.ini file, the file fails to be uploaded.

----End

2.3.2.1.2 Modifying Quota Configuration

Step 1 Change Quota to 4 and SpaceQuota to 1700 MB.


HCNA-BigData V2.0 Experiment Guide Page 24

Step 2 Upload the data again and then view the file list in the directory:

> hdfs dfs -put /FusionInsight-


Client/Flume/FusionInsight_Cluster_1_Flume_Client.tar /user/app_stuXX/myquota

> hdfs dfs -ls /user/app_stuXX/myquota

Found 3 items
-rw-rw-rw-+ 3 user01 hadoop 472237056 2020-07-23 18:42
/user/app_stu01/myquota/FusionInsight_Cluster_1_Flume_Client.tar
-rw-rw-rw-+ 3 user01 hadoop 1774 2020-07-23 18:35
/user/app_stu01/myquota/switchuser.py
-rw-rw-rw-+ 3 user01 hadoop 38 2020-07-23 18:31
/user/app_stu01/myquota/test01.txt

The preceding command output indicates that the large file (296 MB) can be uploaded and multiple
(three) files can be uploaded after the configuration is modified.
----End

2.3.2.1.3 Deleting Quota

Step 1 Log in to FusionInsight Manager, choose Tenant > Tenant Management > queue_stuXX >
Resource (в новом интерфейсе Tenant Resources -> Tenant Resources Management ->
queue_stuXX > Resource).

Step 2 Click the cross icon (x) in the Operation column of the specified directory in HDFS Storage area
to delete the storage resource.
HCNA-BigData V2.0 Experiment Guide Page 25

(Другой интерфейс: Выбрать директорию и в столбце Operation кликнуть Delete)

Step 3 In the Delete Directory dialog box that is displayed, select the check box and click OK.

----End

2.3.2.2 HDFS Metadata Backup and Recovery


To ensure HDFS metadata security or when the system administrator needs to perform major
operations (such as upgrade or migration) on the HDFS cluster, you need to back up HDFS metadata
so that HDFS metadata can be restored in a timely manner when faults occur in the system, making
HDFS cluster data secure and reliable.

2.3.2.2.1 Data Backup


The data backup procedure is as follows:

Step 1 Choose System > Backup Management. (O&M -> Backup and Restoration)

Step 2 Click Create Backup Task. (кнопка Create)


HCNA-BigData V2.0 Experiment Guide Page 26

Step 3 Select the check box next to NameNode and configure parameters of the NameNode metadata
backup task, including Task name, Path type, Maximum number of backup copies, and Instance
name, and click OK.

Task name: stuXX_NameNodeBackup


Backup Object: Cluster Demo
Configuration: NameNode
Path Type: LocalDir
Maximum Number of Backup Copies: 3
Instance Name: hacluster

Step 4 Click the start icon in the Operation column to execute the metadata backup task.
Выбираем задачу из списка, в столбце «Operation» выбираем пункт «More» и далее «Back
Up Now»
HCNA-BigData V2.0 Experiment Guide Page 27

Step 5 When the task progress is 100%, the task is complete and HDFS metadata is backed up
successfully.

----End

2.3.2.2.2 Data Recovery


Data recovery is performed based on the data backup result. The data recovery procedure is as
follows:

Step 1 Choose System > Backup Management (в новом интерфейсе O&M -> Backup and Restoration -
> Backup Management).

Step 2 Click the button for viewing historical operations in the NameNodeBackup task (в столбце
«Operation» выбираем пункт More -> View History).
HCNA-BigData V2.0 Experiment Guide Page 28

Step 3 Check the data backup log and click View in the Details column.

Step 4 Find the path for saving the backup data file from the log file, as shown in the following figure.

(Другой интерфейс: В столбце «Backup Path» кликнуть «View»). Выведется имя файла,
например:
/srv/BigData/LocalBackup/1/stu01_NameNodeBackup_20200723185056/NameNode_2
0200723185104/6.5.1_HDFS-hacluster-fsimage_20200723185211.tar.gz

Step 5 Copy the path and click Recovery Management to create a recovery task (O&M -> Backup and
Restoration -> Restoration Management).
HCNA-BigData V2.0 Experiment Guide Page 29

Step 6 On the page that is displayed, click Create Recovery Task (кнопка Create).

Step 7 Configure parameters for the task, including Task name, Path type, Source path, and Instance
name. The source path indicates the file path obtained in step 4. After configuring all the
parameters, click OK.

Task Name: stuXX_recovery


Recovery Configuration: NameNode
Path type: LocallDir
Source Path: выбрать из списка наименование файла, взятого из локального пути
/srv/BigData/LocalBackup/1/stu01_NameNodeBackup_20200723185056/NameNode_2020072318
5104/6.5.1_HDFS-hacluster-fsimage_20200723185211.tar.gz
Instance Name: hacluster
HCNA-BigData V2.0 Experiment Guide Page 30

Step 8 Click the start icon corresponding to the task to start data recovery.

The preceding figure shows that NameNode data is successfully recovered.


----End

ВНИМАНИЕ! Запуск Recovery task завершиться неудачей, поскольку для ее выполнения нужно
остановить NameNode. Останавливать NameNode НЕ НУЖНО!

2.4 Summary
This experiment describes common HDFS operations and HDFS management. After this experiment,
trainees should have known how to perform common operations in the HDFS.
HCNA-BigData V2.0 Experiment Guide Page 31

3 HBase Database Practice

3.1 Background
HBase is a highly reliable, high-performance, column-oriented, and scalable distributed storage
system. It is the most commonly used NoSQL database in the industry. The knowledge about how to
use HBase can deepen trainees' understanding of HBase and lay a solid foundation for
comprehensively using Big Data.

3.2 Objective
⚫ To have a good command of common HBase operations, region operations, and filter
usage.

3.3 Experiment Tasks

3.3.1 Common HBase Operations


3.3.1.1 Logging In to an HBase Shell Client
Step 1 Log in to an HBase shell client.

> cd /home/userXX/hadoopclient
> source bigdata_env
> kinit stuXX
Password for stuXX@HADOOP.COM:
> hbase shell
……
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 1.0.2, rUnknown, Thu May 12 17:02:55 CST 2016
hbase(main):001:0>

The preceding information indicates that you have logged in to the HBase shell client.

3.3.1.2 Creating a Common Table


Step 1 The syntax for creating a common table is as follows: create 'table name', 'column family name'
HCNA-BigData V2.0 Experiment Guide Page 32

Entering the command:

> create 'stuXX_cga_info','info'


0 row(s) in 0.3620 seconds
=> Hbase::Table - cga_info

The stuXX_cga_info table is successfully created.

Step 2 Run the list command to check the number of common tables in the system.

> list

TABLE
stu01_cga_info
Socker
t1
3 row(s) in 0.2300 seconds
=> ["stu01_cga_info", "socker", "t1"]

The command output shows that there are three common tables in the system.
----End

3.3.1.3 Creating a Namespace


The syntax for creating a namespace is as follows: create_namespace 'namespace name'.

> create_namespace 'nnstuXX'


0 row(s) in 0.1280 seconds

3.3.1.4 Creating a Table in a Specific Namespace


Create a table in the specified namespace:create 'namespace name:table name','column family'

> create 'nnstuXX:studentXX','info'


0 row(s) in 0.2680 seconds
=> Hbase::Table – nnstu01:student

3.3.1.5 Viewing Tables in a Specified Namespace


Run the list_namespace_tables 'namespace name' to view tables in the namespace.

> list_namespace_tables 'nnstuXX'


TABLE
student
1 row(s) in 0.0220 seconds

3.3.1.6 Adding Data


Add data: put 'table name', 'RowKey', 'column name', specific value'
For example, enter the information about a 40-year-old man named Kobe and living in Los Angeles
into the cga_info table:
HCNA-BigData V2.0 Experiment Guide Page 33

> put 'stuXX_cga_info','123001','info:name','Kobe'


0 row(s) in 0.1580 seconds
> put 'stuXX_cga_info','123001','info:gender','male'
0 row(s) in 0.0390 seconds
> put 'stuXX_cga_info','123001','info:age','40'
0 row(s) in 0.0250 seconds
> put 'stuXX_cga_info','123001','info:address','Los Angeles'
0 row(s) in 0.0170 seconds

3.3.1.7 Querying Data Using get


Step 1 get: exact query
Query the content stored in a RowKey precisely: get 'table name', 'RowKey'

> get 'stuXX_cga_info','123001'


COLUMN CELL
info:address timestamp=1523350574004, value=Los Angeles
info:age timestamp=1523350540131, value=40
info:gender timestamp=1523350499780, value=male
info:name timestamp=1523350443121, value=Kobe
4 row(s) in 0.0540 seconds

Step 2 Query the content stored in a cell in a RowKey precisely.

Syntax: get 'table name', 'RowKey', 'column name'


> get 'stuXX_cga_info','123001','info:name'
COLUMN CELL
info:name timestamp=1523350443121, value=Kobe
1 row(s) in 0.0310 seconds

----End

3.3.1.8 Querying Data Using scan


Step 1 Enter multiple data records in the table as instructed in section 3.3.1.6.
Step 2 scan: Queries data in a certain range.
Query information in all columns of a column family in the table. scan 'table name',
{Columns=> 'column family name'}

> scan 'stuXX_cga_info',{COLUMNS=>'info'}


ROW COLUMN+CELL
123001 column=info:address, timestamp=1523350574004, value=Los
Angeles
123001 column=info:age, timestamp=1523350540131, value=40
123001 column=info:gender, timestamp=1523350499780, value=male
123001 column=info:name, timestamp=1523350443121, value=Kobe
123002 column=info:address, timestamp=1523351932415, value=London
123002 column=info:age, timestamp=1523351887009, value=40
123002 column=info:gender, timestamp=1523351993106, value=female
123002 column=info:name, timestamp=1523351965188, value=Victoria
HCNA-BigData V2.0 Experiment Guide Page 34

123003 column=info:address, timestamp=1523352194766, value=Redding


123003 column=info:age, timestamp=1523352108282, value=30
123003 column=info:gender, timestamp=1523352060912, value=female
123003 column=info:name, timestamp=1523352091677, value=Taylor
123004 column=info:address, timestamp=1523352217267, value=Cleveland
123004 column=info:age, timestamp=1523352229436, value=33
123004 column=info:gender, timestamp=1523352267416, value=male
123004 column=info:name, timestamp=1523352251926, value=LeBron
4 row(s) in 0.0480 seconds

Step 3 Query information stored in a specific column in the table.


Syntax: scan 'table name', {Columns=> 'column family name'}

> scan 'stuXX_cga_info',{COLUMNS=>'info:name'}


ROW COLUMN+CELL
123001 column=info:name, timestamp=1523350443121, value=Kobe
123002 column=info:name, timestamp=1523351965188,
value=Victoria
123003 column=info:name, timestamp=1523352091677,
value=Taylor
123004 column=info:name, timestamp=1523352251926,
value=LeBron
4 row(s) in 0.0300 seconds

----End

3.3.1.9 Querying Data that Matches Specific Conditions


Step 1 Query the data whose RowKey is 123002 or 123003.

> scan 'stuXX_cga_info',{STARTROW=>'123002','LIMIT'=>2}


ROW COLUMN+CELL
123002 column=info:address, timestamp=1523351932415,
value=London
123002 column=info:age, timestamp=1523351887009, value=40
123002 column=info:gender, timestamp=1523351993106,
value=female
123002 column=info:name, timestamp=1523351965188,
value=Victoria
123003 column=info:address,
timestamp=1523352194766,value=Redding
123003 column=info:age, timestamp=1523352108282, value=30
123003 column=info:gender, timestamp=1523352060912,
value=female
123003 column=info:name, timestamp=1523352091677,
value=Taylor
2 row(s) in 0.0170 seconds

Step 2 Query the information stored in the cell whose Rowkey is 123001 or 123002 and column name
is name.

> scan 'stuXX_cga_info',{STARTROW=>'123001','LIMIT'=>2,COLUMNS=>'info:name'}


HCNA-BigData V2.0 Experiment Guide Page 35

ROW COLUMN+CELL
123001 column=info:name, timestamp=1523350443121, value=Kobe
123002 column=info:name, timestamp=1523351965188,
value=Victoria
2 row(s) in 0.0500 seconds

In addition to COLUMNS, HBase also supports Limit (limits the rows of query results)
and STARTROW (the Rowkey start line locates the region based on the STARTROW, and
then scans backwards), STOPROW (end line), TIMERANGE (range of the time stamp),
VERSIONS (version number), and FILTER (filters lines by condition).
----End

3.3.1.10 Updating Data


Step 1 Query the age information whose Rowkey is 123001 in the table.

> get 'stuXX_cga_info','123001','info:age'


COLUMN CELL
info:age timestamp=1523350540131, value=40
1 row(s) in 0.0260 seconds

Step 2 Change the age information whose Rowkey is 123001 in the table.

> put 'stuXX_cga_info','123001','info:age','18'


0 row(s) in 0.0340 seconds

Step 3 Query the age information whose Rowkey is 123001 in the table again.

> get 'stuXX_cga_info','123001','info:age'


COLUMN CELL
info:age timestamp=1523353910053, value=18
1 row(s) in 0.0040 seconds

Compare the results of step 1 and step 3. It can be seen that the age information has been updated.
----End

3.3.1.11 Deleting Data


3.3.1.11.1 Deleting Data in a Column Using delete

Step 1 Query the information whose Rowkey is 123001 in the table.

> get 'stuXX_cga_info','123001'


COLUMN CELL
info:address timestamp=1523350574004, value=Los Angeles
info:age timestamp=1523353910053, value=18
HCNA-BigData V2.0 Experiment Guide Page 36

info:gender timestamp=1523350499780, value=male


info:name timestamp=1523350443121, value=Kobe
4 row(s) in 0.0380 seconds

Step 2 Run the delete command to delete the data stored in the age column in 123001.

> delete 'stuXX_cga_info','123001','info:age'


0 row(s) in 0.0300 seconds

Step 3 Query the information whose Rowkey is 123001 in the table again.

> get 'stuXX_cga_info','123001'


COLUMN CELL
info:address timestamp=1523350574004, value=Los Angeles
info:gender timestamp=1523350499780, value=male
info:name timestamp=1523350443121, value=Kobe

Compare the results of step 1 and step 3. It can be seen that the age information has been deleted.
----End

3.3.1.11.2 Deleting All Data in a Line Using deleteall

Step 1 Run the deleteall command to delete data in the entire line of 123001 in the cga_info table.

> deleteall 'stuXX_cga_info','123001'


0 row(s) in 0.0320 seconds

Step 2 Query the information whose Rowkey is 123001 in the table again.

> get 'stuXX_cga_info','123001'


COLUMN CELL
0 row(s) in 0.0190 seconds

No information whose RowKey is 123001 can be found, indicating that all the data in the line has
been deleted.
----End

3.3.1.11.3 Deleting Table Using drop

Step 1 Create a table named cga_info1.

> create 'stuXX_cga_info1','info'


0 row(s) in 0.3920 seconds
=> Hbase::Table - cga_info1

Step 2 Run disable 'table name' first and then drop 'table name' to delete the table.

> disable 'stuXX_cga_info1'


0 row(s) in 1.2270 seconds
> drop 'stuXX_cga_info1'
HCNA-BigData V2.0 Experiment Guide Page 37

2018-04-10 18:12:23,566 INFO [main] client.HBaseAdmin: Deleted cga_info1


0 row(s) in 0.3940 seconds

Step 3 Query tables in the current namespace.

> list
TABLE
cga_info
Socker
t1
3 row(s) in 0.2300 seconds
=> ["cga_info", "socker", "t1"]

The result shows that the stuXX_cga_info1 table has been deleted.
----End

3.3.2 Using Filter


Filter allows you to set certain filtering conditions in the scan process. Only the user data that meets
the filtering conditions is returned. All filters take effect on the server to ensure that the filtered data
is not transmitted to the client.
Example 1: Query people whose age is 40.

> scan 'stuXX_cga_info',{FILTER=>"ValueFilter(=,'binary:40')"}


ROW COLUMN+CELL
123002 column=info:age, timestamp=1523351887009, value=40
1 row(s) in 0.1230 seconds

Example 2: Query the people named LeBron.

> scan 'stuXX_cga_info',{FILTER=>"ValueFilter(=,'binary:LeBron')"}


ROW COLUMN+CELL
123004 column=info:name, timestamp=1523352251926,
value=LeBron
1 row(s) in 0.2240 seconds

Example 3: Query the gender information of all users in the table.

> scan 'stuXX_cga_info',FILTER=>"ColumnPrefixFilter('gender')"


ROW COLUMN+CELL
123002 column=info:gender, timestamp=1523351993106,
value=female
123003 column=info:gender, timestamp=1523352060912,
value=female
123004 column=info:gender, timestamp=1523352267416,
value=male
3 row(s) in 0.0570 seconds

Example 4: Query the address information of all the people in the table and find out the people who
live in London.
HCNA-BigData V2.0 Experiment Guide Page 38

> scan 'stuXX_cga_info',{FILTER=>"ColumnPrefixFilter('address') AND


ValueFilter(=, 'binary:London')"}
ROW COLUMN+CELL
123002 column=info:address, timestamp=1523351932415,
value=London
1 row(s) in 0.0100 seconds

Filter filters data based on the column family, column, version, and so on. Only four filtering methods
are demonstrated here. RPC query requests with filter criteria will be distributed to each
RegionServer. In this way, the network transmission pressure is reduced.

3.3.3 Creating a Table with Pre-Distributed Regions


3.3.3.1 Dividing a Table into Four Random Regions by Rowkey
Step 1 Create a new table stuXX_cga_info2 and divide the table into four regions.

create 'table name', 'column family name', {NUMREGIONS => 4, SPLITALGO =>
'UniformSplit'}
> create 'stuXX_cga_info2','info',{NUMREGIONS=>4,SPLITALGO=>'UniformSplit'}
0 row(s) in 0.3720 seconds
=> Hbase::Table - cga_info2

Step 2 On FusionInsight Manager, choose Services > HBase.

Step 3 Click HMaster(Active).


HCNA-BigData V2.0 Experiment Guide Page 39

Step 4 Click "Table Details".

Step 5 Find the new table stuXX_cga_info2.


HCNA-BigData V2.0 Experiment Guide Page 40

Step 6 Query the region division result. The stuXX_cga_info2 table is divided into four regions. Name
contains the table name, StartKey (the first region does not have StartKey), timestamp, and
region ID.

----End
HCNA-BigData V2.0 Experiment Guide Page 41

3.3.3.2 Specifying the StartKeys of Regions


Step 1 When creating a table, specify the StartKeys of the regions.

> create 'table name', 'column family name', SPLITS => ['first StartKey',
'second StartKey', 'third StartKey']

Example: Create a table named stuXX_cga_info3 and specify three StartKeys which are 10000,
20000, and 30000 respectively.

> create 'stuXX_cga_info3','info',SPLITS => ['10000', '20000', '30000']


0 row(s) in 0.6820 seconds
=> Hbase::Table - cga_info3

Step 2 Go to the Table Regions page as instructed in section 3.3.3.1.

The result shows that the stuXX_cga_info3 table is divided into four regions based on Start Keys
10000, 20000, and 30000.
----End

3.3.3.3 Pre-Dividing Regions Using a File


Step 1 Press Ctrl+C to exit shell.

user01@fi01host01:~>

Step 2 Create the splitFile file in the /home/userXX/ directory.

> touch /home/userXX/splitFile.dat

Step 3 Go to the /tmp/stu01/ directory.


HCNA-BigData V2.0 Experiment Guide Page 42

> cd /home/userXX

Step 4 Enter 10000, 20000, 30000 in splitFile.dat.

> vim splitFile.dat

On the editing interface, press i, and then enter 10000, 20000, and 30000, and press Enter after
entering each of the values.

Step 5 After entering all the information, press esc: wq to end the editing.
Step 6 Go to HBaes shell again.

> cd /home/userXX/hadoopclient
> source bigdata_env
> kinit stuXX
Password for stuXX@HADOOP.COM:
> hbase shell

Step 7 Create a table named stuXX_cga_info4 and pre-divide it using the splitFile file created earlier.

> create 'stuXX_cga_info4','info',SPLITS_FILE =>'/home/userXX/splitFile'


0 row(s) in 0.4650 seconds
=> Hbase::Table - cga_info4

Step 8 Go to the Table Regions page as instructed in section 3.3.3.1.

The result shows that the stuXX_cga_info4 table is divided into four regions based on Start Keys
10000, 20000, and 30000 specified in splitFile.dat.
Note: For a table with regions pre-divided using start keys and end keys, the range of region rowkey
is [start_key, end_key).
----End
HCNA-BigData V2.0 Experiment Guide Page 43

3.3.4 HBase Load Balancing


3.3.4.1 Viewing the HBase Web UI
Step 1 Go to the Region Servers page of the HBase by performing the first four steps of section 3.3.3.1.
Step 2 Click Requests.

Step 3 Click Base Stats.

The previous figure shows a serious problem of load imbalance. The fi01host02 host is overloaded.
You can manually move hot regions to the fi01host01 host.
----End

3.3.4.2 Moving Regions


Step 1 Click fi01host02.

Step 2 Check which regions are taken over by the fi01host02 host.
HCNA-BigData V2.0 Experiment Guide Page 44

As shown in the preceding figure, the load is unbalanced due to the meta table. However, you are
not advised to move the meta table. In this experiment, move the stuXX_cga_info table.

Step 3 Move the 67aee3318a626ec0b1265e26fd46c151 file to


RegionServer'fi01host01,21302,1522806777697.

> echo "move


'67aee3318a626ec0b1265e26fd46c151','fi01host01,21302,1522806777697'" | hbase
shell
move '67aee3318a626ec0b1265e26fd46c151','fi01host01,21302,1522806777697'

0 row(s) in 0.4200 seconds

Step 4 Check the HBase Web UI.

On the web UI, you can see that the region has been moved to fi01host01.
----End
HCNA-BigData V2.0 Experiment Guide Page 45

3.4 Summary
This experiment demonstrates how to create and delete an HBase table, how to add, delete, modify,
and query data, how to pre-divide regions, and how to manually achieve load balancing. Through the
experiment, trainees can master the methods of using HBase and deepen their understanding of
HBase.
HCNA-BigData V2.0 Experiment Guide Page 46

4 Hive Data Warehouse Practice

4.1 Background
Hive is a data warehouse tool that plays an important role in data mining, data aggregation, and
statistical analysis. In particular, Hive plays an important role in telecom services. It can be used to
collect traffic, call fee, and tariff information of users, and establish users' consumption models to
help carriers better plan package content.

4.2 Objectives
⚫ To have a good command of common Hive operations
⚫ To master how to run HOL on Hue.

4.3 Experiment Tasks

4.3.1 Common Functions of Hive


Enter the Hive client beeline:
> source /home/userXX/hadoopclient/bigdata_env
> /home/userXX/hadoopclient/Hive/Beeline/bin/beeline
...
Connected to: Apache Hive (version 1.3.0)
Driver: Hive JDBC (version 1.3.0)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 1.3.0 by Apache Hive
0: jdbc:hive2://192.168.225.11:21066/>
 Character string length function

Character string length function: length


Syntax: length(string A)
Returned value: int
Note: Return the length of character string A.

hive> select length('abcedfg');


7
HCNA-BigData V2.0 Experiment Guide Page 47

 Character string reverse function

Syntax: reverse(string A)
Returned value: string
Note: Return the reversion of character string A.

hive> select reverse('abcedfg');


gfdecba

 Character string connection function

Syntax: concat(string A, string B…)


Returned value: string
Note: Return the result of character string connection. You can enter any number of character
strings.

hive> select concat('abc', 'def', 'gh');


abcdefgh

 The function of connecting character strings with delimiters

Syntax: concat_ws(string SEP, string A, string B…)


Returned value: string
Note: Returns the result of character string connection. SEP indicates the delimiter between
character strings.

hive> select concat_ws('-', 'abc', 'def', 'gh');


abc-def-gh

 Character string truncation function

Syntax: substr(string A, int start, int len),substring(string A, int start, int len)
Returned value: string
Note: Return character string A from the start point with a length of len.

hive> select substr('abcde',3,2);


cd

hive> select substr ('abcde',-2,2);


de

 The function of converting a character string to uppercase

Syntax: upper(string A) ucase(string A)


Returned value: string
Note: Return character string A in the uppercase format.

hive> select upper('abC');


ABC

hive> select ucase('abC');


ABC
HCNA-BigData V2.0 Experiment Guide Page 48

 The function of converting a character string to lowercase

Syntax: lower(string A) lcase(string A)


Returned value: string
Note: Return character string A in the lowercase format.

hive> select lower('abC');


abc

hive> select lcase('abC');


abc

 The function of moving spaces

Syntax: trim(string A)
Returned value: string
Note: Remove the spaces on both sides of the character string.

hive> select trim(' abc ');


abc

 The function of splitting a character string

Syntax: split(string str, string pat)


Returned value: array
Note: Split the string based on the specified string pattern. The string array after splitting is
returned.

hive> select split('abtcdtef','t');


["ab","cd","ef"]

 Time functions

Function of obtaining the current UNIX timestamp: unix_timestamp


Syntax: unix_timestamp ()
Returned value: bigint
Note: Obtain the UNIX timestamp of the current time zone.

hive> select unix_timestamp();


1521511607

 Function of converting the UNIX timestamp to date: from_unixtime

Syntax: from_unixtime(bigint unixtime[, string format])


Returned value: string
Note: Convert the UNIX timestamp (seconds from 1970-01-01 00:00 :00 UTC to the specified
time) to the time format in the current time zone.

hive> select from_unixtime(1521511607,'yyyyMMdd');


20180320
HCNA-BigData V2.0 Experiment Guide Page 49

4.3.2 Creating a Table


4.3.2.1 Syntax for Creating a Table

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name


[(col_name data_type [COMMENT col_comment], ...)]
[COMMENT table_comment]
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
[CLUSTERED BY (col_name, col_name, ...)
[SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
[ROW FORMAT row_format]
[STORED AS file_format]
[LOCATION hdfs_path]

4.3.2.2 Creating an Internal Table


Создаем свою базу данных:
> create database stuXX_db;
> use stuXX_db;
И далее работаем в ней.
Create an internal table cga_info1, which contains name, gender, and time columns.

> create table cga_info1(name string,gender string,timest int) row format


delimited fields terminated by ',' stored as textfile;
No rows affected (0.293 seconds)

In the preceding information, row format delimited fields terminated by ',' indicates that the line
delimiter is ','. If this parameter is not set, the default delimiter is used. A Hive HQL statement ends
with a semicolon (;).
View the cga_info1 table.

> show tables like 'cga_info1';


+------------+
| tab_name |
+------------+
| cga_info1 |
+------------+
1 row selected (0.07 seconds)

4.3.2.3 Creating an External Table


Specify external table when you create an external table.

> create external table cga_info2 (name string,gender string,timest int) row
format delimited fields terminated by ',' stored as textfile;
No rows affected (0.343 seconds)

View the cga_info2 table.


> show tables like 'cga_info2';
HCNA-BigData V2.0 Experiment Guide Page 50

+------------+
| tab_name |
+------------+
| cga_info2 |
+------------+
1 row selected (0.078 seconds)

4.3.2.4 Loading Local Data


Step 1 Create a file on the local host.

> cd /home/userXX
> touch 'cga111.dat'

Step 2 Run the vim command to edit the cga111.dat file. Enter several lines of data in the sequence of
name, gender, and time. The line delimiter is a comma (,). To start new line, press Enter. After
the input is complete, press ESC and enter :wq to save the modification and exit to the Linux
interface.

> vim 'cga111.dat'

Xiaozhao,female,20
Xiaoqian,male,21
Xiaosun,male,25
Xiaoli,female,40
Xiaozhou,male,33

Step 3 Enter Hive again.

> beeline

Step 4 Create a table named cga_info3.

> use stuXX_db;


> create table cga_info3(name string,gender string,timest int) row format
delimited fields terminated by ',' stored as textfile;
No rows affected (0.408 seconds)

Step 5 Load local data cga111.dat to the cga_info3 table.

> load data local inpath '/home/userXX/cga111.dat' into table cga_info3;


INFO : Loading data to table stu01_db.cga_info3 from
file:/home/user01/cga111.dat
No rows affected (0.516 seconds)

Step 6 Query the content in cga_info3.

> select * from cga_info3;


HCNA-BigData V2.0 Experiment Guide Page 51

+-----------------+-------------------+-----------------+
| cga_info3.name | cga_info3.gender | cga_info3.time |
+-----------------+-------------------+-----------------+
| xiaozhao | female | 20 |
| xiaoqian | male | 21 |
| xiaosun | male | 25 |
| xiaoli | female | 40 |
| xiaozhou | male | 33 |
+-----------------+-------------------+-----------------+
5 rows selected (0.287 seconds)

The result shows that the content in the local file cga111.dat has been loaded to the Hive table
cga_info3.
----End

4.3.2.5 Loading HDFS Data


Step 1 Create a directory /cga/cg in the HDFS.

> hdfs dfs -mkdir /user/app_stuXX/cga

18/04/12 19:43:54 INFO hdfs.PeerCache: SocketCache disabled.

> hdfs dfs -mkdir /user/app_stuXX/cga/cg

18/04/12 19:44:24 INFO hdfs.PeerCache: SocketCache disabled.

Step 2 Upload the local file cga111.dat in the tmp folder to the /cga/cg directory of the HDFS.

> hdfs dfs -put /home/userXX/cga111.dat /user/app_stuXX/cga/cg

18/04/12 14:19:39 INFO hdfs.PeerCache: SocketCache disabled.

Step 3 Enter Hive again.

> beeline
> use stuXX_db;

Step 4 Create a table named cga_info4.

> create table cga_info4(name string,gender string,timest int) row format


delimited fields terminated by ',' stored as textfile;

No rows affected (0.404 seconds)

Step 5 Load the HDFS file cga111.dat to the cga_info4 table.

> load data inpath '/user/app_stuXX/cga/cg/cga111.dat' into table cga_info4;

INFO : Loading data to table default.cga_info4 from


hdfs://hacluster/app_stu01/cga111.dat
No rows affected (0.341 seconds)
HCNA-BigData V2.0 Experiment Guide Page 52

Note: Slightly different commands are used to load local data and HDFS data.
Loading a local file: load data local inpath 'local_inpath' into table hive_table;
Loading an HDFS file: load data inpath 'HDFS_inpath' into table hive_table.

Step 6 Query the content in cga_info4.

> select * from cga_info4;


+-----------------+-------------------+-----------------+
| cga_info4.name | cga_info4.gender | cga_info4.time |
+-----------------+-------------------+-----------------+
| xiaozhao | female | 20 |
| xiaoqian | male | 21 |
| xiaosun | male | 25 |
| xiaoli | female | 40 |
| xiaozhou | male | 33 |
+-----------------+-------------------+-----------------+
5 rows selected (0.303 seconds)

The result shows that the content of the cga111.dat file in the HDFS has been loaded to the Hive
table cga_info4.
----End

4.3.2.6 Loading Data When Creating a Table


Step 1 Create table cga_info5 load cga111.dat data in the HDFS during the table creation.

> hdfs dfs -put /home/userXX/cga111.dat /user/app_stuXX/cga/cg


> beeline
> use stuXX_db;

> create external table cga_info5 (name string,gender string,timest int) row
format delimited fields terminated by ',' stored as textfile location
'/user/app_stuXX/cga/cg';
No rows affected (0.317 seconds)

No rows affected (0.317 seconds)

Step 2 Query the content in cga_info5.

> select * from cga_info5;


+-----------------+-------------------+-----------------+
| cga_info5.name | cga_info5.gender | cga_info5.time |
+-----------------+-------------------+-----------------+
| xiaozhao | female | 20 |
| xiaoqian | male | 21 |
| xiaosun | male | 25 |
| xiaoli | female | 40 |
| xiaozhou | male | 33 |
+-----------------+-------------------+--------------+--+
5 rows selected (0.268 seconds)
HCNA-BigData V2.0 Experiment Guide Page 53

It can be seen that the cga_info5 table has been created successfully with cga111.dat data in the
HDFS loaded.
При загрузке данных во внешнюю таблицу файлы не удаляются. При добавлении новых
происходит автоматическое добавление записей в таблицу.
-----------------
ВЫВОДЫ:
1. При загрузке данных в обычную таблицу файл данных в HDFS удаляется.

2. При загрузке данных во ВНЕШНЮЮ таблицу файл данных в HDFS не удаляется.


При добавлении новых файлов в директорию, указанную при создании внешней
таблицы, происходит автоматическое добавление записей в таблицу.

4.3.2.7 Copying an Empty Table


Step 1 Create table cga_info6 and copy table cga_info1 to table cga_info6.

> create table cga_info6 like cga_info1;

No rows affected (0.244 seconds)

Step 2 Query the content in cga_info6.

> select *from cga_info6;


+-----------------+-------------------+-----------------+
| cga_info6.name | cga_info6.gender | cga_info6.time |
+-----------------+-------------------+-----------------+
+-----------------+-------------------+-----------------+
No rows selected (0.243 seconds)

The output shows that the empty table has been copied successfully.
----End

4.3.3 Querying
4.3.3.1 Fuzzy Query of Tables
Query tables whose names start with cga.

> show tables like '*cga*';


+--------------------+
| tab_name |
+--------------------+
| cga_hive_hbase |
| cga_info1 |
| cga_info2 |
| cga_info3 |
| cga_info4 |
| cga_info5 |
HCNA-BigData V2.0 Experiment Guide Page 54

| cga_info6 |
+--------------------+
7 rows selected (0.072 seconds)

4.3.3.2 Querying by Criterion


Example 1: Use limit to query data in the first two lines in the cga_info3 table.

> select * from cga_info3 limit 2;

+-----------------+-------------------+-------------------+
| cga_info3.name | cga_info3.gender | cga_info3.timest |
+-----------------+-------------------+-------------------+
| xiaozhao | female | 20 |
| xiaoqian | male | 21 |
+-----------------+-------------------+-------------------+
2 rows selected (0.295 seconds)

Example 2: Use where to query the information about all women in the cga_info3 table.

> select * from cga_info3 where gender='female';


+-----------------+-------------------+--------------------+
| cga_info3.name | cga_info3.gender | cga_info3.timest |
+-----------------+-------------------+--------------------+
| xiaozhao | female | 20 |
| xiaoli | female | 40 |
+-----------------+-------------------+--------------------+
2 rows selected (0.286 seconds)

Example 3: Use order to query the information about all women in cga_info3 by time in descending
order.

> select * from cga_info3 where gender='female' order by timest desc ;


+-----------------+-------------------+-------------------+
| cga_info3.name | cga_info3.gender | cga_info3.timest |
+-----------------+-------------------+-------------------+
| xiaoli | female | 40 |
| xiaozhao | female | 20 |
+-----------------+-------------------+-------------------+
2 rows selected (24.129 seconds)

The result shows that the information of xiaozhao is ranked second in the output because the query
result is sorted in descending order of time although data about xiaozhao is entered first.

4.3.3.3 Querying by Multiple Criteria


Example 1: Query the cga_info3 table for groups by name, and find out persons whose total value of
time is greater than 30.

> select name,sum(timest) all_time from cga_info3 group by name having


all_time >= 30 ;
HCNA-BigData V2.0 Experiment Guide Page 55

+-----------+--------------+
| name | all_time |
+-----------+--------------+
| xiaoli | 40 |
| xiaozhou | 33 |
+-----------+--------------+
2 rows selected (24.683 seconds)

Example 2: Query the cga_info3 table for groups by gender, and find out the person whose time
value is the greatest.

> select gender,max(timest) from cga_info3 group by gender;

+---------+---------+
| gender | _c1 |
+---------+---------+
| female | 40 |
| male | 33 |
+---------+---------+
2 rows selected (24.35 seconds)

Example 3: Check the numbers of women and men respectively in the cga_info3 table.

> select gender,count(1) num from cga_info3 group by gender;


+---------+---------+
| gender | num |
+---------+---------+
| female | 2 |
| male | 3 |
+---------+---------+
2 rows selected (23.828 seconds)

Example 4: Insert women information in the cga_info7 table into the cga_info3 table.

Step 1 Create internal table cga_info7.

> create table cga_info7(name string,gender string,timest int) row format


delimited fields terminated by ',' stored as textfile;

No rows affected (0.282 seconds)

Step 2 Create local file cga222.dat in /home/userXX and enter content.

> cd /home/userXX
> touch cga222.dat
> vim cga222.dat

xiaozhao,female,20
xiaochen,female,28

Step 3 Load local data to the cga_info7 table.

> beeline
> use stuXX_db;
HCNA-BigData V2.0 Experiment Guide Page 56

> load data local inpath '/home/userXX/cga222.dat' into table cga_info7;


INFO : Loading data to table stu01_db.cga_info7 from
file:/hone/user01/cga222.dat
No rows affected (0.423 seconds)

Step 4 Load women information in the cga_info7 table to the cga_info3 table.

> insert into cga_info3 select * from cga_info7 where gender='female';


No rows affected (20.232 seconds)

Step 5 View the content in cga_info3.

> select * from cga_info3;


+---------------- -+-------------------+-------------------+
| cga_info3.name | cga_info3.gender | cga_info3.timest |
+---------------- -+-------------------+-------------------+
| xiaozhao | female | 20 |
| xiaochen | female | 28 |
| xiaozhao | female | 20 |
| xiaoqian | male | 21 |
| xiaosun | male | 25 |
| xiaoli | female | 40 |
| xiaozhou | male | 33 |
+------------------+-------------------+-------------------+
7 rows selected (0.224 seconds)

The output shows that two pieces of women information in the cga_info7 table have been added to
the cga_info3 table.
Example 5: Query the sum of time values of the people in the cga_info3 table based on the name
and gender.

> select name,gender,sum(timest) timest from cga_info3 group by name,gender;

+-----------+---------+----------+
| name | gender | timest |
+-----------+---------+----------+
| xiaochen | female | 28 |
| xiaoli | female | 40 |
| xiaoqian | male | 21 |
| xiaosun | male | 25 |
| xiaozhao | female | 40 |
| xiaozhou | male | 33 |
+-----------+---------+----------+
6 rows selected (23.554 seconds)

The output shows that two pieces of xiaozhao information are merged in the cga_info3 table.
Example 6: Check the sum of the time values of all the people in the cga_info3 table, and then sorts
the records by time in descending order based on the gender.

> select *,row_number() over(partition by gender order by timest desc) rank


from (select name,gender,sum(timest) timest from cga_info3 group by
name,gender) b;
HCNA-BigData V2.0 Experiment Guide Page 57

+-----------+-----------+----------+--------+
| b.name | b.gender | b.timest | rank |
+-----------+-----------+----------+--------+
| xiaozhao | female | 40 | 1 |
| xiaoli | female | 40 | 2 |
| xiaochen | female | 28 | 3 |
| xiaozhou | male | 33 | 1 |
| xiaosun | male | 25 | 2 |
| xiaoqian | male | 21 | 3 |
+-----------+-----------+----------+--------+
6 rows selected (52.762 seconds)

----End

4.3.4 Hive Join Operations


Create two tables cga_info8 and cga_info9 as instructed in section 5.3.2.4.

Создаем таблицу cga_info8:

> beeline
> use stuXX_db;

> create table cga_info8(name string,age int) row format delimited fields
terminated by ',' stored as textfile;

Создаем текстовый файл, например, cga8.dat со следующим содержимым:


> cd /home/userXX
> touch cga8.dat
> vim cga8.dat

GuoYijun,5
YuanJing,10
Liyuan,20

Загружаем данные в таблицу cga_info8:

> beeline
> use stuXX_db;
> load data local inpath '/home/userXX/cga8.dat' into table cga_info8;

Создаем таблицу cga_info9:

> create table cga_info9(name string,gender string) row format delimited fields
terminated by ',' stored as textfile;

Создаем текстовый файл, например, cga9.dat со следующим содержимым:


> cd /home/userXX
> touch cga9.dat
> vim cga9.dat
HCNA-BigData V2.0 Experiment Guide Page 58

YuanJing,male
Liyuan,male
LiuYang,female
Lilei,male

Загружаем данные в таблицу cga_info9:

> beeline
> use stuXX_db;
> load data local inpath '/home/userXX/cga9.dat' into table cga_info9;

Query the content in cga_info8.

> select * from cga_info8;

+-----------------+-------------------+
| cga_info8.name | cga_info8.age |
+-----------------+-------------------+
| GuoYijun | 5 |
| YuanJing | 10 |
| Liyuan | 20 |
+-----------------+-------------------+
3 rows selected (0.212 seconds)

Query the content in cga_info9.

> select * from cga_info9;


+-----------------+-------------------+
| cga_info9.name | cga_info9.gender |
+-----------------+-------------------+
| YuanJing | male |
| Liyuan | male |
| LiuYang | female |
| Lilei | male |
+-------------------------------------+
4 rows selected (0.227 seconds)

4.3.4.1 join/inner join


The join and inner join operations are interrelated. They associate information of two tables and
return only information shared by the two tables.
The following statement uses join to associate information about the same person in the cga_info8
and cga_info9 tables.

> select * from cga_info9 a join cga_info8 b on a.name=b.name;


+-----------+--------+-----------+-----------+
| a.name | a.age | b.name | b.gender |
+-----------+--------+-----------+-----------+
| YuanJing | 10 | YuanJing | male |
| Liyuan | 20 | Liyuan | male |
+-----------+--------+-----------+-----------+
2 rows selected (24.954 seconds)
HCNA-BigData V2.0 Experiment Guide Page 59

The following statement uses inner join to associate information about the same person in the
cga_info8 and cga_info9 tables.

> select * from cga_info9 a inner join cga_info8 b on a.name=b.name;


+-----------+--------+-----------+-----------+
| a.name | a.age | b.name | b.gender |
+-----------+--------+-----------+-----------+
| YuanJing | 10 | YuanJing | male |
| Liyuan | 20 | Liyuan | male |
+-----------+--------+-----------+-----------+
2 rows selected (25.07 seconds)

4.3.4.2 left join


left join: Indicates left external association. The table before left join is used as the primary table to
associate with the other table. The returned number of records is the same as that in the primary
table. The fields that cannot be associated are set to NULL.
Use left join to associate information about the same person in the cga_info8 and cga_info9 tables.

select * from cga_info9 a left join cga_info8 b on a.name=b.name;


+-----------+-----------+-----------+--------+
| a.name | a.gender | b.name | b.age |
+-----------+-----------+-----------+--------+
| YuanJing | male | YuanJing | 10 |
| Liyuan | male | Liyuan | 20 |
| LiuYang | female | NULL | NULL |
| Lilei | male | NULL | NULL |
+-----------+-----------+-----------+--------+
4 rows selected (24.324 seconds)

4.3.4.3 right join


left join: Indicates right external association. The table after right join is used as the primary table to
associate with the table before right join. The returned number of records is the same as that in the
primary table. The fields that cannot be associated are set to NULL.
Use right join to associate information about the same person in the cga_info8 and cga_info9 tables.

> select * from cga_info9 a right join cga_info8 b on a.name=b.name;


+-----------+-----------+-----------+--------+
| a.name | a.gender | b.name | b.age |
+-----------+-----------+-----------+--------+
| NULL | NULL | GuoYijun | 5 |
| YuanJing | male | YuanJing | 10 |
| Liyuan | male | Liyuan | 20 |
+-----------+-----------+-----------+--------+
3 rows selected (23.225 seconds)

4.3.4.4 full join


full join: Indicates full-external association. Records of two tables are used as the benchmark. All
records of the two tables after deduplication are returned. The fields that cannot be associated are
NULL.
HCNA-BigData V2.0 Experiment Guide Page 60

Use full join to associate information about the same person in the cga_info8 and cga_info9 tables.

> select * from cga_info9 a full join cga_info8 b on a.name=b.name;


+-----------+-----------+-----------+--------+
| a.name | a.gender | b.name | b.age |
+-----------+-----------+-----------+--------+
| NULL | NULL | GuoYijun | 5 |
| Lilei | male | NULL | NULL |
| LiuYang | female | NULL | NULL |
| Liyuan | male | Liyuan | 20 |
| YuanJing | male | YuanJing | 10 |
+-----------+-----------+-----------+--------+
5 rows selected (26.763 seconds)

4.3.4.5 left semi join


left semi join: The table before left semi join is used as the primary table. The returned KEY of the
primary table is also recorded in the associated table.
Use left semi join to associate information about the same person in the cga_info8 and cga_info9
tables.

> select * from cga_info9 a left semi join cga_info8 b on a.name=b.name;


+-----------+-----------+
| a.name | a.gender |
+-----------+-----------+
| YuanJing | male |
| Liyuan | male |
+-----------+-----------+
2 rows selected (24.96 seconds)

4.3.4.6 map join


map join is an optimization function of Hive and applies to the joining of a small table to a large
table. Table joining is performed in Map and in the memory, therefore there is no need to start the
Reduce task or go through the shuffle phase, which saves resources and improve join efficiency.
Use map join to associate information about the same person in the cga_info8 and cga_info9 tables.

> select /*+ mapjoin(b)*/* from cga_info9 a join cga_info8 b on


a.name=b.name;
+-----------+-----------+-----------+--------+
| a.name | a.gender | b.name | b.age |
+-----------+-----------+-----------+--------+
| YuanJing | male | YuanJing | 10 |
| Liyuan | male | Liyuan | 20 |
+-----------+-----------+-----------+--------+
2 rows selected (25.129 seconds)

4.3.5 Hive on Spark Operation


On the beeline client, set the computing engine to Spark.

> set hive.execution.engine=spark;


No rows affected (0.004 seconds)
HCNA-BigData V2.0 Experiment Guide Page 61

Query the sum of time values of people in the cga_info3 table based on the name and gender.

> select name,gender,sum(timest) timest from cga_info3 group by name,gender;

Если такой запрос не работает из-за нехватки памяти или ресурсов, можно выполнить
простой запрос без группировок:
> select name,gender from cga_info3;

+-----------+---------+---------+
| name | gender | timest |
+-----------+---------+---------+
| xiaochen | female | 28 |
| xiaoli | female | 40 |
| xiaoqian | male | 21 |
| xiaosun | male | 25 |
| xiaozhao | female | 40 |
| xiaozhou | male | 33 |
+-----------+---------+---------+
6 rows selected (1.213 seconds)

Compared with the result of example 5 in section 4.3.3, the query speed of Hive on Spark is 1
second, which is much faster than that of Hive on MapReduce. (здесь будет другое время, все
зависит от настроек YARN)
Вернуть исполнение с помощью MapReduce:

> set hive.execution.engine=mr;

4.3.6 Associating a Hive Table with an HBase Table


Step 1 Enter the HBase shell.

> hbase shell

Step 2 Establish HBase table student.

> create 'stuXX_student','info'

0 row(s) in 0.4750 seconds


=> Hbase::Table – student

Step 3 Enter information in the HBase table.

> put 'stuXX_student','001','info:name','lilei'


0 row(s) in 0.1310 seconds
> put ' stuXX_student','002','info:name','tom'
0 row(s) in 0.0210 seconds

Step 4 View information in the table.

> scan 'stuXX_student'


ROW COLUMN+CELL
001 column=info:name, timestamp=1523544015712, value=lilei
HCNA-BigData V2.0 Experiment Guide Page 62

002 column=info:name, timestamp=1523544040443, value=tom


2 row(s) in 0.0370 seconds

Step 5 Create a Hive external table cga_hbase_hive and associate it with the student table.

> beeline
> use stuXX_db;
> create external table cga_hbase_hive (key int,gid map<string,string>)
stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' with
SERDEPROPERTIES ("hbase.columns.mapping" ="info:") TBLPROPERTIES
("hbase.table.name" ="stuXX_student");

No rows affected (0.433 seconds)

Step 6 Query the content in the cga_hbase_hive table.

> select * from cga_hbase_hive;


+---------------------+---------------------+
| cga_hbase_hive.key | cga_hbase_hive.gid |
+---------------------+---------------------+
| 1 | {"name":"lilei"} |
| 2 | {"name":"tom"} |
+---------------------+---------------------+
2 rows selected (0.477 seconds)

Step 7 Query name information in the cga_hbase_hive table.

> select gid['name'] from cga_hbase_hive;


+--------+
| _c0 |
+--------+
| lilei |
| tom |
+--------+
2 rows selected (0.733 seconds)

The experiment result shows that the Hive table is associated with the HBase table.
----End

4.3.7 Merging Small Hive Files


Step 1 Check the content in the /user/hive/warehouse/cga_info1 folder of the HDFS.

> hdfs dfs -put /home/userXX/cga8.dat


/user/hive/warehouse/stuXX_db.db/cga_info1

> hdfs dfs -put /home/userXX/cga9.dat


/user/hive/warehouse/stuXX_db.db/cga_info1

> hdfs dfs -ls -h /user/hive/warehouse/stuXX_db.db/cga_info1

Found 2 items
HCNA-BigData V2.0 Experiment Guide Page 63

…… stu01 hive 17 2018-04-13 15:32


/user/hive/warehouse/stu01_db.db/cga_info1/cga8.dat
…… stu01 hive 15 2018-04-13 15:32
/user/hive/warehouse/stu01_db.db/cga_info1/cga9.dat

The /user/hive/warehouse/stuXX_db.db/cga_info1 folder contains two files.

Step 2 On the Hive client, set the parameter of whether to merge Reduce output files to true.

> beeline
> use stuXX_db;
> set hive.merge.mapredfiles= true;

No rows affected (0.037 seconds)

Step 3 Create table cga_info10 and load the content of table cga_info1 to the new table.

> create table cga_info10 as select * from cga_info1;


No rows affected (20.93 seconds)

Step 4 View the content in table cga_info10.

> hdfs dfs -ls -h /user/hive/warehouse/stuXX_db.db/cga_info10

18/04/13 15:38:23 INFO hdfs.PeerCache: SocketCache disabled.


Found 1 items
-rw-------+ 3 stu01 hive 110 2018-04-13 15:34
/user/hive/warehouse/cga_info10/000000_0

The result shows that the two small files that should be output in the Reduce phase have been
merged into one because the parameter setting has been modified in step 2.
----End

4.3.8 Hive Column Encryption


Currently, Hive columns are encrypted using the AES algorithm.
The encryption must be specified during table creation. The encryption class name corresponding to
AES is org.apache.hadoop.hive.serde2.AESRewriter.

Step 1 Create table info11 and encrypt the name column.

> use stuXX_db;


> create table cga_info11(name string,gender string,timest int) row format
serde 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' with
serdeproperties
('column.encode.columns'='name','column.encode.classname'='org.apache.hadoop.
hive.serde2.AESRewriter') stored as textfile;

No rows affected (1.097 seconds)

Step 2 Load data in table cga_info3 to table cga_info11.


HCNA-BigData V2.0 Experiment Guide Page 64

> insert into cga_info11 select * from cga_info3;


No rows affected (21.994 seconds)

Step 3 Query the content in cga_info11.

> select * from cga_info11;


+------------------+--------------------+---------------------+
| cga_info11.name | cga_info11.gender | cga_info11.timest |
+------------------+--------------------+---------------------+
| xiaozhao | female | 20 |
| xiaochen | female | 28 |
| xiaozhao | female | 20 |
| xiaoqian | male | 21 |
| xiaosun | male | 25 |
| xiaoli | female | 40 |
| xiaozhou | male | 33 |
+------------------+--------------------+---------------------+
7 rows selected (0.346 seconds)

Step 4 Check the encryption effect.

> hdfs dfs -cat /user/hive/warehouse/stuXX_db.db/cga_info11/000000_0


18/04/13 15:21:52 INFO hdfs.PeerCache: SocketCache disabled.
jR091mQ/LIKY0XBCJi8dsw==female20
BRaQqw7O46X/L1YH1ujKEA==female28
jR091mQ/LIKY0XBCJi8dsw==female20
t84/+Zo8Pxiidltw8rAyTA==male21
J3y40cz4TMGs2uKJfHHaEA==male25
pz64eOp896fiocKrV0IpoA==female40
g/sTgzi4MYs9Uotztgg+BQ==male33

The result shows that the names of all the people in the table are encrypted.
----End

4.3.9 Using Hue to Execute HQL


Step 1 Choose Services > Hue.
HCNA-BigData V2.0 Experiment Guide Page 65

Step 2 Click Hue(Active).

Step 3 Move the pointer to Query Editors and choose Hive from the drop-down menu.

Step 4 Write the HQL program in the blank area.


! На панели слева выбираем свою базу данных: stuXX_db
HCNA-BigData V2.0 Experiment Guide Page 66

Step 5 After compiling the HQL program, select the computing engine and then click Execute.
Набираем запрос, например: select * from cga_info1

ДРУГОЙ ИНТЕРФЕЙС: Нажимаем на синий треугольник (кнопка Execute) слева от панели, в


которую вводили запрос.

Step 6 View results.

The results can also be displayed in charts.


ДРУГОЙ ИНТЕРФЕЙС: кнопка для построения диаграммы представлена в виде иконки и
располагается слева от таблицы с результатами запроса, которые отображаются на вкладке
«Results».
HCNA-BigData V2.0 Experiment Guide Page 67

----End

4.4 Summary
This experiment describes how to add, delete, modify, and query data in Hive data warehouses, Hive
on Spark, and how to operate HBase using Hive. In Hive join operations, multiple join methods are
introduced to enable trainees to have a more intuitive understanding of join types and their
differences. This experiment helps trainees to reinforce their comprehension about Hive. Note that
stored as textfile must be specified during table creation when loading data. Otherwise, data cannot
be loaded.
HCNA-BigData V2.0 Experiment Guide Page 68

5 Data Import and Export Using Loader

5.1 Background
Data migration operations are frequently involved in Big Data services, especially data migration
between relational databases and Big Data components, for example, data migration between
MySQL and HDFS/HBase. The graphical operations of Loader makes data migration more convenient.

5.2 Objective
⚫ Have a good command of using Loader to perform data migration in service scenarios.

5.3 Experiment Tasks

5.3.1 Importing HBase Data to HDFS


Step 1 Choose Services > Loader.
HCNA-BigData V2.0 Experiment Guide Page 69

Step 2 Click LoaderServer(Active).

Step 3 Click New Job.


HCNA-BigData V2.0 Experiment Guide Page 70

Step 4 Configure the task name and select the type (select Export to export data from HBase to HDFS).
Job Name: stuXX_cg_hbasetohdfs
Connection: stuXX_hdfs_conn
Queue: DEFAULT

Step 5 Click Add, as shown in the preceding figure. Select hdfs-connector and set Name to a unique
value.
HCNA-BigData V2.0 Experiment Guide Page 71

Step 6 Click Test. If Test Success is displayed, the system is available.

Step 7 Configure basic information as shown in the following figure, and then click Next.

Step 8 Select HBASE for Source type. Set Number to the number of Map tasks. Fill in 1 here. Then, click
Next.
HCNA-BigData V2.0 Experiment Guide Page 72

Step 9 Click Input on the left, select HBase Input, and drag the HBase Input button to the right area.

Step 10 Click output on the left, select File Output, and drag the File Output button to the right area.

Step 11 Query the content in the cga_info table first in order to configure input and output.

> scan 'stuXX_cga_info'


ROW COLUMN+CELL
123002 column=info:address, timestamp=1523351932415,
value=London
123002 column=info:age, timestamp=1523351887009, value=40
123002 column=info:gender, timestamp=1523351993106,
value=female
123002 column=info:name, timestamp=1523351965188,
value=Victoria
123003 column=info:address, timestamp=1523352194766,
value=Redding
123003 column=info:age, timestamp=1523352108282, value=30
123003 column=info:gender, timestamp=1523352060912,
value=female
123003 column=info:name, timestamp=1523352091677, value=Taylor
123004 column=info:address, timestamp=1523352217267,
value=Cleveland
123004 column=info:age, timestamp=1523352229436, value=33
123004 column=info:gender, timestamp=1523352267416, value=male
123004 column=info:name, timestamp=1523352251926, value=LeBron
HCNA-BigData V2.0 Experiment Guide Page 73

3 row(s) in 0.0560 seconds

Step 12 Configure the HBase input. Double-click the HBase Input button on the web UI. Enter table
name cga_info, click Add, enter the family name, column name, field name, and type in
sequence, select is rowkey, and click OK.

HBase table name: stuXX_cga_info

Step 13 Double-click the File Output button on the web UI to configure the HDFS output. Specify the
output delimiter, click associate, enter serial numbers in the position column, and then click
OK.

Output delimiter: ,
HCNA-BigData V2.0 Experiment Guide Page 74

Step 14 Connect HBase Input and File Output.

Step 15 Click Next to configure To.

Step 16 Enter the output path, select the file format, and click Save and run.
Output path: /tmp/stuXX/cg_hbasetohdfs
HCNA-BigData V2.0 Experiment Guide Page 75

Step 17 Check the running result.

Step 18 View the HDFS output.

Находим в папке /tmp/stuXX/cg_hbasetohdfs файл экспорта:

> hdfs dsf -ls /tmp/stuXX/cg_hbasetohdfs

-rw-rw-rw-+ 3 loader hadoop 32 2021-09-17 18:43


/tmp/stu19/cg_hbasetohdfs/export_part_1631277441726_0002_0000000

И подставляем его в команду вывода содержимого:


> hdfs dfs -cat
/tmp/stuXX/cg_hbasetohdfs/export_part_1631277441726_0002_0000000

123002,Victoria,female,40,London
123003,Taylor,female,30,Redding
123004,LeBron,male,33,Cleveland

The output shows that the content of the stuXX_cga_info table is successfully moved to the
export_part_1631277441726_0002_0000000 file in the /tmp/stuXX/cg_hbasetohdfs directory.
----End

5.3.2 Loading HDFS Data to HBase


Step 1 Create a table named stuXX_cg_hdfstohbase.

> hbase shell


> create 'stuXX_cg_hdfstohbase','info'
0 row(s) in 0.4350 seconds
=> Hbase::Table - cg_hdfstohbase

Step 2 Perform the first three steps of section 5.3.1 to go to the page for configuring Loader and
configure basic information.

Name: stuXX_cg_hdfstohbase
Connection: stuXX_hdfs_conn (строка подключения была создана в предыдущем задании)
HCNA-BigData V2.0 Experiment Guide Page 76

Queue: DEFAULT

Step 3 Configure From.


Enter /tmp/stuXX/cg_hbasetohdfs/export_part_1631277441726_0002_0000000 in Input path and *
in File filter. Select UFT_8 for Encode type. Then, click Next.

Step 4 Click Input on the left, select CSV File Input, and drag the CSV File Input button to the right
area.
HCNA-BigData V2.0 Experiment Guide Page 77

Step 5 Click output on the left, select HBase Output, and drag the HBase Output button to the right
area.

Step 6 Configure the CSV file input. Double-click the CSV File Input button on the web UI. Enter the
delimiter of the table and click Add. Enter the position serial number, field name, and type in
sequence, and then click OK.

Delimiter: ,
HCNA-BigData V2.0 Experiment Guide Page 78

Step 7 Configure HBase output. Double-click the HBase Output button on the web UI and click
associate.

Step 8 Select the check boxes in the Name column and click OK.
HCNA-BigData V2.0 Experiment Guide Page 79

Step 9 Enter the table name, select rowkey as the primary key, and click OK.
Table Name: stuXX_cg_hdfstohbase
HCNA-BigData V2.0 Experiment Guide Page 80

Step 10 Connect CSV File Input and HBase Output.

Step 11 Click Next to configure To. Set Storage type to HBASE_PUTLIST, set Number to 1, and click Save
and run.
HCNA-BigData V2.0 Experiment Guide Page 81

Step 12 View results.

Step 13 Query the content in the stuXX_cg_hdfstohbase table in HBase.

> scan 'stuXX_cg_hdfstohbase'


ROW COLUMN+CELL
123002 column=info:address, timestamp=1523623659052, value=London
123002 column=info:age, timestamp=1523623659052, value=40
123002 column=info:gender, timestamp=1523623659052, value=female
123002 column=info:name, timestamp=1523623659052, value=Victoria
123003 column=info:address, timestamp=1523623659052, value=Redding
123003 column=info:age, timestamp=1523623659052, value=30
123003 column=info:gender, timestamp=1523623659052, value=female
123003 column=info:name, timestamp=1523623659052, value=Taylor
123004 column=info:address, timestamp=1523623659052,
value=Cleveland
123004 column=info:age, timestamp=1523623659052, value=33
123004 column=info:gender, timestamp=1523623659052, value=male
123004 column=info:name, timestamp=1523623659052, value=LeBron
3 row(s) in 0.0480 seconds

The result shows that the content of HDFS file /tmp/stuXX_cg_hbasetohdfs/


export_part_1631277441726_0002_0000000 has been loaded to the stuXX_cg_hdfstohbase table.
----End

5.3.3 Importing HDFS Data to MySQL


Step 1 Create a file named test_mysql in the /home/userXX folder, and write data into the file.

> touch test_mysql.txt


> vim test_mysql.txt

1,Tom,male,8
2,Lily,female,24
3,Lucy,female,50

Step 2 Upload local file test_mysql to the /user/app_stuXX/loader_test directory of the HDFS.

> hdfs dfs -mkdir /user/app_stuXX/loader_test


> hdfs dfs -put /home/userXX/test_mysql.txt /user/app_stuXX/loader_test
> hdfs dfs -ls /user/app_stuXX/loader_test

Found 1 items
-rw-r--r--+ 3 user01 supergroup 47 2018-04-15
13:09/user/app_stu01/loader_test/test_mysql.txt

Step 3 View the content in the test_mysql table.


HCNA-BigData V2.0 Experiment Guide Page 82

> hdfs dfs -cat /user/app_stuXX/loader_test/test_mysql.txt


1,tom,male,8
2,lily,female,24
3,lucy,female,50

Step 4 On a Linux node, enter MySQL.

> mysql -uroot -pHuawei@010203

Step 5 Create a database named test_database.

mysql> create database stuXX_loadertest;


Query OK, 1 row affected (0.00 sec)

mysql> set names utf8;


Query OK, 0 rows affected (0.00 sec)

mysql> use stuXX_loadertest;


Database changed

Step 6 Create a table named cga_mysql.

mysql> create table cga_mysql(id int(4) not null primary key auto_increment,
name varchar(255) not null, gender varchar(255) not null, time int(4));

Note: Creating a MySQL requires a primary key.

Step 7 View the content in the cga_mysql table.

mysql> desc cga_mysql;


+--------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+--------+--------------+------+-----+---------+----------------+
| id | int(4) | NO | PRI | NULL | auto_increment |
| name | varchar(255) | NO | | NULL | |
| gender | varchar(255) | NO | | NULL | |
| time | int(4) | YES | | NULL | |
+--------+--------------+------+-----+---------+----------------+
4 rows in set (0.01 sec)

Step 8 Copy the MySQL link JAR package to the specified directory of the active and standby Loader.

> cp /FusionInsight-Client/mysql-connector-java-5.1.21.jar
/opt/huawei/Bigdata/FusionInsight_Porter_6.5.1/install/FusionInsight-Sqoop-
1.99.3/FusionInsight-Sqoop-1.99.3/server/webapps/loader/WEB-INF/ext-lib
(на активный)

> scp /FusionInsight-Client/mysql-connector-java-5.1.21.jar


root@192.168.130.26:/opt/huawei/Bigdata/FusionInsight_Porter_6.5.1/install/Fu
sionInsight-Sqoop-1.99.3/FusionInsight-Sqoop-
1.99.3/server/webapps/loader/WEB-INF/ext-lib
(на резервный)

ВНИМАНИЕ! Файлы уже скопированы! Выполнять не надо!


HCNA-BigData V2.0 Experiment Guide Page 83

Step 9 Check the content in /opt/huawei/Bigdata/FusionInsight_V100R002C60SPC200/FusionInsight-


Sqoop-1.99.3/FusionInsight-Sqoop-1.99.3/server/webapps/loader/WEB-INF/ext-lib/ on the
active and standby nodes.

> ll /opt/huawei/Bigdata/FusionInsight_Porter_6.5.1/install/FusionInsight-
Sqoop-1.99.3/FusionInsight-Sqoop-1.99.3/server/webapps/loader/WEB-INF/ext-lib

total 940
-rwxr-xr-x 1 root root 118057 Jan 23 11:36 hive-jdbc-1.3.0.jar
-rwxr-xr-x 1 omm wheel 827942 Feb 8 10:36 mysql-connector-java-5.1.21.jar
-rwxr-xr-x 1 omm wheel 18 Nov 23 2015 readme.properties

The result shows that the MySQL link JAR package has been copied to the specified directory of the
active and standby Loader.

Step 10 Restart the Loader.

If mysql-connector-java-x.x.x.jar is already available in the directory of the active and


standby Loader, you do not need to restart the Loader. You need to copy the .jar file on
both the active and standby Loader nodes.

Step 11 Perform steps 1 to 3 in section 5.3.1 to enter the page for configuring basic information about
Loader.

Name: stuXX_cg_hdfstomysql
Queue: DEFAULT
HCNA-BigData V2.0 Experiment Guide Page 84

Step 12 Click Edit to start the MySQL connection configuration. The MySQL password is
Huawei@010203. After filling in the information, click Test. After the test is complete, click OK.

Enter jdbc:mysql://192.168.224.41:3306/test_database in JDBC Connection String.


Name: stuXX_mysql
Connector: generic-jdbc-connector
JDBC Driver Class: com.mysql.jdbc.Driver
JDBC Connection String: jdbc:mysql://192.168.130.24:3306/stuXX_loadertest
или jdbc:mysql://192.168.130.25:3306/stuXX_loadertest (в зависимости от того, на какой
машине работаем!!!)
UserName: root
Password: Huawei@010203

Enter /user/app_stuXX/loader_test in Input directory.

Step 13 Click input on the left, select CSV File Input, and drag the CSV File Input button to the right
area.
HCNA-BigData V2.0 Experiment Guide Page 85

Step 14 Click output on the left, select Table Output, and drag the Table Output button to the right
area.

Step 15 Configure the CSV file input. Double-click the CSV File Input button on the web UI. Enter the
delimiter ‘,’ and click Add. Enter the position serial number, field name, and type, and then click
OK.
HCNA-BigData V2.0 Experiment Guide Page 86

!!! В графе «position» нужно указать позиции, начиная с 2 – т.е. должно быть 2,3,4 (поскольку
исходный файл содержит в строке по 4 атрибута, первый – идентификатор, который в
таблице автоматически преобразуется в первичный ключ).

Step 16 Double-click the Table Output button on the web UI. Click associate, select the check boxes in
the Name column, and click OK.

Step 17 Enter the field name, table column name, and type, and then click OK.
HCNA-BigData V2.0 Experiment Guide Page 87

Step 18 Connect CVS File Input and Table Output.

Step 19 Click Next to start output configuration. Enter the table name and click Save and run.

Step 20 Run Loader jobs and view the result.


HCNA-BigData V2.0 Experiment Guide Page 88

Step 21 View the content in the cga_mysql table in MySQL.

mysql> use stuXX_loadertest;


Database changed
mysql> select * from cga_mysql;
+----+------+--------+------+
| id | name | gender | time |
+----+------+--------+------+
| 1 | tom | male | 8 |
| 2 | lily | female | 24 |
| 3 | lucy | female | 50 |
+----+------+--------+------+
3 rows in set (0.00 sec)

The result shows that the content of the test_mysql file in the HDFS has been loaded to the
cga_mysql table of the MySQL database.
----End

5.3.4 Importing MySQL Data to HDFS


Step 1 Prepare a MySQL table using data in the cga_mysql table created in section 5.3.3.

mysql> use stuXX_loadertest;


Database changed

mysql> select * from cga_mysql;


+----+------+--------+------+
| id | name | gender | time |
+----+------+--------+------+
| 1 | tom | male | 8 |
| 2 | lily | female | 24 |
| 3 | lucy | female | 50 |
+----+------+--------+------+
3 rows in set (0.00 sec)

Step 2 Perform steps 1 to 3 in section 5.3.1 to enter the page for configuring basic information about
Loader.

Name: stuXX_cg_mysqltohdfs
Queue: DEFAULT
Connection: stuXX_mysql
HCNA-BigData V2.0 Experiment Guide Page 89

Step 3 Click Next to configure From. Enter the table name.


Параметр «Need partition column» установить в значение «False»

The name of a non-default database table is in the format of "database name.table


name".

Step 4 Click input on the left, select Table Input, and drag the Table Input button to the right area.

Step 5 Click output on the left, select File Output, and drag the File Output button to the right area.
HCNA-BigData V2.0 Experiment Guide Page 90

Step 6 Double-click the Table Input button on the web UI. Click Add, enter the position serial number,
field name, and type, and click OK.

Step 7 Double-click the File Output button on the web UI. Configure Output delimiter and click
associate. Then select the check boxes in the Name column and click OK.

Output delimiter: ,

Step 8 Connect Table Input and File Output, and click Next.
HCNA-BigData V2.0 Experiment Guide Page 91

Step 9 Configure From. Enter /user/app_stuXX/loader_test for Output directory.

Step 10 Run Loader jobs and view the result.

Step 11 View the result in the HDFS.

> hdfs dfs -ls /user/app_stuXX/loader_test


Found 3 items
…… 2018-04-15 14:41 /user/app_stu01/loader_test/_SUCCESS
…… 2018-04-15 14:41 /
user/app_stu01/loader_test/import_part_1522461215526_0114_0000000
…… 2018-04-15 13:09 /user/app_stu01/loader_test/test_mysql.txt

> hdfs dfs -cat /user/app_stuXX/loader_test/<file_name> (подставляем имя


соответствующего файла импорта)
> hdfs dfs -cat
/user/app_stuXX/loader_test/import_part_1522461215526_0114_0000000
HCNA-BigData V2.0 Experiment Guide Page 92

1,tom,male,8
2,lily,female,24
3,lucy,female,50

The result shows that MySQL table cga_mysql has been imported to the
/user/app_stuXX/loader_test directory of the HDFS.
----End

5.3.5 Importing MySQL Data to HBase


Step 1 Prepare a MySQL table using data in the cga_mysql table created in section 5.3.3.

mysql> use stuXX_loadertest;


mysql> select * from cga_mysql;
+----+------+--------+------+
| id | name | gender | time |
+----+------+--------+------+
| 1 | tom | male | 8 |
| 2 | lily | female | 24 |
| 3 | lucy | female | 50 |
+----+------+--------+------+
3 rows in set (0.00 sec)

Step 2 Create an HBase table named stuXX_cg_mysqltohbase.

> hbase shell


> create 'stuXX_cg_mysqltohbase','info'
0 row(s) in 0.5300 seconds

Step 3 Perform steps 1 to 3 in section 5.3.1 to enter the page for configuring basic information about
Loader.

Name: stuXX_cg_mysqltohbase
Connection: stuXX_mysql
Queue: DEFAULT
HCNA-BigData V2.0 Experiment Guide Page 93

Step 4 Click Next to configure From. Enter the table name.


Параметр «Need partition column» устанавливаем в значение «False».

Step 5 Click input on the left, select Table Input, and drag the Table Input button to the right area.
HCNA-BigData V2.0 Experiment Guide Page 94

Step 6 Click output on the left, select HBase Output, and drag the HBase Output button to the right
area.

Step 7 Double-click the Table Input button on the web UI. Click Add, enter the position serial number,
field name, and type, and click OK.

Step 8 Configure the HBase output. Double-click the HBase Output button on the web UI. Click
associate, select the check boxes in the Name column, and click OK.

Step 9 Enter the HBase table name, column family name, column name, and type, select id as the
primary rowkey, and click OK.
HCNA-BigData V2.0 Experiment Guide Page 95

Table Name: stuXX_cg_mysqltohbase

Step 10 Connect Table Input and HBase Output, and click Next.

Step 11 Set Storage type to HBASE_PUTLIST, HBase instance to HBase, and Number to 1, and then click
Save and run.
HCNA-BigData V2.0 Experiment Guide Page 96

Step 12 Run Loader jobs and view the result.

Step 13 View data in the HBase table.

> scan 'stuXX_cg_mysqltohbase'


ROW COLUMN+CELL
2018-04-15 15:21:33,777 INFO [hconnection-0xaa61e4e-shared--pool4-t1]
ipc.AbstractRpcClient: RPC Server Kerberos principal name for
service=ClientService is hbase/hadoop.hadoop.com@HADOOP.COM
1 column=info:gender, timestamp=1523776665511, value=male
1 column=info:name, timestamp=1523776665511, value=tom
1 column=info:time, timestamp=1523776665511, value=8
2 column=info:gender, timestamp=1523776665511,
value=female
2 column=info:name, timestamp=1523776665511, value=lily
2 column=info:time, timestamp=1523776665511, value=24
3 column=info:gender, timestamp=1523776665511,
value=female
3 column=info:name, timestamp=1523776665511, value=lucy
3 column=info:time, timestamp=1523776665511, value=50
3 row(s) in 0.0700 seconds

The result shows that MySQL table cga_mysql has been successfully loaded to HBase table
stuXX_cg_mysqltohabse.
----End

5.3.6 Importing HBase Data to MySQL


Step 1 Use HBase table stuXX_cg_mysqltohbase in created in section 5.3.5. Create table
cga_hbasetomysql in the MySQL database.

mysql> use stuXX_loadertest;


mysql> create table cga_hbasetomysql(id int(4) not null primary key
auto_increment, name varchar(255) not null, gender varchar(255) not null,
time int(4));
HCNA-BigData V2.0 Experiment Guide Page 97

Query OK, 0 rows affected (0.09 sec)

Step 2 Perform steps 1 to 3 in section 5.3.1 to enter the page for configuring basic information about
Loader.

Name: stuXX_cg_hbasetomysql
Connection: stuXX_mysql
Queue: DEFAULT

Step 3 Click Next configure From. Set Source type to HBASE and Number to 1.

Step 4 Click input on the left, select HBase Input, and drag the HBase Input button to the right area.
HCNA-BigData V2.0 Experiment Guide Page 98

Step 5 Click output on the left, select Table Output, and drag the Table Output button to the right
area.

Step 6 Configure the HBase input. Double-click the HBase Input button on the web UI. Enter the HBase
table name, click Add, enter the family name, column name, field name, and type in sequence,
select id as the rowkey, and click OK.

HBase table name: stuXX_cg_mysqltohbase

Step 7 Double-click the Table Output button on the web UI. Click associate, select the check boxes in
the Name column, and click OK.
HCNA-BigData V2.0 Experiment Guide Page 99

Step 8 Connect HBase Input and Table Output, and click Next.

Step 9 Configure To. Set Table name to cga_hbasetomysql, and click Save and run.

Step 10 Run Loader jobs and view the result.

Step 11 View the content in MySQL table cga_hbasetomysql.

mysql> use stuXX_loadertest;


mysql> select * from cga_hbasetomysql;
+----+------+--------+------+
| id | name | gender | time |
+----+------+--------+------+
| 1 | tom | male | 8 |
| 2 | lily | female | 24 |
| 3 | lucy | female | 50 |
+----+------+--------+------+
HCNA-BigData V2.0 Experiment Guide Page 100

3 rows in set (0.00 sec)

The result shows that the content of HBase table cg_mysqltohbase has been successfully loaded to
MySQL table cga_hbasetomysql.
----End

5.3.7 Importing MySQL Data to Hive


Step 1 Use data in MySQL table cga_mysql in 5.3.3 and create Hive table stuXX_db.cg_mysqltohive.
Create the /user/hive/warehouse/stuXX_db.db/cg_mysqltohive directory in the HDFS.

> hdfs dfs -mkdir /user/hive/warehouse/stuXX_db.db/cg_mysqltohive

Create table stuXX_db.cg_mysqltohive in Hive.

> use stuXX_db;


> create table cg_mysqltohive(id int,name string,gender string,timest int)
row format delimited fields terminated by ',' stored as textfile location
'/user/hive/warehouse/stuXX_db.db/cg_mysqltohive';
No rows affected (0.372 seconds)

Step 2 Perform steps 1 to 3 in section 5.3.1 to enter the page for configuring basic information about
Loader.

Name: stuXX_cg_mysqltohive
Connection: stuXX_mysql
Queue: DEFAULT

Step 3 Click Next to configure From. Set the table name to cga_mysql.
Need partition column: false
HCNA-BigData V2.0 Experiment Guide Page 101

Step 4 Click Next to start transform configuration. Click input on the left, select Table Input, and drag
the Table Input button to the right area.

Step 5 Click output on the left, select Hive Output, and drag the Hive Output button to the right area.

Step 6 Configure table input. Double-click the Table Input button on the web UI. Click Add, enter the
position serial number, field name, and type, and click OK.
HCNA-BigData V2.0 Experiment Guide Page 102

Step 7 Configure table output. Double-click the Table Output button on the web UI. Click associate,
select the check boxes in the Name column, and click OK.

Hive File Storage Format: CSV

Step 8 Completing position information and click OK.


id: INTEGER
name: STRING
gender: STRING
time: INTEGER
HCNA-BigData V2.0 Experiment Guide Page 103

Step 9 Connect Table Input and Hive Output, and click Next.

Step 10 Configure To. Set Storage type to HIVE, Output directory to


/user/hive/warehouse/stuXX_db.db/cg_mysqltohive, and Number to 1. Then click Save and
run.

Step 11 Run Loader jobs and view the result.


HCNA-BigData V2.0 Experiment Guide Page 104

Step 12 View Hive table cg_mysqltohive.

> use stuXX_db;


> select * from cg_mysqltohive;
+---- ---+---------+------------+-----------+
| cg.id | cg.name | cg.gender | cg.time |
+--------+---------+------------+-----------+
| 1 | tom | male | 8 |
| 2 | lily | female | 24 |
| 3 | lucy | fema | |
+-- -----+---------+------------+-----------+
3 rows selected (0.369 seconds)

The result shows that the content of MySQL table cga_mysql has been successfully loaded to Hive
table stuXX_cg_mysqltohive.
----End

5.4 Summary
This experiment describes how to use Loader in various service scenarios. After the experiment,
trainees are expected to be able to solve problems occurred during data migration. Note that you
need to create tables before migrating table data between MySQL, HBase, and Hive. When an
experiment is performed on the MySQL database using Loader, the MySQL table must have a primary
key.
HCNA-BigData V2.0 Experiment Guide Page 105

6 Flume Data Collection Practice

6.1 Background
Flume is an important data collection tool among Big Data components. Flume is often used to
collect data from various data sources for other components to analyze. In the log analysis service,
you need to collect server logs to check whether the server is running properly. In real-time services,
data is often collected to the Kafka for analysis and processing of real-time components such as the
Streaming or Spark. The Flume plays an important role in Big Data services.

6.2 Objective
⚫ Understand how to configure Flume and use it to collect data.

6.3 Experiment Tasks

6.3.1 Collecting spooldir Data to the HDFS


6.3.1.1 Using the Configuration Planning Tool to Generate File
properties.properties
Step 1 Download the configuration tool for Flume.
Download the tool at:
http://support.huawei.com/enterprise/docinforeader.action?contentId=DOC1000104118&idPath=79
19749|7919788|19942925|21110924|21112790|21112791|21624194|21830200

The Flume is used to monitor the file directory here. Data is saved to the HDFS. The
Channel type is memory.
HCNA-BigData V2.0 Experiment Guide Page 106

Step 2 Configure the source.


In the Flume Configuration table of the configuration planning tool, click Add Source.
Set SourceName to a1, spoolDir to /home/userXX/spooldir (create spooldir in /home/userXX and
change the permission to 755), set channels to ch1, and retain the default values for other
parameters.
> mkdir /home/userXX/spooldir
> chmod -R 755 /home/userXX/spooldir

The path varies depending on the account. Here, user01 is used as an example.

Step 3 Configure channel information.


Click Add Channel. Set ChanelName to ch1, type to memory, and retain the default values of other
parameters.
HCNA-BigData V2.0 Experiment Guide Page 107

Step 4 Configure the sink.


Click Add Sink. Set SinkName to s1, type to hdfs, hdfs.path to /user/app_stuXX/flume, and use
default values for other parameters.
> hdfs dfs -mkdir /user/app_stuXX/flume

Set hdfs.kerberosPrincipal to a cluster user in FusionInsight Manager stuXX, for example, stu01. The
path of hdfs.kerberosKeytab is the path where the file is stored in Linux, for example,
/home/userXX/flumetest. Set the permission of the file to 755.

Configure channel to ch1:

> mkdir /home/userXX/flumetest


> chmod -R 755 /home/userXX/flumetest

Step 5 Generate a configuration file.


HCNA-BigData V2.0 Experiment Guide Page 108

Click Generate a configuration file, a file named properties.properties is generated automatically.

Step 6 Upload the properties.properties file to the cluster node directory.


Open WinSCP, enter the host name, user name, and password: 192.168.130.24 или
192.168.130.25 (в зависимости от того, на какой машине работаем) , userXX, and
password, and click Login.

Upload the file to the /home/userXX/flumetest directory.

Step 7 Check the Flume data collection result.

> hdfs dfs –ls /user/app_stuXX/flume


> ls /home/userXX/flumetest

----End

6.3.1.2 Installing a Flume Client


Step 1 Decompress the Flume client.

> cp /FusionInsight-Client/Flume/FusionInsight_Cluster_1_Flume_Client.tar
/home/userXX/

> tar -xvf FusionInsight_Cluster_1_Flume_Client.tar

Two files are generated after the decompression:


FusionInsight_Cluster_1_Flume_ClientConfig.tar and
FusionInsight_Cluster_1_Flume_ClientConfig.tar.sha256
HCNA-BigData V2.0 Experiment Guide Page 109

Run the tar command to further decompress FusionInsight_Cluster_1_Flume_ClientConfig.tar.

> tar -xvf FusionInsight_Cluster_1_Flume_ClientConfig.tar

The FusionInsight_Cluster_1_Flume_ClientConfig directory is generated.


The folders adapter, aix, batch_install, flume and file install.sh are obtained.

Путь, где находятся эти файлы:


/home/userXX/FusionInsight_Cluster_1_Flume_ClientConfig/Flume/FlumeClient

> ls /home/userXX/FusionInsight_Cluster_1_Flume_ClientConfig/Flume/FlumeClient

aix batch_install flume install.sh upgrade

Step 2 Obtain krb5.conf and user.keytab files.

Log in to FusionInsight Manager using a FusionInsight Manager account stuXX, for example, stu01,
and choose System > Rights Configuration > User Management.
(--Другой интерфейс! System -> User -> stuXX)

In the Operation column of the corresponding account, click the Download icon to download
krb5.conf and user.keytab files.
HCNA-BigData V2.0 Experiment Guide Page 110

Другой интерфейс! Для пользователя stuXX кликаем More -> Download


Autentication Credentials и загружаем к себе на компьютер файл следующего
формата: stu20_1595500715775_keytab.tar. Далее распаковываем его и получаем 2
файла: krb5.conf и user.keytab.
Use the WinScp tool to upload krb5.conf and user.keytab files to the /home/userXX/flumetest
directory.

Step 3 Create a file named jaas.conf.


Create a file named jaas.conf in the /home/userXX/flumetest directory decompressed in step 1.

> touch jaas.conf

Edit the jaas.conf file.

> vim jaas.conf

The content of the file is as follows:

Client {
com.sun.security.auth.module.Krb5LoginModule required
storeKey=true
principal= "stuXX" (indicates the user created on FusionInsight Manager)
useTicketCache=false

keyTab= "/home/userXX/flumetest/user.keytab" (indicates the path of


FusionInsight Manager user stuXX's authentication file user.keytab in Linux.)

debug=true
useKeyTab=true;
};

Step 4 Modify the flume-env.sh file.


The flume-env.sh file exists in the flume/conf directory of the decompressed file on the Flume client.
Add the following content at the end of JAVA_OPTS:
--Путь, где лежит файл:
> cd
/home/userXX/FusionInsight_Cluster_1_Flume_ClientConfig/Flume/FlumeClient/flume
/conf

> vim flume-env.sh

В файле находим и редактируем следующие строки:

-Djava.security.krb5.conf=/home/userXX/flumetest/krb5.conf
-Djava.security.auth.login.config=/home/userXX/flumetest/jaas.conf
-Dzookeeper.server.principal=zookeeper/hadoop.hadoop.com
-Dzookeeper.request.timeout=120000
HCNA-BigData V2.0 Experiment Guide Page 111

(Note: The values of Djava.security.auth.login.config and Djava.security.krb5.conf must be the


corresponding paths in the cluster.)

Итоговое содержимое файла:

JAVA_OPTS="-Xms2G -Xmx4G -XX:CMSFullGCsBeforeCompaction=1 -


XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -
XX:+UseCMSCompactAtFullCollection -verbose:gc -XX:+UseGCLogFileRotation -
XX:NumberOfGCLogFiles=15 -XX:GCLogFileSize=1M -XX:+PrintGCDetails -
XX:+PrintGCDateStamps -Xloggc:${FLUME_GC_LOG_DIR}/Flume-Client-gc.log -
verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -
Djava.security.krb5.conf=/home/userXX/flumetest/krb5.conf -
Djava.security.auth.login.config=/home/userXX/flumetest/jaas.conf -
Dzookeeper.server.principal=zookeeper/hadoop.hadoop.com -
Dzookeeper.request.timeout=120000"

On the installed HDFS client, copy hdfs-site.xml and core-site.xml from


hdfs_client/HDFS/hadoop/etc/hadoop/ to /home/userXX/flumetest and hbase-site.xml from
hbase_client/HBase/hbase/conf on the HBase client to /home/userXX/flumetest.

> cd /home/userXX/hadoopclient/HDFS/hadoop/etc/hadoop
> cp hdfs-site.xml /home/userXX/flumetest
> cp core-site.xml /home/userXX/flumetest
> cd /home/userXX/hadoopclient/HBase/hbase/conf
> cp hbase-site.xml /home/userXX/flumetest

View the content in the /home/userXX/flumetest folder.

> cd /home/userXX/flumetest
> ll
-rw------- 1 user01 users 8563 Apr 16 22:49 core-site.xml
-rw------- 1 user01 users 9830 Apr 16 22:50 hbase-site.xml
-rw------- 1 user01 users 15277 Apr 16 22:48 hdfs-site.xml
-rw-r--r-- 1 user01 users 199 Apr 16 22:23 jaas.conf
-rw-r--r-- 1 user01 users 757 Apr 15 20:24 krb5.conf
-rw-r--r-- 1 user01 users 2119 Apr 16 21:12 properties.properties
-rw-r--r-- 1 user01 users 126 Apr 15 20:24 user.keytab

Step 5 Install the client. (If you use a non-root user, it is recommended that the installation directory
not contain too many levels; otherwise, the installation may fail.)

> cd
/home/userXX/FusionInsight_Cluster_1_Flume_ClientConfig/Flume/FlumeClient

> ./install.sh -d /home/userXX/flume1 -c


/home/userXX/flumetest/properties.properties -l /var/log/Bigdata/

If message [flume-client install]: install flume client successfully is


displayed, the client is installed successfully.

Parameter description:
-d: installation path of the Flume client
HCNA-BigData V2.0 Experiment Guide Page 112

-f: Service IP addresses of two MonitorServer roles, separated by a comma (,). This parameter is
optional. If this parameter is not set, the Flume client does not send alarm information to the
MonitorServer.
-c: configuration file, which is optional. After the installation, you can configure Flume role client
parameters by modifying /opt/FlumeClient/fusioninsight-flume-1.6.0/conf/properties.properties.
“-l”: Log directory. This parameter is optional. The default value is /var/log/Bigdata. (The user user
needs to have the write permission on the directory.)

Step 6 Check /home/userXX/spooldir. If .flumespool is displayed, the configuration is successful.

> ll /home/userXX/spooldir -a
total 408
drwxrwxrwx 3 root root 4096 Feb 9 13:44 .
drwxrwxrwx 81 root root 12288 Apr 16 23:26 ..
drwxrwxrwx 2 omm wheel 4096 Jan 26 23:46 .flumespool
-rwxrwxrwx 1 root root 389592 Jan 26 23:45 zypper.log.COMPLETED

----End

6.3.2 Collecting avro Data to the HDFS


The collection of avro data sources using Flume is the collection of serialized data, which involves
port configuration.

6.3.2.1 Using the Configuration Planning Tool to Generate File


properties.properties
Step 1 Set Flume name to client.

Step 2 Set the source type to avro, set the listening IP address and port number, set channels to ch2,
and click Add Source.

IP: 192.168.130.24 или 192.168.130.25 (в зависимости от того, на какой машине


работаем)
Port Number: 8181
HCNA-BigData V2.0 Experiment Guide Page 113

Step 3 Configure channel parameters.


Click Add Channel. Set ChanelName to ch2, type to memory, and retain the default values of other
parameters.

Step 4 Configure Sink parameters.


Click Add Sink.
Set SinkName to s2.
Set Hdfs.path to /user/app_stuXX/flume_avro.
HCNA-BigData V2.0 Experiment Guide Page 114

Set authentication information, including the authentication account and the address of the
authentication file. The authentication account and the authentication file can be the same as those
in 6.3.1.

Обязательно указать наименование канала: channel: ch2


If the cluster is deployed in secure mode, you need to configure parameters hdfs.kerberosPrincipal
and hdfs.kerberosKeytab. If the cluster is in non-secure mode, you do not need to configure the two
parameters.
Create the /home/userXX/flumetest2 directory and set the permission to 755.
Также создать директорию /user/app_stuXX/flume_avro в HDFS:
> hdfs dfs -mkdir /user/app_stuXX/flume_avro

Step 5 Generate a configuration file.


Click Generate a configuration file, and upload the properties. properties configuration file to the
specified directory of the cluster, such as /home/userXX/flumetest2. The properties.properties file
in 6.3.1 will be overwritten.

Step 6 Obtain krb5.conf and user.ekytab files.


For details, see step 2 in 6.3.1.2.

Step 7 Create a file named jaas.conf.


For details, see step 3 in 6.3.1.2.

В файле jaas.conf исправить строчку: keyTab= "/home/userXX/flumetest2/user.keytab"


Modify the flume-env.sh file.

--Путь, где лежит файл:


> cd
/home/userXX/FusionInsight_Cluster_1_Flume_ClientConfig/Flume/FlumeClient/flume
/conf
HCNA-BigData V2.0 Experiment Guide Page 115

The flume-env.sh file exists in the flume/conf directory of the decompressed file on the Flume
client. Add the following content at the end of JAVA_OPTS:

> vim flume-env.sh


-Djava.security.krb5.conf=/home/userXX/flumetest2/krb5.conf
-Djava.security.auth.login.config=/home/userXX/flumetest2/jaas.conf
-Dzookeeper.server.principal=zookeeper/hadoop.hadoop.com
-Dzookeeper.request.timeout=120000

----End

6.3.2.2 Creating a Flume Job


Step 1 Reinstall the Flume instance to /home/userXX/flume2.

> cd
/home/userXX/FusionInsight_Cluster_1_Flume_ClientConfig/Flume/FlumeClient

> ./install.sh -d /home/userXX/flume2 -c


/home/userXX/flumetest2/properties.properties -l /var/log/Bigdata/

If message [flume-client install]: install flume client successfully is displayed, the client is installed
successfully.
Note: After the installation, you can run the ps -ef | grep flume | grep username command to check
the Flume service status.

Step 2 Submit data to the avro port.


Copy the flumeavroclient.jar file from the /FusionInsight-Labs directory to the /home/userXX
directory. flumeavroclient.jar can submit Hello and World to the 8181 port for 100 times and the
data is saved in the over.tmp file.

> cd /opt
> java -cp flumeavroclient.jar org.myorg.SSLAvroclient

Step 3 Check the data collection result in the HDFS.

>hdfs dfs –ls /user/app_stuXX/flume_avro


-rw-r--r-- 3 user01 supergroup 1300 2018-04-17 /stu01/flume_avro/over.tmp

----End

6.4 Summary
This experiment describes how to collect data from the spooldir and avro data sources using the
Flume. Through this experiment, trainees are expected to master how to collect data offline and in
real time as well as have a better understanding of Flume.
HCNA-BigData V2.0 Experiment Guide Page 116

7 Comprehensive Cluster Experiment

7.1 Background
In Big Data services, multiple components are usually built into a service system to meet the
requirements of upper-layer services.
This experiment combines the preceding components to build a Big Data analysis and real-time
query platform.
Loader periodically migrates MySQL database data to Hive first. As Hive data is stored in HDFS,
Loader is used to load data in HDFS to HBase. HBase is used to query data in real time, and the big
data processing capability of Hive is used to analyze related results.

7.2 Objective
⚫ Use Big Data components comprehensively to convert and query data in real time.

7.3 Experiment Tasks

7.3.1 Offline Data Collection and Analysis and Real-Time Query


Involving MySQL, Loader, Hive, and HBase
7.3.1.1 Preparing MySQL Data
Step 1 Log in to the MySQL server.
MySQL is installed on a FusinInsight HD cluster node.

> mysql -uroot -pHuawei@010203

Welcome to the MySQL monitor. Commands end with ; or \g.


Your MySQL connection id is 135
Server version: 5.5.48 MySQL Community Server (GPL)
Type 'help;' or '\h' for help. Type '\c' to clear the current input
statement.
mysql>
HCNA-BigData V2.0 Experiment Guide Page 117

Step 2 Create a database named loadertest.

Базу данных уже создавали в разделе 5.3.3. Используем ее.

mysql> use stuXX_loadertest;

Step 3 Create table socker and make time the primary key.

mysql> DROP TABLE IF EXISTS socker;


mysql> CREATE TABLE socker (
time varchar(50) DEFAULT NULL,
open float DEFAULT NULL,
high float DEFAULT NULL,
low float DEFAULT NULL,
close float DEFAULT NULL,
volume varchar(50) DEFAULT NULL,
endprice float DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

Step 4 Load data to socker.


Copy the socker.csv file from the /FusionInsight-Labs directory to the home directory on a local host.

> cp /FusionInsight-Labs/socker.csv /home/userXX

Load data in socker.csv to the socker table using the tool on the MySQL client.

mysql> LOAD DATA INFILE "/home/userXX/socker.csv" INTO TABLE socker FIELDS


TERMINATED BY ',' lines terminated by '\r\n';

Step 5 View data in socker.

mysql> select * from socker limit 10;


+------------+-------+-------+-------+-------+----------+----------+
| time | open | high | low | close | volume | endprice |
+------------+-------+-------+-------+-------+----------+----------+
| 1970-01-02 | 92.06 | 93.54 | 91.79 | 93 | 8050000 | 93 |
| 1970-01-05 | 93 | 94.25 | 92.53 | 93.46 | 11490000 | 93.46 |
| 1970-01-06 | 93.46 | 93.81 | 92.13 | 92.82 | 11460000 | 92.82 |
| 1970-01-07 | 92.82 | 93.38 | 91.93 | 92.63 | 10010000 | 92.63 |
| 1970-01-08 | 92.63 | 93.47 | 91.99 | 92.68 | 10670000 | 92.68 |
| 1970-01-09 | 92.68 | 93.25 | 91.82 | 92.4 | 9380000 | 92.4 |
| 1970-01-12 | 92.4 | 92.67 | 91.2 | 91.7 | 8900000 | 91.7 |
| 1970-01-13 | 91.7 | 92.61 | 90.99 | 91.92 | 9870000 | 91.92 |
| 1970-01-14 | 91.92 | 92.4 | 90.88 | 91.65 | 10380000 | 91.65 |
| 1970-01-15 | 91.65 | 92.35 | 90.73 | 91.68 | 11120000 | 91.68 |
+------------+-------+-------+-------+-------+----------+----------+

----End
HCNA-BigData V2.0 Experiment Guide Page 118

7.3.1.2 Loading MySQL Data to Hive


Step 1 Perform steps 1 to 3 in section 5.3.1 to enter the page for configuring basic information about
Loader.

Click Edit to set JDBC Connection String to jdbc:mysql://192.168.224.41:3306/loadertest1.


The IP address of the server must be assigned by the trainer. If you have any questions, ask the
trainer.
Используем уже существующий Connection: stuXX_mysql
Name: stuXX_cg_mysqltohive2
Queue: DEFAULT

Step 2 Configure From.


Set the table name to socker and then click Next.
HCNA-BigData V2.0 Experiment Guide Page 119

If no primary key is set in the table, specify Partition column name, such as 1 or column
name time.

Step 3 Configure Transform.


Drag the Table Input button to the right area.

Step 4 Double-click Table Input and enter the attributes associated with MySQL. Field name indicates
the corresponding fields of MySQL.
HCNA-BigData V2.0 Experiment Guide Page 120

Step 5 Configure Hive output.


Drag the Hive output button to the blank area on the right.

Step 6 Output table parameter settings.


Double-click the Hive Output button. Set parameters as prompted, set Output delimiter to a period
(.), and enter output fields, as shown in the following figure.
Output delimiter: ‘,’
HCNA-BigData V2.0 Experiment Guide Page 121

Step 7 Connect Table Input and Hive Output.

Step 8 Create directory /stu01/hive/warehouse/socker2 in the HDFS.

> hdfs dfs -mkdir /user/app_stuXX/loader_test/socker2

Step 9 Create socker2 in the Hive data warehouse.

> beeline
> use stuXX_db;
> create table socker2(timest string,open float,high float,low float,close
float,volume string,endprice float)row format delimited fields terminated by
',' stored as textfile location '/user/app_stuXX/loader_test/socker2';

Step 10 Configure To. Set Storage type to HIVE, Output directory to /stu01/hive/warehouse/socker2,
and Number to 2.

Output directory: /user/app_stuXX/loader_test/socker2


HCNA-BigData V2.0 Experiment Guide Page 122

Step 11 Click Save and run. The following information is displayed.

Step 12 View the execution result.

> select * from socker2 limit 10;


+------------+--------------+------------+-----------+-------------+--------------+----------------+
|socker2.timest|socker2.open|socker2.high|socker2.low|socker2.close|socker2.volume|socker2.endprice|
+------------+--------------+------------+-----------+-------------+--------------+----------------+
| 1970-01-02 | 92.06 | 93.54 | 91.79 | 93.0 | 8050000 | 93.0 |
| 1970-01-05 | 93.0 | 94.25 | 92.53 | 93.46 | 11490000 | 93.46 |
| 1970-01-06 | 93.46 | 93.81 | 92.13 | 92.82 | 11460000 | 92.82 |
| 1970-01-07 | 92.82 | 93.38 | 91.93 | 92.63 | 10010000 | 92.63 |
| 1970-01-08 | 92.63 | 93.47 | 91.99 | 92.68 | 10670000 | 92.68 |
| 1970-01-09 | 92.68 | 93.25 | 91.82 | 92.4 | 9380000 | 92.4 |
| 1970-01-12 | 92.4 | 92.67 | 91.2 | 91.7 | 8900000 | 91.7 |
| 1970-01-13 | 91.7 | 92.61 | 90.99 | 91.92 | 9870000 | 91.92 |
| 1970-01-14 | 91.92 | 92.4 | 90.88 | 91.65 | 10380000 | 91.65 |
| 1970-01-15 | 91.65 | 92.35 | 90.73 | 91.68 | 11120000 | 91.68 |
+------------+--------------+------------+-----------+-------------+--------------+----------------+

> hdfs dfs -ls /user/app_stuXX/loader_test/socker2

18/04/15 22:23:45 INFO hdfs.PeerCache: SocketCache disabled.


Found 2 items
-rw-rwxrw-+ 3 loader hadoop 0 2020-06-26 18:12
/user/app_stu01/loader_test/socker2/_SUCCESS
-rw-rwxrw-+ 3 loader hadoop 548456 2020-06-26 18:12
/user/app_stu01/loader_test/socker2/part-m-00000

----End

7.3.1.3 Using Hive for Analysis and Query


Step 1 Obtain data of stocks with the biggest gain.
HCNA-BigData V2.0 Experiment Guide Page 123

Obtain the data and save it to a new table.

> beeline
> use stuXX_db;
> select socker2.timest, socker2.open, socker2.endprice from socker2 where
socker2.endprice > socker2.open sort by socker2.endprice desc;

+-----------------+---------------+-------------------+
| socker2.timest | socker2.open | socker2.endprice |
+-----------------+---------------+-------------------+
| 1974-05-21 | 87.86 | 87.91 |
| 1978-03-09 | 87.84 | 87.89 |
| 1978-03-08 | 87.36 | 87.84 |
| 1975-12-04 | 87.6 | 87.84 |
| 1975-12-12 | 87.8 | 87.83 |
| 1970-02-19 | 87.44 | 87.76 |
| 1974-06-24 | 87.46 | 87.69 |
| 1978-02-23 | 87.56 | 87.64 |
| 1970-03-18 | 87.29 | 87.54 |
| 1970-12-01 | 87.2 | 87.47 |
| 1978-03-03 | 87.32 | 87.45 |
| 1970-02-18 | 86.37 | 87.44 |
| 1974-05-30 | 86.89 | 87.43 |
| 1978-03-07 | 86.9 | 87.36 |
| 1978-03-02 | 87.19 | 87.32 |
| 1975-12-09 | 87.07 | 87.3 |
……
+-----------------+---------------+-------------------+
5,228 rows selected (30.544 seconds)

Step 2 Obtain the latest data of stocks.

> select socker2.timest, socker2.open, socker2.endprice from socker2 where


socker2.endprice> socker2.open sort by socker2.timest desc;

+--------------+--------------+------------------+
| socker2.time | socker2.open | socker2.endprice |
+--------------+--------------+------------------+
| 1970-04-09 | 88.49 | 88.53 |
| 1970-04-01 | 89.63 | 90.07 |
| 1970-03-26 | 89.77 | 89.92 |
| 1970-03-25 | 88.11 | 89.77 |
| 1970-03-24 | 86.99 | 87.98 |
| 1970-03-18 | 87.29 | 87.54 |
| 1970-03-17 | 86.91 | 87.29 |
| 1970-03-10 | 88.51 | 88.75 |
| 1970-03-03 | 89.71 | 90.23 |
| 1970-03-02 | 89.5 | 89.71 |
| 1970-02-27 | 88.9 | 89.5 |
| 1970-02-25 | 87.99 | 89.35 |
| 1970-02-20 | 87.76 | 88.03 |
| 1970-02-19 | 87.44 | 87.76 |
| 1970-02-18 | 86.37 | 87.44 |
……
+--------------+--------------+------------------+
HCNA-BigData V2.0 Experiment Guide Page 124

5,228 rows selected (26.738 seconds)

Step 3 Obtain the number of stocks that increase.

> select count(*) from socker2 where socker2.endprice> socker2.open;


+-------+
| _c0 |
+-------+
| 5228 |
+-------+

Step 4 Create a table to store data of stocks that increase and load the data to HBase (Hive).
Creating a table:

> use stuXX_db;


> create table upsocker like socker2;

Loading the data:

> insert into upsocker select * from socker2 where socker2.endprice >
socker2.open sort by socker2.endprice desc;

----End

7.3.1.4 Loading HDFS Data to HBase


Step 1 Create a table named stuXX_cg_hdfstohbase2 in HBase.

hbase(main):002:0> create 'stuXX_cg_hdfstohbase2','info';


0 row(s) in 0.3900 seconds
=> Hbase::Table - cg_hdfstohbase2

Step 2 Perform steps 1 to 3 in section 6.3.1 to enter the page for configuring basic information about
Loader.

Set related parameters.


Name: stuXX_cg_hdfstohbase2
Connection: stuXX_hdfs_conn
Queue: DEFAULT
HCNA-BigData V2.0 Experiment Guide Page 125

Click Next.

Step 3 Configure From.


Configure the input path of the HDFS file and the encoding type.
Input path: /user/app_stuXX/loader_test/socker2/part-m-00000

Step 4 Configure Transform.


Select CSV File Input and HBase Output and drag them to the blank area on the right, respectively.
Then connect them.
HCNA-BigData V2.0 Experiment Guide Page 126

Step 5 Configure CSV file input parameters.


Set the input parameters based on the format of the data stored in the HDFS.
Set Delimiter to a comma (,), and add information in Input fields, as shown in the following figure.

Step 6 Configure HBase output parameters.


Configure the column name and family name.
Table name: stuXX_cg_hdfstohbase2
HCNA-BigData V2.0 Experiment Guide Page 127

Step 7 Configure To.


Set Storage type to HBASE_PUTLIST, HBase instance to HBase, and Number to 1.

Step 8 Check the execution result.

Step 9 View the content in HBase table stuXX_cg_hdfstohbase2.

hbase(main):005:0> scan 'stuXX_cg_hdfstohbase2'


...
2009-09-15 column=info:high, timestamp=1523803747562, value=1056.04
HCNA-BigData V2.0 Experiment Guide Page 128

2009-09-15 column=info:low, timestamp=1523803747562, value=1043.42


2009-09-15 column=info:open, timestamp=1523803747562, value=1049.03
2009-09-15 column=info:volume, timestamp=1523803747562,
value=6185620000
10022 row(s) in 8.5350 seconds

----End

7.3.1.5 Querying HBase Data in Real Time


Step 1 On the HBase Shell client, query information in the 2009-09-15 line of table
stuXX_cg_hdfstohbase2.

> get 'stuXX_cg_hdfstohbase2','2009/9/15'

COLUMN CELL
info:close timestamp=1523803747562, value=1052.63
info:endprice timestamp=1523803747562, value=1052.63
info:high timestamp=1523803747562, value=1056.04
info:low timestamp=1523803747562, value=1043.42
info:open timestamp=1523803747562, value=1049.03
info:volume timestamp=1523803747562, value=6185620000
6 row(s) in 0.0420 seconds

Step 2 Query information in the period from August 15, 2009 to September 15, 2009.

> scan 'stuXX_cg_hdfstohbase2',{COLUMN=>'info:endprice',


STARTROW=>'2009/08/15',STOPROW=>'2009/09/15'}

ROW COLUMN+CELL
2009-08-17 column=info:endprice, timestamp=1523803747562, value=979.73
2009-08-18 column=info:endprice, timestamp=1523803747562, value=989.67
2009-08-19 column=info:endprice, timestamp=1523803747562, value=996.46
2009-08-20 column=info:endprice, timestamp=1523803747562, value=1007.37
……
2009-09-09 column=info:endprice, timestamp=1523803747562, value=1033.37
2009-09-10 column=info:endprice, timestamp=1523803747562, value=1044.14
2009-09-11 column=info:endprice, timestamp=1523803747562, value=1042.73
2009-09-14 column=info:endprice, timestamp=1523803747562, value=1049.34
20 row(s) in 0.0380 seconds

Step 3 Query all the columns whose values are greater than a specific value. (The system compares
the values as strings.)

> scan 'stuXX_cg_hdfstohbase2',{FILTER => "ValueFilter(>,'binary:979')"}


...
2009-09-02 column=info:low, timestamp=1523803747562, value=991.97
2009-09-02 column=info:open, timestamp=1523803747562, value=996.07
2009-09-03 column=info:low, timestamp=1523803747562, value=992.25
2009-09-03 column=info:open, timestamp=1523803747562, value=996.12
661 row(s) in 0.2230 seconds

Step 4 Query all the information that ends with endprice and the string value is greater than 979.
HCNA-BigData V2.0 Experiment Guide Page 129

hbase(main):011:0> scan
'stuXX_cg_hdfstohbase2',{FILTER=>"ValueFilter(>,'binary:979') AND
ColumnPrefixFilter('endprice')"}

2009-08-18 column=info:endprice, timestamp=1523803747562,


value=989.67
2009-08-19 column=info:endprice, timestamp=1523803747562,
value=996.46
2009-09-01 column=info:endprice, timestamp=1523803747562,
value=998.04
2009-09-02 column=info:endprice, timestamp=1523803747562,
value=994.75
327 row(s) in 0.1180 seconds

----End

7.4 Summary
This experiment uses multiple components to build a Big Data analysis and query platform. Through
the experiment, trainees are expected to have a better understanding of theories and
comprehensive applications about big data components.
HCNA-BigData V2.0 Experiment Guide Page 130

8 Appendix

8.1 Common Linux Commands


Cd /path_dir: Enters the /path_dir directory you specified.
pwd: Displays the working path.
Is: Checks files in the directory.
Is -l: Displays detailed information about files and directories.
Is -a: Displays hidden files.
Is *[0-9]*: Displays file and directory names which contain digits.
tree: Displays the tree structure of the file and directory from the root directory (1).
lstree: Displays the tree structure of the file and directory from the root directory (2).
mkdir dir1: Creates a directory named dir1.
mkdir dir1 dir2: Creates two directories at the same time.
mkdir -p /tmp/dir1/dir2: Creates a directory tree.
rm -f file1: Deletes a file named file1.
rmdir dir1: Deletes a directory named dir1.
rm -rf dir1: Deletes a directory named dir1 and its content.
rm -rf dir1 dir2: Deletes two directories and their contents.
mv dir1 new_dir: Renames/Moves a directory.
cp file1 file2: Copies file1 as file2.
cp dir/*: Copies all the files in a directory to the current working directory.
cp -a /tmp/dir1: Copies a directory to the current working directory.
In -s file1 lnk1: Creates a soft link pointing to the file or directory.

8.2 Other HDFS Commands


HDFS supports the fsck command to check inconsistency. fsck is used to report various file problems,
such as block loss or lack of blocks.
Syntax of fsck:

hdfs fsck <path> [-move | -delete | -openforwrite] [-files [-blocks [-


locations | -racks]]]
<path>: Indicates the start directory of the check.
-move: Moves damaged files to /lost+found.
-delete: Deletes damaged files.
HCNA-BigData V2.0 Experiment Guide Page 131

-openforwrite: Prints the file that is being written.


-files: Displays all the checked files.
-blocks: Prints a block report.
-locations: Prints the position of each block.
-racks: Prints the network topology of the datanode.

8.3 Methods of Creating a new Flume Job


Flume jobs can be created in either of the following methods: Update the properties.properties
configuration file; reinstall the client.
The first method applies when a client already exists. The second method applies when no client
exists or the first method fails in data collection.
First method:
Regenerate the properties.properties configuration file to replace the properties.properties
configuration file created earlier, and restart the Flume service.
The second method of creating a Flume job is used in Chapter 7.

You might also like