HCIA-Big Data V2.0 Lab Guide For Big Data Engineers - Revision 4

Huawei Certification Big Data Training Courses
HCIA - Big Data V2.0

Lab Guide for Big Data Engineers
ISSUE: 2.0
HUAWEI TECHNOLOGIES CO., LTD.

Copyright © Huawei Technologies Co., Ltd. 2018. All rights reserved.
No part of this document may be reproduced or transmitted in any form or by any means without prior written
consent of Huawei Technologies Co., Ltd.
Trademarks and Permissions
and other Huawei trademarks are trademarks of Huawei Technologies Co., Ltd.
All other trademarks and trade names mentioned in this document are the property of their respective holders.
Notice
The purchased products, services and features are stipulated by the contract made between Huawei and the
customer. All or part of the products, services and features described in this document may not be within the
purchase scope or the usage scope. Unless otherwise specified in the contract, all statements, information, and
recommendations in this document are provided "AS IS" without warranties, guarantees or representations of any
kind, either express or implied.
The information in this document is subject to change without notice. Every effort has been made in the
preparation of this document to ensure accuracy of the contents, but all statements, information, and
recommendations in this document do not constitute a warranty of any kind, express or implied.
Huawei Technologies Co., Ltd.

Address: Huawei Industrial Base
Bantian, Longgang
Shenzhen 518129
People's Republic of China
Website: http://e.huawei.com
HCNA-BigData V2.0 Experiment Guide Page 3
About This Document
Overview
This guide instructs trainees to perform all the experiment tasks required by the HCIA-Big Data
course on Huawei FusionInsight HD Big Data platform. It is aimed to help trainees master the
knowledge about using Big Data components of the FusionInsight HD platform.
Content Description
This document contains eight experiments: FusionInsight client installation, HBase database practice,
HDFS file system practice, Loader data import and export practice, Flume data collection practice,
Kafka message subscription practice, Hive data warehouse practice, and cluster comprehensive
experiment.
Precautions
During an experiment, trainees are not allowed to delete files randomly.
When naming a directory, topic, or file, a trainee must include the trainee’s account stuxx or userxx,
for example, stu06_data and user01_socker.
The trainer manages and allocates all the user names and passwords for logging in to the
environment. If you have any questions regarding the user name and password, ask the trainer
please.
References
FusionInsight HD product documentation
Experiment Environment
Table 1-1 Experimental Hardware and Software
1.1 2288H V5
1.1.1 Basic Configuration
02312BTK H22H-05-S26AFC (25*2.5 inch hard disk chassis, 1

onboard 2*GE+2*10GE optical port
(excluding optical
modules) ,2*1500W AC power
supply)H22H-05
1.1.2 SKYLAKE CPU
Intel Xeon Platinum

8176(2.1GHz/28-
02311XFF BC4M04CPU 2
core/38.5MB/165W) processor (with
heat sink)
1.1.3 Memory
DDR4 RDIMM Memory-32GB-

06200241 N26DDR402 12
2666MT/s-2Rank(2G*4bit)-1.2V-ECC
1.1.4 HardDisk(with 2.5" Handle bar)-SAS
HardDisk-600GB-SAS 12Gb/s-10K
02311HAP N600S1210W2 rpm-128MB -2.5 inch(2.5 inch 2
bracket)
HardDisk-1800GB-SAS 12Gb/s-10K
02311FM
N1800S10W2 rpm-128MB-2.5 inch(2.5 inch 4
R
bracket)
1.1.5 HardDisk(with 2.5" Handle bar)-SSD
ES3600S V5 solid state disk -800GB

02312FRL ES3600S800GW2 SAS 12Gb/s-read/write mixed -2.5 2
inch (2.5 inch tray)
1.1.6 Raid controller card
SR530C-M 1G(LSI3108)SAS/SATA
RAID card -RAID0,1,5,6,10,50,60-
02311SMF BC1M05ESMLB 1
1GB Cache-supports supercapacitor
and out-of-band management.
LSI3108 1 GB Cache RAID card

supercapacitor (4 GB, including
02311YPU BC1M08TFM 1
cables and mechanical parts)-
Applicable to rack servers /X6800
1.1.7 Riser card
02311TW
BC1M31RISE 3*x8 (x16 slot) RISER1 module 1
R
1.1.8 PCIe card -NIC
Ethernet adapter -10Gb optical port

(Intel 82599)- dual-port -SFP+
02311EUX CN2ITGAA20 1
(including two multi-mode optical
modules) -PCIe 2.0 x8
1.1.9 Cables and optical modules

Optical module -SFP+-10G-multi-

02318169 OMXD30000 2
mode module (850nm, 0.3km, LC)
1.1.10 Guide rail and cable tray
21240434 EGUIDER01 2U static slide rail kit 1
1.1.11 Operating system
SLES for SAP Applications-English

Version-Enterprise Edition -12.x-2
sockets or 2VMs-x86-64 Bit-Physical
05200723 GOSSLES33 Goods (Paper)-No Documentation- 1
Three Years 7*24 Service (Operating
System Manufacturer Service)-
Greater China Region
To download the FusionCompute software, visit the following website:

http://support.huawei.com/enterprise/en/cloud-computing/fusioncompute-pid-
8576912/software
To download the FusionInsight C70 software, visit the following website:
https://support.huawei.com/enterprise/en/cloud-computing/fusioninsight-hd-pid-
21110924/software/23949194?idAbsPath=fixnode01%7C7919749%7C7941815%7C19942
925%7C250430185%7C21110924
Other hardware
The minimum configuration is 1Gb Ethernet switches. It

Switch is recommended that all 10Gb Ethernet switches be
configured.
OS partition requirements for each VM:

Target node Partition Directory Partition Size
/ 10G
/tmp 10G
/var 10G
Management/Control/Data
node /var/log ≥200G
/srv/BigData ≥60G
/opt ≥300G
VM port group configuration and interconnection switch configuration:

VM name Network Port Name Port Group Name Indicates the VLAN ID.
eth0 PortgroupX vlanX

VM01
eth1 PortgroupY vlanY

VM02

VM03
A trunk interface is used to connect the physical switch to the hypervisor

Physical switch of the virtualization platform. In this way, the VLAN of the internal port
group of the virtual switch can be divided.
Experiment Topology
Three server nodes are used.
Figure 1-2 Non-redundant cluster topology

Trainee Accounts and Software Access

Each trainee is assigned two accounts: The FusionInsight HD cluster account starts with stu, which
can be used for logging in to the FusionInsight Manager management interface, the communication
between big data components, and the access to big data components. The account starting with
user is the OS account of cluster nodes. It is used to log in to the operating system of a cluster node
and perform big data component experiment operations.
The cluster client software and files which will be used during the experiments are saved in the
/FusionInsight_Client directory of each cluster node. Trainees can obtain the software and files from
this directory.
The SSH and file uploading tools which will be used during the experiments are saved in the 07 other
tool directory under ftp://10.175.199.8/. The FTP user name and password are admin1 and admin1
respectively. Trainees can obtain the tools by themselves.
Contents
About This Document ................................................................................................................. 3

Overview ............................................................................................................................................................................. 3
Content Description ............................................................................................................................................................ 3
Precautions ......................................................................................................................................................................... 3
References .......................................................................................................................................................................... 3
Experiment Environment .................................................................................................................................................... 3
Experiment Topology .......................................................................................................................................................... 6
Trainee Accounts and Software Access ............................................................................................................................... 7
1 FusionInsight HD Client Installation........................................................................................ 10
1.1 Background ................................................................................................................................................................. 10
1.2 Objective ..................................................................................................................................................................... 10
1.3 Experiment Tasks ........................................................................................................................................................ 10
1.3.1 Installing a Client ...................................................................................................................................................... 10
1.4 Summary ..................................................................................................................................................................... 12
2 HDFS File System Practice ...................................................................................................... 13
2.1 Background ................................................................................................................................................................. 13
2.2 Objectives ................................................................................................................................................................... 13
2.3.1 Common HDFS Operations ...................................................................................................................................... 13
2.3.2 HDFS Management Operations ............................................................................................................................... 20
2.4 Summary ..................................................................................................................................................................... 30
3 HBase Database Practice........................................................................................................ 31
3.1 Background ................................................................................................................................................................. 31
3.2 Objective ..................................................................................................................................................................... 31
3.3.1 Common HBase Operations ..................................................................................................................................... 31
3.3.2 Using Filter ............................................................................................................................................................... 37
3.3.3 Creating a Table with Pre-Distributed Regions......................................................................................................... 38
3.3.4 HBase Load Balancing .............................................................................................................................................. 43
3.4 Summary ..................................................................................................................................................................... 45
4 Hive Data Warehouse Practice ............................................................................................... 46
4.1 Background ................................................................................................................................................................. 46
4.2 Objectives ................................................................................................................................................................... 46
4.3.1 Common Functions of Hive ...................................................................................................................................... 46
4.3.2 Creating a Table........................................................................................................................................................ 49

4.3.3 Querying .................................................................................................................................................................. 53
4.3.4 Hive Join Operations ................................................................................................................................................ 57
4.3.5 Hive on Spark Operation .......................................................................................................................................... 60
4.3.6 Associating a Hive Table with an HBase Table .......................................................................................................... 61
4.3.7 Merging Small Hive Files .......................................................................................................................................... 62
4.3.8 Hive Column Encryption .......................................................................................................................................... 63
4.3.9 Using Hue to Execute HQL ....................................................................................................................................... 64
4.4 Summary ..................................................................................................................................................................... 67
5 Data Import and Export Using Loader .................................................................................... 68
5.1 Background ................................................................................................................................................................. 68
5.2 Objective ..................................................................................................................................................................... 68
5.3.1 Importing HBase Data to HDFS ................................................................................................................................ 68
5.3.2 Loading HDFS Data to HBase.................................................................................................................................... 75
5.3.3 Importing HDFS Data to MySQL ............................................................................................................................... 81
5.3.4 Importing MySQL Data to HDFS ............................................................................................................................... 88
5.3.5 Importing MySQL Data to HBase ............................................................................................................................. 92
5.3.6 Importing HBase Data to MySQL ............................................................................................................................. 96
5.3.7 Importing MySQL Data to Hive .............................................................................................................................. 100
5.4 Summary ................................................................................................................................................................... 104
6 Flume Data Collection Practice............................................................................................. 105
6.1 Background ............................................................................................................................................................... 105
6.2 Objective ................................................................................................................................................................... 105
6.3 Experiment Tasks ...................................................................................................................................................... 105
6.3.1 Collecting spooldir Data to the HDFS ..................................................................................................................... 105
6.3.2 Collecting avro Data to the HDFS ........................................................................................................................... 112
6.4 Summary ................................................................................................................................................................... 115
7 Comprehensive Cluster Experiment ..................................................................................... 116
7.1 Background ............................................................................................................................................................... 116
7.2 Objective ................................................................................................................................................................... 116
7.3 Experiment Tasks ...................................................................................................................................................... 116
7.3.1 Offline Data Collection and Analysis and Real-Time Query Involving MySQL, Loader, Hive, and HBase ............... 116
7.4 Summary ................................................................................................................................................................... 129
8 Appendix ............................................................................................................................. 130
8.1 Common Linux Commands ....................................................................................................................................... 130
8.2 Other HDFS Commands ............................................................................................................................................ 130
8.3 Methods of Creating a new Flume Job ..................................................................................................................... 131
1 FusionInsight HD Client Installation
1.1 Background
The FusionInsight HD client is the interface for the communication between users and the cluster as
well as the foundation of subsequent experiments. After a client is installed, it requires security
authentication to communicate with the cluster if the cluster is deployed in secure mode.
1.2 Objective
⚫ To understand how to download and install a client.
1.3 Experiment Tasks
1.3.1 Installing a Client

Step 1 Log in to a cluster node.
Use PuTTY and the trainee account (such as userXX) to log in to a cluster node, for example,
192.168.224.45. (The IP address of a specific node must be assigned by the trainer.)
Copy the FusionInsight HD client to the home directory of userXX (for example, user01). The client
files are saved in the /FusionInsight_Client directory of each cluster node.
> cd /FusionInsight-Client
> cp FusionInsight_Cluster_1_Services_ClientConfig.tar /home/userXX
Decompress client software.

> cd /home/userXX
> tar -xvf FusionInsight_Cluster_1_Services_ClientConfig.tar
Step 2 Install the client.

Go to the FusionInsight_Cluster_1_Services_ClientConfig directory and run the installation
command to install the software in the /home/userXX/hadoopclient directory of the current user.
> cd /home/userXX/FusionInsight_Cluster_1_Services_ClientConfig/
>./install.sh /home/userXX/hadoopclient
If message Components client installation is complete is displayed, the installation is complete.

!!! После установки удалить FusionInsight_Cluster_1_Services_ClientConfig.tar.
Step 3 Configure environment variables and perform the authentication.

Go to /home/userXX/hadoopclient and run the following commands to set environment variables:
> cd /home/userXX/hadoopclient
> source bigdata_env
> kinit stuXX
Password for stuXX@HADOOP.COM:

Note: The initial password is Huawei@123 (or consult the trainer). If the system prompts you to
change the password during the first authentication, change the password to Huawei12#$.
Step 4 Test the client.

Run the hdfs command to test the client.
> hdfs dfs –ls /

drwxr-x---+ - flume hadoop 0 2017-07-15 00:39 /flume
drwx------+ - hbase supergroup 0 2018-03-31 10:28 /hbase
drwxrwxr-x+ - admin supergroup 0 2018-01-28 15:52 /mapreduceInput
drwxrwxrwx+ - mapred hadoop 0 2017-07-15 00:39 /mr-history
If the test is successful, it indicates that the client is installed successfully.
----End
1.4 Summary
This experiment demonstrates how to install a FusionInsight HD client. During the installation, you
need to decompress the client software twice. Note that no file or folder exists in the directory
where the client is installed. Otherwise, the installation fails.
2 HDFS File System Practice
2.1 Background
HDFS is a distributed file system on the Hadoop Big Data platform and provides data storage for
upper-layer applications or other Big Data components, such as Hive, Mapreduce, Spark, and HBase.
On the HDFS shell client, you can operate and manage the distributed file system. Using HDFS helps
us better understand and master Big Data.
2.2 Objectives
⚫ To have a good command of common HDFS operations
⚫ To master HDFS file system management operations
2.3.1 Common HDFS Operations

2.3.1.1 Commands
--Сначала выполняем эти команды
> cd /home/userXX/hadoopсlient
> kinit stuXX
 -help: Checks instructions of a command.
> hdfs dfs -help

Usage: hadoop fs [generic options]
[-appendToFile <localsrc> ... <dst>]
[-cat [-ignoreCrc] <src> ...]
[-checksum <src> ...]
[-chgrp [-R] GROUP PATH...]
[-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...]
[-chown [-R] [OWNER][:[GROUP]] PATH...]
[-copyFromLocal [-f] [-p] [-l] <localsrc> ... <dst>]
[-copyToLocal [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
[-count [-q] [-h] [-v] [-t [<storage type>]] <path> ...]
[-cp [-f] [-p | -p[topax]] <src> ... <dst>]

[-createSnapshot <snapshotDir> [<snapshotName>]]
[-deleteSnapshot <snapshotDir> <snapshotName>]
[-df [-h] [<path> ...]]
[-du [-s] [-h] <path> ...]
 -Is: Displays the directory information.
~> hdfs dfs -ls /

-rw-r--r--+ 3 wkj supergroup 13 2018-04-02 16:42 /HDFS
drwxrwxr-x+ - hive supergroup 0 2017-07-15 00:43 /apps
drwxr-xr-x+ - admin supergroup 0 2018-03-13 19:44 /bigdata
drwxr-x---+ - flume hadoop 0 2017-07-15 00:39 /flume
drwx------+ - hbase supergroup 0 2018-03-31 10:28 /hbase
drwxrwxr-x+ - admin supergroup 0 2018-01-28 15:52 /mapreduceInput
drwxrwxrwx+ - mapred Hadoop 0 2017-07-15 00:39 /mr-history
 -mkdir: Creates a directory in the HDFS.
> hdfs dfs -mkdir /user/app_stuXX
> hdfs dfs -ls /user

drwxr-xr-x+ - wkj supergroup 0 2018-04-02 17:20 /0402
drwxr-xr-x+ - wkj supergroup 0 2018-04-02 16:57 /0810
-rw-r--r--+ 3 wkj supergroup 13 2018-04-02 16:42 /HDFS
drwxr-xr-x+ - user01 supergroup 0 2018-04-04 15:04 /app_stu01
 -put: Uploads a local file to the specified directory in the HDFS.
> hdfs dfs -put /FusionInsight-Labs/test01.txt /user/app_stuXX

> hdfs dfs -ls -h /user/app_stuXX
-h: Format file sizes in a human-readable fashion (eg 64.0m instead of

67108864).
-rw-rw-rw-+ 3 user01 hadoop 38 2020-07-13 16:29

/user/app_stu01/test01.txt
 -get: Downloads a file from HDFS to a local host, which is equivalent to copyToLocal.
Copying /app_stuXX/test01.txt to local:
> hdfs dfs -get /user/app_stuXX/test01.txt /home/userXX

> cd /home/userXX
> ll
total 2881728
drwxr-xr-x 15 user01 hadoop 4096 Apr 4 10:58 1001_hadoopclient
-rw-r--r-- 1 user01 hadoop 63 Apr 4 16:30 appendtext.txt
drwxr-xr-x 2 user01 hadoop 4096 Apr 4 10:03 bin
-rw-r--r-- 1 user01 hadoop 0 Apr 4 15:28 hdfs
-rwxr-xr-x 1 user01 hadoop 2947983360 Apr 4 10:05 Service_Client.tar
-rw-r--r-- 1 user01 hadoop 38 Apr 4 16:27 stu01.txt
-rw-r--r-- 1 user01 hadoop 38 Apr 4 17:54 test01.txt
 -moveFromLocal: Cuts and pastes from local to the HDFS.
There is an abcd file in the home directory of userXX.
> cp /FusionInsight-Labs/abcd.txt /home/userXX

> cd /home/userXX
> ll
total 2881716
drwxr-xr-x 15 user01 hadoop 4096 Apr 4 10:58 1001_hadoopclient
drwxr-xr-x 2 user01 hadoop 4096 Apr 4 10:03 bin
-rw-r--r-- 1 user01 hadoop 0 Apr 4 15:28 abcd
-rwxr-xr-x 1 user01 hadoop 2947983360 Apr 4 10:05 Service_Client.tar
Execute the moveFromLocal command to move the abcd file to the /user/app_stuXX directory of
the HDFS.
> hdfs dfs -moveFromLocal /home/userXX/abcd.txt /user/app_stuXX
After the execution is complete, check that the file does not exist anymore in the home directory of
userXX.
> ll
total 2881716
drwxr-xr-x 15 stu01 hadoop 4096 Apr 4 10:58 1001_hadoopclient
drwxr-xr-x 2 stu01 hadoop 4096 Apr 4 10:03 bin
-rwxr-xr-x 1 stu01 hadoop 2947983360 Apr 4 10:05 Service_Client.tar
The file has been moved to the HDFS.
> hdfs dfs -ls -h /user/app_stuXX
Found 3 items
-rw-rw-rw-+ 3 root hadoop 1.3 G 2020-07-13 16:21
/user/app_stu20/FusionInsight_Cluster_1_Services_Client.tar
/user/app_stu01/abcd.txt
 -cat: Displays the file content.
> hdfs dfs -cat /user/app_stuXX/test01.txt
01,HDFS
02,Zookeeper
03,HBase
04,Hive
 -appendToFile: Adds data to the end of a file.
There is a local file appendtext.txt which includes:
> cp /FusionInsight-Labs/appendtext.txt /home/userXX

> cat appendtext.txt
10,Spark
11,Storm
12,Kafka
13,Flink
14,ELK
15,FusionInsight HD
Add the content in appendtext.txt to the end of test01.txt.
> hdfs dfs -appendToFile /home/userXX/appendtext.txt

/user/app_stuXX/test01.txt
Check whether the content has been added successfully.
01,HDFS
02,Zookeeper
03,HBase
04,Hive
10,Spark
11,Storm
12,Kafka
13,Flink
14,ELK
15,FusionInsight HD
 -chmod: Modifies the file permission.
> hdfs dfs -ls /user/app_stuXX
Found 3 items
-rw-rw-rw-+ 3 root hadoop 1352929792 2020-07-13 16:21
/user/app_stu01/FusionInsight_Cluster_1_Services_Client.tar
/user/app_stu01/abcd.txt
Modify the permission of /user/app_stuXX/test01.txt to 755:
> hdfs dfs -chmod 755 /user/app_stuXX/test01.txt
> hdfs dfs -ls /user/app_stuXX/test01.txt
-rwxr-xr-x+ 3 user01 hadoop 101 2020-07-13 16:57

To use chown, you must have the superuser permission.
 -cp: Copies a file.
Copy /user/app_stuXX/test01.txt to the /tmp/stuXX directory with the name file01.txt:
> hdfs dfs -cp /user/app_stuXX/test01.txt /tmp/stuXX/file01.txt
> hdfs dfs -ls /tmp/stuXX
Found 1 items

/tmp/stu01/file01.txt
 -mv: Moves a file.
Move /tmp/stuXX/file01.txt to the /user/app_stuXX directory:
> hdfs dfs -mv /tmp/stuXX/file01.txt /user/app_stuXX
> hdfs dfs -ls /tmp/stuXX
Found 2 items
-rw-rw-rw-+ 3 user01 hadoop 38 2020-07-13 16:45 /tmp/stu01/abcd.txt
/tmp/stu01/test01.txt
 -getmerge: Merge and download multiple files:
There are two files in the /user/app_stuXX directory, which are file01 and test01.txt.
> hdfs dfs -put /FusionInsight-Labs/file01.txt /user/app_stuXX
> hdfs dfs -ls /user/app_stuXX/

/user/app_stu01/file01.txt
-rwxr-xr-x+ 3 user01 hadoop 101 2020-07-13 16:57
The contents of the two files are as follows:
> hdfs dfs -cat /user/app_stuXX/file01.txt
001 FusionInsight HD
002 FusionInsight Miner
003 FusionInsight LibrA
004 FusionInsight Farmer
005 FusionInsight Manager

01,HDFS
02,Zookeeper
03,HBase
04,Hive
10,Spark
11,Storm
12,Kafka
13,Flink
14,ELK
15,FusionInsight HD
Combine the files and copy them to a local directory:
> hdfs dfs -getmerge /user/app_stuXX/file01.txt /user/app_stuXX/test01.txt

/home/userXX/merge_file.txt
> cat /home/userXX/merge_file.txt
001 FusionInsight HD
002 FusionInsight Miner
003 FusionInsight LibrA
004 FusionInsight Farmer
005 FusionInsight Manager
01,HDFS
02,Zookeeper
03,HBase
04,Hive
10,Spark
11,Storm
12,Kafka
13,Flink
14,ELK
15,FusionInsight HD
 -rm: Deletes a file or folder.
Delete the /user/app_stuXX/file01.txt file:
> hdfs dfs -rm -f /user/app_stuXX/file01.txt

INFO fs.Trash: Moved: 'hdfs://hacluster/app_stu01/file01' to trash at:
hdfs://hacluster/user/stu01/.Trash/Current
 -df: Checks the available space of the file system.
> hdfs dfs -df -h /

Filesystem Size Used Available Use%
hdfs://hacluster 1.7 T 11.9 G 1.7 T 1%
 -du: Checks the folder size.
> hdfs dfs -du -h /user/app_stuXX

213.1 M /user/admin
0 /user/hdfs
75 /user/hdfs-examples
213.1 M /user/hive
4.3 K /user/loader
493 /user/mapred
 -count: Checks the number of files in a specific directory.
> hdfs dfs -count -h /user/app_stuXX
344 494 3.2 G /user
344 in the first column indicates the number of folders in the /user/ directory, and 494 in the second
column indicates the number of files in /user/. 3.2G indicates the disk space occupied by all files in
/user/ (excluding the number of copies).
2.3.1.2 Recycle Bin Usage

Files may be deleted by mistake during work. If this happens, you can get back the deleted files from
the recycle bin of the HDFS. By default, the recycle bin saves deleted files for seven days. For
example, in the preceding experiment, the -rm command is used to delete the file01 file. After the
file is deleted, the system prompts that the deleted file is saved in fs.Trash: Moved:
'hdfs://hacluster/user/app_stuXX/file01.txt' to trash at:
hdfs://hacluster/user/userXX/.Trash/Current/user/app_stuXX. However, the HDFS system archives
the deleted file in a different directory.
> hdfs dfs -ls /user/userXX/.Trash
drwx------+ - user01 hadoop 0 2020-06-08 13:13 /user/user01/.Trash/Current
> hdfs dfs -ls /user/userXX/.Trash/Current
drwx------+ - user01 hadoop 0 2020-06-08 13:13

/user/user01/.Trash/Current/user
> hdfs dfs -ls /user/userXX/.Trash/Current/user
drwx------+ - user01 hadoop 0 2020-06-08 13:13

/user/user01/.Trash/Current/user/app_stu01
View the /user/userXX/.Trash/Current/user/app_stuXX directory.
> hdfs dfs -ls -h /user/userXX/.Trash/Current/user/app_stuXX

/user/user01/.Trash/Current/user/app_stu20/file01.txt
Then, use the -mv parameter to move the file to the specified directory. For details, see description
about -mv in this section.
> hdfs dfs -mv /user/userXX/.Trash/Current/user/app_stuXX/file01.txt

/user/app_stuXX
2.3.2 HDFS Management Operations

2.3.2.1 HDFS Quota Management
When multiple tenants use the HDFS, the HDFS space available for each tenant should be limited.
HDFS quota management is designed for this matter.
2.3.2.1.1 Creating Quota
Step 1 On the FusionInsight Manager interface, click Tenant Management (в новом интерфейсе:
Tenant Resources -> Tenant Resources Management).
In the tenant list on the left, click tenant queue_stuXX whose HDFS storage directory needs to be
modified.
Step 2 Click the Resource tab.

Step 3 In the HDFS Storage table, click Create Directory.
Step 4 Add a directory.
Path: Fill in the director assigned to the tenant. If the path does not exist, the system automatically
creates the path. (/user/app_stuXX/myquota)
Quota: Fill in the upper limit of the total number of stored files and directories. Quota: 3
SpaceQuota: Fill in the storage space quota for creating the directory. SpaceQuota: 1000 MB
The path must be unique.

After filling in all the values, click OK.
Step 5 Check the result of adding a directory.

Run the HDFS file uploading command:
> hdfs dfs -put /home/userXX/test01.txt /user/app_stuXX/myquota
Run the following command to check whether the file has been uploaded:
> hdfs dfs -ls /user/app_stuXX/myquota
Found 1 items
/user/app_stu01/myquota/test01.txt
If the preceding information is displayed, the /myquota directory is created successfully and the
current user has the permission to upload files.
Step 6 Test SpaceQuota.

Pre-applied disk space = Number of blocks corresponding to the file x Block size x 3. The default block
size is 128 MB. Therefore, the minimum pre-applied disk space (1 data block) is 128 MB x 3 = 384
MB. The SpaceQuota is set to 1000 MB in step 4. Therefore, the maximum file size is 2 x 128 MB=256
MB. When the file size is greater than 256 MB, you need to apply for at least three data blocks (3 x
128 MB x 3 > 1000 MB). If SpaceQuota cannot meet the requirement, the file will fail to be uploaded.
(Number of blocks corresponding to a file = File size/128. If it is indivisible, the value is rounded up.)
The following is an example of uploading a file larger than 256 MB when the SpaceQuota is 1000
MB.
> cd /FusionInsight-Client/Flume
Check the size of FusionInsight_Cluster_1_Flume_Client.tar:
> ll -h
total 451M
-rwxr-xr-x 1 root root 451M Jul 23 13:22
FusionInsight_Cluster_1_Flume_Client.tar
Run the following command to upload the file to the HDFS:

> hdfs dfs -put /FusionInsight-

Client/Flume/FusionInsight_Cluster_1_Flume_Client.tar /user/app_stuXX/myquota
put: The DiskSpace quota of /user/app_stu01/myquota is exceeded: quota =

1048576000 B = 1000 MB but diskspace consumed = 1207959666 B = 1.13 GB
It can be seen that the file fails to be uploaded when SpaceQuota is set to1000 MB and the file size is
greater than 256 MB.
Step 7 Test Quota.

As configured in step 4, when the number of uploading files exceeds two (3!), the files fail to be
uploaded. Run the following command to perform the test:
> hdfs dfs -put /home/userXX/hadoopclient/switchuser.py

/user/app_stuXX/myquota
> hdfs dfs -put /home/userXX/hadoopclient/install.ini /user/app_stuXX/myquota
put: The NameSpace quota (directories and files) of directory

/user/app_stu01/myquota is exceeded: quota=3 file count=4
Run the following command to view the file list in the specified HDFS directory:
Found 2 items
/user/app_stu01/myquota/switchuser.py
If the command output does not contain the install.ini file, the file fails to be uploaded.
----End
2.3.2.1.2 Modifying Quota Configuration
Step 1 Change Quota to 4 and SpaceQuota to 1700 MB.

Step 2 Upload the data again and then view the file list in the directory:
> hdfs dfs -put /FusionInsight-

Client/Flume/FusionInsight_Cluster_1_Flume_Client.tar /user/app_stuXX/myquota
Found 3 items
/user/app_stu01/myquota/FusionInsight_Cluster_1_Flume_Client.tar
/user/app_stu01/myquota/switchuser.py
The preceding command output indicates that the large file (296 MB) can be uploaded and multiple
(three) files can be uploaded after the configuration is modified.
----End
2.3.2.1.3 Deleting Quota
Step 1 Log in to FusionInsight Manager, choose Tenant > Tenant Management > queue_stuXX >
Resource (в новом интерфейсе Tenant Resources -> Tenant Resources Management ->
queue_stuXX > Resource).
Step 2 Click the cross icon (x) in the Operation column of the specified directory in HDFS Storage area
to delete the storage resource.
(Другой интерфейс: Выбрать директорию и в столбце Operation кликнуть Delete)
Step 3 In the Delete Directory dialog box that is displayed, select the check box and click OK.
----End
2.3.2.2 HDFS Metadata Backup and Recovery

To ensure HDFS metadata security or when the system administrator needs to perform major
operations (such as upgrade or migration) on the HDFS cluster, you need to back up HDFS metadata
so that HDFS metadata can be restored in a timely manner when faults occur in the system, making
HDFS cluster data secure and reliable.
2.3.2.2.1 Data Backup

The data backup procedure is as follows:
Step 1 Choose System > Backup Management. (O&M -> Backup and Restoration)
Step 2 Click Create Backup Task. (кнопка Create)

Step 3 Select the check box next to NameNode and configure parameters of the NameNode metadata
backup task, including Task name, Path type, Maximum number of backup copies, and Instance
name, and click OK.
Task name: stuXX_NameNodeBackup

Backup Object: Cluster Demo
Configuration: NameNode
Path Type: LocalDir
Maximum Number of Backup Copies: 3
Instance Name: hacluster
Step 4 Click the start icon in the Operation column to execute the metadata backup task.
Выбираем задачу из списка, в столбце «Operation» выбираем пункт «More» и далее «Back
Up Now»
Step 5 When the task progress is 100%, the task is complete and HDFS metadata is backed up
successfully.
----End
2.3.2.2.2 Data Recovery

Data recovery is performed based on the data backup result. The data recovery procedure is as
follows:
Step 1 Choose System > Backup Management (в новом интерфейсе O&M -> Backup and Restoration -
> Backup Management).
Step 2 Click the button for viewing historical operations in the NameNodeBackup task (в столбце
«Operation» выбираем пункт More -> View History).
Step 3 Check the data backup log and click View in the Details column.
Step 4 Find the path for saving the backup data file from the log file, as shown in the following figure.
(Другой интерфейс: В столбце «Backup Path» кликнуть «View»). Выведется имя файла,
например:
/srv/BigData/LocalBackup/1/stu01_NameNodeBackup_20200723185056/NameNode_2
0200723185104/6.5.1_HDFS-hacluster-fsimage_20200723185211.tar.gz
Step 5 Copy the path and click Recovery Management to create a recovery task (O&M -> Backup and
Restoration -> Restoration Management).
Step 6 On the page that is displayed, click Create Recovery Task (кнопка Create).
Step 7 Configure parameters for the task, including Task name, Path type, Source path, and Instance
name. The source path indicates the file path obtained in step 4. After configuring all the
parameters, click OK.
Task Name: stuXX_recovery

Recovery Configuration: NameNode
Path type: LocallDir
Source Path: выбрать из списка наименование файла, взятого из локального пути
/srv/BigData/LocalBackup/1/stu01_NameNodeBackup_20200723185056/NameNode_2020072318
5104/6.5.1_HDFS-hacluster-fsimage_20200723185211.tar.gz
Instance Name: hacluster
Step 8 Click the start icon corresponding to the task to start data recovery.
The preceding figure shows that NameNode data is successfully recovered.

----End
ВНИМАНИЕ! Запуск Recovery task завершиться неудачей, поскольку для ее выполнения нужно
остановить NameNode. Останавливать NameNode НЕ НУЖНО!
2.4 Summary
This experiment describes common HDFS operations and HDFS management. After this experiment,
trainees should have known how to perform common operations in the HDFS.
3 HBase Database Practice
3.1 Background
HBase is a highly reliable, high-performance, column-oriented, and scalable distributed storage
system. It is the most commonly used NoSQL database in the industry. The knowledge about how to
use HBase can deepen trainees' understanding of HBase and lay a solid foundation for
comprehensively using Big Data.
3.2 Objective
⚫ To have a good command of common HBase operations, region operations, and filter
usage.
3.3.1 Common HBase Operations

3.3.1.1 Logging In to an HBase Shell Client
Step 1 Log in to an HBase shell client.
> kinit stuXX
> hbase shell
……
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 1.0.2, rUnknown, Thu May 12 17:02:55 CST 2016
hbase(main):001:0>
The preceding information indicates that you have logged in to the HBase shell client.
3.3.1.2 Creating a Common Table

Step 1 The syntax for creating a common table is as follows: create 'table name', 'column family name'
Entering the command:
> create 'stuXX_cga_info','info'

0 row(s) in 0.3620 seconds
=> Hbase::Table - cga_info
The stuXX_cga_info table is successfully created.
Step 2 Run the list command to check the number of common tables in the system.
> list
TABLE
stu01_cga_info
Socker
t1
=> ["stu01_cga_info", "socker", "t1"]
The command output shows that there are three common tables in the system.
----End
3.3.1.3 Creating a Namespace

The syntax for creating a namespace is as follows: create_namespace 'namespace name'.
> create_namespace 'nnstuXX'

3.3.1.4 Creating a Table in a Specific Namespace

Create a table in the specified namespace:create 'namespace name:table name','column family'
> create 'nnstuXX:studentXX','info'

=> Hbase::Table – nnstu01:student
3.3.1.5 Viewing Tables in a Specified Namespace

Run the list_namespace_tables 'namespace name' to view tables in the namespace.
> list_namespace_tables 'nnstuXX'

TABLE
student
3.3.1.6 Adding Data

Add data: put 'table name', 'RowKey', 'column name', specific value'
For example, enter the information about a 40-year-old man named Kobe and living in Los Angeles
into the cga_info table:
> put 'stuXX_cga_info','123001','info:name','Kobe'

> put 'stuXX_cga_info','123001','info:gender','male'
> put 'stuXX_cga_info','123001','info:age','40'
> put 'stuXX_cga_info','123001','info:address','Los Angeles'
3.3.1.7 Querying Data Using get

Step 1 get: exact query
Query the content stored in a RowKey precisely: get 'table name', 'RowKey'
> get 'stuXX_cga_info','123001'

COLUMN CELL
info:address timestamp=1523350574004, value=Los Angeles
info:age timestamp=1523350540131, value=40
info:gender timestamp=1523350499780, value=male
info:name timestamp=1523350443121, value=Kobe
Step 2 Query the content stored in a cell in a RowKey precisely.
Syntax: get 'table name', 'RowKey', 'column name'

> get 'stuXX_cga_info','123001','info:name'
COLUMN CELL
----End
3.3.1.8 Querying Data Using scan

Step 1 Enter multiple data records in the table as instructed in section 3.3.1.6.
Step 2 scan: Queries data in a certain range.
Query information in all columns of a column family in the table. scan 'table name',
{Columns=> 'column family name'}
> scan 'stuXX_cga_info',{COLUMNS=>'info'}

ROW COLUMN+CELL
123001 column=info:address, timestamp=1523350574004, value=Los
Angeles
123001 column=info:age, timestamp=1523350540131, value=40
123001 column=info:gender, timestamp=1523350499780, value=male
123001 column=info:name, timestamp=1523350443121, value=Kobe
123002 column=info:address, timestamp=1523351932415, value=London
123002 column=info:gender, timestamp=1523351993106, value=female
123002 column=info:name, timestamp=1523351965188, value=Victoria
123003 column=info:address, timestamp=1523352194766, value=Redding

123003 column=info:name, timestamp=1523352091677, value=Taylor
123004 column=info:address, timestamp=1523352217267, value=Cleveland
123004 column=info:name, timestamp=1523352251926, value=LeBron
Step 3 Query information stored in a specific column in the table.

Syntax: scan 'table name', {Columns=> 'column family name'}
> scan 'stuXX_cga_info',{COLUMNS=>'info:name'}

ROW COLUMN+CELL
123002 column=info:name, timestamp=1523351965188,
value=Victoria
value=Taylor
value=LeBron
----End
3.3.1.9 Querying Data that Matches Specific Conditions

Step 1 Query the data whose RowKey is 123002 or 123003.
> scan 'stuXX_cga_info',{STARTROW=>'123002','LIMIT'=>2}

ROW COLUMN+CELL
123002 column=info:address, timestamp=1523351932415,
value=London
123002 column=info:gender, timestamp=1523351993106,
value=female
value=Victoria
123003 column=info:address,
timestamp=1523352194766,value=Redding
value=female
value=Taylor
Step 2 Query the information stored in the cell whose Rowkey is 123001 or 123002 and column name
is name.
> scan 'stuXX_cga_info',{STARTROW=>'123001','LIMIT'=>2,COLUMNS=>'info:name'}

ROW COLUMN+CELL
value=Victoria
In addition to COLUMNS, HBase also supports Limit (limits the rows of query results)
and STARTROW (the Rowkey start line locates the region based on the STARTROW, and
then scans backwards), STOPROW (end line), TIMERANGE (range of the time stamp),
VERSIONS (version number), and FILTER (filters lines by condition).
----End
3.3.1.10 Updating Data

Step 1 Query the age information whose Rowkey is 123001 in the table.
> get 'stuXX_cga_info','123001','info:age'

COLUMN CELL
Step 2 Change the age information whose Rowkey is 123001 in the table.
> put 'stuXX_cga_info','123001','info:age','18'

Step 3 Query the age information whose Rowkey is 123001 in the table again.
> get 'stuXX_cga_info','123001','info:age'

COLUMN CELL
Compare the results of step 1 and step 3. It can be seen that the age information has been updated.
----End
3.3.1.11 Deleting Data

3.3.1.11.1 Deleting Data in a Column Using delete
Step 1 Query the information whose Rowkey is 123001 in the table.

COLUMN CELL

Step 2 Run the delete command to delete the data stored in the age column in 123001.
> delete 'stuXX_cga_info','123001','info:age'

Step 3 Query the information whose Rowkey is 123001 in the table again.

COLUMN CELL
Compare the results of step 1 and step 3. It can be seen that the age information has been deleted.
----End
3.3.1.11.2 Deleting All Data in a Line Using deleteall
Step 1 Run the deleteall command to delete data in the entire line of 123001 in the cga_info table.
> deleteall 'stuXX_cga_info','123001'

Step 2 Query the information whose Rowkey is 123001 in the table again.

COLUMN CELL
No information whose RowKey is 123001 can be found, indicating that all the data in the line has
been deleted.
----End
3.3.1.11.3 Deleting Table Using drop
Step 1 Create a table named cga_info1.
> create 'stuXX_cga_info1','info'

=> Hbase::Table - cga_info1
Step 2 Run disable 'table name' first and then drop 'table name' to delete the table.
> disable 'stuXX_cga_info1'

> drop 'stuXX_cga_info1'
2018-04-10 18:12:23,566 INFO [main] client.HBaseAdmin: Deleted cga_info1

Step 3 Query tables in the current namespace.
> list
TABLE
cga_info
Socker
t1
=> ["cga_info", "socker", "t1"]
The result shows that the stuXX_cga_info1 table has been deleted.
----End
3.3.2 Using Filter

Filter allows you to set certain filtering conditions in the scan process. Only the user data that meets
the filtering conditions is returned. All filters take effect on the server to ensure that the filtered data
is not transmitted to the client.
Example 1: Query people whose age is 40.
> scan 'stuXX_cga_info',{FILTER=>"ValueFilter(=,'binary:40')"}

ROW COLUMN+CELL
Example 2: Query the people named LeBron.
> scan 'stuXX_cga_info',{FILTER=>"ValueFilter(=,'binary:LeBron')"}

ROW COLUMN+CELL
value=LeBron
Example 3: Query the gender information of all users in the table.
> scan 'stuXX_cga_info',FILTER=>"ColumnPrefixFilter('gender')"

ROW COLUMN+CELL
value=female
value=female
value=male
Example 4: Query the address information of all the people in the table and find out the people who
live in London.
> scan 'stuXX_cga_info',{FILTER=>"ColumnPrefixFilter('address') AND

ValueFilter(=, 'binary:London')"}
ROW COLUMN+CELL
value=London
Filter filters data based on the column family, column, version, and so on. Only four filtering methods
are demonstrated here. RPC query requests with filter criteria will be distributed to each
RegionServer. In this way, the network transmission pressure is reduced.
3.3.3 Creating a Table with Pre-Distributed Regions

3.3.3.1 Dividing a Table into Four Random Regions by Rowkey
Step 1 Create a new table stuXX_cga_info2 and divide the table into four regions.
create 'table name', 'column family name', {NUMREGIONS => 4, SPLITALGO =>
'UniformSplit'}
> create 'stuXX_cga_info2','info',{NUMREGIONS=>4,SPLITALGO=>'UniformSplit'}
Step 2 On FusionInsight Manager, choose Services > HBase.
Step 3 Click HMaster(Active).

Step 4 Click "Table Details".
Step 5 Find the new table stuXX_cga_info2.

Step 6 Query the region division result. The stuXX_cga_info2 table is divided into four regions. Name
contains the table name, StartKey (the first region does not have StartKey), timestamp, and
region ID.
----End
3.3.3.2 Specifying the StartKeys of Regions

Step 1 When creating a table, specify the StartKeys of the regions.
> create 'table name', 'column family name', SPLITS => ['first StartKey',
'second StartKey', 'third StartKey']
Example: Create a table named stuXX_cga_info3 and specify three StartKeys which are 10000,
20000, and 30000 respectively.
> create 'stuXX_cga_info3','info',SPLITS => ['10000', '20000', '30000']

Step 2 Go to the Table Regions page as instructed in section 3.3.3.1.
The result shows that the stuXX_cga_info3 table is divided into four regions based on Start Keys
10000, 20000, and 30000.
----End
3.3.3.3 Pre-Dividing Regions Using a File

Step 1 Press Ctrl+C to exit shell.
user01@fi01host01:~>
Step 2 Create the splitFile file in the /home/userXX/ directory.
> touch /home/userXX/splitFile.dat
Step 3 Go to the /tmp/stu01/ directory.

> cd /home/userXX
Step 4 Enter 10000, 20000, 30000 in splitFile.dat.
> vim splitFile.dat
On the editing interface, press i, and then enter 10000, 20000, and 30000, and press Enter after
entering each of the values.
Step 5 After entering all the information, press esc: wq to end the editing.
Step 6 Go to HBaes shell again.
> kinit stuXX
> hbase shell
Step 7 Create a table named stuXX_cga_info4 and pre-divide it using the splitFile file created earlier.
> create 'stuXX_cga_info4','info',SPLITS_FILE =>'/home/userXX/splitFile'

Step 8 Go to the Table Regions page as instructed in section 3.3.3.1.
The result shows that the stuXX_cga_info4 table is divided into four regions based on Start Keys
10000, 20000, and 30000 specified in splitFile.dat.
Note: For a table with regions pre-divided using start keys and end keys, the range of region rowkey
is [start_key, end_key).
----End
3.3.4 HBase Load Balancing

3.3.4.1 Viewing the HBase Web UI
Step 1 Go to the Region Servers page of the HBase by performing the first four steps of section 3.3.3.1.
Step 2 Click Requests.
Step 3 Click Base Stats.
The previous figure shows a serious problem of load imbalance. The fi01host02 host is overloaded.
You can manually move hot regions to the fi01host01 host.
----End
3.3.4.2 Moving Regions

Step 1 Click fi01host02.
Step 2 Check which regions are taken over by the fi01host02 host.
As shown in the preceding figure, the load is unbalanced due to the meta table. However, you are
not advised to move the meta table. In this experiment, move the stuXX_cga_info table.
Step 3 Move the 67aee3318a626ec0b1265e26fd46c151 file to

RegionServer'fi01host01,21302,1522806777697.
> echo "move

'67aee3318a626ec0b1265e26fd46c151','fi01host01,21302,1522806777697'" | hbase
shell
move '67aee3318a626ec0b1265e26fd46c151','fi01host01,21302,1522806777697'
Step 4 Check the HBase Web UI.
On the web UI, you can see that the region has been moved to fi01host01.
----End
3.4 Summary
This experiment demonstrates how to create and delete an HBase table, how to add, delete, modify,
and query data, how to pre-divide regions, and how to manually achieve load balancing. Through the
experiment, trainees can master the methods of using HBase and deepen their understanding of
HBase.
4 Hive Data Warehouse Practice
4.1 Background
Hive is a data warehouse tool that plays an important role in data mining, data aggregation, and
statistical analysis. In particular, Hive plays an important role in telecom services. It can be used to
collect traffic, call fee, and tariff information of users, and establish users' consumption models to
help carriers better plan package content.
4.2 Objectives
⚫ To have a good command of common Hive operations
⚫ To master how to run HOL on Hue.
4.3.1 Common Functions of Hive

Enter the Hive client beeline:
> source /home/userXX/hadoopclient/bigdata_env
> /home/userXX/hadoopclient/Hive/Beeline/bin/beeline
...
Connected to: Apache Hive (version 1.3.0)
Driver: Hive JDBC (version 1.3.0)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 1.3.0 by Apache Hive
0: jdbc:hive2://192.168.225.11:21066/>
 Character string length function
Character string length function: length

Syntax: length(string A)
Returned value: int
Note: Return the length of character string A.
hive> select length('abcedfg');

7
 Character string reverse function
Syntax: reverse(string A)
Returned value: string
Note: Return the reversion of character string A.
hive> select reverse('abcedfg');

gfdecba
 Character string connection function
Syntax: concat(string A, string B…)

Note: Return the result of character string connection. You can enter any number of character
strings.
hive> select concat('abc', 'def', 'gh');

abcdefgh
 The function of connecting character strings with delimiters
Syntax: concat_ws(string SEP, string A, string B…)

Note: Returns the result of character string connection. SEP indicates the delimiter between
character strings.
hive> select concat_ws('-', 'abc', 'def', 'gh');

abc-def-gh
 Character string truncation function
Syntax: substr(string A, int start, int len),substring(string A, int start, int len)
Note: Return character string A from the start point with a length of len.
hive> select substr('abcde',3,2);

cd
hive> select substr ('abcde',-2,2);

de
 The function of converting a character string to uppercase
Syntax: upper(string A) ucase(string A)

Note: Return character string A in the uppercase format.
hive> select upper('abC');

ABC
hive> select ucase('abC');

ABC
 The function of converting a character string to lowercase
Syntax: lower(string A) lcase(string A)

Note: Return character string A in the lowercase format.
hive> select lower('abC');

abc
hive> select lcase('abC');

abc
 The function of moving spaces
Syntax: trim(string A)
Note: Remove the spaces on both sides of the character string.
hive> select trim(' abc ');

abc
 The function of splitting a character string
Syntax: split(string str, string pat)

Returned value: array
Note: Split the string based on the specified string pattern. The string array after splitting is
returned.
hive> select split('abtcdtef','t');

["ab","cd","ef"]
 Time functions
Function of obtaining the current UNIX timestamp: unix_timestamp

Syntax: unix_timestamp ()
Returned value: bigint
Note: Obtain the UNIX timestamp of the current time zone.
hive> select unix_timestamp();

1521511607
 Function of converting the UNIX timestamp to date: from_unixtime
Syntax: from_unixtime(bigint unixtime[, string format])

Note: Convert the UNIX timestamp (seconds from 1970-01-01 00:00 :00 UTC to the specified
time) to the time format in the current time zone.
hive> select from_unixtime(1521511607,'yyyyMMdd');

20180320
4.3.2 Creating a Table

4.3.2.1 Syntax for Creating a Table
CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name

[(col_name data_type [COMMENT col_comment], ...)]
[COMMENT table_comment]
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
[CLUSTERED BY (col_name, col_name, ...)
[SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
[ROW FORMAT row_format]
[STORED AS file_format]
[LOCATION hdfs_path]
4.3.2.2 Creating an Internal Table

Создаем свою базу данных:
> create database stuXX_db;
> use stuXX_db;
И далее работаем в ней.
Create an internal table cga_info1, which contains name, gender, and time columns.
> create table cga_info1(name string,gender string,timest int) row format

delimited fields terminated by ',' stored as textfile;
No rows affected (0.293 seconds)
In the preceding information, row format delimited fields terminated by ',' indicates that the line
delimiter is ','. If this parameter is not set, the default delimiter is used. A Hive HQL statement ends
with a semicolon (;).
View the cga_info1 table.
> show tables like 'cga_info1';

+------------+
| tab_name |
+------------+
| cga_info1 |
+------------+
1 row selected (0.07 seconds)
4.3.2.3 Creating an External Table

Specify external table when you create an external table.
> create external table cga_info2 (name string,gender string,timest int) row
format delimited fields terminated by ',' stored as textfile;
View the cga_info2 table.

> show tables like 'cga_info2';
+------------+
| tab_name |
+------------+
| cga_info2 |
+------------+
1 row selected (0.078 seconds)
4.3.2.4 Loading Local Data

Step 1 Create a file on the local host.
> cd /home/userXX
> touch 'cga111.dat'
Step 2 Run the vim command to edit the cga111.dat file. Enter several lines of data in the sequence of
name, gender, and time. The line delimiter is a comma (,). To start new line, press Enter. After
the input is complete, press ESC and enter :wq to save the modification and exit to the Linux
interface.
> vim 'cga111.dat'
Xiaozhao,female,20
Xiaoqian,male,21
Xiaosun,male,25
Xiaoli,female,40
Xiaozhou,male,33
Step 3 Enter Hive again.
> beeline
> use stuXX_db;

Step 5 Load local data cga111.dat to the cga_info3 table.
> load data local inpath '/home/userXX/cga111.dat' into table cga_info3;

INFO : Loading data to table stu01_db.cga_info3 from
file:/home/user01/cga111.dat
Step 6 Query the content in cga_info3.
> select * from cga_info3;

+-----------------+-------------------+-----------------+
| cga_info3.name | cga_info3.gender | cga_info3.time |
+-----------------+-------------------+-----------------+
| xiaozhao | female | 20 |
| xiaoqian | male | 21 |
| xiaosun | male | 25 |
| xiaoli | female | 40 |
| xiaozhou | male | 33 |
+-----------------+-------------------+-----------------+
5 rows selected (0.287 seconds)
The result shows that the content in the local file cga111.dat has been loaded to the Hive table
cga_info3.
----End
4.3.2.5 Loading HDFS Data

Step 1 Create a directory /cga/cg in the HDFS.
> hdfs dfs -mkdir /user/app_stuXX/cga
18/04/12 19:43:54 INFO hdfs.PeerCache: SocketCache disabled.
> hdfs dfs -mkdir /user/app_stuXX/cga/cg
Step 2 Upload the local file cga111.dat in the tmp folder to the /cga/cg directory of the HDFS.
> hdfs dfs -put /home/userXX/cga111.dat /user/app_stuXX/cga/cg
Step 3 Enter Hive again.
> beeline
> use stuXX_db;

Step 5 Load the HDFS file cga111.dat to the cga_info4 table.
> load data inpath '/user/app_stuXX/cga/cg/cga111.dat' into table cga_info4;
INFO : Loading data to table default.cga_info4 from

hdfs://hacluster/app_stu01/cga111.dat
Note: Slightly different commands are used to load local data and HDFS data.
Loading a local file: load data local inpath 'local_inpath' into table hive_table;
Loading an HDFS file: load data inpath 'HDFS_inpath' into table hive_table.

+-----------------+-------------------+-----------------+
+-----------------+-------------------+-----------------+
+-----------------+-------------------+-----------------+
The result shows that the content of the cga111.dat file in the HDFS has been loaded to the Hive
table cga_info4.
----End
4.3.2.6 Loading Data When Creating a Table

Step 1 Create table cga_info5 load cga111.dat data in the HDFS during the table creation.
> hdfs dfs -put /home/userXX/cga111.dat /user/app_stuXX/cga/cg

> beeline
> use stuXX_db;
> create external table cga_info5 (name string,gender string,timest int) row
format delimited fields terminated by ',' stored as textfile location
'/user/app_stuXX/cga/cg';

+-----------------+-------------------+-----------------+
+-----------------+-------------------+-----------------+
+-----------------+-------------------+--------------+--+
It can be seen that the cga_info5 table has been created successfully with cga111.dat data in the
HDFS loaded.
При загрузке данных во внешнюю таблицу файлы не удаляются. При добавлении новых
происходит автоматическое добавление записей в таблицу.
-----------------
ВЫВОДЫ:
1. При загрузке данных в обычную таблицу файл данных в HDFS удаляется.
2. При загрузке данных во ВНЕШНЮЮ таблицу файл данных в HDFS не удаляется.

При добавлении новых файлов в директорию, указанную при создании внешней
таблицы, происходит автоматическое добавление записей в таблицу.
4.3.2.7 Copying an Empty Table

Step 1 Create table cga_info6 and copy table cga_info1 to table cga_info6.
> create table cga_info6 like cga_info1;
> select *from cga_info6;

+-----------------+-------------------+-----------------+
+-----------------+-------------------+-----------------+
+-----------------+-------------------+-----------------+
No rows selected (0.243 seconds)
The output shows that the empty table has been copied successfully.
----End
4.3.3 Querying
4.3.3.1 Fuzzy Query of Tables
Query tables whose names start with cga.
> show tables like '*cga*';

+--------------------+
| tab_name |
+--------------------+
| cga_hive_hbase |
| cga_info1 |
| cga_info2 |
| cga_info3 |
| cga_info4 |
| cga_info5 |
| cga_info6 |
+--------------------+
4.3.3.2 Querying by Criterion

Example 1: Use limit to query data in the first two lines in the cga_info3 table.
> select * from cga_info3 limit 2;
+-----------------+-------------------+-------------------+
| cga_info3.name | cga_info3.gender | cga_info3.timest |
+-----------------+-------------------+-------------------+
+-----------------+-------------------+-------------------+
Example 2: Use where to query the information about all women in the cga_info3 table.
> select * from cga_info3 where gender='female';

+-----------------+-------------------+--------------------+
+-----------------+-------------------+--------------------+
+-----------------+-------------------+--------------------+
Example 3: Use order to query the information about all women in cga_info3 by time in descending
order.
> select * from cga_info3 where gender='female' order by timest desc ;

+-----------------+-------------------+-------------------+
+-----------------+-------------------+-------------------+
+-----------------+-------------------+-------------------+
The result shows that the information of xiaozhao is ranked second in the output because the query
result is sorted in descending order of time although data about xiaozhao is entered first.
4.3.3.3 Querying by Multiple Criteria

Example 1: Query the cga_info3 table for groups by name, and find out persons whose total value of
time is greater than 30.
> select name,sum(timest) all_time from cga_info3 group by name having

all_time >= 30 ;
+-----------+--------------+
| name | all_time |
+-----------+--------------+
| xiaoli | 40 |
| xiaozhou | 33 |
+-----------+--------------+
Example 2: Query the cga_info3 table for groups by gender, and find out the person whose time
value is the greatest.
> select gender,max(timest) from cga_info3 group by gender;
+---------+---------+
| gender | _c1 |
+---------+---------+
| female | 40 |
| male | 33 |
+---------+---------+
Example 3: Check the numbers of women and men respectively in the cga_info3 table.
> select gender,count(1) num from cga_info3 group by gender;

+---------+---------+
| gender | num |
+---------+---------+
| female | 2 |
| male | 3 |
+---------+---------+
Example 4: Insert women information in the cga_info7 table into the cga_info3 table.
Step 1 Create internal table cga_info7.

Step 2 Create local file cga222.dat in /home/userXX and enter content.
> cd /home/userXX
> touch cga222.dat
> vim cga222.dat
xiaozhao,female,20
xiaochen,female,28
Step 3 Load local data to the cga_info7 table.
> beeline
> use stuXX_db;

INFO : Loading data to table stu01_db.cga_info7 from
file:/hone/user01/cga222.dat
Step 4 Load women information in the cga_info7 table to the cga_info3 table.
> insert into cga_info3 select * from cga_info7 where gender='female';

Step 5 View the content in cga_info3.

+---------------- -+-------------------+-------------------+
+---------------- -+-------------------+-------------------+
| xiaochen | female | 28 |
+------------------+-------------------+-------------------+
The output shows that two pieces of women information in the cga_info7 table have been added to
the cga_info3 table.
Example 5: Query the sum of time values of the people in the cga_info3 table based on the name
and gender.
> select name,gender,sum(timest) timest from cga_info3 group by name,gender;
+-----------+---------+----------+
| name | gender | timest |
+-----------+---------+----------+
+-----------+---------+----------+
The output shows that two pieces of xiaozhao information are merged in the cga_info3 table.
Example 6: Check the sum of the time values of all the people in the cga_info3 table, and then sorts
the records by time in descending order based on the gender.
> select *,row_number() over(partition by gender order by timest desc) rank

from (select name,gender,sum(timest) timest from cga_info3 group by
name,gender) b;
+-----------+-----------+----------+--------+
| b.name | b.gender | b.timest | rank |
+-----------+-----------+----------+--------+
| xiaozhao | female | 40 | 1 |
| xiaoli | female | 40 | 2 |
| xiaochen | female | 28 | 3 |
| xiaozhou | male | 33 | 1 |
| xiaosun | male | 25 | 2 |
| xiaoqian | male | 21 | 3 |
+-----------+-----------+----------+--------+
----End
4.3.4 Hive Join Operations

Create two tables cga_info8 and cga_info9 as instructed in section 5.3.2.4.
Создаем таблицу cga_info8:
> beeline
> use stuXX_db;
> create table cga_info8(name string,age int) row format delimited fields
terminated by ',' stored as textfile;
Создаем текстовый файл, например, cga8.dat со следующим содержимым:

> cd /home/userXX
> touch cga8.dat
> vim cga8.dat
GuoYijun,5
YuanJing,10
Liyuan,20
Загружаем данные в таблицу cga_info8:
> beeline
> use stuXX_db;
Создаем таблицу cga_info9:
> create table cga_info9(name string,gender string) row format delimited fields
terminated by ',' stored as textfile;
Создаем текстовый файл, например, cga9.dat со следующим содержимым:

> cd /home/userXX
> touch cga9.dat
> vim cga9.dat
YuanJing,male
Liyuan,male
LiuYang,female
Lilei,male
Загружаем данные в таблицу cga_info9:
> beeline
> use stuXX_db;
Query the content in cga_info8.
+-----------------+-------------------+
| cga_info8.name | cga_info8.age |
+-----------------+-------------------+
| GuoYijun | 5 |
| YuanJing | 10 |
| Liyuan | 20 |
+-----------------+-------------------+
Query the content in cga_info9.

+-----------------+-------------------+
| cga_info9.name | cga_info9.gender |
+-----------------+-------------------+
| YuanJing | male |
| Liyuan | male |
| LiuYang | female |
| Lilei | male |
+-------------------------------------+
4.3.4.1 join/inner join

The join and inner join operations are interrelated. They associate information of two tables and
return only information shared by the two tables.
The following statement uses join to associate information about the same person in the cga_info8
and cga_info9 tables.
> select * from cga_info9 a join cga_info8 b on a.name=b.name;

+-----------+--------+-----------+-----------+
| a.name | a.age | b.name | b.gender |
+-----------+--------+-----------+-----------+
| YuanJing | 10 | YuanJing | male |
| Liyuan | 20 | Liyuan | male |
+-----------+--------+-----------+-----------+
The following statement uses inner join to associate information about the same person in the
cga_info8 and cga_info9 tables.
> select * from cga_info9 a inner join cga_info8 b on a.name=b.name;

+-----------+--------+-----------+-----------+
| a.name | a.age | b.name | b.gender |
+-----------+--------+-----------+-----------+
| YuanJing | 10 | YuanJing | male |
| Liyuan | 20 | Liyuan | male |
+-----------+--------+-----------+-----------+
4.3.4.2 left join

left join: Indicates left external association. The table before left join is used as the primary table to
associate with the other table. The returned number of records is the same as that in the primary
table. The fields that cannot be associated are set to NULL.
Use left join to associate information about the same person in the cga_info8 and cga_info9 tables.
select * from cga_info9 a left join cga_info8 b on a.name=b.name;

+-----------+-----------+-----------+--------+
| a.name | a.gender | b.name | b.age |
+-----------+-----------+-----------+--------+
| YuanJing | male | YuanJing | 10 |
| Liyuan | male | Liyuan | 20 |
| LiuYang | female | NULL | NULL |
| Lilei | male | NULL | NULL |
+-----------+-----------+-----------+--------+
4.3.4.3 right join

left join: Indicates right external association. The table after right join is used as the primary table to
associate with the table before right join. The returned number of records is the same as that in the
primary table. The fields that cannot be associated are set to NULL.
Use right join to associate information about the same person in the cga_info8 and cga_info9 tables.
> select * from cga_info9 a right join cga_info8 b on a.name=b.name;

+-----------+-----------+-----------+--------+
+-----------+-----------+-----------+--------+
| NULL | NULL | GuoYijun | 5 |
+-----------+-----------+-----------+--------+
4.3.4.4 full join

full join: Indicates full-external association. Records of two tables are used as the benchmark. All
records of the two tables after deduplication are returned. The fields that cannot be associated are
NULL.
Use full join to associate information about the same person in the cga_info8 and cga_info9 tables.
> select * from cga_info9 a full join cga_info8 b on a.name=b.name;

+-----------+-----------+-----------+--------+
+-----------+-----------+-----------+--------+
| NULL | NULL | GuoYijun | 5 |
| Lilei | male | NULL | NULL |
| LiuYang | female | NULL | NULL |
+-----------+-----------+-----------+--------+
4.3.4.5 left semi join

left semi join: The table before left semi join is used as the primary table. The returned KEY of the
primary table is also recorded in the associated table.
Use left semi join to associate information about the same person in the cga_info8 and cga_info9
tables.
> select * from cga_info9 a left semi join cga_info8 b on a.name=b.name;

+-----------+-----------+
| a.name | a.gender |
+-----------+-----------+
| YuanJing | male |
| Liyuan | male |
+-----------+-----------+
4.3.4.6 map join

map join is an optimization function of Hive and applies to the joining of a small table to a large
table. Table joining is performed in Map and in the memory, therefore there is no need to start the
Reduce task or go through the shuffle phase, which saves resources and improve join efficiency.
Use map join to associate information about the same person in the cga_info8 and cga_info9 tables.
> select /*+ mapjoin(b)*/* from cga_info9 a join cga_info8 b on

a.name=b.name;
+-----------+-----------+-----------+--------+
+-----------+-----------+-----------+--------+
+-----------+-----------+-----------+--------+
4.3.5 Hive on Spark Operation

On the beeline client, set the computing engine to Spark.
> set hive.execution.engine=spark;

Query the sum of time values of people in the cga_info3 table based on the name and gender.
> select name,gender,sum(timest) timest from cga_info3 group by name,gender;
Если такой запрос не работает из-за нехватки памяти или ресурсов, можно выполнить
простой запрос без группировок:
> select name,gender from cga_info3;
+-----------+---------+---------+
| name | gender | timest |
+-----------+---------+---------+
+-----------+---------+---------+
Compared with the result of example 5 in section 4.3.3, the query speed of Hive on Spark is 1
second, which is much faster than that of Hive on MapReduce. (здесь будет другое время, все
зависит от настроек YARN)
Вернуть исполнение с помощью MapReduce:
> set hive.execution.engine=mr;
4.3.6 Associating a Hive Table with an HBase Table

Step 1 Enter the HBase shell.
> hbase shell
Step 2 Establish HBase table student.
> create 'stuXX_student','info'

=> Hbase::Table – student
Step 3 Enter information in the HBase table.
> put 'stuXX_student','001','info:name','lilei'

> put ' stuXX_student','002','info:name','tom'
Step 4 View information in the table.
> scan 'stuXX_student'

ROW COLUMN+CELL
001 column=info:name, timestamp=1523544015712, value=lilei
002 column=info:name, timestamp=1523544040443, value=tom

Step 5 Create a Hive external table cga_hbase_hive and associate it with the student table.
> beeline
> use stuXX_db;
> create external table cga_hbase_hive (key int,gid map<string,string>)
stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' with
SERDEPROPERTIES ("hbase.columns.mapping" ="info:") TBLPROPERTIES
("hbase.table.name" ="stuXX_student");
Step 6 Query the content in the cga_hbase_hive table.
> select * from cga_hbase_hive;

+---------------------+---------------------+
| cga_hbase_hive.key | cga_hbase_hive.gid |
+---------------------+---------------------+
| 1 | {"name":"lilei"} |
| 2 | {"name":"tom"} |
+---------------------+---------------------+
Step 7 Query name information in the cga_hbase_hive table.
> select gid['name'] from cga_hbase_hive;

+--------+
| _c0 |
+--------+
| lilei |
| tom |
+--------+
The experiment result shows that the Hive table is associated with the HBase table.
----End
4.3.7 Merging Small Hive Files

Step 1 Check the content in the /user/hive/warehouse/cga_info1 folder of the HDFS.
> hdfs dfs -put /home/userXX/cga8.dat

/user/hive/warehouse/stuXX_db.db/cga_info1
> hdfs dfs -put /home/userXX/cga9.dat

/user/hive/warehouse/stuXX_db.db/cga_info1
> hdfs dfs -ls -h /user/hive/warehouse/stuXX_db.db/cga_info1
Found 2 items
…… stu01 hive 17 2018-04-13 15:32

/user/hive/warehouse/stu01_db.db/cga_info1/cga8.dat
…… stu01 hive 15 2018-04-13 15:32
/user/hive/warehouse/stu01_db.db/cga_info1/cga9.dat
The /user/hive/warehouse/stuXX_db.db/cga_info1 folder contains two files.
Step 2 On the Hive client, set the parameter of whether to merge Reduce output files to true.
> beeline
> use stuXX_db;
> set hive.merge.mapredfiles= true;
Step 3 Create table cga_info10 and load the content of table cga_info1 to the new table.
> create table cga_info10 as select * from cga_info1;

Step 4 View the content in table cga_info10.
> hdfs dfs -ls -h /user/hive/warehouse/stuXX_db.db/cga_info10

Found 1 items
-rw-------+ 3 stu01 hive 110 2018-04-13 15:34
/user/hive/warehouse/cga_info10/000000_0
The result shows that the two small files that should be output in the Reduce phase have been
merged into one because the parameter setting has been modified in step 2.
----End
4.3.8 Hive Column Encryption

Currently, Hive columns are encrypted using the AES algorithm.
The encryption must be specified during table creation. The encryption class name corresponding to
AES is org.apache.hadoop.hive.serde2.AESRewriter.
Step 1 Create table info11 and encrypt the name column.
> use stuXX_db;

serde 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' with
serdeproperties
('column.encode.columns'='name','column.encode.classname'='org.apache.hadoop.
hive.serde2.AESRewriter') stored as textfile;
Step 2 Load data in table cga_info3 to table cga_info11.

> insert into cga_info11 select * from cga_info3;


+------------------+--------------------+---------------------+
+------------------+--------------------+---------------------+
+------------------+--------------------+---------------------+
Step 4 Check the encryption effect.
> hdfs dfs -cat /user/hive/warehouse/stuXX_db.db/cga_info11/000000_0

jR091mQ/LIKY0XBCJi8dsw==female20
BRaQqw7O46X/L1YH1ujKEA==female28
jR091mQ/LIKY0XBCJi8dsw==female20
t84/+Zo8Pxiidltw8rAyTA==male21
J3y40cz4TMGs2uKJfHHaEA==male25
pz64eOp896fiocKrV0IpoA==female40
g/sTgzi4MYs9Uotztgg+BQ==male33
The result shows that the names of all the people in the table are encrypted.
----End
4.3.9 Using Hue to Execute HQL

Step 1 Choose Services > Hue.
Step 2 Click Hue(Active).
Step 3 Move the pointer to Query Editors and choose Hive from the drop-down menu.
Step 4 Write the HQL program in the blank area.

! На панели слева выбираем свою базу данных: stuXX_db
Step 5 After compiling the HQL program, select the computing engine and then click Execute.
Набираем запрос, например: select * from cga_info1
ДРУГОЙ ИНТЕРФЕЙС: Нажимаем на синий треугольник (кнопка Execute) слева от панели, в

которую вводили запрос.
Step 6 View results.
The results can also be displayed in charts.

ДРУГОЙ ИНТЕРФЕЙС: кнопка для построения диаграммы представлена в виде иконки и
располагается слева от таблицы с результатами запроса, которые отображаются на вкладке
«Results».
----End
4.4 Summary
This experiment describes how to add, delete, modify, and query data in Hive data warehouses, Hive
on Spark, and how to operate HBase using Hive. In Hive join operations, multiple join methods are
introduced to enable trainees to have a more intuitive understanding of join types and their
differences. This experiment helps trainees to reinforce their comprehension about Hive. Note that
stored as textfile must be specified during table creation when loading data. Otherwise, data cannot
be loaded.
5 Data Import and Export Using Loader
5.1 Background
Data migration operations are frequently involved in Big Data services, especially data migration
between relational databases and Big Data components, for example, data migration between
MySQL and HDFS/HBase. The graphical operations of Loader makes data migration more convenient.
5.2 Objective
⚫ Have a good command of using Loader to perform data migration in service scenarios.
5.3.1 Importing HBase Data to HDFS

Step 1 Choose Services > Loader.
Step 2 Click LoaderServer(Active).
Step 3 Click New Job.

Step 4 Configure the task name and select the type (select Export to export data from HBase to HDFS).
Job Name: stuXX_cg_hbasetohdfs
Connection: stuXX_hdfs_conn
Queue: DEFAULT
Step 5 Click Add, as shown in the preceding figure. Select hdfs-connector and set Name to a unique
value.
Step 6 Click Test. If Test Success is displayed, the system is available.
Step 7 Configure basic information as shown in the following figure, and then click Next.
Step 8 Select HBASE for Source type. Set Number to the number of Map tasks. Fill in 1 here. Then, click
Next.
Step 9 Click Input on the left, select HBase Input, and drag the HBase Input button to the right area.
Step 10 Click output on the left, select File Output, and drag the File Output button to the right area.
Step 11 Query the content in the cga_info table first in order to configure input and output.
> scan 'stuXX_cga_info'

ROW COLUMN+CELL
value=London
value=female
value=Victoria
value=Redding
value=female
value=Cleveland
Step 12 Configure the HBase input. Double-click the HBase Input button on the web UI. Enter table
name cga_info, click Add, enter the family name, column name, field name, and type in
sequence, select is rowkey, and click OK.
HBase table name: stuXX_cga_info
Step 13 Double-click the File Output button on the web UI to configure the HDFS output. Specify the
output delimiter, click associate, enter serial numbers in the position column, and then click
OK.
Output delimiter: ,
Step 14 Connect HBase Input and File Output.
Step 15 Click Next to configure To.
Step 16 Enter the output path, select the file format, and click Save and run.
Output path: /tmp/stuXX/cg_hbasetohdfs
Step 17 Check the running result.
Step 18 View the HDFS output.
Находим в папке /tmp/stuXX/cg_hbasetohdfs файл экспорта:
> hdfs dsf -ls /tmp/stuXX/cg_hbasetohdfs
-rw-rw-rw-+ 3 loader hadoop 32 2021-09-17 18:43

/tmp/stu19/cg_hbasetohdfs/export_part_1631277441726_0002_0000000
И подставляем его в команду вывода содержимого:

> hdfs dfs -cat
/tmp/stuXX/cg_hbasetohdfs/export_part_1631277441726_0002_0000000
123002,Victoria,female,40,London
123003,Taylor,female,30,Redding
123004,LeBron,male,33,Cleveland
The output shows that the content of the stuXX_cga_info table is successfully moved to the
export_part_1631277441726_0002_0000000 file in the /tmp/stuXX/cg_hbasetohdfs directory.
----End
5.3.2 Loading HDFS Data to HBase

Step 1 Create a table named stuXX_cg_hdfstohbase.
> hbase shell

> create 'stuXX_cg_hdfstohbase','info'
=> Hbase::Table - cg_hdfstohbase
Step 2 Perform the first three steps of section 5.3.1 to go to the page for configuring Loader and
configure basic information.
Name: stuXX_cg_hdfstohbase
Connection: stuXX_hdfs_conn (строка подключения была создана в предыдущем задании)
Queue: DEFAULT
Step 3 Configure From.

Enter /tmp/stuXX/cg_hbasetohdfs/export_part_1631277441726_0002_0000000 in Input path and *
in File filter. Select UFT_8 for Encode type. Then, click Next.
Step 4 Click Input on the left, select CSV File Input, and drag the CSV File Input button to the right
area.
Step 5 Click output on the left, select HBase Output, and drag the HBase Output button to the right
area.
Step 6 Configure the CSV file input. Double-click the CSV File Input button on the web UI. Enter the
delimiter of the table and click Add. Enter the position serial number, field name, and type in
sequence, and then click OK.
Delimiter: ,
Step 7 Configure HBase output. Double-click the HBase Output button on the web UI and click
associate.
Step 8 Select the check boxes in the Name column and click OK.
Step 9 Enter the table name, select rowkey as the primary key, and click OK.
Table Name: stuXX_cg_hdfstohbase
Step 10 Connect CSV File Input and HBase Output.
Step 11 Click Next to configure To. Set Storage type to HBASE_PUTLIST, set Number to 1, and click Save
and run.
Step 12 View results.
Step 13 Query the content in the stuXX_cg_hdfstohbase table in HBase.
> scan 'stuXX_cg_hdfstohbase'

ROW COLUMN+CELL
123002 column=info:address, timestamp=1523623659052, value=London
123002 column=info:name, timestamp=1523623659052, value=Victoria
123003 column=info:address, timestamp=1523623659052, value=Redding
value=Cleveland
The result shows that the content of HDFS file /tmp/stuXX_cg_hbasetohdfs/

export_part_1631277441726_0002_0000000 has been loaded to the stuXX_cg_hdfstohbase table.
----End
5.3.3 Importing HDFS Data to MySQL

Step 1 Create a file named test_mysql in the /home/userXX folder, and write data into the file.
> touch test_mysql.txt

> vim test_mysql.txt
1,Tom,male,8
2,Lily,female,24
3,Lucy,female,50
Step 2 Upload local file test_mysql to the /user/app_stuXX/loader_test directory of the HDFS.
> hdfs dfs -mkdir /user/app_stuXX/loader_test

> hdfs dfs -put /home/userXX/test_mysql.txt /user/app_stuXX/loader_test
> hdfs dfs -ls /user/app_stuXX/loader_test
Found 1 items
-rw-r--r--+ 3 user01 supergroup 47 2018-04-15
13:09/user/app_stu01/loader_test/test_mysql.txt
Step 3 View the content in the test_mysql table.

> hdfs dfs -cat /user/app_stuXX/loader_test/test_mysql.txt

1,tom,male,8
2,lily,female,24
3,lucy,female,50
Step 4 On a Linux node, enter MySQL.
> mysql -uroot -pHuawei@010203
Step 5 Create a database named test_database.
mysql> create database stuXX_loadertest;

Query OK, 1 row affected (0.00 sec)
mysql> set names utf8;

Query OK, 0 rows affected (0.00 sec)
mysql> use stuXX_loadertest;

Database changed
Step 6 Create a table named cga_mysql.
mysql> create table cga_mysql(id int(4) not null primary key auto_increment,
name varchar(255) not null, gender varchar(255) not null, time int(4));
Note: Creating a MySQL requires a primary key.
Step 7 View the content in the cga_mysql table.
mysql> desc cga_mysql;

+--------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+--------+--------------+------+-----+---------+----------------+
| id | int(4) | NO | PRI | NULL | auto_increment |
| name | varchar(255) | NO | | NULL | |
| gender | varchar(255) | NO | | NULL | |
| time | int(4) | YES | | NULL | |
+--------+--------------+------+-----+---------+----------------+
4 rows in set (0.01 sec)
Step 8 Copy the MySQL link JAR package to the specified directory of the active and standby Loader.
> cp /FusionInsight-Client/mysql-connector-java-5.1.21.jar
/opt/huawei/Bigdata/FusionInsight_Porter_6.5.1/install/FusionInsight-Sqoop-
1.99.3/FusionInsight-Sqoop-1.99.3/server/webapps/loader/WEB-INF/ext-lib
(на активный)
> scp /FusionInsight-Client/mysql-connector-java-5.1.21.jar

root@192.168.130.26:/opt/huawei/Bigdata/FusionInsight_Porter_6.5.1/install/Fu
sionInsight-Sqoop-1.99.3/FusionInsight-Sqoop-
1.99.3/server/webapps/loader/WEB-INF/ext-lib
(на резервный)
ВНИМАНИЕ! Файлы уже скопированы! Выполнять не надо!

Step 9 Check the content in /opt/huawei/Bigdata/FusionInsight_V100R002C60SPC200/FusionInsight-

Sqoop-1.99.3/FusionInsight-Sqoop-1.99.3/server/webapps/loader/WEB-INF/ext-lib/ on the
active and standby nodes.
> ll /opt/huawei/Bigdata/FusionInsight_Porter_6.5.1/install/FusionInsight-
Sqoop-1.99.3/FusionInsight-Sqoop-1.99.3/server/webapps/loader/WEB-INF/ext-lib
total 940
-rwxr-xr-x 1 root root 118057 Jan 23 11:36 hive-jdbc-1.3.0.jar
-rwxr-xr-x 1 omm wheel 827942 Feb 8 10:36 mysql-connector-java-5.1.21.jar
-rwxr-xr-x 1 omm wheel 18 Nov 23 2015 readme.properties
The result shows that the MySQL link JAR package has been copied to the specified directory of the
active and standby Loader.
Step 10 Restart the Loader.
If mysql-connector-java-x.x.x.jar is already available in the directory of the active and

standby Loader, you do not need to restart the Loader. You need to copy the .jar file on
both the active and standby Loader nodes.
Step 11 Perform steps 1 to 3 in section 5.3.1 to enter the page for configuring basic information about
Loader.
Name: stuXX_cg_hdfstomysql
Queue: DEFAULT
Step 12 Click Edit to start the MySQL connection configuration. The MySQL password is
Huawei@010203. After filling in the information, click Test. After the test is complete, click OK.
Enter jdbc:mysql://192.168.224.41:3306/test_database in JDBC Connection String.

Name: stuXX_mysql
Connector: generic-jdbc-connector
JDBC Driver Class: com.mysql.jdbc.Driver
JDBC Connection String: jdbc:mysql://192.168.130.24:3306/stuXX_loadertest
или jdbc:mysql://192.168.130.25:3306/stuXX_loadertest (в зависимости от того, на какой
машине работаем!!!)
UserName: root
Password: Huawei@010203
Enter /user/app_stuXX/loader_test in Input directory.
Step 13 Click input on the left, select CSV File Input, and drag the CSV File Input button to the right
area.
Step 14 Click output on the left, select Table Output, and drag the Table Output button to the right
area.
Step 15 Configure the CSV file input. Double-click the CSV File Input button on the web UI. Enter the
delimiter ‘,’ and click Add. Enter the position serial number, field name, and type, and then click
OK.
!!! В графе «position» нужно указать позиции, начиная с 2 – т.е. должно быть 2,3,4 (поскольку
исходный файл содержит в строке по 4 атрибута, первый – идентификатор, который в
таблице автоматически преобразуется в первичный ключ).
Step 16 Double-click the Table Output button on the web UI. Click associate, select the check boxes in
the Name column, and click OK.
Step 17 Enter the field name, table column name, and type, and then click OK.
Step 18 Connect CVS File Input and Table Output.
Step 19 Click Next to start output configuration. Enter the table name and click Save and run.
Step 20 Run Loader jobs and view the result.

Step 21 View the content in the cga_mysql table in MySQL.

Database changed
mysql> select * from cga_mysql;
+----+------+--------+------+
| id | name | gender | time |
+----+------+--------+------+
| 1 | tom | male | 8 |
| 2 | lily | female | 24 |
| 3 | lucy | female | 50 |
+----+------+--------+------+
The result shows that the content of the test_mysql file in the HDFS has been loaded to the
cga_mysql table of the MySQL database.
----End
5.3.4 Importing MySQL Data to HDFS

Step 1 Prepare a MySQL table using data in the cga_mysql table created in section 5.3.3.

Database changed

+----+------+--------+------+
+----+------+--------+------+
| 1 | tom | male | 8 |
| 2 | lily | female | 24 |
| 3 | lucy | female | 50 |
+----+------+--------+------+
Loader.
Name: stuXX_cg_mysqltohdfs
Queue: DEFAULT
Connection: stuXX_mysql
Step 3 Click Next to configure From. Enter the table name.

Параметр «Need partition column» установить в значение «False»
The name of a non-default database table is in the format of "database name.table

name".
Step 4 Click input on the left, select Table Input, and drag the Table Input button to the right area.
Step 5 Click output on the left, select File Output, and drag the File Output button to the right area.
Step 6 Double-click the Table Input button on the web UI. Click Add, enter the position serial number,
field name, and type, and click OK.
Step 7 Double-click the File Output button on the web UI. Configure Output delimiter and click
associate. Then select the check boxes in the Name column and click OK.
Output delimiter: ,
Step 8 Connect Table Input and File Output, and click Next.
Step 9 Configure From. Enter /user/app_stuXX/loader_test for Output directory.
Step 11 View the result in the HDFS.
> hdfs dfs -ls /user/app_stuXX/loader_test

Found 3 items
…… 2018-04-15 14:41 /user/app_stu01/loader_test/_SUCCESS
…… 2018-04-15 14:41 /
user/app_stu01/loader_test/import_part_1522461215526_0114_0000000
…… 2018-04-15 13:09 /user/app_stu01/loader_test/test_mysql.txt
> hdfs dfs -cat /user/app_stuXX/loader_test/<file_name> (подставляем имя

соответствующего файла импорта)
> hdfs dfs -cat
/user/app_stuXX/loader_test/import_part_1522461215526_0114_0000000
1,tom,male,8
2,lily,female,24
3,lucy,female,50
The result shows that MySQL table cga_mysql has been imported to the
/user/app_stuXX/loader_test directory of the HDFS.
----End
5.3.5 Importing MySQL Data to HBase

Step 1 Prepare a MySQL table using data in the cga_mysql table created in section 5.3.3.

+----+------+--------+------+
+----+------+--------+------+
| 1 | tom | male | 8 |
| 2 | lily | female | 24 |
| 3 | lucy | female | 50 |
+----+------+--------+------+
Step 2 Create an HBase table named stuXX_cg_mysqltohbase.
> hbase shell

> create 'stuXX_cg_mysqltohbase','info'
Loader.
Name: stuXX_cg_mysqltohbase
Queue: DEFAULT
Step 4 Click Next to configure From. Enter the table name.

Параметр «Need partition column» устанавливаем в значение «False».
Step 5 Click input on the left, select Table Input, and drag the Table Input button to the right area.
Step 6 Click output on the left, select HBase Output, and drag the HBase Output button to the right
area.
Step 7 Double-click the Table Input button on the web UI. Click Add, enter the position serial number,
field name, and type, and click OK.
Step 8 Configure the HBase output. Double-click the HBase Output button on the web UI. Click
associate, select the check boxes in the Name column, and click OK.
Step 9 Enter the HBase table name, column family name, column name, and type, select id as the
primary rowkey, and click OK.
Table Name: stuXX_cg_mysqltohbase
Step 10 Connect Table Input and HBase Output, and click Next.
Step 11 Set Storage type to HBASE_PUTLIST, HBase instance to HBase, and Number to 1, and then click
Save and run.
Step 13 View data in the HBase table.
> scan 'stuXX_cg_mysqltohbase'

ROW COLUMN+CELL
2018-04-15 15:21:33,777 INFO [hconnection-0xaa61e4e-shared--pool4-t1]
ipc.AbstractRpcClient: RPC Server Kerberos principal name for
service=ClientService is hbase/hadoop.hadoop.com@HADOOP.COM
1 column=info:name, timestamp=1523776665511, value=tom
1 column=info:time, timestamp=1523776665511, value=8
value=female
2 column=info:name, timestamp=1523776665511, value=lily
value=female
3 column=info:name, timestamp=1523776665511, value=lucy
The result shows that MySQL table cga_mysql has been successfully loaded to HBase table
stuXX_cg_mysqltohabse.
----End
5.3.6 Importing HBase Data to MySQL

Step 1 Use HBase table stuXX_cg_mysqltohbase in created in section 5.3.5. Create table
cga_hbasetomysql in the MySQL database.

mysql> create table cga_hbasetomysql(id int(4) not null primary key
auto_increment, name varchar(255) not null, gender varchar(255) not null,
time int(4));
Query OK, 0 rows affected (0.09 sec)
Loader.
Name: stuXX_cg_hbasetomysql
Queue: DEFAULT
Step 3 Click Next configure From. Set Source type to HBASE and Number to 1.
Step 4 Click input on the left, select HBase Input, and drag the HBase Input button to the right area.
Step 5 Click output on the left, select Table Output, and drag the Table Output button to the right
area.
Step 6 Configure the HBase input. Double-click the HBase Input button on the web UI. Enter the HBase
table name, click Add, enter the family name, column name, field name, and type in sequence,
select id as the rowkey, and click OK.
HBase table name: stuXX_cg_mysqltohbase
Step 7 Double-click the Table Output button on the web UI. Click associate, select the check boxes in
the Name column, and click OK.
Step 8 Connect HBase Input and Table Output, and click Next.
Step 9 Configure To. Set Table name to cga_hbasetomysql, and click Save and run.
Step 11 View the content in MySQL table cga_hbasetomysql.

mysql> select * from cga_hbasetomysql;
+----+------+--------+------+
+----+------+--------+------+
| 1 | tom | male | 8 |
| 2 | lily | female | 24 |
| 3 | lucy | female | 50 |
+----+------+--------+------+
The result shows that the content of HBase table cg_mysqltohbase has been successfully loaded to
MySQL table cga_hbasetomysql.
----End
5.3.7 Importing MySQL Data to Hive

Step 1 Use data in MySQL table cga_mysql in 5.3.3 and create Hive table stuXX_db.cg_mysqltohive.
Create the /user/hive/warehouse/stuXX_db.db/cg_mysqltohive directory in the HDFS.
> hdfs dfs -mkdir /user/hive/warehouse/stuXX_db.db/cg_mysqltohive
Create table stuXX_db.cg_mysqltohive in Hive.
> use stuXX_db;

> create table cg_mysqltohive(id int,name string,gender string,timest int)
row format delimited fields terminated by ',' stored as textfile location
'/user/hive/warehouse/stuXX_db.db/cg_mysqltohive';
Loader.
Name: stuXX_cg_mysqltohive
Queue: DEFAULT
Step 3 Click Next to configure From. Set the table name to cga_mysql.
Need partition column: false
Step 4 Click Next to start transform configuration. Click input on the left, select Table Input, and drag
the Table Input button to the right area.
Step 5 Click output on the left, select Hive Output, and drag the Hive Output button to the right area.
Step 6 Configure table input. Double-click the Table Input button on the web UI. Click Add, enter the
position serial number, field name, and type, and click OK.
Step 7 Configure table output. Double-click the Table Output button on the web UI. Click associate,
select the check boxes in the Name column, and click OK.
Hive File Storage Format: CSV
Step 8 Completing position information and click OK.

id: INTEGER
name: STRING
gender: STRING
time: INTEGER
Step 9 Connect Table Input and Hive Output, and click Next.
Step 10 Configure To. Set Storage type to HIVE, Output directory to

/user/hive/warehouse/stuXX_db.db/cg_mysqltohive, and Number to 1. Then click Save and
run.

Step 12 View Hive table cg_mysqltohive.
> use stuXX_db;

> select * from cg_mysqltohive;
+---- ---+---------+------------+-----------+
| cg.id | cg.name | cg.gender | cg.time |
+--------+---------+------------+-----------+
| 1 | tom | male | 8 |
| 2 | lily | female | 24 |
| 3 | lucy | fema | |
+-- -----+---------+------------+-----------+
The result shows that the content of MySQL table cga_mysql has been successfully loaded to Hive
table stuXX_cg_mysqltohive.
----End
5.4 Summary
This experiment describes how to use Loader in various service scenarios. After the experiment,
trainees are expected to be able to solve problems occurred during data migration. Note that you
need to create tables before migrating table data between MySQL, HBase, and Hive. When an
experiment is performed on the MySQL database using Loader, the MySQL table must have a primary
key.
6 Flume Data Collection Practice
6.1 Background
Flume is an important data collection tool among Big Data components. Flume is often used to
collect data from various data sources for other components to analyze. In the log analysis service,
you need to collect server logs to check whether the server is running properly. In real-time services,
data is often collected to the Kafka for analysis and processing of real-time components such as the
Streaming or Spark. The Flume plays an important role in Big Data services.
6.2 Objective
⚫ Understand how to configure Flume and use it to collect data.
6.3.1 Collecting spooldir Data to the HDFS

6.3.1.1 Using the Configuration Planning Tool to Generate File
properties.properties
Step 1 Download the configuration tool for Flume.
Download the tool at:
http://support.huawei.com/enterprise/docinforeader.action?contentId=DOC1000104118&idPath=79
19749|7919788|19942925|21110924|21112790|21112791|21624194|21830200
The Flume is used to monitor the file directory here. Data is saved to the HDFS. The
Channel type is memory.
Step 2 Configure the source.

In the Flume Configuration table of the configuration planning tool, click Add Source.
Set SourceName to a1, spoolDir to /home/userXX/spooldir (create spooldir in /home/userXX and
change the permission to 755), set channels to ch1, and retain the default values for other
parameters.
> mkdir /home/userXX/spooldir
> chmod -R 755 /home/userXX/spooldir
The path varies depending on the account. Here, user01 is used as an example.
Step 3 Configure channel information.

Click Add Channel. Set ChanelName to ch1, type to memory, and retain the default values of other
parameters.
Step 4 Configure the sink.

Click Add Sink. Set SinkName to s1, type to hdfs, hdfs.path to /user/app_stuXX/flume, and use
default values for other parameters.
> hdfs dfs -mkdir /user/app_stuXX/flume
Set hdfs.kerberosPrincipal to a cluster user in FusionInsight Manager stuXX, for example, stu01. The
path of hdfs.kerberosKeytab is the path where the file is stored in Linux, for example,
/home/userXX/flumetest. Set the permission of the file to 755.
Configure channel to ch1:
> mkdir /home/userXX/flumetest

> chmod -R 755 /home/userXX/flumetest
Step 5 Generate a configuration file.

Click Generate a configuration file, a file named properties.properties is generated automatically.
Step 6 Upload the properties.properties file to the cluster node directory.

Open WinSCP, enter the host name, user name, and password: 192.168.130.24 или
192.168.130.25 (в зависимости от того, на какой машине работаем) , userXX, and
password, and click Login.
Upload the file to the /home/userXX/flumetest directory.
Step 7 Check the Flume data collection result.
> hdfs dfs –ls /user/app_stuXX/flume

> ls /home/userXX/flumetest
----End
6.3.1.2 Installing a Flume Client

Step 1 Decompress the Flume client.
> cp /FusionInsight-Client/Flume/FusionInsight_Cluster_1_Flume_Client.tar
/home/userXX/
> tar -xvf FusionInsight_Cluster_1_Flume_Client.tar
Two files are generated after the decompression:

FusionInsight_Cluster_1_Flume_ClientConfig.tar and
FusionInsight_Cluster_1_Flume_ClientConfig.tar.sha256
Run the tar command to further decompress FusionInsight_Cluster_1_Flume_ClientConfig.tar.
> tar -xvf FusionInsight_Cluster_1_Flume_ClientConfig.tar
The FusionInsight_Cluster_1_Flume_ClientConfig directory is generated.

The folders adapter, aix, batch_install, flume and file install.sh are obtained.
Путь, где находятся эти файлы:

/home/userXX/FusionInsight_Cluster_1_Flume_ClientConfig/Flume/FlumeClient
> ls /home/userXX/FusionInsight_Cluster_1_Flume_ClientConfig/Flume/FlumeClient
aix batch_install flume install.sh upgrade
Step 2 Obtain krb5.conf and user.keytab files.
Log in to FusionInsight Manager using a FusionInsight Manager account stuXX, for example, stu01,
and choose System > Rights Configuration > User Management.
(--Другой интерфейс! System -> User -> stuXX)
In the Operation column of the corresponding account, click the Download icon to download
krb5.conf and user.keytab files.
Другой интерфейс! Для пользователя stuXX кликаем More -> Download

Autentication Credentials и загружаем к себе на компьютер файл следующего
формата: stu20_1595500715775_keytab.tar. Далее распаковываем его и получаем 2
файла: krb5.conf и user.keytab.
Use the WinScp tool to upload krb5.conf and user.keytab files to the /home/userXX/flumetest
directory.
Step 3 Create a file named jaas.conf.

Create a file named jaas.conf in the /home/userXX/flumetest directory decompressed in step 1.
> touch jaas.conf
Edit the jaas.conf file.
> vim jaas.conf
The content of the file is as follows:
Client {
com.sun.security.auth.module.Krb5LoginModule required
storeKey=true
principal= "stuXX" (indicates the user created on FusionInsight Manager)
useTicketCache=false
keyTab= "/home/userXX/flumetest/user.keytab" (indicates the path of

FusionInsight Manager user stuXX's authentication file user.keytab in Linux.)
debug=true
useKeyTab=true;
};
Step 4 Modify the flume-env.sh file.

The flume-env.sh file exists in the flume/conf directory of the decompressed file on the Flume client.
Add the following content at the end of JAVA_OPTS:
--Путь, где лежит файл:
> cd
/home/userXX/FusionInsight_Cluster_1_Flume_ClientConfig/Flume/FlumeClient/flume
/conf
> vim flume-env.sh
В файле находим и редактируем следующие строки:
-Djava.security.krb5.conf=/home/userXX/flumetest/krb5.conf
-Djava.security.auth.login.config=/home/userXX/flumetest/jaas.conf
-Dzookeeper.server.principal=zookeeper/hadoop.hadoop.com
-Dzookeeper.request.timeout=120000
(Note: The values of Djava.security.auth.login.config and Djava.security.krb5.conf must be the

corresponding paths in the cluster.)
Итоговое содержимое файла:
JAVA_OPTS="-Xms2G -Xmx4G -XX:CMSFullGCsBeforeCompaction=1 -

XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -
XX:+UseCMSCompactAtFullCollection -verbose:gc -XX:+UseGCLogFileRotation -
XX:NumberOfGCLogFiles=15 -XX:GCLogFileSize=1M -XX:+PrintGCDetails -
XX:+PrintGCDateStamps -Xloggc:${FLUME_GC_LOG_DIR}/Flume-Client-gc.log -
verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -
Djava.security.krb5.conf=/home/userXX/flumetest/krb5.conf -
Djava.security.auth.login.config=/home/userXX/flumetest/jaas.conf -
Dzookeeper.server.principal=zookeeper/hadoop.hadoop.com -
Dzookeeper.request.timeout=120000"
On the installed HDFS client, copy hdfs-site.xml and core-site.xml from

hdfs_client/HDFS/hadoop/etc/hadoop/ to /home/userXX/flumetest and hbase-site.xml from
hbase_client/HBase/hbase/conf on the HBase client to /home/userXX/flumetest.
> cd /home/userXX/hadoopclient/HDFS/hadoop/etc/hadoop
> cp hdfs-site.xml /home/userXX/flumetest
> cp core-site.xml /home/userXX/flumetest
> cd /home/userXX/hadoopclient/HBase/hbase/conf
> cp hbase-site.xml /home/userXX/flumetest
View the content in the /home/userXX/flumetest folder.
> cd /home/userXX/flumetest
> ll
-rw------- 1 user01 users 8563 Apr 16 22:49 core-site.xml
-rw------- 1 user01 users 9830 Apr 16 22:50 hbase-site.xml
-rw------- 1 user01 users 15277 Apr 16 22:48 hdfs-site.xml
-rw-r--r-- 1 user01 users 199 Apr 16 22:23 jaas.conf
-rw-r--r-- 1 user01 users 757 Apr 15 20:24 krb5.conf
-rw-r--r-- 1 user01 users 2119 Apr 16 21:12 properties.properties
-rw-r--r-- 1 user01 users 126 Apr 15 20:24 user.keytab
Step 5 Install the client. (If you use a non-root user, it is recommended that the installation directory
not contain too many levels; otherwise, the installation may fail.)
> cd
> ./install.sh -d /home/userXX/flume1 -c

/home/userXX/flumetest/properties.properties -l /var/log/Bigdata/
If message [flume-client install]: install flume client successfully is

displayed, the client is installed successfully.
Parameter description:
-d: installation path of the Flume client
-f: Service IP addresses of two MonitorServer roles, separated by a comma (,). This parameter is
optional. If this parameter is not set, the Flume client does not send alarm information to the
MonitorServer.
-c: configuration file, which is optional. After the installation, you can configure Flume role client
parameters by modifying /opt/FlumeClient/fusioninsight-flume-1.6.0/conf/properties.properties.
“-l”: Log directory. This parameter is optional. The default value is /var/log/Bigdata. (The user user
needs to have the write permission on the directory.)
Step 6 Check /home/userXX/spooldir. If .flumespool is displayed, the configuration is successful.
> ll /home/userXX/spooldir -a
total 408
drwxrwxrwx 3 root root 4096 Feb 9 13:44 .
drwxrwxrwx 81 root root 12288 Apr 16 23:26 ..
drwxrwxrwx 2 omm wheel 4096 Jan 26 23:46 .flumespool
-rwxrwxrwx 1 root root 389592 Jan 26 23:45 zypper.log.COMPLETED
----End
6.3.2 Collecting avro Data to the HDFS

The collection of avro data sources using Flume is the collection of serialized data, which involves
port configuration.
6.3.2.1 Using the Configuration Planning Tool to Generate File

properties.properties
Step 1 Set Flume name to client.
Step 2 Set the source type to avro, set the listening IP address and port number, set channels to ch2,
and click Add Source.
IP: 192.168.130.24 или 192.168.130.25 (в зависимости от того, на какой машине

работаем)
Port Number: 8181
Step 3 Configure channel parameters.

Click Add Channel. Set ChanelName to ch2, type to memory, and retain the default values of other
parameters.
Step 4 Configure Sink parameters.

Click Add Sink.
Set SinkName to s2.
Set Hdfs.path to /user/app_stuXX/flume_avro.
Set authentication information, including the authentication account and the address of the
authentication file. The authentication account and the authentication file can be the same as those
in 6.3.1.
Обязательно указать наименование канала: channel: ch2

If the cluster is deployed in secure mode, you need to configure parameters hdfs.kerberosPrincipal
and hdfs.kerberosKeytab. If the cluster is in non-secure mode, you do not need to configure the two
parameters.
Create the /home/userXX/flumetest2 directory and set the permission to 755.
Также создать директорию /user/app_stuXX/flume_avro в HDFS:
> hdfs dfs -mkdir /user/app_stuXX/flume_avro
Step 5 Generate a configuration file.

Click Generate a configuration file, and upload the properties. properties configuration file to the
specified directory of the cluster, such as /home/userXX/flumetest2. The properties.properties file
in 6.3.1 will be overwritten.
Step 6 Obtain krb5.conf and user.ekytab files.

For details, see step 2 in 6.3.1.2.
Step 7 Create a file named jaas.conf.

For details, see step 3 in 6.3.1.2.
В файле jaas.conf исправить строчку: keyTab= "/home/userXX/flumetest2/user.keytab"

Modify the flume-env.sh file.
--Путь, где лежит файл:

> cd
/home/userXX/FusionInsight_Cluster_1_Flume_ClientConfig/Flume/FlumeClient/flume
/conf
The flume-env.sh file exists in the flume/conf directory of the decompressed file on the Flume
client. Add the following content at the end of JAVA_OPTS:
> vim flume-env.sh

-Djava.security.krb5.conf=/home/userXX/flumetest2/krb5.conf
-Djava.security.auth.login.config=/home/userXX/flumetest2/jaas.conf
-Dzookeeper.server.principal=zookeeper/hadoop.hadoop.com
-Dzookeeper.request.timeout=120000
----End
6.3.2.2 Creating a Flume Job

Step 1 Reinstall the Flume instance to /home/userXX/flume2.
> cd
> ./install.sh -d /home/userXX/flume2 -c

/home/userXX/flumetest2/properties.properties -l /var/log/Bigdata/
If message [flume-client install]: install flume client successfully is displayed, the client is installed
successfully.
Note: After the installation, you can run the ps -ef | grep flume | grep username command to check
the Flume service status.
Step 2 Submit data to the avro port.

Copy the flumeavroclient.jar file from the /FusionInsight-Labs directory to the /home/userXX
directory. flumeavroclient.jar can submit Hello and World to the 8181 port for 100 times and the
data is saved in the over.tmp file.
> cd /opt
> java -cp flumeavroclient.jar org.myorg.SSLAvroclient
Step 3 Check the data collection result in the HDFS.
>hdfs dfs –ls /user/app_stuXX/flume_avro

-rw-r--r-- 3 user01 supergroup 1300 2018-04-17 /stu01/flume_avro/over.tmp
----End
6.4 Summary
This experiment describes how to collect data from the spooldir and avro data sources using the
Flume. Through this experiment, trainees are expected to master how to collect data offline and in
real time as well as have a better understanding of Flume.
7 Comprehensive Cluster Experiment
7.1 Background
In Big Data services, multiple components are usually built into a service system to meet the
requirements of upper-layer services.
This experiment combines the preceding components to build a Big Data analysis and real-time
query platform.
Loader periodically migrates MySQL database data to Hive first. As Hive data is stored in HDFS,
Loader is used to load data in HDFS to HBase. HBase is used to query data in real time, and the big
data processing capability of Hive is used to analyze related results.
7.2 Objective
⚫ Use Big Data components comprehensively to convert and query data in real time.
7.3.1 Offline Data Collection and Analysis and Real-Time Query

Involving MySQL, Loader, Hive, and HBase
7.3.1.1 Preparing MySQL Data
Step 1 Log in to the MySQL server.
MySQL is installed on a FusinInsight HD cluster node.
> mysql -uroot -pHuawei@010203
Welcome to the MySQL monitor. Commands end with ; or \g.

Your MySQL connection id is 135
Server version: 5.5.48 MySQL Community Server (GPL)
Type 'help;' or '\h' for help. Type '\c' to clear the current input
statement.
mysql>
Step 2 Create a database named loadertest.
Базу данных уже создавали в разделе 5.3.3. Используем ее.
Step 3 Create table socker and make time the primary key.
mysql> DROP TABLE IF EXISTS socker;

mysql> CREATE TABLE socker (
time varchar(50) DEFAULT NULL,
open float DEFAULT NULL,
high float DEFAULT NULL,
low float DEFAULT NULL,
close float DEFAULT NULL,
volume varchar(50) DEFAULT NULL,
endprice float DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
Step 4 Load data to socker.

Copy the socker.csv file from the /FusionInsight-Labs directory to the home directory on a local host.
> cp /FusionInsight-Labs/socker.csv /home/userXX
Load data in socker.csv to the socker table using the tool on the MySQL client.
mysql> LOAD DATA INFILE "/home/userXX/socker.csv" INTO TABLE socker FIELDS

TERMINATED BY ',' lines terminated by '\r\n';
Step 5 View data in socker.
mysql> select * from socker limit 10;

+------------+-------+-------+-------+-------+----------+----------+
| time | open | high | low | close | volume | endprice |
+------------+-------+-------+-------+-------+----------+----------+
| 1970-01-02 | 92.06 | 93.54 | 91.79 | 93 | 8050000 | 93 |
| 1970-01-05 | 93 | 94.25 | 92.53 | 93.46 | 11490000 | 93.46 |
| 1970-01-06 | 93.46 | 93.81 | 92.13 | 92.82 | 11460000 | 92.82 |
| 1970-01-07 | 92.82 | 93.38 | 91.93 | 92.63 | 10010000 | 92.63 |
| 1970-01-08 | 92.63 | 93.47 | 91.99 | 92.68 | 10670000 | 92.68 |
| 1970-01-09 | 92.68 | 93.25 | 91.82 | 92.4 | 9380000 | 92.4 |
| 1970-01-12 | 92.4 | 92.67 | 91.2 | 91.7 | 8900000 | 91.7 |
| 1970-01-13 | 91.7 | 92.61 | 90.99 | 91.92 | 9870000 | 91.92 |
| 1970-01-14 | 91.92 | 92.4 | 90.88 | 91.65 | 10380000 | 91.65 |
| 1970-01-15 | 91.65 | 92.35 | 90.73 | 91.68 | 11120000 | 91.68 |
+------------+-------+-------+-------+-------+----------+----------+
----End
7.3.1.2 Loading MySQL Data to Hive

Loader.
Click Edit to set JDBC Connection String to jdbc:mysql://192.168.224.41:3306/loadertest1.

The IP address of the server must be assigned by the trainer. If you have any questions, ask the
trainer.
Используем уже существующий Connection: stuXX_mysql
Name: stuXX_cg_mysqltohive2
Queue: DEFAULT

Set the table name to socker and then click Next.
If no primary key is set in the table, specify Partition column name, such as 1 or column
name time.
Step 3 Configure Transform.

Drag the Table Input button to the right area.
Step 4 Double-click Table Input and enter the attributes associated with MySQL. Field name indicates
the corresponding fields of MySQL.
Step 5 Configure Hive output.

Drag the Hive output button to the blank area on the right.
Step 6 Output table parameter settings.

Double-click the Hive Output button. Set parameters as prompted, set Output delimiter to a period
(.), and enter output fields, as shown in the following figure.
Output delimiter: ‘,’
Step 7 Connect Table Input and Hive Output.
Step 8 Create directory /stu01/hive/warehouse/socker2 in the HDFS.
> hdfs dfs -mkdir /user/app_stuXX/loader_test/socker2
Step 9 Create socker2 in the Hive data warehouse.
> beeline
> use stuXX_db;
> create table socker2(timest string,open float,high float,low float,close
float,volume string,endprice float)row format delimited fields terminated by
',' stored as textfile location '/user/app_stuXX/loader_test/socker2';
Step 10 Configure To. Set Storage type to HIVE, Output directory to /stu01/hive/warehouse/socker2,
and Number to 2.
Output directory: /user/app_stuXX/loader_test/socker2

Step 11 Click Save and run. The following information is displayed.
Step 12 View the execution result.
> select * from socker2 limit 10;

+------------+--------------+------------+-----------+-------------+--------------+----------------+
|socker2.timest|socker2.open|socker2.high|socker2.low|socker2.close|socker2.volume|socker2.endprice|
+------------+--------------+------------+-----------+-------------+--------------+----------------+
| 1970-01-02 | 92.06 | 93.54 | 91.79 | 93.0 | 8050000 | 93.0 |
| 1970-01-05 | 93.0 | 94.25 | 92.53 | 93.46 | 11490000 | 93.46 |
| 1970-01-06 | 93.46 | 93.81 | 92.13 | 92.82 | 11460000 | 92.82 |
| 1970-01-07 | 92.82 | 93.38 | 91.93 | 92.63 | 10010000 | 92.63 |
| 1970-01-08 | 92.63 | 93.47 | 91.99 | 92.68 | 10670000 | 92.68 |
| 1970-01-09 | 92.68 | 93.25 | 91.82 | 92.4 | 9380000 | 92.4 |
| 1970-01-12 | 92.4 | 92.67 | 91.2 | 91.7 | 8900000 | 91.7 |
| 1970-01-13 | 91.7 | 92.61 | 90.99 | 91.92 | 9870000 | 91.92 |
| 1970-01-14 | 91.92 | 92.4 | 90.88 | 91.65 | 10380000 | 91.65 |
| 1970-01-15 | 91.65 | 92.35 | 90.73 | 91.68 | 11120000 | 91.68 |
+------------+--------------+------------+-----------+-------------+--------------+----------------+
> hdfs dfs -ls /user/app_stuXX/loader_test/socker2

Found 2 items
-rw-rwxrw-+ 3 loader hadoop 0 2020-06-26 18:12
/user/app_stu01/loader_test/socker2/_SUCCESS
-rw-rwxrw-+ 3 loader hadoop 548456 2020-06-26 18:12
/user/app_stu01/loader_test/socker2/part-m-00000
----End
7.3.1.3 Using Hive for Analysis and Query

Step 1 Obtain data of stocks with the biggest gain.
Obtain the data and save it to a new table.
> beeline
> use stuXX_db;
> select socker2.timest, socker2.open, socker2.endprice from socker2 where
socker2.endprice > socker2.open sort by socker2.endprice desc;
+-----------------+---------------+-------------------+
| socker2.timest | socker2.open | socker2.endprice |
+-----------------+---------------+-------------------+
| 1974-05-21 | 87.86 | 87.91 |
| 1978-03-09 | 87.84 | 87.89 |
| 1978-03-08 | 87.36 | 87.84 |
| 1975-12-04 | 87.6 | 87.84 |
| 1975-12-12 | 87.8 | 87.83 |
| 1970-02-19 | 87.44 | 87.76 |
| 1974-06-24 | 87.46 | 87.69 |
| 1978-02-23 | 87.56 | 87.64 |
| 1970-03-18 | 87.29 | 87.54 |
| 1970-12-01 | 87.2 | 87.47 |
| 1978-03-03 | 87.32 | 87.45 |
| 1970-02-18 | 86.37 | 87.44 |
| 1974-05-30 | 86.89 | 87.43 |
| 1978-03-07 | 86.9 | 87.36 |
| 1978-03-02 | 87.19 | 87.32 |
| 1975-12-09 | 87.07 | 87.3 |
……
+-----------------+---------------+-------------------+
5,228 rows selected (30.544 seconds)
Step 2 Obtain the latest data of stocks.
> select socker2.timest, socker2.open, socker2.endprice from socker2 where

socker2.endprice> socker2.open sort by socker2.timest desc;
+--------------+--------------+------------------+
| socker2.time | socker2.open | socker2.endprice |
+--------------+--------------+------------------+
| 1970-04-09 | 88.49 | 88.53 |
| 1970-04-01 | 89.63 | 90.07 |
| 1970-03-26 | 89.77 | 89.92 |
| 1970-03-25 | 88.11 | 89.77 |
| 1970-03-24 | 86.99 | 87.98 |
| 1970-03-18 | 87.29 | 87.54 |
| 1970-03-17 | 86.91 | 87.29 |
| 1970-03-10 | 88.51 | 88.75 |
| 1970-03-03 | 89.71 | 90.23 |
| 1970-03-02 | 89.5 | 89.71 |
| 1970-02-27 | 88.9 | 89.5 |
| 1970-02-25 | 87.99 | 89.35 |
| 1970-02-20 | 87.76 | 88.03 |
| 1970-02-19 | 87.44 | 87.76 |
| 1970-02-18 | 86.37 | 87.44 |
……
+--------------+--------------+------------------+
5,228 rows selected (26.738 seconds)
Step 3 Obtain the number of stocks that increase.
> select count(*) from socker2 where socker2.endprice> socker2.open;

+-------+
| _c0 |
+-------+
| 5228 |
+-------+
Step 4 Create a table to store data of stocks that increase and load the data to HBase (Hive).
Creating a table:
> use stuXX_db;

> create table upsocker like socker2;
Loading the data:
> insert into upsocker select * from socker2 where socker2.endprice >
socker2.open sort by socker2.endprice desc;
----End
7.3.1.4 Loading HDFS Data to HBase

Step 1 Create a table named stuXX_cg_hdfstohbase2 in HBase.
hbase(main):002:0> create 'stuXX_cg_hdfstohbase2','info';

=> Hbase::Table - cg_hdfstohbase2
Loader.
Set related parameters.

Name: stuXX_cg_hdfstohbase2
Connection: stuXX_hdfs_conn
Queue: DEFAULT
Click Next.

Configure the input path of the HDFS file and the encoding type.
Input path: /user/app_stuXX/loader_test/socker2/part-m-00000
Step 4 Configure Transform.

Select CSV File Input and HBase Output and drag them to the blank area on the right, respectively.
Then connect them.
Step 5 Configure CSV file input parameters.

Set the input parameters based on the format of the data stored in the HDFS.
Set Delimiter to a comma (,), and add information in Input fields, as shown in the following figure.
Step 6 Configure HBase output parameters.

Configure the column name and family name.
Table name: stuXX_cg_hdfstohbase2
Step 7 Configure To.

Set Storage type to HBASE_PUTLIST, HBase instance to HBase, and Number to 1.
Step 8 Check the execution result.
Step 9 View the content in HBase table stuXX_cg_hdfstohbase2.
hbase(main):005:0> scan 'stuXX_cg_hdfstohbase2'

...
2009-09-15 column=info:high, timestamp=1523803747562, value=1056.04
2009-09-15 column=info:low, timestamp=1523803747562, value=1043.42

2009-09-15 column=info:open, timestamp=1523803747562, value=1049.03
2009-09-15 column=info:volume, timestamp=1523803747562,
value=6185620000
----End
7.3.1.5 Querying HBase Data in Real Time

Step 1 On the HBase Shell client, query information in the 2009-09-15 line of table
stuXX_cg_hdfstohbase2.
> get 'stuXX_cg_hdfstohbase2','2009/9/15'
COLUMN CELL
info:close timestamp=1523803747562, value=1052.63
info:endprice timestamp=1523803747562, value=1052.63
info:high timestamp=1523803747562, value=1056.04
info:low timestamp=1523803747562, value=1043.42
info:open timestamp=1523803747562, value=1049.03
info:volume timestamp=1523803747562, value=6185620000
Step 2 Query information in the period from August 15, 2009 to September 15, 2009.
> scan 'stuXX_cg_hdfstohbase2',{COLUMN=>'info:endprice',

STARTROW=>'2009/08/15',STOPROW=>'2009/09/15'}
ROW COLUMN+CELL
2009-08-17 column=info:endprice, timestamp=1523803747562, value=979.73
……
Step 3 Query all the columns whose values are greater than a specific value. (The system compares
the values as strings.)
> scan 'stuXX_cg_hdfstohbase2',{FILTER => "ValueFilter(>,'binary:979')"}

...
Step 4 Query all the information that ends with endprice and the string value is greater than 979.
hbase(main):011:0> scan
'stuXX_cg_hdfstohbase2',{FILTER=>"ValueFilter(>,'binary:979') AND
ColumnPrefixFilter('endprice')"}
2009-08-18 column=info:endprice, timestamp=1523803747562,

value=989.67
value=996.46
value=998.04
value=994.75
----End
7.4 Summary
This experiment uses multiple components to build a Big Data analysis and query platform. Through
the experiment, trainees are expected to have a better understanding of theories and
comprehensive applications about big data components.
8 Appendix
8.1 Common Linux Commands

Cd /path_dir: Enters the /path_dir directory you specified.
pwd: Displays the working path.
Is: Checks files in the directory.
Is -l: Displays detailed information about files and directories.
Is -a: Displays hidden files.
Is *[0-9]*: Displays file and directory names which contain digits.
tree: Displays the tree structure of the file and directory from the root directory (1).
lstree: Displays the tree structure of the file and directory from the root directory (2).
mkdir dir1: Creates a directory named dir1.
mkdir dir1 dir2: Creates two directories at the same time.
mkdir -p /tmp/dir1/dir2: Creates a directory tree.
rm -f file1: Deletes a file named file1.
rmdir dir1: Deletes a directory named dir1.
rm -rf dir1: Deletes a directory named dir1 and its content.
rm -rf dir1 dir2: Deletes two directories and their contents.
mv dir1 new_dir: Renames/Moves a directory.
cp file1 file2: Copies file1 as file2.
cp dir/*: Copies all the files in a directory to the current working directory.
cp -a /tmp/dir1: Copies a directory to the current working directory.
In -s file1 lnk1: Creates a soft link pointing to the file or directory.
8.2 Other HDFS Commands

HDFS supports the fsck command to check inconsistency. fsck is used to report various file problems,
such as block loss or lack of blocks.
Syntax of fsck:
hdfs fsck <path> [-move | -delete | -openforwrite] [-files [-blocks [-

locations | -racks]]]
<path>: Indicates the start directory of the check.
-move: Moves damaged files to /lost+found.
-delete: Deletes damaged files.
-openforwrite: Prints the file that is being written.

-files: Displays all the checked files.
-blocks: Prints a block report.
-locations: Prints the position of each block.
-racks: Prints the network topology of the datanode.
8.3 Methods of Creating a new Flume Job

Flume jobs can be created in either of the following methods: Update the properties.properties
configuration file; reinstall the client.
The first method applies when a client already exists. The second method applies when no client
exists or the first method fails in data collection.
First method:
Regenerate the properties.properties configuration file to replace the properties.properties
configuration file created earlier, and restart the Flume service.
The second method of creating a Flume job is used in Chapter 7.

HCIA-Big Data V2.0 Lab Guide For Big Data Engineers - Revision 4

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

HCIA-Big Data V2.0 Lab Guide For Big Data Engineers - Revision 4

Uploaded by

Copyright:

Available Formats

Huawei Certification Big Data Training Courses

HCIA - Big Data V2.0

HUAWEI TECHNOLOGIES CO., LTD.

Trademarks and Permissions

Huawei Technologies Co., Ltd.

About This Document

1.1.1 Basic Configuration

02312BTK H22H-05-S26AFC (25*2.5 inch hard disk chassis, 1

1.1.2 SKYLAKE CPU

Intel Xeon Platinum

DDR4 RDIMM Memory-32GB-

1.1.4 HardDisk(with 2.5" Handle bar)-SAS

1.1.5 HardDisk(with 2.5" Handle bar)-SSD

ES3600S V5 solid state disk -800GB

1.1.6 Raid controller card

LSI3108 1 GB Cache RAID card

1.1.7 Riser card

1.1.8 PCIe card -NIC

Ethernet adapter -10Gb optical port

1.1.9 Cables and optical modules

Optical module -SFP+-10G-multi-

1.1.10 Guide rail and cable tray

21240434 EGUIDER01 2U static slide rail kit 1

1.1.11 Operating system

SLES for SAP Applications-English

To download the FusionCompute software, visit the following website:

The minimum configuration is 1Gb Ethernet switches. It

OS partition requirements for each VM:

VM port group configuration and interconnection switch configuration:

eth0 PortgroupX vlanX

eth0 PortgroupX vlanX

eth0 PortgroupX vlanX

A trunk interface is used to connect the physical switch to the hypervisor

Figure 1-2 Non-redundant cluster topology

Trainee Accounts and Software Access

About This Document ................................................................................................................. 3

4.3.2 Creating a Table........................................................................................................................................................ 49

1 FusionInsight HD Client Installation

1.3 Experiment Tasks

1.3.1 Installing a Client

Decompress client software.

Step 2 Install the client.

If message Components client installation is complete is displayed, the installation is complete.

Step 3 Configure environment variables and perform the authentication.

Password for stuXX@HADOOP.COM:

Step 4 Test the client.

> hdfs dfs –ls /

If the test is successful, it indicates that the client is installed successfully.

2 HDFS File System Practice

2.3 Experiment Tasks

2.3.1 Common HDFS Operations

 -help: Checks instructions of a command.

> hdfs dfs -help

[-cp [-f] [-p | -p[topax]] <src> ... <dst>]

 -Is: Displays the directory information.

~> hdfs dfs -ls /

 -mkdir: Creates a directory in the HDFS.

> hdfs dfs -mkdir /user/app_stuXX

> hdfs dfs -ls /user

 -put: Uploads a local file to the specified directory in the HDFS.

> hdfs dfs -put /FusionInsight-Labs/test01.txt /user/app_stuXX

-h: Format file sizes in a human-readable fashion (eg 64.0m instead of

-rw-rw-rw-+ 3 user01 hadoop 38 2020-07-13 16:29