You are on page 1of 27

Big Data Huawei Course

HBase
AVISO
Este documento foi gerado a partir de um material de estudo
da Huawei. Considere as informações nesse documento
como material de apoio.

Centro de Inovação EDGE - Big Data Course


Table of Contents
1. Introduction to HBase ................................................................................. 1

2. Column-oriented explanation: ..................................................................... 1

3. Application Scenarios ................................................................................. 2

4. Position of HBase in FusionInsight ............................................................. 2

5. KeyValue Storage Model ............................................................................ 3

5.1. KeyValue Storage Model (2)............................................................... 3

5.2. KeyValue Storage Model (3)............................................................... 4

6. Functions and Architecture of HBase.......................................................... 4

6.1. HBase Architecture (1) ...................................................................... 4

6.2. HBase Architecture (2) ...................................................................... 5

6.3. HMaster(1)......................................................................................... 6

6.4. HMaster(2)......................................................................................... 7

6.5. RegionServer..................................................................................... 7

6.6. Region(1)........................................................................................... 7

6.7. Region (2).......................................................................................... 7

6.8. Region (3).......................................................................................... 8

6.9. Column Family................................................................................... 8

6.10. ZooKeeper......................................................................................... 9

6.11. MetaData Table.................................................................................. 9

7. Key Processes of HBase ............................................................................ 9

7.1. Client Initiating a Data Writing Request ............................................. 9

7.2. Writing Process - Locating a Region ............................................... 10

7.3. Writing Process - Grouping Data (1)................................................ 10

7.4. Writing Process - Grouping Data (2)................................................ 11

7.5. Writing Process - Sending a Request to a RegionServer................ 12

7.6. Writing Process - Process of Writing Data to a Region ................... 12

Centro de Inovação EDGE - Big Data Course


Table of Contents

7.7. Writing Process - Flush.................................................................... 13

7.8. Impacts of Multiple HFiles ............................................................... 13

7.9. Compaction (1) ................................................................................ 14

7.10. Compaction (2) ................................................................................ 15

7.11. Region Split ..................................................................................... 15

7.12. Client Initiating a Data Reading Request......................................... 16

7.13. OpenScanner .................................................................................. 17

7.14. Filter................................................................................................. 18

7.15. BloomFilter ...................................................................................... 18

8. Improvements on FusionInsight................................................................ 19

8.1. Supporting Secondary Index ........................................................... 19

8.2. HFS ................................................................................................. 20

8.3. HBase MOB (1) ............................................................................... 21

8.4. HBase MOB (2) ............................................................................... 21

Centro de Inovação EDGE - Big Data Course 4


HBase – Huawei Course
1. Introduction to HBase

� HBase is a column-based distributed database built on top of the HDFS, which means
that HBase uses HDFS as its file storage system with high reliability, performance, and
scalability.
� HBase is actually similar to Google's big table, designed to provide quick random access
to a huge amount of structured data. So, in the same way, we can use HBase to store
and process big table data which has billions of rows and millions of columns. And it
also provides real-time read or write access to data in HDFS.
� HBase uses Zookeeper as the collaboration service
� HBase is the Hadoop database, a distributed, scalable big data store.
� We know that Oracle, MySQL, and many common databases are actually relational
databases, RDB for short. HBase is no SQL, not only SQL which usually refers to non-
relational databases. The most common ones are MongoDB and Redis.
� HBase is distributed, which is easy to understand because it is built on top of HDFS and
its column-oriented.

2. Column-oriented explanation:

� Normally, in a table in Oracle or MySQL, data is stored by rows. Data can be added,
modified, or read by row. But when we need all data in the name column we may need
to read the entire row to get the data. So in this case, we may read some unnecessary
data. That is why we find another way to store data, which is by column.
� Data can be read and calculated by column. But when a row is read, we need multiple
I/O operations so, in HBase, data is stored by column. This is what we called column-
based. It is different from relational databases.
� Relational database has a Fixed data structure. For example, we have a table with em-
ployee data, the column includes ID, name, phone and address and the attributes of
each column are all fixed. So we need to pre-define the data structure before we put
actual values into the table. However, HBase doesn't need to. We can make dynamic
extension of columns. For example an employee table in HBase: For the first record, we
have ID and name; For the second record, we have ID, name and phone; For the third

Centro de Inovação EDGE - Big Data Course 1


record, we have ID, name and address. Each record could have different columns. Col-
umns can be dynamically extended.
� HBase supports common commercial hardware, which cost less when making expan-
sion while the expansion of RDB cost much more
� The relational databases are I/O intensive

3. Application Scenarios

� It can be used whenever we need to provide fast random access to massive data or we
need high throughput or we need to process structured and unstructured data like text
files, images, videos, webpages at the same time. Or we do not require the
ACID(Atomicity, Consistency, Isolation, Durability) features of traditional relational data-
base. They are a set of properties that ensure normal database transactions.
� Atomicity requires that each transaction be "all or nothing". If one part of the transac-
tion fails, then the entire transaction fails.
� Consistency means that, for example, consistency determines that A+B = 10. If the
transaction changes the value of A, B must also be changed.
� Isolation ensures that two or more transactions will not be executed in the interleaved
manner because this may cause data inconsistency.
� Durability means that after a transaction is executed successfully, modifications made
on the database are retained permanently, no rollback is performed unless necessary.
When we don't require any of these four features, we can use HBase.

4. Position of HBase in FusionInsight

� In fusion inside hadoop, HBase is a basic component for storing massive data. Its a col-
umn-based distributed storage system built on top of the HDFS and there are cases
that Hive and Spark depend on HBase for upper-layer analysis.

Centro de Inovação EDGE - Big Data Course 2


5. KeyValue Storage Model

� Take a look at the storage mechanism. How data is actually stored in HBase? The data
model in HBase is keyValue, which means data is stored in the form of keyValue pairs.
� Key is used to quickly query a data record and value is used to store user data. But each
key corresponds to many values. Like in Huawei, everyone has an ID number, we can
take this as a key which can correspond to a name, age, phone, address and any other
information. But when we use the ID number to quickly query, how to make sure we get
the name value not the age? That is why KeyValue must store some description of itself,
such as timestamp and type of information. This requires some structured space.
� RDBs predefine the data structure in the database as a series of tables containing fills
with well-defined types. In contrast, KeyValue systems treat the data as a single collec-
tion which may have different fills for every record. This offers considerable flexibility
and more closely follows modeling concepts like object-oriented programming be-
cause optional values are not represented by placeholders, as in most RDBs.
� KeyValues storage often uses far less memory to store the same database which can
lead to large performance gained in certain workloads.

5.1. KeyValue Storage Model (2)

� Here, keys are implemented as byte arrays, and are sorted in byte lexical graphical order
with the lowest order appearing first in a table which simply means that the keys are

Centro de Inovação EDGE - Big Data Course 3


sorted byte by byte from left to right. So after key is ordered, data subregions will be
created based on the RowKey range. Each subregion is a basic distributed storage unit
stored in different nodes.

5.2. KeyValue Storage Model (3)

� And KeyValue pair has a specific format which contains key information such as time-
stamp and type, seven parts in total. The same key can be associated with multiple val-
ues. So we can use these seven parts to distinguish different values corresponding to
the same key. Like each KeyValue has a column qualifier. There also can be multiple
values associated with the same rowkey and column qualifier.
� We know that data in HBase can be updated. So there may be multiple versions of the
same data record. In this case, they're distinguished using timestamp. Here remember
that HBase will return the latest version when querying without specifying the time-
stamp.
� ColumnFamily consists of one or more columns. A column is labeled under the Colum-
nFamily.

6. Functions and Architecture of HBase

6.1. HBase Architecture (1)

Centro de Inovação EDGE - Big Data Course 4


HBase consists of the following parts:
� HMaster involves the active HMaster and standby HMaster in HA(High Availability)
mode. Active HMaster manages RegionServers in HBase including the create, delete,
modificate and query of a table, balances the load of RegionServers, adjusts the distri-
bution of regions, splits regions and distributes regions after they're split, and migrates
regions after a region server failed. Standby HMaster takes over services when the ac-
tive one fails. The original active HMaster serves as the standby HMaster when the fold
is ratified.
� RegionServer provides read and write services of table data, as the data processing and
computing unit in HBase. RegionServer is usually deployed to the other with the
DataNode of the HDFS cluster to perform data storage.
� Client communicates with HMaster and RegionServer by HBase RPC mechanism. The
client communicates with HMaster for management and communicates with Region-
Server for data operation.
� Zookeeper provides distributed coordination services for processes in HBase cluster.
Each RegionServer is registered with Zookeeper so that the active HMaster can obtain
the health status of each RegionServer. HDFS provides highly reliable file storage ser-
vices for HBase, almost all HBase data is stored in HDFS.

6.2. HBase Architecture (2)

Centro de Inovação EDGE - Big Data Course 5


� Inside of the RegionServer, there are the following 5 components: Store, MemStore,
StoreFile, HFile, and HLog.

� When the MemStore capacity reach the upper limit, region server flush data to the
MemStore to HDFS. As more data is inserted, multiple StoreFiles are generated in a
store. When the number of StoreFiles reach the upper limit, RegionServer merges mul-
tiple StoreFiles into a big StoreFile.
� HFile defines the storage format of StoreFiles in the file system. HFile is actually the
underlying implementation of StoreFile in HBase.
� Multiple Regions in a RegionServer share the same HLog.

6.3. HMaster(1)

� HMaster manages RegionServer in HBase, include create, delete, modificate and query
of a table, balances the load of RegionServer, adjusts the distribution of Region, splits
Regions and distributes Regions after it splits, and migrates regions after a Region-
Server failed.

Centro de Inovação EDGE - Big Data Course 6


6.4. HMaster(2)

� HMaster is in active/standby mode. A cluster can have two HMaster processes.When


the cluster starts, the processes compete for being the active HMaster. There is only one
active HMaster and the standby HMaster process is in the win status and does not in-
volve in cluster transactions when the cluster is running.

6.5. RegionServer

� RegionServer provides read and write services of table data as a data processing and
computing unit in HBase. The Region Server is usually deployed with the DataNode of
HDFS cluster to store data. All reading and writing requests of user data are handled
based on interaction among regions on RegionServer. And Regions can be migrated
between RegionServers.

6.6. Region(1)

� In HBase, a data table is divided into subtables based on the range of KeyValue to im-
plement distributed storage. A subtable is called a Region, which is the most basic dis-
tribuited storage unit in HBase. Each region is associated with a KeyValue range, which
is described using a StartKey and an EndKey.

6.7. Region (2)

Centro de Inovação EDGE - Big Data Course 7


� For example like this, Region-1 has the StartKey Row001 and EndKey Row010. Each Re-
gion only needs to record a StartKey because its EndKey serves as the StartKey of the
next Region.

6.8. Region (3)

� There are two types of Regions: Meta Region and User Region:
• Meta Region records the routing information of every User Region.
• The address of Meta Region is stored in the root table in ZooKeeper and the ad-
dress of User Region is stored in the Meta table.

6.9. Column Family

� Columnfamily is a physical storage unit of a Region. A table consists of one or multiple


Columnfamilies horizontally.
� A ColumnFamily can consist of multiple random columns.
� A column is a label under the ColumnFamily, which can be added as required when data
is written
� The Columnfamily supports dynamic expansion. So the number and type of columns
do not need to be predefined.
� Columns of a table in HBase are sparsely distributed. The number and type of columns
in differente rows can be different.

Centro de Inovação EDGE - Big Data Course 8


6.10.ZooKeeper

� ZooKeeper provides distributed coordination services for processes in HBase cluster.


� Each RegionServer is registered with ZooKeeper so that active HMaster can obtain the
host status of each RegionServer.

6.11.MetaData Table

� The MetaData Table is a special HBase table which is used by the client to locate a Re-
gion.
� MetaData table includes HBase matter that records Region information of user tables,
such as Region location and start an EndRowKey. Besides, the MetaData table is splitted
into multiple Regions, and metadata information of Region is stored in ZooKeeper.

7. Key Processes of HBase

� HBase can be compared to a library. You can think RegionServer as a floor in the li-
brary,and region is the books of a certain type. Books of the same type are stored in the
same Region.

7.1. Client Initiating a Data Writing Request

Centro de Inovação EDGE - Big Data Course 9


� So when users write data into HBase, first the client initiate the request. This process is
like a book supplier sending books to a library. But which floor the book should be sent
to? Which means, which RegionServer and Region should data be sent to?

7.2. Writing Process - Locating a Region

� In order to determine that, the HBase client connects to ZooKeeper to obtain informa�on about
the RegionServer where the HBase META table is located. The HBase META table records informa-
�on about each user region, including the range of rowkey and RegionServer where the Region
resides. Based on the META table we can locate the RegionServer of the region in which data will
be wri�en.

7.3. Writing Process - Grouping Data (1)

� After obtaining all these information, data need to be grouped. Data groups includes
two steps.

Centro de Inovação EDGE - Big Data Course 10


7.4. Writing Process - Grouping Data (2)

� First, get the information of Region and RegionServer of tables based on the META ta-
ble, that is find the location of each Region in RegionServer. And then transfer data to
specific Region according to Rowkey. Data on each RegionServer is sent at the same
time.

Centro de Inovação EDGE - Big Data Course 11


7.5. Writing Process - Sending a Request to a RegionServer

� Then, the HBase client connects to RegionServer where the Region of the user table is
located and issues a data operation command to the RegionServer.

7.6. Writing Process - Process of Writing Data to a Region

Centro de Inovação EDGE - Big Data Course 12


� The RegionServer executes the command to write data into Region. To improve data
processing efficiency, the HBase client catches data information of the HBase META ta-
ble and user table in memory. When an application initiates a data operation,the HBase
client queries the Region information from the memory first. If no match is found in
memory, the HBase client performs the preceding operations to obtain Region infor-
mation of the HBase META table and the user table.

7.7. Writing Process - Flush

� Flush is a process to persist data from memory to disk.


� Once flushing the MemStore, MemStore of the whole Region flushes, that is why the
number of ColumnFamily should not be too many.

7.8. Impacts of Multiple HFiles

� As time passes by, the number of HFile increases because service data flows to the
HBase cluster. More files need to be opened for the same query, so the query latency
increases.

Centro de Inovação EDGE - Big Data Course 13


7.9. Compaction (1)

� So, in order to solve this problem, HBase will automatically pick some smaller HFiles
and rewrite them into fewer big ones. This process is called Compaction which can re-
duce the number of HFiles in a ColumnFamily in a Region and increase reading perfor-
mance.
� There are two kinds of compaction: major and minor. Minor compaction reduces the
number of storageFiles by rewriting some smaller files into fewer but larger ones. The
Major compaction merges and rewrites all the HFiles in a Region to one HFile per

Centro de Inovação EDGE - Big Data Course 14


ColumnFamily and in the process, jobs deleted or expired theirselves. This improves
read performance. However, since major compaction rewrites all the files, lots of the
disk I/O and network traffic might occur during the process. This is called write ampli-
fication.
� Major compaction can be scheduled to run automatically. Due to write amplification,
major compaction are usually scheduled for weekends or evenings.
� A major compaction also makes any data files that were remote due to server failure or
low balancing, local to the Region.

7.10.Compaction (2)

� Minor compac�on involves par�al HFiles while the major compac�on involves all the HFiles.

7.11.Region Split

� Notice that if the data size of the Region exceeds the predefined threshold, a Region
would split into two subRegions.
� During this process, the reading and writing services will be suspended for a short time.

Centro de Inovação EDGE - Big Data Course 15


7.12.Client Initiating a Data Reading Request

� The reading process of HBase is kind of similar to the writing process.


� A client first initiates a reading request, which is like query of books in the library using
client. In this case, if the book code is specified which is like a RowKey in HBase, this is
a Get request, which is an accurate search. If a code range is specified, this is a Scan
request, which is a query by range.
� Then, to locate the Region based on the META table.

Centro de Inovação EDGE - Big Data Course 16


7.13.OpenScanner

� After locating the corresponding RegionServer and Region of RowKeys, open a Scanner
to search for data.
� Because a Region contains MemStores and HFiles that include data, there are two types
of scanner to read data in MemStores and HFiles respectively. The scanner correspond-
ing to HFile is StoreFileScanner and the scanner corresponding to MemStore is Mem-
StoreScanner.

Centro de Inovação EDGE - Big Data Course 17


7.14.Filter

� Users can get some Filter conditions to return the data that matches the specified con-
ditions.
� There are some typical Filter types like RowFilter, SingleColumnValueFilter, KeyOnlyFilter
and FilterList.

7.15.BloomFilter

� BloomFilter is used to quickly check whether a piece of user data exists in a large data
set. Notice that if BloomFilter returns the result that data does not exist, then the result
is absolutely accurate. If it returns exist, there may be misjudgements.
� Data relative with BloomFilter in HBase is actually stored in HFiles

Centro de Inovação EDGE - Big Data Course 18


8. Improvements on FusionInsight

8.1. Supporting Secondary Index

� HBase is a distributed storage database of the KeyValue type.


� Data of a table is sorted in the alphabetic order based on Rowkeys.
� If you query data base on specific RowKey or scan data in the scale of a specified
Rowkey, HBase can quickly locate the data that needs to be read, enhancing the effi-
ciency.
� However, in most actual scenarios, you need to query the data of which the column
value is a certain one. HBase provides the Filter feature to query data with a specific
column value. All data is scanned in the order of Rowkey. And then, data is matched
with the specific column value until the required data is found. The Filter feature scans
some unnecessary data to obtain the required data.
� Based on the preceding description, the Filter feature cannot meet the requirements of
the frequent query with high performance standards.
� To solve this problem, the HBase secondary index is generated. It enables HBase to
query data based on specific column value.

Centro de Inovação EDGE - Big Data Course 19


8.2. HFS

� HBase FileStream, which is HFS is an independent HBase file storage module. It is used in FusionIn-
sight HD upper layer applica�ons by encapsula�ng HBase and HDFS interfaces to provide this up-
per layer applica�ons with func�ons such as file storage, read and delete.
� In the hadoop ecosystem, the HDFS and HBase face tough problems in massive file storage in some
scenarios, like if massive small files are stored in HDFS NameNode will have great pressure or some
large files cannot be directly stored in HBase due to HBase interfaces and internal mechanism.
� The HFS is developed for the mixed storage of massive small files and some large files in the
Hadoop. Simply speaking, massive small files and some large files need to be stored in HBase ta-
bles.

Centro de Inovação EDGE - Big Data Course 20


8.3. HBase MOB (1)

� In the actual application scenarios, data in varied sizes needs to be stored. For example,
image data and documents. Data whose size is smaller than 10MB can be stored in
HBase. HBase can use the best read-write performance for data whose size is smaller
than 100kb. If the size of data stored in HBase is greater than 100kb or even reaches
10MB, and the same number of data files are inserted, the total data amount is large,
causing frequent compaction and split, high CPU consumption, high disk I/O frequency
and low performance.
� MOB Data is stored in a file system in HFile format and save the address and size infor-
mation about HFiles to the store of HBase as values. This greatly decreases the com-
paction and split frequency in HBase and improves performance.

8.4. HBase MOB (2)

� In this figure, MOB indicates the MOBstore that is stored on HRegion. MOBstore stores
keys and values. When reading data, MOBstore uses its own standard to read KeyValue
data objects and uses address and data size information in the value to obtain target
data from the file system.

Centro de Inovação EDGE - Big Data Course 21


Centro de Inovação EDGE - Big Data Course 22
Centro de Inovação EDGE - Big Data Course 23

You might also like