Professional Documents
Culture Documents
Hbase: Big Data Huawei Course
Hbase: Big Data Huawei Course
HBase
AVISO
Este documento foi gerado a partir de um material de estudo
da Huawei. Considere as informações nesse documento
como material de apoio.
6.3. HMaster(1)......................................................................................... 6
6.4. HMaster(2)......................................................................................... 7
6.5. RegionServer..................................................................................... 7
6.6. Region(1)........................................................................................... 7
6.10. ZooKeeper......................................................................................... 9
7.14. Filter................................................................................................. 18
8. Improvements on FusionInsight................................................................ 19
� HBase is a column-based distributed database built on top of the HDFS, which means
that HBase uses HDFS as its file storage system with high reliability, performance, and
scalability.
� HBase is actually similar to Google's big table, designed to provide quick random access
to a huge amount of structured data. So, in the same way, we can use HBase to store
and process big table data which has billions of rows and millions of columns. And it
also provides real-time read or write access to data in HDFS.
� HBase uses Zookeeper as the collaboration service
� HBase is the Hadoop database, a distributed, scalable big data store.
� We know that Oracle, MySQL, and many common databases are actually relational
databases, RDB for short. HBase is no SQL, not only SQL which usually refers to non-
relational databases. The most common ones are MongoDB and Redis.
� HBase is distributed, which is easy to understand because it is built on top of HDFS and
its column-oriented.
2. Column-oriented explanation:
� Normally, in a table in Oracle or MySQL, data is stored by rows. Data can be added,
modified, or read by row. But when we need all data in the name column we may need
to read the entire row to get the data. So in this case, we may read some unnecessary
data. That is why we find another way to store data, which is by column.
� Data can be read and calculated by column. But when a row is read, we need multiple
I/O operations so, in HBase, data is stored by column. This is what we called column-
based. It is different from relational databases.
� Relational database has a Fixed data structure. For example, we have a table with em-
ployee data, the column includes ID, name, phone and address and the attributes of
each column are all fixed. So we need to pre-define the data structure before we put
actual values into the table. However, HBase doesn't need to. We can make dynamic
extension of columns. For example an employee table in HBase: For the first record, we
have ID and name; For the second record, we have ID, name and phone; For the third
3. Application Scenarios
� It can be used whenever we need to provide fast random access to massive data or we
need high throughput or we need to process structured and unstructured data like text
files, images, videos, webpages at the same time. Or we do not require the
ACID(Atomicity, Consistency, Isolation, Durability) features of traditional relational data-
base. They are a set of properties that ensure normal database transactions.
� Atomicity requires that each transaction be "all or nothing". If one part of the transac-
tion fails, then the entire transaction fails.
� Consistency means that, for example, consistency determines that A+B = 10. If the
transaction changes the value of A, B must also be changed.
� Isolation ensures that two or more transactions will not be executed in the interleaved
manner because this may cause data inconsistency.
� Durability means that after a transaction is executed successfully, modifications made
on the database are retained permanently, no rollback is performed unless necessary.
When we don't require any of these four features, we can use HBase.
� In fusion inside hadoop, HBase is a basic component for storing massive data. Its a col-
umn-based distributed storage system built on top of the HDFS and there are cases
that Hive and Spark depend on HBase for upper-layer analysis.
� Take a look at the storage mechanism. How data is actually stored in HBase? The data
model in HBase is keyValue, which means data is stored in the form of keyValue pairs.
� Key is used to quickly query a data record and value is used to store user data. But each
key corresponds to many values. Like in Huawei, everyone has an ID number, we can
take this as a key which can correspond to a name, age, phone, address and any other
information. But when we use the ID number to quickly query, how to make sure we get
the name value not the age? That is why KeyValue must store some description of itself,
such as timestamp and type of information. This requires some structured space.
� RDBs predefine the data structure in the database as a series of tables containing fills
with well-defined types. In contrast, KeyValue systems treat the data as a single collec-
tion which may have different fills for every record. This offers considerable flexibility
and more closely follows modeling concepts like object-oriented programming be-
cause optional values are not represented by placeholders, as in most RDBs.
� KeyValues storage often uses far less memory to store the same database which can
lead to large performance gained in certain workloads.
� Here, keys are implemented as byte arrays, and are sorted in byte lexical graphical order
with the lowest order appearing first in a table which simply means that the keys are
� And KeyValue pair has a specific format which contains key information such as time-
stamp and type, seven parts in total. The same key can be associated with multiple val-
ues. So we can use these seven parts to distinguish different values corresponding to
the same key. Like each KeyValue has a column qualifier. There also can be multiple
values associated with the same rowkey and column qualifier.
� We know that data in HBase can be updated. So there may be multiple versions of the
same data record. In this case, they're distinguished using timestamp. Here remember
that HBase will return the latest version when querying without specifying the time-
stamp.
� ColumnFamily consists of one or more columns. A column is labeled under the Colum-
nFamily.
� When the MemStore capacity reach the upper limit, region server flush data to the
MemStore to HDFS. As more data is inserted, multiple StoreFiles are generated in a
store. When the number of StoreFiles reach the upper limit, RegionServer merges mul-
tiple StoreFiles into a big StoreFile.
� HFile defines the storage format of StoreFiles in the file system. HFile is actually the
underlying implementation of StoreFile in HBase.
� Multiple Regions in a RegionServer share the same HLog.
6.3. HMaster(1)
� HMaster manages RegionServer in HBase, include create, delete, modificate and query
of a table, balances the load of RegionServer, adjusts the distribution of Region, splits
Regions and distributes Regions after it splits, and migrates regions after a Region-
Server failed.
6.5. RegionServer
� RegionServer provides read and write services of table data as a data processing and
computing unit in HBase. The Region Server is usually deployed with the DataNode of
HDFS cluster to store data. All reading and writing requests of user data are handled
based on interaction among regions on RegionServer. And Regions can be migrated
between RegionServers.
6.6. Region(1)
� In HBase, a data table is divided into subtables based on the range of KeyValue to im-
plement distributed storage. A subtable is called a Region, which is the most basic dis-
tribuited storage unit in HBase. Each region is associated with a KeyValue range, which
is described using a StartKey and an EndKey.
� There are two types of Regions: Meta Region and User Region:
• Meta Region records the routing information of every User Region.
• The address of Meta Region is stored in the root table in ZooKeeper and the ad-
dress of User Region is stored in the Meta table.
6.11.MetaData Table
� The MetaData Table is a special HBase table which is used by the client to locate a Re-
gion.
� MetaData table includes HBase matter that records Region information of user tables,
such as Region location and start an EndRowKey. Besides, the MetaData table is splitted
into multiple Regions, and metadata information of Region is stored in ZooKeeper.
� HBase can be compared to a library. You can think RegionServer as a floor in the li-
brary,and region is the books of a certain type. Books of the same type are stored in the
same Region.
� In order to determine that, the HBase client connects to ZooKeeper to obtain informa�on about
the RegionServer where the HBase META table is located. The HBase META table records informa-
�on about each user region, including the range of rowkey and RegionServer where the Region
resides. Based on the META table we can locate the RegionServer of the region in which data will
be wri�en.
� After obtaining all these information, data need to be grouped. Data groups includes
two steps.
� First, get the information of Region and RegionServer of tables based on the META ta-
ble, that is find the location of each Region in RegionServer. And then transfer data to
specific Region according to Rowkey. Data on each RegionServer is sent at the same
time.
� Then, the HBase client connects to RegionServer where the Region of the user table is
located and issues a data operation command to the RegionServer.
� As time passes by, the number of HFile increases because service data flows to the
HBase cluster. More files need to be opened for the same query, so the query latency
increases.
� So, in order to solve this problem, HBase will automatically pick some smaller HFiles
and rewrite them into fewer big ones. This process is called Compaction which can re-
duce the number of HFiles in a ColumnFamily in a Region and increase reading perfor-
mance.
� There are two kinds of compaction: major and minor. Minor compaction reduces the
number of storageFiles by rewriting some smaller files into fewer but larger ones. The
Major compaction merges and rewrites all the HFiles in a Region to one HFile per
7.10.Compaction (2)
� Minor compac�on involves par�al HFiles while the major compac�on involves all the HFiles.
7.11.Region Split
� Notice that if the data size of the Region exceeds the predefined threshold, a Region
would split into two subRegions.
� During this process, the reading and writing services will be suspended for a short time.
� After locating the corresponding RegionServer and Region of RowKeys, open a Scanner
to search for data.
� Because a Region contains MemStores and HFiles that include data, there are two types
of scanner to read data in MemStores and HFiles respectively. The scanner correspond-
ing to HFile is StoreFileScanner and the scanner corresponding to MemStore is Mem-
StoreScanner.
� Users can get some Filter conditions to return the data that matches the specified con-
ditions.
� There are some typical Filter types like RowFilter, SingleColumnValueFilter, KeyOnlyFilter
and FilterList.
7.15.BloomFilter
� BloomFilter is used to quickly check whether a piece of user data exists in a large data
set. Notice that if BloomFilter returns the result that data does not exist, then the result
is absolutely accurate. If it returns exist, there may be misjudgements.
� Data relative with BloomFilter in HBase is actually stored in HFiles
� HBase FileStream, which is HFS is an independent HBase file storage module. It is used in FusionIn-
sight HD upper layer applica�ons by encapsula�ng HBase and HDFS interfaces to provide this up-
per layer applica�ons with func�ons such as file storage, read and delete.
� In the hadoop ecosystem, the HDFS and HBase face tough problems in massive file storage in some
scenarios, like if massive small files are stored in HDFS NameNode will have great pressure or some
large files cannot be directly stored in HBase due to HBase interfaces and internal mechanism.
� The HFS is developed for the mixed storage of massive small files and some large files in the
Hadoop. Simply speaking, massive small files and some large files need to be stored in HBase ta-
bles.
� In the actual application scenarios, data in varied sizes needs to be stored. For example,
image data and documents. Data whose size is smaller than 10MB can be stored in
HBase. HBase can use the best read-write performance for data whose size is smaller
than 100kb. If the size of data stored in HBase is greater than 100kb or even reaches
10MB, and the same number of data files are inserted, the total data amount is large,
causing frequent compaction and split, high CPU consumption, high disk I/O frequency
and low performance.
� MOB Data is stored in a file system in HFile format and save the address and size infor-
mation about HFiles to the store of HBase as values. This greatly decreases the com-
paction and split frequency in HBase and improves performance.
� In this figure, MOB indicates the MOBstore that is stored on HRegion. MOBstore stores
keys and values. When reading data, MOBstore uses its own standard to read KeyValue
data objects and uses address and data size information in the value to obtain target
data from the file system.