Professional Documents
Culture Documents
HBase Concepts
What is HBase: HBase is an essential part of Hadoop ecosystem. HBase is an open source,
multidimensional, distributed, scalable and a NoSQL database written in Java. HBase runs on top
of HDFS (Hadoop Distributed File System) and provides BigTable like capabilities to Hadoop. It is
designed to provide a fault tolerant way of storing large collection of sparse data sets.
Features of HBase:
Atomic read and write: On a row level, HBase provides atomic read and write. It can be
explained as, during one read or write process, all other processes are prevented from performing
any read or write operations.
Consistent reads and writes: HBase provides consistent reads and writes due to above feature.
Linear and modular scalability: As data sets are distributed over HDFS, thus it is linearly
scalable across various nodes, as well as modularly scalable, as it is divided across various nodes.
Automatic and configurable sharding of tables: HBase tables are distributed across clusters and
these clusters are distributed across regions. These regions and clusters split, and are redistributed
as the data grows.
Easy to use Java API for client access: It provides easy to use Java API for programmatic
access.
Thrift gateway and a REST-ful Web services: It also supports Thrift and REST API for non-
Java front-ends.
Block Cache and Bloom Filters: HBase supports a Block Cache and Bloom Filters for high
volume query optimization .
Automatic failure support: HBase with HDFS provides WAL (Write Ahead Log) across clusters
which provides automatic failure support.
Sorted rowkeys: As searching is done on range of rows, HBase stores rowkeys in a
lexicographical order. Using these sorted rowkeys and timestamp, we can build an optimized
request.
HBase is a column-oriented database and data is stored in tables. The tables are sorted by RowId. As
shown below, HBase has RowId, which is the collection of several column families that are present in the
table.
The column families that are present in the schema are key-value pairs. If we observe in detail each
column family having multiple numbers of columns. The column values stored into disk memory. Each
cell of the table has its own Metadata like timestamp and other information.
Coming to HBase the following are the key terms representing table schema
Tables: Data is stored in a table format in HBase. But here tables are in column-oriented format.
Row Key: Row keys are used to search records which make searches fast. You would be curious
to know how? I will explain it in the architecture part moving ahead in this blog.
Column Families: Various columns are combined in a column family. These column families are
stored together which makes the searching process faster because data belonging to same column
family can be accessed together in a single seek.
Column Qualifiers: Each column’s name is known as its column qualifier.
Cell: Data is stored in cells. The data is dumped into cells which are specifically identified by
rowkey and column qualifiers.
Timestamp: Timestamp is a combination of date and time. Whenever data is stored, it is stored
with its timestamp. This makes easy to search for a particular version of data.
Column and Row-oriented storages differ in their storage mechanism. As we all know traditional relational
models store data in terms of row-based format like in terms of rows of data. Column-oriented storages
store data tables in terms of columns and column families.
The following Table gives some key differences between these two storages
The amount of data that can able to store It is designed for a small
in this model is very huge like in terms number of rows and columns.
of petabytes
HBase Data Model
Set of tables
Each table with column families and rows
Each table must have an element defined as Primary Key.
Row key acts as a Primary key in HBase.
Any access to HBase tables uses this Primary Key
Each column present in HBase denotes attribute corresponding to object
HMaster
HRegionserver
HRegions
Zookeeper
HDFS
HMaster:
HMaster is the implementation of a Master server in HBase architecture. It acts as a monitoring agent to
monitor all Region Server instances present in the cluster and acts as an interface for all the metadata
changes. In a distributed cluster environment, Master runs on NameNode. Master runs several background
threads.
Plays a vital role in terms of performance and maintaining nodes in the cluster.
HMaster provides admin performance and distributes services to different region servers.
HMaster assigns regions to region servers.
HMaster has the features like controlling load balancing and failover to handle the load over nodes
present in the cluster.
When a client wants to change any schema and to change any Metadata operations, HMaster takes
responsibility for these operations.
Some of the methods exposed by HMaster Interface are primarily Metadata oriented methods.
The client communicates in a bi-directional way with both HMaster and ZooKeeper. For read and write
operations, it directly contacts with HRegion servers. HMaster assigns regions to region servers and in
turn, check the health status of region servers.
In entire architecture, we have multiple region servers. Hlog present in region servers which are going to
store all the log files.
When Region Server receives writes and read requests from the client, it assigns the request to a specific
region, where the actual column family resides. However, the client can directly contact with HRegion
servers, there is no need of HMaster mandatory permission to the client regarding communication with
HRegion servers. The client requires HMaster help when operations related to metadata and schema
changes are required.
HRegionServer is the Region Server implementation. It is responsible for serving and managing regions or
data that is present in a distributed cluster. The region servers run on Data Nodes present in the Hadoop
cluster.
HMaster can get into contact with multiple HRegion servers and performs the following functions.
HBase Regions:
HRegions are the basic building elements of HBase cluster that consists of the distribution of tables and are
comprised of Column families. It contains multiple stores, one for each column family. It consists of
mainly two components, which are Memstore and Hfile.
ZooKeeper:
In HBase, Zookeeper is a centralized monitoring server which maintains configuration information and
provides distributed synchronization. Distributed synchronization is to access the distributed applications
running across the cluster with the responsibility of providing coordination services between nodes. If the
client wants to communicate with regions, the server's client has to approach ZooKeeper first.
During a failure of nodes that present in HBase cluster, ZKquoram will trigger error messages, and it starts
to repair the failed nodes.
HDFS:-
HDFS is a Hadoop distributed file system, as the name implies it provides a distributed environment for
the storage and it is a file system designed in a way to run on commodity hardware. It stores each file in
multiple blocks and to maintain fault tolerance, the blocks are replicated across a Hadoop cluster.
HDFS provides a high degree of fault –tolerance and runs on cheap commodity hardware. By adding nodes
to the cluster and performing processing & storing by using the cheap commodity hardware, it will give
the client better results as compared to the existing one.
In here, the data stored in each block replicates into 3 nodes any in a case when any node goes down there
will be no loss of data, it will have a proper backup recovery mechanism.
HDFS get in contact with the HBase components and stores a large amount of data in a distributed
manner.
The Read and Write operations from Client into Hfile can be shown in below diagram.
Step 1) Client wants to write data and in turn first communicates with Regions server and then regions
Step 2) Regions contacting memstore for storing associated with the column family
Step 3) First data stores into Memstore, where the data is sorted and after that, it flushes into HFile. The
main reason for using Memstore is to store data in a Distributed file system based on Row Key. Memstore
will be placed in Region server main memory while HFiles are written into HDFS.
Step 6) Client approaches HFiles to get the data. The data are fetched and retrieved by the Client.
Memstore holds in-memory modifications to the store. The hierarchy of objects in HBase Regions is as
shown from top to bottom in below table.
Store It stores per ColumnFamily for each region for the table
Memstore Memstore for each store for each region for the table
It sorts data before flushing into HFiles
Write and read performance will increase because of sorting
StoreFile StoreFiles for each store for each region for the table
HBase runs on top of HDFS and Hadoop. Some key differences between HDFS and HBase are in terms of
data operations and processing.
HBASE HDFS
Storage and process both can be It's only for storage areas
perform
Schema Design
HBase table can scale to billions of rows and many number of column based on your requirements. This
table allows you to store terabytes of data in it. The HBase table supports the high read and write
throughput at low latency. A single value in each row is indexed; this value is known as the row key. In
this article, we will check HBase table schema design and concept.
HBase Table Schema Design General Concepts
The HBase schema design is very different compared to the relation database schema design. Below are
some of general concept that should be followed while designing schema in Hbase:
Row key: Each table in HBase table is indexed on row key. Data is sorted lexicographically by this
row key. There are no secondary indices available on HBase table.
Automaticity: Avoid designing table that requires atomacity across all rows. All operations on HBase
rows are atomic at row level.
Even distribution: Read and write should uniformly distributed across all nodes available in cluster.
Design row key in such a way that, related entities should be stored in adjacent rows to increase read
efficacy.
HBase Schema Row key, Column family, Column qualifier, individual and Row value Size Limit
When choosing row key for HBase tables, you should design table in such a way that there should not be
any hotspotting. To get best performance out of HBase cluster, you should design a row key that would
allow system to write evenly across all the nodes.
Poorly designed row key can cause the full table scan when you request some data out of it.
If you are storing data that is represented by the domain names then consider using reverse domain name
as a row keys for your HBase Tables. For example, com.company.name.
This technique works perfectly fine when you have data spread across multiple reverse domains. If you
have very few reverse domain then you may end up storing data on single node causing hotspotting.
Hashing
When you have the data which is represented by the string identifier, then that is good choice for your
Hbase table row key. Use hash of that string identifier as a row key instead of raw string. For example, if
you are storing user data that is identified by user ID’s then hash of user ID is better choice for your row
key.
Timestamps
When you retrieve data based on time when it was stored, it is best to include the timestamp in your row
key. For example, you are trying to store the machine log identified by machine number then append the
timestamp to the machine number when designing row key, machine001#1435310751234.
You can combine multiple key to design row key for your HBase table based on your requirements.
Column Families
In HBase, you have upto 10 column families to get best performance out of HBase cluster. If your row
contains multiple values that are related to each other, then you should place then in same family names.
Also, the names of your column families should be short, since they are included in the data that is
transferred for each request.
Column Qualifiers
You can create as many column qualifiers as you need in each row. The empty cells in the row does not
consume any space. The names of your column qualifiers should be short, since they are included in the
data that is transferred for each request.
You can create the schema using Apache HBase shell or Java API’s:
ZooKeeper Architecture
Apache ZooKeeper works on the Client–Server architecture in which clients are machine nodes and
servers are nodes.
The following figure shows the relationship between the servers and their clients. In this, we can see that
each client sources the client library, and further they communicate with any of the ZooKeeper nodes.
Components of the ZooKeeper architecture has been explained in the following table.
Part Description
The first thing that happens as soon as the ensemble (a group of ZooKeeper servers) starts is, it waits for
the clients to connect to the servers.
After that, the clients in the ZooKeeper ensemble will connect to one of the nodes. That node can be any of
a leader node or a follower node.
Once the client is connected to a particular node, the node assigns a session ID to the client and sends an
acknowledgement to that particular client.
If the client does not get any acknowledgement from the node, then it resends the message to another node
in the ZooKeeper ensemble and tries to connect with it.
On receiving the acknowledgement, the client makes sure that the connection is not lost by sending the
heartbeats to the node at regular intervals.
Finally, the client can perform functions like read, write, or store the data as per the need.
Apache ZooKeeper provides a wide range of good features to the user. Let’s start exploring them.
Updating the Node’s Status: Apache ZooKeeper is capable of updating every node that allows it to store
updated information about each node across the cluster.
Managing the Cluster: This technology can manage the cluster in such a way that the status of each node
is maintained in real time, leaving lesser chances for errors and ambiguity.
Naming Service: ZooKeeper attaches a unique identification to every node which is quite similar to the
DNA that helps identify it.
Automatic Failure Recovery: Apache ZooKeeper locks the data while modifying which helps the cluster
recover it automatically if a failure occurs in the database.
ZooKeeper Use Cases: There are many use cases of ZooKeeper. Some of the most prominent of them are
as follows:
Managing the configuration
Naming services
Choosing the leader
Queuing the messages
Managing the notification system
Synchronization
Zookeeper is used in monitoring a Cluster:
Apache ZooKeeper is a service used by a cluster (group of nodes) to coordinate between
themselves and maintain shared data with robust synchronization techniques.
ZooKeeper is itself a distributed application providing services for writing a distributed application.
ZooKeeper is a distributed co-ordination service to manage large set of hosts.
Co-ordinating and managing a service in a distributed environment is a complicated process.
ZooKeeper solves this issue with its simple architecture and API.
Apache Zookeeper provides a hierarchical file system (with ZNodes as the system files) that helps
with the discovery, registration, configuration, locking, leader selection, queueing, etc of services
working in different machines. ZooKeeper server maintains configuration information, naming,
providing distributed synchronization, and providing group services, used by distributed
applications.
HBase uses zookeeper to build applications with zookeeper