You are on page 1of 11

UNIT V

HBase Concepts

What is HBase: HBase is an essential part of Hadoop ecosystem. HBase is an open source,
multidimensional, distributed, scalable and a NoSQL database written in Java. HBase runs on top
of HDFS (Hadoop Distributed File System) and provides BigTable like capabilities to Hadoop. It is
designed to provide a fault tolerant way of storing large collection of sparse data sets.

Features of HBase:

features of HBase includes

 Atomic read and write: On a row level, HBase provides atomic read and write. It can be
explained as, during one read or write process, all other processes are prevented from performing
any read or write operations.
 Consistent reads and writes: HBase provides consistent reads and writes due to above feature.
 Linear and modular scalability: As data sets are distributed over HDFS, thus it is linearly
scalable across various nodes, as well as modularly scalable, as it is divided across various nodes.
 Automatic and configurable sharding of tables: HBase tables are distributed across clusters and
these clusters are distributed across regions. These regions and clusters split, and are redistributed
as the data grows.
 Easy to use Java API for client access: It provides easy to use Java API for programmatic
access.
 Thrift gateway and a REST-ful Web services: It also supports Thrift and REST API for non-
Java front-ends.
 Block Cache and Bloom Filters: HBase supports a Block Cache and Bloom Filters for high
volume query optimization .
 Automatic failure support: HBase with HDFS provides WAL (Write Ahead Log) across clusters
which provides automatic failure support.
 Sorted rowkeys: As searching is done on range of rows, HBase stores rowkeys in a
lexicographical order. Using these sorted rowkeys and timestamp, we can build an optimized
request.

Storage Mechanism in HBase

HBase is a column-oriented database and data is stored in tables. The tables are sorted by RowId. As
shown below, HBase has RowId, which is the collection of several column families that are present in the
table.

The column families that are present in the schema are key-value pairs. If we observe in detail each
column family having multiple numbers of columns. The column values stored into disk memory. Each
cell of the table has its own Metadata like timestamp and other information.
Coming to HBase the following are the key terms representing table schema

 Tables: Data is stored in a table format in HBase. But here tables are in column-oriented format.
 Row Key: Row keys are used to search records which make searches fast. You would be curious
to know how? I will explain it in the architecture part moving ahead in this blog.
 Column Families: Various columns are combined in a column family. These column families are
stored together which makes the searching process faster because data belonging to same column
family can be accessed together in a single seek.
 Column Qualifiers: Each column’s name is known as its column qualifier.
 Cell: Data is stored in cells. The data is dumped into cells which are specifically identified by
rowkey and column qualifiers.
 Timestamp: Timestamp is a combination of date and time. Whenever data is stored, it is stored
with its timestamp. This makes easy to search for a particular version of data.

Column-oriented vs Row-oriented storages

Column and Row-oriented storages differ in their storage mechanism. As we all know traditional relational
models store data in terms of row-based format like in terms of rows of data. Column-oriented storages
store data tables in terms of columns and column families.

The following Table gives some key differences between these two storages

Column-oriented Database Row oriented Database

 When the situation comes to process and  Online Transactional


analytics we use this approach. Such process such as banking and
as Online Analytical Processing and finance domains use this
it's applications. approach.

 The amount of data that can able to store  It is designed for a small
in this model is very huge like in terms number of rows and columns.
of petabytes
HBase Data Model

HBase Data Model consists of following elements,

 Set of tables
 Each table with column families and rows
 Each table must have an element defined as Primary Key.
 Row key acts as a Primary key in HBase.
 Any access to HBase tables uses this Primary Key
 Each column present in HBase denotes attribute corresponding to object

HBase Architecture and its Important Components

HBase Architecture Diagram

HBase architecture consists mainly of four components

 HMaster
 HRegionserver
 HRegions
 Zookeeper
 HDFS

HMaster:

HMaster is the implementation of a Master server in HBase architecture. It acts as a monitoring agent to
monitor all Region Server instances present in the cluster and acts as an interface for all the metadata
changes. In a distributed cluster environment, Master runs on NameNode. Master runs several background
threads.

The following are important roles performed by HMaster in HBase.

 Plays a vital role in terms of performance and maintaining nodes in the cluster.
 HMaster provides admin performance and distributes services to different region servers.
 HMaster assigns regions to region servers.
 HMaster has the features like controlling load balancing and failover to handle the load over nodes
present in the cluster.
 When a client wants to change any schema and to change any Metadata operations, HMaster takes
responsibility for these operations.
Some of the methods exposed by HMaster Interface are primarily Metadata oriented methods.

 Table (createTable, removeTable, enable, disable)


 ColumnFamily (add Column, modify Column)
 Region (move, assign)

The client communicates in a bi-directional way with both HMaster and ZooKeeper. For read and write
operations, it directly contacts with HRegion servers. HMaster assigns regions to region servers and in
turn, check the health status of region servers.

In entire architecture, we have multiple region servers. Hlog present in region servers which are going to
store all the log files.

HBase Regions Servers:

When Region Server receives writes and read requests from the client, it assigns the request to a specific
region, where the actual column family resides. However, the client can directly contact with HRegion
servers, there is no need of HMaster mandatory permission to the client regarding communication with
HRegion servers. The client requires HMaster help when operations related to metadata and schema
changes are required.

HRegionServer is the Region Server implementation. It is responsible for serving and managing regions or
data that is present in a distributed cluster. The region servers run on Data Nodes present in the Hadoop
cluster.

HMaster can get into contact with multiple HRegion servers and performs the following functions.

 Hosting and managing regions


 Splitting regions automatically
 Handling read and writes requests
 Communicating with the client directly

HBase Regions:

HRegions are the basic building elements of HBase cluster that consists of the distribution of tables and are
comprised of Column families. It contains multiple stores, one for each column family. It consists of
mainly two components, which are Memstore and Hfile.

ZooKeeper:

In HBase, Zookeeper is a centralized monitoring server which maintains configuration information and
provides distributed synchronization. Distributed synchronization is to access the distributed applications
running across the cluster with the responsibility of providing coordination services between nodes. If the
client wants to communicate with regions, the server's client has to approach ZooKeeper first.

It is an open source project, and it provides so many important services.

Services provided by ZooKeeper

 Maintains Configuration information


 Provides distributed synchronization
 Client Communication establishment with region servers
 Provides ephemeral nodes for which represent different region servers
 Master servers usability of ephemeral nodes for discovering available servers in the cluster
 To track server failure and network partitions
Master and HBase slave nodes ( region servers) registered themselves with ZooKeeper. The client needs
access to ZK(zookeeper) quorum configuration to connect with master and region servers.

During a failure of nodes that present in HBase cluster, ZKquoram will trigger error messages, and it starts
to repair the failed nodes.

HDFS:-

HDFS is a Hadoop distributed file system, as the name implies it provides a distributed environment for
the storage and it is a file system designed in a way to run on commodity hardware. It stores each file in
multiple blocks and to maintain fault tolerance, the blocks are replicated across a Hadoop cluster.

HDFS provides a high degree of fault –tolerance and runs on cheap commodity hardware. By adding nodes
to the cluster and performing processing & storing by using the cheap commodity hardware, it will give
the client better results as compared to the existing one.

In here, the data stored in each block replicates into 3 nodes any in a case when any node goes down there
will be no loss of data, it will have a proper backup recovery mechanism.

HDFS get in contact with the HBase components and stores a large amount of data in a distributed
manner.

HBase Read and Write Data Explained

The Read and Write operations from Client into Hfile can be shown in below diagram.

Step 1) Client wants to write data and in turn first communicates with Regions server and then regions

Step 2) Regions contacting memstore for storing associated with the column family

Step 3) First data stores into Memstore, where the data is sorted and after that, it flushes into HFile. The
main reason for using Memstore is to store data in a Distributed file system based on Row Key. Memstore
will be placed in Region server main memory while HFiles are written into HDFS.

Step 4) Client wants to read data from Regions


Step 5) In turn Client can have direct access to Mem store, and it can request for data.

Step 6) Client approaches HFiles to get the data. The data are fetched and retrieved by the Client.

Memstore holds in-memory modifications to the store. The hierarchy of objects in HBase Regions is as
shown from top to bottom in below table.

Table HBase table present in the HBase cluster

Region HRegions for the presented tables

Store It stores per ColumnFamily for each region for the table

Memstore  Memstore for each store for each region for the table
 It sorts data before flushing into HFiles
 Write and read performance will increase because of sorting

StoreFile StoreFiles for each store for each region for the table

Block Blocks present inside StoreFiles

HBase vs. HDFS

HBase runs on top of HDFS and Hadoop. Some key differences between HDFS and HBase are in terms of
data operations and processing.

HBASE HDFS

 Low latency operations  High latency operations

 Random reads and writes  Write once Read many times

 Accessed through shell commands,  Primarily accessed through


client API in Java, REST, Avro or Thrift MR (Map Reduce) jobs

 Storage and process both can be  It's only for storage areas
perform

Schema Design

HBase table can scale to billions of rows and many number of column based on your requirements. This
table allows you to store terabytes of data in it. The HBase table supports the high read and write
throughput at low latency. A single value in each row is indexed; this value is known as the row key. In
this article, we will check HBase table schema design and concept.
HBase Table Schema Design General Concepts

The HBase schema design is very different compared to the relation database schema design. Below are
some of general concept that should be followed while designing schema in Hbase:

 Row key: Each table in HBase table is indexed on row key. Data is sorted lexicographically by this
row key. There are no secondary indices available on HBase table.
 Automaticity: Avoid designing table that requires atomacity across all rows. All operations on HBase
rows are atomic at row level.
 Even distribution: Read and write should uniformly distributed across all nodes available in cluster.
Design row key in such a way that, related entities should be stored in adjacent rows to increase read
efficacy.

HBase Schema Row key, Column family, Column qualifier, individual and Row value Size Limit

Consider below is the size limit when designing schema in Hbase:

 Row keys: 4 KB per key


 Column families: not more than 10 column families per table
 Column qualifiers: 16 KB per qualifier
 Individual values: less than 10 MB per cell
 All values in a single row: max 10 MB

HBase Row Key Design

When choosing row key for HBase tables, you should design table in such a way that there should not be
any hotspotting. To get best performance out of HBase cluster, you should design a row key that would
allow system to write evenly across all the nodes.

Poorly designed row key can cause the full table scan when you request some data out of it.

Type of HBase Row Keys

There are some commonly used HBase row keys:

Reverse Domain Names

If you are storing data that is represented by the domain names then consider using reverse domain name
as a row keys for your HBase Tables. For example, com.company.name.

This technique works perfectly fine when you have data spread across multiple reverse domains. If you
have very few reverse domain then you may end up storing data on single node causing hotspotting.
Hashing

When you have the data which is represented by the string identifier, then that is good choice for your
Hbase table row key. Use hash of that string identifier as a row key instead of raw string. For example, if
you are storing user data that is identified by user ID’s then hash of user ID is better choice for your row
key.

Timestamps

When you retrieve data based on time when it was stored, it is best to include the timestamp in your row
key. For example, you are trying to store the machine log identified by machine number then append the
timestamp to the machine number when designing row key, machine001#1435310751234.

Combines Row Key

You can combine multiple key to design row key for your HBase table based on your requirements.

HBase Column Families and Column Qualifiers

Below are some of guidance on column families and column qualifier:

Column Families

In HBase, you have upto 10 column families to get best performance out of HBase cluster. If your row
contains multiple values that are related to each other, then you should place then in same family names.
Also, the names of your column families should be short, since they are included in the data that is
transferred for each request.

Column Qualifiers

You can create as many column qualifiers as you need in each row. The empty cells in the row does not
consume any space. The names of your column qualifiers should be short, since they are included in the
data that is transferred for each request.

Creating HBase Schema Design

You can create the schema using Apache HBase shell or Java API’s:

Below is the example of create table schema:

hbase(main):001:0> create 'test_table_schema', 'cf'


0 row(s) in 2.7740 seconds

=> Hbase::Table - test_table_schema


Read:

What is Apache Zookeeper?

Apache ZooKeeper is a software project of Apache Software Foundation. It is an open-source technology


that maintains configuration information and provides synchronized as well as group services which are
deployed on Hadoop cluster to administer the infrastructure.

Why Do We Need Apache Zookeeper?


Here, are important reasons behind the popularity of the Zookeeper:

 It allows for mutual exclusion and cooperation between server processes


 It ensures that your application runs consistently.
 The transaction process is never completed partially. It is either given the status of Success or
failure. The distributed state can be held up, but it's never wrong
 Irrespective o the server that it connects to, a client will be able to see the same view of the service
 Helps you to encode the data as per the specific set of rules
 It helps to maintain a standard hierarchical namespace similar to files and directories
 Computers, which run as a single system which can be locally or geographically connected
 It allows to Join/leave node in a cluster and node status at the real time
 You can increase performance by deploying more machines
 It allows you to elect a node as a leader for better coordination
 ZooKeeper works fast with workloads where reads to the data are more common than writes

ZooKeeper Architecture

Apache ZooKeeper works on the Client–Server architecture in which clients are machine nodes and
servers are nodes.

The following figure shows the relationship between the servers and their clients. In this, we can see that
each client sources the client library, and further they communicate with any of the ZooKeeper nodes.

Components of the ZooKeeper architecture has been explained in the following table.

Part Description

Client Client node in our distributed applications


cluster is used to access information from
the server. It sends a message to the server
to let the server know that the client is alive,
and if there is no response from the
connected server the client automatically
resends the message to another server.

Server The server gives an acknowledgement to the


client to inform that the server is alive, and
it provides all services to clients.
Leader If any of the server nodes is failed, this
server node performs automatic recovery.

Follower It is a server node which follows the


instructions given by the leader.

Working of Apache ZooKeeper

 The first thing that happens as soon as the ensemble (a group of ZooKeeper servers) starts is, it waits for
the clients to connect to the servers.
 After that, the clients in the ZooKeeper ensemble will connect to one of the nodes. That node can be any of
a leader node or a follower node.
 Once the client is connected to a particular node, the node assigns a session ID to the client and sends an
acknowledgement to that particular client.
 If the client does not get any acknowledgement from the node, then it resends the message to another node
in the ZooKeeper ensemble and tries to connect with it.
 On receiving the acknowledgement, the client makes sure that the connection is not lost by sending the
heartbeats to the node at regular intervals.
 Finally, the client can perform functions like read, write, or store the data as per the need.

Features of Apache ZooKeeper

Apache ZooKeeper provides a wide range of good features to the user. Let’s start exploring them.

 Updating the Node’s Status: Apache ZooKeeper is capable of updating every node that allows it to store
updated information about each node across the cluster.
 Managing the Cluster: This technology can manage the cluster in such a way that the status of each node
is maintained in real time, leaving lesser chances for errors and ambiguity.
 Naming Service: ZooKeeper attaches a unique identification to every node which is quite similar to the
DNA that helps identify it.
 Automatic Failure Recovery: Apache ZooKeeper locks the data while modifying which helps the cluster
recover it automatically if a failure occurs in the database.

Benefits of Apache ZooKeeper

 Simplicity: Coordination is done with the help of a shared hierarchical namespace.


 Reliability: The system keeps performing even if more than one node fails.
 Order: It keeps track by stamping each update with a number denoting its order.
 Speed: It runs with a ratio of 10:1 in the cases where ‘reads’ are more common.
 Scalability: The performance can be enhanced by deploying more machines.

ZooKeeper Use Cases: There are many use cases of ZooKeeper. Some of the most prominent of them are
as follows:
 Managing the configuration
 Naming services
 Choosing the leader
 Queuing the messages
 Managing the notification system
 Synchronization
Zookeeper is used in monitoring a Cluster:
 Apache ZooKeeper is a service used by a cluster (group of nodes) to coordinate between
themselves and maintain shared data with robust synchronization techniques.
 ZooKeeper is itself a distributed application providing services for writing a distributed application.
 ZooKeeper is a distributed co-ordination service to manage large set of hosts.
 Co-ordinating and managing a service in a distributed environment is a complicated process.
ZooKeeper solves this issue with its simple architecture and API.
 Apache Zookeeper provides a hierarchical file system (with ZNodes as the system files) that helps
with the discovery, registration, configuration, locking, leader selection, queueing, etc of services
working in different machines. ZooKeeper server maintains configuration information, naming,
providing distributed synchronization, and providing group services, used by distributed
applications.
HBase uses zookeeper to build applications with zookeeper

 Zookeeper is a centralized monitoring server that maintains configuration information and


provides distributed synchronization.
 HBase uses zookeeper for region assignments
 Whenever a client wants to access any region,he has to approach zookeeper first.
 Every region server and HMaster is registered with zookeeper.
 So,client needs to access zookeeper inorder to connect with region servers and HMaster.
 HMaster also contacts zookeeper to get information related to region servers.
 Zookeeper tracks the information about region servers and the datanodes residing in the region
servers.
 Zookeeper is also helpful in recovering the crashed region servers by loading them onto other
region servers which are running.
 Thus, Zookeeeper is an important component of the HBase architecture as it provides
communication between client and region servers,maintains configuration information and
handles sever failures.

You might also like