You are on page 1of 100

Tutorial: HBase

Theory and Practice of a Distributed Data Store
Pietro Michiardi
Eurecom
Pietro Michiardi (Eurecom) Tutorial: HBase 1 / 102
Introduction
Introduction
Pietro Michiardi (Eurecom) Tutorial: HBase 2 / 102
Introduction RDBMS
Why yet another storage architecture?
Relational Databse Management Systems (RDBMS):

Around since 1970s

Countless examples in which they actually do make sense
The dawn of Big Data:

Previously: ignore data sources because no cost-effective way to
store everything

One option was to prune, by retaining only data for the last N days

Today: store everything!

Pruning fails in providing a base to build useful mathematical models
Pietro Michiardi (Eurecom) Tutorial: HBase 3 / 102
Introduction RDBMS
Batch processing
Hadoop and MapReduce:

Excels at storing (semi- and/or un-) structured data

Data interpretation takes place at analysis-time

Flexibility in data classification
Batch processing: A complement to RDBMS:

Scalable sink for data, processing launched when time is right

Optimized for large file storage

Optimized for “streaming” access
Random Access:

Users need to “interact” with data, especially that “crunched” after a
MapReduce job

This is historically where RDBMS excel: random access for
structured data
Pietro Michiardi (Eurecom) Tutorial: HBase 4 / 102
Introduction Column-Oriented DB
Column-Oriented Databases
Data layout:

Save their data grouped by columns

Subsequent column values are stored contiguously on disk

This is substantially different from traditional RDBMS, which save
and store data by row
Specialized databases for specific workloads:

Reduced I/O

Better suited for compression →Efficient use of bandwidth

Indeed, column values are often very similar and differ little
row-by-row

Real-time access to data
Important NOTE:

HBase is not a column-oriented DB in the typical term

HBase uses an on-disk column storage format

Provides key-based access to specific cell of data, or a sequential
range of cells
Pietro Michiardi (Eurecom) Tutorial: HBase 5 / 102
Introduction Column-Oriented DB
Column-Oriented and Row-Oriented storage layouts
stoiing this uata ovei a ceitain numLei ol months cieates a huge tail that neeus to Le
hanuleu elliciently. Even though laigei paits ol emails÷loi example, attachments÷
aie stoieu in a seconuaiy system,
'
the amount ol uata geneiateu Ly all these messages
is minu-Loggling. Il we weie to take 1+0 Lytes pei message, as useu Ly Twittei, it woulu
Iigurc 1-1. Co|unn-oricntcd and row-oricntcd storagc |ayouts
=See this Llog post, as well as this one, Ly the FaceLook engineeiing team. Vall messages count loi 15 Lillion
anu chat loi 120 Lillion, totaling 135 Lillion messages a month. Then they also auu SMS anu otheis to cieate
an even laigei numLei.
' FaceLook uses Haystack, which pioviues an optimizeu stoiage inliastiuctuie loi laige Linaiy oLjects, such
as photos.
4 | Chapter 1: Introduction
Figure: Example of Storage Layouts
Pietro Michiardi (Eurecom) Tutorial: HBase 6 / 102
Introduction The problem with RDBMS
The Problem with RDBMS
RDBMS are still relevant

Persistence layer for frontend application

Store relational data

Works well for a limited number of records
Example: Hush

Used throughout this course

URL shortener service
Let’s see the “scalability story” of such a service

Assumption: service must run with a reasonable budget
Pietro Michiardi (Eurecom) Tutorial: HBase 7 / 102
Introduction The problem with RDBMS
The Problem with RDBMS
Few thousands users: use a LAMP stack

Normalize data

Use foreign keys

Use Indexes
The system also uownloaus the linkeu page in the Lackgiounu, anu extiacts, loi in-
stance, the TITLE tag liom the HTML, il piesent. The entiie page is saveu loi latei
piocessing with asynchionous Latch joLs, loi analysis puiposes. This is iepiesenteu Ly
the url taLle.
Eveiy linkeu page is only stoieu once, Lut since many useis may link to the same long
URL, yet want to maintain theii own uetails, such as the usage statistics, a sepaiate
entiy in the shorturl is cieateu. This links the url, shorturl, anu click taLles.
This also allows you to aggiegate statistics to the oiiginal shoit ID, refShortId, so that
you can see the oveiall usage ol any shoit URL to map to the same long URL. The
shortId anu refShortId aie the hasheu IDs assigneu uniguely to each shoiteneu URL.
Foi example, in
http://hush.li/a23eg
the ID is a23eg.
Figuie 1-3 shows how the same schema coulu Le iepiesenteu in HBase. Eveiy shoiteneu
URL is stoieu in a sepaiate taLle, shorturl, which also contains the usage statistics,
stoiing vaiious time ianges in sepaiate column lamilies, with uistinct tinc-to-|ivc
settings. The columns loim the actual counteis, anu theii name is a comLination ol the
uate, plus an optional uimensional postlix÷loi example, the countiy coue.
The uownloaueu page, anu the extiacteu uetails, aie stoieu in the url taLle. This taLle
uses compiession to minimize the stoiage ieguiiements, Lecause the pages aie mostly
HTML, which is inheiently veiLose anu contains a lot ol text.
The user-shorturl taLle acts as a lookup so that you can guickly linu all shoit IDs loi
a given usei. This is useu on the usei`s home page, once she has loggeu in. The user
taLle stoies the actual usei uetails.
Ve still have the same numLei ol taLles, Lut theii meaning has changeu: the clicks
taLle has Leen aLsoiLeu Ly the shorturl taLle, while the statistics columns use the uate
as theii key, loimatteu as YYYYMMDD÷loi instance, 20110502÷so that they can Le ac-
Iigurc 1-2. Thc Hush schcna cxprcsscd as an ERD
14 | Chapter 1: Introduction
Figure: The Hush Schema expressed as an ERD
Pietro Michiardi (Eurecom) Tutorial: HBase 8 / 102
Introduction The problem with RDBMS
The Problem with RDBMS
Find all short URLs for a given user

JOIN user and shorturl tables
Stored Procedures

Consistently update data from multiple clients

Underlying DB system guarantees coherency
Transactions

Make sure you can update tables in an atomic fashion

RDBMS →Strong Consistency (ACID properties)

Referential Integrity
Pietro Michiardi (Eurecom) Tutorial: HBase 9 / 102
Introduction The problem with RDBMS
The Problem with RDBMS
Scaling up to tens of thousands of users

Increasing pressure on the database server

Adding more application servers is easy: they share their state on
the same central DB

CPU and I/O start to be a problem on the DB
Master-Slave architecture

Add DB server so that READS can be served in parallel

Master DB takes all the writes (which are fewer in the Hush
application)

Slaves DB replicate Master DB and serve all reads (but you need a
load balancer)
Pietro Michiardi (Eurecom) Tutorial: HBase 10 / 102
Introduction The problem with RDBMS
The Problem with RDBMS
Scaling up to hundreds of thousands

READS are still the bottlenecks

Slave servers begin to fall short in serving clients requests
Caching

Add a caching layer, e.g. Memcached or Redis

Offload READS to a fast in-memory system
→ You lose consistency guarantees
→ Cache invalidation is critical for having DB and Caching layer
consistent
Pietro Michiardi (Eurecom) Tutorial: HBase 11 / 102
Introduction The problem with RDBMS
The Problem with RDBMS
Scaling up more

WRITES are the bottleneck

The master DB is hit too hard by WRITE load

Vertical scalability: beef up your master server
→ This becomes costly, as you may also have to replace your RDBMS
SQL JOINs becomes a bottleneck

Schema de-normalization

Cease using stored procedures, as they become slow and eat up a
lot of server CPU

Materialized views (they speed up READS)

Drop secondary indexes as they slow down WRITES
Pietro Michiardi (Eurecom) Tutorial: HBase 12 / 102
Introduction The problem with RDBMS
The Problem with RDBMS
What if your application needs to further scale up?

Vertical scalability vs. Horizontal scalability
Sharding

Partition your data across multiple databases

Essentially you break horizontally your tables and ship them to
different servers

This is done using fixed boundaries
→ Re-sharding to achieve load-balancing
→ This is an operational nightmare

Re-sharding takes a huge toll on I/O resources
Pietro Michiardi (Eurecom) Tutorial: HBase 13 / 102
Introduction NOSQL
Non-Relational DataBases
They originally do not support SQL

In practice, this is becoming a thin line to make the distinction

One difference is in the data model

Another difference is in the consistency model (ACID and
transactions are generally sacrificed)
Consistency models and the CAP Theorem

Strict: all changes to data are atomic

Sequential: changes to data are seen in the same order as they
were applied

Causal: causally related changes are seen in the same order

Eventual: updates propagates through the system and replicas
when in steady state

Weak: no guarantee
Pietro Michiardi (Eurecom) Tutorial: HBase 14 / 102
Introduction NOSQL
Dimensions to classify NoSQL DBs
Data model

How the data is stored: key/value, semi-structured,
column-oriented, ...

How to access data?

Can the schema evolve over time?
Storage model

In-memory or persistent?

How does this affect your access pattern?
Consistency model

Strict or eventual?

This translates in how fast the system handles READS and WRITES
[2]
Pietro Michiardi (Eurecom) Tutorial: HBase 15 / 102
Introduction NOSQL
Dimensions to classify NoSQL DBs
Physical Model

Distributed or single machine?

How does the system scale?
Read/Write performance

Top-down approach: understands well the workload!

Some systems are better for READS, other for WRITES
Secondary indexes

Does your workload require them?

Can your system emulate them?
Pietro Michiardi (Eurecom) Tutorial: HBase 16 / 102
Introduction NOSQL
Dimensions to classify NoSQL DBs
Failure Handling

How each data store handle server failures?

Is it able to continue operating in case of failures?

This is related to Consistency models and the CAP theorem

Does the system support “hot-swap”?
Compression

Is the compression method pluggable?

What time of compression?
Load Balancing

Can the storage system seamlessly balance load?
Pietro Michiardi (Eurecom) Tutorial: HBase 17 / 102
Introduction NOSQL
Dimensions to classify NoSQL DBs
Atomic read-modify-write

Easy in a centralized system, difficult in a distributed one

Prevent race conditions in multi-threaded or shared-nothing designs

Can reduce client-side complexity
Locking, waits and deadlocks

Support for multiple client accessing data simultaneously

Is locking available?

Is it wait-free, hence deadlock free?
Impedance Match
“One-size-fits-all” has been long dismissed: need to find the perfect
match for your problem.
Pietro Michiardi (Eurecom) Tutorial: HBase 18 / 102
Introduction Denormalization
Database (De-)Normalization
Schema design at scale

A good methodology is to apply the DDI principle [8]

Denormalization

Duplication

Intelligent Key design
Denormalization

Duplicate data in more than one table such that at READ time no
further aggregation is required
Next: an example based on Hush

How to convert a classic relational data model to one that fits
HBase
Pietro Michiardi (Eurecom) Tutorial: HBase 19 / 102
Introduction Denormalization
Example: Hush - from RDBMS to HBase
The system also uownloaus the linkeu page in the Lackgiounu, anu extiacts, loi in-
stance, the TITLE tag liom the HTML, il piesent. The entiie page is saveu loi latei
piocessing with asynchionous Latch joLs, loi analysis puiposes. This is iepiesenteu Ly
the url taLle.
Eveiy linkeu page is only stoieu once, Lut since many useis may link to the same long
URL, yet want to maintain theii own uetails, such as the usage statistics, a sepaiate
entiy in the shorturl is cieateu. This links the url, shorturl, anu click taLles.
This also allows you to aggiegate statistics to the oiiginal shoit ID, refShortId, so that
you can see the oveiall usage ol any shoit URL to map to the same long URL. The
shortId anu refShortId aie the hasheu IDs assigneu uniguely to each shoiteneu URL.
Foi example, in
http://hush.li/a23eg
the ID is a23eg.
Figuie 1-3 shows how the same schema coulu Le iepiesenteu in HBase. Eveiy shoiteneu
URL is stoieu in a sepaiate taLle, shorturl, which also contains the usage statistics,
stoiing vaiious time ianges in sepaiate column lamilies, with uistinct tinc-to-|ivc
settings. The columns loim the actual counteis, anu theii name is a comLination ol the
uate, plus an optional uimensional postlix÷loi example, the countiy coue.
The uownloaueu page, anu the extiacteu uetails, aie stoieu in the url taLle. This taLle
uses compiession to minimize the stoiage ieguiiements, Lecause the pages aie mostly
HTML, which is inheiently veiLose anu contains a lot ol text.
The user-shorturl taLle acts as a lookup so that you can guickly linu all shoit IDs loi
a given usei. This is useu on the usei`s home page, once she has loggeu in. The user
taLle stoies the actual usei uetails.
Ve still have the same numLei ol taLles, Lut theii meaning has changeu: the clicks
taLle has Leen aLsoiLeu Ly the shorturl taLle, while the statistics columns use the uate
as theii key, loimatteu as YYYYMMDD÷loi instance, 20110502÷so that they can Le ac-
Iigurc 1-2. Thc Hush schcna cxprcsscd as an ERD
14 | Chapter 1: Introduction
Figure: The Hush Schema expressed as an ERD
shorturl table: contains the short URL
click table: contains click tracking, and other statistics,
aggregated on a daily basis (essentially, a counter)
user table: contains user information
URL table: contains a replica of the page linked to a short URL,
including META data and content (this is done for batch analysis
purposes)
Pietro Michiardi (Eurecom) Tutorial: HBase 20 / 102
Introduction Denormalization
Example: Hush - from RDBMS to HBase
The system also uownloaus the linkeu page in the Lackgiounu, anu extiacts, loi in-
stance, the TITLE tag liom the HTML, il piesent. The entiie page is saveu loi latei
piocessing with asynchionous Latch joLs, loi analysis puiposes. This is iepiesenteu Ly
the url taLle.
Eveiy linkeu page is only stoieu once, Lut since many useis may link to the same long
URL, yet want to maintain theii own uetails, such as the usage statistics, a sepaiate
entiy in the shorturl is cieateu. This links the url, shorturl, anu click taLles.
This also allows you to aggiegate statistics to the oiiginal shoit ID, refShortId, so that
you can see the oveiall usage ol any shoit URL to map to the same long URL. The
shortId anu refShortId aie the hasheu IDs assigneu uniguely to each shoiteneu URL.
Foi example, in
http://hush.li/a23eg
the ID is a23eg.
Figuie 1-3 shows how the same schema coulu Le iepiesenteu in HBase. Eveiy shoiteneu
URL is stoieu in a sepaiate taLle, shorturl, which also contains the usage statistics,
stoiing vaiious time ianges in sepaiate column lamilies, with uistinct tinc-to-|ivc
settings. The columns loim the actual counteis, anu theii name is a comLination ol the
uate, plus an optional uimensional postlix÷loi example, the countiy coue.
The uownloaueu page, anu the extiacteu uetails, aie stoieu in the url taLle. This taLle
uses compiession to minimize the stoiage ieguiiements, Lecause the pages aie mostly
HTML, which is inheiently veiLose anu contains a lot ol text.
The user-shorturl taLle acts as a lookup so that you can guickly linu all shoit IDs loi
a given usei. This is useu on the usei`s home page, once she has loggeu in. The user
taLle stoies the actual usei uetails.
Ve still have the same numLei ol taLles, Lut theii meaning has changeu: the clicks
taLle has Leen aLsoiLeu Ly the shorturl taLle, while the statistics columns use the uate
as theii key, loimatteu as YYYYMMDD÷loi instance, 20110502÷so that they can Le ac-
Iigurc 1-2. Thc Hush schcna cxprcsscd as an ERD
14 | Chapter 1: Introduction
Figure: The Hush Schema expressed as an ERD
user table is indexed on the username field, for fast user lookup
shorturl table is indexed on the short URL (shortId) field, for
fast short URL lookup
Pietro Michiardi (Eurecom) Tutorial: HBase 21 / 102
Introduction Denormalization
Example: Hush - from RDBMS to HBase
The system also uownloaus the linkeu page in the Lackgiounu, anu extiacts, loi in-
stance, the TITLE tag liom the HTML, il piesent. The entiie page is saveu loi latei
piocessing with asynchionous Latch joLs, loi analysis puiposes. This is iepiesenteu Ly
the url taLle.
Eveiy linkeu page is only stoieu once, Lut since many useis may link to the same long
URL, yet want to maintain theii own uetails, such as the usage statistics, a sepaiate
entiy in the shorturl is cieateu. This links the url, shorturl, anu click taLles.
This also allows you to aggiegate statistics to the oiiginal shoit ID, refShortId, so that
you can see the oveiall usage ol any shoit URL to map to the same long URL. The
shortId anu refShortId aie the hasheu IDs assigneu uniguely to each shoiteneu URL.
Foi example, in
http://hush.li/a23eg
the ID is a23eg.
Figuie 1-3 shows how the same schema coulu Le iepiesenteu in HBase. Eveiy shoiteneu
URL is stoieu in a sepaiate taLle, shorturl, which also contains the usage statistics,
stoiing vaiious time ianges in sepaiate column lamilies, with uistinct tinc-to-|ivc
settings. The columns loim the actual counteis, anu theii name is a comLination ol the
uate, plus an optional uimensional postlix÷loi example, the countiy coue.
The uownloaueu page, anu the extiacteu uetails, aie stoieu in the url taLle. This taLle
uses compiession to minimize the stoiage ieguiiements, Lecause the pages aie mostly
HTML, which is inheiently veiLose anu contains a lot ol text.
The user-shorturl taLle acts as a lookup so that you can guickly linu all shoit IDs loi
a given usei. This is useu on the usei`s home page, once she has loggeu in. The user
taLle stoies the actual usei uetails.
Ve still have the same numLei ol taLles, Lut theii meaning has changeu: the clicks
taLle has Leen aLsoiLeu Ly the shorturl taLle, while the statistics columns use the uate
as theii key, loimatteu as YYYYMMDD÷loi instance, 20110502÷so that they can Le ac-
Iigurc 1-2. Thc Hush schcna cxprcsscd as an ERD
14 | Chapter 1: Introduction
Figure: The Hush Schema expressed as an ERD
shorturl and user tables are related through a foreign key
relation on the userId
URL table is related to shorturl table with a foreign key on the
URL id
click table is related to shorturl table with a foreign key on the
short URL id
NOTE: a web page is stored only once (even if multiple users link
to it), but each users maintain separate statistics
Pietro Michiardi (Eurecom) Tutorial: HBase 22 / 102
Introduction Denormalization
Example: Hush - from RDBMS to HBase
cesseu seguentially. The auuitional user-shorturl taLle is ieplacing the loieign key
ielationship, making usei-ielateu lookups lastei.
Theie aie vaiious appioaches to conveiting onc-to-onc, onc-to-nany, anu nany-to-
nany ielationships to lit the unueilying aichitectuie ol HBase. You coulu implement
even this simple example in uilleient ways. You neeu to unueistanu the lull potential
ol HBase stoiage uesign to make an euucateu uecision iegaiuing which appioach to
take.
The suppoit loi spaise, wiue taLles anu column-oiienteu uesign olten eliminates the
neeu to noimalize uata anu, in the piocess, the costly JOIN opeiations neeueu to
aggiegate the uata at gueiy time. Use ol intelligent keys gives you line-giaineu contiol
ovei how÷anu wheie÷uata is stoieu. Paitial key lookups aie possiLle, anu when
Iigurc 1-3. Thc Hush schcna in HBasc
Nonrelational Database Systems, Not-Only SQL or NoSQL? | 15
Figure: The Hush Schema in HBase
shorturl table: stores each
short URL, usage statistics
(various time-ranges in
separate column-families with
distinct TTL settings)

Note the dimensional postfix
appended to the time
information
url table: stores the
downloaded page, and the
extracted details

This table uses compression
Pietro Michiardi (Eurecom) Tutorial: HBase 23 / 102
Introduction Denormalization
Example: Hush - from RDBMS to HBase
cesseu seguentially. The auuitional user-shorturl taLle is ieplacing the loieign key
ielationship, making usei-ielateu lookups lastei.
Theie aie vaiious appioaches to conveiting onc-to-onc, onc-to-nany, anu nany-to-
nany ielationships to lit the unueilying aichitectuie ol HBase. You coulu implement
even this simple example in uilleient ways. You neeu to unueistanu the lull potential
ol HBase stoiage uesign to make an euucateu uecision iegaiuing which appioach to
take.
The suppoit loi spaise, wiue taLles anu column-oiienteu uesign olten eliminates the
neeu to noimalize uata anu, in the piocess, the costly JOIN opeiations neeueu to
aggiegate the uata at gueiy time. Use ol intelligent keys gives you line-giaineu contiol
ovei how÷anu wheie÷uata is stoieu. Paitial key lookups aie possiLle, anu when
Iigurc 1-3. Thc Hush schcna in HBasc
Nonrelational Database Systems, Not-Only SQL or NoSQL? | 15
Figure: The Hush Schema in HBase
user-shorturl table: this is
a lookup table (basically an
index) to find all shortIDs for a
given user

Note that this table is filled
at insert time, it’s not
automatically generated by
HBase
user table: stores user details
Pietro Michiardi (Eurecom) Tutorial: HBase 24 / 102
Introduction Denormalization
Example: Hush - RDBMS vs HBase
Same number of tables

Their meaning is different

click table has been absorbed by the shorturl table

statistics are stored with the date as the key, so that they can be
accessed sequentially

The user-shorturl table is replacing the foreign key
relationship, making user-related lookups faster
Normalized vs. De-normalized data

Wide tables and column-oriented design eliminates JOINs

Compound keys are essential

Data partitioning is based on keys, so a proper understanding
thereof is essential
Pietro Michiardi (Eurecom) Tutorial: HBase 25 / 102
Introduction HBase Sketch
HBase building blocks
The backdrop: BigTable

GFS, The Google FileSystem [6]

Google MapReduce [4]

BigTable [3]
What is BigTable?

BigTable is a distributed storage system for managing structured
data designed to scale to a very large size

BigTable is a sparse, distributed, persistent multi-dimensional
sorted map
What is HBase?

Essentially it’s an open-source version of BigTable

Differences listed in [5]
Pietro Michiardi (Eurecom) Tutorial: HBase 26 / 102
Introduction HBase Sketch
HBase building blocks
Tables, Rows, Columns, and Cells
The most basic unit in HBase is a column

Each column may have multiple versions, with each distinct value
contained in a separate cell

One or more columns form a row, that is addressed uniquely by a
row key
A table is a collection of rows

All rows are always sorted lexicographically by their row key
ovei laigei key ianges oi entiie taLles. The culmination ol these elloits was puLlisheu
in 2006 in a papei titleu ¨Bigtab|c: A Distributcd Storagc Systcn jor Structurcd Data¨,
two exceipts liom which lollow:
BigtaLle is a uistiiLuteu stoiage system loi managing stiuctuieu uata that is uesigneu to
scale to a veiy laige size: petaLytes ol uata acioss thousanus ol commouity seiveis.
.a spaise, uistiiLuteu, peisistent multi-uimensional soiteu map.
It is highly iecommenueu that eveiyone inteiesteu in HBase ieau that papei. It uesciiLes
a lot ol ieasoning Lehinu the uesign ol BigtaLle anu, ultimately, HBase. Ve will, how-
evei, go thiough the Lasic concepts, since they apply uiiectly to the iest ol this Look.
HBase is implementing the BigtaLle stoiage aichitectuie veiy laithlully so that we can
explain eveiything using HBase. Appenuix F pioviues an oveiview ol wheie the two
systems uillei.
Tables, Rows, Columns, and Cells
Fiist, a guick summaiy: the most Lasic unit is a co|unn. One oi moie columns loim a
row that is auuiesseu uniguely Ly a row |cy. A numLei ol iows, in tuin, loim a tab|c,
anu theie can Le many ol them. Each column may have multiple veisions, with each
uistinct value containeu in a sepaiate cc||.
This sounus like a ieasonaLle uesciiption loi a typical uataLase, Lut with the extia
dincnsion ol allowing multiple veisions ol each cells. But oLviously theie is a Lit moie
to it.
All rows aie always soiteu lexicogiaphically Ly theii iow key. Example 1-1 shows how
this will look when auuing a lew iows with uilleient keys.
Exanp|c 1-1. Thc sorting oj rows donc |cxicographica||y by thcir |cy
hbase(main):001:0> scan 'table1'
ROW COLUMN+CELL
row-1 column=cf1:, timestamp=1297073325971 ...
row-10 column=cf1:, timestamp=1297073337383 ...
row-11 column=cf1:, timestamp=1297073340493 ...
row-2 column=cf1:, timestamp=1297073329851 ...
row-22 column=cf1:, timestamp=1297073344482 ...
row-3 column=cf1:, timestamp=1297073333504 ...
row-abc column=cf1:, timestamp=1297073349875 ...
7 row(s) in 0.1100 seconds
Note how the numLeiing is not in seguence as you may have expecteu it. You may have
to pau keys to get a piopei soiting oiuei. In lexicogiaphical soiting, each key is com-
paieu on a Linaiy level, Lyte Ly Lyte, liom lelt to iight. Since row-1... is less than
row-2..., no mattei what lollows, it is soiteu liist.
Having the iow keys always soiteu can give you something like a piimaiy key inuex
known liom RDBMSes. It is also always unigue, that is, you can have each iow key
Building Blocks | 17
Pietro Michiardi (Eurecom) Tutorial: HBase 27 / 102
Introduction HBase Sketch
HBase building blocks
Tables, Rows, Columns, and Cells
Lexicographical ordering of row keys

Keys are compared on a binary level, byte by byte, from left to right

This can be thought of as a primary index on the row key!

Row keys are always unique

Row keys can be any arbitrary array of bytes
Columns

Rows are composed of columns

Can have millions of columns

Can be compressed or tagged to stay in memory
Pietro Michiardi (Eurecom) Tutorial: HBase 28 / 102
Introduction HBase Sketch
HBase building blocks
Tables, Rows, Columns, and Cells
Column Families

Columns are grouped into column families
→ Semantical boundaries between data

Column families and columns stored together in the same low-level
storage file, called an HFile

Defined when table is created

Should not be changed too often

The number of column families should be reasonable [WHY?]

Column family name composed by printable characters
References to columns

Column “name” is called qualifier, and can be any arbitrary
number of bytes

Reference: family:qualifier
Pietro Michiardi (Eurecom) Tutorial: HBase 29 / 102
Introduction HBase Sketch
HBase building blocks
Tables, Rows, Columns, and Cells
A note on the NULL value

In RDBMS NULL cells need to be set and occupy space

In HBase, NULL cells or columns are simply not stored
A cell

Every column value, or cell, is timestamped (implicitly or explicitly)

This can be used to save multiple versions of a value that changes
over time

Versions are stored in decreasing timestamp, most recent first

Cell versions can be constrained by predicate deletions

Keep only values from the last week
Pietro Michiardi (Eurecom) Tutorial: HBase 30 / 102
Introduction HBase Sketch
HBase building blocks
Tables, Rows, Columns, and Cells
Access to data

(Table, RowKey, Family, Column, Timestamp) →
Value

SortedMap<RowKey, List<SortedMap<Column,
List<Value, Timestamp»»

Row data access is atomic and includes any number of columns

There is no further guarantee or transactional feature spanning
multiple rows
→ HBase is strictly consistent
Which means:

The first SortedMap is the table, containing a List of column
families

The families contain another SortedMap, representing columns
and a List of value, timestamp tuples
Pietro Michiardi (Eurecom) Tutorial: HBase 31 / 102
Introduction HBase Sketch
HBase building blocks
Automatic Sharding
Region

This is the basic unit of scalability and load balancing

Regions are contiguous ranges of rows stored together →they are
the equivalent of range partitions in sharded RDBMS

Regions are dynamically split by the system when they become too
large

Regions can also be merged to reduce the number of storage files
Regions in practice

Initially, there is one region

System monitors region size: if a threshold is attained, SPLIT

Regions are split in two at the middle key

This creates roughly two equivalent (in size) regions
Pietro Michiardi (Eurecom) Tutorial: HBase 32 / 102
Introduction HBase Sketch
HBase building blocks
Automatic Sharding
Region Servers

Each region is served by exactly one Region Server

Region servers can serve multiple regions

The number of region servers and their sizes depend on the
capability of a single region server
Server failures

Regions allow for fast recovery upon failure

Fine-grained Load Balancing is also achieved using regions as they
can be easily moved across servers
Pietro Michiardi (Eurecom) Tutorial: HBase 33 / 102
Introduction HBase Sketch
HBase building blocks
Storage API
No support for SQL

CRUD operations using a standard API, available for many “clients”

Data access is not declarative but imperative
Scan API

Allows for fast iteration over ranges of rows

Allows to limit the number and which column are returned

Allows to control the version number of each cell
Read-modify-write API

HBase supports single-row transactions

Atomic read-modify-write on data stored in a single row key
Pietro Michiardi (Eurecom) Tutorial: HBase 34 / 102
Introduction HBase Sketch
HBase building blocks
Storage API
Counters

Values can be interpreted as counters and updated atomically

Can be read and modified in one operation
→ Implement global, strictly consistent, sequential counters
Coprocessors

These are equivalent to stored-procedures in RDBMS

Allow to push user code in the address space of the server

Access to server local data

Implement lightweight batch jobs, data pre-processing, data
summarization
Pietro Michiardi (Eurecom) Tutorial: HBase 35 / 102
Introduction HBase Sketch
HBase building blocks
HBase implementation
Data Storage

Store files are called HFiles

Persistent and ordered immutable maps from key to value

Internally implemented as sequences of blocks with an index at the
end

Index is loaded when the HFile is opened and kept in memory
Data lookups

Since HFiles have a block index, lookup can be done with a single
disk seek

First, the block possibly containing a given lookup key is determined
with a binary search in the in-memory index

Then a block read is performed to find the actual key
Underlying file system

Many are supported, usually HBase deployed on top of HDFS
Pietro Michiardi (Eurecom) Tutorial: HBase 36 / 102
Introduction HBase Sketch
HBase building blocks
HBase implementation
WRITE operation

First, data is written to a commit log, called WAL (write-ahead-log)

Then data is moved into memory, in a structure called memstore

When the size of the memstore exceeds a given threshold it is
flushed to an HFile to disk
How can HBase write, while serving READS and WRITES?

Rolling mechanism

new/empty slots in the memstore take the updates

old/full slots

Note that data in memstore is sorted by keys, matching what
happens in the HFiles
Data Locality

Achieved by the system looking up for server hostnames

Achieved through intelligent key design
Pietro Michiardi (Eurecom) Tutorial: HBase 37 / 102
Introduction HBase Sketch
HBase building blocks
HBase implementation
Deleting data

Since HFiles are immutable, how can we delete data?

A delete marker (also known as tombstone marker ) is written to
indicate that a given key is deleted

During the read process, data marked as deleted is skipped

Compactions (see next slides) finalize the deletion process
READ operation

Merge of what is stored in the memstores (data that is not on disk)
and in the HFiles

The WAL is never used in the READ operation

Several API calls to read, scan data
Pietro Michiardi (Eurecom) Tutorial: HBase 38 / 102
Introduction HBase Sketch
HBase building blocks
HBase implementation
Compactions

Flushing data from memstores to disk implies the creation of new
HFiles each time
→ We end up with many (possibly small) files
→ We need to do housekeeping [WHY?]
Minor Compaction

Rewrites small HFiles into fewer, larger HFiles

This is done using an n-way merge
1
Major Compaction

Rewrites all files within a column family or a region in a new one

Drop deleted data

Perform predicated deletion (e.g. delete old data)
1
What is MergeSort?
Pietro Michiardi (Eurecom) Tutorial: HBase 39 / 102
Introduction HBase Sketch
HBase: a glance at the architecture
uistiiLuteu systems can use to negotiate owneiship, iegistei seivices, oi watch loi
upuates.
Eveiy iegion seivei cieates its own ephemeial noue in ZooKeepei, which the mastei,
in tuin, uses to uiscovei availaLle seiveis. They aie also useu to tiack seivei lailuies oi
netwoik paititions.
Ephemeial noues aie Lounu to the session Letween ZooKeepei anu the client which
cieateu it. The session has a heaitLeat keepalive mechanism that, once it lails to iepoit,
is ueclaieu lost Ly ZooKeepei anu the associateu ephemeial noues aie ueleteu.
HBase uses ZooKeepei also to ensuie that theie is only one mastei iunning, to stoie
the Lootstiap location loi iegion uiscoveiy, as a iegistiy loi iegion seiveis, as well as
loi othei puiposes. ZooKeepei is a ciitical component, anu without it HBase is not
opeiational. This is mitigateu Ly ZooKeepei`s uistiiLuteu uesign using an assemLle ol
seiveis anu the Zab piotocol to keep its state consistent.
Figuie 1-S shows how the vaiious components ol HBase aie oichestiateu to make use
ol existing system, like HDFS anu ZooKeepei, Lut also auuing its own layeis to loim
a complete platloim.
Iigurc 1-8. HBasc using its own conponcnts whi|c |cvcraging cxisting systcns
The mastei seivei is also iesponsiLle loi hanuling loau Lalancing ol iegions acioss
iegion seiveis, to unloau Lusy seiveis anu move iegions to less occupieu ones. The
mastei is not pait ol the actual uata stoiage oi ietiieval path. It negotiates loau Lalancing
anu maintains the state ol the clustei, Lut nevei pioviues any uata seivices to eithei the
iegion seiveis oi the clients, anu is theieloie lightly loaueu in piactice. In auuition, it
takes caie ol schema changes anu othei metauata opeiations, such as cieation ol taLles
anu column lamilies.
Region seiveis aie iesponsiLle loi all ieau anu wiite ieguests loi all iegions they seive,
anu also split iegions that have exceeueu the conliguieu iegion size thiesholus. Clients
communicate uiiectly with them to hanule all uata-ielateu opeiations.
¨Region Lookups¨ on page 3+5 has moie uetails on how clients peiloim the iegion
lookup.
26 | Chapter 1: Introduction
Master node: HMaster

Assigns regions to region servers using ZooKeeper

Handles load balancing

Not part of the data path

Holds metadata and schema
Region Servers

Handle READs and WRITEs

Handle region splitting
Pietro Michiardi (Eurecom) Tutorial: HBase 40 / 102
Architecture
Architecture
Pietro Michiardi (Eurecom) Tutorial: HBase 41 / 102
Architecture Seek vs. Transfer
Seek vs. Transfer
Fundamental difference between RDBMS and alternatives

B+Trees

Log-Structured Merge Trees
Seek vs. Transfer

Random access to individual cells

Sequential access to data
Pietro Michiardi (Eurecom) Tutorial: HBase 42 / 102
Architecture Seek vs. Transfer
B+ Trees
Dynamic, multi-level indexes

Efficient insertion, lookup and deletion

Q: What’s the difference between a B+ Tree and a Hash Table?

Frequent updates may imbalance the trees →Tree optimization
and re-organization is required (which is a costly operation)
Bounds on page size

Number of keys in each branch

Larger fanout compared to binary trees

Lower number of I/O operations to find a specific key
Support for range scans

Leafs are linked and represent an in-order list of all keys

No costly tree-traversal algorithms required
Pietro Michiardi (Eurecom) Tutorial: HBase 43 / 102
Architecture Seek vs. Transfer
LSM-Trees
Data flow

Incoming data is first stored in a logfile, sequentially

Once the log has the modification saved, data is pushed in memory

In-memory store holds most recent updates for fast lookup

When memory is “full”, data is flushed in a store file to disk, as a
sorted list of key → record pair

At this point, the log file can be thrown away
How store files are arranged

Similar idea of a B+ Tree, but optimized for sequential disk access

All nodes of the tree try to be filled up completely

Updates are done in a rolling merge fashion

The system packs existing on-disk multi-page blocks with in-memory
data until the block reaches full capacity
Pietro Michiardi (Eurecom) Tutorial: HBase 44 / 102
Architecture Seek vs. Transfer
LSM-Trees
Clean-up process

As flushes take place over time, a lot of store files are created

Background process aggregates files into larger ones to limit disk
seeks

All store files are always sorted by key →no re-ordering required to
fit new keys in
Data Lookup

Lookups are done in a merging fashion

First lookup in the in-memory store

If miss, the lookup in the on-disk store
Deleting data

Use a delete marker

When pages are re-written, deleted markers and keys are
eventually dropped

Predicate deletion happens here
Pietro Michiardi (Eurecom) Tutorial: HBase 45 / 102
Architecture Seek vs. Transfer
B+ Tree vs. LSM-Trees
B+ Tree [1]

Work well when there are not so many updates

The more and the faster you insert data at random locations the
faster pages get fragmented

Updates and deletes are done at disk seek rates, rather than
transfer rates
LSM-Tree [7]

Work at disk transfer rate and scale better to huge amounts of data

Guarantee a consistent insert rate

They transform random into sequential writes

Reads are independent from writes

Optimized data layout which offers predictable boundaries on disk
seeks
Pietro Michiardi (Eurecom) Tutorial: HBase 46 / 102
Architecture Storage
Storage
Overview
has a pietty complete pictuie ol wheie to get iows without neeuing to gueiy
the .META. seivei again. See ¨Region Lookups¨ on page 3+5 loi moie uetails.
The HMaster is iesponsiLle loi assigning the iegions to each HRegion
Server when you stait HBase. This also incluues the special -ROOT-
anu .META. taLles. See ¨The Region Lile Cycle¨ on page 3+S loi uetails.
The HRegionServer opens the iegion anu cieates a coiiesponuing HRegion oLject. Vhen
the HRegion is openeu it sets up a Store instance loi each HColumnFamily loi eveiy taLle
as uelineu Ly the usei Leloiehanu. Each Store instance can, in tuin, have one oi moie
StoreFile instances, which aie lightweight wiappeis aiounu the actual stoiage lile
calleu HFile. A Store also has a MemStore, anu the HRegionServer a shaieu HLog in-
stance (see ¨Viite-Aheau Log¨ on page 333).
Write Path
The client issues an HTable.put(Put) ieguest to the HRegionServer, which hanus the
uetails to the matching HRegion instance. The liist step is to wiite the uata to the wiite-
aheau log (the VAL), iepiesenteu Ly the HLog class.

The VAL is a stanuaiu Hauoop
SequenceFile anu it stoies HLogKey instances. These keys contain a seguential numLei
Iigurc 8-3. Ovcrvicw oj how HBasc hand|cs ji|cs in thc ji|csystcn, which storcs thcn transparcnt|y
in HDIS
¶In extieme cases, you may tuin oll this step Ly setting a llag using the Put.setWriteToWAL(boolean) methou.
This is not iecommenueu as this will uisaLle uuiaLility.
320 | Chapter 8: Architecture
Figure: Overview of how HBase handles files in the filesystem
Pietro Michiardi (Eurecom) Tutorial: HBase 47 / 102
Architecture Storage
Storage
Overview
HBase handles two kinds of file types

One is used for the WAL

One is used for the actual data storage
Who does what

HMaster

Low-level operations

Assign region servers to key space

Keep metadata

Talk to ZooKeeper

HRegionServer

Handles the WAL and HFiles

These files are divided in to blocks and stored into HDFS

Block size is a parameter
Pietro Michiardi (Eurecom) Tutorial: HBase 48 / 102
Architecture Storage
Storage
Overview
General communication flow

A client contacts ZooKeeper when trying to access a particular row

Recovers from ZooKeeper the server name that host the -ROOT-
region

Using the -ROOT- information the client retrieves the server name
that host the .META. table region

The .META. table region contain the row key in question

Contact the reported .META. server and retrieve the server name
that has the region containing the row key in question
Caching

Generally, lookup procedures involve caching row key locations for
faster subsequent lookups
Pietro Michiardi (Eurecom) Tutorial: HBase 49 / 102
Architecture Storage
Storage
Overview
Important Java Classes

HRegionServer handles one or more regions and create the
corresponding HRegion object

When an HRegion object is opened it creates aStore instance for
each HColumnFamily

Each Store instance can have:

One or more StoreFile instances

A MemStore instance

HRegionServer has a shared HLog instance
Pietro Michiardi (Eurecom) Tutorial: HBase 50 / 102
Architecture Storage
Storage
Write Path
External client insert data in HBase

Issues an HTable.put(Put) request to HRegionServer

HRegionServer hands the request to the HRegion instance that
matches the request [Q: What is the matching criteria?]
How the system reacts to a write request

Write data to the WAL, represented by the HLog class

The WAL stores HLogKey instances in a HDFS SequenceFile

These keys contain a sequence number and the actual data

In case of failure, this data can be used to replay not-yet-persisted
data

Copy data in theMemStore

Check if MemStore size has reached a threshold

If yes, launch a flush request

Launch a thread in the HRegionServer and flush MemStore data to
an HFile
Pietro Michiardi (Eurecom) Tutorial: HBase 51 / 102
Architecture Storage
Storage
HBase Files
What and where are HBase files (including WAL, HFile,...)
stored?

HBase has a root directory set to “/hbase” in HDFS

Files can be divided into:

Those that reside under the HBase root directory

Those that are in the per-table directories
/hbase

.logs

.oldlogs

.hbase.id

.hbase.version

/example-table
Pietro Michiardi (Eurecom) Tutorial: HBase 52 / 102
Architecture Storage
Storage
HBase Files
/example-table

.tableinfo

.tmp

“...Key1...”

.oldlogs

.regioninfo

.tmp

colfam1/
colfam1/

“....column-key1...”
Pietro Michiardi (Eurecom) Tutorial: HBase 53 / 102
Architecture Storage
Storage
HBase: Root-level files
.logs directory

WAL files handled by HLog instances

Contains a subdir for each HRegionServer

Each subdir contains many HLog files

All regions from that HRegionServer share the same HLog files
.oldlogs directory

When data is persisted to disk (from Memstores) log files are
decommissioned to the .oldlogs dir
hbase.id and hbase.version

Represent the unique ID of the cluster and the file format version
Pietro Michiardi (Eurecom) Tutorial: HBase 54 / 102
Architecture Storage
Storage
HBase: Table-level files
Every table has its own directory

.tableinfo: stores the serialized HTableDescriptor

This include the table and column family schema

.tmp directory

Contains temporary data
Pietro Michiardi (Eurecom) Tutorial: HBase 55 / 102
Architecture Storage
Storage
HBase: Region-level files
Inside each table dir, there is a separate dir for every region
in the table
The name of each of this dirs is the MD5 hash of a region name

Inside each region there is a directory for each column family

Each column family directory holds the actual data files, namely
HFiles

Their name is just an arbitrary random number
Each region directory also has a .regioninfo file

Contains the serialized information of the HRegionInfo instance
Split Files

Once the region needs to be split, a splits directory is created

This is used to stage two daughter regions

If split is successful, daughter regions are moved up to the table
directory
Pietro Michiardi (Eurecom) Tutorial: HBase 56 / 102
Architecture Storage
Storage
HBase: A note on region splits
Splits triggered by store file (region) size

Region is split in two

Region is closed to new requests

.META. is updated
Daughter regions initially reside on the same server

Both daughters are compacted

Parent is cleaned up

.META. is updated
Master schedules new regions to be moved off to other
servers
Pietro Michiardi (Eurecom) Tutorial: HBase 57 / 102
Architecture Storage
Storage
HBase: Compaction
Process that takes care of re-organizing store files

Essentially to conform to underlying filesystem requirements

Compaction check when memstore is flushed
Minor and Major compactions

Always from the oldest to the newest files

Avoid all servers to perform compaction concurrently
Compactions
The stoie liles aie monitoieu Ly a Lackgiounu thieau to keep them unuei contiol. The
llushes ol memstoies slowly Luilu up an incieasing numLei ol on-uisk liles. Il theie aie
enough ol them, the conpaction piocess will comLine them to a lew, laigei liles. This
goes on until the laigest ol these liles exceeus the conliguieu maximum stoie lile size
anu tiiggeis a iegion split (see ¨Region splits¨ on page 326).
Compactions come in two vaiieties: ninor anu najor. Minoi compactions aie iespon-
siLle loi iewiiting the last lew liles into one laigei one. The numLei ol liles is set
with the hbase.hstore.compaction.min piopeity (which was pieviously calleu
hbase.hstore.compactionThreshold, anu although uepiecateu is still suppoiteu). It is
set to 3 Ly uelault, anu neeus to Le at least 2 oi moie. A numLei too laige woulu uelay
minoi compactions, Lut also woulu ieguiie moie iesouices anu take longei once the
compactions stait.
The maximum numLei ol liles to incluue in a minoi compaction is set to 10, anu is
conliguieu with hbase.hstore.compaction.max. The list is luithei naiioweu uown Ly
the hbase.hstore.compaction.min.size (set to the conliguieu memstoie llush size loi
the iegion), anu the hbase.hstore.compaction.max.size (uelaults to Long.MAX_VALUE)
conliguiation piopeities. Any lile laigei than the maximum compaction size is always
excluueu. The minimum compaction size woiks slightly uilleiently: it is a thiesholu
iathei than a pei-lile limit. It incluues all liles that aie unuei that limit, up to the total
numLei ol liles pei compaction alloweu.
Figuie S-+ shows an example set ol stoie liles. All liles that lit unuei the minimum
compaction thiesholu aie incluueu in the compaction piocess.
Iigurc 8-1. A sct oj storc ji|cs showing thc nininun conpaction thrcsho|d
The algoiithm uses hbase.hstore.compaction.ratio (uelaults to 1.2, oi 120º) to ensuie
that it uoes incluue enough liles in the selection piocess. The iatio will also select liles
that aie up to that size compaieu to the sum ol the stoie lile sizes ol all newei liles. The
evaluation always checks the liles liom the oluest to the newest. This ensuies that oluei
liles aie compacteu liist. The comLination ol these piopeities allows you to line-tune
how many liles aie incluueu in a minoi compaction.
328 | Chapter 8: Architecture
Figure: A set of store files showing the minimum compaction threshold
Pietro Michiardi (Eurecom) Tutorial: HBase 58 / 102
Architecture Storage
Storage
HFile format
Store files are implemented by the HFile class

Efficient data storage is the goal
HFiles consist of a variable number of blocks

Two fixed blocks: info and trailer

index block: records the offsets of the data and meta blocks

Block size: large →sequential access; small →random access
In contiast to minoi compactions, majoi compactions compact all liles into a single
lile. Vhich compaction type is iun is automatically ueteimineu when the compaction
check is executeu. The check is tiiggeieu eithei altei a memstoie has Leen llusheu to
uisk, altei the conpact oi najor_conpact shell commanus oi coiiesponuing API calls
have Leen invokeu, oi Ly a Lackgiounu thieau. This Lackgiounu thieau is calleu the
CompactionChecker anu each iegion seivei iuns a single instance. It iuns a check on a
iegulai Lasis, contiolleu Ly hbase.server.thread.wakefrequency (anu multiplieu Ly
hbase.server.thread.wakefrequency.multiplier, set to 1000, to iun it less olten than
the othei thieau-Laseu tasks).
Il you call the najor_conpact shell commanu, oi the majorCompact() API call, you loice
the majoi compaction to iun. Otheiwise, the seivei checks liist il the majoi compaction
is uue, Laseu on hbase.hregion.majorcompaction (set to 2+ houis) liom the liist time it
ian. The hbase.hregion.majorcompaction.jitter (set to 0.2, in othei woius, 20º) cau-
ses this time to Le spieau out loi the stoies. Vithout the jittei, all stoies woulu iun a
majoi compaction at the same time, eveiy 2+ houis. See ¨Manageu Split-
ting¨ on page +29 loi inloimation on why this is a Lau iuea anu how to manage this
Lettei.
Il no majoi compaction is uue, a minoi compaction is assumeu. Baseu on the aloie-
mentioneu conliguiation piopeities, the seivei ueteimines il enough liles loi a minoi
compaction aie availaLle anu continues il that is the case.
Minoi compactions might Le piomoteu to majoi compactions when the loimei woulu
incluue all stoie liles, anu theie aie less than the conliguieu maximum liles pei
compaction.
HFile Format
The actual stoiage liles aie implementeu Ly the HFile class, which was specilically
cieateu to seive one puipose: stoie HBase`s uata elliciently. They aie Laseu on Ha-
uoop`s TFile class,

anu mimic the SSTab|c loimat useu in Google`s BigtaLle aichitec-
tuie. The pievious use ol Hauoop`s MapFile class in HBase pioveu to Le insullicient in
teims ol peiloimance. Figuie S-5 shows the lile loimat uetails.
See the ]IRA issue HADOOP-3315 loi uetails.
Iigurc 8-5. Thc HIi|c structurc
Storage | 329
Figure: The HFile structure
Pietro Michiardi (Eurecom) Tutorial: HBase 59 / 102
Architecture Storage
Storage
HFile size and HDFS block size
HBase uses any underlying filesystem
In case HDFS is used

HDFS block size is generally 64MB

This is 1,024 times the default HFile block size (64 KB)
→ There is no correlation between HDFS block and HFile sizes
Pietro Michiardi (Eurecom) Tutorial: HBase 60 / 102
Architecture Storage
Storage
The KeyValue Format
Each KeyValue in the HFile is a low-level byte array

It allows for zero-copy access to the data
Format

Fixed-length preambule indicated the length of the key and value

This is useful to offset into the array to get direct access to the value,
ignoring the key

Key format

Contains row key, column family name, column qualifier...

[TIP]: consider small keys to avoid overhead when storing small
data
Iigurc 8-7. Thc Kcy\a|uc jornat
The stiuctuie staits with two lixeu-length numLeis inuicating the size anu value ol the
key. Vith that inloimation, you can ollset into the aiiay to, loi example, get uiiect
access to the value, ignoiing the key. Otheiwise, you can get the ieguiieu inloimation
liom the key. Once the inloimation is paiseu into a KeyValue ]ava instance, you can
use getteis to access the uetails, as explaineu in ¨The KeyValue class¨ on page S3.
The ieason the aveiage key in the pieceuing example is laigei than the value has to uo
with the lielus that make up the key pait ol a KeyValue. The key holus the iow key, the
column lamily name, the column gualiliei, anu so on. Foi a small payloau, this iesults
in guite a consiueiaLle oveiheau. Il you ueal with small values, tiy to keep the key small
as well. Choose a shoit iow anu column key (the lamily name with a single Lyte, anu
the gualiliei egually shoit) to keep the iatio in check.
On the othei hanu, compiession shoulu help mitigate the oveiwhelming key size pioL-
lem, as it looks at linite winuows ol uata, anu all iepeating uata shoulu compiess well.
The soiting ol all KeyValues in the stoie lile helps to keep similai keys (anu possiLly
values too, in case you aie using veisioning) close togethei.
Write-Ahead Log
The iegion seiveis keep uata in-memoiy until enough is collecteu to waiiant a llush to
uisk, avoiuing the cieation ol too many veiy small liles. Vhile the uata iesiues in mem-
oiy it is volatile, meaning it coulu Le lost il the seivei loses powei, loi example. This is
a likely occuiience when opeiating at laige scale, as explaineu in ¨Seek Veisus Tians-
lei¨ on page 315.
A common appioach to solving this issue is writc-ahcad |ogging:
=
Each upuate (also
calleu an ¨euit¨) is wiitten to a log, anu only il the upuate has succeeueu is the client
inloimeu that the opeiation has succeeueu. The seivei then has the liLeity to Latch oi
aggiegate the uata in memoiy as neeueu.
Overview
The VAL is the lileline that is neeueu when uisastei stiikes. Similai to a binary |og in
MySQL, the VAL iecoius all changes to the uata. This is impoitant in case something
=Foi inloimation on the teim itsell, ieau ¨Viite-aheau logging¨ on Vikipeuia.
Write-Ahead Log | 333
Figure: The KeyValue Format
Pietro Michiardi (Eurecom) Tutorial: HBase 61 / 102
Architecture WAL
The Write-Ahead Log
Main tool to ensure resiliency to failures

Region servers keep data in-memory until enough is collected to
warrant a flush

What if the server crashes or power is lost?
WAL is a common approach to address fault-tolerance

Every data update is first written to a log

Log is persisted (and replicated, since it resides on HDFS)

Only when log is written, client is notified a successful operation on
data
Pietro Michiardi (Eurecom) Tutorial: HBase 62 / 102
Architecture WAL
The Write-Ahead Log
happens to the piimaiy stoiage. Il the seivei ciashes, the VAL can ellectively rcp|ay
the log to get eveiything up to wheie the seivei shoulu have Leen just Leloie the ciash.
It also means that il wiiting the iecoiu to the VAL lails, the whole opeiation must Le
consiueieu a lailuie.
¨Oveiview¨ on page 319 shows how the VAL lits into the oveiall aichitectuie ol HBase.
Since it is shaieu Ly all iegions hosteu Ly the same iegion seivei, it acts as a cential
logging LackLone loi eveiy mouilication. Figuie S-S shows how the llow ol euits is split
Letween the memstoies anu the VAL.
Iigurc 8-8. A|| nodijications savcd to thc WAL, and thcn passcd on to thc ncnstorcs
The piocess is as lollows: liist the client initiates an action that mouilies uata. This can
Le, loi example, a call to put(), delete(), anu increment(). Each ol these mouilications
is wiappeu into a KeyValue oLject instance anu sent ovei the wiie using RPC calls. The
calls aie (iueally) Latcheu to the HRegionServer that seives the matching iegions.
Once the KeyValue instances aiiive, they aie iouteu to the HRegion instances that aie
iesponsiLle loi the given iows. The uata is wiitten to the VAL, anu then put into the
MemStore ol the actual Store that holus the iecoiu. This is, in essence, the writc path ol
HBase.
334 | Chapter 8: Architecture
Figure: The write path of HBase
WAL records all changes to data

Can be replayed in case of server failure

If write to WAL fails, the whole operations has to fail
Pietro Michiardi (Eurecom) Tutorial: HBase 63 / 102
Architecture WAL
The Write-Ahead Log
happens to the piimaiy stoiage. Il the seivei ciashes, the VAL can ellectively rcp|ay
the log to get eveiything up to wheie the seivei shoulu have Leen just Leloie the ciash.
It also means that il wiiting the iecoiu to the VAL lails, the whole opeiation must Le
consiueieu a lailuie.
¨Oveiview¨ on page 319 shows how the VAL lits into the oveiall aichitectuie ol HBase.
Since it is shaieu Ly all iegions hosteu Ly the same iegion seivei, it acts as a cential
logging LackLone loi eveiy mouilication. Figuie S-S shows how the llow ol euits is split
Letween the memstoies anu the VAL.
Iigurc 8-8. A|| nodijications savcd to thc WAL, and thcn passcd on to thc ncnstorcs
The piocess is as lollows: liist the client initiates an action that mouilies uata. This can
Le, loi example, a call to put(), delete(), anu increment(). Each ol these mouilications
is wiappeu into a KeyValue oLject instance anu sent ovei the wiie using RPC calls. The
calls aie (iueally) Latcheu to the HRegionServer that seives the matching iegions.
Once the KeyValue instances aiiive, they aie iouteu to the HRegion instances that aie
iesponsiLle loi the given iows. The uata is wiitten to the VAL, anu then put into the
MemStore ol the actual Store that holus the iecoiu. This is, in essence, the writc path ol
HBase.
334 | Chapter 8: Architecture
Write Path

Client modifies data (put(), delete(), increment())

Modifications are wrapped into a KeyValue object

Objects are batched to the corresponding HRegionServer

Objects are routed to the corresponding HRegion

Objects are written to WAL and in the MemStore
Pietro Michiardi (Eurecom) Tutorial: HBase 64 / 102
Architecture Read Path
Read Path
HBase uses multiple store files per column family

These can be either in-memory and/or materialized on disk

Compactions and clean-up background processes take care of
store files maintenance

Store files are immutable, so deletion is handled in a special way
The anatomy of a get command

HBase uses a QueryMatcher in combination with a
ColumnTracker

First, an exclusion check is performed to filter skip files (and
eventually tombstone labelled data)

Scanning data is implemented by a RegionScanner class which
retrieves a StoreScanner

StoreScanner includes both the MemStore and HFiles

Read/Scans happen in the same order as data is saved
Pietro Michiardi (Eurecom) Tutorial: HBase 65 / 102
Architecture Region Lookups
Region Lookups
How does a client find the region server hosting a specific
row key range?

HBase uses two special catalog tables, -ROOT- and .META.

The -ROOT- table is used to refer to all regions in the .META. table
Three-level B+ Tree -like operation

Level 1: a node stored in ZooKeeper, containing the location
(region server) of the -ROOT- table

Level 2: Lookup in the -ROOT- table to find a matching meta region

Level 3: Retrieve the table region from the .META. table
Pietro Michiardi (Eurecom) Tutorial: HBase 66 / 102
Architecture Region Lookups
Region Lookups
Where to send requests when looking for a specific row key?

This information is cached, but the first time or when the cache is
stale or when there is a miss due to compaction, the following
procedure applies
Recursive discovery process

Ask the region server hosting the matching .META. table to retrieve
the row key address

If the information is invalid, it backs out: asks the -ROOT- table
where the relevant .META. region is

If this fails, ask ZooKeeper where the -ROOT- table is
Pietro Michiardi (Eurecom) Tutorial: HBase 67 / 102
Architecture Region Lookups
Region Lookups
I
i
g
u
r
c

8
-
1
1
.

M
a
p
p
i
n
g

o
j

u
s
c
r

t
a
b
|
c

r
c
g
i
o
n
s
,

s
t
a
r
t
i
n
g

w
i
t
h

a
n

c
n
p
t
y

c
a
c
h
c

a
n
d

t
h
c
n

p
c
r
j
o
r
n
i
n
g

t
h
r
c
c
|
o
o
|
u
p
s
R
e
g
i
o
n

L
o
o
k
u
p
s
|
3
4
7
Pietro Michiardi (Eurecom) Tutorial: HBase 68 / 102
Key Design
Key Design
Pietro Michiardi (Eurecom) Tutorial: HBase 69 / 102
Key Design Concepts
Concepts
HBase has two fundamental key structures

Row key

Column key
Both can be used to convey meaning

Because they store particularly meaningful data

Because their sorting order is important
Pietro Michiardi (Eurecom) Tutorial: HBase 70 / 102
Key Design Concepts
Concepts
Logical vs. on-disk layout of a table

Main unit of separation within a table is the column family

The actual columns (as opposed to other column-oriented DB) are
not used to separate data

Although cells are stored logically in a table format, rows are stored
as linear sets of the cells

Cells contain all the vital information inside them
theieloie has to also stoie the iow key and column key with eveiy cell so that it can
ietain this vital piece ol inloimation.
In auuition, multiple veisions ol the same cell aie stoieu as sepaiate, consecutive cells,
auuing the ieguiieu tincstanp ol when the cell was stoieu. The cells aie soiteu in
uescenuing oiuei Ly that timestamp so that a ieauei ol the uata will see the newest
value liist÷which is the canonical access pattein loi the uata.
The entiie cell, with the auueu stiuctuial inloimation, is calleu KeyValue in HBase
teims. It has not just the column anu actual value, Lut also the iow key anu timestamp,
stoieu loi eveiy cell loi which you have set a value. The KeyValues aie soiteu Ly iow
key liist, anu then Ly column key in case you have moie than one cell pei iow in one
column lamily.
The lowei-iight pait ol the liguie shows the iesultant layout ol the logical taLle insiue
the physical stoiage liles. The HBase API has vaiious means ol gueiying the stoieu uata,
with uecieasing gianulaiity liom lelt to iight: you can select iows Ly iow keys anu
ellectively ieuuce the amount ol uata that neeus to Le scanneu when looking loi a
specilic iow, oi a iange ol iows. Specilying the column lamily as pait ol the gueiy can
eliminate the neeu to seaich the sepaiate stoiage liles. Il you only neeu the uata ol one
lamily, it is highly iecommenueu that you specily the lamily loi youi ieau opeiation.
Although the tincstanp÷oi vcrsion÷ol a cell is laithei to the iight, it is anothei im-
poitant selection ciiteiion. The stoie liles ietain the timestamp iange loi all stoieu cells,
so il you aie asking loi a cell that was changeu in the past two houis, Lut a paiticulai
stoie lile only has uata that is loui oi moie houis olu it can Le skippeu completely. See
also ¨Reau Path¨ on page 3+2 loi uetails.
Iigurc 9-1. Rows storcd as |incar scts oj actua| cc||s, which contain a|| thc vita| injornation
358 | Chapter 9: Advanced Usage
Pietro Michiardi (Eurecom) Tutorial: HBase 71 / 102
Key Design Concepts
Concepts
theieloie has to also stoie the iow key and column key with eveiy cell so that it can
ietain this vital piece ol inloimation.
In auuition, multiple veisions ol the same cell aie stoieu as sepaiate, consecutive cells,
auuing the ieguiieu tincstanp ol when the cell was stoieu. The cells aie soiteu in
uescenuing oiuei Ly that timestamp so that a ieauei ol the uata will see the newest
value liist÷which is the canonical access pattein loi the uata.
The entiie cell, with the auueu stiuctuial inloimation, is calleu KeyValue in HBase
teims. It has not just the column anu actual value, Lut also the iow key anu timestamp,
stoieu loi eveiy cell loi which you have set a value. The KeyValues aie soiteu Ly iow
key liist, anu then Ly column key in case you have moie than one cell pei iow in one
column lamily.
The lowei-iight pait ol the liguie shows the iesultant layout ol the logical taLle insiue
the physical stoiage liles. The HBase API has vaiious means ol gueiying the stoieu uata,
with uecieasing gianulaiity liom lelt to iight: you can select iows Ly iow keys anu
ellectively ieuuce the amount ol uata that neeus to Le scanneu when looking loi a
specilic iow, oi a iange ol iows. Specilying the column lamily as pait ol the gueiy can
eliminate the neeu to seaich the sepaiate stoiage liles. Il you only neeu the uata ol one
lamily, it is highly iecommenueu that you specily the lamily loi youi ieau opeiation.
Although the tincstanp÷oi vcrsion÷ol a cell is laithei to the iight, it is anothei im-
poitant selection ciiteiion. The stoie liles ietain the timestamp iange loi all stoieu cells,
so il you aie asking loi a cell that was changeu in the past two houis, Lut a paiticulai
stoie lile only has uata that is loui oi moie houis olu it can Le skippeu completely. See
also ¨Reau Path¨ on page 3+2 loi uetails.
Iigurc 9-1. Rows storcd as |incar scts oj actua| cc||s, which contain a|| thc vita| injornation
358 | Chapter 9: Advanced Usage
Logical Layout (Top-Left)

Table consists of rows and columns

Columns are the combination of a column family name and a
column qualifier
→ <cf name: qualifier> is the column key

Rows have a row key to address all columns of a single logical row
Pietro Michiardi (Eurecom) Tutorial: HBase 72 / 102
Key Design Concepts
Concepts
theieloie has to also stoie the iow key and column key with eveiy cell so that it can
ietain this vital piece ol inloimation.
In auuition, multiple veisions ol the same cell aie stoieu as sepaiate, consecutive cells,
auuing the ieguiieu tincstanp ol when the cell was stoieu. The cells aie soiteu in
uescenuing oiuei Ly that timestamp so that a ieauei ol the uata will see the newest
value liist÷which is the canonical access pattein loi the uata.
The entiie cell, with the auueu stiuctuial inloimation, is calleu KeyValue in HBase
teims. It has not just the column anu actual value, Lut also the iow key anu timestamp,
stoieu loi eveiy cell loi which you have set a value. The KeyValues aie soiteu Ly iow
key liist, anu then Ly column key in case you have moie than one cell pei iow in one
column lamily.
The lowei-iight pait ol the liguie shows the iesultant layout ol the logical taLle insiue
the physical stoiage liles. The HBase API has vaiious means ol gueiying the stoieu uata,
with uecieasing gianulaiity liom lelt to iight: you can select iows Ly iow keys anu
ellectively ieuuce the amount ol uata that neeus to Le scanneu when looking loi a
specilic iow, oi a iange ol iows. Specilying the column lamily as pait ol the gueiy can
eliminate the neeu to seaich the sepaiate stoiage liles. Il you only neeu the uata ol one
lamily, it is highly iecommenueu that you specily the lamily loi youi ieau opeiation.
Although the tincstanp÷oi vcrsion÷ol a cell is laithei to the iight, it is anothei im-
poitant selection ciiteiion. The stoie liles ietain the timestamp iange loi all stoieu cells,
so il you aie asking loi a cell that was changeu in the past two houis, Lut a paiticulai
stoie lile only has uata that is loui oi moie houis olu it can Le skippeu completely. See
also ¨Reau Path¨ on page 3+2 loi uetails.
Iigurc 9-1. Rows storcd as |incar scts oj actua| cc||s, which contain a|| thc vita| injornation
358 | Chapter 9: Advanced Usage
Folding the Logical Layout (Top-Right)

The cells of each row are stored one after the other

Each column family are stored separately
→ On disk all cells of one family reside on an individual StoreFile

HBase does not store unset cells
→ Row and column key is required to address every cell
Pietro Michiardi (Eurecom) Tutorial: HBase 73 / 102
Key Design Concepts
Concepts
Versioning

Multiple versions of the same cell stored consecutively, together
with the timestamp

Cells are sorted in descending order of timestamp
→ Newest value first
KeyValue object

The entire cell, with all the structural information, is a KeyValue
object

Contains: row key, <column family: qualifier> →
column key, timestamp and value

Sorted by row key first, then by column key
Pietro Michiardi (Eurecom) Tutorial: HBase 74 / 102
Key Design Concepts
Concepts
theieloie has to also stoie the iow key and column key with eveiy cell so that it can
ietain this vital piece ol inloimation.
In auuition, multiple veisions ol the same cell aie stoieu as sepaiate, consecutive cells,
auuing the ieguiieu tincstanp ol when the cell was stoieu. The cells aie soiteu in
uescenuing oiuei Ly that timestamp so that a ieauei ol the uata will see the newest
value liist÷which is the canonical access pattein loi the uata.
The entiie cell, with the auueu stiuctuial inloimation, is calleu KeyValue in HBase
teims. It has not just the column anu actual value, Lut also the iow key anu timestamp,
stoieu loi eveiy cell loi which you have set a value. The KeyValues aie soiteu Ly iow
key liist, anu then Ly column key in case you have moie than one cell pei iow in one
column lamily.
The lowei-iight pait ol the liguie shows the iesultant layout ol the logical taLle insiue
the physical stoiage liles. The HBase API has vaiious means ol gueiying the stoieu uata,
with uecieasing gianulaiity liom lelt to iight: you can select iows Ly iow keys anu
ellectively ieuuce the amount ol uata that neeus to Le scanneu when looking loi a
specilic iow, oi a iange ol iows. Specilying the column lamily as pait ol the gueiy can
eliminate the neeu to seaich the sepaiate stoiage liles. Il you only neeu the uata ol one
lamily, it is highly iecommenueu that you specily the lamily loi youi ieau opeiation.
Although the tincstanp÷oi vcrsion÷ol a cell is laithei to the iight, it is anothei im-
poitant selection ciiteiion. The stoie liles ietain the timestamp iange loi all stoieu cells,
so il you aie asking loi a cell that was changeu in the past two houis, Lut a paiticulai
stoie lile only has uata that is loui oi moie houis olu it can Le skippeu completely. See
also ¨Reau Path¨ on page 3+2 loi uetails.
Iigurc 9-1. Rows storcd as |incar scts oj actua| cc||s, which contain a|| thc vita| injornation
358 | Chapter 9: Advanced Usage
Physical Layout (Lower-Right)

Select data by row key

This reduces the amount of data to scan for a row or a range of rows

Select data by row key and column key

This focuses the system on an individual storage file

Select data by column qualifier

Exact lookups, including filters to omit useless data
Pietro Michiardi (Eurecom) Tutorial: HBase 75 / 102
Key Design Concepts
Concepts
Summary of key lookup properties
The next level ol gueiy gianulaiity is the co|unn qua|ijicr. You can employ exact column
lookups when ieauing uata, oi ueline lilteis that can incluue oi excluue the columns
you neeu to access. But as you will have to look at each KeyValue to check il it shoulu
Le incluueu, theie is only a minoi peiloimance gain.
The va|uc iemains the last, anu Lioauest, selection ciiteiion, egualing the column
gualiliei`s ellectiveness: you neeu to look at each cell to ueteimine il it matches the ieau
paiameteis. You can only use a liltei to specily a matching iule, making it the least
ellicient gueiy option. Figuie 9-2 summaiizes the ellects ol using the KeyValue lielus.
Iigurc 9-2. Rctricva| pcrjornancc dccrcasing jron |cjt to right
The ciucial pait ol Figuie 9-1 shows is the shijt in the lowei-lelthanu siue. Since the
ellectiveness ol selection ciiteiia gieatly uiminishes liom lelt to iight loi a KeyValue,
you can move all, oi paitial, uetails ol the value into a moie signilicant place÷without
changing how much uata is stoieu.
Tall-Narrow Versus Flat-Wide Tables
At this time, you may Le asking youisell wheie anu how you shoulu stoie youi uata.
The two choices aie ta||-narrow anu j|at-widc. The loimei is a taLle with lew columns
Lut many iows, while the lattei has lewei iows Lut many columns. Given the explaineu
gueiy gianulaiity ol the KeyValue inloimation, it seems to Le auvisaLle to stoie paits ol
the cell`s uata÷especially the paits neeueu to gueiy it÷in the iow key, as it has the
highest caiuinality.
In auuition, HBase can only split at iow Lounuaiies, which also enloices the iecom-
menuation to go with tall-naiiow taLles. Imagine you have all emails ol a usei in a single
iow. This will woik loi the majoiity ol useis, Lut theie will Le outlieis that will have
magnituues ol emails moie in theii inLox÷so many, in lact, that a single iow coulu
outgiow the maximum lile/iegion size anu woik against the iegion split lacility.
Key Design | 359
Pietro Michiardi (Eurecom) Tutorial: HBase 76 / 102
Key Design Tall-Narrow vs. Flat-Wide
Tall-Narrow vs. Flat-Wide Tables
Tall-Narrow Tables

Few columns

Many rows
Flat-Wide Tables

Many columns

Few rows
Given the query granularity explained before
→ Store parts of the cell data in the row key

Furthermore, HBase splits at row boundaries
→ It is recommended to go for Tall-Narrow Tables
Pietro Michiardi (Eurecom) Tutorial: HBase 77 / 102
Key Design Tall-Narrow vs. Flat-Wide
Tall-Narrow vs. Flat-Wide Tables
Example: email data - version 1

You have all emails of a user in a single row (e.g. userID is the
row key)

There will be some outliers with orders of magnitude more emails
than others
→ A single row could outgrow the maximum file/region size and work
against split facility
Example: email data - version 2

Each email of a user is stored in a separate row (e.g.
userID:messageID is the row key)

On disk this makes no difference (see the disk layout figure)

If the messageID is in the column qualifier or the row key, each cell
still contains a single email message
→ The table can be split easily and the query granularity is more
fine-grained
Pietro Michiardi (Eurecom) Tutorial: HBase 78 / 102
Key Design Partial Key Scans
Partial Key Scans
Partial Key Scans reinforce the concept of Tall-Narrow Tables

From the email example: assume you have a separate row per
message, across all users

If you don’t have an exact combination of user and message ID you
cannot access a particular message
Partial Key Scan solves the problems

Specify a start and end key

The start key is set to the exact userID only, with the end key set
at userID+1
→ This triggers the internal lexicographic comparison mechanism

Since the table does not have an exact match, it positions the scan at:
<userID>:<lowest-messageID>

The scan will then iterate over all the messages of an exact user,
parse the row key and get the messageID
Pietro Michiardi (Eurecom) Tutorial: HBase 79 / 102
Key Design Partial Key Scans
Partial Key Scans
Composite keys and atomicity

Following the email example: a single user inbox now spans many
rows

It is no longer possible to modify a single user inbox in one atomic
operation
If this is acceptable or not, depends on the application at
hand
Pietro Michiardi (Eurecom) Tutorial: HBase 80 / 102
Key Design Time Series Data
Time Series Data
Stream processing of events

E.g. data coming from a sensor, stock exchange, monitoring
system ...

Such data is a time series →The row key represents the event
time
→ HBase will store all rows sorted in a distinct range, namely regions
with specific start and stop keys
Sequential monotonously increasing nature of time series
data

All incoming data is written to the same region (and hence the
same server)
→ Regions become HOT!

Performance of the whole cluster is bound to that of a single
machine
Pietro Michiardi (Eurecom) Tutorial: HBase 81 / 102
Key Design Time Series Data
Time Series Data
Solution to achieve load balancing: Salting

We want data to be spread over all region servers

This can be done, e.g., by prefixing the row key with a
non-sequential number
Salting example
byte prefix = (byte) (Long.hashCode(timestamp) % <number of
region servers>);
byte[] rowkey = Bytes.add(Bytes.toBytes(prefix),
Bytes.toBytes(timestamp));
- Data access needs to be fanned out across many servers
+ Use multiple threads to read for I/O performance: e.g. use the
Map phase of MapReduce
Pietro Michiardi (Eurecom) Tutorial: HBase 82 / 102
Key Design Time Series Data
Time Series Data
Solution to achieve load balancing: Field swap/promotion

Move the timestamp filed of the row key or prefix it with another
field

If you already have a composite row key, simply swap elements

Otherwise if you only have the timestamp, you need to promote
another field

The sequential, monotonously increasing timestamp is moved to a
secondary position in the row key
- You can only access data (especially time ranges) for a given
swapped or promoted field (but this could be a feature)
+ You achieve load balancing
Pietro Michiardi (Eurecom) Tutorial: HBase 83 / 102
Key Design Time Series Data
Time Series Data
Solution to achieve load balancing: Randomization

byte[] rowkey = MD5(timestamp)

This gives you a random distribution of the row key across all
available region servers
- Less than ideal for range scans
+ Since you can re-hash the timestamp, this solution is good for
random access
Pietro Michiardi (Eurecom) Tutorial: HBase 84 / 102
Key Design Time Series Data
Time Series Data
Summary
Using the salteu oi piomoteu lielu keys can stiike a goou Lalance ol uistiiLution loi
wiite peiloimance, anu seguential suLsets ol keys loi ieau peiloimance. Il you aie only
uoing ianuom ieaus, it makes most sense to use ianuom keys: this will avoiu cieating
iegion hot-spots.
Time-Ordered Relations
In oui pieceuing uiscussion, the time seiies uata uealt with inseiting new events as
sepaiate iows. Howevei, you can also stoie ielateu, time-oiueieu uata: using the col-
umns ol a taLle. Since all ol the columns aie soiteu pei column lamily, you can tieat
this soiting as a ieplacement loi a seconuaiy inuex, as availaLle in RDBMSes. Multiple
seconuaiy inuexes can Le emulateu Ly using multiple column lamilies÷although that
is not the iecommenueu way ol uesigning a schema. But loi a small numLei ol inuexes,
this might Le what you neeu.
Consiuei the eailiei example ol the usei inLox, which stoies all ol the emails ol a usei
in a single iow. Since you want to uisplay the emails in the oiuei they weie ieceiveu,
Lut, loi example, also soiteu Ly suLject, you can make use ol column-Laseu soiting to
achieve the uilleient views ol the usei inLox.
Given the auvice to keep the numLei ol column lamilies in a taLle low÷
especially when mixing laige lamilies with small ones (in teims ol stoieu
uata)÷you coulu stoie the inLox insiue one taLle, anu the seconuaiy
inuexes in anothei taLle. The uiawLack is that you cannot make use ol
the pioviueu pei-taLle iow-level atomicity. Also see ¨Seconuaiy In-
uexes¨ on page 370 loi stiategies to oveicome this limitation.
The liist uecision to make conceins what the piimaiy soiting oiuei is, in othei woius,
how the majoiity ol useis have set the view ol theii inLox. Assuming they have set the
Iigurc 9-3. Iinding thc right ba|ancc bctwccn scqucntia| rcad and writc pcrjornancc
Key Design | 367
Pietro Michiardi (Eurecom) Tutorial: HBase 85 / 102
MapReduce Integration
MapReduce Integration
Pietro Michiardi (Eurecom) Tutorial: HBase 86 / 102
MapReduce Integration Recap
Introduction
In the following we review the main classes involved in
reading and writing data from/to an underlying data store
For MapReduce to work with HBase, some more practical
issues have to be addressed

E.g.: creating an appropriate JAR file inclusive of all required
libraries
Refer to [5], Chapter 7 for an in-depth treatment of this subject
Pietro Michiardi (Eurecom) Tutorial: HBase 87 / 102
MapReduce Integration Recap
Main classes involved in MapReduce
Classes
Figuie 7-1 also shows you the classes that aie involveu in the Hauoop implementation
ol MapReuuce. Let us look at them anu also at the specilic implementations that HBase
pioviues on top ol them.
Hauoop veision 0.20.0 intiouuceu a new MapReuuce API. Its classes
aie locateu in the package nameu mapreduce, while the existing classes
loi the pievious API aie locateu in mapred. The oluei API was uepiecateu
anu shoulu have Leen uioppeu in veision 0.21.0÷Lut that uiu not hap-
pen. In lact, the olu API was unuepiecateu since the auoption ol the new
one was hinueieu Ly its incompleteness.
HBase also has these two packages, which only uillei slightly. The new
API has moie suppoit Ly the community, anu wiiting joLs against it is
not impacteu Ly the Hauoop changes. This chaptei will only ielei to the
new API.
InputFormat
The liist class to ueal with is the InputFormat class (Figuie 7-2). It is iesponsiLle loi two
things. Fiist it splits the input uata, anu then it ietuins a RecordReader instance that
uelines the classes ol the |cy anu va|uc oLjects, anu pioviues a next() methou that is
useu to iteiate ovei each input iecoiu.
Iigurc 7-1. Thc MapRcducc proccss
290 | Chapter 7: MapReduce Integration
Figure: Main MapReduce Classes
Pietro Michiardi (Eurecom) Tutorial: HBase 88 / 102
MapReduce Integration Recap
Main classes involved in MapReduce
InputFormat
It is responsible for two things

Splits input data

Returns a RecordReader instance

Defines a key and a value object

Provides a next() method to iterate over input records
As lai as HBase is conceineu, theie is a special implementation calleu TableInput
FormatBase whose suLclass is TableInputFormat. The loimei implements the majoiity
ol the lunctionality Lut iemains aLstiact. The suLclass is a lightweight conciete veision
ol TableInputFormat anu is useu Ly many supplieu samples anu ieal MapReuuce classes.
These classes implement the lull tuinkey solution to scan an HBase taLle. You have to
pioviue a Scan instance that you can piepaie in any way you want: specily stait
anu stop keys, auu lilteis, specily the numLei ol veisions, anu so on. The
TableInputFormat splits the taLle into piopei Llocks loi you anu hanus them ovei to
the suLseguent classes in the MapReuuce piocess. See ¨TaLle Splits¨ on page 29+ loi
uetails on how the taLle is split.
Mapper
The Mapper class(es) is loi the next stage ol the MapReuuce piocess anu one ol its
namesakes (Figuie 7-3). In this step, each iecoiu ieau using the RecordReader is pio-
cesseu using the map() methou. Figuie 7-1 also shows that the Mapper ieaus a specilic
type ol key/value paii, Lut emits possiLly anothei type. This is hanuy loi conveiting
the iaw uata into something moie uselul loi luithei piocessing.
Iigurc 7-3. Thc Mappcr hicrarchy
HBase pioviues the TableMapper class that enloices |cy c|ass 1 to Le an ImmutableBytes
Writable, anu va|uc c|ass 1 to Le a Result type÷since that is what the
TableRecordReader is ietuining.
One specilic implementation ol the TableMapper is the IdentityTableMapper, which is
also a goou example ol how to auu youi own lunctionality to the supplieu classes. The
TableMapper class itsell uoes not implement anything Lut only auus the signatuies ol
Iigurc 7-2. Thc |nputIornat hicrarchy
Framework | 291
Figure: InputFormat hierarchy
Pietro Michiardi (Eurecom) Tutorial: HBase 89 / 102
MapReduce Integration Recap
Main classes involved in MapReduce
InputFormat → TableInputFormatBase
Implement a full turnkey solution to scan an HBase table

Splits the table into proper blocks and hand them to the
MapReduce process
Must supply a Scan instance to interact with a table

Specify start and stop keys for the scan

Add filters (optional)

Specify the number of versions
Pietro Michiardi (Eurecom) Tutorial: HBase 90 / 102
MapReduce Integration Recap
Main classes involved in MapReduce
Mapper
Each record read using the RecordReader is processed
using the map() method
The Mapper reads specific types of input key/value pairs, but
emit possibly another type
As lai as HBase is conceineu, theie is a special implementation calleu TableInput
FormatBase whose suLclass is TableInputFormat. The loimei implements the majoiity
ol the lunctionality Lut iemains aLstiact. The suLclass is a lightweight conciete veision
ol TableInputFormat anu is useu Ly many supplieu samples anu ieal MapReuuce classes.
These classes implement the lull tuinkey solution to scan an HBase taLle. You have to
pioviue a Scan instance that you can piepaie in any way you want: specily stait
anu stop keys, auu lilteis, specily the numLei ol veisions, anu so on. The
TableInputFormat splits the taLle into piopei Llocks loi you anu hanus them ovei to
the suLseguent classes in the MapReuuce piocess. See ¨TaLle Splits¨ on page 29+ loi
uetails on how the taLle is split.
Mapper
The Mapper class(es) is loi the next stage ol the MapReuuce piocess anu one ol its
namesakes (Figuie 7-3). In this step, each iecoiu ieau using the RecordReader is pio-
cesseu using the map() methou. Figuie 7-1 also shows that the Mapper ieaus a specilic
type ol key/value paii, Lut emits possiLly anothei type. This is hanuy loi conveiting
the iaw uata into something moie uselul loi luithei piocessing.
Iigurc 7-3. Thc Mappcr hicrarchy
HBase pioviues the TableMapper class that enloices |cy c|ass 1 to Le an ImmutableBytes
Writable, anu va|uc c|ass 1 to Le a Result type÷since that is what the
TableRecordReader is ietuining.
One specilic implementation ol the TableMapper is the IdentityTableMapper, which is
also a goou example ol how to auu youi own lunctionality to the supplieu classes. The
TableMapper class itsell uoes not implement anything Lut only auus the signatuies ol
Iigurc 7-2. Thc |nputIornat hicrarchy
Framework | 291
Figure: The Mapper hierarchy
Pietro Michiardi (Eurecom) Tutorial: HBase 91 / 102
MapReduce Integration Recap
Main classes involved in MapReduce
Mapper → TableMapper
TableMapper class enforces:

The input key to the mapper to be an ImmutableBytesWritable
type

The input value to be a Result type
A handy implementation is the IdentityTableMapper

This is the equivalent of an identity mapper
Pietro Michiardi (Eurecom) Tutorial: HBase 92 / 102
MapReduce Integration Recap
Main classes involved in MapReduce
OutputFormat
Used to persist data

Output written to files

Output written to HBase tables

This is done using a TableRecordWriter
the actual key/value paii classes. The IdentityTableMapper is simply passing on the
keys/values to the next stage ol piocessing.
Reducer
The Reducer stage anu class hieiaichy (Figuie 7-+) is veiy similai to the Mapper stage.
This time we get the output ol a Mapper class anu piocess it altei the uata has Leen
shujj|cd anu sortcd.
In the implicit shullle Letween the Mapper anu Reducer stages, the inteimeuiate uata is
copieu liom uilleient Map seiveis to the Reuuce seiveis anu the soit comLines the
shullleu (copieu) uata so that the Reducer sees the inteimeuiate uata as a nicely soiteu
set wheie each unigue key is now associateu with all ol the possiLle values it was lounu
with.
Iigurc 7-1. Thc Rcduccr hicrarchy
OutputFormat
The linal stage is the OutputFormat class (Figuie 7-5), anu its joL is to peisist the uata
in vaiious locations. Theie aie specilic implementations that allow output to liles, oi
to HBase taLles in the case ol the TableOutputFormat class. It uses a TableRecord
Writer to wiite the uata into the specilic HBase output taLle.
Iigurc 7-5. Thc OutputIornat hicrarchy
It is impoitant to note the caiuinality as well. Although many Mappers aie hanuing
iecoius to many Reducers, only one OutputFormat takes each output iecoiu liom its
Reducer suLseguently. It is the linal class that hanules the key/value paiis anu wiites
them to theii linal uestination, this Leing a lile oi a taLle.
292 | Chapter 7: MapReduce Integration
Figure: The OutputFormat hierarchy
Pietro Michiardi (Eurecom) Tutorial: HBase 93 / 102
MapReduce Integration Recap
Main classes involved in MapReduce
OutputFormat → TableOutputFormat
This is the class that handles the key/valu pairs and writes
them to their final destination

Single instance that takes the output record from each reducer
subsequently
Details

Must specify the table name when the MR job is created

Handles buffer flushing implicitly (autoflush option is set to false)
Pietro Michiardi (Eurecom) Tutorial: HBase 94 / 102
MapReduce Integration Recap
MapReduce Locality
How does the system make sure data is placed close to
where it is needed?

This is done implicitly by MapReduce when using HDFS

When MapReduce uses HBase things are a bit different
How HBase handles data locality

Shared vs. non-shared cluster

HBase store its files on HDFS (HFiles and WAL)

HBase servers are not restarted frequently and they perform
compactions regularly
→ HDFS is smart enough to ensure data locality

There is a block placement policy that enforces local writes

The data node compares the server name of the writer with its own

If they match, the block is written to the local filesystem

Just be careful about region movements during load balancing or
server failures
Pietro Michiardi (Eurecom) Tutorial: HBase 95 / 102
MapReduce Integration Recap
Table Splits
When running a MapReduce job that reads from an HBase
table you use the TableInputFormat

Overrides getSplits() and createRecordReader()
Before a job is run, the framework calls getSplit() to
determine how the data is to be separated into chunks

TableInputFormat, given the Scan instance you define, divide
the table at region boundaries
→ The number of input splits is equal to all regions between the start
and stop keys
Pietro Michiardi (Eurecom) Tutorial: HBase 96 / 102
MapReduce Integration Recap
Table Splits
When a job starts, the framework calls
createRecordReader() for each input split

It iterates over the splits and create a new TableRecordReader
with the current split

Each TableRecordReader handles exactly one region, reading
and mapping every row from the region’s start and end keys
Data locality

Each split contains the server name hosting the region

The framework checks the server name and if the TaskTracker is
running on the same machine, it will run it on that server

The RegionServer is colocated with the HDFS DataNode, hence
data is read from the local filesystem
TIP: Turn off speculative execution!
Pietro Michiardi (Eurecom) Tutorial: HBase 97 / 102
References
References I
[1] B+ tree.
http://en.wikipedia.org/wiki/B%2B_tree.
[2] Eric Brewer.
Lessons from giant-scale services.
In In IEEE Internet Computing, 2001.
[3] Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh,
Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew
Fikes, and Robert E. Gruber.
Bigtable: A distributed storage system for structured data.
In Proc. od USENIX OSDI, 2006.
[4] Jeffrey Dean and Sanjay Ghemawat.
Mapreduce: Simplified data processing on large clusters.
In Proc. of ACM OSDI, 2004.
Pietro Michiardi (Eurecom) Tutorial: HBase 98 / 102
References
References II
[5] Lars George.
HBase, The Definitive Guide.
O’Reilly, 2011.
[6] Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung.
The google file system.
In Proc. of ACM OSDI, 2003.
[7] Patrick O’Neil, Edward Cheng, Dieter Gawlick, and Elizabeth
O’Neil.
The log-structured merge-tree (lsm-tree).
1996.
Pietro Michiardi (Eurecom) Tutorial: HBase 99 / 102
References
References III
[8] D. Salmen.
Cloud data structure diagramming techniques and design patterns.
https://www.data-tactics-corp.com/index.php/
component/jdownloads/finish/22-white-papers/
68-cloud-data-structure-diagramming, 2009.
Pietro Michiardi (Eurecom) Tutorial: HBase 100 / 102