You are on page 1of 38

Course Topics

 Week 1  Week 5
– Introduction to HDFS – HIVE

 Week 2  Week 6
– Setting Up Hadoop Cluster – HBASE

 Week 3  Week 7
– Map-Reduce Basics, types and formats – ZOOKEEPER

 Week 4  Week 8
– PIG – SQOOP
What are we going to learn Today..!

• Problems in the real world


• Traditional RDBMS fallacies
• The advent of HBase
• HBase Architecture
• Hands-on creation, updation of HBase table on shell
• Multiple ways of loading data into HBase (Shell, Jvm-Client, MapReduce, Avro, Thrift,
REST Api)
Problem in real world
Linkedin
Revolutionizing Education
Add Targeting
So, what is common?

• Huge Data
• Fast Random access
• Structured Data
• Variable Schema
• Need of Compression
• Need of Distribution (Sharding)
How Traditional RDBMS will solve

Users Follower
Id User_id
Name Follower_id
Sex type
age
Contd.

Users Connections
Id User_id
Name Connection_id
Sex type
age
Characteristics Of Probable Solution

• Distributed database
• Sorted data
• Sparse data store
• Automatic sharding
History of HBase

2006 Big Table paper published

2006 HBase development starts

2008 Microsoft buys powerset

2010 Facebook’s messaging system


Facebook Messaging System

• Facebook monitored their usage and figured out what the really
needed.

• What they needed was a system that could handle two types of data
patterns:
– A short set of temporal data that tends to be volatile
– An ever-growing set of data that rarely gets accessed

real-time, distributed, linearly scalable, robust, BigData, open-source, key-value, column-oriented


HBase Definition

HBase is a key/value store. Specifically it is a


Sparse, Consistent, Distributed, Multidimensional, Sorted map.
More HBase Implementation
uses HBase to power their Messages
A number of applications including infrastructure
people search rely on HBase internally http://sites.computer.org\debull\A12june\facebo
for data generation. ok.pdf

uses HBase to store document fingerprint for


We use HBase as a real time data detecting near-duplications. We have a
storage and analytics platform. cluster of few nodes that runs HDFS,
mapreduce, and HBase.

uses an HBase cluster containing


uses HBase as a foundation for cloud scale
over a billion anonymized clinical
storage for a variety of applications.
records.

Referred - http://wiki.apache.org/hadoop/Hbase/PoweredBy
Data Model
Versions Of Data

Row key Personal_data demographic


Persons ID Name Address Birth Date Gender
1 Harry BTM layout 1988-10-31 M

2 Dhawan 1956-09-16 M
3 Sana whitefield 1189-12-03 F
….. ….. ….. ….. …..
500,000,000 vineet delhi 1964-01-07 M
Physical storage
Col3(Birth date) ->
1926-10-31
Col1(address) ->Budapest
Row 1(1)
Col3(Gender) -> M
Col1(Name) -> H. Houdini
Col3(Birth date) ->
Row 2 (2) val3

Col5(address) -> D. Copper

Row 3 (3) Col4(Gender) -> val4

Family1(personal data) Family2(Demographic)


How does It look like?

What it means?

Column Family:
Row Key Values
Column Qualifier

• Unique for each row • Less number of families • Various versions of values
• Identifies each row gives faster access are maintained
• Families are fixed column • Scan shows only recent
qualifiers are not version
Three Major Components
Data Distribution
Row
s
Logical View – All rows in a
Region
A1
Null -> A3
A2
A22 Region
A3 A3 -> F34
…..
…..
K4 Region
….. F34 -> K80
….. Region
090 K80 -> 095
table

….. Region
….. 095 -> Null
…..
Z30
Z55 Region Region Region
Server Server Server
HBase Components

Zookeeper

Master /hbase/region
1
/hbase/region
2
…..
RegionServers
…..
memstore
/hbase/region

HDFS HFile WAL


HBase Components

• Table made of regions


• Region – a range of rows stored together
- Single Shard, used for scaling
- Dynamically merge if too big
- merge if too small
• Region servers- serves one or more regions
- A region is served by only one region
• Master server – demon responsible for managing HBase cluster
• HBase stores its data into HDFS
- relies on HDFS’s High Availability and fault tolerance
HBase Storage Architecture
Hbase Storage Simpler

Zookeeper Zookeeper Zookeeper


HBase HDFS
HDFS SNN
Master Namenode
Management Management Management
Node Node Node

Zookeeper Zookeeper Scale Zookeeper


HBase HBase Horizontally HBase
Region Server Region Server N Machines Region Server

Data Node Data Node


Data Node
Different Types of regions
Root/Meta Table

Each row in the ROOT and META tables is approximately 1KB in size.
At the default size of 256MB.
Compactions
Compactions

Time Column
Row key Column “anchor:”
Stamp “contents:”
t12 “<html>…”
t11 “<html>…”
“com.apache.www”
t10 “anchor:apache.com” “APACHE”

t9 “anchor:cnnsi.com” “CNN”

t8 “anchor:my.look.ca” “CNN.com”
“com.cnn.www”
t6 “<html>…”
t5 “<html>…”
t3 “<html>…”
Hstore1
Region Split
Region Splits

Time Column
Row key Column “anchor:”
Stamp “contents:”
t12 “<html>…”

“com.apache.www” t11 “<html>…”

t10 “anchor:apache.com” “APACHE”

t9 “anchor:cnnsi.com” “CNN”

t8 “anchor:my.look.ca” “CNN.com”
“com.cnn.www”
t6 “<html>…”
t5 “<html>…”
t3 “<html>…”

Hstore1
HBase Client API
HBase Client API
Scanner and Filters
Search

Get value from table where key=„com.apache.www‟ AND label=„anchor:apache.com‟

Time
Row key Column “anchor:”
Stamp
t12
“com.apache.www” t11
t10 “anchor:apache.com” “APACHE”
t9 “anchor:cnnsi.com” “CNN”
t8 “anchor:my.look.ca” “CNN.com”
“com.cnn.www” t6
t5
t3
Search

Scanner Select value from table


where, anchor=„cnnsi.com‟

Time
Row key Column “anchor:”
Stamp
t12
“com.apache.www” t11
t10 “anchor:apache.com” “APACHE”
t9 “anchor:cnnsi.com” “CNN”
t8 “anchor:my.look.ca” “CNN.com”
“com.cnn.www” t6
t5
t3
Hbase API

• get(row)
• put(row,Map<column,value>)
• scan(key range, filter)
• increment(row, columns)
• Check and Put, delete etc.
Hbase Shell

• hbase(main):003:0> create 'test', 'cf'


0 row(s) in 1.2200 seconds
• hbase(main):004:0> put 'test', 'row1', 'cf:a', 'value1'
0 row(s) in 0.0560 seconds
• hbase(main):005:0> put 'test', 'row2', 'cf:b', 'value2'
0 row(s) in 0.0370 seconds
• hbase(main):006:0> put 'test', 'row3', 'cf:c', 'value3'
0 row(s) in 0.0450 seconds
Hbase Shell Contd.

• hbase(main):007:0> scan 'test'


ROW COLUMN+CELL
• row1 column=cf:a, timestamp=1288380727188,
value=value1
• row2 column=cf:b, timestamp=1288380738440,
value=value2
• row3 column=cf:c, timestamp=1288380747365,
value=value3
3 row(s) in 0.0590 seconds
Thank You
See You in Class Next Week

You might also like