Hadoop Week 6

Course Topics
 Week 1  Week 5
– Introduction to HDFS – HIVE
– Setting Up Hadoop Cluster – HBASE
– Map-Reduce Basics, types and formats – ZOOKEEPER
– PIG – SQOOP
What are we going to learn Today..!
• Problems in the real world

• Traditional RDBMS fallacies
• The advent of HBase
• HBase Architecture
• Hands-on creation, updation of HBase table on shell
• Multiple ways of loading data into HBase (Shell, Jvm-Client, MapReduce, Avro, Thrift,
REST Api)
Problem in real world
Linkedin
Revolutionizing Education
Add Targeting
So, what is common?
• Huge Data
• Fast Random access
• Structured Data
• Variable Schema
• Need of Compression
• Need of Distribution (Sharding)
How Traditional RDBMS will solve
Users Follower
Id User_id
Name Follower_id
Sex type
age
Contd.
Users Connections
Id User_id
Name Connection_id
Sex type
age
Characteristics Of Probable Solution
• Distributed database
• Sorted data
• Sparse data store
• Automatic sharding
History of HBase
2006 Big Table paper published
2006 HBase development starts
2008 Microsoft buys powerset
2010 Facebook’s messaging system

Facebook Messaging System
• Facebook monitored their usage and figured out what the really
needed.
• What they needed was a system that could handle two types of data
patterns:
– A short set of temporal data that tends to be volatile
– An ever-growing set of data that rarely gets accessed
real-time, distributed, linearly scalable, robust, BigData, open-source, key-value, column-oriented

HBase Definition
HBase is a key/value store. Specifically it is a

Sparse, Consistent, Distributed, Multidimensional, Sorted map.
More HBase Implementation
uses HBase to power their Messages
A number of applications including infrastructure
people search rely on HBase internally http://sites.computer.org\debull\A12june\facebo
for data generation. ok.pdf
uses HBase to store document fingerprint for

We use HBase as a real time data detecting near-duplications. We have a
storage and analytics platform. cluster of few nodes that runs HDFS,
mapreduce, and HBase.
uses an HBase cluster containing

uses HBase as a foundation for cloud scale
over a billion anonymized clinical
storage for a variety of applications.
records.
Referred - http://wiki.apache.org/hadoop/Hbase/PoweredBy
Data Model
Versions Of Data
Row key Personal_data demographic

Persons ID Name Address Birth Date Gender
1 Harry BTM layout 1988-10-31 M
2 Dhawan 1956-09-16 M
3 Sana whitefield 1189-12-03 F
….. ….. ….. ….. …..
500,000,000 vineet delhi 1964-01-07 M
Physical storage
Col3(Birth date) ->
1926-10-31
Col1(address) ->Budapest
Row 1(1)
Col3(Gender) -> M
Col1(Name) -> H. Houdini
Col3(Birth date) ->
Row 2 (2) val3
Col5(address) -> D. Copper
Row 3 (3) Col4(Gender) -> val4
Family1(personal data) Family2(Demographic)

How does It look like?
What it means?
Column Family:
Row Key Values
Column Qualifier
• Unique for each row • Less number of families • Various versions of values
• Identifies each row gives faster access are maintained
• Families are fixed column • Scan shows only recent
qualifiers are not version
Three Major Components
Data Distribution
Row
s
Logical View – All rows in a
Region
A1
Null -> A3
A2
A22 Region
A3 A3 -> F34
…..
…..
K4 Region
….. F34 -> K80
….. Region
090 K80 -> 095
table
….. Region
….. 095 -> Null
…..
Z30
Z55 Region Region Region
Server Server Server
HBase Components
Zookeeper
Master /hbase/region
1
/hbase/region
2
…..
RegionServers
…..
memstore
/hbase/region
HDFS HFile WAL

HBase Components
• Table made of regions

• Region – a range of rows stored together
- Single Shard, used for scaling
- Dynamically merge if too big
- merge if too small
• Region servers- serves one or more regions
- A region is served by only one region
• Master server – demon responsible for managing HBase cluster
• HBase stores its data into HDFS
- relies on HDFS’s High Availability and fault tolerance
HBase Storage Architecture
Hbase Storage Simpler
Zookeeper Zookeeper Zookeeper

HBase HDFS
HDFS SNN
Master Namenode
Management Management Management
Node Node Node
Zookeeper Zookeeper Scale Zookeeper

HBase HBase Horizontally HBase
Region Server Region Server N Machines Region Server
Data Node Data Node

Data Node
Different Types of regions
Root/Meta Table
Each row in the ROOT and META tables is approximately 1KB in size.
At the default size of 256MB.
Compactions
Compactions
Time Column
Row key Column “anchor:”
Stamp “contents:”
t12 “<html>…”
t11 “<html>…”
“com.apache.www”
t10 “anchor:apache.com” “APACHE”
t9 “anchor:cnnsi.com” “CNN”
t8 “anchor:my.look.ca” “CNN.com”
“com.cnn.www”
t6 “<html>…”
t5 “<html>…”
t3 “<html>…”
Hstore1
Region Split
Region Splits
Time Column
Stamp “contents:”
t12 “<html>…”
“com.apache.www” t11 “<html>…”
“com.cnn.www”
t6 “<html>…”
t5 “<html>…”
t3 “<html>…”
Hstore1
HBase Client API
HBase Client API
Scanner and Filters
Search
Get value from table where key=„com.apache.www‟ AND label=„anchor:apache.com‟
Time
Stamp
t12
“com.apache.www” t11
“com.cnn.www” t6
t5
t3
Search
Scanner Select value from table

where, anchor=„cnnsi.com‟
Time
Stamp
t12
“com.apache.www” t11
“com.cnn.www” t6
t5
t3
Hbase API
• get(row)
• put(row,Map<column,value>)
• scan(key range, filter)
• increment(row, columns)
• Check and Put, delete etc.
Hbase Shell
• hbase(main):003:0> create 'test', 'cf'

0 row(s) in 1.2200 seconds
• hbase(main):004:0> put 'test', 'row1', 'cf:a', 'value1'
• hbase(main):005:0> put 'test', 'row2', 'cf:b', 'value2'
• hbase(main):006:0> put 'test', 'row3', 'cf:c', 'value3'
Hbase Shell Contd.
• hbase(main):007:0> scan 'test'

ROW COLUMN+CELL
• row1 column=cf:a, timestamp=1288380727188,
value=value1
• row2 column=cf:b, timestamp=1288380738440,
value=value2
• row3 column=cf:c, timestamp=1288380747365,
value=value3
Thank You
See You in Class Next Week

Hadoop Week 6

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Hadoop Week 6

Uploaded by

Copyright:

Available Formats

Course Topics

• Problems in the real world

2006 Big Table paper published

2006 HBase development starts

2008 Microsoft buys powerset

2010 Facebook’s messaging system

real-time, distributed, linearly scalable, robust, BigData, open-source, key-value, column-oriented

HBase is a key/value store. Specifically it is a

uses HBase to store document fingerprint for

uses an HBase cluster containing

Row key Personal_data demographic

Col5(address) -> D. Copper

Row 3 (3) Col4(Gender) -> val4

Family1(personal data) Family2(Demographic)

HDFS HFile WAL

• Table made of regions

Zookeeper Zookeeper Zookeeper

Zookeeper Zookeeper Scale Zookeeper

Data Node Data Node

“com.apache.www” t11 “<html>…”

t10 “anchor:apache.com” “APACHE”

Get value from table where key=„com.apache.www‟ AND label=„anchor:apache.com‟

Scanner Select value from table

• hbase(main):003:0> create 'test', 'cf'

• hbase(main):007:0> scan 'test'

You might also like