You are on page 1of 3

hbase.

md 2024-04-13

HBase Data Storage


measures proficiency in real-time data access using HBase and Hadoop.
Apache HBase™ is the Hadoop database, a distributed, scalable, big data store.
Use Apache HBase™ when you need random, realtime read/write access to your Big Data. This project's
goal is the hosting of very large tables -- billions of rows X millions of columns -- atop clusters of
commodity hardware. Apache HBase is an open-source, distributed, versioned, non-relational database
modeled after Google's Bigtable: A Distributed Storage System for Structured Data by Chang et al. Just as
Bigtable leverages the distributed data storage provided by the Google File System, Apache HBase
provides Bigtable-like capabilities on top of Hadoop and HDFS.
NoSQL
column oriented key-value store
key & values are ByteArray
Values are stored in key orders
no schema
at the time of table creation, we need to specify the number of column families
Features
Linear and modular scalability.
Strictly consistent reads and writes.
Automatic and configurable sharding of tables
Automatic failover support between RegionServers.
Convenient base classes for backing Hadoop MapReduce jobs with Apache HBase tables.
Easy to use Java API for client access.
Block cache and Bloom Filters for real-time queries.
Query predicate push down via server side Filters
Thrift gateway and a REST-ful Web service that supports XML, Protobuf, and binary data encoding
options
Extensible jruby-based (JIRB) shell
Support for exporting metrics via the Hadoop metrics subsystem to files or Ganglia; or via JMX
2 types of nodes:
1 master node (HA managed with Zookeeper)
manages region servers
not a part of the read or write path
multiple region servers
hosts tables & performs data read/write operations
Partitioning
HBase tables are horizontally partitioned into regions
each region is managed by a region server
a region server may hold multiple regions
1/3
hbase.md 2024-04-13

Persistence and data availability


HBase stores its data in HDFS, does not replicate RegionServers and relies on HDFS replication for
data availability
Updates and reads are served from the in-memory store (MemStore) and the on-disk store (HFile),
which are periodically flushed to disk
Row distribution
HBase tables are sorted by row key
Data storage
Hfiles or StoreFiles are the underlying storage units for HBase tables
HFile is a key-value map
When data is added, it is written to a log colled Write Ahead Log (WAL) and then to the MemStore
HFiles are immutable
HBase periodically merges HFiles into larger ones to improve read performance
Data Model
HBase is a column-oriented database
sorted by row key
each column family can have multiple columns
each row can have multiple versions of the same column
sparse data model
row key
column family
qualifier
timestamp version
value
Use Cases
millions/billions of rows
commodity hardware
random selects
range scans by key
HBase vs RDBMS
automatic partitioning
scales horizontally with new nodes
commodity hardware
fault tolerant
MapReduce distributed processing
Connecting
MapReduce Hive/Pig/Hcatalog/Hue
Java Applications
2/3
hbase.md 2024-04-13

[...]

3/3

You might also like