measures proficiency in real-time data access using HBase and Hadoop. Apache HBase™ is the Hadoop database, a distributed, scalable, big data store. Use Apache HBase™ when you need random, realtime read/write access to your Big Data. This project's goal is the hosting of very large tables -- billions of rows X millions of columns -- atop clusters of commodity hardware. Apache HBase is an open-source, distributed, versioned, non-relational database modeled after Google's Bigtable: A Distributed Storage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, Apache HBase provides Bigtable-like capabilities on top of Hadoop and HDFS. NoSQL column oriented key-value store key & values are ByteArray Values are stored in key orders no schema at the time of table creation, we need to specify the number of column families Features Linear and modular scalability. Strictly consistent reads and writes. Automatic and configurable sharding of tables Automatic failover support between RegionServers. Convenient base classes for backing Hadoop MapReduce jobs with Apache HBase tables. Easy to use Java API for client access. Block cache and Bloom Filters for real-time queries. Query predicate push down via server side Filters Thrift gateway and a REST-ful Web service that supports XML, Protobuf, and binary data encoding options Extensible jruby-based (JIRB) shell Support for exporting metrics via the Hadoop metrics subsystem to files or Ganglia; or via JMX 2 types of nodes: 1 master node (HA managed with Zookeeper) manages region servers not a part of the read or write path multiple region servers hosts tables & performs data read/write operations Partitioning HBase tables are horizontally partitioned into regions each region is managed by a region server a region server may hold multiple regions 1/3 hbase.md 2024-04-13
Persistence and data availability
HBase stores its data in HDFS, does not replicate RegionServers and relies on HDFS replication for data availability Updates and reads are served from the in-memory store (MemStore) and the on-disk store (HFile), which are periodically flushed to disk Row distribution HBase tables are sorted by row key Data storage Hfiles or StoreFiles are the underlying storage units for HBase tables HFile is a key-value map When data is added, it is written to a log colled Write Ahead Log (WAL) and then to the MemStore HFiles are immutable HBase periodically merges HFiles into larger ones to improve read performance Data Model HBase is a column-oriented database sorted by row key each column family can have multiple columns each row can have multiple versions of the same column sparse data model row key column family qualifier timestamp version value Use Cases millions/billions of rows commodity hardware random selects range scans by key HBase vs RDBMS automatic partitioning scales horizontally with new nodes commodity hardware fault tolerant MapReduce distributed processing Connecting MapReduce Hive/Pig/Hcatalog/Hue Java Applications 2/3 hbase.md 2024-04-13