You are on page 1of 91

MITB B.

Big Data Technologies


Reference Notes

Chapter 3 – Data Storage


Agenda Today

• Data Management
‒ Storage formats
‒ Compression
‒ Schema Design
• Storage Systems – Hbase
• Lab Environment walk through

MITB B.5 Big Data Technologies


Data Management

MITB B.5 Big Data Technologies


Data Modelling in Hadoop
Key characteristics of Hadoop
- Distributed data store
- Platform for implementing powerful parallel processing frameworks

Reliability of Flexibility in - Store massive amounts of data “as –is”


Data Store + processing = Data Hub
- No constraints on how it is processed

MITB B.5 Big Data Technologies


Data Modelling in Hadoop

Key Choices
1. Data Storage Formats (what file format to store data in)
2. Compression
3. Schema Design (organization of the files in the filesystem)
4. Meta-Data Management
5. Security

MITB B.5 Big Data Technologies


1. Data Storage
Formats

MITB B.5 Big Data Technologies


Different Types of Files
Hadoop ecosystems in general have to store, manage and process different types of file formats and file
types

Server log files on a central Typically key-value log files in binary format; lot’s of small files
web server

Lucene search indexes in Inverted index data structure optimized for full-text search
HDFS

Typically made available by API’s in XML or JSON format


Social media data

Structured data, residing in relational databases


Transaction data

MITB B.5 Big Data Technologies


File Formats – High Level View
Standard File Formats Hadoop Specific File Formats

File based data


Sequence Files
structure
Text
Avro
XML
Serialization
Structured Text Format Thrift

JSON Protocol Buffer

RCFile
Binary Image
Columnar Format ORC

Parquet

MITB B.5 Big Data Technologies


Standard File Formats – Text
• Text = csv, tsv, xml, json records
• Convenient format for exchange with application and scripts
• Human readable and parse-able
• Data store is bulky and not as efficient to query
‒ Text storage space > Integer storage space
‒ Type conversion overhead
• Hard to express complex, nested data structures
• Removing or adding fields is tricky

MITB B.5 Big Data Technologies


Hadoop File Formats – Sequence Files
- Persistent data structure for storing data as binary key value pairs
- Row-based storage
- Commonly used to transfer data between MR jobs
- Can be used as an archive to pack small files in Hadoop
- Common header (Meta-data – compression codec, class names, user defined meta-data)
- Supports splittability via Sync Markers that are written into the body of the file to allow for seeking
to random points in the file
- Other File Based data structures like MapFiles, SetFiles, ArrayFiles, and BloomMapFiles also
available

MITB B.5 Big Data Technologies


Hadoop File Formats – Sequence Files

Record Compressed

- This format compresses each record as it’s added to the file.

Block Compressed

- Waits until data reaches block size to compress, rather than as


each record is added.
- Block compression provides better compression ratios
- “Block” refers to a group of records that are compressed together
within a single HDFS block

MITB B.5 Big Data Technologies


Hadoop File Formats – Sequence Files
Record Compression • Common header followed by one or more
records
Header Record Record Sync Record Record
• Metadata in header: Compression codec, key
Record No compression and value class names, user defined meta data
Key Length Key Value
Length
• Each Record: Record Length, Key Length, Key
Record Compressed Record
Length
Key Length Key
Value compression and (Un) Compressed value
• Randomly generated sync marker – key to
facilitating splittability

Block Compression • Compresses multiple records at once – more


compact
Header Sync Block Sync Block Sync • Block size defined in
io.seqfile.compress.blocksize
No. of Compressed Compressed Compressed Compressed
records Key Lengths Keys Value Lengths Values
• Sync marker before start of every block

MITB B.5 Big Data Technologies


Hadoop File Formats – Sequence Files
• A common use case for SequenceFiles is as a container for smaller files.
• Storing a large number of small files in Hadoop can cause a couple of issues.
• One is excessive memory use for the NameNode, because metadata for each file stored in
HDFS is held in memory.
• Another potential issue is in processing data in these files—many small files can lead to many
processing tasks, causing excessive overhead in processing.
• Because Hadoop is optimized for large files, packing smaller files into a SequenceFile makes the
storage and processing of these files much more efficient.
• Limited support outside Hadoop and only supported in Java
• SequenceFiles are append only

MITB B.5 Big Data Technologies


Hadoop File Formats – Sequence Files
server_log_21-aug-2016 server_log_22-aug-2016
Key Value Key Value

IP 121.121.121.111 IP 121.122.122.122

Timestamp [21/Aug/2016:00:12:14 -0500] Timestamp [22/Aug/2016:00:20:48 -0300]

Request GET /download/windows/ Request GET /a12345f/ HTTP/1.0


a12345.zip HTTP/1.0
Status 200
Status 200
… …
… …

SequenceFile

Key Value
21-Aug-2016 server_log_21-aug-2016

22-Aug-2016 server_log_22-aug-2016

… …

… …

MITB B.5 Big Data Technologies


Hadoop File Formats – Serialization Formats
Serialization is the process of
- Turning structured objects into a byte stream
- For transmission over a network or
- For writing to persistent storage

Deserialization is the reverse process of turning a byte stream back into a series of structured objects.

Core to Hadoop, since


- It allows data to be converted into a format that can be efficiently stored
- And efficiently transferred across a network connection.

MITB B.5 Big Data Technologies


Hadoop File Formats – Serialization Formats
Inter-process Communication (RPC) Requirements Data Storage
Make best use of network bandwidth – Make efficient use of storage space
most scarce resource in a data center. Compact

Backbone for a distributed system – as Overhead in reading or writing terabytes of


little performance overhead as possible Fast data is minimal
SerDe process
Evolve the protocol in a controlled manner Transparently read data written in an older
for clients and servers Extensible format

Support clients that are written in different Read or write persistent data using
languages to the server Interoperable different languages

MITB B.5 Big Data Technologies


Hadoop File Formats – Thrift and Protocol Buffers
• Developed at Facebook and Google as a framework for implementing cross-language interfaces to
services
• Uses an Interface Definition Language (IDL) to define interfaces
• Easy to express rich data structures
• Support schema evolution
• Although sometimes used for data serialization with Hadoop, Thrift has several drawbacks:
‒ Does not support internal compression of records,
‒ Not splittable
‒ Lack native MapReduce support
• There are externally available libraries such as the Elephant Bird project to address these drawbacks

MITB B.5 Big Data Technologies


Hadoop File Formats – Apache Avro
• Language-neutral data serialization system designed to address the major downside of Hadoop
Writables – lack of language portability
• Row-based, offers a compact and fast binary format
• Supports rich data structures
• Like Thrift and ProtocolBuffers, Avro data is described through a language independent schema
• Self-describing files – stores schema in the header of each file
‒ Avro files can be easily read later even from a different language than the one used to write the
file
• Native support in MapReduce – compressible and splittable Avro data files
• Integrates with many programming languages

MITB B.5 Big Data Technologies


Hadoop File Formats – Avro
• Avro schemas are usually written in JSON, but may also be written in Avro IDL, which is a C-like
language
‒ Describe the field types and (optionally) their default and allowed values
• Schema is stored together with the data
• The file header contains a unique sync marker.
• Just as with SequenceFiles, this sync marker is used to separate blocks in the file, allowing Avro
files to be splittable
• Blocks can optionally be compressed, and within those blocks, types are stored in their native
format, providing an additional boost to compression.
• Avro supports Snappy and Deflate compression

MITB B.5 Big Data Technologies


Row Oriented Vs. Column Oriented
Col 1 Col 2 Col 3
Row 1 1 2 3
Row 2 4 5 6
Row 3 7 8 9
Row 4 10 11 12

Row 1 Row 2 Row 3 Row 4

1 2 3 4 5 6 7 8 9 10 11 12 Row Oriented
Layout
Row 1 Split Row 2 Split
Col 1 Col 2 Col 3 Col 1 Col 2 Col 3

1 4 2 5 3 6 7 10 8 11 9 12 Column Oriented
Layout
MITB B.5 Big Data Technologies
Hadoop File Formats – Columnar Formats
Advantages of Columnar Formats
• Skips I/O and decompression (if applicable) on columns that are not a part of the query
• Works well for queries that only access a small subset of columns. If many columns are being
accessed, then row-oriented is generally preferable
• Generally very efficient in terms of compression on columns because entropy within a column is
lower than entropy within a block of rows
• Well suited for data-warehousing-type applications where users want to aggregate certain columns
over a large collection of records

MITB B.5 Big Data Technologies


Hadoop File Formats – Columnar Formats (RCFile)
• Developed to provide fast data loading, fast query processing and highly efficient storage space
utilization
• RCFile format breaks file into row splits, then within each split uses column oriented storage
• Advantages
- Query performance
- Compression performance
• Disadvantages
- Optimization of queries as compared to other columnar formats

MITB B.5 Big Data Technologies


Hadoop File Formats – Columnar Formats (ORC File)
• Created to address some of the shortcomings of the RCFile
• Breaks file into row splits, then within each split uses column oriented storage
• Advantage
- Lightweight, always on compression
- Supports type primitives like decimal and complex types
- Splittable storage format
• Disadvantages
- Designed specifically for Hive, so not a general purpose storage format that can be used with
non-Hive MR interfaces such as Pig or Impala

MITB B.5 Big Data Technologies


Hadoop File Formats – Columnar Formats (Parquet)
• General purpose storage format for Hadoop – suitable for MapReduce interfaces (Java, Hive, Pig)
and others (Impala, Spark)
Advantages
• Efficient compression – specified on per-column basis
• Supports complex nested data structures
• Stores full meta-data at the end of files, so Parquet files are self documenting
• Splittable storage format
• Stores full metadata at the end of files, so self-documenting

MITB B.5 Big Data Technologies


Design Considerations
Code Generation Transparent Compression
- Libraries with code generation libraries (rich - Internally compress and decompress data on
object creation) reads and writes
- Type safety

Schema Evolution Splittability


- Support data model evolution - Supporting multiple parallel readers
- Add, Modify attributes while providing backward - Synchronization markers (random seek and
and forward compatibility scan)

Language Support Support in MapReduce and Hadoop ecosystem


- Access data in more than one programming - File format integration natively supported rather
language than custom code

MITB B.5 Big Data Technologies


Design Considerations
How is the raw data structured
• Text format or csv and store as such
- Easy to understand, troubleshoot
- System performance can be poor due to inefficient parsing
• Json or XML format and store as such
- Splittability could be an issue which may impact processing times

MITB B.5 Big Data Technologies


Design Considerations
What does the processing pipeline consist of?
• Lot of MR jobs
- SequenceFile can be very efficient
• Lot of interactive querying
- Columnar formats can be more efficient (Parquet)
How many columns are stored and used for analysis
• Query a few data columns on a wide table (many stored, few analyzed) –
Parquet is a good choice
• Full scan - Avro

MITB B.5 Big Data Technologies


Design Considerations
Does data change over a period of time
- Text files do not explicitly store schema
- Parquet allows addition of new columns at the end of columns (doesn’t handle
deletion of columns)
- Avro allows for addition, deletion and renaming of multiple columns

MITB B.5 Big Data Technologies


Design Considerations
Speed Considerations
‒ Parquet and ORC usually need some additional parsing to format the data
which increases the overall time
‒ Adding large amounts of data to HDFS quickly – SequenceFiles
‒ Compression of a file regardless of format, increases query speed times
‒ Parquet and ORC optimized read performance at the expense of write
performance

MITB B.5 Big Data Technologies


2. Compression

MITB B.5 Big Data Technologies


Compression
• Compression helps reduce overall processing time as well as storage space
requirements
• Compression adds CPU load, but typically offset by savings in I/O
• All compression algorithms exhibit a space/time trade-off
• Different tools have very different compression characteristics
• Not all compression formats are splittable
‒ Impediment to efficient processing, since MR splits data to multiple tasks

MITB B.5 Big Data Technologies


Compression and Input Splits

Uncompressed Block Size of Map Reduce Job


File of 1 GB size 128 MB

8 Blocks created 8 Map tasks for


and stored Input Splits

Compressed File Block Size of Map Reduce Job


of 1 GB size 128 MB

8 Blocks created Single Map task


and stored using for all Input Splits
GZIP compression

MITB B.5 Big Data Technologies


Compression and Input Splits
Compression Algorithm Filename Extension Splittable
Format

DEFLATE DEFLATE .deflate No


gzip DEFLATE .gz No
bzip2 bzip2 .bz2 Yes
LZO LZO .lzo No
LZ4 LZ4 .lz4 No
Snappy Snappy .snappy No

LZO files are splittable if they have been indexed in a pre-processing step

MITB B.5 Big Data Technologies


Compression – Recommendations
- Enable compression of MR intermediate output – improve performance by
decreasing the amount of intermediate data
- Ordering of data can provide better compression levels
- A compact file format with support for splittable compression such as Avro is
often a good choice

MITB B.5 Big Data Technologies


3. Schema Design

MITB B.5 Big Data Technologies


Schema Design
• No schema requirements by Hadoop by ingesting due to schema-on-read
• Given Hadoop’s role as an Enterprise Data Hub, a structured and organized
repository is desirable
• Benefits
• Easier to share data between teams
• Enforcing access and quota controls
• Code re-use possible
• Comply with standard tool assumptions on data availability

MITB B.5 Big Data Technologies


Schema Design Considerations
• Dependent by use case
• In general,
• Develop standard practices and enforce
• Design compliance with tools being intended to process
• Keep usage patterns in mind
• Location of HDFS File
• /usr/<username>: Data, jars and configuration files
• /etl : Data in various stages of being processed by an ETL workflow
• /tmp: Temporary data generated by tools or shared between users
• /data: Data sets that are shared across organization
• /metadata: Metadata files
• /app: jar files, Oozie workflow definitions, HiveQL file etc.

MITB B.5 Big Data Technologies


Schema Design Considerations
PARITIONING

• Technique to reduce amount of I/O required to process a data set

• Break up the data set into smaller sets, each subset being called a partition

• Each partition would be in a sub-directory of the directory containing the entire data set

• Example

• <data set
name>/<partition_column_name=partition_column_value>/{files}

• This translates to: orders/date=20131101/{order1.csv, order2.csv}

MITB B.5 Big Data Technologies


Schema Design Considerations
BUCKETING

• Technique for decomposing large data sets into more manageable sub sets

• Further decomposing partitioned data sets by a particular field or column

• Partitioning by the wrong column can result in the small file problem

• Trade-off between number of files – small number is inefficient to manage and large would
cause long scan times and slow down query

MITB B.5 Big Data Technologies


CASE STUDY

File Format Benchmarking

Courtesy Hortonworks
http://www.slideshare.net/HadoopSummit/file-format-
benchmark-avro-json-orc-parquet

MITB B.5 Big Data Technologies


CASE STUDY 2

Benchmarking Apache Parquet

Courtesy Cloudera
http://blog.cloudera.com/blog/2016/04/benchmarki
ng-apache-parquet-the-allstate-experience/

MITB B.5 Big Data Technologies


Hbase

MITB B.5 Big Data Technologies


Row Oriented Vs. Column Oriented
Col 1 Col 2 Col 3
Row 1 1 2 3
Row 2 4 5 6
Row 3 7 8 9
Row 4 10 11 12

Row 1 Row 2 Row 3 Row 4

1 2 3 4 5 6 7 8 9 10 11 12 Row Oriented
Layout
Row 1 Split Row 2 Split
Col 1 Col 2 Col 3 Col 1 Col 2 Col 3

1 4 2 5 3 6 7 10 8 11 9 12 Column Oriented
Layout
MITB B.5 Big Data Technologies
Row Oriented Vs. Column Oriented

• OLAP workloads benefit more from column oriented structures


- Faster access
- Better compression
- Faster parallel processing
• Trades slower disk I/O with aggressive encoding and compression that uses
faster CPU cycles
• Sorting and cardinality determine the encoding
• Operates in encoded data, decodes as late as possible

MITB B.5 Big Data Technologies


Apache Hbase

Apache Hbase is an open


source, horizontally
scalable, low latency,
random access, data store
built on top of Apache
Hadoop
MITB B.5 Big Data Technologies
Overview

• Provides real-time, random read and write access to tables meant to store billions
of rows and millions of columns
• Designed to run on commodity hardware and scale horizontally while retaining
performance
• Fault tolerant leveraging HDFS’s redundancy
‒ When servers fail, data is automatically re-balanced over the remaining
servers
• Strong consistency model – changes are visible to all other clients
• Modeled after Google’s BigTable

MITB B.5 Big Data Technologies


Hbase Data Model

MITB B.5 Big Data Technologies


Hbase Data Model
It’s a sparse, distributed, persistent multidimensional sorted map, which is indexed by
• a row key,
• column key,
• and a timestamp

• It’s a key value store


• Colum Family oriented database
• Database storing versioned map of maps

MITB B.5 Big Data Technologies


Hbase Data Model
• Tables
‒ Data is stored in Tables
‒ Table names are Strings and composed of characters that are safe for use in
file system path
• Rows
‒ Within a table, data is stored according to it’s row.
‒ Rows are identified uniquely by their row key which do not have a data type and
are always treated as a byte array
‒ Row keys equivalent of primary keys in RDBMS
‒ You cannot change a row key once the table has been set up

MITB B.5 Big Data Technologies


Hbase Data Model
• Column Families
‒ Data within a row is grouped by column family
‒ CF’s impact physical arrangement of data stored in Hbase
‒ Defined upfront and are not easily modified
‒ Every row has the same CF, although a row need not store data in all CF’s
‒ Columns within a family are stored together
• Column Qualifier
• Data within a CF is addressed via it’s column qualifier (or column)
• Columns need not be specified in advance
• Columns need not be consistent between rows
• Like row keys, columns don’t have a data type and are always byte array
MITB B.5 Big Data Technologies
Hbase Data Model
• Cell
‒ Data is stored in cells which are identified by row-key x column-family x column
‒ Cell’s content is also an array of bytes
• Timestamp
‒ Values within a cell are versioned
‒ Versions are identified by their version number, which by default is the
timestamp of when the cell was written
‒ Number of cell versions retained by Hbase is configurable for each column
family (default is 3)

MITB B.5 Big Data Technologies


Hbase Data Model

MITB B.5 Big Data Technologies


Hbase Data Model
Time Stamp 3
Column 1

Column Family Time Stamp 2


Row Key 1 Column 2
1

Column 3 Time Stamp 1

• The row key maps to a list of column families, which map to a list of column
qualifiers, which map to a list of timestamps, each of which map to a value, i.e., the
cell itself

• Hbase returns only the latest version by default

MITB B.5 Big Data Technologies


Row Key

Column Families
• This
representation
of the Hbase
data model Column Qualifiers
leads it to being
called a key
value store
• Key is formed
by [row key, Time Stamp / Version
column family,
column qualifier,
timestamp] Cell Value
• Value is the
contents of the
cell
MITB B.5 Big Data Technologies
Hbase Architecture

MITB B.5 Big Data Technologies


Region Servers
Key Components

• Two Key Hbase Daemons

- Hbase Master

- Region Servers

• ZooKeeper

• HDFS

MITB B.5 Big Data Technologies


Region Servers
Region Server 1 Region Server 2

Start Key Region Start Key Region Start Key Region Start Key Region

End Key End Key End Key End Key

• HBase Tables are divided horizontally by row key range into “Regions”
• A region contains all rows in the table between the region’s start key and end key
• Regions are assigned to the nodes in the cluster, called “Region Servers”
• Region Servers are co-located with the HDFS DataNodes, which enable data locality
• Region servers are responsible for all read and write requests for all regions they serve
• Clients communicate directly with them to handle all data-related operations
MITB B.5 Big Data Technologies
Hbase Master
HMaster Client

Region Server 1 Region Server 2

Start Key Region Start Key Region Start Key Region Start Key Region

End Key End Key End Key End Key

• Coordinating the region servers


‒ Assigning regions on startup , re-assigning regions for recovery or load balancing
‒ Monitoring all RegionServer instances in the cluster (listens for notifications from zookeeper)
• Admin functions
‒ Interface for creating, deleting, updating tables

MITB B.5 Big Data Technologies


Zookeeper
Heartbeat
ZooKeeper
Heartbeat HMaster Client

Heartbeat
Region Server 1 Region Server 2

Start Key Region Start Key Region Start Key Region Start Key Region

End Key End Key End Key End Key

• Distributed coordination service to maintain server state in the cluster

• Zookeeper maintains which servers are alive and available, and provides server failure notification

• Zookeeper uses consensus to guarantee common shared state


MITB B.5 Big Data Technologies
Region Servers
1. WAL: Write Ahead Log
Region Server 1 • Stored on HDFS

Block Cache
• Used to store new data that hasn't yet been
persisted to permanent storage
Region Region
Start Key Start Key • Used for recovery in the case of failure.
Memstore Memstore 2. BlockCache (read cache)
Memstore Memstore
• Stores frequently read data in memory.
WAL Hfile Hfile
• Least Recently Used data is evicted when full.
Hfile Hfile
End Key End Key 3. MemStore (write cache)
HDFS • Stores new data which has not yet been written to
disk.
• It is sorted before writing to disk.
• There is one MemStore per column family per
region.
4. Hfiles store the rows as sorted KeyValues on disk.

MITB B.5 Big Data Technologies


Meta Table
Meta Table
Row Key Value
Region Start , Region ID Region Server

• An HBase table that keeps a list of all regions in the system.


• META. table structure
• Key: region start key,region id
• Values: RegionServer
• -ROOT- stores location of Meta table

MITB B.5 Big Data Technologies


Auto Sharding and Distribution
• Unit of scalability in Hbase is the Region

• Sorted, contiguous range of rows

• Spread “randomly” across RegionServer

• Moved around for load balancing and failover

• Split automatically or manually to scale with growing data

• Capacity is solely a factor of cluster nodes vs. regions per node

MITB B.5 Big Data Technologies


Hbase Read / Write

MITB B.5 Big Data Technologies


First Read / Write
Meta Table
Zookeeper Stores location of the Meta Table
ZooKeeper
1. Client gets the Region server that hosts the META

Client table from ZooKeeper.


1. Get RegionServer for Row Key 2. Client will query the .META. server to get the region
2. Read/Write Operations
server corresponding to the row key it wants to
access.
RegionServer RegionServer RegionServer
3. Client caches this information along with the META
table location.

4. Client will get the Row from the corresponding Region


Server

MITB B.5 Big Data Technologies


Write Steps
1 Client issues PUT request 2 Data written to WAL

Region Server Region Server

Region Region
PUT
WAL Memstore WAL Memstore
Memstore Memstore

HDFS HDFS

• Write data to WAL • Updates written to Memstore


• Edits appended to end of WAL file on • PUT acknowledgement returned to client
disk

MITB B.5 Big Data Technologies


Write Steps
3 Data written to Memstore 4 Memstore flushes to Hfile on HDFS

Region Server Region Server

Region Region

WAL Memstore WAL Memstore


Memstore Memstore

FLUSH
Hfile Hfile
HDFS Hfile Hfile

HDFS
• Memstore stores data as KV pair
• One Memstore per column family • Sorted set written to new HFile in HDFS
• Updates sorted per column family • When one Memstore full, all flush
• Saves last written sequence number

MITB B.5 Big Data Technologies


HFile
Meta Meta Data Meta
Data Data Data File Info Trailer
Block Block Index Index

Key Key Key Key Key Key


Magic
Value Value Value Value Value Value

• Each Hfile contains a variable number of • Each Data Block contains


data blocks ‒ Magic header
• Fixed blocks for File Info and Trailer ‒ Serialized KV instances
• Index blocks record offset of data and • When a Hfile is loaded, the index contained in
meta blocks the file is opened and kept in memory
• This allows lookups to be performed with a
single disk seek
MITB B.5 Big Data Technologies
HFile
Key Value Row CF Column Time
Row Key CF name Key Type Value
Length Length Length Length Qualifier Stamp

• Each KV in the Hfile is a low-level byte array


• Two fixed length numbers used for offsetting into array
• Size (Length) of Key
• Size (Length) of Value
• Key – Row key, CF name, qualifier etc.

MITB B.5 Big Data Technologies


Hbase Read Merge
• Read Amplification: KeyValue cells corresponding to one row can be in multiple places,
‒ row cells already persisted are in Hfiles,
‒ recently updated cells are in the MemStore,
‒ and recently read cells are in the Block cache
• A Read merges KV from the three places per:
‒ Looks for the row cells in the Block cache (read cache)
‒ Looks in Memstore – the write cache
‒ Then use Block Cache indexes and bloom filters to load Hfiles into memory

MITB B.5 Big Data Technologies


Overall View of Services and Daemons
Hbase Master Nodes
Standby
Zookeeper Coordinate the
Namenode Master
cluster
HDFS Active Resource Hbase
Zookeeper
NameNode Manager Master

YARN Slave Nodes


Data Node Data Node Data Node Data Node perform the work
on the cluster
Hbase Node Node Node Node
Manager Manager Manager Manager

Zookeeper Region Region Region Region


Server Server Server Server

MITB B.5 Big Data Technologies


Design Considerations

MITB B.5 Big Data Technologies


Key Table Properties
- Indexing is only done based on the key
- Tables are stored sorted based on row key – each region in the table is
responsible for a part of the row key space and is identified by the start and end
row key
- Everything in Hbase tables is stored as a byte array – there are no types
- Atomicity is guaranteed only at row level. There is no atomicity guarantee across
rows, which means that there are no multi-row transactions
- Column families have to be defined up front at table creation time
- Column qualifiers are dynamic and can be defined at write time

MITB B.5 Big Data Technologies


Design Considerations
- What should be the row key structure and what should it contain
- How many column families should the table have
- What data goes into what column family
- How many columns are there in each column family
- What should the column names be
- What information should go into the cells
- How many versions should be stored for each cell

MITB B.5 Big Data Technologies


Recommendations
- Design for use case
- Read, Write, Both?
- Row keys are the single most important aspect of Hbase table design –
understand access patterns before designing
- Store everything with similar access patterns in the same column family
- Hashing allows for fixed length keys and better distributions, but takes way the
ordering implied by using strings as keys
- Length of column qualifiers impact storage footprint – be concise

MITB B.5 Big Data Technologies


Areas for further study
• Hbase design • Hbase client API’s
- Key design - CRUD operations
- Tall narrow vs. Flat Wide - Batch operations
• Performance optimization - Scans
- Compaction (minor and major), Splits - Filters, Counters and Co-
- Hashing, Sequential Keys and Salting processors

- Avoiding Hotspotting
- Block Cache and Memstore sizes
- Bloom Filters
- Write Amplification
MITB B.5 Big Data Technologies
Critique

MITB B.5 Big Data Technologies


Benefits
- Strong Consistency Model
- Scales automatically
o Regions split when data grows too large
o Uses HDFS to spread and replicate data
- Built-in recovery
o Using Write-ahead Log
- Integrated with Hadoop
o MR on Hbase is straightforward

MITB B.5 Big Data Technologies


What it is good for
• Large Data Sets (TB/PB)
• Sparse Data Sets
• Loosely coupled (denormalized) records
• Lots of concurrent clients
• High Throughput
• Great for variable schema
‒ Rows may drastically differ
‒ If your schema has many columns and most of them are null
• Random reads and Writes

MITB B.5 Big Data Technologies


When it may not be the best fit
• Small datasets (unless you have lots of them)
• Highly relational records
• Schema designs requiring transactions
• Mixed workloads need careful evaluation

MITB B.5 Big Data Technologies


Use Cases

MITB B.5 Big Data Technologies


Pinterest’s deployment of Hbase
• Pinterest – an online pinboard where you “curate” and “discover” things you love and go do them in real life
• “Follower” follows “Followee”
• “Following Feed” for a user has content aggregated from all the followees of the user and presented each time a
user accesses his page on pinterest.com

Follower 1 • 100’s of millions of pins/repins per


month
• High fanout – billions of writes per
Follower 2
day (high throughput)
New Pin
• Billions of request per month (Low
Follower 3
latency and high availability)

MITB B.5 Big Data Technologies


Pinterest’s deployment of Hbase
Hbase Advantages
Requirements
- High throughput through column oriented
- Read volume comparatively lower than write structure and WAL
volume, but low read latency important - Ability to manage sharding and machine
- High throughput failures automatically
- High volume of random writes - Seamless integration with Hadoop and Hive

Options - High user adoption (Facebook already uses


Hbase for messages
- B-tree bases systems : MySQL
- Atomicity (transaction support at row level)
- In-memory databases: Redis
- Locality (user’s data lies within the same
- Log Structured Merge (LSM) Trees: Google region and on the same region server)
BigTable, Hbase, Cassandra

MITB B.5 Big Data Technologies


Pinterest’s deployment of Hbase
Two key operations
Front End
‒ Updation: As users
Async. Task
follow/unfollow other Enqueue

users and boards Follow


Follow Store Message Bus Pin Store
Unfollow New Pin
‒ Serving / Retrieval: Write
Following feed Feed Store
retrieved whenever
Feed Store stores only the
authenticated user Thrift + Finagle
identifiers associated with
the pins in the following
layer
accesses feed
pinterest.com
Hbase

MITB B.5 Big Data Technologies


Pinterest’s deployment of Hbase
Front End

MySQL Database Retrieve Pin


(Sharded) Metadata Retrieve
PinID(s)
Persistent storage of core entities
- Pin (image url, title, description)
- Boards
- Users
Feed Store

Thrift + Finagle
layer

Hbase

MITB B.5 Big Data Technologies


Hbase at AirBnb – Realtime Ingestion

MITB B.5 Big Data Technologies


Hbase at Visa – Mobile Offerings Application
• Scalable and real-time
Transaction History
service
• Migrated prominent
Mobile wallet offerings
to this service
• Learnings
• Kerberos
• Availability
• Handling Client
Exceptions
MITB B.5 Big Data Technologies
Hbase at Xiaomi – Offline Querying

• Input data replicated from online to offline cluster


• Output data is written to offline cluster and replicated to online cluster
• Both Batch and Stream processing
MITB B.5 Big Data Technologies
Hbase at Alibaba Search

• Core storage in
Alibaba search
system since
2010
• 3 clusters each
with 1,000+
nodes

MITB B.5 Big Data Technologies


References

MITB B.5 Big Data Technologies


References
Data Modeling
Text: Hadoop Application Architectures
Mark Grover, Ted Malaska, Jonathan Seidman, Gwen Shapira; O’Reilley Media Inc.
Paper: Dremel: Interactive Analysis of Web Scale Datasets
http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36632.pdf

Hbase
Hbase Reference Guide
https://hbase.apache.org/book.html

Paper: Introduction to Basic Schema Design by Amandeep Khurana


http://0b4af6cdc2f0c5998459-c0245c5c937c5dedcca3f1764ecc9b2f.r43.cf2.rackcdn.com/9353-login1210_khurana.pdf

Article: An In-depth Look at Hbase Architecture


https://www.mapr.com/blog/in-depth-look-hbase-architecture

MITB B.5 Big Data Technologies


References
Case Studies
AirBnb
http://www.slideshare.net/HBaseCon/apache-hbase-at-airbnb

Facebook
http://www.slideshare.net/cloudera/h-base-in-production-at-facebook-jonthan-grayfacebookfinal

Xiaomi
http://www.slideshare.net/HBaseCon/apache-hbase-improvements-and-practices-at-xiaomi

Alibaba Search
http://www.slideshare.net/HBaseCon/improvements-to-apache-hbase-and-its-applications-in-alibaba-search

Pinterest
http://www.slideshare.net/AbhiKhune/scaling-deep-social-feeds-at-pinterest

Visa
http://www.slideshare.net/HBaseCon/rolling-out-apache-hbase-for-mobile-offerings-at-visa
MITB B.5 Big Data Technologies

You might also like