Big Data Technologies

MITB B.
Big Data Technologies

Reference Notes
Chapter 3 – Data Storage

Agenda Today
• Data Management
‒ Storage formats
‒ Compression
‒ Schema Design
• Storage Systems – Hbase
• Lab Environment walk through
MITB B.5 Big Data Technologies

Data Management

Data Modelling in Hadoop
Key characteristics of Hadoop
- Distributed data store
- Platform for implementing powerful parallel processing frameworks
Reliability of Flexibility in - Store massive amounts of data “as –is”

Data Store + processing = Data Hub
- No constraints on how it is processed

Data Modelling in Hadoop
Key Choices
1. Data Storage Formats (what file format to store data in)
2. Compression
3. Schema Design (organization of the files in the filesystem)
4. Meta-Data Management
5. Security

1. Data Storage
Formats

Different Types of Files
Hadoop ecosystems in general have to store, manage and process different types of file formats and file
types
Server log files on a central Typically key-value log files in binary format; lot’s of small files
web server
Lucene search indexes in Inverted index data structure optimized for full-text search
HDFS
Typically made available by API’s in XML or JSON format

Social media data
Structured data, residing in relational databases

Transaction data

File Formats – High Level View
Standard File Formats Hadoop Specific File Formats
File based data

Sequence Files
structure
Text
Avro
XML
Serialization
Structured Text Format Thrift
JSON Protocol Buffer
RCFile
Binary Image
Columnar Format ORC
Parquet

Standard File Formats – Text
• Text = csv, tsv, xml, json records
• Convenient format for exchange with application and scripts
• Human readable and parse-able
• Data store is bulky and not as efficient to query
‒ Text storage space > Integer storage space
‒ Type conversion overhead
• Hard to express complex, nested data structures
• Removing or adding fields is tricky

Hadoop File Formats – Sequence Files
- Persistent data structure for storing data as binary key value pairs
- Row-based storage
- Commonly used to transfer data between MR jobs
- Can be used as an archive to pack small files in Hadoop
- Common header (Meta-data – compression codec, class names, user defined meta-data)
- Supports splittability via Sync Markers that are written into the body of the file to allow for seeking
to random points in the file
- Other File Based data structures like MapFiles, SetFiles, ArrayFiles, and BloomMapFiles also
available

Record Compressed
- This format compresses each record as it’s added to the file.
Block Compressed
- Waits until data reaches block size to compress, rather than as

each record is added.
- Block compression provides better compression ratios
- “Block” refers to a group of records that are compressed together
within a single HDFS block

Record Compression • Common header followed by one or more
records
Header Record Record Sync Record Record
• Metadata in header: Compression codec, key
Record No compression and value class names, user defined meta data
Key Length Key Value
Length
• Each Record: Record Length, Key Length, Key
Record Compressed Record
Length
Key Length Key
Value compression and (Un) Compressed value
• Randomly generated sync marker – key to
facilitating splittability
Block Compression • Compresses multiple records at once – more

compact
Header Sync Block Sync Block Sync • Block size defined in
io.seqfile.compress.blocksize
No. of Compressed Compressed Compressed Compressed
records Key Lengths Keys Value Lengths Values
• Sync marker before start of every block

• A common use case for SequenceFiles is as a container for smaller files.
• Storing a large number of small files in Hadoop can cause a couple of issues.
• One is excessive memory use for the NameNode, because metadata for each file stored in
HDFS is held in memory.
• Another potential issue is in processing data in these files—many small files can lead to many
processing tasks, causing excessive overhead in processing.
• Because Hadoop is optimized for large files, packing smaller files into a SequenceFile makes the
storage and processing of these files much more efficient.
• Limited support outside Hadoop and only supported in Java
• SequenceFiles are append only

server_log_21-aug-2016 server_log_22-aug-2016
Key Value Key Value
IP 121.121.121.111 IP 121.122.122.122
Timestamp [21/Aug/2016:00:12:14 -0500] Timestamp [22/Aug/2016:00:20:48 -0300]
Request GET /download/windows/ Request GET /a12345f/ HTTP/1.0

a12345.zip HTTP/1.0
Status 200
Status 200
… …
… …
SequenceFile
Key Value
21-Aug-2016 server_log_21-aug-2016
22-Aug-2016 server_log_22-aug-2016
… …
… …

Hadoop File Formats – Serialization Formats
Serialization is the process of
- Turning structured objects into a byte stream
- For transmission over a network or
- For writing to persistent storage
Deserialization is the reverse process of turning a byte stream back into a series of structured objects.
Core to Hadoop, since

- It allows data to be converted into a format that can be efficiently stored
- And efficiently transferred across a network connection.

Hadoop File Formats – Serialization Formats
Inter-process Communication (RPC) Requirements Data Storage
Make best use of network bandwidth – Make efficient use of storage space
most scarce resource in a data center. Compact
Backbone for a distributed system – as Overhead in reading or writing terabytes of

little performance overhead as possible Fast data is minimal
SerDe process
Evolve the protocol in a controlled manner Transparently read data written in an older
for clients and servers Extensible format
Support clients that are written in different Read or write persistent data using
languages to the server Interoperable different languages

Hadoop File Formats – Thrift and Protocol Buffers
• Developed at Facebook and Google as a framework for implementing cross-language interfaces to
services
• Uses an Interface Definition Language (IDL) to define interfaces
• Easy to express rich data structures
• Support schema evolution
• Although sometimes used for data serialization with Hadoop, Thrift has several drawbacks:
‒ Does not support internal compression of records,
‒ Not splittable
‒ Lack native MapReduce support
• There are externally available libraries such as the Elephant Bird project to address these drawbacks

Hadoop File Formats – Apache Avro
• Language-neutral data serialization system designed to address the major downside of Hadoop
Writables – lack of language portability
• Row-based, offers a compact and fast binary format
• Supports rich data structures
• Like Thrift and ProtocolBuffers, Avro data is described through a language independent schema
• Self-describing files – stores schema in the header of each file
‒ Avro files can be easily read later even from a different language than the one used to write the
file
• Native support in MapReduce – compressible and splittable Avro data files
• Integrates with many programming languages

Hadoop File Formats – Avro
• Avro schemas are usually written in JSON, but may also be written in Avro IDL, which is a C-like
language
‒ Describe the field types and (optionally) their default and allowed values
• Schema is stored together with the data
• The file header contains a unique sync marker.
• Just as with SequenceFiles, this sync marker is used to separate blocks in the file, allowing Avro
files to be splittable
• Blocks can optionally be compressed, and within those blocks, types are stored in their native
format, providing an additional boost to compression.
• Avro supports Snappy and Deflate compression

Row Oriented Vs. Column Oriented
Col 1 Col 2 Col 3
Row 1 1 2 3
Row 2 4 5 6
Row 3 7 8 9
Row 4 10 11 12
Row 1 Row 2 Row 3 Row 4
1 2 3 4 5 6 7 8 9 10 11 12 Row Oriented
Layout
Row 1 Split Row 2 Split
Col 1 Col 2 Col 3 Col 1 Col 2 Col 3
1 4 2 5 3 6 7 10 8 11 9 12 Column Oriented
Layout
Hadoop File Formats – Columnar Formats
Advantages of Columnar Formats
• Skips I/O and decompression (if applicable) on columns that are not a part of the query
• Works well for queries that only access a small subset of columns. If many columns are being
accessed, then row-oriented is generally preferable
• Generally very efficient in terms of compression on columns because entropy within a column is
lower than entropy within a block of rows
• Well suited for data-warehousing-type applications where users want to aggregate certain columns
over a large collection of records

Hadoop File Formats – Columnar Formats (RCFile)
• Developed to provide fast data loading, fast query processing and highly efficient storage space
utilization
• RCFile format breaks file into row splits, then within each split uses column oriented storage
• Advantages
- Query performance
- Compression performance
• Disadvantages
- Optimization of queries as compared to other columnar formats

Hadoop File Formats – Columnar Formats (ORC File)
• Created to address some of the shortcomings of the RCFile
• Breaks file into row splits, then within each split uses column oriented storage
• Advantage
- Lightweight, always on compression
- Supports type primitives like decimal and complex types
- Splittable storage format
• Disadvantages
- Designed specifically for Hive, so not a general purpose storage format that can be used with
non-Hive MR interfaces such as Pig or Impala

Hadoop File Formats – Columnar Formats (Parquet)
• General purpose storage format for Hadoop – suitable for MapReduce interfaces (Java, Hive, Pig)
and others (Impala, Spark)
Advantages
• Efficient compression – specified on per-column basis
• Supports complex nested data structures
• Stores full meta-data at the end of files, so Parquet files are self documenting
• Splittable storage format
• Stores full metadata at the end of files, so self-documenting

Design Considerations
Code Generation Transparent Compression
- Libraries with code generation libraries (rich - Internally compress and decompress data on
object creation) reads and writes
- Type safety
Schema Evolution Splittability

- Support data model evolution - Supporting multiple parallel readers
- Add, Modify attributes while providing backward - Synchronization markers (random seek and
and forward compatibility scan)
Language Support Support in MapReduce and Hadoop ecosystem

- Access data in more than one programming - File format integration natively supported rather
language than custom code

How is the raw data structured
• Text format or csv and store as such
- Easy to understand, troubleshoot
- System performance can be poor due to inefficient parsing
• Json or XML format and store as such
- Splittability could be an issue which may impact processing times

What does the processing pipeline consist of?
• Lot of MR jobs
- SequenceFile can be very efficient
• Lot of interactive querying
- Columnar formats can be more efficient (Parquet)
How many columns are stored and used for analysis
• Query a few data columns on a wide table (many stored, few analyzed) –
Parquet is a good choice
• Full scan - Avro

Does data change over a period of time
- Text files do not explicitly store schema
- Parquet allows addition of new columns at the end of columns (doesn’t handle
deletion of columns)
- Avro allows for addition, deletion and renaming of multiple columns

Speed Considerations
‒ Parquet and ORC usually need some additional parsing to format the data
which increases the overall time
‒ Adding large amounts of data to HDFS quickly – SequenceFiles
‒ Compression of a file regardless of format, increases query speed times
‒ Parquet and ORC optimized read performance at the expense of write
performance

2. Compression

Compression
• Compression helps reduce overall processing time as well as storage space
requirements
• Compression adds CPU load, but typically offset by savings in I/O
• All compression algorithms exhibit a space/time trade-off
• Different tools have very different compression characteristics
• Not all compression formats are splittable
‒ Impediment to efficient processing, since MR splits data to multiple tasks

Compression and Input Splits
Uncompressed Block Size of Map Reduce Job

File of 1 GB size 128 MB
8 Blocks created 8 Map tasks for

and stored Input Splits
Compressed File Block Size of Map Reduce Job

of 1 GB size 128 MB
8 Blocks created Single Map task

and stored using for all Input Splits
GZIP compression

Compression and Input Splits
Compression Algorithm Filename Extension Splittable
Format
DEFLATE DEFLATE .deflate No

gzip DEFLATE .gz No
bzip2 bzip2 .bz2 Yes
LZO LZO .lzo No
LZ4 LZ4 .lz4 No
Snappy Snappy .snappy No
LZO files are splittable if they have been indexed in a pre-processing step

Compression – Recommendations
- Enable compression of MR intermediate output – improve performance by
decreasing the amount of intermediate data
- Ordering of data can provide better compression levels
- A compact file format with support for splittable compression such as Avro is
often a good choice

3. Schema Design

Schema Design
• No schema requirements by Hadoop by ingesting due to schema-on-read
• Given Hadoop’s role as an Enterprise Data Hub, a structured and organized
repository is desirable
• Benefits
• Easier to share data between teams
• Enforcing access and quota controls
• Code re-use possible
• Comply with standard tool assumptions on data availability

Schema Design Considerations
• Dependent by use case
• In general,
• Develop standard practices and enforce
• Design compliance with tools being intended to process
• Keep usage patterns in mind
• Location of HDFS File
• /usr/<username>: Data, jars and configuration files
• /etl : Data in various stages of being processed by an ETL workflow
• /tmp: Temporary data generated by tools or shared between users
• /data: Data sets that are shared across organization
• /metadata: Metadata files
• /app: jar files, Oozie workflow definitions, HiveQL file etc.

PARITIONING
• Technique to reduce amount of I/O required to process a data set
• Break up the data set into smaller sets, each subset being called a partition
• Each partition would be in a sub-directory of the directory containing the entire data set
• Example
• <data set
name>/<partition_column_name=partition_column_value>/{files}
• This translates to: orders/date=20131101/{order1.csv, order2.csv}

BUCKETING
• Technique for decomposing large data sets into more manageable sub sets
• Further decomposing partitioned data sets by a particular field or column
• Partitioning by the wrong column can result in the small file problem
• Trade-off between number of files – small number is inefficient to manage and large would
cause long scan times and slow down query

CASE STUDY
File Format Benchmarking
Courtesy Hortonworks
http://www.slideshare.net/HadoopSummit/file-format-
benchmark-avro-json-orc-parquet

CASE STUDY 2
Benchmarking Apache Parquet
Courtesy Cloudera
http://blog.cloudera.com/blog/2016/04/benchmarki
ng-apache-parquet-the-allstate-experience/

Hbase

Col 1 Col 2 Col 3
Row 1 1 2 3
Row 2 4 5 6
Row 3 7 8 9
Row 4 10 11 12
Row 1 Row 2 Row 3 Row 4
1 2 3 4 5 6 7 8 9 10 11 12 Row Oriented
Layout
Row 1 Split Row 2 Split
Col 1 Col 2 Col 3 Col 1 Col 2 Col 3
1 4 2 5 3 6 7 10 8 11 9 12 Column Oriented
Layout
• OLAP workloads benefit more from column oriented structures

- Faster access
- Better compression
- Faster parallel processing
• Trades slower disk I/O with aggressive encoding and compression that uses
faster CPU cycles
• Sorting and cardinality determine the encoding
• Operates in encoded data, decodes as late as possible

Apache Hbase
Apache Hbase is an open

source, horizontally
scalable, low latency,
random access, data store
built on top of Apache
Hadoop
Overview
• Provides real-time, random read and write access to tables meant to store billions
of rows and millions of columns
• Designed to run on commodity hardware and scale horizontally while retaining
performance
• Fault tolerant leveraging HDFS’s redundancy
‒ When servers fail, data is automatically re-balanced over the remaining
servers
• Strong consistency model – changes are visible to all other clients
• Modeled after Google’s BigTable

Hbase Data Model

Hbase Data Model
It’s a sparse, distributed, persistent multidimensional sorted map, which is indexed by
• a row key,
• column key,
• and a timestamp
• It’s a key value store

• Colum Family oriented database
• Database storing versioned map of maps

Hbase Data Model
• Tables
‒ Data is stored in Tables
‒ Table names are Strings and composed of characters that are safe for use in
file system path
• Rows
‒ Within a table, data is stored according to it’s row.
‒ Rows are identified uniquely by their row key which do not have a data type and
are always treated as a byte array
‒ Row keys equivalent of primary keys in RDBMS
‒ You cannot change a row key once the table has been set up

Hbase Data Model
• Column Families
‒ Data within a row is grouped by column family
‒ CF’s impact physical arrangement of data stored in Hbase
‒ Defined upfront and are not easily modified
‒ Every row has the same CF, although a row need not store data in all CF’s
‒ Columns within a family are stored together
• Column Qualifier
• Data within a CF is addressed via it’s column qualifier (or column)
• Columns need not be specified in advance
• Columns need not be consistent between rows
• Like row keys, columns don’t have a data type and are always byte array
Hbase Data Model
• Cell
‒ Data is stored in cells which are identified by row-key x column-family x column
‒ Cell’s content is also an array of bytes
• Timestamp
‒ Values within a cell are versioned
‒ Versions are identified by their version number, which by default is the
timestamp of when the cell was written
‒ Number of cell versions retained by Hbase is configurable for each column
family (default is 3)

Hbase Data Model

Hbase Data Model
Time Stamp 3
Column 1
Column Family Time Stamp 2

Row Key 1 Column 2
1
Column 3 Time Stamp 1
• The row key maps to a list of column families, which map to a list of column
qualifiers, which map to a list of timestamps, each of which map to a value, i.e., the
cell itself
• Hbase returns only the latest version by default

Row Key
Column Families
• This
representation
of the Hbase
data model Column Qualifiers
leads it to being
called a key
value store
• Key is formed
by [row key, Time Stamp / Version
column family,
column qualifier,
timestamp] Cell Value
• Value is the
contents of the
cell
Hbase Architecture

Region Servers
Key Components
• Two Key Hbase Daemons
- Hbase Master
- Region Servers
• ZooKeeper
• HDFS

Region Servers
Region Server 1 Region Server 2
Start Key Region Start Key Region Start Key Region Start Key Region
End Key End Key End Key End Key
• HBase Tables are divided horizontally by row key range into “Regions”
• A region contains all rows in the table between the region’s start key and end key
• Regions are assigned to the nodes in the cluster, called “Region Servers”
• Region Servers are co-located with the HDFS DataNodes, which enable data locality
• Region servers are responsible for all read and write requests for all regions they serve
• Clients communicate directly with them to handle all data-related operations
Hbase Master
HMaster Client
• Coordinating the region servers

‒ Assigning regions on startup , re-assigning regions for recovery or load balancing
‒ Monitoring all RegionServer instances in the cluster (listens for notifications from zookeeper)
• Admin functions
‒ Interface for creating, deleting, updating tables

Zookeeper
Heartbeat
ZooKeeper
Heartbeat HMaster Client
Heartbeat
• Distributed coordination service to maintain server state in the cluster
• Zookeeper maintains which servers are alive and available, and provides server failure notification
• Zookeeper uses consensus to guarantee common shared state

Region Servers
1. WAL: Write Ahead Log
Region Server 1 • Stored on HDFS
Block Cache
• Used to store new data that hasn't yet been
persisted to permanent storage
Region Region
Start Key Start Key • Used for recovery in the case of failure.
Memstore Memstore 2. BlockCache (read cache)
Memstore Memstore
• Stores frequently read data in memory.
WAL Hfile Hfile
• Least Recently Used data is evicted when full.
Hfile Hfile
End Key End Key 3. MemStore (write cache)
HDFS • Stores new data which has not yet been written to
disk.
• It is sorted before writing to disk.
• There is one MemStore per column family per
region.
4. Hfiles store the rows as sorted KeyValues on disk.

Meta Table
Meta Table
Row Key Value
Region Start , Region ID Region Server
• An HBase table that keeps a list of all regions in the system.

• META. table structure
• Key: region start key,region id
• Values: RegionServer
• -ROOT- stores location of Meta table

Auto Sharding and Distribution
• Unit of scalability in Hbase is the Region
• Sorted, contiguous range of rows
• Spread “randomly” across RegionServer
• Moved around for load balancing and failover
• Split automatically or manually to scale with growing data
• Capacity is solely a factor of cluster nodes vs. regions per node

Hbase Read / Write

First Read / Write
Meta Table
Zookeeper Stores location of the Meta Table
ZooKeeper
1. Client gets the Region server that hosts the META
Client table from ZooKeeper.

1. Get RegionServer for Row Key 2. Client will query the .META. server to get the region
2. Read/Write Operations
server corresponding to the row key it wants to
access.
RegionServer RegionServer RegionServer
3. Client caches this information along with the META
table location.
4. Client will get the Row from the corresponding Region

Server

Write Steps
1 Client issues PUT request 2 Data written to WAL
Region Server Region Server
Region Region
PUT
WAL Memstore WAL Memstore
Memstore Memstore
HDFS HDFS
• Write data to WAL • Updates written to Memstore

• Edits appended to end of WAL file on • PUT acknowledgement returned to client
disk

Write Steps
3 Data written to Memstore 4 Memstore flushes to Hfile on HDFS
Region Server Region Server
Region Region
WAL Memstore WAL Memstore

Memstore Memstore
FLUSH
Hfile Hfile
HDFS Hfile Hfile
HDFS
• Memstore stores data as KV pair
• One Memstore per column family • Sorted set written to new HFile in HDFS
• Updates sorted per column family • When one Memstore full, all flush
• Saves last written sequence number

HFile
Meta Meta Data Meta
Data Data Data File Info Trailer
Block Block Index Index
Key Key Key Key Key Key

Magic
Value Value Value Value Value Value
• Each Hfile contains a variable number of • Each Data Block contains

data blocks ‒ Magic header
• Fixed blocks for File Info and Trailer ‒ Serialized KV instances
• Index blocks record offset of data and • When a Hfile is loaded, the index contained in
meta blocks the file is opened and kept in memory
• This allows lookups to be performed with a
single disk seek
HFile
Key Value Row CF Column Time
Row Key CF name Key Type Value
Length Length Length Length Qualifier Stamp
• Each KV in the Hfile is a low-level byte array

• Two fixed length numbers used for offsetting into array
• Size (Length) of Key
• Size (Length) of Value
• Key – Row key, CF name, qualifier etc.

Hbase Read Merge
• Read Amplification: KeyValue cells corresponding to one row can be in multiple places,
‒ row cells already persisted are in Hfiles,
‒ recently updated cells are in the MemStore,
‒ and recently read cells are in the Block cache
• A Read merges KV from the three places per:
‒ Looks for the row cells in the Block cache (read cache)
‒ Looks in Memstore – the write cache
‒ Then use Block Cache indexes and bloom filters to load Hfiles into memory

Overall View of Services and Daemons
Hbase Master Nodes
Standby
Zookeeper Coordinate the
Namenode Master
cluster
HDFS Active Resource Hbase
Zookeeper
NameNode Manager Master
YARN Slave Nodes

Data Node Data Node Data Node Data Node perform the work
on the cluster
Hbase Node Node Node Node
Manager Manager Manager Manager
Zookeeper Region Region Region Region

Server Server Server Server


Key Table Properties
- Indexing is only done based on the key
- Tables are stored sorted based on row key – each region in the table is
responsible for a part of the row key space and is identified by the start and end
row key
- Everything in Hbase tables is stored as a byte array – there are no types
- Atomicity is guaranteed only at row level. There is no atomicity guarantee across
rows, which means that there are no multi-row transactions
- Column families have to be defined up front at table creation time
- Column qualifiers are dynamic and can be defined at write time

- What should be the row key structure and what should it contain
- How many column families should the table have
- What data goes into what column family
- How many columns are there in each column family
- What should the column names be
- What information should go into the cells
- How many versions should be stored for each cell

Recommendations
- Design for use case
- Read, Write, Both?
- Row keys are the single most important aspect of Hbase table design –
understand access patterns before designing
- Store everything with similar access patterns in the same column family
- Hashing allows for fixed length keys and better distributions, but takes way the
ordering implied by using strings as keys
- Length of column qualifiers impact storage footprint – be concise

Areas for further study
• Hbase design • Hbase client API’s
- Key design - CRUD operations
- Tall narrow vs. Flat Wide - Batch operations
• Performance optimization - Scans
- Compaction (minor and major), Splits - Filters, Counters and Co-
- Hashing, Sequential Keys and Salting processors
- Avoiding Hotspotting
- Block Cache and Memstore sizes
- Bloom Filters
- Write Amplification
Critique

Benefits
- Strong Consistency Model
- Scales automatically
o Regions split when data grows too large
o Uses HDFS to spread and replicate data
- Built-in recovery
o Using Write-ahead Log
- Integrated with Hadoop
o MR on Hbase is straightforward

What it is good for
• Large Data Sets (TB/PB)
• Sparse Data Sets
• Loosely coupled (denormalized) records
• Lots of concurrent clients
• High Throughput
• Great for variable schema
‒ Rows may drastically differ
‒ If your schema has many columns and most of them are null
• Random reads and Writes

When it may not be the best fit
• Small datasets (unless you have lots of them)
• Highly relational records
• Schema designs requiring transactions
• Mixed workloads need careful evaluation

Use Cases

Pinterest’s deployment of Hbase
• Pinterest – an online pinboard where you “curate” and “discover” things you love and go do them in real life
• “Follower” follows “Followee”
• “Following Feed” for a user has content aggregated from all the followees of the user and presented each time a
user accesses his page on pinterest.com
Follower 1 • 100’s of millions of pins/repins per

month
• High fanout – billions of writes per
Follower 2
day (high throughput)
New Pin
• Billions of request per month (Low
Follower 3
latency and high availability)

Hbase Advantages
Requirements
- High throughput through column oriented
- Read volume comparatively lower than write structure and WAL
volume, but low read latency important - Ability to manage sharding and machine
- High throughput failures automatically
- High volume of random writes - Seamless integration with Hadoop and Hive
Options - High user adoption (Facebook already uses

Hbase for messages
- B-tree bases systems : MySQL
- Atomicity (transaction support at row level)
- In-memory databases: Redis
- Locality (user’s data lies within the same
- Log Structured Merge (LSM) Trees: Google region and on the same region server)
BigTable, Hbase, Cassandra

Two key operations
Front End
‒ Updation: As users
Async. Task
follow/unfollow other Enqueue
users and boards Follow

Follow Store Message Bus Pin Store
Unfollow New Pin
‒ Serving / Retrieval: Write
Following feed Feed Store
retrieved whenever
Feed Store stores only the
authenticated user Thrift + Finagle
identifiers associated with
the pins in the following
layer
accesses feed
pinterest.com
Hbase

Front End
MySQL Database Retrieve Pin

(Sharded) Metadata Retrieve
PinID(s)
Persistent storage of core entities
- Pin (image url, title, description)
- Boards
- Users
Feed Store
Thrift + Finagle
layer
Hbase

Hbase at AirBnb – Realtime Ingestion

Hbase at Visa – Mobile Offerings Application
• Scalable and real-time
Transaction History
service
• Migrated prominent
Mobile wallet offerings
to this service
• Learnings
• Kerberos
• Availability
• Handling Client
Exceptions
Hbase at Xiaomi – Offline Querying
• Input data replicated from online to offline cluster

• Output data is written to offline cluster and replicated to online cluster
• Both Batch and Stream processing
Hbase at Alibaba Search
• Core storage in
Alibaba search
system since
2010
• 3 clusters each
with 1,000+
nodes

References

References
Data Modeling
Text: Hadoop Application Architectures
Mark Grover, Ted Malaska, Jonathan Seidman, Gwen Shapira; O’Reilley Media Inc.
Paper: Dremel: Interactive Analysis of Web Scale Datasets
http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36632.pdf
Hbase
Hbase Reference Guide
https://hbase.apache.org/book.html
Paper: Introduction to Basic Schema Design by Amandeep Khurana

http://0b4af6cdc2f0c5998459-c0245c5c937c5dedcca3f1764ecc9b2f.r43.cf2.rackcdn.com/9353-login1210_khurana.pdf
Article: An In-depth Look at Hbase Architecture

https://www.mapr.com/blog/in-depth-look-hbase-architecture

References
Case Studies
AirBnb
http://www.slideshare.net/HBaseCon/apache-hbase-at-airbnb
Facebook
http://www.slideshare.net/cloudera/h-base-in-production-at-facebook-jonthan-grayfacebookfinal
Xiaomi
http://www.slideshare.net/HBaseCon/apache-hbase-improvements-and-practices-at-xiaomi
Alibaba Search
http://www.slideshare.net/HBaseCon/improvements-to-apache-hbase-and-its-applications-in-alibaba-search
Pinterest
http://www.slideshare.net/AbhiKhune/scaling-deep-social-feeds-at-pinterest
Visa
http://www.slideshare.net/HBaseCon/rolling-out-apache-hbase-for-mobile-offerings-at-visa

Big Data Technologies - Data Storage

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Big Data Technologies - Data Storage

Uploaded by

Copyright:

Available Formats

MITB B.

Chapter 3 – Data Storage

MITB B.5 Big Data Technologies

MITB B.5 Big Data Technologies

Reliability of Flexibility in - Store massive amounts of data “as –is”

MITB B.5 Big Data Technologies

MITB B.5 Big Data Technologies

MITB B.5 Big Data Technologies

Typically made available by API’s in XML or JSON format

Structured data, residing in relational databases

MITB B.5 Big Data Technologies

File based data

JSON Protocol Buffer

MITB B.5 Big Data Technologies

MITB B.5 Big Data Technologies

MITB B.5 Big Data Technologies

- This format compresses each record as it’s added to the file.

- Waits until data reaches block size to compress, rather than as

MITB B.5 Big Data Technologies

Block Compression • Compresses multiple records at once – more

MITB B.5 Big Data Technologies

MITB B.5 Big Data Technologies

Timestamp [21/Aug/2016:00:12:14 -0500] Timestamp [22/Aug/2016:00:20:48 -0300]

Request GET /download/windows/ Request GET /a12345f/ HTTP/1.0

MITB B.5 Big Data Technologies

Core to Hadoop, since

MITB B.5 Big Data Technologies

Backbone for a distributed system – as Overhead in reading or writing terabytes of

MITB B.5 Big Data Technologies

MITB B.5 Big Data Technologies

MITB B.5 Big Data Technologies

MITB B.5 Big Data Technologies

Row 1 Row 2 Row 3 Row 4

MITB B.5 Big Data Technologies

MITB B.5 Big Data Technologies

MITB B.5 Big Data Technologies

MITB B.5 Big Data Technologies

Schema Evolution Splittability

Language Support Support in MapReduce and Hadoop ecosystem

MITB B.5 Big Data Technologies

MITB B.5 Big Data Technologies

MITB B.5 Big Data Technologies

MITB B.5 Big Data Technologies

MITB B.5 Big Data Technologies

MITB B.5 Big Data Technologies

MITB B.5 Big Data Technologies

Uncompressed Block Size of Map Reduce Job

8 Blocks created 8 Map tasks for

Compressed File Block Size of Map Reduce Job

8 Blocks created Single Map task

MITB B.5 Big Data Technologies

DEFLATE DEFLATE .deflate No

MITB B.5 Big Data Technologies

MITB B.5 Big Data Technologies

MITB B.5 Big Data Technologies

MITB B.5 Big Data Technologies

MITB B.5 Big Data Technologies

• Technique to reduce amount of I/O required to process a data set

• This translates to: orders/date=20131101/{order1.csv, order2.csv}

MITB B.5 Big Data Technologies

• Further decomposing partitioned data sets by a particular field or column

MITB B.5 Big Data Technologies

File Format Benchmarking