You are on page 1of 52

Apache Hadoop

Apache Hadoop

04/03/2015 01:55 PM 1 of 52
Apache Hadoop

Unit 17: HBase - Contd


Unit 17: HBase - Contd
HBase vs RDBMS

Column Families

Access HBase Data

HBase API

Runtime modes

Running HBase

04/03/2015 01:55 PM 2 of 52
Apache Hadoop

HBase: Keys and Column Families

Keys and Column Families


Key

Byte array

Serves as the primary key for the table

Indexed for fast lookup

Column Family

Has a name (string)

Contains one or more related columns

Column

Belongs to one column family

Included inside the row

familyName:columnName

04/03/2015 01:55 PM 3 of 52
Apache Hadoop

Column Families

Version Number & Value


Version Number

Unique within each key

By default, System's timestamp

Data type is Long

Value (Cell)

Byte array

04/03/2015 01:55 PM 4 of 52
Apache Hadoop

Notes on Data Model


HBase schema consists of several Tables

Each table consists of a set of Column Families

Columns are not part of the schema

HBase has Dynamic Columns

Because column names are encoded inside the cells

Different cells can have different columns

Notes on Data Model (2)


The version number can be user-supplied

Even does not have to be inserted in increasing order

Version number are unique within each key

Table can be very sparse

Many cells are empty

Keys are indexed as the primary key

04/03/2015 01:55 PM 5 of 52
Apache Hadoop

HBase Physical Model


Each column family is stored in a separate file (called HTables)

Key & Version numbers are replicated with each column family

Empty cells are not stored

HBase maintains a multi-level index on values:

Column Families
Different sets of columns may have different properties and access patterns

Configurable by column family:

Compression (none, gzip, LZO)

Version retention policies

Cache priority

CFs stored separately on disk: access one without wasting IO on the other

04/03/2015 01:55 PM 6 of 52
Apache Hadoop

HBase Deployment

HBase vs HDFS

Plain HDFS/MR HBase


Write pattern Append-only Random write, bulk incremental
Random read, small range scan or table
Read pattern Full table scan, partition table scan
scan
Hive (SQL)
Very good 4-5x slower
performance
Do-it-yourself / TSV / SequenceFile /
Structured storage Sparse column-family data model
Avro / ?
Max data size 30 PB ~1PB

04/03/2015 01:55 PM 7 of 52
Apache Hadoop

HBase vs RDBMS

RDBMS HBase
Data layout Row-oriented Column-family-oriented
Transactions Multi-row ACID Single row only
Query language SQL get/put/scan/etc
Security Authentication / Authorization Work in progress
Indexes On arbitrary columns Row-key only
Max data size TBs ~1PB
Read/write throughput limits 1000s queries/second Millions of queries/second

When to use HBase


You need random write, random read or both (but not neither)

You need to do many thousands of operations per second on multiple TB of data

Your access patterns are well-known and simple

04/03/2015 01:55 PM 8 of 52
Apache Hadoop

Conclusion
Summary

Unit 18: Working with Flume


Flume - How it works

Flume Complex Flow - Calculation/ Multiplexing

04/03/2015 01:55 PM 9 of 52
Apache Hadoop

What is Flume?
Collection, Aggregation of streaming Event Data

Typically used for log data

Significant advantages over ad-hoc solutions

Reliable, Scalable, Manageable, Customizable and High Performance

Declarative, Dynamic Configuration

Contextual Routing

Feature rich

Fully extensible

Core Concepts: Event


An Event is the fundamental unit of data transported by Flume from its point of origination to its final
destination

Event is a byte array payload accompanied by optional headers

Payload is opaque to Flume

Headers are specified as an unordered collection of string key-value pairs, with keys being unique
across the collection

Headers can be used for contextual routing

04/03/2015 01:55 PM 10 of 52
Apache Hadoop

Core Concepts: Client


An entity that generates events and sends them to one or more Agents

Example:

Flume log4j Appender

Custom Client using Client SDK (org.apache.flume.api)

Decouples Flume from the system where event data is consumed from

Not needed in all cases

Core Concepts: Agent


A container for hosting Sources, Channels, Sinks and other components that enable the
transportation of events from one place to another

Fundamental part of a Flume flow

Provides Configuration, Life-Cycle Management, and Monitoring Support for hosted components

04/03/2015 01:55 PM 11 of 52
Apache Hadoop

Typical Aggregation Flow

Core Concepts: Source


An active component that receives events from a specialized location or mechanism and places it on
one or Channels

Different Source types:

Specialized sources for integrating with well-known systems. Example: Syslog, Netcat

Auto-Generating Sources: Exec, SEQ

IPC sources for Agent-to-Agent communication: Avro

Require at least one channel to function

04/03/2015 01:55 PM 12 of 52
Apache Hadoop

Core Concepts: Channel


A passive component that buffers the incoming events until they are drained by Sinks

Different Channels offer different levels of persistence:

Memory Channel: volatile

File Channel: backed by WAL implementation

JDBC Channel: backed by embedded Database

Channels are fully transactional

Provide weak ordering guarantees

Can work with any number of Sources and Sinks

Core Concepts: Sink


An active component that removes events from a Channel and transmits them to their next hop
destination

Different types of Sinks:

Terminal sinks that deposit events to their final destination. For example: HDFS, HBase

Auto-Consuming sinks. For example: Null Sink

IPC sink for Agent-to-Agent communication: Avro

Require exactly one channel to function

04/03/2015 01:55 PM 13 of 52
Apache Hadoop

Flow Reliability

Reliability based on:

Transactional Exchange between Agents

Persistence Characteristics of Channels in the Flow

Also Available:

Built-in Load balancing Support

Built-in Failover Support

Flow Reliability (2)

04/03/2015 01:55 PM 14 of 52
Apache Hadoop

Flow Handling
Channels decouple impedance of upstream and downstream

Upstream burstiness is damped by channels

Downstream failures are transparently absorbed by channels

Sizing of channel capacity is key in realizing these benefits

Configuration
Java Properties File Format

# Comment line
key1 = value
key2 = multi-line \
value

Hierarchical, Name Based Configuration

agent1.channels.myChannel.type = FILE
agent1.channels.myChannel.capacity = 1000

Uses soft references for establishing associations

agent1.sources.mySource.type = HTTP
agent1.sources.mySource.channels = myChannel

04/03/2015 01:55 PM 15 of 52
Apache Hadoop

Configuration (2)
Global List of Enabled Components

agent1.soruces = mySource1 mySource2


agent1.sinks = mySink1 mySink2
agent1.channels = myChannel
...
agent1.sources.mySource3.type = Avro
...

Custom Components get their own namespace

agent1.soruces.mySource1.type = org.example.source.AtomS
ource
agent1.sources.mySource1.feed = http://feedserver.com/ne
ws.xml
agent1.sources.mySource1.cache-duration = 24
...

Configuration (3)
agent1.properties: Active Agent Components(Sources, Channels, Sinks)

# Active components
agent1.sources = src1
agent1.channels = ch1
agent1.sinks = sink1

Individual Component Configuration

# Define and configure src1


agent1.sources.src1.type = netcat
agent1.sources.src1.channels = ch1
agent1.sources.src1.bind = 127.0.0.1
agent1.sources.src1.port = 10112

# Define and configure sink1


agent1.sinks.sink1.type = logger
agent1.sinks.sink1.channel = ch1

04/03/2015 01:55 PM 16 of 52
Apache Hadoop

# Define and configure ch1


agent1.channels.ch1.type = memory

04/03/2015 01:55 PM 17 of 52
Apache Hadoop

Configuration (4)
A configuration file can contain configuration information for many Agents

Only the portion of configuration associated with the name of the Agent will be loaded

Components defined in the configuration but not in the active list will be ignored

Components that are misconfigured will be ignored

Agent automatically reloads configuration if it changes on disk

Configuration (5)

04/03/2015 01:55 PM 18 of 52
Apache Hadoop

Contextual Routing
Achieved using Interceptors and Channel Selectors

Contextual Routing (2)


Interceptor is a component applied to a source in pre-specified order to enable decorating and
filtering of events where necessary

Built-in Interceptors allow adding headers such as timestamps, hostname, static markers etc

Custom interceptors can introspect event payload to create specific headers where necessary

04/03/2015 01:55 PM 19 of 52
Apache Hadoop

Contextual Routing (3)


Channel Selector allows a Source to select one or more Channels from all the Channels that the
Source is configured with based on preset criteria

Built-in Channel Selectors:

* Replicating: for duplicating the events


* Multiplexing: for routing based on headers

Channel Selector Configuration


Applied via Source configuration under namespace "selector"

# Active components
agent1.sources = src1
agent1.channels = ch1 ch2
agent1.sinks = sink1 sink2

# Configure src1
agent1.sources.src1.type = AVRO
agent1.sources.src1.channels = ch1 ch2
agent1.sources.src1.selector.type = multiplexing
agent1.sources.src1.selector.header = priority
agent1.sources.src1.selector.mapping.high = ch1
agent1.sources.src1.selector.mapping.low = ch2
agent1.sources.src1.selector.default = ch2
...

04/03/2015 01:55 PM 20 of 52
Apache Hadoop

Contextual Routing
Terminal Sinks can directly use Headers to make destination selections

HDFS Sink can use headers values to create dynamic path for files that the event will be
added to

Some headers such as timestamps can be used in a more sophisticated manner

Custom Channel Selector can be used for doing specialized routing where necessary

Load Balancing and Failover


Sink Processor

A Sink Processor is responsible for invoking one sink from an assigned group of sinks

Built-in Sink Processors:

Load Balancing Sink Processor – using RANDOM, ROUND_ROBIN or Custom selection


algorithm

Failover Sink Processor

Default Sink Processor

04/03/2015 01:55 PM 21 of 52
Apache Hadoop

Sink Processor
Invoked by Sink Runner

Acts as a proxy for a Sink

Sink Processor Configuration


Applied via "Sink Groups"

Sink Groups declared at global level:

# Active components
agent1.sources = src1
agent1.channels = ch1
agent1.sinks = sink1 sink2 sink3
agent1.sinkgroups = foGroup

# Configure foGroup
agent1.sinkgroups.foGroup.sinks = sink1 sink3
agent1.sinkgroups.foGroup.processor.type = failover
agent1.sinkgroups.foGroup.processor.priority.sink1 = 5
agent1.sinkgroups.foGroup.processor.priority.sink3 = 10
agent1.sinkgroups.foGroup.processor.maxpenalty = 1000

04/03/2015 01:55 PM 22 of 52
Apache Hadoop

Sink Processor Configuration (2)


A Sink can exist in at most one group

A Sink that is not in any group is handled via Default Sink Processor

Caution:

Removing a Sink Group does not make the sinks inactive!

Summary
Clients send Events to Agents

Agents hosts number Flume components – Source, Interceptors, Channel Selectors, Channels, Sink
Processors, and Sinks

Sources and Sinks are active components, where as Channels are passive

Source accepts Events, passes them through Interceptor(s), and if not filtered, puts them on
channel(s) selected by the configured Channel Selector

Sink Processor identifies a sink to invoke, that can take Events from a Channel and send it to its next
hop destination

Channel operations are transactional to guarantee one-hop delivery semantics

Channel persistence allows for ensuring end-to-end reliability

04/03/2015 01:55 PM 23 of 52
Apache Hadoop

Conclusion
Summary

Unit 19: Cassandra


What is Apache Cassandra?

How does Cassandra work

Installing Cassandra

Moving data to or from other databases

COPY command

Cassandra bulk loader (sstableloader)

OpsCenter

Querying Cassandra

Using DevCenter and cqlsh

04/03/2015 01:55 PM 24 of 52
Apache Hadoop

What is Cassandra?
Open-source database management system (DBMS)

Several key features of Cassandra differentiate it from other similar systems

History of Cassandra
Cassandra was created to power the Facebook Inbox Search

Facebook open-sourced Cassandra in 2008 and became an Apache Incubator project

In 2010, Cassandra graduated to a top-level project, regular update and releases followed

04/03/2015 01:55 PM 25 of 52
Apache Hadoop

Motivation and Function


Designed to handle large amount of data across multiple servers

There is a lot of unorganized data out there

Easy to implement and deploy

Mimics traditional relational database systems, but with triggers and lightweight transactions

Raw, simple data structures

General Design Features


Emphasis on performance over analysis

Still has support for analysis tools such as Hadoop

Organization

Rows are organized into tables

First component of a table's primary key is the partition key

Rows are clustered by the remaining columns of the key

Columns may be indexed separately from the primary key

Tables may be created, dropped, altered at runtime without blocking queries

Language

CQL (Cassandra Query Language) introduced, similar to SQL (flattened learning curve)

04/03/2015 01:55 PM 26 of 52
Apache Hadoop

Peer to Peer Cluster


Decentralized design

Each node has the same role

No single point of failure

Avoids issues of master-slave DBMS's

No bottlenecking

Fault Tolerance/Durability
Failures happen all the time with multiple nodes

Hardware Failure

Bugs

Operator error

Power Outage, etc.

Solution: Buy cheap, redundant hardware, replicate, maintain consistency

04/03/2015 01:55 PM 27 of 52
Apache Hadoop

Fault Tolerance/Durability (2)


Replication

Data is automatically replicated to multiple nodes

Allows failed nodes to be immediately replaced

Distribution of data to multiple data centers

An entire data center can go down without data loss occurring

Performance
Core architectural designs allow Cassandra to outperform its competitors

Very good read and write throughputs

Consistently ranks as fastest amongst comparable NoSql DBMS's with large data sets

04/03/2015 01:55 PM 28 of 52
Apache Hadoop

Scalability
Read and write throughput increase linearly as more machines are added

Comparisons
Amazon
Apache Cassandra Google Big Table
DynamoDB
Storage Type Column Column Key-Value
Large database
Best Use Write often, read less Designed for large scalability
solution
Concurrency
MVCC Locks ACID
Control
High Availability Partition Consistency High Availability Consistency High
Characteristics
Tolerance Persistence Partition Tolerance Persistence Availability

Key Point – Cassandra offers a healthy cross between BigTable and Dynamo

04/03/2015 01:55 PM 29 of 52
Apache Hadoop

Key-Value Model
Cassandra is a column oriented NoSQL system

Column families: sets of key-value pairs

column family as a table and key-value pairs as a row (using relational database analogy)

A row is a collection of columns labeled with a name

Cassandra Row

04/03/2015 01:55 PM 30 of 52
Apache Hadoop

Cassandra Row (2)


The value of a row is itself a sequence of key-value pairs

Such nested key-value pairs are columns

key = column name

A row must contain at least 1 column

Example of Columns

04/03/2015 01:55 PM 31 of 52
Apache Hadoop

Column names storing values


key: User ID

column names store tweet ID values

values of all column names are set to "-" (empty byte array) as they are not used

Key Space
A Key Space is a group of column families together

It is only a logical grouping of column families and provides an isolated scope for names

04/03/2015 01:55 PM 32 of 52
Apache Hadoop

Comparing Cassandra (C*) and RDBMS


With RDBMS, a normalized data model is created without considering the exact queries

SQL can return almost anything though Joins

With C*, the data model is designed for specific queries

schema is adjusted as new queries introduced

C*: NO joins, relationships, or foreign keys

a separate table is leveraged per query

data required by multiple tables is denormalized across those tables

Cassandra Query Language - CQL


creating a keyspace - namespace of tables

CREATE KEYSPACE demo


WITH replication = {‘class’: ’SimpleStrategy’,
replication_factor’: 3};

To use namespace:

USE demo;

04/03/2015 01:55 PM 33 of 52
Apache Hadoop

Cassandra Query Language - CQL (2)


Creating tables:

CREATE TABLE users(


email varchar,
bio varchar,
birthday timestamp,
active boolean,
PRIMARY KEY (email));

CREATE TABLE tweets(


email varchar,
time_posted timestamp,
tweet varchar,
PRIMARY KEY (email, time_posted));

Cassandra Query Language - CQL (3)


Inserting data:

INSERT INTO users (email, bio, birthday, active)


VALUES ('john.doe@example.com', 'Example Teammate', 51651360
0000, true);

timestamp fields are specified in milliseconds since epoch

04/03/2015 01:55 PM 34 of 52
Apache Hadoop

Cassandra Query Language - CQL (4)


Querying tables

SELECT expression reads one or more records from Cassandra column family and returns a
result-set of rows

SELECT * FROM users;


SELECT email FROM users WHERE active = true;

Cassandra Architecture Overview


Cassandra was designed with the understanding that system/ hardware failures can and do occur

Peer-to-peer, distributed system

All nodes are the same

Data partitioned among all nodes in the cluster

Custom data replication to ensure fault tolerance

Read/Write-anywhere design

04/03/2015 01:55 PM 35 of 52
Apache Hadoop

Google BigTable & Amazon Dynamo


Google BigTable - data model

Column Families

Memtables

SSTables

Amazon Dynamo - distributed systems technologies

Consistent hashing

Partitioning

Replication

One-hop routing

Transparent Elasticity
Nodes can be added and removed from Cassandra online, with no downtime being experienced

04/03/2015 01:55 PM 36 of 52
Apache Hadoop

Transparent Scalability
Addition of Cassandra nodes increases performance linearly and ability to manage TB's-PB's of data

High Availability
Cassandra, with its peer-to-peer architecture has no single point of failure

04/03/2015 01:55 PM 37 of 52
Apache Hadoop

Multi-Geography / Zone Aware


Cassandra allows a single logical database to span 1-N datacenters that are geographically
dispersed

Also supports a hybrid on-premise / Cloud implementation

Data Redundancy
Cassandra allows for customizable data redundancy so that data is completely protected

Also supports rack awareness (data can be replicated between different racks to guard against
machine/rack failures).

Uses 'Zookeeper' to choose a leader which tells nodes the range they are replicas for

04/03/2015 01:55 PM 38 of 52
Apache Hadoop

Partitioning
Nodes are logically structured in Ring Topology

Hashed value of key associated with data partition is used to assign it to a node in the ring

Hashing rounds off after certain value to support ring structure

Lightly loaded nodes moves position to alleviate highly loaded nodes

Partitioning & Replication

04/03/2015 01:55 PM 39 of 52
Apache Hadoop

Gossip Protocols
Used to discover location and state information about the other nodes participating in a Cassandra
cluster

Network Communication protocols inspired for real life rumor spreading

Periodic, Pairwise, inter-node communication

Low frequency communication ensures low cost

Random selection of peers

Gossip Protocols example


Example – Node A wishes to search for pattern in data

Round 1 – Node A searches locally and then gossips with node B

Round 2 – Node A, B gossips with C and D

Round 3 – Nodes A, B, C and D gossips with 4 other nodes

Round by round doubling makes protocol very robust

04/03/2015 01:55 PM 40 of 52
Apache Hadoop

Failure Detection
Gossip process tracks heartbeats from other nodes both directly and indirectly

Node Fail state is given by variable Φ

tells how likely a node might fail (suspicion level) instead of simple binary value (up/down)

This type of system is known as Accrual Failure Detector

Takes into account network conditions, workload, or other conditions that might affect perceived
heartbeat rate

A threshold for Φ tells is used to decide if a node is dead

If node is correct, Φ will be constant set by application

Generally, Φ(t) = 0

Write Operation Stages


Logging data in the commit log

Writing data to the memtable

Flushing data from the memtable

Storing data on disk in SSTables

04/03/2015 01:55 PM 41 of 52
Apache Hadoop

Write Operations
Commit Log

First place a write is recorded

Crash recovery mechanism

Write not successful until recorded in commit log

Once recorded in commit log, data is written to Memtable

Memtable

Data structure in memory

Once memtable size reaches a threshold, it is flushed (appended) to SSTable

Several may exist at once (1 current, any others waiting to be flushed)

First place read operations look for data

SSTable

Kept on disk

Immutable once written

Periodically compacted for performance

Write Operations (2)

04/03/2015 01:55 PM 42 of 52
Apache Hadoop

Consistency
Read Consistency

Number of nodes that must agree before read request returns

ONE to ALL

Write Consistency

Number of nodes that must be updated before a write is considered successful

ANY to ALL

At ANY, a hinted handoff is all that is needed to return

QUORUM

Commonly used middle-ground consistency level

Defined as (replication_factor / 2) 1

Write Consistency (ONE)

INSERT INTO table (column1, ...) VALUES (value1, ...) USING


CONSISTENCY ONE

04/03/2015 01:55 PM 43 of 52
Apache Hadoop

Write Consistency (QUORUM)

INSERT INTO table (column1, ...) VALUES (value1, ...) USING


CONSISTENCY QUORUM

Write Operations: Hinted Handoff


Write intended for a node that's offline

An online node, processing the request, makes a note to carry out the write once the node comes
back online

04/03/2015 01:55 PM 44 of 52
Apache Hadoop

Hinted Handoff

INSERT INTO table (column1, ...) VALUES (value1, ...) USING


CONSISTENCY ANY

Note: Doesn't not count toward consistency level (except ANY)

Delete Operations
Tombstones

On delete request, records are marked for deletion

Similar to 'Recycle Bin'

Data is actually deleted on major compaction or configurable timer

04/03/2015 01:55 PM 45 of 52
Apache Hadoop

Compaction
Compaction runs periodically to merge multiple SSTables

Reclaims space

Creates new index

Merges keys

Combines columns

Discards tombstones

Improves performance by minimizing disk seeks

Two types

Major

Read-only

Compaction (2)

04/03/2015 01:55 PM 46 of 52
Apache Hadoop

Compaction (3)

Anti-Entropy
Ensures synchronization of data across nodes

Compares data checksums against neighboring nodes

Uses Merkle trees (hash trees)

Snapshot of data sent to neighboring nodes

Created and broadcasted on every major compaction

If two nodes take snapshots within TREE_STORE_TIMEOUT of each other, snapshots are
compared and data is synced

04/03/2015 01:55 PM 47 of 52
Apache Hadoop

Merkle Tree

Read Operations
Read Repair

On read, nodes are queried until the number of nodes which respond with the most recent
value meet a specified consistency level from ONE to ALL

If the consistency level is not met, nodes are updated with the most recent value which is then
returned

If the consistency level is met, the value is returned and any nodes that reported old values
are then updated

04/03/2015 01:55 PM 48 of 52
Apache Hadoop

Read Repair

SELECT * FROM table USING CONSISTENCY ONE

Read Operations: Bloom Filters


Bloom filters provide a fast way of checking if a value is not in a set

04/03/2015 01:55 PM 49 of 52
Apache Hadoop

Read

Cassandra Advantages
Perfect for time-series data

High performance

Decentralization

Nearly linear scalability

Replication support

No single points of failure

MapReduce support

04/03/2015 01:55 PM 50 of 52
Apache Hadoop

Cassandra Weaknesses
No referential integrity

No concept of JOIN

Querying options for retrieving data are limited

Sorting data is a design decision

No GROUP BY

No support for atomic operations

If operation fails, changes can still occur

First think about queries, then about data model

Cassandra: Points to Consider


Cassandra is designed as a distributed database management system

Use it when you have a lot of data spread across multiple servers

Cassandra write performance is always excellent, but read performance depends on write patterns

It is important to spend enough time to design proper schema around the query pattern

Having a high-level understanding of some internals is a plus

Ensures a design of a strong application built atop Cassandra

04/03/2015 01:55 PM 51 of 52
Apache Hadoop

Conclusion
Summary

04/03/2015 01:55 PM 52 of 52

You might also like