Professional Documents
Culture Documents
Apache Hadoop
04/03/2015 01:55 PM 1 of 52
Apache Hadoop
Column Families
HBase API
Runtime modes
Running HBase
04/03/2015 01:55 PM 2 of 52
Apache Hadoop
Byte array
Column Family
Column
familyName:columnName
04/03/2015 01:55 PM 3 of 52
Apache Hadoop
Column Families
Value (Cell)
Byte array
04/03/2015 01:55 PM 4 of 52
Apache Hadoop
04/03/2015 01:55 PM 5 of 52
Apache Hadoop
Key & Version numbers are replicated with each column family
Column Families
Different sets of columns may have different properties and access patterns
Cache priority
CFs stored separately on disk: access one without wasting IO on the other
04/03/2015 01:55 PM 6 of 52
Apache Hadoop
HBase Deployment
HBase vs HDFS
04/03/2015 01:55 PM 7 of 52
Apache Hadoop
HBase vs RDBMS
RDBMS HBase
Data layout Row-oriented Column-family-oriented
Transactions Multi-row ACID Single row only
Query language SQL get/put/scan/etc
Security Authentication / Authorization Work in progress
Indexes On arbitrary columns Row-key only
Max data size TBs ~1PB
Read/write throughput limits 1000s queries/second Millions of queries/second
04/03/2015 01:55 PM 8 of 52
Apache Hadoop
Conclusion
Summary
04/03/2015 01:55 PM 9 of 52
Apache Hadoop
What is Flume?
Collection, Aggregation of streaming Event Data
Contextual Routing
Feature rich
Fully extensible
Headers are specified as an unordered collection of string key-value pairs, with keys being unique
across the collection
04/03/2015 01:55 PM 10 of 52
Apache Hadoop
Example:
Decouples Flume from the system where event data is consumed from
Provides Configuration, Life-Cycle Management, and Monitoring Support for hosted components
04/03/2015 01:55 PM 11 of 52
Apache Hadoop
Specialized sources for integrating with well-known systems. Example: Syslog, Netcat
04/03/2015 01:55 PM 12 of 52
Apache Hadoop
Terminal sinks that deposit events to their final destination. For example: HDFS, HBase
04/03/2015 01:55 PM 13 of 52
Apache Hadoop
Flow Reliability
Also Available:
04/03/2015 01:55 PM 14 of 52
Apache Hadoop
Flow Handling
Channels decouple impedance of upstream and downstream
Configuration
Java Properties File Format
# Comment line
key1 = value
key2 = multi-line \
value
agent1.channels.myChannel.type = FILE
agent1.channels.myChannel.capacity = 1000
agent1.sources.mySource.type = HTTP
agent1.sources.mySource.channels = myChannel
04/03/2015 01:55 PM 15 of 52
Apache Hadoop
Configuration (2)
Global List of Enabled Components
agent1.soruces.mySource1.type = org.example.source.AtomS
ource
agent1.sources.mySource1.feed = http://feedserver.com/ne
ws.xml
agent1.sources.mySource1.cache-duration = 24
...
Configuration (3)
agent1.properties: Active Agent Components(Sources, Channels, Sinks)
# Active components
agent1.sources = src1
agent1.channels = ch1
agent1.sinks = sink1
04/03/2015 01:55 PM 16 of 52
Apache Hadoop
04/03/2015 01:55 PM 17 of 52
Apache Hadoop
Configuration (4)
A configuration file can contain configuration information for many Agents
Only the portion of configuration associated with the name of the Agent will be loaded
Components defined in the configuration but not in the active list will be ignored
Configuration (5)
04/03/2015 01:55 PM 18 of 52
Apache Hadoop
Contextual Routing
Achieved using Interceptors and Channel Selectors
Built-in Interceptors allow adding headers such as timestamps, hostname, static markers etc
Custom interceptors can introspect event payload to create specific headers where necessary
04/03/2015 01:55 PM 19 of 52
Apache Hadoop
# Active components
agent1.sources = src1
agent1.channels = ch1 ch2
agent1.sinks = sink1 sink2
# Configure src1
agent1.sources.src1.type = AVRO
agent1.sources.src1.channels = ch1 ch2
agent1.sources.src1.selector.type = multiplexing
agent1.sources.src1.selector.header = priority
agent1.sources.src1.selector.mapping.high = ch1
agent1.sources.src1.selector.mapping.low = ch2
agent1.sources.src1.selector.default = ch2
...
04/03/2015 01:55 PM 20 of 52
Apache Hadoop
Contextual Routing
Terminal Sinks can directly use Headers to make destination selections
HDFS Sink can use headers values to create dynamic path for files that the event will be
added to
Custom Channel Selector can be used for doing specialized routing where necessary
A Sink Processor is responsible for invoking one sink from an assigned group of sinks
04/03/2015 01:55 PM 21 of 52
Apache Hadoop
Sink Processor
Invoked by Sink Runner
# Active components
agent1.sources = src1
agent1.channels = ch1
agent1.sinks = sink1 sink2 sink3
agent1.sinkgroups = foGroup
# Configure foGroup
agent1.sinkgroups.foGroup.sinks = sink1 sink3
agent1.sinkgroups.foGroup.processor.type = failover
agent1.sinkgroups.foGroup.processor.priority.sink1 = 5
agent1.sinkgroups.foGroup.processor.priority.sink3 = 10
agent1.sinkgroups.foGroup.processor.maxpenalty = 1000
04/03/2015 01:55 PM 22 of 52
Apache Hadoop
A Sink that is not in any group is handled via Default Sink Processor
Caution:
Summary
Clients send Events to Agents
Agents hosts number Flume components – Source, Interceptors, Channel Selectors, Channels, Sink
Processors, and Sinks
Sources and Sinks are active components, where as Channels are passive
Source accepts Events, passes them through Interceptor(s), and if not filtered, puts them on
channel(s) selected by the configured Channel Selector
Sink Processor identifies a sink to invoke, that can take Events from a Channel and send it to its next
hop destination
04/03/2015 01:55 PM 23 of 52
Apache Hadoop
Conclusion
Summary
Installing Cassandra
COPY command
OpsCenter
Querying Cassandra
04/03/2015 01:55 PM 24 of 52
Apache Hadoop
What is Cassandra?
Open-source database management system (DBMS)
History of Cassandra
Cassandra was created to power the Facebook Inbox Search
In 2010, Cassandra graduated to a top-level project, regular update and releases followed
04/03/2015 01:55 PM 25 of 52
Apache Hadoop
Mimics traditional relational database systems, but with triggers and lightweight transactions
Organization
Language
CQL (Cassandra Query Language) introduced, similar to SQL (flattened learning curve)
04/03/2015 01:55 PM 26 of 52
Apache Hadoop
No bottlenecking
Fault Tolerance/Durability
Failures happen all the time with multiple nodes
Hardware Failure
Bugs
Operator error
04/03/2015 01:55 PM 27 of 52
Apache Hadoop
Performance
Core architectural designs allow Cassandra to outperform its competitors
Consistently ranks as fastest amongst comparable NoSql DBMS's with large data sets
04/03/2015 01:55 PM 28 of 52
Apache Hadoop
Scalability
Read and write throughput increase linearly as more machines are added
Comparisons
Amazon
Apache Cassandra Google Big Table
DynamoDB
Storage Type Column Column Key-Value
Large database
Best Use Write often, read less Designed for large scalability
solution
Concurrency
MVCC Locks ACID
Control
High Availability Partition Consistency High Availability Consistency High
Characteristics
Tolerance Persistence Partition Tolerance Persistence Availability
Key Point – Cassandra offers a healthy cross between BigTable and Dynamo
04/03/2015 01:55 PM 29 of 52
Apache Hadoop
Key-Value Model
Cassandra is a column oriented NoSQL system
column family as a table and key-value pairs as a row (using relational database analogy)
Cassandra Row
04/03/2015 01:55 PM 30 of 52
Apache Hadoop
Example of Columns
04/03/2015 01:55 PM 31 of 52
Apache Hadoop
values of all column names are set to "-" (empty byte array) as they are not used
Key Space
A Key Space is a group of column families together
It is only a logical grouping of column families and provides an isolated scope for names
04/03/2015 01:55 PM 32 of 52
Apache Hadoop
To use namespace:
USE demo;
04/03/2015 01:55 PM 33 of 52
Apache Hadoop
04/03/2015 01:55 PM 34 of 52
Apache Hadoop
SELECT expression reads one or more records from Cassandra column family and returns a
result-set of rows
Read/Write-anywhere design
04/03/2015 01:55 PM 35 of 52
Apache Hadoop
Column Families
Memtables
SSTables
Consistent hashing
Partitioning
Replication
One-hop routing
Transparent Elasticity
Nodes can be added and removed from Cassandra online, with no downtime being experienced
04/03/2015 01:55 PM 36 of 52
Apache Hadoop
Transparent Scalability
Addition of Cassandra nodes increases performance linearly and ability to manage TB's-PB's of data
High Availability
Cassandra, with its peer-to-peer architecture has no single point of failure
04/03/2015 01:55 PM 37 of 52
Apache Hadoop
Data Redundancy
Cassandra allows for customizable data redundancy so that data is completely protected
Also supports rack awareness (data can be replicated between different racks to guard against
machine/rack failures).
Uses 'Zookeeper' to choose a leader which tells nodes the range they are replicas for
04/03/2015 01:55 PM 38 of 52
Apache Hadoop
Partitioning
Nodes are logically structured in Ring Topology
Hashed value of key associated with data partition is used to assign it to a node in the ring
04/03/2015 01:55 PM 39 of 52
Apache Hadoop
Gossip Protocols
Used to discover location and state information about the other nodes participating in a Cassandra
cluster
04/03/2015 01:55 PM 40 of 52
Apache Hadoop
Failure Detection
Gossip process tracks heartbeats from other nodes both directly and indirectly
tells how likely a node might fail (suspicion level) instead of simple binary value (up/down)
Takes into account network conditions, workload, or other conditions that might affect perceived
heartbeat rate
Generally, Φ(t) = 0
04/03/2015 01:55 PM 41 of 52
Apache Hadoop
Write Operations
Commit Log
Memtable
SSTable
Kept on disk
04/03/2015 01:55 PM 42 of 52
Apache Hadoop
Consistency
Read Consistency
ONE to ALL
Write Consistency
ANY to ALL
QUORUM
Defined as (replication_factor / 2) 1
04/03/2015 01:55 PM 43 of 52
Apache Hadoop
An online node, processing the request, makes a note to carry out the write once the node comes
back online
04/03/2015 01:55 PM 44 of 52
Apache Hadoop
Hinted Handoff
Delete Operations
Tombstones
04/03/2015 01:55 PM 45 of 52
Apache Hadoop
Compaction
Compaction runs periodically to merge multiple SSTables
Reclaims space
Merges keys
Combines columns
Discards tombstones
Two types
Major
Read-only
Compaction (2)
04/03/2015 01:55 PM 46 of 52
Apache Hadoop
Compaction (3)
Anti-Entropy
Ensures synchronization of data across nodes
If two nodes take snapshots within TREE_STORE_TIMEOUT of each other, snapshots are
compared and data is synced
04/03/2015 01:55 PM 47 of 52
Apache Hadoop
Merkle Tree
Read Operations
Read Repair
On read, nodes are queried until the number of nodes which respond with the most recent
value meet a specified consistency level from ONE to ALL
If the consistency level is not met, nodes are updated with the most recent value which is then
returned
If the consistency level is met, the value is returned and any nodes that reported old values
are then updated
04/03/2015 01:55 PM 48 of 52
Apache Hadoop
Read Repair
04/03/2015 01:55 PM 49 of 52
Apache Hadoop
Read
Cassandra Advantages
Perfect for time-series data
High performance
Decentralization
Replication support
MapReduce support
04/03/2015 01:55 PM 50 of 52
Apache Hadoop
Cassandra Weaknesses
No referential integrity
No concept of JOIN
No GROUP BY
Use it when you have a lot of data spread across multiple servers
Cassandra write performance is always excellent, but read performance depends on write patterns
It is important to spend enough time to design proper schema around the query pattern
04/03/2015 01:55 PM 51 of 52
Apache Hadoop
Conclusion
Summary
04/03/2015 01:55 PM 52 of 52