You are on page 1of 59

Big Data Processing

Jiaul Paik
Lecture 3
Old Tools for Big Data Processing
Shared Memory Message Passing

• Programming models

Memory
• Shared memory (pthreads)
• Message passing (MPI) P1 P2 P3 P4 P5 P1 P 2 P 3 P 4 P 5

• Design Patterns
• Master-slaves
• Producer-consumer flows
• Shared work queues producer consumer
master

work queue

slaves

producer consumer
Some Difficulties
• Concurrency is difficult to reason about
• At the scale of datacenters and across datacenters
• In the presence of failures
• In terms of multiple interacting services
• Debugging: Even more difficult
• The reality:
• Lots of one-off solutions, custom code
• Write you own dedicated library, then program with it
• Burden on the programmer to explicitly manage everything
Source: MIT Open Courseware
Source: MIT Open Courseware
The datacenter is the computer!

Source: Google
The datacenter is the computer
• It’s all about the right level of abstraction
• Needs new “instruction set” for datacenter computers

• Hide system-level details from the developers


• No more race conditions, lock contention, etc.
• No need to explicitly worry about reliability, fault tolerance, etc.

• Separating the what from the how


• Developer specifies the computation that needs to be performed
• Execution framework (“runtime”) handles actual execution
Key Idea for Big Data Processing

Divide and Conquer


Divide and Conquer

“Work”
Partition

w1 w2 w3

worker worker worker

r1 r2 r3

“Result” Combine
Building Blocks

Single Rack of Cluster of racks


server servers
Source: Barroso and Urs Hölzle (2009)
Source: Google
Source: Google
Source: Facebook
Storage Hierarchy

Source: Barroso and Urs Hölzle (2013)


Storage Hierarchy

Source: Barroso and Urs Hölzle (2013)


Storage Hierarchy

Source: Barroso and Urs Hölzle (2013)


Anatomy of a Datacenter

Source: Barroso and Urs Hölzle (2013)


Why cluster computing framework for Big Data?
A Simple Problem: Word counting

• You have 100 Billion web pages in millions of files. Your goal is to
compute the count of each word appearing the collection.
Standard Solution
• Use multiple interconnected machine

BIG DATA 1. Split data into small chunks

2. Send different chunks to


different machines and process

3. Collect the results from


different machines

Distributed Data processing in Cluster of Computers


Compute Cluster Components

Rack
switch

server Rack of servers Data centre


How to Organize Cluster of Computers?
Cluster Architecture: Rack Servers
Backbone switch
(typically 2-10 gbps)
switch

1 gbps between
switch switch any pair of nodes

computer computer computer computer

Rack 1 Rack 2

Each rack typically contains commodity computers (nodes)


Main challenges in Cluster Computing
Challenge # 1

• Node failures

• Single server lifetime: 1000 days

• 1000 servers in a cluster => 1 failure/day

• 1M servers in clusters => 1000 failures/day


Consequences of Node Failure

• Data loss

• Node failure in the middle of long and expensive computation

• Need to restart the computation from scratch


Challenge # 2

• Network bottleneck

• Computers in a cluster exchange data through network

• Moving 10TB of data through 1 gbps network bandwidth takes 1 day


Challenge # 3

• Distributed Programming is hard!!!

• Why?

• The programmer has to manage too many low level things apart from writing
code for the task
Other Challenges

• How do we assign work units to workers?

• What if we have more work units than workers?

• What if workers need to share partial results?

• How do we aggregate partial results?

• How do we know all the workers have finished?

• What if workers die?


Managing Multiple Workers
• Difficult because
• We don’t know the order in which workers run
• We don’t know when workers interrupt each other
• We don’t know when workers need to communicate partial results
• We don’t know the order in which workers access shared data

• Thus, we need:
• Semaphores (lock, unlock)
• Conditional variables (wait, notify, broadcast)
The datacenter is the computer

• Hide system-level details from the developers


• No more race conditions, lock contention, etc.
• No need to explicitly worry about reliability, fault tolerance, etc.

• Separating the what from the how


• Developer specifies the computation that needs to be performed
• Execution framework (“runtime”) handles actual execution
Bring computation to the data

a data data
dat d e
co
code
data
Centralized code co code
de
processor
data data
da
ta

Traditional approach Modern approach: Hadoop, spark


Key Idea Behind Parallel Processing: Divide and Conquer

“Work”
Partition

w1 w2 w3

worker worker worker

r1 r2 r3

“Result” Combine
Programming Model
(Hadoop Map-reduce)
Simple programming model: Map-reduce
• Simple programing model
• Mainly using two functions (map-reduce)
 Map
 Reduce

Programmer’s responsibility:
define only two functions, Map and Reduce suitable for your problem
Word Count Using MapReduce: Pseudocode
map(key, value)
// key: document name; value: text of the document
for each word w in value
emit(w, 1)

reduce(key, values)
// key: a word; value: set of counts values for a word
result = 0
for each count v in values:
result += v
emit(key, result)
Programming Model

Splitting job

Intermediate combining

Final combining
Map-reduce Data Flow

map task
HDFS data block

• Data local processing

• Rack local processing

• Off-rack processing

Adapted from: Hadoop the definitive Guide, 4 th ed, Tom white


MapReduce data flow with single reduce task

HDFS
replication

Adapted from: Hadoop the definitive Guide, 4 th ed, Tom white


MapReduce data flow with multiple reduce tasks

Adapted from: Hadoop the definitive Guide, 4 th ed, Tom white


Mapreduce with no reduce task
input output
HDFS HDFS

split 0 map part 0 HDFS


replication

split 1 map part 1 HDFS


replication

split 2 map part 2 HDFS


replication

Adapted from: Hadoop the definitive Guide, 4 th ed, Tom white


Hadoop Distributed Filesystem
Why distributed filesystem?

• When data is very large a single machine can’t store

• Need to store the data in many machines

• Data needs to partitioned and stored

• Filesystem that manages the storage across a network of machines

• Hadoop stands on HDFS (Hadoop distributed filesystem)


Design Principles of HDFS
• Very large files
• Hundreds on terabytes or even petabytes

• Streaming data access


• Basic idea: write once, but read many times
• Time to read data is more important than latency

• Commodity hardware
• Uses cluster of commodity hardwares
• Thus chance of node failure is high
• HDFS is designed to handle such failures without noticeable interruption
When HDFS does not work well
• Low-latency data access
• Applications that require tens of milliseconds range
• It provides high throughput but at the expense of low latency

• Lost of small files


• In Hadoop, namenode holds the data
• The metadata is stored in namenode’s memory
• Thus, namenode may run out of memory if there are too many small files
HDFS concepts
• Blocks
• Disk block size: minimum amount of data it can read and write
• Typical size: 128 MB

• Why HDFS blocks are larger?


• Goal is to minimize the cost of seek
• The time it takes to transfer the data from the disk can be significantly longer
than the time to seek to the start of the block.
HDFS Blocks
• A file is made of a number of blocks

• Command to know the list of blocks for each file in HDFS

hdfs fsck / -files -blocks


Node types in HDFS Cluster
• Operates in master-worker pattern

• Namenode or master node

• Datanode or workers
Namenode
• The namenode manages the filesystem namespace.

• It maintains the filesystem tree and the metadata for all the files and
directories in the tree.
• This information is stored persistently on the local disk in two files:
• the namespace image and the edit log.

• The namenode also knows the datanodes on which all the blocks for a
given file are located

• But, it does not store block locations persistently,


• this information is reconstructed from datanodes when the system starts.
Datanode

• Datanodes are the workhorses of the filesystem.

• They store and retrieve blocks when they are told to (by clients or
the namenode)

• They report back to the namenode periodically with lists of blocks


that they are storing.
Node failures
• Namenode failures
• All the files in the filesystem are lost
• Since, reconstruction is not possible

• Datanode failure
• Won’t be a problem
• Data blocks are stored in many machines
• Can be recovered from another machine
HDFS (Hadoop) Architecture
namenode = master node

HDFS namenode
Application /foo/bar
(file name, block id)
File namespace block 3df2
HDFS Client
(block id, block location)

instructions to datanode

datanode state
(block id, byte range)
HDFS datanode HDFS datanode
block data
Linux file system Linux file system

… …

(Ghemawat et al., SOSP 2003)


HDFS

namenode job submission node

namenode daemon jobtracker

tasktracker tasktracker tasktracker

datanode daemon datanode daemon datanode daemon

Linux file system Linux file system Linux file system

… … …
slave node slave node slave node
Block Caching
• Generally, datanodes read blocks from the disk

• Frequently accessed blocks can be stored in RAM

• By default, a block is cached in only one datanode’s memory

• Job schedulers tries to run the code on the block that is cached
HDFS Federation

• The namenode keeps a reference to every file and block in the


filesystem in memory

• This means, for a very large cluster, namenode may run out of
memory to hold the metadata

• This problem is addressed by adding more namenodes in the


cluster
Tackling Namenode failure
• If namenode fails, then all metadata are lost
• Won’t be able to reconstruct the file from the blocks

• How to handle?
• Maintain a replica of the metadata into another passive machine
• When the active namenode fails, the admin can start the passive namenode
• It needs to load the namepace into memory before it starts
Filesystem Operations
• Major Filesystem operations:
• reading files, creating directories, moving files, deleting data, and listing
directories.

• One can run a Hadoop command from command line

• To know the details about every command

hadoop fs -help
Filesystem Operations
• Copying a file from the local filesystem to HDFS

hadoop fs -copyFromLocal file-1 file-2

• Copying a file to the local filesystem from HDFS


hadoop fs -copyToLocal source-file dest-file
Filesystem Operations
• Creating a directory
hadoop fs -mkdir mydir

• Listing the files


hadoop fs -ls

You might also like