BDP 2023 03

Big Data Processing
Jiaul Paik
Lecture 3
Old Tools for Big Data Processing
Shared Memory Message Passing
• Programming models
Memory
• Shared memory (pthreads)
• Message passing (MPI) P1 P2 P3 P4 P5 P1 P 2 P 3 P 4 P 5
• Design Patterns
• Master-slaves
• Producer-consumer flows
• Shared work queues producer consumer
master
work queue
slaves
producer consumer
Some Difficulties
• Concurrency is difficult to reason about
• At the scale of datacenters and across datacenters
• In the presence of failures
• In terms of multiple interacting services
• Debugging: Even more difficult
• The reality:
• Lots of one-off solutions, custom code
• Write you own dedicated library, then program with it
• Burden on the programmer to explicitly manage everything
Source: MIT Open Courseware
Source: MIT Open Courseware
The datacenter is the computer!
Source: Google
The datacenter is the computer
• It’s all about the right level of abstraction
• Needs new “instruction set” for datacenter computers
• Hide system-level details from the developers

• No more race conditions, lock contention, etc.
• No need to explicitly worry about reliability, fault tolerance, etc.
• Separating the what from the how

• Developer specifies the computation that needs to be performed
• Execution framework (“runtime”) handles actual execution
Key Idea for Big Data Processing
Divide and Conquer

Divide and Conquer
“Work”
Partition
w1 w2 w3
worker worker worker
r1 r2 r3
“Result” Combine
Building Blocks
Single Rack of Cluster of racks

server servers
Source: Barroso and Urs Hölzle (2009)
Source: Google
Source: Google
Source: Facebook
Storage Hierarchy

Storage Hierarchy

Storage Hierarchy

Anatomy of a Datacenter

Why cluster computing framework for Big Data?
A Simple Problem: Word counting
• You have 100 Billion web pages in millions of files. Your goal is to
compute the count of each word appearing the collection.
Standard Solution
• Use multiple interconnected machine
BIG DATA 1. Split data into small chunks
2. Send different chunks to

different machines and process
3. Collect the results from

different machines
Distributed Data processing in Cluster of Computers

Compute Cluster Components
Rack
switch
server Rack of servers Data centre

How to Organize Cluster of Computers?
Cluster Architecture: Rack Servers
Backbone switch
(typically 2-10 gbps)
switch
1 gbps between
switch switch any pair of nodes
computer computer computer computer
Rack 1 Rack 2
Each rack typically contains commodity computers (nodes)

Main challenges in Cluster Computing
Challenge # 1
• Node failures
• Single server lifetime: 1000 days
• 1000 servers in a cluster => 1 failure/day
• 1M servers in clusters => 1000 failures/day

Consequences of Node Failure
• Data loss
• Node failure in the middle of long and expensive computation
• Need to restart the computation from scratch

Challenge # 2
• Network bottleneck
• Computers in a cluster exchange data through network
• Moving 10TB of data through 1 gbps network bandwidth takes 1 day

Challenge # 3
• Distributed Programming is hard!!!
• Why?
• The programmer has to manage too many low level things apart from writing
code for the task
Other Challenges
• How do we assign work units to workers?
• What if we have more work units than workers?
• What if workers need to share partial results?
• How do we aggregate partial results?
• How do we know all the workers have finished?
• What if workers die?

Managing Multiple Workers
• Difficult because
• We don’t know the order in which workers run
• We don’t know when workers interrupt each other
• We don’t know when workers need to communicate partial results
• We don’t know the order in which workers access shared data
• Thus, we need:
• Semaphores (lock, unlock)
• Conditional variables (wait, notify, broadcast)
The datacenter is the computer
• Hide system-level details from the developers

• No more race conditions, lock contention, etc.
• No need to explicitly worry about reliability, fault tolerance, etc.
• Separating the what from the how

• Developer specifies the computation that needs to be performed
• Execution framework (“runtime”) handles actual execution
Bring computation to the data
a data data
dat d e
co
code
data
Centralized code co code
de
processor
data data
da
ta
Traditional approach Modern approach: Hadoop, spark

Key Idea Behind Parallel Processing: Divide and Conquer
“Work”
Partition
w1 w2 w3
worker worker worker
r1 r2 r3
“Result” Combine
Programming Model
(Hadoop Map-reduce)
Simple programming model: Map-reduce
• Simple programing model
• Mainly using two functions (map-reduce)
 Map
 Reduce
Programmer’s responsibility:
define only two functions, Map and Reduce suitable for your problem
Word Count Using MapReduce: Pseudocode
map(key, value)
// key: document name; value: text of the document
for each word w in value
emit(w, 1)
reduce(key, values)
// key: a word; value: set of counts values for a word
result = 0
for each count v in values:
result += v
emit(key, result)
Programming Model
Splitting job
Intermediate combining
Final combining
Map-reduce Data Flow
map task
HDFS data block
• Data local processing
• Rack local processing
• Off-rack processing
Adapted from: Hadoop the definitive Guide, 4 th ed, Tom white

MapReduce data flow with single reduce task
HDFS
replication

MapReduce data flow with multiple reduce tasks

Mapreduce with no reduce task
input output
HDFS HDFS
split 0 map part 0 HDFS

replication

replication

replication

Hadoop Distributed Filesystem
Why distributed filesystem?
• When data is very large a single machine can’t store
• Need to store the data in many machines
• Data needs to partitioned and stored
• Filesystem that manages the storage across a network of machines
• Hadoop stands on HDFS (Hadoop distributed filesystem)

Design Principles of HDFS
• Very large files
• Hundreds on terabytes or even petabytes
• Streaming data access

• Basic idea: write once, but read many times
• Time to read data is more important than latency
• Commodity hardware
• Uses cluster of commodity hardwares
• Thus chance of node failure is high
• HDFS is designed to handle such failures without noticeable interruption
When HDFS does not work well
• Low-latency data access
• Applications that require tens of milliseconds range
• It provides high throughput but at the expense of low latency
• Lost of small files

• In Hadoop, namenode holds the data
• The metadata is stored in namenode’s memory
• Thus, namenode may run out of memory if there are too many small files
HDFS concepts
• Blocks
• Disk block size: minimum amount of data it can read and write
• Typical size: 128 MB
• Why HDFS blocks are larger?

• Goal is to minimize the cost of seek
• The time it takes to transfer the data from the disk can be significantly longer
than the time to seek to the start of the block.
HDFS Blocks
• A file is made of a number of blocks
• Command to know the list of blocks for each file in HDFS
hdfs fsck / -files -blocks

Node types in HDFS Cluster
• Operates in master-worker pattern
• Namenode or master node
• Datanode or workers
Namenode
• The namenode manages the filesystem namespace.
• It maintains the filesystem tree and the metadata for all the files and
directories in the tree.
• This information is stored persistently on the local disk in two files:
• the namespace image and the edit log.
• The namenode also knows the datanodes on which all the blocks for a
given file are located
• But, it does not store block locations persistently,

• this information is reconstructed from datanodes when the system starts.
Datanode
• Datanodes are the workhorses of the filesystem.
• They store and retrieve blocks when they are told to (by clients or
the namenode)
• They report back to the namenode periodically with lists of blocks

that they are storing.
Node failures
• Namenode failures
• All the files in the filesystem are lost
• Since, reconstruction is not possible
• Datanode failure
• Won’t be a problem
• Data blocks are stored in many machines
• Can be recovered from another machine
HDFS (Hadoop) Architecture
namenode = master node
HDFS namenode
Application /foo/bar
(file name, block id)
File namespace block 3df2
HDFS Client
(block id, block location)
instructions to datanode
datanode state
(block id, byte range)
HDFS datanode HDFS datanode
block data
Linux file system Linux file system
… …
(Ghemawat et al., SOSP 2003)

HDFS
namenode job submission node
namenode daemon jobtracker
tasktracker tasktracker tasktracker
datanode daemon datanode daemon datanode daemon
Linux file system Linux file system Linux file system
… … …
slave node slave node slave node
Block Caching
• Generally, datanodes read blocks from the disk
• Frequently accessed blocks can be stored in RAM
• By default, a block is cached in only one datanode’s memory
• Job schedulers tries to run the code on the block that is cached
HDFS Federation
• The namenode keeps a reference to every file and block in the

filesystem in memory
• This means, for a very large cluster, namenode may run out of
memory to hold the metadata
• This problem is addressed by adding more namenodes in the

cluster
Tackling Namenode failure
• If namenode fails, then all metadata are lost
• Won’t be able to reconstruct the file from the blocks
• How to handle?
• Maintain a replica of the metadata into another passive machine
• When the active namenode fails, the admin can start the passive namenode
• It needs to load the namepace into memory before it starts
Filesystem Operations
• Major Filesystem operations:
• reading files, creating directories, moving files, deleting data, and listing
directories.
• One can run a Hadoop command from command line
• To know the details about every command
hadoop fs -help
• Copying a file from the local filesystem to HDFS
hadoop fs -copyFromLocal file-1 file-2
• Copying a file to the local filesystem from HDFS

hadoop fs -copyToLocal source-file dest-file
• Creating a directory
hadoop fs -mkdir mydir
• Listing the files

hadoop fs -ls

BDP 2023 03

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

BDP 2023 03

Uploaded by

Copyright:

Available Formats

Big Data Processing

• Hide system-level details from the developers

• Separating the what from the how

Divide and Conquer

worker worker worker

Single Rack of Cluster of racks

Source: Barroso and Urs Hölzle (2013)

Source: Barroso and Urs Hölzle (2013)

Source: Barroso and Urs Hölzle (2013)

Source: Barroso and Urs Hölzle (2013)

BIG DATA 1. Split data into small chunks

2. Send different chunks to

3. Collect the results from

Distributed Data processing in Cluster of Computers

server Rack of servers Data centre

computer computer computer computer

Each rack typically contains commodity computers (nodes)

• Single server lifetime: 1000 days

• 1000 servers in a cluster => 1 failure/day

• 1M servers in clusters => 1000 failures/day

• Node failure in the middle of long and expensive computation

• Need to restart the computation from scratch

• Computers in a cluster exchange data through network

• Moving 10TB of data through 1 gbps network bandwidth takes 1 day

• Distributed Programming is hard!!!

• How do we assign work units to workers?

• What if we have more work units than workers?

• What if workers need to share partial results?

• How do we aggregate partial results?

• How do we know all the workers have finished?

• What if workers die?

• Hide system-level details from the developers

• Separating the what from the how

Traditional approach Modern approach: Hadoop, spark

worker worker worker

• Data local processing

• Rack local processing

Adapted from: Hadoop the definitive Guide, 4 th ed, Tom white

Adapted from: Hadoop the definitive Guide, 4 th ed, Tom white

Adapted from: Hadoop the definitive Guide, 4 th ed, Tom white

split 0 map part 0 HDFS

split 1 map part 1 HDFS

split 2 map part 2 HDFS

Adapted from: Hadoop the definitive Guide, 4 th ed, Tom white

• When data is very large a single machine can’t store

• Need to store the data in many machines

• Data needs to partitioned and stored

• Filesystem that manages the storage across a network of machines

• Hadoop stands on HDFS (Hadoop distributed filesystem)

• Streaming data access

• Lost of small files

• Why HDFS blocks are larger?

• Command to know the list of blocks for each file in HDFS

hdfs fsck / -files -blocks

• Namenode or master node

• But, it does not store block locations persistently,

• Datanodes are the workhorses of the filesystem.

• They report back to the namenode periodically with lists of blocks

(Ghemawat et al., SOSP 2003)

namenode job submission node

namenode daemon jobtracker

tasktracker tasktracker tasktracker

datanode daemon datanode daemon datanode daemon