DC MergedPostMidSem

DISTRIBUTED COMPUTING
Session # 9
BITS Pilani Dr. Sudheer Reddy

Pilani Campus
Consensus and Agreement
Algorithms
• Problem definition
• The Byzantine agreement and other consensus problems
• Overview of Results
• Agreement in failure-free system (synchronous or asynchronous)
• Agreement in (message-passing) synchronous systems with failures
BITS Pilani, Pilani Campus

Problem Definition
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Introduction
• Fundamental Requirement - Agreement among the

processes in a distributed system.
• Need to reach a common understanding or agreement
before taking application-specific actions.
e.g., – ‘commit’ decision in database systems, wherein the processes
collectively decide whether to ‘commit’ or ‘abort’ a transaction that they
participate in.

Agreement Protocols
• When distributed systems engage in cooperative efforts like

enforcing distributed mutual exclusion algorithms,
processor failure can become a critical factor.
• Processors may fail in various ways, and their failure
modes are central to the ability of healthy processors to
detect and respond to such failures.

The Byzantine agreement and other
consensus problems

What happened at Byzantine?
• May 29th, 1453

• The Turks are besieging the city of Byzantine by making a coordinated attack.
• Goals
– Consensus between loyal generals
– A small number of traitors cannot cause the loyals to adopt a bad plan
– Do not have to identify the traitors

Byzantine general sending
confusing messages

Agreement Protocol: The System model
• The are n processors in the system and at most m of them can

be faulty
• The processors can directly communicate with other processors
via messages (fully connected system)
• A receiver computation always knows the identity of a sending
computation
• The communication system is reliable

Communication Requirements
• Synchronous communication model is assumed in this

section:
– Healthy processors receive, process and reply to messages in a
lockstep manner
– The receive, process, reply sequence is called a round
– In the sync-comm model, processes know what messages they expect
to receive in a round
• The sync model is critical to agreement protocols, and
the agreement problem is not solvable in an
asynchronous system

Processor Failures
• Crash fault
– Abrupt halt, never resumes operation
• Omission fault
– Processor “omits” to send required messages to some other
processors
• Malicious fault
– Processor behaves randomly and arbitrarily
– Known as Byzantine faults

Message Types
• Authenticated messages (also called signed messages)

– assure the receiver of correct identification of the sender
• Non-authenticated messages (also called oral messages)

– are subject to intermediate manipulation

Agreement Problems –
Byzantine Agreement
• The Byzantine agreement problem requires a designated
process, called the source process, with an initial value, to
reach agreement with the other processes about its initial
value, subject to the following conditions:
– Agreement : All non-faulty processes must agree on the same value.
– Validity : If the source process is non-faulty, then the agreed upon value by
all the non-faulty processes must be the same as the initial value of the
source.
– Termination : Each non-faulty process must eventually decide on a value.
• Two popular flavors of the Byzantine agreement problem
– Consensus problem
– Interactive consistency problem

Consensus Problem
• The consensus problem differs from the Byzantine

agreement problem in that each process has an
initial value and all the correct processes must
agree on a single value.
– Agreement : All non-faulty processes must agree on the same
(single) value
– Validity : If all the non-faulty processes have the same initial value,
then the agreed upon value by all the non-faulty processes must be
that same value.
– Termination : Each non-faulty process must eventually decide on a
value.

Interactive Consistency
Problem
• The interactive consistency problem differs from
the Byzantine agreement problem in that each
process has an initial value, and all the correct
processes must agree upon a set of values, with
one value for each process.
– Agreement : All non-faulty processes must agree on the same array
of values A[v1….vn]
– Validity : If process i is non-faulty and its initial value is vi, then all
non-faulty processes agree on vi as the ith element of the array A. If
process j is faulty, then the non-faulty processes can agree on any
value for A[j].
– Termination : Each non-faulty process must eventually decide on
the array A.
Agreement Problems -
Summary
Problem Who initiates value Final Agreement
Byzantine One Processor Single Value

Agreement
Consensus All Processors Single Value
Interactive All Processors A Vector of Values

Consistency

Overview of Results

Overview of Results

Agreement in failure-free system
(synchronous or asynchronous)

Agreement in failure-free system
(synchronous or asynchronous)
• In a failure-free system, consensus can be reached by
collecting information from the different processes,
arriving at a “decision” and distributing this decision in
the system.
• Each process broadcast its values to others.

Agreement in (message-passing)
synchronous system with failures

Byzantine Agreement
• Theorem: There is no algorithm to solve byzantine if only oral

messages are used, unless more than two thirds of the
generals are loyal.
• In other words, impossible if n  3f for n processes, f of

which are faulty
• Oral messages are under control of the sender

– sender can alter a message that it received before forwarding it
• Let’s look at examples for case of n=3, f=1

Case 1
• Traitor lieutenant tries to foil consensus by refusing to participate

“white hats” == loyal or “good guys”
“black hats” == traitor or “bad guys”
Round 1: Commanding
General sends “Retreat” Commanding General 1
Round 2: L3 sends “Retreat” to L2, but
L2 sends nothing Loyal lieutenant obeys
R R commander. (good)
Decide: L3 decides “Retreat”
Lieutenant 3
Lieutenant 2 decides to retreat
R
Acknowledgement: Class lectures of A.D Brown of UofT

Case 2a
• Traitor lieutenant tries to foil consensus by lying about order sent by

general
Round 1: Commanding
General sends “Retreat” Commanding General 1
Round 2: L3 sends “Retreat” to L2; L2
sends “Attack” to L3 Loyal lieutenant obeys
R R commander. (good)
Lieutenant 3
R
A

Case 2b
• Traitor lieutenant tries to foil consensus by lying about order sent by

general
Round 1: Commanding
General sends “Attack” Commanding General 1
Round 2: L3 sends “Attack” to L2; L2
sends “Retreat” to L3 Loyal lieutenant disobeys
A A commander. (bad)
Lieutenant 3
A
R

Case 3
• Traitor General tries to foil consensus by sending different orders to loyal

lieutenants
Round 1: General sends “Attack”
to L2 and “Retreat” to L3
Commanding General 1
Round 2: L3 sends “Retreat” to L2; L2
sends “Attack” to L3 Loyal lieutenants obey
Decide: L2 decides “Attack” and L3 A R commander. (good)
decides “Retreat” Decide differently (bad)
Lieutenant 3
decides to attack R
A

Oral Message Algorithm
(Lamport – Shostak – Pease)
Oral Message algorithm, OM(m) consists of m+1 “phases”

Algorithm OM(0) is the “base case” (no faults)
1) Commander sends value to every lieutenant
2) Each lieutenant uses value received from commander, or default “retreat” if no value was
received
Recursive algorithm OM(m) handles up to m faults
1) Commander sends value to every lieutenant
2) For each lieutenant i, let vi be the value i received from commander, or “retreat” if no value
was received. Lieutenant i acts as commander and runs Alg OM(m-1) to send vi to each of
the n-2 other lieutenants
3) For each i, and each j not equal to i, let vj be the value Lieutenant i received from Lieutenant j
in step (2) (using Alg OM(m-1)), or else “retreat” if no such value was received. Lieutenant i
uses the value majority(v1, … , vn-1) to compute the agreed upon value.

Example OM with m=1, and n=4
Step 1: Commander sends same value, v, to Commander 1

all
Step 2: Each of L2, L3, L4 executes
OM(0) as commander, but L2 sends v v
arbitrary values v
Step 3: Decide
L3 has {v,v,x},
L4 has {v,v,y}, v v
Both choose v.
x v
L2 L3 v L4
y

Continued with Commander
Step 1: Commander sends different value, Commander 1

x,y,z, to each
All loyal lieutenants
Step 2: Each of L2, L3, L4 executes get same result.
OM(0) as commander, sending value they x z
received y
Step 3: Decide
L2 has {x,y,z}
L3 has {x,y,z}, y z
L4 has {x,y,z},
x y
L2 L3 z L4
x

BA: Stages in OM
P1
Stage 0: (n-1) msgs
P2 n-1
Pn
Stage 1: (n-1)(n-2) msgs
n-2
P3 n-2
Pn Pn-1 .
P2
.
. . .
P4 Pn . . Stage m
. . .
.
.

O(nm) Complexity of OM
OM(m) triggers n-1 OM(m-1)

OM(m-1) triggers n-2 OM(m-2)
…
OM(m-k) will be called by (n-1)…(n-k) times
…
OM(0)

Interactive Consistency Model

Solution with Signed Messages
We can cope with any number of traitors

• Prevent traitors lie about the commander’s order
• Messages are signed by commander
• The sign can be verified by all loyal lieutenants
• All loyals receive the same set of commands eventually
• If the commander is loyal, it works

Applications of BA
Building fault tolerant distributed services, e.g.,

• Hardware Clock Synchronization in presence of faulty
nodes
• Distributed commit in databases

Summary
• Consensus and Agreement

Algorithms
• Problem Definition
• The Byzantine agreement and other consensus problems
• Overview of Results
• Agreement in failure-free system (synchronous or asynchronous)
• Agreement in (message-passing) synchronous system with
failures

Session # 10

Pilani Campus
Peer-to-Peer Computing and
Overlay Graphs
• Introduction
• Data indexing and Overlays
• Unstructured Overlays

Introduction

What is a Peer?
• Contrast with Client-Server

model
• Servers are centrally
maintained and
administered
• Client has fewer resources
than a server

What is a P2P Overlay?
Selfish C
E F
objective P2P overlay
H
A D G layer
B
A C
E F H
AS1
AS4
B AS6
AS2 D
Native IP layer
G AS5
AS3
Improved services

Latency optimized Overlay: an
example
A 50ms
D
20ms
20ms
C
B
– Overlay routing is independent of native layer routing

– Each Overlay path comprises one or more Overlay links, based on a selfish
objective

Characteristics and Performance features
• Distributed Control
• Efficient use of resources
• Self-organizing
• Scalability
• Reliability
• Ease of deployment and administration

Generic P2P Architecture
Search API
Overlay Messaging API
Routing and Forwarding

Neighbor Discovery Content
Peer Role Selection
Join/Leave Storage
Bootstrap
Capability &
Configuration NAT/ Firewall Traversal
Operating System

Centralized P2P Networks

Decentralized P2P
Architectures
Purely Decentralized
Partially Decentralized

Applications over P2P
Partially Hybrid
Application Purpose Purely Decentralized
Decentralized Decentralized
Freenet, PAST, OceanStore, CFS [Dabek et al.
Content Publishing & Storage
2001]
Gnutella 0.6
File Sharing Gnutella, Free Haven [Gnutella 2002], Napster, BitTorrent
Kazaa
Anonymous Storage Freenet
Communication & Messaging Groove, SharedMind Skype Jabber
Backup operations Pastiche
Chord [Stoica et al. 2001], Pastry [Rowstron and
Infrastructure base Druschel 2000], Tapestry[Zhao et al. 2003], CAN
[Ratnasamy et al. 2001]
Routing RON [Andersen et al. 2001]
Minerva [Michel et al. 2005], ODISSEA [Suel et al.
Search engines
2003]
Computation SETI@home
Web Caching Squirrel
Split Stream [Castro et al. 2003], Scribe,
Multicasting
Bayeux[Zhuang et al. 2001]
Spam filtering SpamWatch[Zhou et al. 2003]
Domain name lookup DNS using Chord [Cox 2002]
Multimedia Streaming PPLive, Coolstreaming
Intrusion Detection Indra [Janakiraman et al. 2003]
Commercial Applications over P2P
Application Purpose Company Since

Hybrid peer-to-peer content distribution
Kontiki M K Capital 2008
targeting media content
Skype Voice over peer-to-peer eBay 2006
Collaboration without any help of central
Groove server. Known customers: Dell, Department of Groove Networks 2001
Defense, GlaxoSmithKline and Veridian
P2P infrastructure platform for building
Endeavours
Magi secure, cross-platform, collaborative 2001
Technologies
applications
Zattoo Streaming to TV viewers over internet Zattoo 2006
CoolStreaming To share television content using swarms Roxbeam Corp. 2005
PPTV Live and on-demand TV service over internet PPLive 2005
Streaming legal videos in agreement with Adconion Media
Joost 2009
publishers Group

Characteristics
• P2P Network
• All nodes are equal
• Location of arbitrary objects
• No DNS servers
• Resources
• Dynamic participation

Data indexing and Overlays

Data Indexing and overlays
• Data indexing
• Classification:
• Centralized
• Local
• Distributed

Centralized Indexing
• Centralized indexing
• One or few central servers to store references (indexes)
• e.g., Napster

Napster: Centralized architecture
Centralized Lookup
– Centralized directory services
– Steps
• Connect to Napster server.
• Upload list of files to server.
• Give server keywords to search the full list with.
• Select “best” of correct answers
– Performance Bottleneck
• Lookup is centralized, but files are copied in P2P manner

Local Indexing
• Each peer to index only the local data objects

• Remote objects need to be searched
• Used in unstructured overlays
– flooding search
– random walk search.
• e.g., Gnutella

Distributed Indexing
• Indexes to the objects at various peers being scattered

• A structure is used to access the indexes
• Distributed Hash Table (DHT)

Classification of P2P overlay networks
• Structured overlay networks

– Distributed Hash Tables (DHT)
– The overlay network
• assigns keys to data items
• organizes its peers into a graph
– e.g., hyper cube, mesh
– Rigid organizational principles for object storage and object
search
• Unstructured overlay networks

– Peers in a random graph in flat or hierarchical manner
– Loose guidelines for object search and storage
– Ad-hoc search mechanisms, variants of flooding and random
walk

Unstructured Overlays

Unstructured Overlays
• Loose rules
• Flooding or Random walks
• Flood query  List of all content

Unstructured Overlays:
Properties
• Semantic indexing possible
• High churn accommodation
• Efficient in case of replication
• Good if user satisfied with “best-effort” search
• Low Delays
• e.g., Gnutella, BitTorrent

Message types in Gnutella
• Ping
• Pong
• Query
• QueryHit
• Push

Gnutella query flooding
• Fully distributed lookup for files Overlay network: graph

• no central server • Edge between peer X and Y if
• Main representative of there’s a TCP connection
Unstructured P2P • All active peers and edges in
• Flooding based lookup: overlay net
• Inefficient in terms of scalability and • Edge is not a physical link
bandwidth
• Given peer will typically be
connected with < 10 overlay
neighbors

Gnutella protocol
• Query message sent over Query

existing TCP connections QueryHit
• Peers forward Query
message
• QueryHit sent over
reverse path
Query
QueryHit
Scalability:
limited scope flooding

Gnutella search
• Flooding based search

• A large (linear) part of the network is covered
• Enormous number of redundant messages
• All users do this in parallel: local load grows linearly with size
• Search protocols
• Controlling topology to allow for better search
• Random walk, Degree-biased Random Walk
• Controlling placement of objects
• Replication

Gnutella search (Contd..)
G
Object X
Query F H
Query Hit
E
B
C
D
A
Look up ‘Object X’

BitTorrent
• .torrent
• Metadata
• Tracker

BitTorrent Language
Seeder = a peer that provides the complete file.

Initial seeder = a peer that provides the initial copy.
Leecher
Initial seeder
Leecher
Seeder

Simple example
{1,2,3,4,5,6,7,8,9,10}
Seeder: A
{}{1,2,3}
{1,2,3,5}
{}
{1,2,3}
{1,2,3,4}
{1,2,3,4,5}
Downloader C
Downloader B

Download in progress

Download in progress (Contd..)

Summary
• Peer-to-Peer Computing and Overlay

Graphs
• Introduction
• Data indexing and Overlays
• Unstructured Overlays

Session # 11

Pilani Campus
Overlay Graphs
• Structured Overlays: CHORD DHT
• Design issues of P2P overlays
• Graph structure of Complex networks, Internet Graphs
• Generalized Random graph networks, Small-world and Scale-free
networks

Structured Overlays: CHORD DHT

Classification of P2P overlay networks
• Structured overlay networks

– Distributed Hash Tables (DHT)
– The overlay network
• assigns keys to data items
• organizes its peers into a graph
e.g., hyper cube, mesh
– Rigid organizational principles for object storage and object
search
• Unstructured overlay networks

– Peers in a random graph in flat or hierarchical manner
– Loose guidelines for object search and storage
– Ad-hoc search mechanisms, variants of flooding and random
walk

DHT based P2P
• A Distributed Hash Table (DHT) - distributes data among

a set of nodes
• Each peer responsible for a range of data items
• Partial knowledge in each peer

Chord DHT
• The Chord protocol, proposed by Stoica et al

• Uses a flat key space
• The node address & data object/file/value is mapped

An Example

CHORD Finger table
https://pdos.csail.mit.edu/papers/chord:sigcomm01/chord_sigcomm.pdf

An example CHORD ring
Key node
9 32
10 32
“Where is key 80?”
N120 12 32
N08 16 32
N105 33 60
40 60
Finger table of N08

“N90 has K80” N32
K80 N90
N60

Design issues of P2P overlays

Design Issues of P2P Overlays
• Fairness in P2P Systems

• Selfish behavior
• free-riding/leaching
• Trust/reputation in P2P Systems

• Quantifying trust and using different metrics for trust
• How to maintain trust about other peers in the face of collusion
• How to minimize the cost of the trust management protocols

Fairness in P2P Systems

Trust / Reputation in P2P
Systems
• Incentive-based economic mechanisms
• Trust in the quality of data being downloaded
• Evaluate the trust in particular offerers by contacting other nodes
• May be susceptible to various forms of malicious attack, thereby
requiring strong security guarantees.

Characterizing Complex Networks
• P2P overlay graphs can have different structures.

• Computer Science: WWW, INTNET, AS graphs
• Social networks (SOC), the phonecall graph (PHON), the movie actor
collaboration graph (ACT), the author collaboration graph (AUTH), and citation
networks (CITE)
• Linguistics: the word co-occurrence graph (WORDOCC), and the work synonym
graph (WORDSYN)
• The power distribution grid (POWER)
• Random-Graph Model
• Small-world Networks
• Clustering

Internet Graphs - Basic Laws

Generalized Random Graph
Networks
• Random graphs are not scale-free

• The generalized random graph uses the degree
distributed as an input
• These semi-random graphs have a random distribution
of edges

Small World Networks
• Small diameter
• Ordered lattices have clustering coefficients independent
of the network size

Scale-free Networks
• Obey a power law for the degree distributions

• Constrained to have large clustering coefficients

Summary

Graphs
• Structured Overlays: CHORD DHT
• Design issues of P2P overlays
• Graph structure of Complex networks, Internet Graphs
• Generalized Random graph networks, Small-world and Scale-
free networks

Session # 12

Pilani Campus
Overlay Graphs
• Security concerns from P2P networks
• Mitigating security risks in P2P networks

Security concerns from P2P networks

Security concerns in P2P
• Malicious attacks

Security gap in P2P
Peer A
Peer B
Internet Protected
Network
A TCP Port
Peer X
Malicious Peer C Firewall

Denial of Service
GET /index.html HTTP/1.0

3 P1
A
QueryHit
1 “star”,” 192.168.100.220:80”
Query: “star”
192.168.100.220:80
(target)
P2
QueryHit
Query: “pop” “pop”, ” 192.168.100.220:80”
Query: “star” “star”,” 192.168.100.220:80”
Query: “pop”
Query: “star”
Malicious
QueryHit: “star”,” 192.168.100.220:80”
QueryHit: “pop”, ” 192.168.100.220:80”
2 P3
Peer
192.168.100.40:4442

File pollution attack
original content
polluted content
pollution
company

File pollution attack (Contd..)
pollution
server
pollution
company
File sharing
network
pollution pollution
server server
pollution
server

File pollution attack (Contd..)
Unsuspecting users
spread pollution! Alice
File sharing
network
Bob

Index Poisoning
46.100.80.23
index
title location
file1 120.18.89.100
file2 46.100.80.23
file3 234.8.98.20
120.18.89.100
file sharing network
234.8.98.20

Index Poisoning (Contd..)
46.100.80.23
index
title location
file1 120.18.89.100
file2 46.100.80.23
file3 234.8.98.20
file4 111.22.22.22
120.18.89.100
file sharing network
234.8.98.20
111.22.22.22

Fake block attack
Victim Peer
Genuine Blocks
5. Hash Fail
Genuine Blocks
3. Block Request
1. TCP Connection
2. Fake BitMap
4. Fake Block
Genuine Blocks
Attacker

Free riders
Tracker
Seeder
Free Rider

Sybil attack
Sybil
Victim

Mitigating security risks in P2P networks

P2P Traffic Control

Trust Management
Centralized
Peers
Peers
Super-peers
Peers
Ordinary
Peers
Fully Hybrid
Decentralized P2P P2P architecture

Identifying P2P traffic
• Payload signatures and port-based

• Well-known port numbers (FastTrack: 1214, BitTorrent: 6881-6889, DC:411-412,
Limewire: 6346/6347, Edonkey: 4662 etc)
• Payload signatures (GET /torrents/, …): Cisco’s PDML, Juniper’s netscreen IDP,
Microsoft’s common application signatures, NetScout etc.
• Traffic behavior based

• Source destination IP pairs that concurrently use both TCP and UDP.
• Failed connections.
• Server and Client behavior.

Usage of heuristics
UDP using 80 and 443

Usage of heuristics (Contd..)
Multiple Unique IPs and ports

Machine learning Model for P2P detection
Filtered Training data • Machine learning

Parse packets Extract Features & knowledge • Statistical approaches
traces (features) • Expert knowledge
Entire Enhanced knowledge

network (subset of robust features)
traffic
Type 1
Dashboard
Type 2
Decision ML model
… Live traffic
Type n

Identifying Malicious P2P
Packet Feature Set

Conversation
Validation and Extraction
Filtering Module Creation Module
Module
Signal-processing
based features
K-nn
P2P botnets REP trees Extracted
identified ANNs Features
SVMs Network-behavior
based features
Valid packets Discarded packets Malicious conversation Benign conversation

Summary

Graphs
• Security concerns from P2P networks
• Mitigating security risks in P2P networks

Session # 13

Pilani Campus
Cluster Computing
• Cluster development trends

• Design objectives of computer clusters
• Cluster organization and resource sharing
• Node architecture and MPP packaging

Cluster development trends

What is a computing cluster?
A computing cluster consists of a collection of interconnected

stand-alone/complete computers, which can cooperatively
work together as a single, integrated computing resource.
Cluster explores parallelism at job level (load balancing) and
distributed computing with higher availability (fail over).

Operational Benefits of
Clustering
• High Availability (HA)
• OS and Application Reliability
• Scalability
• High Performance

Common Cluster Modes
• High Availability Cluster (HAC)

• High Performance Cluster (HPC)
• High Throughput Cluster (HTC)

High Availability Clusters

High Performance Cluster

High Throughput Cluster
Shared Pool of
Computing Resources:
Processors, Memory, Disks
Interconnect
Guarantee at least one Deliver large % of collective

workstation to many individuals resources to few individuals
(when active) at any one time
HA and HP in the same Cluster
Best of both worlds

Cluster Applications
• Numerous Scientific & Engineering Applications

• Business Applications
• Internet Applications
• Mission Critical Applications

Cluster Advantages
• Price/Performance ratio
• Incremental growth
• The provision of a multipurpose system
• Scientific, commercial, Internet applications
• Mainstream Enterprise Computing

Cluster Development Trends
• From interconnecting high-end mainframe computers to clusters with massive numbers
of x86 engines
• Computer clustering started with the linking of large mainframe computers
• Demand for cooperative group computing
• Provide higher availability in critical enterprise applications
• Trend moved toward the networking of many minicomputers
• Tandem’s Himalaya : fault-tolerant online transaction processing (OLTP) applications
• 1990s - Berkeley NOW (Network of Workstations), IBM SP2 AIX-based server cluster
• 2000 - Clustering of RISC or x86 PC engines.
• Trended as integrated systems, software tools, availability infrastructure, and operating
system extensions.

Milestone Research (or) Commercial
Cluster Computer Systems

Design objectives of computer clusters

Cluster Classification
• Scalability
• Packaging
• Control
• Homogeneity
• Security
• Dedicated Vs Enterprise Clusters
Source: Wiki
• Programmability

Cluster Design Issues
• Scalable Performance
• Single-System Image (SSI)
• Availability Support
• Cluster Job Management
• Internode Communication
• Fault Tolerance and Recovery
• Cluster Family Classification:
 Compute clusters
 High Availability clusters
 Load-balancing clusters

Design Principles of Computer
Clusters
• Single System
• Single Control
• Symmetry
• Location Transparent

Middleware Support for SSI
Clustering

Realizing a single entry point in a cluster

Single File Hierarchy
• Single file hierarchy - the illusion of a single, huge file

system image that transparently integrates local and
global disks and other file devices (e.g., tapes).
large RAID disk

Single point of control,
Networking, and Memory space

Example single I/O space
4 node Linux PC cluster

Availability
The operate-repair cycle of a computer system.
Availability = MTTF / (MTTF + MTTR)

Single Points of Failure in SMP and
Clusters

Fault Tolerant Cluster Configurations
• Hot Standby
• Mutual Takeover
• Fault-Tolerance

Failover
• Failure diagnosis
• Failure notification
• Failure recovery

Cluster organization and resource sharing

A Basic Cluster Architecture

HA and Performance: Resource sharing in Clusters
Connected through the

I/O bus
Check-pointing,
rollback recovery etc.
Scalable coherence
interface ring

Node architecture and MPP packaging

Nodes
• Compute Nodes
• Service Nodes

MPP Packaging Architecture
IBM BlueGene/L SuperComputer: The World Fastest Message-Passing

MPP built in 2005

Summary
• Cluster Computing
• Cluster development trends
• Design objectives of Computer clusters
• Cluster organization and resource sharing
• Node architecture and MPP packaging

Session # 14

Pilani Campus
Cluster Computing
• Cluster System Interconnects

• Hardware, Software and Middleware support
• GPU Clusters for Massive Parallelism
• Cluster Job and Resource Management

Cluster System Interconnects

High-Bandwidth Interconnects
Comparison of Four Cluster Interconnect Technologies Reported in 2007

Google Search Engine Cluster
A Rack contains 80
PCs.
Cross bar switch Interconnect

InfiniBand Architecture
HCA – Host Channel Adapter. An HCA

is the point at which an InfiniBand end
node, such as a server or storage
device, connects to the InfiniBand
network.
TCA – Target Channel Adapter. This is
a specialized version of a channel
adapter intended for use in an
embedded environment such as a
storage appliance.

Hardware, Software and Middleware
support

Hardware, Software, and
Middleware Support

GPU Clusters for Massive Parallelism

GPU Clusters for Massive
Parallelism
• High-performance accelerators
• Hundreds of processor cores per chip
• Up to 4 GB of on-board memory
• Sustaining memory bandwidths exceeding 100 GB/s.
• Homogeneous GPUs
• GPU cluster software includes the OS, GPU drivers, and
clustering API such as an MPI.

The Echelon GPU Chip Design
GPU Chip design for 20 Tflops performance and 1.6 TB/s memory
bandwidth in the Echelon system

GPU Cluster Components
• Components:
• The CPU host nodes
• The GPU nodes
• The cluster interconnect between CPU and GPU nodes
• High-bandwidth network and memory.

Echelon GPU Cluster
Architecture
The architecture of NVIDIA Echelon system built with a hierarchical

network of GPUs that can deliver 2.6 Pflops per cabinet

Cluster Job and Resource Management

Cluster Job Scheduling and
Management
Job Management System (JMS):

• User Server
• Job Scheduler
• Resource Manager

Multi-Job Scheduling
Schemes
• Calendar Scheduling
• Event Scheduling
• Priority
• Static Priority
• Dynamic Priority

Job Scheduling Issues

Scheduling Modes
• Dedicated Mode
• Space Sharing
• Time Sharing
• Independent Scheduling
• Gang Scheduling
• Competition with Foreign Jobs

Job Scheduling by Tiling over
Cluster Nodes

Cluster Job Management
Systems
• Job management ~ workload management ~ load
sharing/management
• Job Management System (JMS) :
• User Server
• Job Scheduler
• Resource Manager

JMS Administration
• Functional Distribution
e.g., user server each host node
• Administration is centralized
• Centralized configuration and log files
• Single user interface
• Dynamic Reconfiguration
• Suspend or Kill any job

Cluster Job Types
• Serial jobs
• Parallel jobs
• Interactive jobs
• Batch jobs

Migration Schemes
• Node Availability
• Migration overhead
• Recruitment threshold

Desired Features in a JMS
• Heterogeneous
• Support parallel and batch jobs
• Load-balancing
• Dynamic migration
• Dynamic suspension and resumption
• Dynamic addition/deletion of resources
• Command-line interface and a graphics user interface

Some Available JMSs
HPC Software Vendor Supported Platforms

LSF IBM Windows, Linux
Slurm Open Source Linux
PBS Pro Altair Windows, Linux
Open PBS Open Source Windows, Linux

Load Sharing Facility (LSF)
• Job management and load sharing

• Checkpointing
• Availability
• Load migration
• SSI
• Cluster of thousands of nodes
• UNIX and Windows/NT platforms
• Grids and Clouds

LSF (Contd..)
• Interactive, batch, sequential and parallel jobs

• Server node
• Client node
• Set of tools

High Level LSF Jobs
Submission

LSF Jobs Life Cycle & Job
States

Some LSF Commands

Submit Jobs to LSF

Summary
• Cluster Computing
• Cluster system interconnects
• Hardware, software and Middleware support
• GPU Clusters for massive parallelism
• Cluster job and resource management

Session # 15

Pilani Campus
Grid Computing
• Grid Architecture and Service Modeling

• Grid Resource Management and Brokering

Grid Architecture and Service Modeling

Computer Systems: Single -> Global
Computer Systems
Single System Distributed Systems

(multiple systems)
PC/Workstation SMP/NUMA Vector Mainframe
Clusters Client Server Grids Peer-to-Peer
Control and Management

Centralised Decentralised

What is Grid?
 Grid is a metacomputing architecture that brings

together PC’s, workstations, server clusters, super
computers, PDAs, etc. to form a large collection of
compute power, storage power and network resources
to solve large scale computing problems.

What Grid Computing can do?
• Exploiting underutilized resources.

• Parallel CPU Capacity.
• Applications.
• Virtual resources and virtual organizations for
collaboration.

What Grid Computing can do?
(Contd..)
The Grid virtualizes heterogeneous geographically disperse resources

Virtual Organizations

Scalable Computing
PE
RF
O
R 2100 2100 2100 2100
M
A
N
CE
+
2100 2100 2100 2100
2100
Q Administrative Barriers
o •Individual
S •Group
•Department
•Campus
•State
•National
•Globe
Personal Device SMPs or SuperComputers Local Enterprise Global Inter Planet

Cluster Cluster/Grid Grid Grid

Some Characteristics of Grids
Numerous
resources
Owned by multiple Connected by

organizations & heterogeneous,
individuals multi-level networks
Different security Different resource

requirements management
& policies policies
Unreliable Geographically
resources and distributed
Resources are
environments
heterogeneous

Grid Service Families
• Computational Grids
• Data Grids
• Information or Knowledge Grids
• Business Grids

Grid Service Protocol Stack
Network links virtual private

channels

Layered Grid Architecture
APPLICATIONS
Applications and Portals
Engineering Collaboration Prob. Solving Env. … Web enabled Apps

Scientific
USER LEVEL
Application Development and Deployment Environment
MIDDLEWARE
Autonomic/ Grid Economy

Adaptive Management
Languages/Compilers Libraries Debuggers Monitors … Web tools
Resource Management and Scheduling:
CORE
Distributed Resources Coupling Services
MIDDLEWARE
Security Information Data Process Trading … QoS
SECURITY LAYER
Local Resource Managers
Operating Systems Queuing Systems Libraries & App Kernels … Internet Protocols
Networked Resources across Organizations
Computers Networks Storage Systems Data Sources … Scientific Instruments

Grid Resources

CPU Scavenging and Virtual
Supercomputer
• CPU Scavenging
• Virtual Supercomputers
• Grid Resource Aggregation

Open Grid Services Architecture(OGSA)
 VO Management Service
 Resource discovery and management service
 Job management service
 Security services
 Data management services

Four architectural models for building a data grid

Grid GARUDA
GRID GARUDA initiative is a collaboration of the Scientific, Engineering and Academic Community to carry-out
research and experimentation on a nationwide grid of computational nodes, mass storage that aims to provide the
distributed data and compute intensive High Performance Computing solutions for the 21st century.

Stateless and Stateful Web
Service

OGSA, WSRF, and Web
services
WSRF specifies how to make

Web Services stateful

Software for Grid computing

Grid Challenges
Security Computational Economy
Uniform Access
System Management
Resource Discovery Data locality

Resource Allocation
& Scheduling
Application Construction
Network Management
Grid Resource Management and Brokering

How does Grids look like?
A Bird Eye View of a Global Grid
Grid Information Service

Grid Resource Broker
R2 Application
database
R3 R4
R5 RN
Grid Resource Broker
R6
R1
Resource Broker
Grid Information Service

Resource Management
• Autonomous resources
• Own resource management policies
• Resource Management System (RMS)

Resource Management
System

Job Scheduling

Job Scheduling Methods
• Hierarchical classification, which classifies scheduling

methods by a multilevel
• Flat classification based on a single attribute

Resource Brokering
An economy model for assessing grid computing services, developed at the University
of Melbourne.
A grid resource broker managing
job scheduling and execution

Architecture of the Gridbus
Resource Broker

Grid Middleware Technologies
Globus – Argonne National Lab and ISI

Gridbus – University of Melbourne
Unicore – European Project (Germany…)
Legion – University of Virginia
Middleware Clients and Portals

3rd Party
Solutions
User-Level Middleware
Low-Level Middleware
Fabric Access Management

UNICORE Globus Legion Gridbus

Globus Toolkit
Applications
Third Party User-Level Middleware
Globus
Grid Data
Grid Resource Grid Information
Management
Management Services
(GridFTP, Replica
(GRAM, GASS) (MDS)
Catalog)
GSI Security Layer
Grid Resources and Local Services

Globus Toolkit Services
• Security (GSI)
• Job submission and management (GRAM)
• Information services (MDS)
• Remote file management (GASS)
• Remote Data Catalogue and Management Tools
• WSRF

Summary
• Grid Computing
• Grid Architecture and Service Modeling
• Grid Resource Management and Brokering

Session # 16

Pilani Campus
Internet of Things
• IoT for Ubiquitous Computing

• RFID
• Sensors and ZigBee Technologies
• Applications of IoT
• Graph Theoretic Analysis of Social Networks
• Facebook, and Twitter case studies

IoT for Ubiquitous Computing

Introduction
• Ubiquitous computing
• Network of sensor or radio connected devices
• Internet of Things (IoT)
• RFID with IPv6
• IP-Identifiable objects

The Internet of Things for
Ubiquitous Computing
• Natural extension of the Internet
• RFID
• Discovery of tagged objects and mobile devices
• Everyday objects

Technology Road map of IoT

IoT Architecture
Application Merchandise Environment Intelligent Tele- Intelligent Smart

Layer Tracking Protection Search medicine Traffic Home
Cloud Computing Platform
Network The
Mobile Telecom Information
Layer Internet
Network Network
RFID Sensor Network GPS
Sensing
Layer
RFID Label Sensor Nodes Road Mapper

RFID

RFID : Radio Frequency
IDentification
An RFID tag used for

electronic toll collection.

RFID Tags
• Active RFID
• Passive RFID
• Battery-assisted passive RFID

RFID Device Components
• RFID tag
• Reader antenna
• Reader

How RFID Works?

RFID Technology

RFID Merchandize Tracking in
Distribution Center

Sensors and ZigBee technologies

What is Wireless Sensor
Network?
• Self-organizing networks formed by many autonomous sensor nodes
– Each node comprises its own power supply, processing unit, radio and sensors
– Typically peer-to-peer communication (no central server)
– Many (100 to 10,000) sensor nodes per net
• Various Applications: Industrial Automation, Building Control, Health Care,

Military, Farming, Traffic Control, Home Automation, ...

Sensor Networks
• Wireless sensor networks (WSNs)

• Spatially distributed autonomous sensors
• Motivated by military applications

Wireless Networks for Supporting
Ubiquitous Computing

Structure of typical ZigBee
Network

Use Case: Cinema Example
Step 3:
STEP 1:
The users with SIMTOOLKIT application
Environment with ZigBee node buy the content. The operator manages
the bill transaction and downloads the
token to the SIM
for enabling the access to the cinema
GPRS/UMTS
Cinema
STEP 2:
User having Z-SIM is
recognized by the Encrypted SMS
environment and is invited to
buy a cinema ticket Step 4:
The token is
Do you received and
confirm the
purchase of
“La fabbrica di
sent to the
cioccolato”?
printer via
ZigBee
The ticket is
automatically
printed

Applications of IoT

IoT Telemedicine Applications
Patient Data Transferred Using a Wireless Sensor Network

Supply chain management

Smart Power Grid

Graph theoretic analysis of social networks

ONLINE SOCIAL AND
PROFESSIONAL NETWORKING
• Enables many social and professional functions over the

Internet
• Online social networks (OSNs)
• Representing the social relationships of individuals
• Nodes represent the individuals
• Ties between the nodes represent the relationships

Graph-Theoretic Analysis of
Social Networks
• Mapped into directed or undirected graphs -
acquaintance graphs or simply social connection graphs
• Nodes correspond to the users or actors
• Graph edges or links refer to the ties or relationships
• Complex and hierarchically structured to reflect
relationships at all levels.

A Graph Representation of an
Example of a Social Network

Facebook

Facebook Architecture

Twitter

Twitter Architecture and
Access Protocol Sequence

Summary
• Internet of Things
• IoT for Ubiquitous Computing
• RFID
• Sensors and ZigBee Technologies
• Applications of IoT
• Graph theoretic analysis of social networks
• Facebook, and Twitter case studies

Session # 17

Pilani Campus
Course Objective
• To learn hardware architectures for building distributed

systems, and their communication models.
• To learn the design aspects of various software applications
that can be deployed on various distributed systems.
• To provide an understanding of the complexities and resource
management issues that are critical in a large distributed
system.
• To provide algorithmic aspects of building/designing distributed
systems in domains like IoT, P2P, Cluster, Grid computing etc.

Books
• Ajay D. Kshemkalyani, and Mukesh Singhal
“Distributed Computing: Principles, Algorithms,
and Systems”, Cambridge University Press, 2008
(Reprint 2013).
• John F. Buford, Heather Yu, and Eng K. Lua, “P2P

Networking and Applications”, Morgan
Kaufmann, 2009 Elsevier Inc.
• Kai Hwang, Geoffrey C. Fox, and Jack J.

Dongarra, “Distributed and Cloud Computing:
From Parallel processing to the Internet of
Things”, Morgan Kaufmann, 2012 Elsevier Inc.

Content Structure
Module Module Title Objectives
No
1 Introduction to This module will give an introduction to Distributed computing
Distributed Computing in terms of various hardware and software models that can be
used to build a distributed computing system. It will also cover
communication models appropriate for building these systems.
Design issues and Challenges for building such systems will
also be discussed.
2 Logical Clocks & Vector To explore alternate ways of managing time (clock) in a
Clocks distributed computing system as there is no centralized
hardware clock available in this type of system. To study ways
of building systems that can provide a global notion of time.
3 Global state and snapshot Recording of a global state for the purpose of carrying out
recording algorithms distributed computations is a complex task because of non-
availability of Single physical memory and single physical
clock in a distributed system. So, designers have to explore
ways or algorithms to record or collect global state of a
distributed computation.
Content Structure (Contd..)
Module No Module Title Objectives
4 Terminology and Basic To study a framework in which distributed algorithms can be classified
algorithms and analysed. Basic distributed graph algorithms, synchronizers, and
practical graph problem are studied and analysed to give a better
understanding of distributed algorithm design.
5 Message ordering and In a distributed system, message ordering plays an important role on
Termination detection making the system consistent or stable. The order of messages sent
should match (happen prior) the order of its’ receipt in a group
communication application implemented in a distributed system. This
module will discuss ways to achieve various message ordering schemes.
Also, the approaches for detecting termination of a distributed
computation will be discussed.
6 Distributed Mutual Exclusion In a distributed system when multiple entities compete for a shared
& Deadlock detection resource, the access to this shared resource has to be serialized (or
coordinated). This module will cover different assertion based, and tree
based distributed algorithms to implement DME. Deadlocks are very
common in distributed computing systems, and this module will discuss
ways to detect this deadlock and resolve them.

Content Structure (Contd..)
Module No Module Title Objectives
7 Consensus and When multiple entities cooperate with each other in solving a complex function or task
Agreement in a distributed system, there are instances where a majority of these entities must agree
Algorithms on certain decisions without which the task cannot be solved. This module will discuss
ways to achieve these.
8 Peer-to-Peer Overlay networks have been playing a dominant role in providing better services since
computing and past two decades. Peer-to-Peer computing has emerged as an important distributed
Overlay graphs application (network) in ensuring a wide variety of improved services be it in voice
(Skype), file sharing (BitTorrent), digital currency (BitCoin), or Anonymous surfing
(Tor) etc. This module will cover different aspects of P2P Overlay application
development.
9 Cluster Cluster computing is also a model of carrying out distributed computation using a set of
Computing & homogeneous systems or machines. This module will deal with the design aspects of
Grid Computing Cluster computing. Grid computing extends the design of a cluster computing system by
provisioning centralized resources for specific services and the design issues open up
another dimension to the distributed system design.
10 Internet of Things Today, IoT is playing a big role in connecting devices, objects, etc using RFID, sensors
and other communication models. The way these devices can communicate and build
intelligent applications for various purposes is posing new challenges to build these
systems. This module will discuss these design challenges.

Learning Outcomes
• Students will be able to describe middleware platforms like

RPC(Sun RPC, Java RMI, etc) for implementing
communication models over distributed systems.
• Students will be able to appraise/justify the need for Logical
clocks and their usages in building distributed systems and its’
components.
• Students will be able to explain Mutual exclusion primitives,
Agreement protocols, and deadlock handling scenarios in
distributed systems.
• Students will be able to describe search, storage,
communication, efficiency and other related issues in
paradigms like P2P, Cluster, Grid, and IoT.
Problem # 1 - Apply Lamport’s Logical Clock
Algorithm on the following processes
R1: Before executing an event
(send, receive, or internal),
process pi executes the
following:
Ci := Ci + d (d > 0)
In general, every time R1 is
executed, d can have a
different value; however,
typically d is kept at 1.
R2: Each message piggybacks

the clock value of its sender at
sending time. When a process
pi receives a message with
timestamp Cmsg, it executes
the following actions:
Ci := max (Ci, Cmsg)
Execute R1
Deliver the message

Problem # 2 – Identify which one is
Consistent/In-Consistent State/Cut?
In-Consistent Consistent

Problem # 3
• There are two banking transactions ‘T’ and ‘U’ are defined as:
T U
z.deposit(10000);
aBranch.create(Z);
z.deposit(20000);
1. What is the balance of account State the balance of Z after their execution in
this order.
2. Are these consistent with serially equivalent executions of T and U?
• A1 : 20000
• A2 : Serial executions of T and U are:
o T then U: aBranch.create(“Z”); z.deposit(10);z.deposit(20). Z’s final balance is 30000.
o U then T: z.deposit(10); z.deposit(20); aBranch.create(“Z”). Z’s final balance is 0.
o Therefore, the example is not a serially equivalent execution of T and U.

Problem # 4
• Give a use case in which multicasts sent by different clients are delivered in
different orders at two group members. Assume that some form of message
retransmissions are in use, but that messages that are not dropped arrive in
sender ordering. Provide a solution how recipients may remedy this situation?
 Client c1 sends request r1 to members m1 and m2 but the message to m2 is dropped

 Client c2 sends request r2 to members m1 and m2 (both arrive safely)
 c1 re-transmits request r1 to member m2 (it arrives safely).
 m1 receives the messages in the order r1;r2. However, m2 receives them in the order
r2;r1.
To remedy the situation, each recipient delivers messages to its application in sender order.
When it receives a message that is ahead of the next one expected, it hold it back until it has
received and delivered the earlier re-transmitted messages.

Problem # 5
• An NTP server B receives server A’s message at

16:34:23.480 bearing a timestamp 16:34:13.430 and
replies to it. A receives the message at 16:34:15.725,
bearing B’s timestamp 16:34:25.7. Estimate the offset
between B and A and the accuracy of the estimate.
Let T1 = Ti-2 – Ti-3 = 23.48 - 13.43 = 10.05;

T2 = Ti-1 – Ti = 25.7- 15.725 =9.975.
Then the estimated offset = (T1+T2)/2 = 10.013s,

with estimated accuracy = = ± (T1-T2)/2 = 0.038s

Problem # 6
• Identify which of the following is Sequentially Consistent

data store?
Inconsistent Consistent

Problem # 7
• There is a file server which uses caching with hit rate of

80%. If the requested block finds in the cache, then server
takes 5 milli seconds of CPU time, otherwise it takes
additional 15 milli seconds of disk I/O time. Determine the
file server’s throughput capacity in terms of average
requests/sec if it is a single threaded process.
80% of accesses cost 5 milli seconds

20% of accesses cost 20 milli seconds
Average request time is 0.8*5+.2*20= 4+4 = 8 milli seconds.
Throughput is, 1000/8 = 125 requests/sec

Problem # 8
• Consider the behavior of two machines in a distributed

system. Both have clocks that are supposed to tick 1000
times per millisecond. One of them does, but the other
ticks only 990 times per millisecond. If UTC updates come
in once a minute, what is the maximum clock skew that will
occur?
The second clock ticks 990,000 times per second, giving an error of 10
msec per second. In a minute this error has grown to 600 msec.
Other way to look is, second clock is 1% slow, so after a minute it is off by
0. 01 x 60 sec, or 600 msec.

Problem # 9
• There is a group of five replicated processes producing same results. Assume

that these processes can only crash silently; what is the maximum number of
these processes that may crash while it is still possible to get the correct
results? What is the maximum number of Byzantine processes in this group to
achieve the correct results?
Crashes: k + 1 = 5, k = 4, therefore 4 can crash
Byzantine: 2k + 1 = 5; k = 2 -> 2 can produce Byzantine failure -> need at least

3 correct processes in the system.

Thank You

DC MergedPostMidSem

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DC MergedPostMidSem

Uploaded by

Copyright:

Available Formats

DISTRIBUTED COMPUTING

BITS Pilani Dr. Sudheer Reddy

BITS Pilani, Pilani Campus

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

• Fundamental Requirement - Agreement among the

BITS Pilani, Pilani Campus

• When distributed systems engage in cooperative efforts like

BITS Pilani, Pilani Campus

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

• May 29th, 1453

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

• The are n processors in the system and at most m of them can

BITS Pilani, Pilani Campus

• Synchronous communication model is assumed in this

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

• Authenticated messages (also called signed messages)

• Non-authenticated messages (also called oral messages)

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

• The consensus problem differs from the Byzantine

BITS Pilani, Pilani Campus

Problem Who initiates value Final Agreement

Byzantine One Processor Single Value

Consensus All Processors Single Value

Interactive All Processors A Vector of Values

BITS Pilani, Pilani Campus

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

BITS Pilani, Pilani Campus

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

BITS Pilani, Pilani Campus

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

• Theorem: There is no algorithm to solve byzantine if only oral

• In other words, impossible if n  3f for n processes, f of

• Oral messages are under control of the sender

• Let’s look at examples for case of n=3, f=1

BITS Pilani, Pilani Campus

• Traitor lieutenant tries to foil consensus by refusing to participate

BITS Pilani, Pilani Campus

• Traitor lieutenant tries to foil consensus by lying about order sent by

Acknowledgement: Class lectures of A.D Brown of UofT

BITS Pilani, Pilani Campus

• Traitor lieutenant tries to foil consensus by lying about order sent by

BITS Pilani, Pilani Campus

• Traitor General tries to foil consensus by sending different orders to loyal

BITS Pilani, Pilani Campus

Oral Message algorithm, OM(m) consists of m+1 “phases”

BITS Pilani, Pilani Campus

Step 1: Commander sends same value, v, to Commander 1

BITS Pilani, Pilani Campus

Step 1: Commander sends different value, Commander 1

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

OM(m) triggers n-1 OM(m-1)

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

We can cope with any number of traitors

BITS Pilani, Pilani Campus

Building fault tolerant distributed services, e.g.,

BITS Pilani, Pilani Campus

• Consensus and Agreement

BITS Pilani, Pilani Campus

BITS Pilani Dr. Sudheer Reddy