You are on page 1of 24

Intelligent Network Infrastructures for

Global Grid Computing


Luca Valcarenghi, Piero Castoldi, and Lorenzo Rossi
Scuola Superiore Sant’Anna for University Study and Research
Pisa, Italy
{valcarenghi, castoldi, rossi}@sssup.it
GGF 12
GHPN-RG Session II, 22 October, 2004
National Research Program
Strategic Project on Enabling Technologies for Information Society
FIRB

http://www.grid.it
Enabling Technologies for High-performance Computational
Grids Oriented to Scalable Virtual Organizations

22 October, 2004 GGF 12 - GHPN-RG Session II 2


National Program Coordination

TORINO
MILANO CNR & University
PADOVA
PAVIA HPC, Parallel Programming, Grid computing,
BOLOGNA Scientific libraries, Data base and knowledge discovery,
GENOVA
Earth Observation, Computational chemistry,
Image processing, …
PISA
INFN & University
PERUGIA
Grid (INFN-Grid, DataGrid, DataTag) ,
e-science applications: Astrophysics,
ROMA Bioinformatics, Geophysics, …

BARI
NAPOLI
ASI
GRID.IT 3-Year Project MATERA
LECCE
Applications of Earth
CAGLIARI
Cost 11.1 M€ CNITCOSENZA
Observation

Funding: €8.1 M
PALERMO Technologies and HW-SW
infrastructures for

(1.1 M€ for young researchers) high-performance communication,


Optical technologies, …

22 October, 2004
Start-up: November 2002
GGF 12 - GHPN-RG Session II 3
Interactions among WPs
Applications and Demonstrators
(WP10, WP11, WP12, WP13, WP14)

Component-based Programming Environment (WP8),


e-science Components (WP9)

Knowledge Services (WP6), Grid Portals (WP7),


Security (WP4)
Photonic, high- Basic Grid Infrastructure
bandwidth
and Data Core Services
networks
(WP3, WP5)
(WP1, WP2)
22 October, 2004 GGF 12 - GHPN-RG Session II 4
WP1 breakdown
WP 1 - Grid oriented optical switching paradigms
(Resp. P. Castoldi)
• Activity 1 – Connections, topologies and network service models
(Resp. R. Battiti/F. Granelli) Year 1 Year 2

• Activity 2 – Grid computing on state-of-the-art optical networks


(Resp. P. Castoldi) Year 1 Year 2

• Activity 3 – Migration scenarios to intelligent flexible optical networks


(Resp. F. Callegati) Year 1 Year 2

• Activity 4 – Control plane and network emulation for optical packet switching networks
(Resp. A. Fumagalli) Year 2 Year 3

• Activity 5 – Enabling technologies for optical switching networks


(Resp. G. Cancellieri) Year 1 Year 2 Year 3
22 October, 2004 GGF 12 - GHPN-RG Session II 5
Grid Network Aware Programming Environment

User Interface (UI)


User
Run
(max_exec_time
Application , reliable, etc.)
Application requests
f(max_exec_time
Network
, reliable, etc.)
Resource Infos
DB Programming Environment:
General Purpose Grid Services
Computational Infos
Update Resource
Grid Network Services
DB
Update Notification
Allocation
Middleware ⇒
Grid Abstract Machine request

Resource Notification Elaboration


allocation
22 October, 2004 Basic HW+SW
GGF 12 - GHPN-RG Session II 6

platform
Grid Network Service and Network
Management and Control Plane
Interaction

General Purpose Grid Services

Grid Network Services

22 October, 2004 GGF 12 - GHPN-RG Session II 7


Resilience

• A network that provides some ability to recover ongoing


connections disrupted by the catastrophic failure of a
network component, such as a line interruption or a node
failure, is said to be
– Resilient → resilience (resiliency)
– Reliable → reliability
– Survivable → survivability
• Resilience QoS Parameters
– Restoration Blocking Probability (Pb)
• Ratio between the number recovered connections and the number of
failed connections
– Recovery Time (RT)
• Time elapsed between failure notification and transmission restart
– Restoration Blocking Probability ↔ inter-service communication
bandwidth and inter-service connectivity
– Recovery Time ↔ inter-service communication latency

22 October, 2004 GGF 12 - GHPN-RG Session II 8


Approaches for Grid Networking
Resilience
Layered Grid Architecture Failover Schemes TCP/IP Stack
Application specific fault tolerant
Application schemes based on middleware
end-user applications fault detection
Middleware Tasks and Data Replicas
Collective Condor-G Application
collective resource control checkpointing, migration, DAGMan
GT2/GT3
Resource GridFTP
resource management
Reliable File Transfer (RFT)
Replica Location Service (RLS)
Connectivity Transport
Inter-process communication, Fault Tolerant TCP (FT-TCP)
protection
delegated to WAN, MAN,
and LAN resilience schemes Internet/Network

Fabric delegated to HW, SW,


22 October, 2004
basic hardware and software
GGF 12 - GHPN-RG Session II
and farm failover schemes
Link 9
Network Infrastructure Resilient
Schemes
Resilience
Protection Restoration
Network spare Network spare
resources are resources are
computed and found and
reserved upon reserved upon
connection set up failure occurrence

Dedicated Shared Dynamic Pre-planned


Each connection is Connection that are Backup routes are Backup routes are
assigned dedicated not contemporarily computed and pre-computed but
spare resource involved in the spare resources spare resources are
failure event share are reserved upon reserved upon
spare resources failure occurrence failure occurrence

22 October, 2004 GGF 12 - GHPN-RG Session II 10


Protection and Restoration

• Protection/Restoration
– Path level
• End-to-end connections are independently recovered by
finding a new route from source to destination
– Link level
• All the connections disrupted by the failure (in this case link
failure) are rerouted along the same recovery path by-passing
the failure (i.e., the failed link)
• Restoration schemes are commonly available at
higher layers (e.g., the IP layer)
• Protection schemes are commonly used at the
physical transport layer (e.g., WDM)

22 October, 2004 GGF 12 - GHPN-RG Session II 11


Recovery Times
• BGP-4: 15 – 30 minutes
• OSPF: 1s to minutes
• MPLS fast (link) rerouting 50-100ms
• MPLS edge-to-edge rerouting 1-100s
• Spanning Tree 50s
• RSTP 10ms
• FRP-FAST < 2s
• EtheReal ~ 250ms
• SDH / SONET / DWDM: 50 ms
• OCh and OMS restoration ≥50ms
• Dedicated OL protection 10µs-10ms
• Shared OL protection 1-100ms
22 October, 2004 GGF 12 - GHPN-RG Session II 12
Integrated Grid Computing
Resilience

• Motivations
Virtual Processor (VP) – Guaranteeing the
successful calculation of
function f(x,d) in spite of
f(x1,d) grid infrastructure failures
connection – Nowadays grid failure
guaranteed by middleware
(d1,dn) and network resilient
schemes independently
f(x,d) f(x2,d1)
(d2) – Grid network failover
Emitter (E)/ improved by Grid
network infrastructure
Collector (C) resilient scheme
f(xn,dn) integration
Distributed Virtual
Shared Memory
(DVSM)
22 October, 2004 GGF 12 - GHPN-RG Session II 13
Motivations for Grid Services and
Network Services Resilience Integration

• Fabric fault tolerant scheme already implemented


in the LAN
• Utilize fault tolerant scheme in the global internet
to implement a reliable fabric
• Integrate with higher layer (e.g., collective, job
scheduling) fault tolerance schemes
• Try to move the most of the resilience burden to
the fabric making a lighter service for collective
and higher layers

22 October, 2004 GGF 12 - GHPN-RG Session II 14


Proposed Approach
• Integrating network layer connection rerouting
with task/data replication/migration
A • Integrated scheme model by MILP problem
G
B
H formulation
D – Objective: maximizing the number of connections
restored after failure original task/data
1 task/data replica
1 A
A A
G A
0 2
G G
0 2 0 2

B
B 4 B
4 H D 4
H D 5 3 H D
22 October, 2004 GGF 12 - GHPN-RG Session II 15
5 3 5 3
Integrated Restoration
Performance

• Integrated restoration outperforms


OSPF dynamic rerouting resilience
• Integrated restoration performs as
well as service migration resilience
but by utilizing path restoration
decreases the need for service
synchronization and restart
22 October, 2004 GGF 12 - GHPN-RG Session II 16
MPLS Grid (MGRID) Testbed
Router Linux
OSPFD,RSVPD
R3
AMD Athlon 1 Ghz
512 MB RAM
OS: Linux Red Hat
eth1
krn 2.4.20
eth0
10.0.100.2
10.0.200.2
GLOBUS GLOBUS
TOOLKIT 3 TOOLKIT 3

FE2 FE0
10.0.100.1 LSP “NORD” 10.0.200.1
10.0.20.2
10.0.30.2
M10 M10
10.0.30.1 10.0.20.1
R1 R2
10.0.40.1 Ingress Egress 10.0.80.1
PC1 FE0 FE2 PC2
AMD Athlon XP AMD Athlon 1.3
LSP “SUD” Ghz
2800+
512 MB RAM 512 MB RAM
OS: Linux Debian OS: Linux Debian
krn 2.4.25-1-386 eth2 krn 2.4.26-1-386
10.0.40.2 eth1
10.0.80.2

R4 Router Linux
OSPFD,RSVPD
AMD Athlon 1 Ghz
512 MB RAM
OS: Linux Red Hat
krn 2.4.26-1-386

22 October, 2004 GGF 12 - GHPN-RG Session II 17


The Metrocore/VESPER testbed
CNR-IIT

UniPi Engineering 100 Mbit/s

CNR-ISTI

Centro Ser.R.A.

Gigabit
Comune di Pisa Ethernet/ WDM

3 1 2
Nodo AGESCOM Nodo AGESCOM
Via Battisti Centro Forum
4
Distance 1-2: 11 km
Distance 3-4: 12,5 km
22 October, 2004 GGF 12 - GHPN-RG Session II 18
Service Replication/Migration
Replica
Service A
VideoLAN
Secondary Server
Service A

VideoLAN VideoLAN
Client Primary Server

22 October, 2004 GGF 12 - GHPN-RG Session II 19


Connection Between Client and
Primary Server Working

22 October, 2004 GGF 12 - GHPN-RG Session II 20


Primary Server and Secondary
Server Link Failure

22 October, 2004 GGF 12 - GHPN-RG Session II 21


Migration to Secondary Server

22 October, 2004 GGF 12 - GHPN-RG Session II 22


Open Issues

• Which Layer must respond to failures ?


• Which layer more efficiently overcomes which
failures ?
• Inter-layer coordination between application layer,
middleware, and grid network services resilient
schemes
• Can recovery times typical of Clusters be achieved
in Global Grid Computing ?
• Which are the expenses to make a Global Grid
Cluster perform as well as Local Area Network
Cluster ?

22 October, 2004 GGF 12 - GHPN-RG Session II 23


Thanks to … the people
(in alphabetical order)

• Filippo Cugini, CNIT


• Luca Foschini, SSSUP
• Domenico Laforenza, CNR
• Francesco Paolucci, CNIT
• Marco Vanneschi, UNIPI

22 October, 2004 GGF 12 - GHPN-RG Session II 24

You might also like