Professional Documents
Culture Documents
different motivations
Clusters may be generally defined as "a parallel or distributed system that consists of a
collection of interconnected whole computers, that is utilized as a single, unified computing
resource".
While the parallel computing community sees clusters from the high performance and
scalability perspective and refers to them as WSCs (Workstation Clusters) or NOWs (Network
Of Workstations), business/ mission critical computing sees clusters from a different
perspective -high-availability, and refers to them as HA clusters.
In theory, the clustering model can provide both high-availability and high performance, and
also manageability, and scalability, and affordability.
In the parallel computing field, a cluster is primarily a mean to provide scalability and
increasing processing power by enabling a parallel application to run across all the nodes of
the cluster.
The main advantage of a cluster as compared with traditional MPCs (Massively Parallel
Computers) is its low cost.
In fact, during the nineties, clusters appeared as the poor man's answer for the parallel high
performance computing quest, in opposition to the expensive, proprietary, and massive
parallel architectures of the eighties. This trend can be clearly observed in the well-known
Gordon Bell prize. This contest has a special class of award for the best ratio flops per million
dollars. This award usually goes to applications running on workstation clusters.
Clusters inherit much of the knowledge acquired from research in distributed systems. Of
course, there are some typical differences between a cluster and a distributed system that
make things less hard to the clusters.
Clusters are typically homogeneous (a distributed system is a computer zoo, with lots of
different kinds of computers), tightly coupled, and most important, nodes trust each other
(distributed systems have to deal with untrusted nodes).
But using clusters is not a panacea: As the number of hardware components in a system
rises, so does the probability of failure. A cluster can grow to hundreds or thousands of
nodes, but the likelihood that a fault will occur while an application is running on the cluster
increases proportionally.
This fact is clearly illustrated by a statement by Leslie Lamport (one of the fathers of
distributed computing systems):"A distributed system is one in which the failure of a
computer you didn't even know existed can render your own computer unusable."
In clusters the increasing probability of fault occurrence can be particularly harmful for long-
running applications, such as scientific number crunching. In these cases, the need for some
kind of mechanism to guarantee the continuation of applications in the presence of faults
becomes a must.
So much for the parallel computing perspective. From the mission and business critical
system's perspective, a cluster is a configuration of a group of independent servers that can
be managed as a single system, shares a common namespace, and is designed specifically to
tolerate component failures and to support the addition or subtraction of components in a
way that's transparent to users. The focus is not on performance, but rather on high
availability.
Mission critical clusters were originally built, in most of the cases, with only two machines
(the so-called primary and secondary nodes) and typically supported OLTP (Online
Transaction Processing) applications for banking, and insurance, among other similar critical
applications. The idea is simple in concept. In case of a failure of a component (hardware or
software) in the primary system, the critical applications can be transferred to the secondary
server, thus avoiding downtime and guaranteeing application's availability. This process is
termed failover, and inevitably implies a temporary degradation of the service, but without
total permanent loss. This is quite different from component level redundancy with voting
which masks faults completely. The so-called Fault-Tolerant servers pioneered by Tandem
(20] use component redundancy to provide conunuous service in case of failures. Clusters
also use redundancy but at the system level.
HA clusters have evolved from the two node configuration, and with the increasing
performance needs, they came to support more than two nodes, and provide additional
mechanisms to take advantage of the aggregate performance of the participating nodes
through load-balancing. The two worlds became closer and closer.
The convergence between HA and HP is not a coincidence. In fact, high performance parallel
computing and high-availability can be seen as enabling technologies of each other. Parallel
processing uses the joint power of many processing units to achieve high performance.
However, only fault tolerance can enable effective parallel computing because, as we have
shown, fault tolerance allows application continuity.
Failover
Fault-tolerant servers
o Those faults that statistically don’t cause node failures, but have other insidious
effects
Examples
o ASCI Red : 9000 Intel Pentium Pro (a 263 Gbytes memory machine)
MTBF : 10 hours for permanent faults, 1-2 hours for transient faults (a
node’s MTBF is 10 years)
MTBF is 20 hours
MTBF is 14 hours
o Cold-backup
o Hot-backup
Protect data, not just at a certain time of the day, but on a continuous basis
using a mirrored storage system where all data is replicated
Combine highly reliable & highly scalable computing solutions with open & affordable
components
o The ability to scale to meet the demands of service growth became a requirement
as a requirement
clusterware
Dependability Concepts
The time between the first occurrence of the error and the moment when
it is detected
The longer, the most affected the internal state of the system may become
o Fault
Transient
Permanent
o Bohrbugs
– bugs are always there again when the program is
re-executed
o Heisenbugs
o - in a re-execution, the timing will usually be slightly different, the bug does not
happen again
2.Dependability Attributes
o Reliability
o Availability
o MTBF
o MTTR
The Mean Time To Repair a failure, & bring the system back to correct
service
3.Dependability Means
o Two ways
Fault prevention
Fault tolerance
o Disaster Tolerance
Each node has its own memory & own storage resources
o Shared-Storage
Both servers share the same storage resource with the consequent need
for synchronization between disks accesses to keep data integrity
o Active/Standby
o Active/Active
All servers are active & do not sit idle waiting for a fault to occur & take
over
Providing bidirectional-failover
o N-Way
3.Interconnects
3 main kinds of interconnects – network, storage, monitoring
Network interconnect
Storage
Provide the I/O connection between the cluster nodes & disk storage
Monitoring
NCR Lifekeeper
also have an additional serial link between the servers for that
purpose
VIA
Aim at defining the interface for high performance interconnection of servers and
storage devices within a cluster (SAN)
Define a thin, fast interface that connects software applications directly to the
networking hardware while retaining the security & protection of the OS
VI-SAN
SAN
FC-AL
The most widely endorsed open standard for the SAN environment
Modular scalability
Error coverage
Detection latency
Diagnosis layer
Collect as much info as possible about the location, source & affected parts
Run diagnosis over the failed system to determine whether the fault is
transient or permanent
1.Self-Testing
Consists of executing specific programs that exercise different parts of the system,
& validating the output to the expected known results
inside of microprocessors
Fail to detect problem occurs at a higher level or affects data in subtle ways
If the timer is not reset before it expires, the process has probably failed in some
way
Coverage is limited
Implemented in SW or HW
Software watchdog
(simplest) Watch its application until it eventually crashes, the only action
it takes is to relaunch the applications
1.Heartbeats
Useless if OS crashes
2.Idle notification
Inform the watchdog about periods when the application is idle, or it is not doing
any useful work, then the watchdog can perform preventive actions (restarting
application)
Software Rejuvenation
3.Error notification
An application that is encountering errors which it can’t overcome can signal the
problem to the watchdog & request recovery actions
Consistency checking
Range check
Checkpointing
Checkpointing classifications
2.Transactions
ACID properties
Atomicity
Consistency
Integrity
Durability
Failover process
Takeover
Failback
4.Reconfiguration
Wolfpack
A phased approach
Not lock-step/fault-tolerant
Not 99.99999 % available (max of 3-4 sec of downtime per year), but 20 -
30 sec downtime per year
3 states
2.NCR LifeKeeper
Heartbeat via SCSI, LAN, & RS 232, event monitoring, & Log monitoring
3 different configurations
Oracle Failsafe
Based on MSCS
Targeted for enterprise level systems, seeking not only HA but also
scalability
Oracle servers
Support hot-backups, replication, & mirrored solutions