You are on page 1of 34

1 1 1

Linux High Availability Cluster Selection


Tim Burke
tburke@redhat.com
1 1 1
Which cluster product is right for me ?

There is no one size fits all winner

Rapidly evolving marketplace

The good news: There is a lot to choose from

The bad news: There is a lot to choose from

Strategy - be an informed consumer


1 1 1
Selection Process / Presentation Outline

Identify target applications - usage model

Identify required cluster feature set

Open source vs proprietary, product vs project

Cost factors

Vendor evaluation

OEM & ISV endorsements


1 1 1
Identify Target Applications

Clustering Categories

High Availibility Clusters

Database

Fileservers

Off the shelf applications

Load Balancing Clusters

Dispatching web traffic

High Performance Computing

Large computational problems


1 1 1
High Performance Computing

HPC, HPTC cluster attributes


1. Large # of systems working together to
solve a common problem -scalability
2. Performance, not reliability is of utmost
importance
3. Requires custom parallelized applications
4. Tends to be bleeding edge, early adopters
5. Example deployments: genetics,
pharmacutical, weather, seismic analysis,
modeling
1 1 1
Load Balancing Clusters

Front end dispatching node (or 2 for


redundancy)

Pool of inexpensive back end servers

Redirect transactions so no 1 system is


overloaded

Balancing algorithms: round robin,


weighted, load based

Typically used for web server traffic


(Apache front end)

Useful for static content

Not applicable for dynamic content


1 1 1
High Availability Clusters

The need for high availability (HA)

Overview of high availability features


1 1 1
Reliability, Availability, Serviceability
(RAS)

Users & businesses have high expectations


1. Reliability - high degree of protection for corporate
data. Information is a crucial business asset.
2. Availability - near continuous data access
3. Serviceability - procedures to correct problems with
minimal business impact
1 1 1
Sources of Downtime
The Standish Group - 2001
Application bug or
error
Main-system
hardware failure
Database error
Main-server system
bug
Network
Operator error
Other server's
hardware failure
Other server's sys-
tem bug
Environmental condi-
tions
Planned outage
Other
1 1 1
Downtime Costs -The Standish Group
Electronic
resource
planning
(ERP)
Supply
chain
man-
agement
E-
com-
Internet
banking
Customer
service
center
Messaging
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
12000
13000
Cost per minute of downtime (dollars)
Column 2
1 1 1
No Single Point of Failure (NSPF)

Hardware Redundancy - increased overall


reliability and availability
1. Multiple paths between systems
2. Storage - mirrored, RAID5
3. Multiple power sources
4. Multiple external networks
1 1 1
High Availability Clusters

Redundancy for fault


tolerance

Failover - if 1 node shuts


down or fails, another node
takes over application load

Facilitates planned
maintenance
1 1 1
Failover

Involves selecting a target node & moving


resources - failover policies

Example resource types


1. Physical disk ownership
2. Filesystems
3. Applications
4. Databases
5. IP addresses
1 1 1
Failover Configurations

Active / Passive

1 node runs application(s)

Other node on standby for takeover

Idle node can takeover with no performance degradation

Active / Active

All nodes actively running application(s)

Workload moves to survivor on failure

Effectively utilizes capacity (TCO)


1 1 1
Data Integrity Provisions

Crucial for safe failover of data centric services (filesystem /


database)

In failure scenarios (eg hung node), ensure failed node can not access
storage - I/O Barriers, I/O Fencing

Lack of I/O Fencing can result in

Loss of data (backups ?)

System crashes

Common mechanisms

Power switches

SCSI reservations

Watchdog timers
1 1 1
Application Monitoring

All HA clusters monitor node state

Most monitor key cluster resources - network, disk

Many monitor application health

Process existence

Application check scripts

HTTP get on web server

Record retrieval on database

Filesystem directory listing


1 1 1
Failover Times

Don't get too hung up on this

Remember that data integrity is paramount

Quoted failover times only include cluster overhead, don't include


application recovery

Application startup time

Filesystem consistency checks

Database recovery - transaction replay

Example

Product literature cites 5 second failover time

Can be several minutes for database recovery (size & activity


dependent)
1 1 1
Open Source vs Proprietary
Project vs Product

Open source facilitates self-support &


customization

Support is a key determinant

Products are generally well tested

Some products are also open source

If you care enough about high availability &


solution stacks, you're likely to go the product
route
1 1 1
Heterogeneous HA Products

Proprietary offerings that run on Linux, W2K,


UNIX

Unifies user training

May compromise flexibility, adaptability or data


integrity (ouch!)

Some are Linux products with GUIs that run on


other platforms

Virtually none allow heterogeneous platforms


within the same cluster
1 1 1
Cost Factors

Beware of hidden charges

Product base fee

Application specific charges (Oracle, DB2, NFS, etc)

Support

Some only come with bundled service offerings

Hardware requirements

Proprietary UNIX offerings typically cost several


times more
1 1 1
Vendor Evaluation

Company vision - do their cluster offerings complement or


distract. Futures roadmap.

Financial Stability

Ability to impact the marketplace

Responsiveness - ability to provide ongoing feature enhancements

Proprietary vs open source

Product integration - fit with distribution, kernel patches,


compatibility & support implications

New Linux technology vs large monolithic legacy ports

How long its been on the market


1 1 1
Open Source Projects

FailSafe - from SGI & SuSE

Optional data integrity provisions (power switch)

Supports 16 nodes

Good set of application kits

Red Hat Cluster Manager

Also offered as a product

Described later in presentation


1 1 1
HA Cluster Product Comparisons

The ground rules

Trying to remain objective

Highlight product strengths

Listed in alphabetical order

Based on web site content as of 10/2002


1 1 1
HP - MC/Serviceguard

Proprietary - Ported from HP/UX

Only supported on HP hardware

Dynamic online addition/removal of members

Worldwide support services

Quorum voting membership

Up to 8 nodes using FibreChannel storage, 2


nodes using SCSI

Compaq Alpha line targeted at HPC clusters


1 1 1
Legato - Availability Manager

Proprietary

Heterogeneous (Linux, W2K, Solaris, HP-UX)

Strong data centric services

Well integrated with SAN environments

Replication

Storage management, volume management, backup

Application monitoring

Extensive set of application specific modules


1 1 1
PolyServe - Application Manager

Proprietary

Application monitoring

Up to 16 nodes

Multiple platforms - Linux, W2K, Solaris

Doesn't require shared storage

Dynamic member addition/removal

Centralized management
1 1 1
PolyServe - Matrix Server

Tailored for Oracle 9i Real Application Clusters

Concurrent read + write access to data on shared


storage SAN

Cluster filesystem with lock manager +


distributed cache

Allows incremental growth by adding servers +


storage

Proprietary
1 1 1
Red Hat - Cluster Manager

Bundled with RHL Advanced Server 2.1

Both open source & product

Data integrity provisions

Power switches (optional)

Watchdog timer software

Application monitoring

Heterogeneous fileserving via NFS + Samba

Web monitoring GUI

Also integrated Piranha load balancing cluster


1 1 1
Steeleye - LifeKeeper

Proprietary - UNIX port

Multi-platform - Linux, W2K

Wide set of application kits (separately


purchaced)

Established OEM relationships

Data integrity provisions - via SCSI reservations,


requiring kernel patches

Application monitoring
1 1 1
IBM

Focusing on HPC

Rackmounted Intel servers

Custom solutions

(older) XCAT software for management, parallel


operations, and installation

(newer) Cluster Systems Mgt (CSM) for Linux

Remote monitoring, resets, bios console

Parallel shell

Requires IBM hardware for imbedded service processor

High Availability via partnering


1 1 1
Veritas Cluster Server

Recent Linux port

16 nodes, wide range of supported apps

Also runs on Windows, AIX, UNIX, Solaris

Integrates with their storage offerings (volume


management, backup, data replication)

Proprietary
1 1 1
Other Vendors

Dell

Strategic partnering for HA software

Penguin Computing

HPC offering via partnership with Scyld Beowulf


1 1 1
Consolidated Solutions

Egenera

BladeFrame hardware, backplane eliminates cabling

Management software, HA, provisioning

Linux NetworX

Turnkey solution, preintegrated hardware + management tools

Custom hardware, dense racks


1 1 1
Summary

Know what category of cluster is right for you

Be knowledgeable of required cluster features

Weigh your cost criteria

Chose a vendor you can trust to safeguard your


corporate assets

Be wary of marketing collateral