Professional Documents
Culture Documents
This product or document is protected by copyright and distributed under licenses restricting its use, copying, distribution, and
decompilation. No part of this product or document may be reproduced in any form by any means without prior written authorization
of Sun and its licensors, if any. Third-party software, including font technology, is copyrighted and licensed from Sun suppliers.
Parts of the product may be derived from Berkeley BSD systems, licensed from the University of California. UNIX is a registered
trademark in the U.S. and other countries, exclusively licensed through X/Open Company, Ltd. For Netscape Communicator™, the
following notice applies: Copyright 1995 Netscape Communications Corporation. All rights reserved.
Sun, Sun Microsystems, the Sun logo, Sun Up, Sun Cluster, Netra ft, Sun Enterprise, Sun Trunking, Sun Enterprise Volume Manager,
Solstice Disk Suite, Sun Management Center, SunSpectrum Gold, SunSpectrum Platinum, SunService, Sun StorEdge, SunScreen,
Trusted Solaris, OpenBoot, JumpStart, Sun BluePrints AnswerBook2, docs.sun.com, SunLabs group, Sun BluePrints, Sun Enterprise
3000, Sun Enterprise 4500, Sun Enterprise 6500, Sun Enterprise 10000, Sun Cluster 2.2, Sun Cluster 3.0, Sun Enterprise Volume Manager,
Sun Management Center, SunScreen, JumpStart, Sun Ultra 2, Sun StorEdge 5100, Sun StorEdge 5200, Sun Prestoserve, Sun
Microsystems Ultra-1, Sun Microsystems Ultra-2, SunVTS, Sun Ultra Enterprise servers (models 150, 220R, 250, 420R, and 450),
SunSwift, SunLink, SunATM, Sun Internet Mail Server, and Solaris are trademarks, registered trademarks, or service marks of Sun
Microsystems, Inc. in the U.S. and other countries. All SPARC trademarks are used under license and are trademarks or registered
trademarks of SPARC International, Inc. in the U.S. and other countries. Products bearing SPARC trademarks are based upon an
architecture developed by Sun Microsystems, Inc.
The OPEN LOOK and Sun™ Graphical User Interface was developed by Sun Microsystems, Inc. for its users and licensees. Sun
acknowledges the pioneering efforts of Xerox in researching and developing the concept of visual or graphical user interfaces for the
computer industry. Sun holds a non-exclusive license from Xerox to the Xerox Graphical User Interface, which license also covers Sun’s
licensees who implement OPEN LOOK GUIs and otherwise comply with Sun’s written license agreements.
RESTRICTED RIGHTS: Use, duplication, or disclosure by the U.S. Government is subject to restrictions of FAR 52.227-14(g)(2)(6/
87) and FAR 52.227-19(6/87), or DFAR 252.227-7015(b)(6/95) and DFAR 227.7202-3(a).
DOCUMENTATION IS PROVIDED “AS IS” AND ALL EXPRESS OR IMPLIED CONDITIONS, REPRESENTATIONS AND
WARRANTIES, INCLUDING ANY IMPLIED WARRANTY OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE
OR NON-INFRINGEMENT, ARE DISCLAIMED, EXCEPT TO THE EXTENT THAT SUCH DISCLAIMERS ARE HELD TO BE
LEGALLY INVALID.
Copyright 2000 Sun Microsystems, Inc., 901 San Antonio Road • Palo Alto, CA 94303-4900 Etats-Unis. Tous droits réservés.
Ce produit ou document est protégé par un copyright et distribué avec des licences qui en restreignent l’utilisation, la copie, la
distribution, et la décompilation. Aucune partie de ce produit ou document ne peut être reproduite sous aucune forme, par quelque
moyen que ce soit, sans l’autorisation préalable et écrite de Sun et de ses bailleurs de licence, s’il y en a. Le logiciel détenu par des tiers, et
qui comprend la technologie relative aux polices de caractères, est protégé par un copyright et licencié par des fournisseurs de Sun.
Des parties de ce produit pourront être dérivées des systèmes Berkeley BSD licenciés par l’Université de Californie. UNIX est une
marque déposée aux Etats-Unis et dans d’autres pays et licenciée exclusivement par X/Open Company, Ltd. La notice suivante est
applicable à Netscape Communicator™: Copyright 1995 Netscape Communications Corporation. All rights reserved.
Sun, Sun Microsystems, the Sun logo, Sun Up, Sun Cluster, Netra ft, Sun Enterprise, Sun Trunking, Sun Enterprise Volume Manager,
Solstice Disk Suite, Sun Management Center, SunSpectrum Gold, SunSpectrum Platinum, SunService, Sun StorEdge, SunScreen,
Trusted Solaris, OpenBoot, JumpStart, Sun BluePrints AnswerBook2, docs.sun.com, and Solaris sont des marques de fabrique ou des
marques déposées, ou marques de service, de Sun Microsystems, Inc. aux Etats-Unis et dans d’autres pays. Toutes les marques SPARC
sont utilisées sous licence et sont des marques de fabrique ou des marques déposées de SPARC International, Inc. aux Etats-Unis et dans
d’autres pays. Les produits portant les marques SPARC sont basés sur une architecture développée par Sun Microsystems, Inc.
L’interface d’utilisation graphique OPEN LOOK et Sun™ a été développée par Sun Microsystems, Inc. pour ses utilisateurs et licenciés.
Sun reconnaît les efforts de pionniers de Xerox pour la recherche et le développement du concept des interfaces d’utilisation visuelle ou
graphique pour l’industrie de l’informatique. Sun détient une licence non exclusive de Xerox sur l’interface d’utilisation graphique
Xerox, cette licence couvrant également les licenciés de Sun qui mettent en place l’interface d’utilisation graphique OPEN LOOK et qui
en outre se conforment aux licences écrites de Sun.
CETTE PUBLICATION EST FOURNIE “EN L’ETAT” ET AUCUNE GARANTIE, EXPRESSE OU IMPLICITE, N’EST ACCORDEE, Y
COMPRIS DES GARANTIES CONCERNANT LA VALEUR MARCHANDE, L’APTITUDE DE LA PUBLICATION A REPONDRE A
Please
Recycle
Table of Contents 1
Preface xi
Chapter 1: High Availability Fundamentals
Basic System Outage Principles 2
System Types—Availability Requirements 3
Reliability Fundamentals 5
Mean Time Between Failures (MTBF) 5
Failure Rate 6
Common MTBF Misconceptions 8
Availability Fundamentals 8
Reliability vs. Availability 11
Serviceability Fundamentals 12
Dynamic Reconfiguration 12
Alternate Pathing 13
Single System Availability Tuning 13
Configuring to Reduce System Failures 14
Reducing System Interruption Impact 15
Reducing Maintenance Reboots 15
Configuring Highly Available Subsystems 15
Datacenter Best Practices 18
Systems Management Principles 19
Hardware Platform Stability 19
Consolidating Servers in a Common Rack 20
System Component Identification 20
AC/DC Power 22
System Cooling 23
Network Infrastructure 23
Security 24
Systems Installation and Configuration Documentation 25
Change Control 26
Maintenance and Patch Strategy 27
iii
Component Spares 27
Software Release Upgrade Process 28
Support Agreement and Associated Response Time 28
Backup and Restore 28
Sun Cluster Recovery Procedures 29
Campus Cluster Recovery Procedures 29
Summary 30
iv Table of Contents
Architectural Limitations 74
Summary 75
Table of Contents v
Chapter 4: Sun Cluster 2.2 Administration
Chapter Roadmap 112
Monitoring Sun Cluster Status and Configuration 113
Monitoring Tools 113
Cluster Logs 119
Managing Time in a Clustered Environment 120
Cluster Membership Administration 121
Starting a Cluster or Cluster Node 121
Stopping a Cluster or Cluster Node 126
Managing Cluster Partitions 128
Setting and Administering CMM Timeouts 131
Managing Cluster Configuration 133
Cluster Configuration Databases 133
Managing Cluster Configuration Changes 137
CCD Quorum 141
Shared CCD 143
Shared CCD Setup 144
Cluster Configuration Database Daemon 149
Resolving Cluster Configuration Database Problems 150
Administering NAFO Groups 151
PNM Daemon 151
Setting up PNM 152
Changing NAFO Group Configurations Using pnmset 156
Overview of the pnmset Options 157
Logical Host and Data Service Administration 158
Logical Hosts and Data Services 158
Managing Logical Hosts 160
Administering Logical Hosts 167
Cluster Topology and Hardware Changes 170
Administering Hardware and Topology Changes 170
Changing the Node Hardware 172
Changing Private Network Configuration (for Ethernet) 173
Changing Private Network Configuration (for SCI) 175
Changing Terminal Concentrator/SSP Connectivity 178
Changing the Quorum Device 179
Changing Shared Data Disks 180
Changing Node Topology 181
Summary 185
vi Table of Contents
Chapter 5: Highly Available Databases
High Availability for Business 187
Parallel Databases 188
Parallel Databases Using Sun Cluster 2.2 Software 190
Parallel Database/Highly Available Database Comparison 192
Failover Latency 193
Configuration Issues 195
Highly Available Databases 196
Minimum Configurations for Highly Available Databases 197
Cluster Configuration Issues Prior to Installation 197
Database Software Package Placement 198
Configuring Highly Available Databases 203
Creating Logical Hosts 205
Logical Host Syntax 207
HA Status Volumes 209
Setting Up the Logical Host Mount Point 210
Database Monitoring Setup 212
Fault Probe Account Setup 212
Database Monitoring 218
Fine-Tuning Fault Probe Cycle Times 224
Database Failover 225
Fault Probe Debugging 227
Handling Client Failover 228
Summary 229
Appendix A: SCSI-Initiator ID
SCSI Issues in Clusters 319
Changing the SCSI-initiator ID Overview 322
Changing the SCSI-initiator ID 330
Using the Open Boot PROM (OBP) Prompt Method 334
Maintaining Changes to NVRAM 337
Summary 338
Table of Contents ix
x Table of Contents
Preface
This book is one of an on-going series of books developed by the engineering staff of
the Sun BluePrints™ Program. The Sun BluePrints Program is managed by the
Enterprise Engineering group, and provides a framework to identify, develop, and
distribute best practices information for Sun products.
If you are new to clustering, we discuss how a cluster is built and configured, plus
more (a lot more)—we recommend you read the book from cover to cover. For those
with cluster experience, the book has been set out in modular fashion—each section
covers specific areas that can be quickly identified and referenced.
This book builds a solid foundation by detailing the architecture and configuration of
the Sun Cluster software. The information provided in this book extends beyond the
Sun Cluster infrastructure and introduces Sun Cluster 2.2 (SC2.2) applications,
maintenance, and datacenter best practices to enhance datacenter efficiency and
application availability.
The Sun Cluster 2.2 technology has evolved due in part to the involvement and
commitment of its customers to achieve a high level of maturity and robustness. Sun
Microsystems’ participation in the cluster software arena began in the spring of 1995
with the introduction of the SPARCcluster™ 1 product. Table P-1 is a timetable of
cluster products released by Sun Microsystems (as of the time of this printing).
xi
Table P-1 Sun Cluster Product Evolution
Year Product
Spring 1995 SPARCcluster 1 to sustain highly available applications
Summer 1995 SPARCcluster PDB 1.0 support for Oracle Parallel Server (OPS),
Informix Extended Parallel Server (XPS), and Sybase Navigation Server
Winter 1995 SPARCcluster HA 1.0
Spring 1996 SPARCcluster PDB 1.1
Summer 1996 SPARCcluster HA 1.1
Fall 1996 SPARCcluster PDB 1.2
Fall 1996 SPARCcluster HA 1.2
Spring 1997 SPARCcluster HA 1.3
Fall 1997 Sun Cluster 2.0 - Initial merge of SPARCcluster PDB and SPARCcluster
HA product lines
Spring 1998 Sun Cluster 2.1
Summer 1999 Sun Cluster 2.2 - Final merge of SPARCcluster HA into Sun Cluster
In the summer of 1995, the SPARCcluster PDB (parallel database) 1.0 software was
introduced to enable parallel node scalability—this product also supported a highly
available application environment. The SPARCcluster PDB software enabled parallel
database products such as Oracle Parallel Server (OPS), Informix Extended Parallel
System (XPS), and Sybase Navigation Server to be used in a two-node cluster.
After the release of the SPARCcluster PDB software, a separate development group
within Sun introduced the SPARCcluster HA (high availability) product in the winter
of 1995. The SPARCcluster HA software enabled a highly available environment for
mission-critical applications such as databases, NFS, and DNS Servers.
http://www.sun.com/blueprints
The mission of the Sun BluePrints Program is to empower Sun customers with the
technical knowledge required to implement an expandable, highly available, and
secure information system within a datacenter using Sun products. The Sun
BluePrints Program is managed by the Enterprise Engineering group. This group
provides a framework to identify, develop, and distribute best practice information
that can be applied across all Sun products. Subject matter experts in the areas of
performance, resource management, security, high-performance computing,
networking, and storage write articles that explain the best methods of integrating
Sun products into a datacenter.
The Enterprise Engineering group is the primary contributor of technical content for
the Sun BluePrints Program, which includes books, guides, and online articles.
Through these vehicles, Sun provides down-to-earth guidance, installation tips, real-
life implementation experiences, and late-breaking technical information.
Part I, Infrastructure
Describes basic availability concepts and Sun Cluster 2.2 architecture and its
components. It contains the following chapters:
The Sun Cluster 2.2 framework is a collection of integrated software modules that
provides a highly available environment with horizontal scalability. Topics such as
cluster topologies, cluster membership, quorum algorithms, failure fencing, and
network monitoring are discussed.
There are many component choices to be made when assembling a cluster. This
chapter discusses capacity, performance, and availability features of supported Sun
Cluster 2.2 components—servers, disk storage, interconnect, and public networks.
This information can help you assemble a cluster to match your requirements.
This chapter provides the framework for implementing a cluster configuration from
scratch. The techniques and issues involved in configuring a low-end, distributed NFS
server are discussed.
This chapter provides an introduction to the pinnacle of the Sun Cluster technology
(Sun Cluster 3.0). Topics discussed include global file services, scalable services,
availability, and supported cluster configurations.
This appendix presents an in-depth tutorial for resolving SCSI controller issues when
using multiple hosts that share SCSI storage.
This appendix is a collection of Bourne shell templates that support all data service
agent routines.
Related Books
The books in Table P-2 provide additional useful information:
Table P-2 Related Books
Typeface or
Symbol Meaning Example
AaBbCc123 Names of commands, files, and Edit your .login file.
directories Use ls -a to list all files.
machine_name% You have mail.
AaBbCc123 Text entered into the system machine_name% su
(bold) Password:
Shell Prompt
C shell prompt machine_name%
C shell superuser prompt machine_name#
Bourne shell and Korn shell prompt $
Bourne shell and Korn shell superuser prompt #
Contributors
Special thanks go to Peter Lees, a Sun Microsystems engineer based in Australia, for
providing most of the content for the “Sun Cluster 2.2 Data Services” chapter. Peter
also wrote the scripts in Appendix B, “Sun Cluster 2.2 Data Service Templates.”
Special thanks also to Ji Zhu and Roy Andrada for providing the underlying core of
the “High Availability Fundamentals” chapter—both are members of the Sun
Microsystems RAS group.
Thanks to David Gee, our dedicated in-house technical writer/editor for massaging
the engineering content into the finished product, and for his constructive writing
tips. We also thank his wife, Elizabeth, for her understanding in the midst of the
wedding preparation rush.
Thanks to Terry Williams for the edits and proofs, and for building the index and
glossary.
xix
General
Writing a book requires much more than just raw material and technical writing. We
would also like to thank the following, who helped review, provide technical clarity,
and all manner of personal or professional support:
We would like to acknowledge the following cluster experts who helped shape the
technical content of this book: Peter Dennis, Tim Read, Geoff Carrier, Paul Mitchell,
Walter Tsai, Bela Amade, Angel Camacho, Jim Mauro, Hal Stern, Michael Geldner,
and Tom Atkins (and many other experts who contributed through internal Sun
Cluster
e-mail aliases).
Additionally, we thank the Sun Cluster marketing and engineering groups for
sanitizing our technical claims. We would especially like to thank Yousef Khalidi,
Midland Joshi, Sandeep Bhalerao, and Paul Strong.
Thanks to Don Kawahigashi and Don Roach; both of these SUN University cluster
instructors helped with the review process. We appreciate the contribution of their
own material which enriched the content of the book.
Also, many thanks go to Barbara Jugo for effectively managing our book’s publication
while caring for an adorable baby daughter.
This book was made possible by the support provided by the Enterprise Engineering
team, which generated a synergy that enabled creative and innovative ideas to flow
freely in a fun environment.
Finally, thanks to the Sun customers for being our main supporters. We are confident
this book will fulfill their expectations and they will benefit from the material
presented.
xxiii
Chapter 3: Sun Cluster 2.2 Components
Figure 3-1: SCSI Host Connectors 96
xxvii
Chapter 5: Highly Available Databases
Table 5-1: Sun-Supplied Database Agents for HA on Sun Cluster 2.2 Software 188
Table 5-2: Factors Affecting Failover Time 193
Table 5-3: HA-Oracle/OPS Differences 195
Table 5-4: Database Software Placement 202
Table 5-5: Support File Location 203
Table 5-6: Issues Affecting Recovery Time (database-independent) 223
Alomari, A., (1999) Oracle & Unix Performance Tuning 2nd Edition. Prentice Hall, Upper Saddle
River, NJ.
A Scalable Server Architecture From Department to Enterprise — The SPARCserver 1000E and
SPARCcenter 2000E; Sun Microsystems Whitepaper.
Availability Primer: Availability Tuning and Configuration Series for Solaris Servers; Ji Zhu; Sun
Microsystems Whitepaper.
Buyya, R., (1999) High Performance Cluster Computing Volume 1 Architectures and Systems. Pren-
tice Hall, Upper Saddle River, NJ.
Campus Clustering with the Sun Enterprise Cluster; Peter Chow, Joe Sanzio, Linda Foglia; Sun
Microsystems Whitepaper.
Ensor, D., Stevenson, I., (1997) Oracle Design. O’Reilly & Associates Inc., Sebastopol, CA.
Loney, K., Kock, G., (1997) Oracle 8: The Complete Reference. Osborne McGraw-Hill, New York, NY.
Mahapatra, T., Mishra, S., (2000) Oracle Parallel Processing. O’Reilly & Associates Inc., Sebastopol,
CA.
Mauro, J., McDougall, R., (2001) Solaris Internals Core Kernel Architecture. Prentice Hall, Upper
Saddle River, NJ.
Membership, Quorum and Failure Fencing in Sun Cluster 2.x; Hossein Moiin, Tim Read; Sun Micro-
systems Whitepaper.
Netra st A1000 and D1000 Storage Arrays Just the Facts; Sun Microsystems Whitepaper.
Netra t 1120 and 1125 Servers Just the Facts; Sun Microsystems Whitepaper.
Netra t 1400 and 1405 Servers Just the Facts; Sun Microsystems Whitepaper.
Oracle 8 Parallel Server Concepts and Administration Manual. June 1998.
Pfister, G., (1998) In Search of Cluster. Prentice Hall, Upper Saddle River, NJ.
Reliability, Availability, and Serviceability in the SPARCcenter 2000E and the SPARCserver 1000E;
Sun Microsystems Whitepaper.
Ryan, N., Smith, D., (1995) Database Systems Engineering. International Thomson Computer Press,
Albany, NY.
Stern, H., Marcus, E., (2000) Blueprints for High Availability. John Wiley & Sons, Inc., New York, NY.
xxix
SunATM Adapters Just the Facts; Sun Microsystems Whitepaper.
Sun Enterprise 220R Server Just the Facts; Sun Microsystems Whitepaper.
Sun Enterprise 250 Server Just the Facts; Sun Microsystems Whitepaper.
Sun Enterprise 3500-6500 Server Just the Facts; Sun Microsystems Whitepaper.
Sun Enterprise 420R Server Just the Facts; Sun Microsystems Whitepaper.
Sun Enterprise 450 Server Just the Facts; Sun Microsystems Whitepaper.
Sun Enterprise 450 Server Just the Facts; Sun Microsystems Whitepaper.
Sun Enterprise 10000 System Just the Facts; Sun Microsystems Whitepaper.
Sun Fast Ethernet PCI Bus Adapter Just the Facts; Sun Microsystems Whitepaper.
Sun FDDI Adapters Just the Facts; Sun Microsystems Whitepaper.
Sun Gigabit Ethernet PCI Adapter 2.0 Just the Facts; Sun Microsystems Whitepaper.
Sun Quad Fast Ethernet Just the Facts; Sun Microsystems Whitepaper.
Solaris Resource Manager on Sun Cluster 2.2; Sun Microsystems Whitepaper.
Sun StorEdge A1000/D1000 Storage Arrays Just the Facts; Sun Microsystems Whitepaper.
Sun StorEdge A3500 Array Just the Facts; Sun Microsystems Whitepaper.
Sun StorEdge A5x00 Array Family Just the Facts; Sun Microsystems Whitepaper.
Sun StorEdge A7000 Array, Remote Dual Copy, and Sun StorEdge DataShare Just the Facts; Sun
Microsystems Whitepaper.
Sun StorEdge MultiPack, FlexiPack, and UniPack Systems Just the Facts; Sun Microsystems White-
paper.
Sun Ultra Enterprise 150 Just the Facts; Sun Microsystems Whitepaper.
Sun Ultra 2 Workstation Just the Facts; Sun Microsystems Whitepaper.
System Availability Report for Ultra Enterprise X000 Server Family; Sun Microsystems Whitepaper.
The Ultra Enterprise 1 and 2 Server Architecture; Sun Microsystems Whitepaper.
Ultra Enterprise 3000-6000 Just the Facts; Sun Microsystems Whitepaper.
Understanding the Sun Cluster HA API: A Field Guide; Peter Lees; Sun Microsystems Whitepaper.
Virtually every aspect of our lives is touched in some way by the computer
revolution. We have become highly dependent on computer systems—a good
example of this dependence was illustrated by the frenzied preparation for the year
2000 computer bug, Y2K. Apart from the monetary cost, the potential loss of our
computer systems caused widespread fears that the world would face various
disasters—ranging in severity from failure of the ATM network to planes falling from
the sky like ripe plums—or the ultimate bad news story, nuclear Armageddon.
Because our trading systems represent the backbone of our financial stability, users
increasingly demand 100 percent availability. Any disruption to service is measured,
not only in dollars and cents, but perhaps more importantly, in reputation. We rely on
our computer systems for everyday life—e-mail, pager services, pumping gas, and a
myriad of other functions.
Availability has long been a critical component for online systems because business
processes can quickly come to a halt when a computer system is down. For example,
in the world of e-commerce, availability is paramount because of client demand for
instant access to a site—if a site is unavailable (for whatever reason), the competitor’s
site is only a mouse click away.
Sun Microsystems places system availability at the top of its business priority list. Sun
has created two organizations for researching high availability in Sun computer
systems:
■ The RAS (reliability, availability, and serviceability) Engineering group
concentrates on statistical analysis of system components to extract availability
metrics. The RAS Engineering group develops tools that can assist hardware
designers to achieve higher levels of availability in new and existing hardware
designs.
■ The SunUP organization is part of the SunLabs group, and is responsible for
working with customers and third-party partners to develop products and
services that enhance availability in Sun computer systems. The Sun BluePrints
Program works in alliance with the SunUP program to produce best practice
documentation.
1
Availability is directly affected by computer hardware and software ; however, people
and processes can also have a significant impact. This chapter introduces different
system categories that can assist in the decision-making process when choosing a
server type best suited to your specific business availability requirements. The
fundamental concepts of reliability, availability, and serviceability (RAS) relating to the
hardware, operating system, and application elements will be discussed. Although
the RAS mathematics may appear tedious to the reader, this information can assist
administrators in making decisions when attempting to balance cost against
availability.
Note: The statistical concepts and formulas presented in this chapter have been kept
basic for the sake of simplicity.
This chapter introduces best practices that can assist in minimizing the impact of
people and processes in the datacenter—which may help achieve higher availability
goals. The availability model for Sun servers has become increasingly complex due to
the increased number of high availability features provided at the hardware,
operating environment, and application levels.
Note: Unplanned outages are the result of uncontrollable, random system failures
associated with faults affecting hardware or software components. Unplanned
outages are the most costly, and can be minimized through component redundancy,
enhanced RAS features, and a Sun Cluster environment. Unplanned outages include
external influences (for example, network services, power infrastructure, security,
naming services, and human error).
Reliability Fundamentals
The academic definition of reliability, R(t), is the probability of a system performing
its intended function over a given time period, t. For example, a system with a 0.9999
reliability over one year has a 99.99 percent probability of functioning without failures
over the time period of a year.
Reliability is only one of many factors that influences availability. For example, 99.99
percent reliability does not mean 99.99 percent availability. Reliability measures the
ability of a system to function without interruptions, while availability measures the
ability of a system to provide a specified application service level to clients. Therefore,
reliability provides a metric of how often a component is likely to fail, while
availability includes the effects of any downtime that failures produce.
Reliability Fundamentals 5
The following are typical variations of the MTBF:
Hardware MTBF
This metric refers to the mean time between hardware component failures.
Component redundancy and error correction features of the Solaris Operating
Environment prevent hardware component failures from becoming system
interruptions.
System MTBF
This metric refers to the mean time between system failures. A system failure
indicates that users experience an application service level outage that can only be
restored by repair. Hardware component redundancy increases the overall system
MTBF (although the MTBFs of individual components remain the same). Because a
failure of a redundant part does not bring a system down, the overall system MTBF is
increased. However, because individual components fail at the same rate, a higher
component count increases the probability of component failure. Hardware designers
rely on statistical analysis to quantify the reliability gains from introducing
component redundancy into a system.
Failure Rate
Another metric for measuring reliability is the failure rate—defined as the inverse of
either the hardware or system MTBF as determined by the formula below. If the
failure rate is high, the MTBF value is small.
Early Wear-out
Life Useful Life Period
Failure Rate
Figure 1-1 illustrates hardware component failure over time and demonstrates a
phenomenon known as infant mortality—this is where a component exhibits a high
failure rate during the early stage of its life. To help ensure component failures are
detected early, manufacturers often subject components to a burn-in process whereby
systems are exposed to voltage, temperature, and clock frequency extremes.
The failure rate gradually reduces over time until it approaches a constant value
(which is maintained during its useful life). Eventually, the component enters the wear-
out stage of its life—in this stage, failures increase exponentially.
Note: New technologies may also exhibit a similar pattern of failure rate during the
early and useful life time periods.
Reliability Fundamentals 7
Common MTBF Misconceptions
MTBF is commonly confused with a component’s useful life, even though the two
concepts are not related in any way. For example, a battery may have a useful life of
four hours and an MTBF of 100,000 hours. These figures indicate that in a population
of 100,000 batteries, there will be approximately one battery failure every hour during
a single battery’s four-hour life span. By contrast, a 64-processor server platform may
last 50,000 hours before it enters its wear-out period, while its MTBF value may only
be 2,000 hours.
Another common misconception is to assume the MTBF value is higher in a system
that implements component redundancy. Although component redundancy can
increase the system MTBF, it does quite the opposite for the hardware
MTBF—generally, the greater the number of components in a system, the lower the
system’s hardware MTBF.
Availability Fundamentals
Steady-state availability is a commonly quoted availability metric for most computer
systems. Availability is often measured by the uptime ratio—this is a close
approximation of the steady-state availability value and represents the percentage of
time a computer system is available throughout its useful life. Another frequently
used availability metric is downtime per year, which is calculated using the following
formula:
This availability formula does not include enhanced RAS features typical of the
Solaris Operating Environment. This basic model is illustrated in Figure 1-2 and
makes the assumption the system is either available or experiencing a repair cycle. It
is important to note a failure must be detected and the system administrator notified
before repairs can be scheduled. The time elapsed between a failure occurring and its
detection can be reduced (to enhance availability) using sophisticated failure detection
systems.
System available
System unavailable
Availability Fundamentals 9
The Solaris Operating Environment has enhanced RAS features, including online
maintenance and ASR. Because some of the failure cycles do not require service
intervention, ASR invalidates the old computer systems availability model. The ASR
feature enables faulty components to be automatically fenced off after a component-
failure reboot—component-failure reboots are invoked by the kernel to prevent data
corruption. Hardware component redundancy is critical for system availability
because ASR will not grant a system reboot if the remainder of the components
cannot guarantee a functioning system.
Note: To enable the ASR feature, the OpenBoot PROM (OBP) diag-level
environment variable must be set to max. The diag-trigger OBP environment
variable must be set to error-reset or soft-reset to ensure that diagnostics are
executed after a reboot.
System available
System unavailable (reboot after repair)
System unavailable (ASR reboot after failure)
Figure 1-3 Solaris Operating Environment Availability Model (with enhanced RAS
features)
Where
t – ( 1 – R(t) )dt R(t) = System reliability over a time period, t
Availability = -------------------------------------
t dt = Average outage in hours
Availability Fundamentals 11
The reliability, R(t), of a system represents the probability of the system not
encountering any failures over a time period, t. The value of 1 - R(t) represents the
probability of a system encountering failures over a time period t. Multiplying 1 - R(t)
by dt (the average outage time) approximates the total outage time over the same time
period (assuming R(t) is close to 1).
The value of t - (1 - R(t)) x dt provides the total time a system is working during time
period t (uptime). Dividing t - (1 - R(t)) x dt over the total time period, t, yields the
total availability figure.
Serviceability Fundamentals
Serviceability, S(t), is the probability of a service action being completed during a
given time period, t. If a system is said to have a serviceability of 0.95 for three hours,
then the probability of a service action being completed within the specified time is 95
percent. Higher reliability reduces the frequency of system failures, while higher
serviceability reduces the duration of each repair incident.
A commonly used serviceability metric is the mean time to repair (MTTR), which is the
average length of time needed to complete a repair action.
Customers do not normally allow for the impact of the logistics time in the MTTR
calculation. The logistics time is sometimes referred to as the service response time (time
required for service personnel with the necessary spare parts to arrive at a customer’s
site). Availability calculations made without including logistics time are referred to as
intrinsic availability measurements.
Dynamic Reconfiguration
The Solaris Operating Environment supports Dynamic Reconfiguration (DR) software
on Sun Enterprise™ 3000 to 6500 and 10000 platforms. The DR software enables the
attachment or detachment of hardware resources without incurring system down-
time. For further details on dynamic reconfiguration, see the Sun BluePrints book
Alternate Pathing
The Solaris Operating Environment supports alternate pathing (AP) software on the
Sun Enterprise 3000 to 6500 and 10000 platforms. The AP software enables a manual
failover of an active network with an automatic failover of disk I/O controllers to
standby resources after a failure is detected. In the future, AP will provide fully
automated failover capabilities.
The SC2.2 infrastructure does not support the use of AP because it conflicts with the
SC2.2 bundled network adapter failover (NAFO) software. NAFO provides automatic
failover to alternate network controllers when a network failure is detected.
Static Constraints
System configuration, software choices, etc.
AC 1
Power
Supply Cooling
Module 1 Fan 1
DC Processors Memory
Power
Supply Cooling
Module n Fan n
Computer System
AC (n)
The availability of the overall system is only optimized after the individual
subsystems are optimized. The following sections highlight subsystem RAS-related
issues.
CPU Subsystem
Due to its complexity and high density, the processor module is a component that
may have a high failure rate. In most Sun servers, a failed processor will cause the
system to invoke a component-failure reboot. However, if the ASR feature is enabled,
the failed component will be fenced off after rebooting. There are few configuration
choices for processor modules—however, the general rule of reliability applies—the
fewer components there are, the less the chance of a failure.
Memory Subsystem
Although memory errors can be single-bit or multiple-bit, the majority of errors are
single-bit. A computer system is unaffected by single-bit errors; however, multiple-bit
errors cause the system to invoke a component-failure reboot. The general rule of
reliability also applies here—the number of memory modules should be reduced.
Note: Neither AP nor Sun Trunking are currently supported by the SC2.2
infrastructure.
The Sun Remote Services (SRS) service option can enhance system availability
through system event monitoring and service management of critical production
systems (via a modem link). Remote monitoring of disk storage subsystems has been
adopted by many high-end customers because they appreciate the value of replacing
defective hardware and upgrading obsolete firmware before they become a critical
issue in the datacenter.
The SRS option is delivered as part of the standardized SunSpectrum GoldSM and
SunSpectrum PlatinumSM service bundles that enable Sun Enterprise Service
engineers to monitor production systems on a 24 x 7 basis to assist in reducing service
latency after a problem is detected. The SRS feature detects error conditions based on
a set of pre-defined thresholds. When errors are detected, a corrective approach is
taken in partnership with the customer.
B
System
C ‘xyz’
(C,3)
1 2 3 4 5 6 7 8
AC/DC Power
The AC power supply must conform to the voltage settings established for a specific
hardware platform. For example, if the DC power supply subsystem fails, the whole
system will fail. To ensure voltages remain constant, they should be continually
monitored.
Mission-critical datacenter equipment is commonly equipped with multiple power
cords to access alternate sources of AC power. Systems with high availability
requirements could have datacenter power that originates from different power grids
to remove the potential of a single point of failure.
Power outages are generally brief in nature; therefore, short-term backup power
requirements can be met using an uninterruptible power supply (UPS). The UPS
battery should provide sufficient power for the system to be kept running for a
specific time period. Because a UPS converts a battery source to AC, it has the added
advantage of smoothing an erratic power supply (power conditioning).
Systems with higher availability requirements could use a diesel generator to supply
power for extended power failures. Because a diesel generator requires time to be
fully functional after starting, a UPS is still required to deliver power immediately
following a power failure.
Power cords should not be positioned where they are vulnerable to damage. It is also
critical that power cords and circuit breakers are secured so that they are not
accidentally tripped. Availability can sometimes be enhanced with simple actions
such as fitting a cover over any exposed circuit breakers to prevent accidental
tripping.
Note: Any covers fitted to circuit breakers should enable clearance for the breaker to
trip.
Network Infrastructure
Network infrastructure is comprised of switches, hubs, routers, and bridges is
considered external to the system and is commonly overlooked as being a key
component of system availability. The network infrastructure enables communications
between peer and client systems at the local and wide area network (LAN/WAN)
levels; therefore, network infrastructure scalability and availability directly affect the
availability of the total system.
Use the NAFO software provided with the SC2.2 infrastructure to improve
availability of the internal network controllers. This is performed by switching all
network I/O to an alternate controller when a failure is detected. Most network
product vendors are aware of the high availability requirements of network
infrastructure, and therefore provide redundancy at the switch and router levels.
To remove the potential of a single point of failure, WAN links for client connections
should be duplicated using independent telephone switching stations.
Customers with higher availability requirements can subscribe to the Internet using
different ISPs, or at least using separate access points with the same provider, thereby
removing another potential single point of failure.
Sun Microsystems Professional Services organization has developed a highly
available, scalable, and secure network backbone solution that uses Alteon
WebSystems switches in a redundant configuration at the Internet access point to
provide scalability and availability.
Security
In the past, security was mostly required by the government and financial sectors;
however, the e-commerce revolution is exposing more businesses to the potential of
unauthorized access to confidential data and attacks by hackers and viruses.
Malicious hackers attempt to modify business data to satisfy their own ego, whereas
sophisticated hackers steal business data in an attempt to profit financially. Generally,
security breaches cause partial outages—however, they have the potential to result in
the total demise of a business.
Denial of service (DoS) attacks involve malicious, repetitive requests to a system or
application with the intention of depleting resources. DoS attacks are less severe from
a security standpoint because they do not interfere with data integrity or privacy.
However, because an authorized end-user is not able to access a service during a DoS
attack, there is an immediate impact on business costs.
Customers requiring Internet access should secure their external connections with the
following:
■ A secure application architecture with built-in functionality to resolve DoS
attacks
■ A secure, hardened operating environment with minimized system builds
■ Up-to-date security patches
■ Chokepoints and proxies (where chokepoints are filtering routers, flow control
devices, and packet-filtering firewalls)
For customers requiring higher levels of security, Sun Microsystems offers the Trusted
Solaris™product, which supplies security features and assurances supported by the
National Computer Security Center (NCSC) and Defense Intelligence Agency (DIA).
Businesses can use the Trusted Solaris system with their own security levels and
options, or they can enable full security to enable users to perform administrative
functions in their own workspaces.
Change Control
After a production system platform is stable and all applications are fully functional—
it is essential that any proposed system changes undergo peer review to identify any
impact and provide a strategy for implementation.
Systems must be rebooted after changes are implemented—it could be difficult to
associate a failed reboot to modifications made six months before.
Datacenters with higher availability requirements should have a production system
mirrored by a test or staging system (the mirror should be a scaled-down version of
the production system) to implement and evaluate any changes prior to adoption.
Care should be taken to isolate test operations on the development system from the
real production environment. As an example of poor test process, a customer who
mirrored an SAP manufacturing environment inadvertently generated a test
production order that resulted in the building of a $1.5 million piece of equipment
(oops!).
In a clustered environment, system modifications should be implemented using
guidelines discussed in Chapter 3, “Sun Cluster 2.2 Components.”
Production systems should be backed up to tape prior to implementing any
changes—this enables the replication of the original system if the proposed
modifications prove detrimental.
http://sunsolve.sun.com
The patch read-me notes should be reviewed to identify which patch is relevant to a
specific production installation. For example, some patches might involve I/O
interfaces not used by the system, or could involve a locality that does not apply.
All relevant patches should be collected, stored, and applied on a three-or six-month
schedule to keep systems current and minimize outage impact. Critical patches that
affect the health of a production system should be applied immediately.
Immediately install any required or recommended patches for:
■ Solaris kernel updates to Sun Enterprise 10000 System Service Processor (SSP)
■ Sun Cluster software
■ Volume manager software (Solstice Disk Suite, Veritas Volume Manager, or Raid
Manager [rm6])
■ Disk controller/disk storage array/disk drive software
Component Spares
Having spare components available in the datacenter can reduce the time to perform
repairs. Sun Microsystems emphasizes the importance of producing interchangeable
components for use in different system platforms. This strategy enables reduced
component inventory.
Previously in this chapter, we discussed the probability of system components having
a higher failure rate in their early and late life periods. The strategy used by car rental
companies can be applied to this situation—customers with higher availability
requirements can take a proactive approach of replacing components prior to them
reaching their wear-out stage, see “Failure Rate” on page 6.
This chapter discusses the architecture of the Sun Cluster 2.2 (SC2.2) software. To
make full use of a Sun Cluster environment, it will be helpful to understand the
underlying framework.
We begin by defining the concepts of clustered computing and how they are
implemented within the SC2.2 software. The SC2.2 environment has been designed to
provide a flexible architecture for high availability and scalability. However,
regardless of configuration or cluster topology, the primary function of the product is
to provide a highly available computing environment without compromising data
integrity.
The SC2.2 software can be configured in many different ways on varying server
configurations—from Ultra-2 up to the Sun Enterprise 10000 server. Likewise, it is
also possible to use SC2.2 software as a high-performance database computing
environment that presents a single database image to clients. A correctly configured
SC2.2 cluster can withstand a (single) major failure such as losing a node, disk array,
or network connection, or several smaller, less severe errors such as losing a single
disk or single interconnect and still continue to provide application services. If a fault
is detected, the SC2.2 framework protects all shared data and begins cluster recovery.
After recovery is completed, the SC2.2 framework will continue providing access to
services. Although rare, multiple failures can occur simultaneously; the cluster
framework cannot deal with this situation. If multiple failures do occur, the data
service may not be available; however, all shared storage will be protected (because of
failure fencing).
31
What Is a Cluster?
A cluster is a collection of computers (heterogeneous or homogeneous) that is
connected by a private, high-speed network that enables the cluster to be used as a
unified computing resource. Clusters can be designed in a myriad of ways—some
cluster architectures are hardware-only, while others are a combination of hardware
and software. Sun Cluster technology falls into the hardware/software category. Any
approved Sun server, Sun storage device, and supporting hardware coupled with
SC2.2 software can be used to create a cluster. Applications such as electronic mail
systems, database systems, Web servers, and proprietary systems can take advantage
of a clustered environment. One of the economic advantages of clusters is the ability
to add resources to the cluster. Therefore, increasing scalability or system availability
can provide a greater return on capital investment.
Hardware SSI
At this level, nodes within a cluster share a common address or memory space. For
additional information, refer to literature on non-coherent, non-uniform memory
access (NC-NUMA) and cache-coherent, non-uniform memory access (CC-NUMA)
architectures. SC2.2 software does not offer hardware SSI.
Note: With Sun Cluster 3.0 software, an SSI is available without using parallel
databases, see Chapter 8, “Beyond Sun Cluster 2.2.”
Speed up
The speed up concept states that: Increasing the number of nodes in a cluster
decreases the workload time. For example, if it takes one hour to perform a complex
query, ideally adding a second node will decrease the time to 30 minutes.
Scale up
The scale up concept states that: Adding more nodes enables a greater number of
users to perform work. For example, if one node can handle 100 user connections for
order entry transactions, ideally adding a second node will enable another 100 users
to enter transactions—thereby potentially increasing productivity two-fold.
Application scalability using SC2.2 software can only be achieved using cluster-aware
parallel databases such as Oracle OPS, Informix XPS, or IBM DB2 EEE. When using
these database products, applications can scale as the number of nodes in the cluster
increases. It is also important to note that scalability is not a linear function. It would
be unreasonable to assume that by adding two extra nodes to an existing two-node
cluster, database performance will increase by 100 percent; however, it is not
uncommon for a correctly designed database that has been tuned to achieve 70 to 80
percent scalability.
Note: Two domains clustered within a single Sun Enterprise 10000 server is a fully
viable and supported configuration. However, the trade-off between a SPOF of a
single machine and the potential outage that could result becomes a business
consideration and not a technical limitation.
Creating a highly available computing environment has a significant dollar cost. The
cost is balanced against the value of a data resource, or in keeping a revenue-
generating application online. Therefore, any SPOF should be viewed as having the
potential of reducing (or stopping) a revenue stream, see Chapter 1, “High
Availability Fundamentals” for specific information on SPOFs.
Fault Monitors
Cluster Interconnect
Because all cluster communication is performed via the interconnect, it is the focal
point of a cluster; therefore, it is essential the interconnect infrastructure is always
available. The interconnect supports cluster application data (shared nothing
databases), and/or locking semantics (shared disk databases) between cluster nodes.
The interconnect network should not be used for any other purpose except cluster
operations. Do not use the interconnect for transporting application data, copying
files, or backing up data. Availability of the interconnect is crucial for cluster
operation, hence, its redundancy. When a cluster is started, each private network
adaptor is configured with a pre-assigned IP address (TCP/IP). The IP addresses used
for the interconnect are non-routed and are registered by Sun Microsystems, Inc.
Although redundant links are used for availability, only one link is active at any given
time. The interconnect is the only component in the cluster that could be considered
fault- tolerant—if one link fails, the other is immediately (and transparently) available
without affecting cluster operation.
Cluster Interconnect 37
Cluster Membership Monitor
The Cluster Membership Monitor (CMM) is also an important function of the SC2.2
framework—it will ensure valid cluster membership under varying conditions and
assist in preserving data integrity.
Cluster computing is a form of distributed processing. An essential component of any
distributed environment is the ability to determine membership of the participants.
Similarly, with clustered computing, it is important to know which nodes are able to
join or leave the cluster because resources have to be made available whenever a
change in membership occurs. After a cluster is formed, new membership is
determined using a consensus of active nodes within the cluster. For a cluster to
continue after a membership change (node addition or failure), the CMM will ensure
a valid cluster membership. In general, in a three-node cluster, the majority vote is
two; therefore, in the event of a failure, two votes must be “available” for the cluster
to continue. This concept is discussed in further detail in the failure fencing section,
“SC2.2 Failure Fencing/Cluster Membership,” on page 58.
Note: “Available” in the preceding paragraph refers to nodes that can communicate
with one another via the private interconnect. However, in a multi-node cluster, it is
possible to lose the entire interconnect structure and therefore not reach consensus.
CCD Architecture
Not all cluster configuration data is stored and maintained within the CCD. For
example, the CCD stores status information on public network management (PNM);
however, PNM configuration information is stored in /etc/pnmconfig file on each
node.
The CCD consists of two components—the initial and dynamic:
■ The initial (init) component of the CCD remains static, and is created during the
initial cluster configuration by using the scinstall command.
■ The dynamic component of the CCD is updated by the CCD daemon (ccdd)
using remote procedure calls (RPCs). Cluster wide ccdd communications are
performed using TCP/IP (via the cluster interconnect). Valid cluster membership
must be reached before a CCD quorum can be achieved. For cluster membership
to initiate, each cluster member must have access to the cluster configuration
data (stored in the
/etc/opt/SUNWcluster/conf/cluster_name.cdb file).
Although this file is technically not part of the database structure, it is still considered
part of the CCD architecture. For further information on CCD components, see
Chapter 4, “Sun Cluster 2.2 Administration.”
CCD Components
The CMM configuration data is stored in the
/etc/opt/SUNWcluster/conf/cluster_name.cdb file (where cluster_name is
the name of the cluster).
The /etc/opt/SUNWcluster/conf/ccd.database.init file is used to store the
init component of the CCD.
The /etc/opt/SUNWcluster/conf/ccd.database file is used to store the
dynamic component of the CCD.
Cluster Interconnect 39
CCD Updates
Updates to the CCD are performed using a two-phase commit protocol. The ccdd is
started on each cluster member. As each node joins the cluster, the ccdd is
automatically started. Before updates to the CCD can be performed, three steps must
be executed:
■ A valid cluster membership must be determined via the CMM.
■ The interconnect must be up and running via SMA.
■ A valid CCD must be determined.
To ensure consistent replication across clusters nodes, the CCD framework uses two
levels of consistency. Each copy of the CCD has a self-contained consistency
record—this stores the checksum and length (excluding consistency record) of the
CCD, which is used to timestamp the last CCD update. Cluster-wide consistency
checks are made through the ccdd by the exchange and comparison of consistency
records.
The following are the semantics of the CCD update process:
■ The node creates an RPC to its own copy of the ccdd requesting an update.
■ The ccdd confirms the CCD is not performing an update. The request is sent to
the master node (the master node is determined when the cluster software is
installed [has the lowest ID]).
■ The master node executes the pre-synchronization phase of the synchronization
command.
■ If the synchronization command fails, the update is aborted—if the command
succeeds, the master CCD server sends an update broadcast to all CCD daemons.
■ In the event of a rollback, the CCD daemons on each node make a shadow copy
of their CCD. Each node performs the update and acknowledges to the master
node that the update is completed.
■ The master node manages execution of the post-synchronization phase of the
synchronization command. If the update fails, it is rolled back; if the update
succeeds, it is committed to the CCD.
■ The master node acknowledges the post-synchronization operation. The status of
the update operation is forwarded to the node that originated the update request.
Caution: The CCD is a distributed database. ASCII-#(not binary) based files are used
to store CCD data. To ensure CCD integrity, modifications should only be performed
using the SC2.2 CCD administration tools. Direct modifications of the CCD may
cause data corruption.
Note: Latent NAFO interfaces are not automatically tested by the cluster framework.
Cluster Interconnect 41
PNM Fault Detection
After the PNM daemon receives a broadcast, each node writes its results to the CCD.
Each node then compares the results with those from other nodes. If nodes are
experiencing a problem with network connectivity (hub, switch, cable, or routers), the
cluster framework will take no action because all nodes within the cluster are
experiencing the same condition. Therefore, a failover of data services to another node
will not alleviate this problem. In a situation where the primary interface cannot
respond to network requests, its IP addresses (both physical and logical) are migrated
to a redundant interface within the assigned group. The IP addresses from the failed
adaptor are brought online for the new interface by using the ifconfig utility.
Although the ifconfig utility generates an address resolution protocol (ARP)
Ethernet update, the PNM will do several gratuitous ARP updates to ensure the MAC
address and IP address are correctly mapped in all ARP tables. If there are no
redundant network interfaces defined in the NAFO group, a failover of the data
services to another node will be performed.
The following is the basic algorithm used for network adapter testing:
1) In and out packets are gathered for the primary adapter of each NAFO group.
2) The PNM is inactive for a default time of five seconds. This parameter is tunable,
see Chapter 4, “Sun Cluster 2.2 Administration.”
3) Statistics are gathered again.
4) The current query is compared to the previous query. If the old and new statistics
for in and out packets are different, the adapter is considered working. If the statistics
indicate no changes, the following is automatically performed:
■ Ping the multicast at 224.0.0.2.
■ If the ping fails, ping the other multicast at 224.0.0.1.
■ If there is no response to either multicasts, perform a general broadcast ping at
X.X.X.255.
5) Statistics are gathered again—if they have changed, the adapter is considered
working. If the statistics have not changed, a failure has occurred with the adapter or
network segment. The PNM daemon will query the status of the CCD to determine
adapter status of the other nodes. If the other nodes indicate no problems, a
redundant interface will be used. If no redundant interface is present, a failover of the
data service will be performed.
Note: The failover of a logical host will only be performed if a fault monitor is
associated with the logical host.
RETURN
STOP
ABORT
Cluster Interconnect 43
The following is a description of the cluster reconfiguration steps (Figure 2-2):
Return Step—Each component prepares for the reconfiguration sequence. All new I/O
is disabled and pending I/O is finished. Although the SMA monitors the
interconnects, it will not intervene in the event of a failure. At this time, the CCD will
drop any network connections.
Begin Step—New cluster membership is established, and faulty nodes are removed
from the cluster.
Step 1—The logical interconnect is brought online by the SMA. If a reservation of a
quorum device is necessary, it will be acquired at this time.
Step 2—The node lock is acquired.
Step 3—The ccdd creates IPC connections (via the interconnect) to the CCD servers. If
Oracle OPS is used, the UDLM begins recovery.
Step 4—The CCD elects a valid database. Any running HA fault monitors are stopped.
Step 5—Monitoring of the terminal concentrators begins.
Step 6—Reserved disk groups are released. The CMM updates the Cluster Volume
Manager (CVM) with regards to cluster membership. At this time, the CVM creates a
list of nodes joining or leaving the cluster.
Step 7—The CVM determines which node is master (Step 6). Blocked I/O proceeds,
and faulty network interfaces are invalidated via PNM.
Step 8—The CMM will decide which node(s) will become the master or slave. The
master node(s) imports the disk groups. The recovery of any faulty node(s) begins (if
required).
Step 9—Disk groups (imported in previous step) undergo recovery (if required).
Step 10—Any required shared disk groups are reserved. The logical host(s) is
migrated to the standby node as required (tunable parameter).
Step 11—The logical host(s) is mastered (tunable parameter).
Step 12—HA fault monitors are started.
The actions performed in the previous 12 steps appear to be lengthy and time-
consuming; however, the most time-consuming part of the process is importing the
disk groups and disk recovery (Step 9). Reducing the number of disks per disk group
or disk set will decrease the time required to complete this step. If file system recovery
(Step 9) is required using the fsck command, it will take longer before the disks are
ready. Using a journaling file system can greatly reduce the time required to perform
a file system check (if required).
Volume Management
Disks and failures are managed using a disk manager. Disk failures are transparent to
the SC2.2 framework (providing a disk mirror is available). Disk failures do not affect
cluster functionality or membership.
Note: A complete I/O failure is considered a dual failure—this type of failure cannot
be tolerated by the SC2.2 architecture.
Logging file systems should be used to decrease recovery time and for file system
consistency checks (using the fsck command). For journal file logging, SC2.2
supports Solaris OE versions 8.x and 7.x logging, Solstice DiskSuite log devices, and
Veritas file systems (VxFS).
The SC2.2 product can be used with the following disk managers:
■ Solstice DiskSuite (SDS)
■ Veritas Volume Manager (VxVM)
■ Cluster Volume Manager (CVM)
Note: Software products change often to keep up with technology; therefore, for
specific information on volume managers or logging file systems, consult your Sun
Microsystems sales office.
The choice of using the SDS or VxVM volume manager is often based on preference
and/or feature sets. Both volume managers can be used with the SC2.2 framework in
HA mode, and with shared nothing database systems such as Informix XPS or IBM
DB2, see Table 2-2.
Cluster Interconnect 45
Table 2-2 Supported Volume Managers for Various Applications
The CVM software is based on the VxVM source code (with the added functionality of
shared disk groups). Prior to VxVM version 3.0.4, disk groups could only be imported
by one host at any time, however; the CVM software enables disk groups to be
imported by multiple hosts. The CVM must be used for shared disk databases such as
Oracle OPS (and is used in conjunction with the shared disk databases to enable HA
configurations).
After a volume manger has been installed, it can be used for shared and non-shared
storage. If VxVM is required for cluster storage management, it must also be used to
mirror all disks.
Note: The use of different volume mangers within the same SC2.2 environment is not
supported.
Fault Probes
Fault probes are data service methods (DSMs) assigned to a data service which probe
the application to check its health (or the health of the existing environment). Fault
probes cannot exist without a data service; however, a data service does not
necessarily require a fault probe, see Chapter 7, “Sun Cluster 2.2 Data Services.”
Parallel Databases
Sun Cluster 2.2 software supports two database topologies:
■ High availability (HA)
■ Parallel database (PDB)
For additional information on both topics, see Chapter 5, “Highly Available
Databases.“
Note: Logical hosts are mastered by one node and must have at least one secondary
node assigned to them.
Data Services
Application Monitoring
Shared Disk
Logical Logical
Host 1 Host 2
Each node is the master of its own logical host, and also acts as a standby for the other
node. Therefore, if node A fails, node B continues to support its own services and also
takes over the database and services of node A. In this example (Figure 2-5), node B
continues hosting the human resources database; however, a second database is
started for accounting.
Cluster Topology 51
It is important to note that multiple logical hosts can be configured for each
node—that is, each node is not limited to one database instance per node. Applying
this principle to our example, if five separate databases were used for the human
resources department on node A, and three databases were used for the accounting
department on node B, in a failure scenario, each node must have the resources and
capacity to run all eight databases.
Configuring a symmetric HA configuration can be more challenging than configuring
an asymmetric configuration. Application behavior must be fully understood because
any node could be required to host multiple applications (thereby pressuring
resources). The capacity of each node within a symmetric configuration should be
such that it can handle the entire cluster workload—if this is not possible, users will
have to settle for reduced levels of service. Decisions of this nature should be driven
by business not technical requirements.
Note: The servers depicted in Figure 2-6 could be two Sun Ultra 2 servers, two Sun
Enterprise 10000 servers, or two domains within a single Sun Enterprise 10000 server.
The nodes do not have to be identical machine types—it is acceptable to cluster a Sun
Enterprise 6500 server with a Sun Enterprise 4500 server.
Each node has its own local boot disk(s) that is not part of the cluster environment,
that is, it is not shared among cluster nodes. A boot disk holds the operating system,
cluster software, and application software. A boot disk should always be mirrored
(not shown in diagram). Failure of a boot disk will force a node to panic (because the
OS will no longer be available), and all cluster operations will migrate to the other
node.
Note: Although it is a good idea to mirror the boot disk for availability, it is not a
technical requirement. Loss of a boot disk that is not mirrored will cause a failover of
the data services on that node.
Shared storage is available to both nodes and should be mirrored. A two-node cluster
is able to run high availability services in symmetric and asymmetric configurations
(and shared disk and shared nothing parallel databases). For parallel databases,
Oracle Parallel Server, Informix XPS, or IBM DB2/EEE can be used.
Note: The System Service Processor (SSP) is used for cluster configurations on Sun
Enterprise 10000 servers.
Figure 2-6 depicts the use of redundant network interfaces that are connected to
different network subnets. Although redundant network interfaces are not a technical
requirement, a failure of the primary network interface will lead to cluster failover.
Network interface redundancy is managed by the cluster framework, see section,
“Public Network Management,” on page 41.
Multiple two-node clusters can share the same administrative workstation and
terminal concentrator—in this arrangement, the clusters would be independent of
each another. Sharing the management framework can make administration easier
and reduce costs.
Public Network
Switch/Hub Switch/Hub
(a)
Management (b) (b)
Console
Cluster Topology 53
N+1 Topology (Three- and Four-Node Clusters)
Figure 2-7 depicts a four-node N+1 cluster. One difference between an N+1
configuration and a two-node cluster is the interconnects do not use point-to-point
connections; with N+1, each node connects to redundant Ethernet or SCI hubs.
In the N+1 configuration, the first three nodes connect to their own storage device(s).
The remaining node (node D) connects to all storage devices. The first three nodes are
configured to run highly available applications. Node D is assigned to be the standby.
If node A fails, node D will take over the services of node A.
A four-node N+1 configuration is ideal for environments where the workload cannot
be shared. For example, if the first three nodes are Sun Enterprise 4500 servers hosting
Web and database applications, the standby node could be a Sun Enterprise 6500. If
the three Sun Enterprise 4500s are running at full capacity, a Sun Enterprise 6500
cannot manage the combined workload of the other three servers. However, when
configured correctly, the Sun Enterprise 6500 could handle the workload of two Sun
Enterprise 4500s (with performance degradation).
The N+1 configuration provides flexibility for cluster configurations where each node
requires specific resources for applications (before and after a failure scenario).
Additionally, the N+1 configuration has the ability to run highly available services in
asymmetric configuration and shared nothing parallel databases. For parallel
databases, Informix XPS and IBM DB2/EEE can be used.
Public Network 2
Cluster Topology 55
Ring Topology (Three- and Four-Node Clusters)
Figure 2-8 depicts a four-node ring configuration cluster. A ring topology is a
variation of the N+1 configuration; however, there are no dedicated standby nodes.
Each node in the cluster is configured to run an application or provide a service.
With a ring configuration, each node is a standby node for its adjacent node. Each
node connects to the same shared storage as its adjacent node (including mirrors). For
example, if node A fails, node B would takes over its services. Node C is the standby
for node B, with node A the standby for node D. This configuration would be a good
choice for homogenous workloads, such as Web or e-mail servers, which are subjected
to consistent usage patterns (such as intranet environments). Additionally, the N+1
configuration has the ability to run high availability services in both symmetric and
asymmetric configurations (and shared nothing parallel databases). For parallel
databases, Informix XPS and IBM DB2/EEE can be used.
Public Network 1
Public Network 2
Public Network 1
Public Network 2
Cluster Topology 57
Note: The configuration depicted in Figure 2-9 requires an administrative
workstation and network terminal concentrator (or an SSP when using a Sun
Enterprise 10000 server), which are not shown in the diagram.
Note: This section explains failure fencing scenarios for node failures; however, the
same mechanisms are used during routine SC2.2 operations (such as when a node
joins or leaves the cluster).
Failure fencing should not be confused with cluster membership because they are
different concepts. Because both concepts come into play during cluster
reconfiguration, sometimes it can be difficult to distinguish between the two.
The following is an explanation of terms used in this section:
Cluster membership—algorithms to ensure valid cluster membership.
Failure fencing—techniques and algorithms to protect shared data from non-cluster
members.
Partition—any node(s) with the potential to form a cluster. If more than two nodes are
in a partition, they must communicate exclusively between themselves.
Quorum devices—assigned disk drives (located in shared storage). For a cluster to
form, a partition must access a minimum of at least one quorum device. These devices
are used as tie-breakers in the event of the cluster degrading into two partitions (of
two nodes each). Quorum devices are also used as the basis of failure fencing in two-
node, N+1, and ring topologies.
SCSI-2 reservation—a component of the SCSI-2 protocol that enables a host to reserve
the disk. Reservations are atomic—that is, only one node can perform and hold the
reservation; other nodes attempting to access the device will be denied.
Membership Rules
1. a) If a non scalable partition has less than half of cur_nodes, it will abort out of
the cluster.
b) Scalable topologies wait for ETO+STO, then attempt to reserve the node
lock and quorum device. If the partition is successful in reserving the node
lock and quorum device, it will become the cluster; otherwise, it will abort.
b) If the topology is scalable, a node in the majority partition will acquire the
node lock. If the node fails to gain the node lock, the partition will abort.
The node that acquires the node lock will issue a reset function to terminate
all other nodes outside the partition.
3. When a partition has half of cur_nodes, there are three possible outcomes:
a) In a situation where cur_nodes equals two (when both nodes have access to
a common quorum device), if communication is lost between nodes, both
nodes will race for the quorum device—whichever node acquires the
quorum device will stay in the cluster—the node that does not reserve the
quorum will abort out of the cluster.
b) In a scalable topology, each partition races for the node lock and quorum
device after ETO. The successful partition will terminate all other nodes—
using Rule 2(b). If the partition fails to acquire the node lock, it will
implement the ask policy.
c) In all other cases (single node partitions not sharing a quorum device or
partitions greater than two nodes), the ask or deterministic policy will be
implemented.
Note: Nodes with shared data disks must share a quorum device; however, quorum
devices can also be used in a configuration where data is not shared.
2. After the first node in a scalable cluster starts, it acquires a node lock. If this
node leaves the cluster, the node lock will be released and will be acquired
by another node.
2. If any node in a scalable topology cluster cannot acquire the node lock, it
will not start the cluster.
Note: Solstice Disk Suite ignores Steps 1 and 2 because in these scenarios it uses its
own failure fencing algorithms using database replicas.
The following sections describe two failure scenarios for the following cluster
topologies:
■ Node failure
■ Split brain—an interconnect failure resulting in a cluster that divides into one or
more partitions
Node A Node B
Interconnect
(Failed)
Quorum
Q1 Storage Arrays
Device
Cluster Membership
By applying Membership Rule 3(a), it can be seen in Figure 2-10 that node A is half of
cur_nodes (and shares quorum device Q1 with node B). After node B fails, node A
will acquire the quorum device and will continue as the active partition. If node A
was unable to acquire the quorum device, the partition would abort.
Failure Fencing
The failure fencing process begins as soon as node A acquires the quorum device. The
scenario depicted in Figure 2-10 uses Rule 1, which states that node A will acquire a
SCSI-2 reservation on the quorum device (and all shared disks). The SCSI-2
reservation prevents node B from starting its own cluster (or accessing shared
storage), unless it is a cluster member.
Quorum
Device Q1 Storage Arrays
Cluster Membership
In this example (Figure 2-11), both interconnects failed—the SMA daemon will
immediately notify the CMM. The CMM on each node will begin the cluster
reconfiguration process. This scenario is covered by Rule 3(a), which states that both
nodes will attempt to acquire the quorum device (Q1). The node that succeeds in
acquiring the quorum device will be the partition that continues (the remaining node
aborts from the cluster). To avoid possible data corruption (pending I/O), the CMM
on the node that is not able to reserve the quorum will invoke the fail-fast driver. The
fail-fast driver was designed as a safety valve in the event of dual interconnect failure
in a two-node cluster. When invoked, the fail-fast driver’s function is to bring the
node to the boot prompt. After the interconnect is repaired, the node is able rejoin the
cluster.
(Failed)
Q1 Q2 Q3
Qx = Quorum Device #
Failure Fencing
In this example, failure fencing is covered by Rule 1. A SCSI-2 reservation will be
implemented on Q3 (only) because when two nodes share a quorum, a reservation is
implemented on any shared quorum when one of the nodes is not present. SCSI-2
reservations will not be implemented on Q1 or Q2 because the nodes that share those
quorum devices are present; therefore, failure fencing is not required. Node D will
hold the SCSI-2 reservation for Q3 until node B can rejoin the cluster.
Note: To simplify this concept, the configuration depicted in Figure 2-13 is basic. In
practice, two hubs are used for the interconnects—each node is connected to both
hubs—making the failure scenario highly unlikely.
Interconnect Failure
Q1 Q2 Q3
Qx = Quorum Device #
Failure Fencing
Failure fencing is performed using SCSI-2 reservations. If partition 1 (nodes A and B)
is the partition that continues, a SCSI-2 reservation will be implemented on Q1 and
Q2; however, if partition 2 (nodes C and D) is to remain active, then a SCIS-2
reservation will be implemented on Q3.
Scalable Topology
Scalable topology offers configuration flexibility, which enables it to be used for
shared disk databases or HA services. Because of this flexibility, an extra
configuration step is required for cluster membership and failure fencing.
For example, if node A starts a scalable cluster, the first function performed during
cluster formation is for node A to telnet to an unused port (node lock) on the
terminal concentrator (the port number is determined at installation). Node A will
retain the node lock unless it leaves the cluster. If node A leaves the cluster, the node
lock is passed to another node in the partition. If node A is unable to acquire the node
lock, it will not start the cluster (failure fencing Rule 2).
Failure fencing and membership rules for scalable topologies are slightly different
from other topologies; however, the goals are the same—to protect shared data and (if
possible) continue to provide services. Unlike N+1 and ring topologies, it is possible
for a multi-node, scalable topology to degrade into a single node and still continue to
provide services. Also, with a scalable topology, only a single quorum device is
required to be configured (regardless of the number of logical hosts).
With respect to a shared disk database such as OPS, access to shared disks is governed
through a Distributed Lock Manager (DLM). The DLM processes require the cluster
framework to be functional; therefore, access to shared disks without the presence of
a DLM is not possible. For instance, Oracle will not start on a failed node without the
DLM being part of the cluster. An additional step performed in an OPS configuration
is lock recovery on the failed nodes—this function is performed by the DLM.
(Failed)
Quorum Device Q1
(Mirrors not shown)
Membership
In this example, node A acquires and holds the node lock. Because this is the majority
partition, all I/O stops, cluster reconfiguration begins, and the isolated node is
removed from the cluster membership.
Failure Fencing
Although the failure fencing mechanism used in a two-node cluster is efficient, it is
not used with clusters configured in a scalable topology. Because a SCSI-2 reservation
is an atomic operation—that is, one node issues and holds a reservation—if
performed in a scalable topology, all other nodes will be locked out. Therefore, SCSI-2
is not used. Failure fencing is performed using the node lock on the terminal
concentrator (SSP with the Sun Enterprise 10000 server). No other cluster can form
without acquiring the node lock—in this example, node A holds the lock for as long
as it is a cluster member, thereby ensuring no other cluster can form.
Note: To simplify this concept, the configuration depicted in Figure 2-15 is basic. In
practice, two hubs are used for the interconnects—each node is connected to both
hubs—making the failure scenario highly unlikely.
Interconnect Failure
Membership
The example in Figure 2-15 is covered by Membership Rule 2(b)—the deterministic or
ask policy will be used. The partition containing the single node is covered by
Membership Rule 1(c).
If the deterministic policy is chosen, the partition which has the highest or lowest ID
will be the partition that continues, while the other partition will implement the ask
policy. Assuming the lowest IDs were used, nodes A, B, and C will continue.
Failure Fencing
After the reset function is performed, node A will verify that node D is at the boot
prompt—if so, the majority partition will continue; otherwise, the ask policy will be
used for both partitions. If the ask policy is used, the SC2.2 framework requires user
input to determine which partition continues or aborts. Failure fencing also consists of
acquiring the node lock and stopping all I/O until cluster reconfiguration is complete.
Note: To simplify this concept, the configuration depicted in Figure 2-16 is basic. In
practice, two hubs are used for the interconnects—each node is connected to both
hubs—making the failure scenario highly unlikely.
Interconnect Failure
Q1 Quorum Device
(Mirrors not shown)
Failure Fencing
After the reset function is performed, node A will verify that nodes C and D are at the
boot prompt. If so, the partition will continue; otherwise, the ask policy will be used
for both partitions. If the ask policy is used, the SC2.2 framework requires user input
to determine which partition continues or aborts. Failure fencing also consists of
acquiring the node lock and stopping all I/O until cluster reconfiguration is complete.
Note: To simplify the concept, the configuration depicted in Figure 2-17 is basic. In
practice, two hubs are used for the interconnects—each node is connected to both
hubs—so both hubs would have to fail for this situation to occur.
Q1 Quorum Device
(Mirrors not shown)
Membership
In this example (Figure 2-17), each partition is less than cur_nodes—Membership Rule
1(b). In this scenario, each node is a valid partition. Because each node does not know
the status of the other nodes, each one will try to become the new cluster. However,
every partition will have to wait ETO+STO before proceeding. When these time-outs
elapse, each partition will race for the node lock. After the node lock is acquired, the
node will then have to acquire the quorum device. The node that is able to reserve
both the node lock and quorum device will continue as the sole cluster member.
Failure Fencing
After both the node lock and quorum device have been acquired, the cluster member
will use the reset function to terminate the other three nodes’ boot images. Once
verified that all other nodes are at the boot prompt, the partition will continue;
otherwise, the ask policy will be used for all partitions. Failure fencing in this case
consists of acquiring the node lock and quorum device and stopping all I/O until
cluster reconfiguration is complete.
Note: Again, it is important to stress, SC2.2 software was designed to handle a single
failure without a prolonged disruption of services.
It is impressive how the SC2.2 framework is able to protect shared data. In the
majority of multiple hardware failures, SC2.2 will continue to provide data services
(albeit with degradation), regardless of the failure condition.
Tables 2-3 and 2-4 provide an overview of different scenarios for multi-node clusters
configured in N+1, ring, and scalable topologies.
Summary
The SC2.2 architecture is designed to provide high levels of availability for hardware,
operating system, and applications without compromising data integrity. The
architecture is a flexible, robust computing platform. The SC2.2 framework can be
customized using the SC2.2 API to make custom applications highly available.
Summary 75
76 Sun Cluster 2.2 Architecture
CHA PTER 3
The task of choosing cluster components from the myriad of available hardware and
software can be difficult because administrators are required to adhere to the
constraints of supportability, availability, performance, capacity, and cost
requirements.
This chapter addresses supported Sun Cluster 2.2 (SC2.2) hardware and software
options to highlight the large selection of available choices. This information can assist
customers in choosing a cluster configuration that best matches their datacenter
capacity, performance, and availability requirements. The hardware components are
described to highlight the RAS, performance, and capacity features.
Specific combinations of components supported by the SC2.2 software undergo
extensive quality assurance testing using supported cluster topologies to avoid
incompatibility issues at customer sites. Contact your local Sun sales office or service
representative to obtain the current SC2.2 hardware and software support matrix.
This chapter does not address the impact of component cost. Although component cost
may be a prime factor in determining a customer’s cluster component selection, we
assume any cost decisions are business driven and will be made at the managerial
level.
77
In this section, the term server platform refers to the set that comprises the following
components:
System Cage
A metal structure that includes a main system board and backplane (or centerplane).
The system cage also contains the power supply, power cord(s), and cooling fan(s).
Note: All power supplies used in Sun servers are universal (i.e., AC input ranges
between 90 to 250 V—this helps make them resilient to AC supply variations.
Cooling Fans
Provides forced air to cool components. Some servers have a hot-swap cooling fan
option.
Note: The ASR feature automatically isolates faulty hardware components to make a
system functional after a component-failure reboot.
■ JTAG (Joint Test Action Group, IEEE Std. 1149.1) ASIC diagnostics
■ Dynamic RAM error correction code (ECC)
■ System board LED diagnosis
■ Redundant packet-switched data/address bus (second XDBus available on the
SC2000E only)
■ XDBus and CPU cache parity protection
Note: For ease of reading, Sun Microsystems <model number> <product type> will
be referred to as model xxxx in the remainder of this chapter (where applicable).
Where ambiguity is possible, the model number and product type will be specified.
Sun Microsystems Ultra™-1 and Ultra-2 workgroup servers are based on the
64-bit, SPARC™ Version 9-compliant, UltraSPARC™-I processor technology.
Note: SC2.2 support for the Ultra-1 is restricted to the 170 model. The
Ultra-1/170 has the 167 MHz UltraSPARC processor option.
The Ultra-1 and Ultra-2 servers employ a Universal Port Architecture (UPA)
interconnect to support internal and external 64-bit SBus (IEEE Std. 1496-1993)
peripherals. UPA supports data transfer speeds of 1.3 Gbyte/second on a 167 MHz
system and 1.6 Gbyte/second on a 200 MHz system.
On-Board Devices
■ One Ethernet IEEE 802.3 RJ45 port (10 Base-T for the Ultra-1 and 100 Base-T for
the Ultra-2)
■ Two RS-232 or RS-423 serial ports
■ One Centronics-compatible parallel port
■ One single-ended SCSI-2 connector (8-bit for the Ultra-1 and 16-bit for the
Ultra-2)
■ One 16-bit, 48 KHz audio device with options for: line-out, line-in, microphone-
in, headphone-out, and internal speaker
■ Optional drives; SCSI CD-ROM and 3.5” 1.44 Mbyte floppy drives
■ Three SBus slots (Ultra-1) or four SBus slots (Ultra-2)
■ Additional UPA connector (Ultra-2)
Note: The SC2.2 software supports the on-board Ethernet standard (IEEE 802.3) for
both Ultra-1 and Ultra-2 when configured for private interconnects or public
networks.
System Resources
■ Up to four processors (maximum 440 MHz) with on-chip data and instruction
cache (32 Kbytes each), and up to 4 Mbytes of external cache available for each
processor
■ Memory capacity of 4 Gbyte (16 x 256 Mbyte DIMMs)
On-Board Devices
■ One Ethernet IEEE 802.3 10/100 Base-T RJ45 port
■ Two RS-232 or RS-423 serial ports
■ One Centronics-compatible parallel port
■ One single-ended fast-wide Ultra SCSI-2 connector
■ One 16-bit, 48 KHz audio device with options for: line-out, line-in, microphone-
in, headphone-out, and internal speaker
■ Four PCI slots (2 at 33 MHz with 5V signalling; 1 at 33 MHz with 5V signalling;
and 1 at 33–66 MHz with 3.3V signalling)
■ Optional SCSI CD-ROM and DAT drives
RAS Features
■ Extensive power-on self test (POST)
■ ECC bus protection (memory and UPA data)
■ Cache RAM parity
■ Variable-speed fans with thermal sensors—a thermal fault results in a warning
alert and/or shutdown to avoid component damage.
■ The ability to run SunVTS diagnostics to validate system functionality.
■ Minimal internal cabling, jumpers, and common fasteners to make servicing
easier.
■ Dual redundant (n+1) hot-swap power supplies (48/60V DC, 110/240V AC
options).
■ Internal hot-pluggable Ultra-SCSI disks.
■ Lights-out management (LOM): telecom RAS management (alarm card with
relay output, LED panel, and remote access ports to monitor power inputs and
fan failures and control remote reset, programmable watchdog, and user-
programmable alarms).
■ Ruggedized, rack-mountable chassis capable of operating in hostile
environmental conditions (NEBS Level 3-compliant).
Model 150
■ Supports one processor (maximum 167 MHz) with on-chip data and instruction
cache (32 Kbytes each), and up to 512 Kbytes of external cache.
■ Memory capacity of 1 Gbyte (8 x 128 Mbyte SIMMs).
On-Board Devices
■ One Ethernet IEEE 802.3 10/100 Base-T RJ45 port.
■ Two RS-232 or RS-423 serial ports.
■ One Centronics-compatible parallel port.
■ One internal and one external single-ended fast-wide Ultra SCSI-2 connectors
(except model 150, which allocates one of the three internal SBus slots for a
SunSwift™ card to provide 100 Base-T and fast-wide SCSI connectivity).
■ One 16-bit, 48 KHz audio device with options for: line-out, line-in, microphone-
in, headphone-out, and internal speaker.
■ Model 150 supports two 25 MHz, 64-bit SBus slots; models 220R, 250, and 420R
support four PCI slots (3 at 33 MHz with 5V signalling and 1 at 33–66 MHz with
3.3V signalling); model 450 supports ten PCI slots (7 at 33 MHz with 5V
signalling and 3 at 33–66 MHz with 3.3V signalling).
■ Models 220R and 420R support two internal hot-pluggable Ultra-SCSI disks;
model 250 supports six internal hot-pluggable Ultra-SCSI disks; model 150
supports 12 fast-wide SCSI disks; model 450 supports 20 internal hot-pluggable
Ultra-SCSI disks.
■ Optional 1.44 Mbyte drive and 4mm DDS-3 SCSI tape.
RAS Features
■ Extensive power-on self test (POST).
■ ECC bus protection (memory and UPA data).
■ Cache RAM parity.
■ Variable-speed fans with thermal sensors—a thermal fault results in a warning
alert and/or shutdown to avoid component damage.
■ The ability to run SunVTS diagnostics to validate system functionality.
■ Minimal internal cabling, jumpers, and common fasteners to make servicing
easier.
■ Internal hot-pluggable Ultra-SCSI disks (model 150 supports fast-wide drives
only).
■ Dual redundant (n+1) hot-swap power supplies with separate power cords
(model 150 has a single power supply; model 450 has three).
Note: The Sun Enterprise x500 servers have a faster (clock speed) centerplane than
the x000 servers.
Each server in the xx00 family can accommodate five types of plug-in boards:
■ CPU/memory board
■ SBus I/O board
■ Graphics I/O board
■ PCI I/O board
■ Disk board (not available for Sun Enterprise xx00 servers)
Note: The SC2.2 software supports the on-board Ethernet standard (IEEE 802.3) for
model xx00 servers when configured for private interconnects or public networks.
RAS Features
■ Extensive power-on self test (POST).
■ Variable-speed fans with thermal sensors—a thermal fault results in a warning
alert and/or shutdown to avoid component damage.
■ ECC bus protection (memory and UPA data).
■ Parity protection (cache RAM, interconnect address, and control signals).
■ Passive component centerplane.
■ N+1 redundancy option available for hot-swap power supplies and cooling fans.
■ Remote booting and power cycling.
■ Automatic System Recovery.
■ Dynamic Reconfiguration enables online maintenance (system boards can be hot-
plugged). See, “Dynamic Reconfiguration,” on page 12 for additional details.
Note: Alternate Pathing and Dynamic Reconfiguration are not currently supported
by the SC2.2 software.
System Resources
■ 16 centerplane slots—up to 64 processors (maximum 400 MHz), with on-chip
data and instruction cache (32 Kbytes each), and up to 8 Mbytes of external cache
available for each processor.
■ Memory capacity of 64 Gbytes (512 x 128 Mbyte DIMMs).
Note: Alternate Pathing and Dynamic Reconfiguration are not currently supported
by the SC2.2 software.
■ Redundant control boards provide JTAG, clock, fan, power, serial interface, and
SSP Ethernet interface functionality.
■ Extensive power-on self-test for all system boards prior to their integration into a
domain.
■ Variable-speed fans with thermal sensors—a thermal fault results in a warning
alert and/or shutdown to avoid component damage.
■ ECC bus protection (memory and UPA data).
■ Environmental monitoring.
■ Parity protection (cache RAM, interconnect address, and control signals).
■ Parallel data and address paths provide redundancy for availability and enables
load balancing.
■ N+1 redundancy available for hot-swap power supplies and cooling fans.
■ Redundant AC power cords and circuit breakers.
■ Remote booting and system management using SSP.
■ Automatic System Recovery (ASR).
Note: Sun Cluster 3.0 software only runs on Solaris OE version 8 (and higher), see
Chapter 8, “Beyond Sun Cluster 2.2” for additional information.
Note: A failure of this controller creates a larger availability impact because it brings
down the network and disk access functions simultaneously.
Note: Failure of this controller carries a higher availability impact because of the
potential of losing four network links simultaneously.
Gigabit Ethernet
The IEEE Gigabit Ethernet 802.3z standard, an extension of the 10- and 100-Mbps IEEE
802.3 Ethernet standard, accommodates larger packet sizes and operates on fiber
transport (up to 1 Gbps). Gigabit Ethernet supports full-duplex transmission
(aggregate of 2 Gbps) for switch-to-switch and switch-to-end-station connections.
Half-duplex transmission for shared connections uses repeaters that implement the
carrier sense multiple access with collision detection (CSMA/CD) access method.
Token Ring
Token ring is a common transport in the mainframe world. Although token ring
supports the TCP/IP protocol, it is generally used to interface with mainframes using
IBM’s System Network Architecture (SNA) protocol.
Note: The SC2.2 software does not support the SNA protocol.
Token ring supports transport speed options of 4 Mbps or 16 Mbps and adheres to a
token-passing bus scheme using IBM type 1 or type 2 STP cabling to make bandwidth
allocation predictable.
The Sun Microsystems TRI/S adapter is used in conjunction with the SunLink TRI/S
software to provide transparent access to token ring networks without modifying
existing applications. The SunLink TRI/S adapter provides IP routing between
Ethernet, FDDI, and token ring networks and is compatible with IBM source routing
bridges.
Fast Ethernet
Fast Ethernet is a good choice for a private interconnect when used to move only
heartbeat information, see section, “Fast Ethernet” on page 91.
Distance
FC devices remove the necessity of having disk storage in close proximity to the
server—enabling disk storage to be in a physically separated room or building.
Bus Termination
FC devices do not require electrical signal termination and are not affected by
electromagnetic interference (EMI).
Multiple Hosts
The FC protocol enables multiple hosts to be connected in the same environment
without conflict. By contrast, the SCSI protocol requires multiple hosts sharing the
same environment to use unique SCSI-initiator IDs, see “Appendix A: SCSI-Initiator
ID.”
Note: The SC2.2 software enables SCSI storage to be shared between two nodes. FC
storage can be shared between the maximum number of nodes (current maximum is
four).
Bandwidth
The FC transport provides a higher aggregate bandwidth than SCSI because the data
is transmitted serially (maximum 200 Mbps), whereas SCSI data is transmitted in
parallel (maximum 40 Mbps).
Some disk subsystems supported by the SC2.2 software feature a battery-backed
cache, which is essential for performance in write-intensive (i.e., OLTP) environments.
Either Veritas VxVM or Solstice DiskSuite™ (SDS) can be used to manage volumes
across all disk subsystems, including those with hardware RAID capability. It is
essential that disk controllers have redundant paths to access mirrored
data—alternate controllers should be located on a separate system board or data bus
(if possible).
Note: Shared SCSI devices require a unique SCSI-initiator ID on both cluster nodes
and limit access to two-node clusters, see “Appendix A: SCSI-Initiator ID.”
SCSI storage subsystems have SCSI-in and SCSI-out ports, see Figure 3-1. In a single-
host environment, the SCSI-in connector attaches to the host and the SCSI-out
connector is either capped with a terminator (that matches the differential or single-
ended bus technology) or connected to the next storage device. In a multi-host
environment, the SCSI-out connector is attached to the next host sharing the SCSI
device.
SCSI-in SCSI-out
Models 100/110
Holds 30 drives and have a 4 Mbyte battery-backed NVRAM cache option (Model 110
runs at a higher clock speed).
Models 200/210
Holds 36 drives and have larger disk capacities per drive than the 100/110 models.
The battery-backed NVRAM cache option for both models is 16 Mbyte.
Hot Spares
This feature enhances recovery from a disk failure by automatically replacing a failed
mirror or RAID-5 partition with an alternate disk. Hot spares automatically migrate
RAID-5 and mirrored data. During this time, users can access surviving partitions.
Disk Sets
SDS includes host ownership assignment to a group of disks to support the multiple
host synchronization required by clusters.
Note: To enable the journaling file system, the Solstice DiskSuite 3.x product should
be used in conjunction with Solaris OE (version 7 or higher).
Storage Checkpoint
The VxFS checkpoint feature uses snapshot technology to create a clone of a currently
mounted VxFS. A VxFS checkpoint provides a consistent, point-in-time view of the
file system by identifying and tracking modified file system blocks. A VxFS
checkpoint is required by the block level incremental backup and storage rollback features.
Note: In Table 3-1 the term “Solaris” is used in place of “Solaris Operating
Environment.”
Summary 109
110 Sun Cluster 2.2 Components
CHAPTER 4
111
Chapter Roadmap
This chapter describes cluster administration, including monitoring, managing
changes, moving nodes in and out of a cluster, and replacing non-functional
components. This chapter has been designed (in part) to be a reference
guide—however, an understanding of the cluster components does not in itself
demonstrate the larger cluster picture. This section provides a roadmap of the chapter
and introduces related chapters.
■ Underlying all aspects of cluster administration is monitoring. Monitoring is
essential for identifying problems and consequent troubleshooting. Additionally,
monitoring provides the current state of the configuration—essential for
performing modifications. Although monitoring is covered to some degree in
each section, it is important to understand and compare the monitoring tools.
Because monitoring is important to all other aspects of cluster administration, it
is covered first.
■ The commands that output monitoring information are discussed, along with
locations of the cluster activity logs. Additionally, the synchronization of time in
a clustered environment is discussed because timestamps are essential for
comparing cluster logs on different nodes.
An essential part of a highly available cluster is providing a data service; therefore,
the core component of this chapter leads into the creation and administration of an
HA data service. Chapter 7, “Sun Cluster 2.2 Data Services“ delves deeper into HA
services and provides insight into the creation of custom HA services. Chapter 1,
“High Availability Fundamentals” broadens out the discussion with an explanation of
the underlying concepts and implementation of parallel, highly available databases.
The basic aspects of cluster administration are:
Configuring a Cluster
The section, “Managing Cluster Configuration” on page 133 describes cluster
configuration databases and how to perform changes.
Monitoring Tools
There are several commands used for monitoring the status and configuration of a
Sun Cluster. Some commands display information on the broader aspects of cluster
configuration and current status—others display information about a specific portion
of the cluster. A GUI-based monitoring tool is also supplied with the distribution.
get_node_status
This command displays the node’s status as seen by the Cluster Membership Monitor
(CMM). The CMM is responsible for monitoring node status and determining cluster
membership, see Chapter 2, “Sun Cluster 2.2 Architecture.“ The output of the
get_node_status command includes only information relating to the CMM.
Information is listed on whether the node is in the cluster, the status of the
interconnects, the volume manager (which affects failure fencing), and the database
status (because the database lock manager is related to the CMM).
scconf -p
This command displays the configuration parameters set by the scconf command.
The scconf -p command provides information about cluster configuration (but not
the current cluster state). The scconf -p command is commonly used to verify
configuration changes made when using the other scconf command options, and for
obtaining data service and logical host configuration information.
pnmstat -l
This command displays status information about PNM groups. It lists the configured
backup groups, the status of each group (with the currently active adapter in each
group), and the time since the last failover. Although the pnmstat command can be
run without the -l option, it may be difficult to interpret the output. The pnmstat -
l command does not list any configuration information beyond the currently active
adapter for each group. To find other adapters in a group, use the pnmset -pntv
command.
pnmset -pntv
This command complements the pnmstat -l command because it displays the
configuration of each NAFO group. Running this command will list all adapters
configured in each NAFO group.
pnmptor
This command takes a NAFO group as an argument and outputs the primary (active)
interface for the group. The command name indicates a conversion from a pseudo
adapter name (a NAFO group) to a real adapter name. The abbreviation ptor comes
from the expression pseudo to real.
pnmrtop
This command takes a real adapter name on the command line and outputs the name
of the pseudo adapter (NAFO group) it belongs to.
finddevices
This command displays the serial numbers for the attached disks. The finddevices
disks command lists all disks attached to the node and associated serial numbers.
Use the grep command on this output to determine the serial number of a controller
instance (or vice versa) — this can be especially useful for verifying quorum device
validity against the configuration displayed using scconf -p.
hareg -q
The hareg -q servicename command lists all command line options used when
registering a service.
Note: Most errors described in this chapter display the verbose error messages as
displayed to users on the console. The errors displayed to users administering a
cluster via a telnet or rlogin window will only display the terse message
“scadmin: errors encountered”.
Set up the syslog facility to send a copy of all messages to a remote host (such as the
administrative workstation) in addition to the copies stored on the local node.
Sending the cluster node logs to a remote host enables an administrator to effectively
troubleshoot problems (even if the cluster node is down).
Sending the messages to a remote host is performed by appending the following lines
in the /etc/syslog.conf file on both nodes:
Note: The remotehostname entry is the name of the remote host that will receive
the logs. A <TAB> character (not a space) is used as a separator before the
remotehostname entry. The hostname should be entered into the /etc/hosts file
(if not already there).
After the /etc/syslog.conf file has been modified, the syslogd (1M) daemon
should be restarted by entering the following commands:
# /etc/init.d/syslog stop
# /etc/init.d/syslog start
Note: Confirm that no crontab(1) entries use the ntpdate, date, or rdate
commands.
Using the network time protocol (NTP) enables synchronized time to be maintained
between cluster nodes without administrator intervention. The NTP function corrects
time drift by continuously making small changes. Although cluster nodes cannot be
configured as NTP servers, they can be NTP clients. To enable cluster nodes to be
synchronized (in time), an NTP server using the NTP daemon (xntpd) should be set
up. In many large datacenters, an NTP server may already be configured; if not, an
easy method of setting up an NTP server is to use the administrative workstation.
Although this method does not necessarily set the cluster nodes to the actual time, the
important thing is to keep the nodes consistent with each other.
Note: Keeping the nodes aligned with a time standard, for example, Coordinated
Universal Time, could help in situations where cluster problems are caused by
external clients or servers.
server ntp_server
peer cluster_node
In this example, ntp_server is the name of the NTP server and cluster_node is
the name of the other cluster node. Both names should be stored in the /etc/hosts
file. For clusters with more than two nodes, peer entries should be included for all
other nodes. Changes will be effected after the nodes have been rebooted (or the
xntpd daemon has been restarted).
Note: Only cluster membership related options of the scadmin command are
discussed in this section. The scadmin command also has options to ensure disks
can be effectively failure fenced and the logical host can be switched over.
In this example, localnode will become the first node in the clustername cluster.
This node (localnode) becomes the only cluster member seen by the new instance of
the cluster. The time required to start a cluster is dependent on the time taken to start
the data services and import the logical host disk groups. During startup, a large
number of status messages are displayed on the console.
Note: Although it may be tempting to use the -a option with the startcluster,
startnode, and stopnode variations of the scadmin command, this option should
not be used.
The -a option was designed to perform startup asynchronously and return a prompt
immediately to the administrator. Two issues make using the -a option a bad idea. The
first issue is the fact that problems could occur if other commands are run during
startup. The second issue is that important warnings or error messages may be missed
because the startup messages are not displayed.
After the cluster is running on the initial node, additional nodes can be added to the
cluster using the scadmin command.
Contrary to the Sun Cluster 2.2 Administration Guide, cluster nodes should be added
singularly, as noted in the Sun Cluster 2.2 Release Notes Addendum, section 1.4.5. When
the following command is run on a single node, it will cause the node to join the
cluster:
In this example, the problem mounting the administrative file system resulted in the
cluster starting without the affected logical host. The problem illustrated here was
caused by a busy mount directory. In this case, a local user was using the directory. As
discussed in the Sun Cluster 2.2 Release Notes, there should be no local access to
directories exported by NFS, and users should not access the administrative file
system.
While the exact cause of an error may be difficult to ascertain (as in this example),
knowing which operation is failing can assist in determining the cause of a problem.
An understanding of the reconfiguration steps also assists in identifying the cause of
any problems.
Note: The reconfiguration steps for the CMM are discussed in Chapter 2, “Sun
Cluster 2.2 Architecture.”
A problem can occur if shared disks cannot be imported because, for example, the
disks were missing or were not correctly deported.
In this example (as before), the cluster will start normally—although the logical host
will not be mastered until the problem is fixed. Although the logical host is left
unmastered at the end of the reconfiguration, it can be re-mastered using the
haswitch command after the problem is resolved.
Nodes should be stopped one at a time—no cluster changes should be made while a
node is stopping. There is no special command to stop a cluster; therefore, use the
scadmin stopnode command on the last node in the cluster. It is not necessary to
stop the entire cluster to stop the logical hosts—this can be accomplished by putting
the logical hosts into maintenance mode (as described in “Managing Logical Hosts”
on page 160).
If a cluster node fails during reconfiguration, a situation could occur where nodes are
uncertain as to what their reconfiguration state should be—specifically, whether the
node is in or out of the cluster. It is important to prevent the condition whereby a
node believes it is still part of the cluster, when in fact it is not. This could result in a
situation where the confused node attempts to illegally access cluster data. Sun
Cluster infrastructure goes to great lengths to ensure this does not happen. When a
node is leaving the cluster and an error occurs, the node will either refuse to leave the
cluster or be forced to panic using the fail-fast driver. When a node refuses to leave
the cluster, it maintains contact with other cluster nodes, thereby coordinating data
access (and preventing data corruption). Although panicking may seem extreme, it
brings down the OS immediately—thereby preventing access to shared data.
Another example, is when an administrator changes the interfaces configured for use
as the private interconnect on a live node. Here, the affected node will panic when it
attempts to leave the cluster, see section, “Changing Private Network Configuration
(for Ethernet).”
In this case, a change is made to the configuration files on the node, as opposed to the
active configuration. The new interfaces are not plumbed until the node goes through
a cluster reconfiguration. The node continues to run as part of the cluster using hme0
and qfe1 as the private interconnect until it leaves the cluster and then rejoins it. An
error would be induced if an administrator attempted to remove the node pie from
the cluster by running the following command (on node pie):
This command starts the reconfiguration. One of the steps in the process uses the
ifconfig command to unplumb the private network interfaces. However, the
interfaces in the configuration files (hme0 and qfe5) are the ones that are unplumbed.
The unplumbing of hme0 will proceed normally because it is a legitimate private
interconnect; however, when qfe5 is unplumbed, it will return an error because qfe5
was never plumbed.
A failed attempt to unplumb an interface in a non-clustered environment would
return a trivial error; however, the Sun Cluster software takes no chances. When the
software detects an error of any sort during reconfiguration, it will panic the node (in
this case, pie). After the node is re-started, if it joins the cluster, it will function
normally and use interfaces hme0 and qfe5 for the private interconnect.
There are other conditions that cause problems when stopping a cluster node. One
such condition occurs if a logical host disk group cannot be deported from the node
during a manual failover. To prevent the cluster from entering an inconsistant state,
the CMM prevents the node from leaving the cluster or alternatively forces it out of
the cluster.
Note: The following error is only displayed on the console. Because of this, it is
possible that a serious error could be overlooked by an administrator. To be aware of
possible partition situations, it is important to monitor the cluster interconnects. All
partitioning events result in failed interconnect warnings being logged. These events
can be viewed using the Sun Cluster Manager GUI or a log monitor.
If the previous error occurs, the scadmin command can be used to nominate a
partition. It is important to track which nodes are in which partition.
The first step after a partitioning event occurs is to check if all nodes are up. If only
one partition has live nodes, select this partition to continue. A good method of
determining which nodes are in which partition is to look at the appropriate error
messages (the cluster status commands may be incorrect or not updated
immediately). Nodes listed as having the same Proposed cluster partition can
communicate with each other and form a new cluster together. By mapping which
logical hosts are present in each partition, it is possible to determine which partition
would be the most advantageous to have up. It can be useful to have a runbook that
details which partitions should be continued or aborted for all possible partition
situations, see section, “Systems Installation and Configuration Documentation“ in
Chapter 1, “High Availability Fundamentals.”
Verify that all nodes (supposed to be in the partition) aborted successfully. If any
nodes did not abort, run the command on the nodes separately.
After all nodes that should not be cluster members have aborted, entering the
following command on a node enables the partition containing that node to become
the cluster:
This action will cause the partition to become the new cluster. It is important to verify
that all nodes that should be in this partition started correctly. Any nodes that did not
should be shut down manually and restarted. The nodes that did not start and any
aborted nodes should be checked for problems. After the interconnect is repaired, the
nodes should rejoin the cluster in the normal fashion using the scadmin startnode
command.
Note: The Step 10 and 11 timeout values must always be greater than or equal to the
logical host timeout.
Note: The term cluster configuration databases (plural) is used as a generic term for all
configuration database files involved in tracking cluster configuration and state. This
includes the cluster configuration database (CCD [singular]) and also several related
configuration repositories, including the initial CCD, the pnmconfig file, and the
CDB (as described in the following section). To avoid confusion, the cluster
configuration database is referred to by its acronym, CCD.
The CCD
As discussed in Chapter 2, “Sun Cluster 2.2 Architecture,” some of the cluster state
and configuration information is stored in the CCD.
The CCD is divided into two parts: the initial CCD (also called the init CCD) and the
dynamic CCD (or simply the CCD). The dynamic CCD is a distributed database that is
kept consistent on all nodes.
The init CCD configuration file is stored in the
/etc/opt/SUNWcluster/conf/ccd.database.init. file.
The dynamic CCD configuration file is stored in the
/etc/opt/SUNWcluster/conf/ccd.database file. If a cluster is configured for a
shared CCD, the CCD configuration file will be located in the
/etc/opt/SUNWcluster/conf/ccdssa/ccd.database.ssa file during the time
the shared CCD is in use (when one node is out of the cluster).
The CDB
The CCD framework requires stable cluster membership before determining the
validity of the CCD database. Configuration information is required to determine
cluster membership—the CCD cannot store this information because the CCD has not
been elected prior to the time the cluster membership is initially formed. The CDB
stores configuration information required by the Cluster Membership Monitor
(CMM), and for other related components; for example, Oracle’s Distributed Lock
Manager (DLM), the Cluster Configuration Monitor (CCM), and for quorum and
failure fencing mechanisms (which determine cluster membership).
A non-replicated CDB file exists on each node in the
/etc/opt/SUNWcluster/conf/clustername.cdb file, where clustername is
the name of the cluster. Because the CDB is not distributed (like the CCD), any
commands that modify the CDB only affect the CDB of the node they run on.
Therefore, it is important to synchronize the CDBs on all nodes by ensuring that
commands to modify the CDB are executed on all nodes.
Modifying the CDB on all nodes requires running commands on each node
individually with the cluster control panel, or entering commands directly on each
node singularly. Even nodes not currently in the cluster require their CDBs to be
updated appropriately. Because it is possible to have different copies of the CDB on
different nodes, it is crucial to keep records of cluster changes.
/etc/opt/SUNWcluster/conf/
ccdssa/ccd.database.ssa
CDB /etc/opt/SUNWcluster/conf/ ■ CMM configuration
clustername.cdb ■ CCM and UDLM configuration
■ Private network addresses and
interfaces
■ Cluster name
■ Number of potential nodes (from
installation)
■ Node names and IDs
■ Local parts of logical host
configuration.
■ Quorum devices
■ TC/SSP monitoring, IP addresses,
port #s, type
■ PNM tunables
pnmconfig /etc/pnmconfig ■ NAFO group configuration
Database(s)
Command Purpose Updated Applicable Node and Cluster State
scconf -h Changes hostnames CDB Do not use
scconf -i Changes internal network CDB See “Changing Private Network
interfaces Configuration (for Ethernet)” on
page 173
scconf -F Configures administrative None All nodes or all nodes capable of
file system mastering logical host
scconf -L Configures logical host CCD One node
scconf -s Associates a data service CCD One node
with a logical host
scconf -H Changes terminal CCD and Cluster down
concentrator to host CDB
connection (architecture or
port #)
scconf -t Changes terminal CCD and Cluster down
concentrator public network CDB
information
Database(s)
Command Purpose Updated Applicable Node and Cluster State
scconf -p Displays cluster parameters None One node
(informational
command)
scconf - Determines if cluster uses CDB All nodes
D/+D direct-attached disks
scconf -q Changes quorum device CDB All nodes
scconf -U Sets configuration path for CDB All nodes
DLM
scconf -N Sets node Ethernet address CDB All nodes
scconf -A Changes active hosts init CCD All nodes, cluster down
scconf -S Switches between shared init CCD All nodes, see “Shared CCD Setup“
CCD and unshared CCD
scconf -T Sets Step 10 and 11 timeouts CDB All nodes
scconf -l Logical host timeout value CDB All nodes
scconf -R Removes data service CCD One node
pnmset Sets PNM groups pnmconfig Only the (single) node being
and CCD modified (which must not be a
cluster member when the command
is run)
CCD Quorum
The validity of the CCD is important because most of the logical host infrastructure
relies on the CCD, see Chapter 2, “Sun Cluster 2.2 Architecture.” The CCD must
always be consistent and should be backed up often.
Efficient management of the CCD requires:
■ Implementing management procedures to ensure consistency of the CCD
■ Understanding how to back up and restore the CCD
It is essential to understand how to deal with problems involving the CCD and other
cluster configuration databases, see section, “Resolving Cluster Configuration
Database Problems,” on page 150.
There are several ways to keep the CCD consistent—the CCD is a replicated database;
therefore, it ensures all nodes in the cluster are in agreement on the cluster
configuration. However, a problem could occur if there are inconsistent CCDs on
different nodes because one group of nodes was present in the cluster during a CCD
update while the other group was not.
If updated nodes leave the cluster and nodes (which are not updated) restart the
cluster, the new cluster would use the CCD that is not updated—this will result in the
cluster not using an up-to-date configuration. If the new and old configurations
attempt to access the same data, it could lead to data corruption. There are several
ways to avoid this problem—these require an understanding of the CCD quorum.
Note: The CCD quorum is not related to the CMM quorum which uses node majority
and quorum devices to determine cluster membership.
Shared CCD
Problems with having a CCD quorum and inconsistent CCDs can be largely overcome
in two-node clusters by using a shared CCD. A shared CCD enables two nodes to
share a consistent CCD source, even if one of them is unavailable.
Note: Because the CCD is updated during the data service package installation, if a
shared CCD is configured, there may be an error if the CCD is updated while the
shared CCD volume is in use (when one or more nodes is out of the cluster). This
situation can be resolved by disabling the shared CCD before installing the package.
If a cluster environment will require adding many new data services after it is put in
production, a shared CCD may not make sense.
Shared CCDs can only be used on two-node clusters using Veritas Volume Manager
(VxVM) or Cluster Volume Manager (CVM) software as the volume manager. If a
shared CCD is used, the CCD functions as a non-shared CCD when both nodes are in
the cluster. When a cluster is configured for a shared CCD and a node leaves the
cluster, the CCD is copied to the shared CCD volume (which is a dual-ported,
mirrored set of two dedicated disks).
Note: A shared CCD cannot be used in a configuration that uses SDS as its volume
manager.
Any changes made to the CCD are also made to the shared CCD (but not to the CCDs
of the individual nodes). Because the shared CCD volumes are accessible to both
cluster nodes, after the second node has left the cluster, the shared CCD can be used
by whichever node restarts the cluster. The node that restarts the cluster will use the
shared CCD rather than its own CCD copy. When the remaining node joins the
cluster, the shared CCD on the disk is copied to both cluster nodes and CCD updates
are again performed using the non-shared CCD mechanisms.
A shared CCD requires exclusive use of two disks—the CCD volume will be mirrored
across these two disks. Any size disks should be sufficient because the CCD file is
relatively small. Each disk should be connected to both cluster nodes (a shared CCD
only works on a two-node cluster). Additionally, the disks should have separate
controllers (as with any mirrored shared disks). To achieve a higher level of
redundancy, the shared CCD must use two disks, each on different arrays.
In this example, disk1 and disk2 are the instance numbers of two disks that use
different controllers, for example, c1t3d0 and c2t3d0.
After this command has executed, stop both cluster nodes using the scadmin
stopnode command. Enter the following command onto both nodes:
To configure the shared CCD volume, enter the following command onto any node
(but only a single node):
# confccdssa clustername
The disk group sc_dg was found and has been imported
on this node. This disk group will be used for the
shared ccd. The ccdvol volume will be recreated to
make certain that no previous shared ccd database
exists.
Do you want to continue (yes/no) ? yes <CR>
Then answer yes when asked to verify if a new file system should be constructed.
# scadmin stopnode
This command causes the remaining node to import the shared CCD disk group (and
use the shared CCD volume). This is necessary because access to shared CCD
information is required even when the shared CCD is turned off. If the cluster is not
running on either node, the shared CCD volume can be manually imported. If this is
the case, the paths to the shared CCD could differ from those listed in this section.
The paths vary depending on the mount point used; however, the rest of the process
is similar.
To safeguard against administrative mishaps, back up the shared CCD before
proceeding. The shared CCD file is located in the
/etc/opt/SUNWcluster/conf/ccdssa/ccd.database.ssa file.
After the shared CCD has been backed up, it can be turned off by entering the
following command onto both nodes (nodes in and out of the cluster):
Note: If the command is not run on both nodes, the second node will panic when it
tries to rejoin the cluster.
Copy the shared CCD back to the standard CCD location in both nodes. The file
should be copied to the correct location on the node that is running the cluster by
using the following command:
# cp /etc/opt/SUNWcluster/conf/ccdssa/ccd.database.ssa \
/etc/opt/SUNWcluster/conf/ccd.database
/etc/opt/SUNWcluster/conf/ccd.database
After both nodes have the correct CCD, bring down the cluster on the remaining
node. Both nodes should now be out of the cluster with correct CCDs. The shared
CCD volume is no longer required and can be unmounted using the following
command:
# umount /etc/opt/SUNWcluster/conf/ccdssa
The shared CCD disk group can now be deported using the following command:
Note: The cluster can now be brought up. If the second node panics when entering
the cluster, it probably means the “scconf clustername -S none” command was
not run correctly on both nodes.
If the following message is displayed during the start of reconfiguration Step 4, the
CCD was not successfully copied to the first node:
Note that the sc_dg disk group still contains CCD information. This enables an
administrator to access the data if something has gone wrong with the process. If both
nodes join the cluster correctly, files on the shared CCD volume can be deleted safely.
However, doing so will require remounting the disk.
# hastat | more
2) Set the logical hosts to maintenance mode using the following command:
# haswitch -m logicalhostname
Note: The logical hosts must be in maintenance mode prior to restoring the CCD;
therefore, the best time to back up the CCD is during cluster installation or
maintenance. Backing up the CCD should always be part of a cluster installation.
Before switching the logical hosts into maintenance mode, determine the current
physical host for each logical host so that they can be restored to the correct node.
This information can be displayed using the hastat command. Use the following
command to put one or more logical hosts into maintenance mode:
After the logical hosts are in maintenance mode, restore the CCD using the following
command:
In this case, clustername is the name of the cluster and restore_file is the file-
name of the CCD backup.
The following command can be used to turn the CCD quorum back on:
# ccdadm clustername -q on
After this command is performed, the logical host(s) can be brought online using the
haswitch command. For each logical host, use the following command to host
logicalhost on machine phys-host:
Idle
In Idle mode, the ccdd is waiting for a trigger to enter the start state.
Start
In the Start mode, the daemon reads and checks the init CCD for the dynamic CCD
configuration information. The daemon on each node establishes TCP/IP sockets to
communicate with other nodes.
Step 1
The CCD on each node will attempt to communicate with the daemons on all other
nodes during the Step 1 state. Each node will parse its local CCD file and verify the
internal consistency of its own copy (discarded if inconsistent). The daemons on each
node communicate with each other to elect a valid CCD (which requires a CCD
quorum—as described previously). With SC2.2 software, the first node in the cluster
will have its CCD elected as the valid CCD unless the file has errors or an incorrect
checksum.
Return
In Return mode, the daemon closes any connections, discards queued messages,
releases resources, and closes any open files.
Step 1 of the CCD reconfiguration begins with Step 3 of the CMM reconfiguration.
Any CCD issues are resolved in Steps 3 and 4 of the CMM reconfiguration.
syntax error: Starting Sun Cluster software - joining the tree cluster
Jun 14 17:49:24 beech ID[SUNWcluster.clustd.conf.4004]: fatal:
cdbmatch(cmm.osversion,...): syntax error
scadmin: errors encountered.
If the CDB (of a joining node) is syntactically correct but differs from other nodes in
the cluster (in any way), the following error message will be displayed after the start
of the reconfiguration:
In either case, the CDB can be restored from a backup or copied from another node
into the node having problems (via FTP). This will ensure the database is syntactically
correct and consistent with the other nodes.
Note: For CDBs to be consistent, they must be identical. Adding a blank space or any
comments to one file not reflected in another results in an “inconsistent CDB”
message.
Note: If the init CCD has an invalid checksum (or differs from a joining node), the
joining node will panic.
PNM Daemon
The PNM daemon (pnmd) is started at boot unless:
■ An error occurs when attempting to start.
■ No /etc/pnmconfig file exists.
■ The /etc/pnmconfig file is incorrectly formatted.
The pnmd process is important for cluster operation; therefore, if not started at boot
time, the problem should be rectified to enable the daemon to be restarted with the
pnmset command. When the pnm daemon starts, its configuration information is read
from the /etc/pnmconfig file.
Changes can be made to the /etc/pnmconfig file by using the pnmset command.
The pnmset command will restart pnmd (unless the -v flag is used). The pnmd
daemon is responsible for testing active interfaces in the NAFO group (and performs
a failover if the interfaces are not working correctly). The pnmd daemon performs
fault probing on the active interface within each NAFO group. The frequency of the
probes can be set in the cluster database (CDB) file, see section, “PNM Tunables,” on
page 154.
Setting Up PNM
All NAFO groups are checked during installation; however, after the groups are
installed, only active groups are monitored. Because the backup interfaces cannot be
monitored while a node is in the cluster, it is a good idea to check the interfaces
whenever a node leaves the cluster for maintenance or other reasons. This is
performed using the pnmset -nv command when the node is out of the cluster.
Although the pnmset -nv command is the preferred method of checking NAFO
interfaces, the ifconfig command may be an alternate method of checking
connectivity of the backup interfaces if an error cannot be easily determined from the
pnmset -nv output. However the ifconfig command may lead to unpredictable
behavior in a running cluster; therefore, this command should only be run when a
node is out of the cluster, see Chapter 6, “Sun Cluster 2.2 Application Notes.”
Prior to configuring the PNM, each NAFO group requires (exactly) one interface
plumbed with an IP address. To automate the plumbing and ensure it is performed
correctly, create a hostname.if file for one interface in each group. This interface
becomes the default interface for the group—that is, the interface to be the active
interface in a NAFO group when the PNM daemon starts.
Each NAFO group is associated with an IP address. This address is initially hosted on
the primary interface, but will fail over to a backup interface if the primary fails. For
each NAFO group, the primary interface should have an entry in the /etc/hosts
file to associate the NAFO group’s IP address with the relevant name.
Using the Sun Cluster naming convention is generally a good idea; however, in this
case, it might not be the best method. According to the naming convention, the
hostname used should be appended with the primary interface (for example, phys-
129.146.75.200 ha-nfs
129.146.75.201 ha-web
129.146.75.202 phys-hosta-nafo0
129.146.75.203 phys-hostb-nafo0
129.146.75.204 phys-hosta-nafo1
129.146.75.205 phys-hostb-nafo1
As can be seen, each physical host has at least one IP address—these are the IP
addresses for each NAFO group. Additionally, each logical host needs one or more IP
address.
The logical host IP addresses are hosted on whichever physical host currently masters
the logical host. The clients will connect to the logical host IP addresses without being
aware of the physical host IP addresses.
After the hostnames are correctly configured, the interface database files
(hostname.if) can be set up. Each file configures an IP address onto a given
interface at boot time. Each NAFO group requires an interface with an active IP
address; therefore, one interface in each NAFO group should have an appropriate
interface database file.
For the primary interface in each NAFO group, create a file named hostname.if,
where hostname is the string “hostname” (not a variable as stated in the Sun Cluster
2.2 Administration Manual) and if is the interface instance (hme0, hme1, qfe3, etc.).
The file (hostname.if) should consist of one line containing the hostname for the
interface (as reflected in the /etc/hosts file). In the previous configuration example,
the node hosta had two interface database files. If hme0 was the primary interface in
NAFO group nafo0 and qfe3 was the primary interface in NAFO group nafo1, the
files should be as follows:
The file hostname.hme0 should contain the following line:
phys-hosta-nafo0
phys-hosta-nafo1
Note: For security reasons, write permissions should be turned off after file creation.
Where net is the IP address of the network (or host to route to) and address is the
IP address or hostname to route through (the IP address or hostname should be in the
/etc/hosts file).
PNM Tunables
After the PNM setup is complete, NAFO groups can be configured by running the
pnmset command. In some cases, it may be advantageous to change the pnmd probe
values. The default values will be sufficient for most situations; however, some
systems may have requirements such as faster detection of a failure or less
susceptibility to slow network problems—in these situations, it may be advantageous
to modify the PNM tunables.
Modifying PNM tunables requires directly editing the CDB, see section, “Cluster
Configuration Databases” on page 133. However, editing the CDB could result in a
misformatted CDB that could lead to problems with starting a cluster. Change only
the relevant variables—keep the format the same (variable name, spaces, colon, space,
and an integer representing the setting).
Note: Although this file is plain-text, users should not edit it directly.
Note: All NAFO groups should be entered in their default order. This prevents a
situation where an administrator accidentally enters a configuration twice. Entering
the same group number twice while editing groups could result in a faulty
configuration that will not be detected by pnmset.
Command Results
pnmset -ntvp Prints the current PNM configuration (without
doing anything else)
pnmset -pf /backups/pnmconfig.bak Provides the current configuration (performs the
standard interactive pnmset, then saves a
backup copy)
pnmset -n Restarts the daemon
pnmset -nt Fast restarts the daemon (with no testing)
pnmset -nv Tests the PNM configuration in
/etc/pnmconfig
Any given data service runs normally, except if a fault is detected in the service or the
cluster node the service is hosted on; in this situation, the service will fail over to an
alternate node (by means of the logical host).
The logical host controls the data service’s interface to:
■ The outside world (logical host IP address)
■ The data (shared data disks are under the logical host’s control)
When the data service restarts on a new node, the configuration will be identical
(because all functions are performed through the logical host).
A useful metaphor in this situation could be that the logical host is like a box, and a
data service is like an object that can be put into the box.
Note: Recall that multiple data services can coexist in a logical host, and a simple
dependency model can control the order in which the services start or shut down.
A single data service can be registered with multiple logical hosts. In these situations,
each logical host would have its own instance of the data service. However, instances
cannot be controlled independently (registering, unregistering, turning on, or turning
off the service affects all instances of the service). It is not possible to unconfigure or
stop a single instance of a data service—this introduces a dependency between
instances of the same data service on different logical hosts.
Because of this dependency, there are some considerations to be aware of. Removing
or performing changes involving the relationship of the logical host to its IP
address(es) or shared disk(s) requires unconfiguring any currently configured data
services on the logical host. If other logical hosts are running instances of the same
data service, they may also have to be brought down during the change—this could
lead to unexpected downtime during maintenance.
For example, if logical host A has data services for HA-NFS and logical host B has
data services for HA-NFS plus a custom service, if a major change needs to be made
to logical host A, HA-NFS needs to be turned off and unconfigured. However,
performing this action will also turn off and unconfigure the HA-NFS instance on
logical host B, making the service unavailable, and may impact the custom service (if
dependent on the HA-NFS instance).
Note: It is acceptable to set up the appropriate disk groups or NAFO groups before
or after the logical host is set up.
The logical host named treeweb is configured on the trees cluster. It uses the disk
group TREEWEBDISK, which can be mastered by the nodes ash, beech, and cyprus.
Because the -m flag is not present, an automatic failback will occur. Therefore,
whenever the host ash (primary host) enters the cluster, the logical host will fail over
to be hosted on ash. The logical host has two network hostnames. The network
hostname website1 can be hosted by the NAFO groups containing the interface
hme0 on the host ash, qfe0 on the host beech, and hme1 on the host cyprus. The
network hostname website2 can be hosted by the NAFO groups containing the
interface hme1 on the host ash, qfe3 on the host beech, and hme2 on the host
cyprus.
As previously discussed, after a logical host is added, an administrative file system
should be created.
Where DG_NAME is the name of the disk group; DISK01 and DISK02 are the Veritas
disk names for two disks on different controllers. The volume name should be the
name of the disk group appended with the string -stat. Therefore, if the disk group
is named NFSDISK, the volume name for the administrative volume would be
NFSDISK-stat.
After the volume is created, use the scconf -F command to create the appropriate
file system and files.
The scconf -F command performs the following:
■ Creates the DG_NAME-stat volume (if not already existing).
■ Creates a UFS file system on the volume. This occurs only on the host with the
volume currently imported.
■ Creates a directory under the root with an identical name as the logical host. This
directory will be used to mount the file system (in the previous step).
■ Creates the /etc/opt/SUNWcluster/hanfs directory (if not already existing).
The files vfstab.logicalhost and dfstab.logicalhost are created under
the directory (where logicalhost is the logical hostname). These files are
associated with the HA-NFS data service; however, logical hosts require these
files even if not configured with the HA-NFS data service. When the
Note: The previous line for mounting the administrative file system must appear
before any other volumes to be mounted.
If the configuration uses the SDS product, all steps should be performed manually.
The scconf command should be run using the following syntax:
[clustname is the name of the cluster, logicalhost is the name of the logical host
that the administrative file system is being created for, and DG_NAME is the name of
the disk group on which to create the administrative file system].
Note: If the scconf -F command has been run with the wrong logical hostname,
manually remove the files created.
Using the -h flag to specify logical hosts is optional. If not used, the data service will
be configured on all logical hosts. It is a good idea to only configure data services on
the logical hosts that will use them. Data services can be associated with new logical
hosts later using the following command:
Where DS_name is the name of the data service to be associated with the logical host
LH_name in the cluster clustername.
Similarly, a service can be disassociated with a logical host using the following
command:
This command disassociates a data service from a logical host, but does not unregister
it. Unregistering a service will automatically disassociate it with all logical hosts.
Changing the data service to a logical host association works the same for Sun-
supplied services and custom services.
# hareg -y service1,service2
Equivalent haswitch
The scadmin Switch Command Command Effect
scadmin switch clustname haswitch dest_host Switches
dest_host logical_hosts* logical_hosts* logical_hosts
mastered on
dest_host
scadmin switch clustname -r haswitch -r Performs a cluster
reconfiguration
scadmin switch clustname -m haswitch -m Puts the specified
logical_hosts* logical_hosts* logical hosts into
maintenance mode
Note: It is important to ensure that the node a logical host is moving to has enough
resources to host the data services associated with the logical host. Sizing
appropriate capacity monitoring of backup nodes is necessary to ensure a failover
can be performed as required.
Because the coherency of the logical host data is important, the cluster will not permit
the logical host to be in an indeterminate state. If a logical host move fails and the
logical host is able to remain hosted on its original host, the cluster framework will
keep the logical host where it is (to ensure the state remains determinant). If neither
node is able to master the logical host, it will configure the logical host down.
Switchovers are performed using the haswitch command:
This command will switch logical hosts logical_host1 and logical_host2 over
to physical host dest_host.
In some situations, it may be desirable not to have the logical host mastered by any
physical hosts. This situation is called maintenance mode and results in the logical
host deporting its disks and stopping data services (on all nodes). This is the first part
of the switchover process. The host leaves its current master and is not remastered.
Maintenance mode can be useful when making some types of logical host changes
(and for backing up or restoring the CCD). A logical host can be switched into
maintenance mode using the following command:
# haswitch -m logical_hosts
This command will switch one or more logical hosts into maintenance mode
(logical_hosts can be a single logical host name or a space-separated list of logical
hostnames). Logical hosts can be switched out of maintenance mode using the
haswitch command to host them to a new physical host.
# hareg -n dataservice
After the data service has been stopped, it can be unregistered by using the following
command:
# hareg -u dataservice
This command unregisters all instances of the data service. If the hareg -u command
fails, it will leave the cluster in an inconsistent state.
If all data services present in a logical host have been stopped, the logical host can be
removed using the scinstall command (or the following command):
The available options for removing a logical host using the scinstall command are:
change (main menu), logical hosts (change menu), and remove (logical hosts
configuration menu).
After the logical host is removed, delete the
/etc/opt/SUNWcluster/conf/hanfs/vfstab.logicalhost file. If you are not
planning to use the disks again in a logical host, you could remove the stat volume.
Alternatively, it may be a good idea to keep the stat volume because it takes up little
space and provides the flexibility to re-introduce shared data into a cluster in the
future.
Because all instances of the affected data services were unregistered and turned off
previously, it may be necessary to re-register the services and turn them on to enable
other logical hosts to use them again.
Note: This will only be required if the services were configured into other logical
hosts in addition to the removed one.
Note: One meaning of the term cluster topology refers only to the number of nodes
and the topological relationships among them. However, in this discussion, the term
topology denotes both the topological relationships among nodes and their
connections (including networking and shared storage).
There are many types of changes that can be performed on a cluster. The fundamental
goal of a cluster is to keep the application available. Although this goal can be
attained when implementing some topology changes, it is not possible in others.
Interruptions to a service caused by topology changes can vary depending on the type
of change. Interruption can be classified according to the amount of downtime (no
downtime, the time to perform a switchover, or even the time required to reinstall the
cluster).
Non-invasive changes may cause no interruption at the application level. However,
complex changes may require failing over a service. If a node does not master any
logical hosts, the node can be taken out of service, modified, then re-introduced into
the cluster without interrupting service. If a node does master a logical host, some
downtime occurs during the time it takes the logical host to fail over to a backup node
(in preparation for bringing the node down).
Note: Some types of highly invasive changes to cluster topology may involve
bringing the entire cluster down. This includes changes that affect the init CCD.
Any topology changes should be applied carefully because they are potentially
disruptive. Disruptions can occur as a result of misconfiguration problems or reduced
redundancy while a backup component is out of service.
During the time a backup node is down, availability of the primary node is reduced to
its base availability rating. Because the backup node is not available during this time,
if the primary node fails, the failure results in an extended downtime. Additionally,
while running in degraded mode, the failure of a node will generally have a longer
downtime because errors that usually result in a failover may result in a reboot.
The basic method of a topology change involves putting the cluster in the correct
state, implementing the change, then making the necessary changes to the cluster
configuration so that the change is recognized. Implementation of (almost) all changes
are the same as for a non-clustered environment—in other words, it involves the same
basic changes to the hardware, OS, or volume manager. However, there are additional
steps required to enable the cluster to recognize changes.
Note: Because a node may not be able to respond to heartbeats during Dynamic
Reconfiguration (DR), the DR features of Sun servers are not supported in an SC2.2
environment.
If a change to a node affects the network or disk interfaces, it may affect the logical
hosts or other cluster functionality. As a result, additional steps may be required to
ensure the cluster correctly deals with the change. This is because the cluster will be
unaware of changes to the underlying hardware (with the exception of components it
is actively monitoring).
Network or storage interfaces that are not used by the logical host framework or the
cluster itself do not require additional configuration. In these cases, the node can
simply be taken down and the appropriate interface can be added or replaced. If the
logical host or cluster framework are affected, the card can be added in the standard
manner; however, additional actions may be required before the node can rejoin the
cluster with the new component. The following sections detail the procedures
involved in updating the cluster configuration appropriately when different
components are added or changed.
It is important to keep in mind that the standard Solaris OE actions for adding the
appropriate interface still apply. Cluster configuration changes are in addition to any
changes normally required to use the interface. For example, when adding a network
interface, an administrator needs to perform a reconfiguration reboot, create an
/etc/hostname.if file (if required), run ifconfig, etc. After the changes are
accomplished, the configuration steps outlined in the following sections should be
performed to make the cluster aware of the change.
Note: When interfaces are changed, run the scconf -i command on all nodes,
including any node being upgraded (even though not currently in the cluster
membership)—a good method of doing this is to use the cluster console.
This command updates the configuration for host hostname of cluster clustername
to enable the private interconnect used by hostname to connect to cluster
clustername. Interfaces are interface1 and interface2.
The interface parameters for interface1 and interface2, should be interface
instance names, for example, qfe0, qfe3, or hme1. The two private network
interfaces should be located on different cards.
The first interface listed should correspond to the interconnect on the first private
network; the second interface listed should correspond to the interconnect on the
second private network. For example, if interface2 has not been upgraded, it will
still be the second entry listed. If the interface order is reversed, the node will be
unable to communicate over the private network and will be prevented from rejoining
the cluster.
Switching to a new internal network interface only causes downtime in the logical
hosts that are hosted on the node to be upgraded. These logical hosts should be failed
over while the node is out of the cluster.
After the scrubbers are set, reinstall the cards and connect the cables. The
corresponding SCI cards on each node should connect to each other; for example,
SCI0 on Node0 attaches to SCI0 on Node1, and SCI1 on Node0 attaches to SCI1 on
Node1.
Note: SCI cables are heavy—ensure they have adequate support to prevent
damaging connectors.
# /opt/SUNWsci/bin/sm_config -f sci.config
Where sci.config is the configuration file that has been modified from the template
as described above.
Where hostname is the name of the host having its interfaces updated in the
configuration of the cluster clustername. The interface interface1 will be used
for the first private network, and the interface interface2 will be used for the
second private network. Interface parameters interface1 and interface2 should
be interface instance names, for example, scid0 and scid2.
#/opt/SUNWsci/bin/sciinfo -a
sciinfo $Revision: 2.10 $
There are 2 ASbus adapters found in this machine
--------------------------------------------------
Adapter number : 0
Node id : 4
Serial Number : 5035
Scrubber : ON
--------------------------------------------------
Adapter number : 1
Node id : 8
Serial Number : 5027
Scrubber : OFF
Note: If the SCI cable is connected incorrectly, the scrubber status cannot be
determined.
Note: In clusters where the terminal concentrator plays a role in failure fencing, any
changes to the terminal concentrator configuration should only be performed when
the cluster is stopped.
Quorum devices are essential for failure fencing when using VxVM configurations;
therefore, manually selecting a device is generally not a good practice. Using the
interactive selection verifies both nodes can see the quorum device (and the device
name is correct).
Physically mark the disks that are used as quorum devices to clearly identify them.
Because the cluster expects a certain disk ID for the quorum devices, if the quorum
disk is replaced with a new disk, it cannot perform as a quorum device unless the
cluster configuration is updated. It is important that data center staff never replace a
quorum device without notifying cluster administrators to update the configuration.
In cases where a quorum device already exists between two nodes, the scconf
clustername -q command can be run while the cluster is up (so there will be no
downtime)—this enables switching quorum devices for maintenance. The
scconf -q command can also be used to add a new quorum device between two
nodes that did not previously share a quorum device (changing the quorum
topology). When such quorum topology changes are implemented, the cluster should
be down. Quorum topology changes may be necessary if the maximum number of
nodes permitted in the cluster is increased, or the topology is changed so that two
nodes are configured to share data they did not previously share.
# scadmin resdisks
Where instance_num is the instance number of the disk (in the form cXtYdZ).
Note: Some of the following directions below use scconf command variations.
These commands should be run on all nodes in the cluster, including old nodes and
any new nodes to be added to the cluster.
When adding a new node to the cluster, the installation should be performed on the
new node using the scinstall program. When this program is run on the new
node, the requested information about the existing nodes should reflect the existing
nodes’ current configuration. The information requested for the new node should
match the desired end state of the cluster (regardless of the answers provided during
the initial cluster installation). Because the cluster does not permit a node to join if
there is an inconsistency between its configuration files and the configuration files of
the remaining nodes, a problem is relatively easy to spot (although not necessarily
easy to diagnose).
Cluster configurations on the original cluster nodes will not contain the same
configuration information as the new nodes (their configuration information will not
include the new node). To synchronize configurations, it is necessary to perform
additional administrative tasks.
As previously discussed, it is unlikely that the correct Ethernet address of the new
node will match the Ethernet address configured during the initial installation.
The Ethernet address of the node can be changed using the following command:
Where clustname is the name of the cluster, nodenum is the node number of the
node to change the Ethernet address of, and Ethernet_address is the Ethernet
address of the node. The node number can be obtained using the get_node_status
command.
After the configuration has been performed, the new node(s) can be added to the
cluster using the scconf -A command to modify the number of permitted nodes. The
argument is the new number of nodes in the cluster. Nodes can only be added to the
cluster in node ID order; a cluster with three nodes can only contain node IDs 0, 1,
and 2 (ID 3 is not permitted—node ID 3 can only be used in a four-node cluster).
For example, the following command configures the first three nodes so they can join
the cluster—if there were a fourth potential node, it would not be able to join:
# scconf clustername -A 3
Note: A node is also able to be removed using the scconf -A command; however,
the nodes with the highest node numbers will always be removed from the cluster
first.
Summary 185
186 Sun Cluster 2.2 Administration
CHA PTER 5
187
To enable high availability for applications, SC2.2 software supplies fault probes for
some of the most popular databases available today: Oracle, Sybase, and Informix. See
Table 5-1 for a list of currently supported database versions for which Sun
Microsystems provides high availability for. Other database modules are available (for
example, IBM DB2 EEE) which are provided by their respective vendors.
Table 5-1 Sun-Supplied Database Agents for High Availability on Sun Cluster 2.2
Software
Parallel Databases
Although SC2.2 software is capable of running parallel databases (PDBs), the subject
is only discussed briefly here. PDB functionality within the SC2.2 software provides
database vendors a computing platform which enables all nodes within the cluster to
act as a single system—thereby offering scalability and high availability. Vendors
currently using SC2.2 software with PDB functionality are: Oracle (Oracle Parallel
Server), Informix (XMP), and IBM (DB2 EEE).
Although the underlying SC2.2 software for PDB uses the same hardware and
software as an HA solution, PDBs use a fundamentally different approach to system
redundancy and concurrency.
The basic goal of a clustered database is to use multiple servers with a single data
store. One approach is called shared nothing—this is used by Informix XPS and IBM
DB2/EEE. Each node within the shared nothing cluster only has access to its own
array. If data on another array is required, it must be accessed through the node that
owns the storage; the data is transported between nodes via the interconnect.
The benefit of a shared nothing database system is its excellent application scalability
(for specific classes of workload); for example, where there is little data sharing—as
with OLTP workloads and large reporting-centric decision support applications. A
noticeable characteristic of these application workloads is that data access is
predictable and highly controlled.
By contrast, Oracle Parallel Server (OPS) software uses shared disk topology—that is,
every node within the cluster has access to the same set of disks (with a single shared
data store). Unlike the shared nothing approach, shared disk databases access data
directly from storage; therefore, data transport speed is only constrained by the disk
I/O system (not the interconnect). This approach is extremely flexible and enables the
interconnect to be used to communicate instance locking activities between nodes.
The shared disk environment offers excellent robustness, and because each server
connects directly to the shared disks, administration becomes easier. However,
performance could become vulnerable because excessive lock contention could
significantly slow a system.
The best means of scalability (availability notwithstanding) may not be by clustering
additional servers through Oracle Parallel Servers. The simplest way to solve a scaling
problem is by upgrading CPUs, I/O, and memory on the same server. The Sun
Enterprise Server line makes this process straightforward because a single-
architecture (SPARC/Solaris) server scales from 1 CPU to 64 CPUs (Enterprise 10000
server).
System capacity may be increased by adding a new server; however, creating a larger
single instance does not guarantee improved availability. Moreover, beyond machines
with 64 CPUs, clustering is the only option for increased scalability.
Note: At the time of this writing, there are few commercial workloads that can keep
so large a population of CPUs continually busy within a single database instance.
ORA-A ORA-B
Interconnect
Shared Disk
Figure 5-1 Typical Two-Node Oracle OPS Cluster Using Sun Cluster 2.2 Software
Configuring SC2.2 software for OPS uses a different setup procedure than that of HA-
Oracle. HA-Oracle and OPS also differ in their use of volume managers and
configuration of the cluster software (given the shared disk architecture of the OPS
system).
The Sun Enterprise™ Volume Manager (SEVM) controls storage devices such as disk
arrays, which adds the capabilities of disk mirroring and striping. After the disks are
controlled by the volume manager, they are grouped together in diskgroups.
Diskgroups are extremely flexible, and can be assigned to one or multiple servers.
Disks within diskgroups are created into logical partitions called volumes. These
volumes can be parts of stripes or mirrors and have the flexibility to be raw devices or
file system-based volumes. Applications access data by means of volumes—because
volumes are logical units, they offer excellent flexibility for I/O layouts, disk, and
recovery management.
1. Node A gets data from the database and brings it into its system global area
(SGA).
2. Depending on the request type for the data (update, read, or write), Oracle will
allocate the appropriate lock on all blocks that correspond to the data.
Enterprise No issues. No support for IDN, No issues. No support for IDN, AP/DR.
10000 AP/DR.
domains
System No issues outside of standard OS, Uses slightly more memory resources due to
resources database, or cluster requirements. DLM activity.
Note: The CVM product is required for shared disk database architectures such as
OPS.
Note: Manual failover of a logical host after node recovery is recommended (to avoid
unscheduled downtime).
Node A Node B
Local
ORASID Local
Disk Disk
Active Standby
Shared
Disk
In this configuration, the data and database programs reside on the logical host. There
is only one database instance residing on node A. Node B is always on
standby—waiting to take over the logical host. Because database programs reside on
the logical host, they always migrate with the logical host, thereby ensuring the
database programs are available in the advent of a logical host failover.
An asymmetric configuration offers the best potential performance characteristics
because the master node can be tuned specifically for one or more database instances.
ORASID 1
Local Local
Disk Disk
ORASID N
Active Standby
Shared
Disk
/logicalhost/u01/orahome/bin
/logicalhost/u02/data1
/logicalhost/u03/data2
/logicalhost/u04/dataN
The configuration illustrated in Figure 5-3 is physically identical to Figure 5-2 with the
exception of multiple database instances running on node A. If node A had a physical
failure, all instances would migrate to node B (along with the logical host). However,
a problem could arise with a logical failure of a database which is not physical.
For example, assume ORASID 1 had a problem that froze the database. If after an
unsuccessful attempt to correct the situation, the cluster framework determined it
should send the logical host over to node B—the switch over would be performed.
However, if an instance ORASID N was still active on node A, and because ORASID
N is bound to the logical host, it must also fail over (even though there was no
problem with the instance). All uncommitted transactions on ORASID N would be
lost and clients would have to reconnect to ORASID N after the failover was
completed.
A logical host could be created for each database instance. In this situation, ORASID 1
would switch over without affecting ORASID N; however, because the software is on
the first logical host, ORASID N would not have access to those binaries. If ORASID
N had to be stopped, an administrator would not be able to do so because the Oracle
software would only be accessible to node B.
Node A Node B
ORASID ORASID
Local Local
Disk Disk
Active Active
Shared
Disk
/logicalhost1/u01/orahome/bin /logicalhost2/u01/orahome/bin
/logicalhost1/u02/data1 /logicalhost2/u02/data1
/logicalhost1/u03/data2 /logicalhost2/u03/data2
Figure 5-4 Symmetric Configuration with Database Software on the Logical Host
The configuration in Figure 5-4 shows that each node is hosting one instance bound to
one logical host—in this situation, having the binaries on the logical hosts is
acceptable. One copy of the database binaries will not be sufficient because a logical
host can only be mastered by one node at a time. In this configuration, putting
binaries on the logical or local disk would not make a difference.
When using multiple logical hosts in a symmetric configuration (Figure 5-5), one set
of shared binaries per logical host only works for a physical failover (similar to the
asymmetric configuration). For logical failures or maintenance functions, the database
software has to be placed on the node’s local disk to ensure the software is available
for all instances residing on the node.
ORASID ORASID
Local Local
Disk ORASID ORASID Disk
N N
Active Active
Shared
Disk
/opt/oracle/bin /opt/oracle/bin
/logicalhost1/u01/data1 /logicalhost2/u01/data1
/logicalhost1/u02/data2 /logicalhost2/u02/data2
/logicalhost1/u03/data3 /logicalhost2/u03/data3
1 Logical Host,
1 Logical Host, Multiple DB Multiple Logical Hosts,
HA Mode 1 DB Instance Instances Multiple Instances
Asymmetric Logical Host Local Disk Local Disk
Symmetric Logical Host Local Disk Local Disk
Note: Data service modules are sold separately. Ensure any required modules have
been acquired and licensed.
The support file indicates to the cluster framework which versions of the vendor
product are supported. In the following example, the file
/etc/opt/SUNWsor/haoracle_support indicates only Oracle versions 7.3.x and
8.0.x are supported—no other version will work correctly. This file should be queried
to ensure the database version to be made highly available is supported.
The support file is ASCII-based, making it easy to view and edit. It is strongly advised
that you not edit this file to add a database version that does not exist.
Support files change from release to release and sometimes between patch level
upgrades. Modifying a support file could cause the database agent to monitor the
database incorrectly.
The database probe checks the support and configuration files during cluster startup
(and during the probe cycle sequence). Database error conditions may change
between patch level revisions; therefore, an error condition could be encountered
without an appropriate action for the probe to take—potentially leaving the database
probe in a unknown state.
The support file should be verified each time a software or patch level upgrade is
applied to the cluster.
Caution: After the database version has been confirmed and the issue of binary
placement has been addressed, creation of the logical host can begin. Due to its
inherent complexity, configuring clusters is a time-consuming process that can be
error-prone. Use the following checklist as a guide.
Note: When using a volume manager (SEVM, CVM, or VxVM), you cannot create
diskgroups without a root dg. Likewise, with SDS, you cannot create meta sets
without database replicas, see Chapter 4, “Sun Cluster 2.2 Administration.”
A good practice to follow for data placement on logical hosts is to ensure all nodes
within the cluster can see each array (with identical controller numbers). This helps
ensure consistency when using storage for data placement.
For example, each node might have an array controller configured as c5txdx, with
c6txdx configured on the other node. Data should only reside on disk array c5 and
be mirrored to c6. Using this standard helps ensure each disk on array c5 is mirrored
to each disk on array c6.
This practice makes administration easier and can speed disk replacement. For
additional details, see Chapter 4, “Sun Cluster 2.2 Administration.”
Edit the /etc/hosts file and define a name and IP address for the logical host. Next,
create the logical host(s) necessary for the number of instances requiring HA services.
The following box contains the general syntax for creating a logical host(s):
Where:
scconf is the binary used to create or modify cluster configuration.
solarflare is the name of the cluster.
-L is the logical host (in this case, lhost1).
-n is a list of nodes that monitor the logical host (clnode0 is the default master).
-g is the diskgroup associated with the logical host (in this case, oradg).
-i is a list of the network interfaces that will be used to service the logical IP address
(in this case, hme0 for both nodes).
-m is the automatic failback switch (in this case, automatic failback has been
disabled).
The previous syntax creates a logical host called lhost1 with clnode0 being the
default master, with a diskgroup called oradg assigned to it. The public network
interface that services the logical IP address is hme0 for both nodes.
Note: Creation of the logical host should be performed on one node only.
Using the -m argument will ensure the logical host does not automatically migrate
back to the master node in the event of a failure. If the -m option is not used, the
logical host automatically switches back to clnode0 as soon as clnode0 rejoins the
cluster. For database systems, this is not normally a desirable action because most
connections to database systems are not stateless. Therefore, if the application or
database does not handle client connectivity, a logical host switch will cause the
database to lose uncommitted transactions—any connected clients would need to re-
establish their connections.
Note: After creating the hastat volume, ensure a corresponding mirror exists on a
different array.
The first time the scconf -F command is invoked, it creates the hastat volume and
a corresponding mirror in the oradg diskgroup (also a mount point for the logical
host). In addition to creating the hastat volume, the -F option creates two other
important files:
■ /etc/opt/SUNWcluster/conf/hanfs/vfstab.lhost1
■ /etc/opt/SUNWcluster/conf/hanfs/dfstab.lhost1
The scconf solarflare -F oradg command must be executed on at least one
node; however, it is useful to run the command on every node monitoring the logical
host—this ensures all slave nodes have the above files and a logical host mount point.
If the command is not run on each node, the files have to be manually copied to each
node and the logical host mount points created.
In the previous example, the logical host is servicing a database; therefore, the
dfstab.logicalhost file is not necessary and can be deleted. The
vfstab.logicalhost file holds information necessary for mounting the volumes or
meta devices for the database.
The semantics of the two files is similar to the vfstab file found in the UNIX /etc
directory. Although these files look and work similarly, there are differences. The
main difference is that mount points are not automatically mounted any time during
or after the boot phase. Mounting is performed by means of the cluster framework
(exclusively).
/lhost1 <---- Logical Host mount point on each node’s root disk
/lhost1/u01
/lhost1/u02
/lhost1/u03 <---- U01,U02,U03 are mounted under the logical host
The only system administration consideration is to ensure the logical host(s) mount
point exists on the root disk of all nodes monitoring the logical host(s). Creating
individual mount points on a local disk can lead to confusion, mistakes, and the
potential of accidentally deleting or unmounting required mount points.
After the devices have been configured for data (volumes or meta devices), the
vfstab.logical host file should be updated to indicate the required mount points
for the logical host. Ensure all nodes configured to run on the logical host have
identical vfstab.logical hosts files.
After all steps for defining a logical host are completed, the logical host should be
tested prior to continuing. Use the scadmin command to force a switch of the logical
host.
To test the logical host, use the following command:
Confirm the logical host mount points to ensure all mount points identified in the
vfstab.logical host file have been mounted:
clnode1> df -lk
/dev/vx/dsk/oradg/oradg-stat
1055 12 938 2% /lhost1
/dev/vx/dsk/oradg/u01
191983 9 172776 1% /lhost1/u01
/dev/vx/dsk/oradg/u02
191983 9 172776 1% /lhost1/u02
/dev/vx/dsk/oradg/u03
95983 9 86376 1% /lhost1/u03
Note: This example highlights simplicity. Note the lack of other storage parameters
which could be associated when creating tablespaces.
Note: Ensure the user ID set up for database monitoring is performed in a part of the
database that can be taken offline without affecting database operations.
The following steps are required to administer the dbmonitor tablespace (if
required):
■ Stop database monitoring
■ Drop the tablespace
■ Create the tablespace
■ Start database monitoring
Then, create a user ID with a password.
Note: For security reasons, do not use the default Oracle user ID example as used in
the Software Installation Guide.
When creating a user ID, choose a meaningful name that indicates its function. Within
the Oracle software, the fault probe user must have access to the V$SYSSTAT view.
Use the Oracle Server Manager (or equivalent) to create a user ID (assign appropriate
permissions). See the following box for an example of creating of a user ID.
Use the following commands (in the Server Manager):
LISTENER-ora =
(ADDRESS_LIST =
(PROTOCOL = TCP)
(HOST = lhost1) <---Logical Host
(PORT = 1530)))
SID_LIST_LISTENER-ora
(SID_LIST =
(SID_DESC =
(SID_NAME = sunha)))
solarflare =
(DESCRIPTION =
(ADDRESS =
(PROTOCOL = TCP)
(HOST = lhost1) <----- Logical Host
(PORT = 1530)))
(CONNECT_DATA =
(SID = sunha))
In the preceding box, note the use of the lowercase y. Using a lowercase y will turn on
the Oracle module only—using an uppercase Y will turn on all data services.
Use the following command to display the data services that are registered and
switched on:
clnode0> hareg
oracle on
nfs on
clnode1> hareg
oracle on
nfs on
After the Oracle module has been registered and switched on, instance monitoring
can be set up. The cluster framework will be made aware of which instances require
monitoring by using the haoracle command.
After the Oracle data service module has been configured, bring the database online
by using the haroacle start command as follows:
Note: The stop command does not stop or shut down the database—it is the data
service monitoring that is stopped. Any instance will have to be shut down manually
by logging onto the database.
After monitoring of the database has started, use the UNIX ps command to confirm
the monitoring processes are running.
Enter the following commands:
The output is the same for the backup node; however, there is an -r indicator
associated with the process—this indicates the node is running the fault probe
remotely.
In this example, by viewing the process output of node A, it can be seen that two
haoracle_fmon (Oracle fault monitoring) processes are running. The child process
connects to the database and runs the probe sequence. The parent checks the status of
the child process and determines if any action is necessary. When the child process
reports the results of the last probe to the parent, the results are evaluated. If the
probe results are inconsistent, they will be compared to an action file to determine if
action is required (such as a logical host switch or database restart).
To identify which node is being monitored, use the haoracle list command on
both nodes as follows:
The output of the haoracle list command displays the number of instances
registered for monitoring with the SC2.2 framework, how they are configured, and
the current monitoring status.
The first parameter identifies monitoring on/off status. When queried, all nodes
should return an identical monitoring status.
The SC2.2 GUI (if used) displays the status of the database. The scope of this function
is limited; however, a system operator can quickly determine if a database is up or
down depending on the color status of the database indicator.
This is a convenient method of confirming database availability without using either
the db_check or ps commands (or logging into the database by means of an SQL
console).
Database Monitoring
A powerful attribute of a highly available database fault probe is its ability to
distinguish between a failed and inactive database. If a failure is detected on a node,
the database probe restarts the database or starts a listener process.
Restarting a database instance on a functioning node is a faster recovery process than
a full logical host failover. However (if necessary), the database probe can begin a
logical host failover process.
After the monitoring agent is activated, the database data service module runs two
background processes for monitoring a database(s) on each node. The following is the
ordered sequence used by the HA-Oracle monitoring agent:
1. The HA-Database module logs into the database by means of an assigned user
ID/password.
3. Database statistics are checked by means of a query against the current system. If
the statistics indicate database activity, the probe takes no action. The process
repeats each probe cycle time (as set when the logical host was created).
Note: Ensure the fault probe user ID does not use the system tablespace as a default
tablespace.
When viewing the Oracle action file, it can be seen how the fault probe behaves when
faced with specific errors—each type of error condition has a pre-defined option. The
options include switching the logical host, restarting the database, stopping the
database, or doing nothing at all.
Note: Issues other than those listed (Table 5-6) are database related.
Where:
■ Probe cycle time (value set to 60). This parameter determines the start time
between two database probes. In this example, the fault monitor will probe the
database every 60 seconds unless a probe is already running.
■ Connectivity probe cycle count (value set to 10). This parameter determines if a
database connection will be left open for the next probe cycle. In this example
where the value is set to 10, database connectivity is checked after every ten
probe cycles. If the tenth probe is successful, the connection is left open for the
next probe. The eleventh probe will disconnect from the database and reconnect
again to ensure connections are still possible. Setting the value to 0 will disable
the function.
■ Probe timeout (value set to 120). This is the time after which a database probe is
aborted by the fault monitor—in this example, after 120 seconds.
■ Restart delay (value set to 300). This is the minimum time that must pass between
database restart operations. If a restart must be performed before 300 seconds,
the fault probe initiates a switchover.
The defaults for these values are 60, 10, 120, and 300. Parameter changes are
performed using the modify option.
Database Failover
What happens during failover?
It is important to understand that failures with computer systems will happen. Even a
fault-tolerant class of machine can fail; however, the recovery times with this type of
machine can be so short it appears they never go down. Knowing what to expect
during a failure is critical for two reasons:
■ Did my environment recover?
■ Is the environment able to provide services at a level that can be considered
acceptable?
To answer the first question, it is helpful to understand the overall computing
environment, see Chapter 2, “Sun Cluster 2.2 Architecture.”
The following is the general sequence of events that can be expected. Assume a two-
node asymmetric cluster; single Oracle database instance; one node completely fails:
3. The standby node does not need to start a listener because one is already running.
Note: When using a symmetric configuration, the amount of RAM per node should
be doubled.
local.debug; /var/adm/haoracle_fmon.debug
Note: After debugging is finished, remove the variables and clear the log file.
#include <sql.h>
main()
{submit transaction
transaction mytrack;
mytrack=insert into table abc values(1,2,3);
if (mytrack!= 0){
check connection;
if (conneciton!=0){
wait 30 seconds and try again.
else
connect to alternate server.}
submit transaction again.}
Note: Some vendors provide the tools neccessary to manage client connections; for
example, Oracle provides this functionality through its OCI libraries.
Summary 229
230 Highly Available Databases
CHA PTER 6
This chapter enables the reader to integrate the concepts presented in the previous
chapters into a working cluster. Because the SC2.2 software is complex, it is
fundamental to understand the function of the individual components and their
relationship with other components.
The solution presented in this chapter details a lower end implementation of the
SC2.2 software using a Sun Enterprise™ 220R server. This solution details the
implementation of a highly available NFS Web server application that uses a second
cluster node as a standby failover node.
The application notes detail installation and configuration for the following elements:
■ Solaris OE
■ Hardware and software volume managers
■ File systems
■ Applications
■ SC2.2 software
Each configuration step is explained and includes detailed comments. Where
applicable, the best practices are highlighted. Administrators with SC2.2 software
installation and configuration experience will find it worthwhile to review this
chapter to identify the associated best practices.
231
Hardware Connectivity
The hardware arrangement includes two Sun Enterprise 220R servers, each containing
two 450 MHz UltraSPARC™-II processors, 512 Mbytes of memory, and two 9.1
Gbyte/7200 rpm internal SCSI drives. Each server has two quad fast Ethernet cards to
meet the private interconnect and public network requirements, and two Ultra-SCSI
differential controllers to provide access to an external Sun StorEdge™ D1000 array.
Review the hardware option information in Table 6-1 and controller placement details
in Figure 6-1.
Internal PCI 10/100 BaseT Quad Fast X1034A 2 qfe0, qfe1, qfe2,
Ethernet Controller qfe3
Internal PCI Ultra-SCSI Differential X6541A 3 c1
Controller
Internal PCI Ultra-SCSI Differential X6541A 4 c2
Controller
c2
c1
B
c2
c1
B
The Sun StorEdge D1000 array contains two separate SCSI buses, each supporting
four 9.1 Gbyte/7200 rpm SCSI drives. The production data resides on the first half of
the array and is mirrored using the second half of the array. Each half of the D1000 is
accessed by different controllers on each host (c1, c2) to increase availability.
Use separate disk trays to house the production and mirror data to avoid the disk tray
becoming a single point of failure (SPOF).
Note: For the sake of expediency, we elected to use a single Sun StorEdge D1000
array—even though the single tray will be a SPOF.
In Figure 6-2 each node (nickel and pennies) in the two-node cluster has console access
through a terminal concentrator (coins) and is managed by the cluster administrative
workstation (clusterclient00).
Note: In our example, the boot disk and its mirror are connected to the same
controller, making the controller a SPOF. Although it is a best practice to avoid a
SPOF, it may be impractical in some lower end solutions due to the reduced number
of I/O slots.
‘clusterclient00’
Administrative
workstation
2x200 MHz HA switched HA switched
UltraSparc backbone backbone
512 Mbyte
‘coins-hub00’ pu pu ‘coins-hub01’
hme0
Ultra 2 * 100 Mbyte 100 Mbyte
p1 FDX Hub FDX Hub
‘coins’
ttyb
p2 p3 p4 p2 p3
p1 Terminal
concentrator ethernet
p2 p3
t0 SCSI A SCSI B t0
Internal disks Internal disks
t8 t0
t1 t1
Boot mirror Boot mirror
t9 t1
t10 t2
Note: Each software layer should be installed and configured in the presented order.
However, in this solution, the highly available application (in this case, NFS) is
integrated into Solaris OE and does not need to be installed separately.
Second Layer
Hardware / Software Volume Managers
Note: Ensure the hardwire entry in the /etc/remote file includes the
/dev/term/b device (associated with the ttyb port of the administrative
workstation).
monitor:: addr
Enter Internet address [192.40.85.60]:: 129.153.49.79
Internet address: 129.153.49.79
Enter Subnet mask [255.255.0.0]:: 255.255.255.0
Subnet mask: 255.255.255.0
Enter Preferred load host Internet address [<any host>]:: 0.0.0.0
Preferred load host address: <any host>0.0.0.0
Enter Broadcast address [0.0.0.0]:: 129.153.49.255
Broadcast address: 129.153.49.255
Enter Preferred dump address [192.168.96.255]:: 0.0.0.0
Preferred dump address: 0.0.0.0
Select type of IP packet encapsulation (ieee802/ethernet) [<ethernet>]:: ethernet
Type of IP packet encapsulation: [<ethernet>]:: ethernet
Load Broadcast Y/N [Y]:: N
Load Broadcast: N
monitor::
Note: Provide the appropriate IP address, subnet mask, and broadcast address to suit
your specific network environment.
Power cycle the TC to enable the IP address changes to be effected. Confirm that the
TC responds to the new IP address by issuing the ping coins command from the
administrative workstation.
# telnet coins
Trying 129.153.49.79...
Connected to coins.
Escape character is ’^]’.
cli
Annex Command Line Interpreter * Copyright 1991 Xylogics, Inc.
annex: su
Password: 129.153.49.79 (password defaults to assigned IP address)
annex# edit config.annex
Ctrl-W: save and exit Ctrl-X: exit Ctrl-F: page down Ctrl-B: page up
# The following are definitions of gateway entries:
#
%gateway
net default gateway 129.153.49.254 metric 1 active
#
# The following are definitions of macro entries:
#
%macros
%include macros
#
# The following are definitions of rotary entries:
#
%rotary
%include rotaries
Because we are using a dtterm emulation to display the previous box, substitute your
gateway IP address by using the appropriate editing keys (up-arrow, down-arrow,
Backspace, and Delete). The keyboard should be set to insert mode—use the
Backspace key to remove the old IP address. After the old address is removed, enter
the new address. To end the session, use the <ctrl>w command to save changes to
the config.annex file.
# telnet coins
Trying 129.153.49.79...
Connected to coins.
Escape character is ’^]’ <CR>
Rotaries Defined:
cli
After accessing the nickel server, extract the Ethernet address by entering the
following command:
{0} ok banner
Sun Enterprise 220R (2 X UltraSPARC-II 450MHz), No Keyboard
OpenBoot 3.23, 512 MB memory installed, Serial #12565971.
Ethernet address 8:0:20:bf:bd:d3, Host ID: 80bfbdd3.
Gain access to the pennies server by entering the following command sequence:
# telnet coins
Trying 129.153.49.79...
Connected to coins.
Escape character is ’^]’ <CR>
Rotaries Defined:
cli
After accessing the pennies server, extract the Ethernet address by entering the
following command:
{2} ok banner
Sun Enterprise 220R (2 X UltraSPARC-II 450MHz), No Keyboard
OpenBoot 3.23, 512 MB memory installed, Serial #12565971.
Ethernet address 8:0:20:bf:bd:d3, Host ID: 80bfbdd3.
Note: All commands discussed in this section should be entered into the
administrative workstation unless otherwise directed.
To minimize operator error, use the Solaris JumpstartTM software to fully automate
installation of the Solaris OE and application packages.
Note: Although Jumpstart techniques are a best practice, they are not implemented
in this chapter because it is important to identify the best practices associated with a
manual installation. For additional information on Jumpstart techniques, refer to
other Sun BluePrints publications at http://www.sun.com/blueprints.
# cd /cdrom/Sun_Cluster_2_2/Sol_2.6/Tools
# ./scinstall
...
============ Main Menu =================
1) Install / Upgrade - Install or Upgrade Server
Packages or Install Client Packages.
2) Remove - Remove Server or Client Packages.
3) Verify - Verify installed package sets.
4) List - List installed package sets.
5) Quit - Quit this program.
6) Help - The help screen for this menu.
129.153.49.91 nickel
129.153.49.92 pennies
To access man(1) pages and binaries for the SC2.2 client packages, update the
MANPATH and PATH variables. The following sample code lines illustrate a Bourne or
Korn shell environment:
PATH=$PATH:/opt/SUNWcluster/bin;export PATH
MANPATH=$MANPATH:/opt/SUNWcluster/man;export MANPATH
# ccp&
Note: When using the ccp(1) command remotely, ensure the DISPLAY shell
environment variable is set to the local hostname.
On the Cluster Control Panel window (Figure 6-4), double-click the Cluster Console
(console mode) icon to display the Cluster Console window, see Figure 6-5.
On the Cluster Console window’s menu (Figure 6-7), click Hosts, then choose the Select
Hosts option to display the Cluster Console: Select Hosts window. For the Hostname
option within the Cluster Console: Select Hosts window, enter the name
“low-end-cluster” (as registered in the /etc/clusters file)—see Figure 6-5.
Click the Insert button to display the console windows for both cluster nodes (nickel
and pennies)—see Figure 6-6.
To enter text simultaneously into both node windows, click in the Cluster Console
window (to activate the window) and enter the required text. The text is not
displayed in the Cluster Console window, but is displayed in both node windows. For
example, the /etc/nsswitch.conf file can be edited on both cluster nodes
simultaneously using the Cluster Console window. This ensures both nodes end up
with identical file modifications.
Note: Before using an editor within the Cluster Console window, ensure the TERM
shell environment value is set and exported to a value of vt220 (which is the
terminal emulation being used).
Enter the banner command in the Cluster Console window to display system
information for both nodes.
To issue a Stop-A command to the cluster nodes (to access the OBP prompt), position
the cursor in the Cluster Console window and enter the <ctrl>] character to force
access to the telnet prompt.
Enter the Stop-A command as follows:
Note: The console windows for both cluster nodes are grouped; that is, the three
windows move in unison (Figure 6-6). To ungroup the Cluster Console window from
the cluster node console windows, select Options from the Hosts menu (Figure 6-7)
and uncheck the Group Term Windows checkbox.
# mkdir -p /local/OS/Solaris2.6Hardware3
# cd /cdrom/sol_2_6_sparc_smcc_svr/s0/Solaris_2.6/Tools
# ./setup_install_server /local/OS/Solaris2.6Hardware3
Verifying target directory...
Calculating the required disk space for the Solaris_2.6 product
Copying the CD image to disk...
Copying of the Solaris image to disk is complete
The name of a created directory should indicate its contents. For example,
Solaris2.6Hardware3 denotes the OS version.
Ensure the OS image is available to install clients through NFS by inserting the
following line in the /etc/dfstab file:
Note: The ro,anon=0 option provides install clients with root privileges and access
to the installation directory. After the /etc/dfstab file is modified, make the
installation directory available by using the shareall(1M) command.
# cd /local/OS/Solaris2.6Hardware3/Solaris_2.6/Tools
# ./add_install_client -i 129.153.49.91 -e 8:0:20:bf:c5:91 nickel sun4u
# ./add_install_client -i 129.153.49.92 -e 8:0:20:bf:bd:d3 pennies sun4u
{0}ok show-devs
/SUNW,UltraSPARC-II@2,0 (System CPU 1)
/SUNW,UltraSPARC-II@0,0 (System CPU 1)
...
/pci@1f,2000/pci@1/SUNW,qfe@3,1 (qfe port 7)
...
/pci@1f,2000/pci@1/SUNW,qfe@2,1 (qfe port 6)
...
/pci@1f,2000/pci@1/SUNW,qfe@1,1 (qfe port 5)
...
/pci@1f,2000/pci@1/SUNW,qfe@0,1 (qfe port 4)
...
/pci@1f,4000/scsi@5,1 (PCI SCSI controller instance 2)
...
/pci@1f,4000/scsi@4,1 (PCI SCSI controller instance 1)
...
/pci@1f,4000/scsi@3,1 (mother board SCSI controller instance 0)
...
/pci@1f,4000/network@1,1 (motherboard hme instance 0)
...
/pci@1f,4000/pci@2/SUNW,qfe@3,1 (qfe port 3)
...
/pci@1f,4000/pci@2/SUNW,qfe@2,1 (qfe port 2)
...
/pci@1f,4000/pci@2/SUNW,qfe@1,1 (qfe port 1)
...
/pci@1f,4000/pci@2/SUNW,qfe@0,1 (qfe port 0)
...
Disable auto-booting on both servers at the OBP level until installation is complete.
This prevents booting without the correct SCSI-initiator ID.
Disable auto-booting by entering the following command:
{0} ok nvedit
0: probe-all
1: cd /pci@1f,4000/scsi@3
2: 7 " scsi-initiator-id" integer-property
3: device-end
4: install-console
5: banner
6: <CTRL-C>
{0} ok nvstore
{0} ok setenv use-nvramrc? true
use-nvramrc? = true
{0} ok printenv nvramrc
nvramrc = probe-all
cd /pci@1f,4000/scsi@3
7 " scsi-initiator-id" integer-property
device-end
install-console
banner
{0} ok
Note: There is a space after the first quotation mark in the “scsi-initiator-id”
string.
Keystroke Function
<ctrl> b Moves backward one character.
<ctrl> c Exits the nvramrc editor and returns to the OBP command interpreter. The
temporary buffer is preserved, but is not written back to the nvramrc (use
nvstore to save changes).
Delete Deletes previous character.
<ctrl> f Moves forward one character.
<ctrl> k Joins the next line to the current line (when the cursor is at the end of a line).
Deletes all characters in a line when the cursor is at the beginning of the line.
<ctrl> l Lists all lines.
<ctrl> n Moves to the next line of the nvramrc editing buffer.
<ctrl> o Inserts a new line at the cursor position (the cursor stays on the current line).
<ctrl> p Moves to the previous line of the nvramrc editing buffer.
<CR> Inserts a new line at the cursor position (the cursor advances to the next line).
Note: In the previous example, the warning message unused disk space should be
ignored.
To enable VxVM root encapsulation, ensure partitions 3 and 4 are not used. There
should be at least 16 MBytes of unallocated disk space at the end of the disk to
implement VxVM’s private region.
After installation is complete, review the install_log and sysidtool.log files in
the /a/var/sadm/system/logs directory to confirm that all software packages
were correctly installed.
When using the /user/bin/vi command to view the log files, set up terminal
emulation using the following command:
# TERM=vt220;export TERM
After reviewing all installation log files, reboot the system by entering the following
command:
# reboot
Note: The password used in the above box is an example only. Select your own
password to avoid security exposure.
Note: Use the ccp(1M) Cluster Console window to enter commands into both nodes
simultaneously unless otherwise directed.
A Sun Microsystems server with more than one network interface automatically
becomes a router (and may be selected over an efficient hardware router). To
circumvent this situation, establish a default route by entering the following
commands:
The SC2.2 infrastructure does not support Solaris IP interface groups—they must be
permanently disabled to prevent network anomalies during a logical host migration.
Solaris IP interface groups can be disabled by entering the following commands:
129.212.0.0 255.255.255.0
Note: Obtain the appropriate netmask value from your network administrator.
nickel
pennies
ha-nfs1
Test the remote shell capability by entering the following command on the pennies
node:
Note: If the remote shell test fails, debugging should be performed. (Consultation
with your service provider may be required.)
Verify network connectivity for the first private interconnect by entering the following
commands on the nickel node:
Verify network connectivity for the NAFO failover of the main network interface
(hme0 on both nodes) by entering the following commands:
# mkdir -p /PATCHES/Solaris2.6Cluster
Click the Download FTP button. Click the GO button, and when prompted, store the
downloaded 2.6_Recommended.zip file in the /PATCHES/Solaris2.6Cluster
directory.
Unzip and install the patch files by entering the following commands:
# cd /PATCHES/Solaris2.6Cluster
# unzip 2.6_Recommended.zip
# cd /PATCHES/Solaris2.6Cluster/2.6_Recommended
# ./install_cluster
Are you ready to continue with install? [y/n]: y
Installing 106960-01...
Installing 107038-01...
...
Installing 108721-01...
...
For more installation messages refer to the installation logfile:
/var/sadm/install_data/Solaris_2.6_Recommended_log
Use the /usr/bin/showrev -p command to verify installed patch-IDs.
Refer to individual patch README files for additional patch detail.
Rebooting the system is usually necessary after installation.
Note: Some patches included in the patch cluster will fail installation with a
“Return code 2” error message because they were already included in the Solaris
OE base software installation. Review the
/var/sadm/install_data/Solaris_2.6_Recommended_log file to verify patch
installation errors.
Reboot the system to activate the patches. Enter the following command:
# reboot
# cd /cdrom/vm_3_0_2/pkgs
# pkgadd -d.
The following packages are available:
1 VRTSvmdev VERITAS Volume Manager, Header and Library Files
(sparc) 3.0.2,REV=08.30.1999.15.56
2 VRTSvmdoc VERITAS Volume Manager (user documentation)
(sparc) 3.0.2,REV=08.26.1999.22.51
3 VRTSvmman VERITAS Volume Manager, Manual Pages
(sparc) 3.0.2,REV=08.30.1999.15.55
4 VRTSvmsa VERITAS Volume Manager Storage Administrator
(sparc) 3.0.3,REV=08.27.1999.13.55
5 VRTSvxvm VERITAS Volume Manager, Binaries
(sparc) 3.0.2,REV=08.30.1999.15.56
Select package(s) you wish to process (or ’all’ to process all packages). (default: all)
[?,??,q]: <CR>
...
Installation of <VRTSvmdev> was successful.
...
VERITAS Volume Manager (user documentation)
...
Select the document formats to be installed (default: all) [?,??,q]: 1 (Postscript)
...
Installation of <VRTSvmman> was successful.
# reboot
# mkdir -p /PATCHES/VxVM3.0.2
Unzip the file and install the patches by entering the following commands:
# cd /PATCHES/VxVM3.0.2
# unzip 106541-*.zip
# cd 106541-*
# patchadd .
# reboot
To start the VxVM 3.0.2 configuration, enter the following command on both
nodes—answer the prompts to suit your specific environment:
# /usr/sbin/vxinstall
...
No valid licenses found.
Are you prepared to enter a license key [y,n,q,?] (default: y) y
Please enter your key: XXXX XXXX XXXX XXXX XXXX XXXX
vrts:vxlicense: INFO: Feature name: CURRSET [95]
vrts:vxlicense: INFO: Number of licenses: 1 (non-floating)
vrts:vxlicense: INFO: Expiration date: Sun Apr 30 01:00:00 2000 (30.5 days from now)
vrts:vxlicense: INFO: Release Level: 20
vrts:vxlicense: INFO: Machine Class: All
vrts:vxlicense: INFO: Key successfully installed in /etc/vx/elm/95.
Do you wish to enter another license key [y,n,q,?] (default: n) y
Please enter your key: XXXX XXXX XXXX XXXX XXXX XXX
vrts:vxlicense: INFO: Feature name: RAID [96]
vrts:vxlicense: INFO: Number of licenses: 1 (non-floating)
vrts:vxlicense: INFO: Expiration date: Sun Apr 30 01:00:00 2000 (30.5 days from now)
Use the vxprint(1M) command after the system reboots to confirm that the boot disk,
boot mirror, and swap volumes are active.
# /etc/vx/bin/vxrootmir bootmirror
# /usr/sbin/vxassist -g rootdg mirror swapvol layout=nostripe bootmirror
Note: These VxVM commands produce a bit-by-bit copy of the root and swap
partitions into their respective mirrors (mirror synchronization). The performance of
the machine will be degraded until the copy operation is completed.
# /etc/vx/bin/vxdisksetup -i c1t8d0
# /etc/vx/bin/vxdisksetup -i c1t9d0
# /etc/vx/bin/vxdisksetup -i c1t10d0
# /etc/vx/bin/vxdisksetup -i c1t11d0
# /etc/vx/bin/vxdisksetup -i c2t0d0
# /etc/vx/bin/vxdisksetup -i c2t1d0
# /etc/vx/bin/vxdisksetup -i c2t2d0
# /etc/vx/bin/vxdisksetup -i c2t3d0
Populate the NFS1DG VxVM disk group with the rest of the available disks from coins-
d1k00 by entering the following commands on the nickel host:
The VxVM disk name should indicate the name of the physical disk array it is
associated with. This helps identify a physical location if an error occurs. For example,
the d1k00t10 VxVM disk name (from the previous box) indicates SCSI target 10 of
the first Sun Enterprise D1000 array.
Import the NFS1DG disk group by entering the following commands on the pennies
host:
Note: After the NFS1DG disk group is imported into the pennies host, verify its
presence using the vxprint(1M) command.
After it is been determined that the NFS1DG disk group can be migrated to the pennies
host, export the disk group back to the nickel host by entering the following command
on the pennies host:
Import the NFS1DG disk group by entering the following commands on the nickel
host:
# /usr/sbin/newfs /dev/vx/rdsk/NFS1DG/NFS1DG-stat
newfs: construct a new file system /dev/vx/rdsk/NFS1DG/NFS1DG-stat: (y/n)? y
Although the SC2.2 software creates the administrative volume and file system with a
single command, there is no control over disk selection. Therefore, build the
administrative volume and file system ahead of the SC2.2 software to enable manual
selection of the disks.
Note: The first command used in the preceding box returns the maximum possible
size for the proposed configuration to be used in the following VxVM command.
After the HANFS1 volume is mirrored, create a file system by entering the following
command on the nickel host:
# /usr/sbin/newfs /dev/vx/rdsk/NFS1DG/HANFS1
newfs: construct a new file system /dev/vx/rdsk/NFS1DG/HANFS1: (y/n)? y
After the HANFS1 file system has been created, establish a directory mount point by
entering the following command on both the nickel and pennies hosts:
# mkdir /nfs-services
# cd /cdrom/Sun_Cluster_2_2/Sol_2.6/Tools
# ./scinstall
...
============ Main Menu =================
1) Install/Upgrade - Install or Upgrade Server
Packages or Install Client Packages.
2) Remove - Remove Server or Client Packages.
3) Verify - Verify installed package sets.
4) List - List installed package sets.
5) Quit - Quit this program.
6) Help - The help screen for this menu.
Please choose one of the menu items: [6]: 1
==== Install/Upgrade Software Selection Menu =======================
Upgrade to the latest Sun Cluster Server packages or select package
sets for installation. The list of package sets depends on the Sun
Cluster packages that are currently installed.
Choose one:
1) Upgrade Upgrade to Sun Cluster 2.2 Server packages
2) Server Install the Sun Cluster packages needed on a server
3) Client Install the admin tools needed on an admin workstation
4) Server and Client Install both Client and Server packages
5) Close Exit this Menu
6) Quit Quit the Program
Enter the number of the package set [6]: 2
# /opt/SUNWcluster/bin/scadmin startnode
/opt/SUNWpnm/bin/pnmset
do you want to continue ... [y/n]: y
How many NAFO backup groups on the host [1]: 1
Enter backup group number [0]: 0
Please enter all network adapters under nafo0
hme0 qfe1
Note: Using the pnmset command updates the /etc/pnmconfig file. Manually
updating this file is not recommended.
Note: Because the administrative volume and file system were previously created
using VxVM commands, the SC2.2 software will not re-create them, see section,
“Creating the Administrative Volume and File System” on page 270.
share -F nfs-services
# hastat
NAFO Verification
Confirm that a failover from the main network interface to the standby NAFO
interface is possible by executing the following steps on the nickel and pennies hosts:
1. Confirm that hme0 is currently the main network interface by entering the
following command:
# ifconfig -a
2. Disconnect the hme0 interface by unplugging the RJ45 Ethernet cable from the
hme0 host port (this action simulates a failed interface).
3. Confirm that qfe1 has replaced hme0 as the main network interface by entering
the following command:
# ifconfig -a
1. Position the cursor in the Cluster Console window (ccp(1M)). Enter the <ctrl>]
character to access the telnet prompt.
>
3. Confirm the NFS data services have been successfully failed over to the pennies
node by verifying NFS client access to files.
4. After confirming that the pennies node has ownership of the NFS data services,
force the nickel node back into the Solaris OE by entering the following
command:
{0} ok go
Note: After relinquishing ownership of the NFS data services, the SC2.2 software
may invoke a shutdown on the nickel node. If this occurs, cluster membership should
be re-established by running the scadmin startnode command after rebooting.
Summary
A good method of assimilating technical concepts is by reviewing the implementation
details of the solution. This chapter provided a detailed description of the highly
available NFS services solution. Reviewing this information can help the reader better
understand the cluster concepts, the environment (software and hardware
components), and associated best practices.
Data services sit at the core of your business application’s availability because they
sustain the highly available environment by means of high availability agents (HA-
agents). The Sun Cluster 2.2 (SC2.2) HA-agents continually monitor the status of the
application. If errors are detected, the application attempts to restart on the same
node, or is restarted on the standby node (following a brief outage).
Note: A data service combines the functionality of a standard application (not highly
available) and the monitoring functionality of an HA-agent.
Note: Contact your local Sun service provider to license the HA-agents developed by
Sun Microsystems or third-party vendors.
281
Highly Available Data Services
The SC2.2 software was developed to enhance data and application availability for
business-critical or mission-critical environments. The basic SC2.2 premise is to
connect two or more servers by means of a private network, and have all servers
access the same application data using multi-ported disk storage. Because each cluster
node is allocated the same network resources and disk data, it is capable of inheriting
an application after its primary server is unable to provide services.
Note: Sybase version 12 and higher is supported and developed by Sybase. Contact
your Sun sales representative to obtain the latest information on supported SC2.2
HA-agents.
Note: Although the SC2.2 HA-agent API provides a data service library of functions
that supports C language software development, it is not covered in this book.
Read the Sun Cluster 2.2 API Programmer’s Guide and SC2.2 online man pages for
additional information.
Crash Tolerance
Because the SC2.2 HA model simulates a fast reboot of a failed system, the highly
available application must access disk storage rather than system memory (temporary
storage) to create a permanent record of its last known state (to be used during
startup). Depending on the application, it may be necessary to perform an automatic
rollback or data consistency check each time the application is started—this ensures
any partially completed transactions are reconciled.
REGISTRATION
Note: SC2.2 software enables all data services to start and stop regardless of their
dependencies on other data services. The START and STOP DSMs must include
routines which verify availability of dependent data services.
IP = 192.0.2.1
Data Service 1
Data Service 2
Shared
Disk Storage
IP=192.0.2.2
Data Service 3
Node Y
Node X
IP=192.0.2.1
Data Service 1
Data Service 2
Shared
Disk Storage
IP=192.0.2.2
Data Service 3
After the ownership of a logical host is determined, a set of routines (DSMs) are
performed against each data service to ensure the highly available application is
correctly stopped on the failed node and correctly started on the alternate node.
All DSMs supported by the SC2.2 framework address the specific needs of different
applications. Some DSMs may not be applicable to specific applications; therefore,
registration of these DSMs is not required. However, the START (or START_NET) and
STOP (or STOP_NET) DSMs are mandatory, see section, “Data Service Method
Registration,” on page 292.
The core DSMs are as follows:
■ START — Starts a data service before the logical host’s network interfaces are
configured (plumbed and activated).
■ START_NET — Starts a data service after the logical host’s network interfaces are
configured (plumbed and activated).
■ STOP — Stops a data service and removes unwanted processes (and their child
processes) after the logical host network interface(s) has been brought down.
■ STOP_NET — Stops a data service and removes unwanted processes (and their
child processes) before the logical host network interface(s) has been brought
down.
■ ABORT — Used when a cluster node terminates abnormally. Any open files or
processes will be cleaned after the logical host network interface(s) has been
brought down.
■ ABORT_NET — Used when a cluster node terminates abnormally. Any open files
or processes will be cleaned before the logical host network interface(s) has been
brought down.
Note: At least one START (or START_NET) and one STOP (or STOP_NET) DSM must
be defined, configured, and registered with a data service before it can be activated.
# hareg -r myDataService \
-m START=/opt/myDataService/bin/myDataService_start \
-h myLogicalHost
This example registers a newly developed data service called myDataService that is
managed by the logical host designated myLogicalHost. Note how the START DSM
is associated with the /opt/myDataService/bin/myStartScript.sh script.
Note: Each DSM only needs to be registered on a single node because the ccdd(1M)
daemon will distribute the configuration information to all nodes. It is essential that
the DSM’s executable files are located in the same local disk location on each node;
for example, /opt/myDataService/bin/myStartScript.sh.
Note: For ease of readability, the term “SC2.2” will be used in place of “SC2.2
software” in the following lists.
1. FM_STOP
2. STOP_NET
3. SC2.2 brings down all network interfaces associated with the logical host on the
active node.
4. STOP
5. SC2.2 unmounts all file systems associated with the logical host on the active node.
6. SC2.2 deports all disk sets (SDS) or disk groups (VxVM) associated with the logical
host on the active node.
8. SC2.2 imports all disk sets (SDS) or disk groups (VxVM) associated with the
logical host on the alternate node.
9. SC2.2 mounts all file systems associated with the logical host on the alternate
node.
10. START
11. SC2.2 brings up the network interfaces associated with the logical host on the
alternate node.
12. START_NET
13. FM_INIT
14. FM_START
1. ABORT_NET
2. SC2.2 brings down all network interfaces associated with the logical host.
3. ABORT
The content of the ABORT and ABORT_NET DSM routines should be similar to that of
the STOP and STOP_NET DSMs. For many highly available applications, it may be
possible to use slightly modified versions of the STOP and STOP_NET DSMs in place
of the ABORT and ABORT_NET DSMs.
The ABORT and ABORT_NET DSMs are asynchronous routines that can be called at any
time (including when the START or STOP DSMs are running). The code routines
associated with the ABORT DSMs should be designed to check for and kill any existing
START or STOP process.
Note: The ABORT DSMs are cleanup programs that should be designed efficiently
because they may not have much time to run.
Table 7-2 shows the second state of the cluster after the use of the scadmin(1M)
command (scadmin switch mycluster physical-B logical-A). The
scadmin command is used to switch logical host logical-A from the node
physical-A to physical-B.
Table 7-3 shows the third state of the cluster following the use of the STOP and
STOP_NET DSMs on both nodes. Notice how the logical-A file system resources
have been unmounted from the physical-A node and mounted on the
physical-B node.
Table 7-3 Third State: START and START_NET DSMs Are Invoked on All Cluster
Nodes
Table 7-4 Fourth State: START and START_NET DSMs Have Completed on Both
Cluster Nodes
The tables presented in this section demonstrate that when the same DSMs are
executed on different nodes, they are assigned unique arguments during a cluster
reconfiguration. The tasks required to be performed by a DSM are based on the
arguments passed to it. The state of the data services and disk sets or disk groups
should be determined to establish which additional tasks are to be performed by each
DSM. For example, a physical host must not attempt to start a highly available
application if the node is not the master of the logical host managing the data service.
There is little point in having a physical host stop a data service if the data service is
not running on the node.
Examples of how to implement code that tests the current state of a node, logical host,
and data services are provided in Appendix B, “SC2.2 Data Service Templates.”
Developing HA-Agents
As discussed previously, the development of an HA-agent for an application involves
the creation of DSMs (and their registration with the SC2.2 infrastructure) to
automatically start, stop, and monitor a highly available application. The following
sections discuss some of tools provided with the SC2.2 software which can be useful
for developing a DSM. Appendix B contains shell script templates with sample code
for all supported DSMs. These templates can be used to reinforce the concepts
presented in this chapter, and to simplify the development of an HA-agent.
Note: All C functions supported by the liblibhads.so library are listed in the
ha_get_calls(3HA) man pages.
# pmfadm -c myproc -n 10 -t 60 \
-a /opt/ds/bin/failureprog /opt/ds/bin/myprog
The preceding example will start a monitoring process associated with the string
identifier myproc; if the string identifier already exists, the pmfadm process will exit.
The monitoring process will execute the /opt/ds/bin/myprog routine, which will
be restarted if the program fails. The invocation of the routine is restricted to 10
restarts within the same 60-minute window.
Summary
High availability of business applications is important to guarantee expected service
levels to external and internal customers and prevent losses in productivity, business,
and market share. The SC2.2 data services, or HA-agents, manage business
applications to provide highly available services. This chapter introduced
qualification requirements for making existing applications highly available, see
section, “Qualifying a Highly Available Application” on page 283. The elements
involved in an application failover using the SC2.2 data services architecture were
explained in detail.
Data service methods (DSMs) are the routines that manage a business application to
make it highly available. All DSMs supported by the SC2.2 software, their suggested
content, and their invocation sequence within SC2.2 reconfiguration events were also
discussed in detail. Appendix B enhances the concepts presented in this chapter by
providing sample code templates to assist in the development of an HA-agent.
This chapter describes the architecture of the next product release in Sun
Microsystems’ cluster strategy—Sun Cluster 3.0 (SC3.0) software. SC3.0 software was
designed as a general-purpose cluster solution. In the evolution of the cluster product
line, SC3.0 software represents a major milestone. Sun Microsystems views clusters as
a strategically critical technology and has made, and will continue to make, a
commitment to this technology in terms of resources and research. As this book went
to print, SC3.0 was released for general availability.
As with the SC2.2 software, the fundamental design criteria for SC3.0 software are
data integrity and application availability. Although the design parameters are the
same, the method of implementing the technology is completely different. The SC2.2
software provides a platform for availability and affords limited scalability;1 however,
SC3.0 is designed for continuous availability and unlimited scalability.
During the early development of the cluster products, a Sun Labs engineering team
led by Dr. Yousef A. Khalidi began a research initiative to investigate important areas
of future cluster technology. The research included (but was not limited to) global file
systems, high-performance I/O, high availability, and networking. With the valuable
experience gained by this project, the cluster engineering team developed a powerful
cluster solution in conjunction with the SC2.2 development. Many of the technologies
and techniques (but not all) developed and tested in this project eventually made their
way into the SC3.0 product line.
The SC3.0 software is an expandable, robust, scalable, and highly available computing
environment built around a new object-oriented framework designed for mission-
critical applications operating in the Internet, intranet, and datacenter environments.
SC3.0 software provides a unified image of file systems, global networking, and
global device access, ensuring an environment that is scalable and easier to
administer.
301
Not all features of the SC3.0 software are covered in this chapter. A forthcoming Sun
BluePrints book will detail the SC3.x product. This chapter presents a high-level
overview of some the main areas of this important and exciting technology.
Note: Although the Solaris OE currently supports clustering, SC3.0 is not a standard
component of the OS and requires its own user license.
After a node becomes a member of a cluster, it is not permitted to drop out of the
cluster environment (as with the SC2.2 software). Cluster nodes can boot in non-
cluster mode; however, in a standard boot sequence (simple boot command at the ok
prompt), nodes will either join or form a cluster and respond to cluster messages and
heartbeats.1 No further actions or processes need to be invoked for nodes to
participate in or form a cluster.
Not requiring a system administrator to start or bring nodes into a cluster has
advantages when a cluster is used in a lights-out datacenter. Additionally, this feature
can be advantageous when used in combination with hardware that is able to blacklist
or isolate failed components (failure fencing) by means of built-in RAS capabilities.
1 A node can form a new cluster only if a cluster did not exist, or if the node was the last
node to leave a current cluster.
Note: Some older architectures may not have RAS capabilities. Also, RAS features
have limitations, see Chapter 4, “Sun Cluster 2.2 Administration” for further
discussion of RAS and its capabilities.
After rebooting, the node automatically joins the cluster, thereby increasing
application availability.
Topology
The physical architecture of a cluster implemented with SC3.0 software is similar to a
cluster implemented with SC2.2 software, in that all nodes must have the same
version of the Solaris OE and all nodes must be connected to a private network of
redundant high-speed interconnects. However, unlike SC2.2, not every cluster node
needs to have physical access to shared storage.
Cluster Interconnects
As with SC2.2 software, the interconnects pass cluster membership information,
cluster heartbeats, and in some configurations, application data between nodes. SC3.0
software also uses the interconnects to pass new data types for global file services
traffic, global devices traffic, metadata, and redirected IP traffic. Also, as with SC2.2
software, the interconnects must be made redundant using Ethernet, SCI, or Gigabit
Ethernet. Each cluster node must be configured with a minimum of two interconnect
interfaces (with a maximum of six). However, unlike SC2.2 software, all private
network interfaces are used in parallel for data transport and cluster messages.
As discussed in Chapter 2, “Sun Cluster 2.2 Architecture,” SC2.2 software uses only
one interconnect for data and heartbeats, while the other is used as a fault-tolerant
redundant link (managed by the switch management agent [SMA]). However, with
SC3.0 software, all interconnect interfaces are used simultaneously. Data on the
interconnect is striped across all interconnect interfaces for increased throughput.
Individual I/O is transported via a single interface—these interfaces are used in a
round-robin process. Another performance benefit developed for the SC3.0
interconnect is a true zero copy optimization algorithm. Data is moved from the
network adapters directly to the session layers of the protocol stack, thereby speeding
packet transfers dramatically.
Topology 305
Adding adapters to the interconnect topology increases the available bandwidth used
for message passing, scalable services, and disk I/O—at the same time, increasing
availability of the interconnect. Losing one or more of the interconnects does not affect
cluster operation (assuming enough redundancy) because the cluster will continue
using the remaining interconnects (at a lower aggregate bandwidth). For example, if a
cluster is configured with five adapters for the interconnect, all five adapters will be
used simultaneously; if one adapter fails, the remaining four will be used.
Although SC3.0 software will continue to work with a failure of an interconnect
adapter, interconnect performance will decrease. As with SC2.2 software, if a cluster
contains more than two nodes, switches will have to be used for the cluster
interconnects; however, in a two-node cluster, a direct point-to-point connection is
sufficient.
Note: Although not required, boot disks should be mirrored for availability. Shared
boot devices are not supported by SC3.0 software (including the /usr and /etc/
directories).
Shared Storage
Unless dual-attached devices (such as a Sun StorEdge™ A3500) are being used,
shared storage must be dual-ported and mirrored for availability. A minimum of two
nodes should be connected to shared storage devices. If one node leaves the cluster,
all I/O is routed to another node that has access to the storage device (see Figure 8-1).
This behavior is important, because with SC3.0 software, only one path at a time is
permitted to be the active path to a disk.
Public Network
A B C D E
Failed
Disk Mirror
Before failure
After failure
Topology 307
Global Devices
A major difference between SC2.2 and SC3.0 software is the capability of SC3.0
software to make all storage hardware resources available across a cluster without
regard to where a device is (physically). A storage device attached to any node can be
made into a global device. These devices include:
■ Local disks
■ Storage arrays
■ Tape drives
■ CD-ROM drives
For example, a CD-ROM drive on node A can be made into a global device that is
accessible to all nodes in the cluster. Storage devices are referenced with a global disk
ID (DID) that is supplied to node hardware by means of a DID driver. If a failure
occurs, SC3.0 will automatically redirect device access to another I/O path (only
applicable for dual- and multi-ported storage devices).
Access to devices (not directly attached to the node) is gained by means of the cluster
interconnect. The DID driver automatically probes all nodes within a cluster and
builds a list of available devices. Each device is assigned a unique ID which includes
major and minor numbers. Global devices are accessed using the DID device number
rather than the standard c(x) t(y) d(z) notation. For example, d10s7 translates to
device d10 slice 7). Each node maintains its own copy of the global namespace
located in the /dev/global directory. Device files are stored in the
/dev/global/rdsk directory for blocks, the /dev/global/dsk directory for disks,
and the /dev/global/rmt directory for tape devices. Assigning a global access
naming convention ensures device consistency across the cluster.
Global device access enables the use of dual-ported devices in a four-node cluster.
Additionally, system administration is simplified when the storage capacity of the
cluster is increased. For example, when a new array is added to the cluster, all nodes
connected to the array should be reconfigured using the Solaris drvconfig, disks,
and devlinks commands. The DID driver will update the /dev/global directory
on the local node and will then issue updates to all nodes in the cluster. The new
array will now be visible to all nodes. Global device access enables the creation of a
cluster storage environment that is easy to scale and manage.
Object-based
SC3.0 software is object based (unlike the SC2.2 framework). The interface definition
language (IDL) for SC3.0 software is C++; however, new IDLs can be created for other
object-oriented languages such as Java™ and Enterprise Java Beans™ (EJB™). Object-
oriented programming enables the expansion of existing code without the need to
rewrite the code. Although an inheritance model for the RGM will not be available,
the software structure is in place; as market conditions warrant, the inheritance model
can be made readily available.
Proxy Proxy
Client Server
Interconnect
Cache Provider
Public Network
A B C D E
Primary Secondary Primary
/dev/did/dsk/data1
Disk Disk
/dev/did/dsk/data2
Node A Masters /dev/did/dsk/data1, all I/O to this file system goes through A
Node E Masters /dev/did/dsk/data2, all I/O to this file system goes through E
Types of Scalability
As previously discussed in Chapter 2, “Sun Cluster 2.2 Architecture,” there are two
types of scalability: vertical (adding more CPUs to a node) and horizontal (adding more
nodes to a cluster).
Vertical Scalability
Vertical scalability is limited within the domain of a single machine. Adding CPU
capacity increases the scalability potential. Vertical scalability is useful for long-
running database queries or complex computing-intensive jobs (such as finite element
analysis) requiring large amounts of CPU power. Using twice the number of CPUs
should theoretically decrease processing time by half.
Horizontal Scalability
Horizontal scalability is not limited to a single machine. Data or processes are spread
across several machines in a horizontal configuration (such as a cluster environment).
An example of application horizontal scalability is numerous small jobs benefiting
from multiple machines accessing the same data store. If a single node can manage
100 users, then two nodes should (theoretically) be able to manage 200 users, thereby
doubling application scalability.
Note: Even if an application is CPU-bound, adding more CPUs does not guarantee
linear scalability. Eventually, other limitations will be revealed.
Public Network
A B C D E
Global IP
Internal Routing
Web request = Note: Outgoing responses can come from any node.
Communications Framework
The SC3.0 communications framework was written using an object request broker
(ORB) framework. Security is not an issue because the ORB implementation is internal
to the cluster framework. Cluster messages are not passed on the public network, only
on the interconnect. Similarly, only cluster members receive ORB messages. Using a
message-passing model creates a cluster with a framework that is more resilient to
fault conditions because object isolation cluster recovery is isolated to the location at
which a fault occurred. With object orientation, the ORB is easily expandable, thereby
opening the framework to technologies such as EJB components and servlets.
Summary
The SC3.0 framework has undergone extensive testing during its development. The
features discussed in this chapter are accurate at the time of publication.
The SC3.0 software offers extensive capability and potential; therefore, a forthcoming
Sun BluePrints book will be devoted to the SC3.0 environment.
Because of the benefits clustering offers, clustering technology will continue to evolve.
Sun Microsystems is dedicated and committed to the future of cluster technology.
319
The SCSI protocol supports two classes of devices:
Initiator
An initiator is generally a host device that can initiate a SCSI command. The default
SCSI-initiator ID is 7.
Target
A target is a device that will respond to SCSI commands.
Because target devices generally require an initiator to function, they should not use
ID 7. Other available IDs fall within the ranges of 0-6 and 8-15. Narrow SCSI target
devices have IDs available in the 0-6 range, with wide SCSI devices having IDs
available in the range of 0-15.
Note: If only one system is connected to a SCSI chain, the SCSI-initiator ID (the SCSI
ID of the system) will not generally require changing.
Each SCSI chain attached to a system is associated with a single SCSI controller. Each
controller is the connection point of the system to a specific SCSI chain. The host can
have a different address on each SCSI chain; therefore, it can be useful to think of the
SCSI address as belonging to a specific controller (the host’s connection to a specific
chain) itself, rather than to the host. The SCSI-initiator ID represents the system’s
address on a specific chain and can be set environment-wide for all controllers on the
system (or individually for each SCSI controller). An environment-wide ID sets the
default value for all controllers not having an ID explicitly set.
Clustered configurations are more complex to configure than a single machine
because there could be multiple initiators on the SCSI chain. If only one initiator is on
a chain, all targets will be local to the initiator—that is, they can only receive requests
from that initiator. In a single non-clustered system configuration, there can be several
SCSI chains—each with its own set of SCSI IDs. However, each chain will still be local
to the single initiator.
A clustered configuration could be made up of a combination of local and global
chains. Any local chains will have connections from a specific host to target devices
only, whereas any global chains will contain connections to target devices and other
initiators.
Node A Node B
(Shared)
7 0 1 2 3 7
7 Conflict - same ID # 7
(Local)
0 0
On local chains, the SCSI-initiator ID should be kept at the default value of 7—thereby
preventing SCSI ID conflicts with devices such as the boot disk or CD-ROM (and any
internal devices to be installed in the future).
On shared chains, one of the hosts should have its SCSI ID set to 6—this will prevent
conflict problems, see Figure A-2. The SCSI ID for the other host should be left at the
default value.
Note: Setting the SCSI ID to 6 prevents a drive with a higher priority from hijacking
the bus. Although changing the global SCSI-initiator ID of a device to 8 (or another
unique ID on the chain) could solve a SCSI-initiator conflict and avoids the necessity
of writing an nvram script, there could be a precedence problem because ID 8 has a
precedence below that of other targets. A SCSI ID of 7 has the highest
precedence—followed by 6, decreasing down to 0. IDs 8 to 15 have a precedence
below ID 0.
Node A Node B
(Shared)
7 0 1 2 3 6
Different ID #s
7 7
(Local)
0 0
Note: Ensure the host SCSI IDs do not interfere with each other, and also that target
IDs do not interfere with the host IDs. Each SCSI ID on the chain must be unique,
regardless of whether used by a target or initiator device. This could mean leaving an
empty slot in a storage enclosure if the SCSI ID for the slot conflicts with the
modified SCSI-initiator ID.
Note: Setting the SCSI ID for individual controllers is more complex than setting an
environment-wide default. Both require using the OBP prompt or the Solaris OE
eeprom command. Veritas Volume Manager (and other programs) store information
in the NVRAM; therefore, it is important to check the NVRAM contents before making
changes. This is performed using the printenv nvramrc OBP command, or the
eeprom nvramrc command at the Solaris OE shell prompt. In most cases, existing
NVRAM contents can be appended to the nvram script that sets the initiator ID. Some
firmware upgrades will modify the OBP variables, see section, “Maintaining
Changes to NVRAM” on page 337.
* Only runs if the banner command is not called during the nvramrc script evaluation.
The work-around script uses the banner command (which displays system
configuration information) in the nvram script to signal that the part of the standard
system startup that runs the probe-all, install-console, and banner
commands is being executed by the nvram script. If the banner command appears
anywhere in the nvram script, the banner command will not run outside of the script
(neither will the install-console or probe-all commands).
The device node properties (including the SCSI ID) can be changed after the
probe-all command has created the device node tree. The Open Boot directory
structure lists devices in a manner similar to the UNIX environment—each device
node represents a level in the hardware hierarchy.
The peripheral bus (sbus or PCI) is a primary node under the root. All peripheral
devices branch off this node—each device has a unique name which comprises the
following parts:
Device Identifier
The first part of the name identifies the device; however, the naming convention may
vary for each type of device node. The device identifier is followed by the @ symbol.
Note: Device identifiers for peripheral cards are generally identified by the
manufacturers symbol, followed by a comma, then the device type abbreviation (for
example, SUNW,qfe@2,8c00000 identifies a quad fast Ethernet device manufactured
by Sun Microsystems).
SUNW,fas@3,8800000
SUNW,socal@d,10000
SUNW,hme@3,8c00000
st sd
sf@1,0 sf@0,0
ssd ssd
SUNW,qfe@0,8c30000 SUNW,qfe@0,8c10000
SUNW,qfe@0,8c20000
SUNW,qfe@0,8c10000
Figure A-3 Example of sbus OBP Device Tree (incomplete)
scsi@1,1 scsi@1
pci@5 pci@4
Tape Disk
Disk Tape
SUNW,qfe@0,1 pci108e,1000@3
SUNW,qfe@1,1 pci108e,1000@2
SUNW,qfe@2,1 SUNW,qfe@3,1
Figure A-4 Example of PCI OBP Device Tree—Sun Enterprise 250 (incomplete)
MII TP
Figure A-5 I/O Board Interfaces in the Sun Enterprise xx00 Series
Table A-2 Sun Enterprise xx00: Relationships of Physical Locations to OBP Nodes
ok pwd
/sbus@1f,0/SUNW,fas@e,8800000
ok .properties
scsi-initiator-id 00000007
hm-rev 00 00 00 22
device_type scsi
clock-frequency 02625a00
intr 00000020 00000000
interrupts 00000020
reg 0000000e 08800000 00000010
0000000e 08810000 00000040
name SUNW,fas
In this example, the current SCSI-initiator ID for the controller is listed in the
scsi-initiator-id field. Unless the value of the scsi-initiator-id field is set
explicitly, it will default to the value of the scsi-initiator-id OBP environment
variable.
Changing a SCSI-initiator ID is discussed in the following sections.
# eeprom scsi-initiator-id=6
The eeprom command enables changes to the eeprom (which is visible from the OBP
level). Because the preceding command made a change to the EEPROM, the change
becomes permanent (even if the OS is reinstalled). To ensure a change does not
interfere with local SCSI chains (including the CD-ROM and boot devices), write a
script that resets the local chain controller IDs back to 7.
An easy way to create an nvram script at the shell prompt is to make a file that
contains a script of OBP commands (not shell commands). Then, store the script in
NVRAM using the eeprom command. Create this file using a text editor and store it
where it can be retrieved at a later date. The following is a step-by-step description of
the commands required in the script.
The first command required is the probe-all command. This command is the first
action taken by the nvram script—it will probe the devices and create a device tree.
The file should look as follows:
probe-all
Next, change the SCSI-initiator ID back to 7 on the local chains (which are not
connected to other initiators). For each controller connected to a local chain, the script
should enter the device node that represents the controller (using the cd command)
and change the SCSI-initiator ID value. For example, if the internal/local controller
represented by the device node SUNW,fas@e,8800000 is connected to the sbus card
at address 1f (as with Ultra 1 and Ultra 2 systems), the following command would be
added to the file:
cd /sbus@1f,0/SUNW,fas@e,8800000
Note: The device node path used is only an example. In a situation where there are
multiple local chains, the script is required to execute commands to enter each device
node representing a local controller and explicitly change its initiator ID.
After entering the device node, change the SCSI-initiator ID to 7 by inserting the
following line in the file:
probe-all
cd /sbus@1f,0/SUNW,fas@e,8800000
7 " scsi-initiator-id" integer-property
Change the SCSI-initiator ID for any other local chains to 7 (using the method
described previously).
After the local chain SCSI-initiator IDs have been changed, insert the device-end
command into the file. Inserting this command will enable the script to exit the
device tree. The file should now look similar to the following:
probe-all
cd /sbus@1f,0/SUNW,fas@e,8800000
7 " scsi-initiator-id" integer-property
device-end
probe-all
cd /sbus@1f,0/SUNW,fas@e,8800000
7 " scsi-initiator-id" integer-property
device-end
install-console
banner
In the previous command, the backtics (‘) signal the shell to interpret the cat
command. If the quotation marks are not included, only the first line will make it into
the nvramrc variable. If the backtics are not included, the nvramrc variable will
contain the unevaluated string cat /var/adm/scripts/nvarm_file. Because
different shells treat backtics and quotes differently, check the values of the variables
when finished.
The nvramrc environment variable should now be configured; however, the nvram
script will not be run at initialization unless the use-nvramrc? environment variable
is set to true using the following command:
# eeprom use-nvramrc?=true
Confirm all modified NVRAM values by using the eeprom command. The eeprom
command displays the value corresponding to a specific variable if run with that
variable as an argument (without specifying a new value with the equals sign). Enter
the following commands to confirm the appropriate OBP variables:
# eeprom scsi-initiator-id
scsi-initiator-id=6
# eeprom nvramrc
nvramrc=probe-all
cd /sbus@1f,0/SUNW,fas@e,8800000
7 " scsi-initiator-id" integer-property
device-end
install-console
banner
# eeprom use-nvramrc?
use-nvramrc?=true
Note: The steps discussed in this section are for an OBP prompt only and are not
suitable for running at the shell command prompt.
ok setenv scsi-initiator-id 6
Note: It is easy to unintentionally press Return and move existing lines down
without realizing it; therefore, it is a good idea to review the finished file using the
<ctrl> p or <ctrl> n keystrokes.
To start the nvedit editor, enter the following command at the OBP prompt:
ok nvedit
0:
Keystrokes Action
<ctrl> c Exits the nvramrc editor and returns to the OBP command
interpreter. The temporary buffer is preserved, but is not written
back to the nvramrc variable. (Use nvstore to write back.)
<ctrl> l List all lines.
<ctrl> p Move to previous line.
<ctrl> n Move to next line.
<ctrl> b Move back one character.
<ctrl> f Move forward one character.
Delete Delete a character.
Enter/Return Start a new line.
or <ctrl> o
<ctrl> k Delete (kill) all characters from the cursor to the end of the line
(including the next carriage return). If used at the end of a line, it
will join the current line with the next line. If used at the beginning
of a line, it will delete the entire line.
On line zero, enter the probe-all command (followed by pressing Return). This will
cause the nvram script to probe the devices and create the device tree. The nvram
script should look as follows:
0: probe-all
After pressing Return, line 0 will move up the page and the cursor will be on line 1.
For each controller connected to a local chain, use the cd command to enter the device
node (that represents the controller) and reset the SCSI-initiator ID value to 7. For
example, if the internal/local controller is represented by the device node
SUNW,fas@e (connected to the sbus card at address 1f), enter the device node by
inserting the following command:
cd /sbus@1f,0/SUNW,fas@e,8800000
In this example, the SCSI-initiator ID for the controller can be set in NVRAM when
the previous two commands are added to lines 1 and 2 of the nvram script—as
shown:
1: cd /sbus@1f,0/SUNW,fas@e,8800000
2: 7 " scsi-initiator-id" integer-property
The previous lines should be repeated for each local chain. After the IDs have been
changed for all local devices, add the following commands to the end of the script
(the line numbers used in the following codebox are examples only):
3: device-end
4: install-console
5: banner
6:
The Return following the banner command (line 5) will create a new line (6). Because
the script is now completed, press <ctrl> c to exit the nvedit editor. Although the
script is completed, it is not yet committed to NVRAM. To save the changes, use the
nvstore command to move the nvedit buffer into NVRAM. Enter the following
command at the OBP prompt:
ok nvstore
The nvramrc script should now be configured; however, it will not be run unless the
use-nvramrc? environment variable is set to true by using the setenv command.
Enter the following command at the OBP prompt (note the output):
ok printenv nvramrc
nvramrc = probe-all
cd /sbus@1f,0/SUNW,fas@e,8800000
7 " scsi-initiator-id" integer-property
device-end
install-console
banner
If there are errors, use the nvedit editor again and make the required changes. Move
to the line(s) to be edited using the <ctrl> p or <ctrl> n keystrokes—edit the line,
or delete it using the <ctrl> k keystrokes.
After making changes, run the nvstore command again. When the nvramrc script
is correct, restart the machine for changes to take effect by entering the following
command:
ok reset-all
eeprom use-nvramrc?
eeprom nvramrc
eeprom scsi-initiator-id
Summary
Changing the SCSI-initiator ID on a cluster node enables two nodes to share SCSI
storage devices on a single SCSI chain. This procedure is essential for correct
operation of a shared SCSI chain. Changing SCSI IDs requires either direct or indirect
modification of the nvramrc script. Changes are permanent (even after re-installation
of the OS).
Because some firmware upgrades can modify OBP variables, it is critical to check the
variables after any firmware changes. Regular backups of the EEPROM settings are
also crucial.
The following Bourne shell script templates are provided as examples of data service
methods (DSMs) required by the SC2.2 software to convert standard applications into
highly available applications.
Each template is targeted to a basic (non-complex) data service. The templates include
a flow diagram and supporting documentation. If these templates are to be used for
complex data services, the error checking component of the scripts must be enhanced.
The DSM code presented in this appendix provides building blocks for developing
HA-agents (those not currently provided by Sun Microsystems).
Sun Microsystems, Inc. is not responsible for the use of this code. The DSM script
templates presented here are for illustrative purposes only and are not supported by
Sun Microsystems, Inc. or any of its subsidiaries.
The following DSM script templates have been designed using the following
assumptions:
■ The data service has a unique name.
■ The configuration file and DSM scripts are located in the same directory.
■ The data service name is used as a prefix for the configuration file and DSM
scripts. For example, the START_NET DSM script for the ha_webserver data
service should be named ha_webserver_start_net, with its corresponding
configuration file named ha_webserver.config.
■ The process monitoring facility (PMF) is required for use inside the template
scripts for starting and monitoring the data service—the PMF will use the data
service name as its identifier.
Each DSM template script uses a configuration file. This file is a text file containing
attributes and values that can be imported into the Bourne shell (for example,
ATTRIBUTE=VALUE).
An important reason for having a configuration file is to store variable modifications
required by a new application, rather than modifying the DSM template script itself.
339
It is recommended that the STOP_NET DSM script include additional code to ensure a
data service or highly available application is correctly stopped.
The following two variables must be included in the global configuration file:
1. SERVICE_NAME
The name of the data service registered with the SC2.2 infrastructure. The
SERVICE_NAME is used as the prefix for the configuration file, for example,
SERVICE_NAME.config.
Note: To access the global configuration file, the DSM script templates need to
extract the SERVICE_NAME variable by stripping off the first part of the program
name; for example, the ha_webserver is stripped from the
ha_webserver_stop_net string. The SERVICE_NAME variable must still be
included in the configuration file (to accurately reflect the data service name), which
prevents the DSM script templates from failing.
2. START_PROGRAM
An executable file invoked by the pmfadm(1m) command within the START_NET
DSM script, which is responsible for starting an application hosted by the data
service. For example, the START_PROGRAM executable starts a Web server application
(with appropriate arguments) associated with the ha_webserver data service.
The following variables can be (optionally) included in the global configuration file:
■ REQUIRED_FILESYSTEMS
A shell list of additional file system mount points required by the data service. If this
variable exists, the DSM script templates will check for availability of these file
systems, and will fail if the file systems are not mounted.
Note: The SC2.2 infrastructure manages disk groups (VxVM) and disk sets (SDS),
administrative file systems, data service file system ownership, and availability of
each logical host.
Note: DSM script debugging can be turned on by setting the variable kDEBUG to 1,
and turned off by setting the variable to 0.
extract config
filename
Yes
Yes
Yes
are all required No exit with
file systems
error
mounted?
exit with No
is start program
error name set?
Yes
Yes
is data service No launch start
already running? program
exit with No is start program
error executable?
Yes
Yes
exit with No is this data service
error registered with a exit without
logical host? error
Yes
#!/bin/sh#
# generic START_NET script for HA data services Version 1.3
# Copyright 1999 Sun Microsystems, Inc. All rights reserved.
# This script can be used as a template for developing a START_NET
# DSM for SunCluster 2.2 data services. The intention is to help reduce the
# amount of “boilerplate” work before a production script is developed.
# START_NET is called by clustd as part of the cluster reconfiguration
# process. This script can be invoked several times during a cluster
# reconfiguration process by the same node, and it should be written to
# manage a wide range of cluster states.
# When the START_NET script is launched, it is called with the
# following arguments:
# $1 = comma-separated list of the logical hosts which should be
# mastered by this physical node
# $2 = comma-separated list of the logical hosts which should NOT
# be mastered by this physical node
# $3 = time in seconds until clustd will time out the process.
# This script makes the following assumptions:
# - the script is named <dataservice>_start_net, where <dataservice>
# is the unique data service name
# - the file <dataservice>.config exists in the same directory as this
# script and it defines at the very least the variables SERVICE_NAME
# and START_PROGRAM
# - the data service program can be started, stopped, and queried
# using pmfadm
# If your data service does not support the above assumptions, then this script
# should be modified appropriately.
#########################################################################
# debug function
debug()
{
if [ “$kDEBUG” = “1” ] ; then
logger -p ${SYSLOG}.debug “${PROG} [$$]: $1”
fi
}
#########################################################################
# error function
doerror()
{
# Check that the administrative file system of each required logical host
# is mounted correctly. Yield failure if required file system is not mounted.
CURRENT_MOUNTS=`mount | awk -e ‘{print $1}’ | xargs echo `
for required_host in ${SERVICE_HOSTS} ; do
neededMount=`haget -f pathprefix -h ${required_host}`
foundMount=0
debug “searching for filesystem on ${neededMount}”
for mountPoint in ${CURRENT_MOUNTS} ; do
debug “ - considering ${mountPoint}”
if [ “${mountPoint}” = “${neededMount}” ] ; then
debug “ - found ${mountPoint}”
foundMount=1
fi
done
# Fail if no mounted file system matches a required file system
if [ ${foundMount} -ne 1 ] ; then
doerror “administrative filesystem ${neededMount} for logical \
host ${required_host} is not mounted”
exit 1
fi
done
# Check that additional file systems, if defined, are mounted and available
# to this physical node. Required file systems are defined as a word list in the
# config file
if [ -n “${REQUIRED_FILESYSTEMS}” ] ; then
for neededMount in ${REQUIRED_FILESYSTEMS} ; do
foundMount=0
debug “searching for filesystem on ${neededMount}”
for mountPoint in ${CURRENT_MOUNTS} ; do
debug “ - considering ${mountPoint}”
if [ “${mountPoint}” = “${neededMount}” ] ; then
debug “ - found ${mountPoint}”
foundMount=1
fi
done
# Yield failure if no mounted file system matches requirements
if [ ${foundMount} -ne 1 ] ; then
doerror “required extra filesystem ${neededMount} is not mounted”
# Any extra checking, e.g., testing to see if an additional config file is accessible,
# should be placed in this section. After verifying that additional testing is
# successful we can proceed to verify if a data service is already running.
debug “additional checking passed OK”
# Check and exit if data service has already been started.
if `pmfadm -q ${SERVICE_NAME}` ; then
debug “${SERVICE_NAME} is already running on this node”
exit 0
fi
########################################################################
# At this point we can ensure the data service is not already running and
# it can be started.
debug “data service ${SERVICE_NAME} is not already running”
# Ensure that the START program can actually be executed; otherwise,
# complain and attempt to use the default TERM on the pmfadm process.
if [ -n “${START_PROGRAM}” ] ; then
executable=`echo ${START_PROGRAM} | awk -e ‘{print $1}’`
if [ ! -x “${executable}” ] ; then
doerror “cannot execute ${executable} “
exit 1
fi
fi
# Parse any arguments that were set by the config file. Exit if data service
# cannot be started.
retries=””
period=””
action=””
if [ -n “${PMF_RETRIES}” ] ; then
retries=”-n ${PMF_RETRIES}”
if [ -n “${PMF_PERIOD}” ] ; then
period=”-t ${PMF_PERIOD}”
extract config
filename
Yes
Yes
set signals for
pmfadm
No does this node
exit without
have any unmastered
error logical hosts?
Yes
stop PMF monitoring
Yes
search for each of this data
service’s logical hosts
in list of mastered logical hosts No is stop program
defined?
Yes
are all required
exit with No logical hosts launch stop program
error mastered by this
physical host?
Yes
did data service No exit with
error
stop?
exit with No are all required
error file systems
mounted? Yes
exit without
Yes error must be implemented for
each data service
#!/bin/sh
# generic STOP_NET script for HA data services Version 1.3
# Copyright 1999 Sun Microsystems, Inc. All rights reserved.
# This script can be used as a template for developing a STOP_NET
# DSM for SunCluster 2.2 data services. The intention is to help reduce the
# amount of “boilerplate” work before a production script is developed.
# STOP_NET is called by clustd as part of the cluster reconfiguration
# process. This script can be invoked several times during a cluster
# reconfiguration process by the same node, and it should be written to
# manage a wide range of cluster states.
# When the STOP_NET script is launched, it is called with the
# following arguments:
# $1 = comma-separated list of the logical hosts which should be
# mastered by this physical node
# $2 = comma-separated list of the logical hosts which should NOT
# be mastered by this physical node
# $3 = time in seconds until clustd will time out the process.
# This script makes the following assumptions:
# - the script is named <dataservice>_stop_net, where <dataservice>
# is the unique data service name
# - the file <dataservice>.config exists in the same directory as this
# script and it defines at the very least the variables SERVICE_NAME
# and START_PROGRAM
# - the data service program can be started, stopped, and queried
# using pmfadm
# If your data service does not support the above assumptions, then this script
# should be modified appropriately.
#########################################################################
# debug function
debug()
{
if [ “$kDEBUG” = “1” ] ; then
logger -p ${SYSLOG}.debug “${PROG} [$$]: $1”
fi
}
#########################################################################
# error function
doerror()
{
debug $1
logger -p ${SYSLOG}.err “${PROG} [$$]: $1”
}
#########################################################################
# The following variable should be set to 0 to turn debugging off
kDEBUG=1
# Now loop through all logical hosts required by this data service,
# and compare against the list of hosts not mastered by this node. Yield
# a failure if no logical hosts are found in the list of host not mastered.
NOT_MASTERED_HOSTS=`echo ${NOT_MASTERED} | tr ‘,’ ‘ ‘`
for required_host in ${SERVICE_HOSTS} ; do
foundHost=0
debug “searching for ${required_host}”
for not_mastered_host in ${NOT_MASTERED_HOSTS} ; do
debug “ - considering ${not_mastered_host}”
if [ “${not_mastered_host}” = “${required_host}” ] ; then
debug “ - found ${required_host} = ${not_mastered_host}”
foundHost=1
fi
done
# fail if no NOT mastered host matched up with the required host
if [ ${foundHost} -ne 1 ] ; then
doerror “node does not deny mastering required logical host \
${required_host}”
exit 1
fi
done
########################################################################
# At this point we know the set of NOT mastered hosts contains the set
# of required hosts.
debug “logical hosts mastered OK”
# Check that the administrative file system of each required logical
# host is mounted. Yield failure if a file system is not mounted.
# This section is required since to STOP a data service, access to
# the data service’s shared files is probably required.
CURRENT_MOUNTS=`mount | awk -e ‘{print $1}’ | xargs echo `
for required_host in ${SERVICE_HOSTS} ; do
neededMount=`haget -f pathprefix -h ${required_host}`
foundMount=0
debug “searching for filesystem on ${neededMount}”
for mountPoint in ${CURRENT_MOUNTS} ; do
debug “ - considering ${mountPoint}”
if [ “${mountPoint}” = “${neededMount}” ] ; then
debug “ - found ${mountPoint}”
foundMount=1
fi
done
# Fail if no mounted file system matches target file system
if [ ${foundMount} -ne 1 ] ; then
doerror “administrative filesystem ${neededMount} for logical host \
${required_host} is not mounted”
if [ -n “${PMF_SIGNAL}” ] ; then
signal=${PMF_SIGNAL}
else
signal=”TERM”
fi
fi
# Stop monitoring the data service and ensure it did work.
pmfadm -s ${SERVICE_NAME} ${wait_time} ${signal}
if [ $? -ne 0 ] ; then
doerror “couldn’t stop monitoring \”${START_PROGRAM}\”. \
error was $? continuing anyway”
fi
if [ -n “${STOP_PROGRAM}” ] ; then
${STOP_PROGRAM}
fi
########################################################################
# At this point the data service has been instructed to stop.
debug “data service has been instructed to stop through the ${STOP_PROGRAM}”
No clear
is abort program
ABORT_PROGRAM
No executable? variable
exit with is service name
error set correctly?
Yes
Yes
Yes
stop PMF monitoring
Yes
are all required
exit with No logical hosts
mastered by this launch abort program
error
physical host?
Yes
did data service No exit with
error
stop?
exit with No are all required
error file systems
mounted? Yes
#!/bin/sh#
# generic ABORT_NET script for HA data services Version 1.3
# Copyright 1999 Sun Microsystems, Inc. All rights reserved.
# This script can be used as a template for developing an ABORT_NET
# DSM for SunCluster 2.2 data services. The intention is to help reduce the
# amount of “boilerplate” work before a production script is developed.
# ABORT_NET is called by clustd when the node encounters certain problems.
# This script can be invoked at any time, including midway processing for the
# START and STOP DSM processing. ABORT_NET is only called on the physical
# node invoking the routine.
# When the ABORT_NET script is launched, it is called with the
# following arguments:
# $1 = comma-separated list of the logical hosts which should be
# mastered by this physical node
# $2 = comma-separated list of the logical hosts which should NOT
# be mastered by this physical node
# $3 = time in seconds until clustd will time out the process.
# This script makes the following assumptions:
# - the script is named <dataservice>_abort_net, where <dataservice>
# is the unique data service name
# - the file <dataservice>.config exists in the same directory as this
# script and it defines at the very least the variables SERVICE_NAME
# and START_PROGRAM
# - the data service program can be started, stopped, and queried
# using pmfadm
# If your data service does not support the above assumptions, then this script
# should be modified appropriately.
########################################################################
# debug function
debug()
{
if [ “$kDEBUG” = “1” ] ; then
logger -p ${SYSLOG}.debug “${PROG} [$$]: $1”
fi
}
########################################################################
# error function
doerror()
{
debug $1
logger -p ${SYSLOG}.err “${PROG} [$$]: $1”
}
#########################################################################
# The following variable should be set to 0 to turn debugging off
kDEBUG=1
PATH=/usr/sbin:/usr/bin:/opt/SUNWcluster/bin
if [ -n “${PMF_SIGNAL}” ] ; then
signal=${PMF_SIGNAL}
else
signal=”KILL”
fi
fi
# Stop monitoring the data service and verify it worked
pmfadm -s ${SERVICE_NAME} ${wait_time} ${signal}
if [ $? -ne 0 ] ; then
doerror “couldn’t stop monitoring \”${START_PROGRAM}\”. \
error was $? continuing anyway”
fi
if [ -n “${ABORT_PROGRAM}” ] ; then
${ABORT_PROGRAM}
fi
########################################################################
# At this point the data service has been instructed to stop.
debug “data service has been instructed to stop”
# This section should be extended with code to ensure that the data
# service has come to a proper stop and that any leftover files have
# been cleaned up. There is no simple way to generalize this and you need
# to resort to application expertise to make sure an appropriate cleanup
# has taken place.
########################################################################
# at this point the data service has stopped
debug “data service has stopped”
exit 0
Fault Monitoring
The START_NET, STOP_NET, and ABORT_NET DSM template scripts previously
discussed provide the basic functions required to start, stop, and abort applications
associated with a data service. The following section provides a basic DSM script used
for application monitoring and management using the Sun Cluster process
monitoring facility (PMF).
Note: When running an application or associated children, if they die, they can be
restarted by the PMF framework.
Note: This section assumes the fault monitor process is executed on the same
physical node as the data service.
calculate config
filename
Yes
Yes
Yes
is fault probe No launch probe
already running? program
Yes
No does this node
exit without master any
error logical hosts? exit without
error
Yes
#!/bin/sh
# generic FM_START script for HA data services Version 1.3
# Copyright 1999 Sun Microsystems, Inc. All rights reserved.
# This script can be used as a template for developing an FM_START
# DSM for SunCluster 2.2 data services. The intention is to help reduce the
# amount of “boilerplate” work before a production script is developed.
# FM_START is called by clustd as part of the cluster reconfiguration
# process. This script can be invoked several times during a cluster
# reconfiguration process by the same node, and it should be written to
# manage a wide range of cluster states.
# When the START_NET script is launched, it is called with the
# following arguments:
# $1 = comma-separated list of the logical hosts which should be
# mastered by this physical node
# $2 = comma-separated list of the logical hosts which should NOT
# be mastered by this physical node
# $3 = time in seconds until clustd will timeout the process.
# This script makes the following assumptions:
# - the script is named <dataservice>_fm_start, where <dataservice>
# is the unique data service name
# - the file <dataservice>.config exists in the same directory as this
# script and it defines at the very least the variables SERVICE_NAME
# and START_PROGRAM
# - the data service program can be started, stopped and queried
# using pmfadm
# - this FM_START DSM will execute on the same physical system where the
# data service is running.
# If your data service does not support the above assumptions, then this script
# should be modified appropriately. For this script to perform any useful action,
a
# fault probe program must be developed, and registered in the configuration file
as
# FAULT_PROBE=fault_probe_program
########################################################################
# debug function
debug()
{
if [ “$kDEBUG” = “1” ] ; then
logger -p ${SYSLOG}.debug “${PROG} [$$]: $1”
fi
}
########################################################################
# error function
doerror()
{
debug $1
extract config
filename
Yes
No exit without
is fault probe
running? error
exit with No is service name
error set correctly?
Yes
Yes
Yes
stop fault monitoring
exit with No is this data service
error registered with a
logical host?
#!/bin/sh#
# generic FM_STOP script for HA data services Version 1.0
# Copyright 1999 Sun Microsystems, Inc. All rights reserved.
#
# This script can be used as a template for developing an FM_STOP
# DSM for SunCluster 2.2 data services. The intention is to help reduce the
# amount of “boilerplate” work before a production script is developed.
#
# FM_STOP is called by clustd as part of the cluster reconfiguration
# process. This script can be invoked several times during a cluster
# reconfiguration process by the same node, and it should be written to
# manage a wide range of cluster states.
# When the FM_STOP script is launched, it is called with the
# following arguments:
# $1 = comma-separated list of the logical hosts which should be
# mastered by this physical node
# $2 = comma-separated list of the logical hosts which should NOT
# be mastered by this physical node
# $3 = time in seconds until clustd will time out the process.
# This script makes the following assumptions:
# - the script is named <dataservice>_fm_stop, where <dataservice>
# is the unique data service name
# - the file <dataservice>.config exists in the same directory as this
# script and it defines at the very least the variables SERVICE_NAME
# and START_PROGRAM
# - the data service program can be started, stopped and queried
# using pmfadm
# If your data service does not support the above assumptions, then this script
# should be modified appropriately.
########################################################################
# debug function
debug()
{
if [ “$kDEBUG” = “1” ] ; then
logger -p ${SYSLOG}.debug “${PROG} [$$]: $1”
fi
}
########################################################################
# error function
doerror()
{
debug $1
logger -p ${SYSLOG}.err “${PROG} [$$]: $1”
}
########################################################################
# The following variable should be set to 0 to turn debugging off
PATH=/usr/sbin:/usr/bin:/opt/SUNWcluster/bin
PROG=`basename $0`
BASEDIR=`dirname $0`
SYSLOG=`haget -f syslog_facility`
MASTERED=$1
NOT_MASTERED=$2
TIMEOUT=$3
########################################################################
debug “called as ${PROG} \”${MASTERED}\” \”${NOT_MASTERED}\” ${TIMEOUT}”
## Strip the DSM suffix from the program name to build the config file name,
# then extract its contents to build required variables.
CONFIG_FILE=`echo ${PROG} | sed s/_fm_stop/.config/`
if [ ! -r “${BASEDIR}/${CONFIG_FILE}” ] ; then
doerror “cannot read config file ${BASEDIR}/${CONFIG_FILE}”
exit 1
fi
debug “reading config file ${BASEDIR}/${CONFIG_FILE}”
. ${BASEDIR}/${CONFIG_FILE}
# fail if service name does not exist
if [ -z “${SERVICE_NAME}” ] ; then
doerror “cannot determine service name”
exit 1
fi
FM_NAME=”${SERVICE_NAME}_FM”
# exit peacefully if no logical hosts are mastered on this node
if [ -z “${NOT_MASTERED}” ] ; then
debug “all logical hosts are mastered by this node”
exit 0
fi
# fail if there are no logical hosts assigned to this service
SERVICE_HOSTS=`hareg -q ${SERVICE_NAME} -H | xargs echo `
if [ -z “${SERVICE_HOSTS}” ] ; then
doerror “no logical hosts assigned to ${SERVICE_NAME}”
exit 1
fi
########################################################################
# At this point the trivial cases of failure have been managed and
# we can start verifying appropriate node resource allocation, e.g.,
# logical host mastering
debug “trivial failure cases passed OK”
# Now loop through all logical hosts required by this data service,
Summary
If an HA-agent for a specific application is not available from Sun Microsystems or
third-party partners, the Sun Cluster 2.2 HA API enables any application (after being
qualified) to be managed within the highly available SC2.2 framework. This
functionality is a useful resource when business-critical and mission-critical
applications are bound to HA requirements.
The approach of associating a data service to a particular logical host (which in turn is
bound to a pre-defined set of IP addresses and disk resources) could produce an out-
of-control situation if the underlying DSMs managing the application are complex.
The purpose of this Appendix, and the “Sun Cluster 2.2 Data Services” chapter is to
familiarize the reader with the processes involved in managing a data service and to
assist in planning a customized data service infrastructure.
bus (1) A circuit over which data or power is transmitted; often acts as a
common connection among a number of locations. (2) A set of parallel
communication lines that connects major components of a computer
system including CPU, memory, and device controllers.
export To write data from a database into a transportable operating system file.
An import instruction reads data from this file back into the database.
failure rate The inverse of either the hardware or system mean time between
failures.
373
fibre channel
arbitrated loop
(FCAL) Also known as an FC-AL. A high bandwidth, serial communication
technology that supports 100 Mbyte/sec per loop. Most systems
support dual-loop connections, which enable a total bandwidth of 200
Mbyte/sec.
Gbyte One billion bytes. In reference to computers, bytes are often expressed
in multiples of powers of two. Therefore, a Gbyte can also be 1024
Mbytes, where a Mbyte is considered to be 2^20 (or 1,048,576) bytes.
import To read data from an operating system file into a database. The
operating system file is originally created by performing an export on
the database.
infant mortality A component exhibits a high failure rate during the early stages of its
life.
instantaneous
availability The probability of a system performing its intended function at any
instant in time, t.
kernel The core of the operating system software. The kernel manages the
hardware (for example, processor cycles and memory) and enables
fundamental services such as filing.
local area network Also known as LAN. A group of computer systems in close proximity
that can communicate with one another via connecting hardware and
software.
local host The CPU or computer on which a software application is running; the
workstation.
374
mean time between
interruptions A temporary system outage where repairs are not required.
Network file system (NFS) A technology developed by Sun Microsystems designed to give
users high-performance, transparent access to server file systems on
global networks.
Point-to-point
protocol Also known as PPP. The successor to SLIP, PPP provides router-to-
router and host-to-network connections over both synchronous and
asynchronous circuits.
remote procedure
call Also known as RPC. A common paradigm for implementing the client/
server model of distributed computing. A request sent to a remote
system to execute a designated procedure using supplied arguments,
with the results being returned to the caller. There are many variations
of RPC, which have resulted in a variety of different protocols.
service level
agreement Also known as an SLA. A formalized agreement between an IT
department and one of its customers that defines services provided,
such as level of availability of computing resources.
Solaris Operating
Environment Also known as Solaris OE. An operating environment from Sun
Microsystems, Inc.
375
Starfire The Sun Enterprise 10000, a high-end SMP server that scales to 64
processors and supports dynamic system domains for system
partitioning.
storage area
network Also known as a SAN. A high-speed dedicated network with a direct
connection between storage devices and servers. This approach enables
storage and tape subsystems to be connected remotely from a server.
Tape SANs enable efficient sharing of tape resources. Both the backup
and restore tool and the tape library require SAN support to make this
sharing possible.
Sun Enterprise
10000 Sun Enterprise 10000 server, also known as the Starfire.
Sun-Netscape
Alliance A business alliance between Sun Microsystems, Inc. and the Netscape
division of AOL. See http:// www.iplanet.com.
swap To write the active pages of a job to external storage (swap space) and
to read pages of another job from external page storage into real
storage.
symmetric
multiprocessing Also known as SMP. A method of using multiple processors in
computer systems to deliver performance levels more predictable than
MPP systems. SMP is efficient, and does not require specialized
techniques for implementation.
Wide Area Network Also known as WAN. A network consisting of many systems providing
file transfer services. This type of network can cover a large physical
area, sometimes spanning the globe.
376
Index
377
/var/opt/SUNWcluster/ccd/ccd.log 120 C
/var/opt/SUNWcluster/oracle 221
C language 282
/var/opt/SUNWcluster/scadmin.log 120
cache-coherent, non uniform memory 32
A cached quick I/O 107
campus cluster recovery 29
ac system power 22
carrier sense multiple access with collision
address bus 79
detection (CSMA/CD) 91
address resolution protocol (ARP) 42
CCD
administrative files 277
communication 134
administrative volume and file system creation
configuration locking 288
270
configurator 196
administrative workstation 55
framework 38
Alteon switches 23
logs 120
alternate pathing 13
ccdd(1M) command 288
architecture.com 23
ccdd(1M) daemon 292
ARP (address resolution protocol) 42
CC-NUMA (cache-coherent, non uniform
ask policy 59, 60
memory access) 32
ASR (automatic system recovery) 6, 10, 79
ccp(1) 244
asymmetric cluster configurations 199
ccp(1M) command 248
asymmetric HA configuration 50
CDB 133–155
asynchronous transfer mode (ATM) 92
inconsistencies 135
ATM (asynchronous transfer mode) 92
CDE console terminal emulation 239
automatic SCSI ID slot assignment 97
centerplane 78
automatic SCSI termination 97
change control practice 26
Automatic System Recovery (ASR) 6, 10, 79
changing SCSI initiator ID 322
availability 1
changing system parameters 26
five nine system 8
clustd (1M) daemon 295
intrinsic 12
clustd(1M) command 288
mathematical formula 9
cluster API (application programming interface)
model 2
309
uptime ratio 8
cluster application data
B shared-nothing databases 36
backplane 78 cluster daemons 288
backup and restore testing 28 cluster database (CDB) 133–155
basic data service 339 cluster dfstab file 278
battery status 99 cluster interconnect (SC3.0) 305
Bellcore NEBS 98 cluster membership 38, 58
block level incremental backup 106 cluster membership monitor (CMM) 38, 114–
board LED diagnosis 79 136, 149–150
boot drive 17 cluster node configuration 18
Bourne shell script templates 339 cluster partition 130
brownout 22 cluster reconfiguration 288
cluster recovery procedure 29
cluster topologies 304
378 Sun Microsystems, Inc.
cluster topology 31 pmfadm(1M) 299
cluster volume manager (CVM) 45, 197 pnmptor nafoname 118
clustered pairs 305 pnmrtop 115
clustered system is its scalability 34 pnmrtop instname 118
CMM (cluster membership monitor) 38, 114– pnmset 156
136, 149–150 pnmset -f 157
commands pnmset -n 157
# haswitch 280 pnmset -p 157
ccdd(1M) 288 pnmset -pntv 115, 118
ccp(1) 244 pnmset -v 157
ccp(1M) 248 pnmstat 115
clustd(1M) 288 pnmstat -l 115, 117
crontab(1) 120 probe-all 324
date 120 rcp(1) 258
devlinks 308 rdate 120
disk 308 rename(2) 284
drvconfig 308 rsh(1) 258
dtterm 239 scadmin 122, 130, 211
edit config.annex 239 scadmin abortpartition nodename
eeprom 323 clustername 131
find_devices 183 scadmin continuepartition node-
finddevices 115 name clustername 131
finddevices | grep 115 scadmin startcluster 122
finddevices disks 118 scadmin startcluster -a 122
fsck 132 scadmin startnode 128, 131
get_node_status 114,117,122 scadmin startnode clustname 122
haget(1M) 299 scadmin stopnode 126
haoracle DB login 216 scadmin stopnode pecan 127
haoracle list 217 scconf 114, 126, 208
hareg 114, 118, 215 scconf -A 183
hareg -n 169 scconf clustername -L 206
hareg -q 116, 118 scconf clustername -l
hareg -s -r oracle -h lhost1 215 new_timeout_value 132
hareg -u 170 scconf clustername -p 114
hareg(1M) 292 scconf clustname -F logical_host
hastat 114, 115, 117, 209 DG_NAME 164
haswitch 125 scconf -F 163
ifconfig 42, 127, 172 scconf -H 183
install-console 325 scconf -i 127
mv(1) 284 scconf -L 160
ntpdate 120 scconf -p 114, 117, 183
ping coins 238 scconf pecan -i pie hme0 qfe5 127
pkginfo -l SUNWscora 203 scconf -q 123, 179, 183
pmfadm -s 299 scconf -R 170
qfe1 127 S
quad fast Ethernet 232 SAP 108
quorum device 58, 122, 123 Sbus 325
problems 123 SBus I/O board 85
R SC 2.2 architecture 31
SC2.2 API 47
RAID 0+1 103 SC2.2 hardware 32
RAID Manager 6 (rm6) 97 SC2.2 See Sun Cluster 2.2 77
RAID5 VxVM 265 SC2.2 software 31, 37
RAS 1, 77 SC2000E 79
RAS engineering group 1 SC2000E (SPARCcenter™ 2000E) 79
RAS organization 29 SCA (single connector attachment) 97
RAS service level agreement 4 SCA-2 (single connector) 99
rcp(1) 258
rdate 120