Professional Documents
Culture Documents
Ha Vcs 410 101a 2 10 srtpg1 130918134130 Phpapp02 PDF
Ha Vcs 410 101a 2 10 srtpg1 130918134130 Phpapp02 PDF
UNIX, Fundamentals
(Lessons)
HA-VCS-410-101A-2-10-SRT (100-002149-A)
COURSE DEVELOPERS Disclaimer
Bilge Gerrits
The information contained in this publication is subject to change without
Siobhan Seeger notice. VERITAS Software Corporation makes no warranty of any kind
Dawn Walker with regard to this guide, including, but not limited to, the implied
warranties of merchantability and fitness for a particular purpose.
VERITAS Software Corporation shall not be liable for errors contained
herein or for incidental or consequential damages in connection with the
furnishing, performance, or use of this manual.
LEAD SUBJECT MATTER
EXPERTS
Copyright
Geoff Bergren
Connie Economou Copyright 2005 VERITAS Software Corporation. All rights reserved.
Paul Johnston No part of the contents of this training material may be reproduced in any
form or by any means or be used for the purposes of training or education
Dave Rogers
without the written permission of VERITAS Software Corporation.
Jim Senicka
Pete Toemmes Trademark Notice
VERITAS, the VERITAS logo, and VERITAS FirstWatch, VERITAS
Cluster Server, VERITAS File System, VERITAS Volume Manager,
VERITAS NetBackup, and VERITAS HSM are registered trademarks of
TECHNICAL VERITAS Software Corporation. Other product names mentioned herein
CONTRIBUTORS AND may be trademarks and/or registered trademarks of their respective
REVIEWERS companies.
Billie Bachra
VERITAS Cluster Server for UNIX, Fundamentals
Barbara Ceran
Participant Guide
Bob Lucas
April 2005 Release
Gene Henriksen
Margy Cassidy VERITAS Software Corporation
350 Ellis Street
Mountain View, CA 94043
Phone 6505278000
www.veritas.com
Table of Contents
Course Introduction
VERITAS Cluster Server Curriculum ................................................................ Intro-2
Course Prerequisites......................................................................................... Intro-3
Course Objectives............................................................................................. Intro-4
Certification Exam Objectives........................................................................... Intro-5
Cluster Design Input .......................................................................................... Intro-6
Sample Design Input.......................................................................................... Intro-7
Sample Design Worksheet................................................................................. Intro-8
Lab Design for the Course ................................................................................ Intro-9
Lab Naming Conventions ................................................................................ Intro-10
Classroom Values for Labs............................................................................... Intro-11
Course Overview............................................................................................. Intro-12
Legend ............................................................................................................ Intro-15
Table of Contents i
Copyright 2005 VERITAS Software Corporation. All rights reserved.
Lesson 2: Preparing a Site for VCS
Planning for Implementation ................................................................................... 2-4
Implementation Needs .............................................................................................. 2-4
The Implementation Plan .......................................................................................... 2-5
Using the Design Worksheet..................................................................................... 2-6
Hardware Requirements and Recommendations ................................................... 2-7
SCSI Controller Configuration for Shared Storage .................................................. 2-9
Hardware Verification............................................................................................ 2-12
Software Requirements and Recommendations................................................... 2-13
Software Verification ............................................................................................. 2-15
Preparing Cluster Information ............................................................................... 2-16
VERITAS Security Services .................................................................................. 2-17
Lab 2: Validating Site Preparation ........................................................................ 2-19
Table of Contents v
Copyright 2005 VERITAS Software Corporation. All rights reserved.
Lesson 11: Configuring VCS Response to Resource Faults
Introduction ........................................................................................................... 11-2
VCS Response to Resource Faults ...................................................................... 11-4
Failover Decisions and Critical Resources ............................................................. 11-4
How VCS Responds to Resource Faults by Default............................................... 11-5
The Impact of Service Group Attributes on Failover.............................................. 11-7
Practice: How VCS Responds to a Fault............................................................... 11-10
Determining Failover Duration ............................................................................. 11-11
Failover Duration on a Resource Fault ................................................................. 11-11
Adjusting Monitoring............................................................................................ 11-13
Adjusting Timeout Values .................................................................................... 11-14
Controlling Fault Behavior................................................................................... 11-15
Type Attributes Related to Resource Faults.......................................................... 11-15
Modifying Resource Type Attributes.................................................................... 11-18
Overriding Resource Type Attributes ................................................................... 11-19
Recovering from Resource Faults....................................................................... 11-20
Recovering a Resource from a FAULTED State .................................................. 11-20
Recovering a Resource from an ADMIN_WAIT State ........................................ 11-22
Fault Notification and Event Handling ................................................................. 11-24
Fault Notification .................................................................................................. 11-24
Extended Event Handling Using Triggers ............................................................ 11-25
The Role of Triggers in Resource Faults .............................................................. 11-25
Lab 11: Configuring Resource Fault Behavior .................................................... 11-28
Index
VERITAS
Cluster Server,
Fundamentals
VERITAS
Cluster Server,
Implementing Local
Clusters
High Availability
VERITAS Disaster Recovery
Design Using
Cluster Server Agent Using VVR 4.0 and
VERITAS
Development Global Cluster Option
Cluster Server
Course Prerequisites
This course assumes that you have an administrator-level understanding of one or
more UNIX platforms. You should understand how to configure systems, storage
devices, and networking in multiserver environments.
Course Objectives
In the VERITAS Cluster Server for UNIX, Fundamentals course, you are given a
high availability design to implement in the classroom environment using
VERITAS Cluster Server.
The course simulates the job tasks you perform to configure a cluster, starting with
preparing the site and application services that will be made highly available.
Lessons build upon each other, exhibiting the processes and recommended best
practices you can apply to implementing any design cluster.
The core material focuses on the most common cluster implementations. Other
cluster designs emphasizing additional VCS capabilities are provided to illustrate
the power and flexibility of VERITAS Cluster Server.
Web Service
IP
IP Address Mount
Start up on system S1.
192.168.3.132
192.168.3.132 /web
Restart Web server
process 3 times before
faulting it.
Fail over to S2 if any NIC Volume
resource faults. eri0 WebVol
Notify patg@company.com
if any resource faults.
OOOO OOOO
OOOO OO Disk Group
WebDG
Components
Components required
required to
to
provide
provide the
the Web
Web service.
service.
Example: main.cf
group WebSG (
SystemList = { S1 = 0, S2 = 1 }
AutoStartList = { S1 }
)
IP WebIP (
Device = eri0
Address = 192.168.3.132
Netmask = 255.255.255.0
)
vcsx
their_nameSG1
your_nameSG1
their_nameSG2
your_nameSG2
NetworkSG
trainxx
trainxx
Course Overview
This training provides comprehensive instruction on the installation and initial
configuration of VERITAS Cluster Server (VCS). The course covers principles
and methods that enable you to prepare, create, and test VCS service groups and
resources using tools that best suit your needs and your high availability
environment. You learn to configure and test failover and notification behavior,
cluster additional applications, and further customize your cluster according to
specified design criteria.
Course Resources
This course uses this participant guide containing lessons presented by your
instructor and lab exercises to enable you to practice your new skills.
Lab materials are provided in three forms, with increasing levels of detail to suit a
range of student expertise levels.
Appendix A: Lab Synopses has high-level task descriptions and design
worksheets.
Appendix B: Lab Details includes the lab procedures and detailed steps.
Appendix C: Lab Solutions includes the lab procedures and steps with the
corresponding command lines required to perform each step.
Appendix D: Job Aids provides supplementary material that can be used as
on-the-job guides for performing some common VCS operations.
Appendix E: Design Worksheet Template provides a blank design
worksheet.
Additional supplements may be used in the classroom or provided to you by your
instructor.
Course Platforms
This course material applies to the VCS platforms shown in the slide. Indicators
are provided in slides and text where there are differences in platforms.
Refer to the VERITAS Cluster Server user documentation for your platform and
version to determine which features are supported in your environment.
Symbol Description
Server, node, or cluster system (terms
used interchangeably)
Storage
Application service
Cluster interconnect
VCS resource
Introduction
Overview
This lesson introduces basic VERITAS Cluster Server terminology and concepts,
and provides an overview of the VCS architecture and supporting communication
mechanisms.
Importance
The terms and concepts covered in this lesson provide a foundation for learning
the tasks you need to perform to deploy the VERITAS Cluster Server product, both
in the classroom and in real-world applications.
1
Topic After completing this lesson, you
will be able to:
Cluster Terminology Define clustering terminology.
Cluster Communication Describe cluster communication
mechanisms.
Maintaining the Cluster Describe how the cluster configuration
Configuration is maintained.
VCS Architecture Describe the VCS architecture.
Supported Failover Describe the failover configurations
Configurations supported by VCS.
Outline of Topics
Cluster Terminology
Cluster Communication
Maintaining the Cluster Configuration
VCS Architecture
Supported Failover Configurations
Cluster Terminology
A Nonclustered Computing Environment
An example of a traditional, nonclustered computing environment is a single
server running an application that provides public network links for client access
and data stored on local or SAN storage.
If a single component fails, application processing and the business service that
relies on the application are interrupted or degraded until the failed component is
repaired or replaced.
1
A cluster is a collection of multiple
independent systems working
together under a management
framework for increased service
availability.
Application
Node
Storage
Cluster Interconnect
Definition of a Cluster
A clustered environment includes multiple components configured such that if one
component fails, its role can be taken over by another component to minimize or
avoid service interruption.
This allows clients to have high availability to their data and processing, which is
not possible in nonclustered environments.
The term cluster, simply defined, refers to multiple independent systems or
domains connected into a management framework for increased availability.
Clusters have the following components:
Up to 32 systemssometimes referred to as nodes or servers
Each system runs its own operating system.
A cluster interconnect, which allows for cluster communications
A public network, connecting each system in the cluster to a LAN for client
access
Shared storage (optional), accessible by each system in the cluster that needs to
run the application
Application
Node
Failed Node
Storage
Cluster
Interconnect
1
An application service is a collection
of all the hardware and software
components required to provide a
service.
If the service must be migrated to
another system, all components
need to be moved in an orderly
fashion.
Examples include Web servers,
databases, and applications.
1
system at a time.
VCS migrates the service group at the administrators
request and in response to faults.
Parallel
The service group can be online on multiple cluster
systems simultaneously.
An example is Oracle Real Application Cluster (RAC).
Hybrid
This is a special-purpose type of service group used to manage
service groups in replicated data clusters (RDCs), which are
based on VERITAS Volume Replicator.
Definition of a Resource
Resources are VCS objects that correspond to hardware or software components,
such as the application, the networking components, and the storage components.
VCS controls resources through these actions:
Bringing a resource online (starting)
Taking a resource offline (stopping)
Monitoring a resource (probing)
Resource Categories
Persistent
None
VCS can only monitor persistent resourcesthey cannot be brought online
or taken offline. The most common example of a persistent resource is a
network interface card (NIC), because it must be present but cannot be
stopped. FileNone and ElifNone are other examples.
On-only
VCS brings the resource online if required, but does not stop it if the
associated service group is taken offline. NFS daemons are examples of
on-only resources. FileOnOnly is another on-only example.
Nonpersistent, also known as on-off
Most resources fall into this category, meaning that VCS brings them online
and takes them offline as required. Examples are Mount, IP, and Process.
FileOnOff is an example of a test version of this resource.
1
online and offline order of the resource.
A parent resource depends
on a child resource.
There is no limit to the
number of parent and child Parent
resources.
Persistent resources, such
as NIC, cannot be parent Parent/child
resources.
Dependencies cannot be
cyclical.
Child
Resource Dependencies
Resources depend on other resources because of application or operating system
requirements. Dependencies are defined to configure VCS for these requirements.
Dependency Rules
These rules apply to resource dependencies:
A parent resource depends on a child resource. In the diagram, the Mount
resource (parent) depends on the Volume resource (child). This dependency
illustrates the operating system requirement that a file system cannot be
mounted without the Volume resource being available.
Dependencies are homogenous. Resources can only depend on other
resources.
No cyclical dependencies are allowed. There must be a clearly defined
starting point.
Solaris
Solaris
mount
mount F
F vxfs
vxfs /dev/vx/dsk/WebDG/WebVol
/dev/vx/dsk/WebDG/WebVol /Web
/Web
Resource Attributes
Resources attributes define the specific characteristics on individual resources. As
shown in the slide, the resource attribute values for the sample resource of type
Mount correspond to the UNIX command line to mount a specific file system.
VCS uses the attribute values to run the appropriate command or system call to
perform an operation on the resource.
Each resource has a set of required attributes that must be defined in order to
enable VCS to manage the resource.
For example, the Mount resource on Solaris has four required attributes that must
be defined for each resource of type Mount:
The directory of the mount point (MountPoint)
The device for the mount point (BlockDevice)
The type of file system (FSType)
The options for the fsck command (FsckOpt)
The first three attributes are the values used to build the UNIX mount command
shown in the slide. The FsckOpt attribute is used if the mount command fails. In
this case, VCS runs fsck with the specified options (-y) and attempts to mount
the file system again.
Some resources also have additional optional attributes you can define to control
how VCS manages a resource. In the Mount resource example, MountOpt is an
optional attribute you can use to define options to the UNIX mount command.
For example, if this is a read-only file system, you can specify -ro as the
MountOpt value.
1
The resource type
specifies the
attributes needed to
define a resource of
that type.
For example, a Mount
resource has different
properties than an IP
resource.
Solaris
Solaris
mount
mount [-F
[-F FSType]
FSType] [options]
[options] block_device
block_device mount_point
mount_point
/web /log
10.1.2.3 online
WebDG WebVol logVol
eri0
offline
monitor
Mount clean
IP
Disk Volume
NIC Group
1
The
TheVERITAS
VERITASCluster
ClusterServer
Server
Bundled
BundledAgents
AgentsReference
Reference
Guide defines all VCS resource
Guide defines all VCS resource
types
typesfor
forall
allbundled
bundledagents.
agents.
See
Seehttp://support.veritas.com
http://support.veritas.com
for product documentation.
for product documentation.
Cluster Communication
VCS requires a cluster communication channel between systems in a cluster to
serve as the cluster interconnect. This communication channel is also sometimes
referred to as the private network because it is often implemented using a
dedicated Ethernet network.
VERITAS recommends that you use a minimum of two dedicated communication
channels with separate infrastructuresfor example, multiple NICs and separate
network hubsto implement a highly available cluster interconnect. Although
recommended, this configuration is not required.
The cluster interconnect has two primary purposes:
Determine cluster membership: Membership in a cluster is determined by
systems sending and receiving heartbeats (signals) on the cluster interconnect.
This enables VCS to determine which systems are active members of the
cluster and which systems are joining or leaving the cluster.
In order to take corrective action on node failure, surviving members must
agree when a node has departed. This membership needs to be accurate and
coordinated among active membersnodes can be rebooted, powered off,
faulted, and added to the cluster at any time.
Maintain a distributed configuration: Cluster configuration and status
information for every resource and service group in the cluster is distributed
dynamically to all systems in the cluster.
Cluster communication is handled by the Group Membership Services/Atomic
Broadcast (GAB) mechanism and the Low Latency Transport (LLT) protocol, as
described in the next sections.
1
Is responsible for sending
heartbeat messages
Transports cluster
communication traffic to
every active system
Balances traffic load
across multiple network
LLT links
Maintains the
LLT communication link state
Is a nonroutable protocol
Runs on an Ethernet
network
Low-Latency Transport
VERITAS uses a high-performance, low-latency protocol for cluster
communications. LLT is designed for the high-bandwidth and low-latency needs
of not only VERITAS Cluster Server, but also VERITAS Cluster File System, in
addition to Oracle Cache Fusion traffic in Oracle RAC configurations. LLT runs
directly on top of the Data Link Provider Interface (DLPI) layer over Ethernet and
has several major functions:
Sending and receiving heartbeats over network links
Monitoring and transporting network traffic over multiple network links to
every active system
Balancing cluster communication load over multiple links
Maintaining the state of communication
Providing a nonroutable transport mechanism for cluster communications
LLT Is a proprietary
broadcast protocol
Uses LLT as its
transport mechanism
1
Monitors GAB to detect
cluster membership
Reboot changes
Ensures a single view
Fence
of cluster membership
Fence GAB
Prevents multiple nodes
GAB LLT
from accessing the
same Volume Manager
LLT 4.x shared storage
devices
1
HAD User Processes iPlanet
hashadow
GAB TCP
Kernel Processes
LLT IP
NIC
Hardware
NIC
LLT Versus IP
LLT is driven by GAB, has specific targets in its domain and assumes constant
connection between servers, known as a connection-oriented protocol. IP is a
connectionless protocol it assumes that packets can take different paths to reach
the same destination.
main.cf hashadow
all systems
simultaneously by
way of GAB using
LLT.
The configuration is
preserved on disk in
the main.cf file.
1
include "types.cf"
cluster vcs (
UserNames = { admin = ElmElgLimHmmKumGlj }
Administrators = { admin }
CounterInterval = 5
) A
A simple
simple text
text file
file is
is used
used to
to
system S1 ( store
store the
the cluster
cluster configuration
configuration
) on
on disk.
disk.
system S2 ( The
The file
file contents
contents areare described
described
) in
in detail later in the
the course.
course.
group WebSG (
SystemList = { S1 = 0, S2 = 1 }
)
Mount WebMount (
MountPoint = "/web"
BlockDevice = "/dev/vx/dsk/WebDG/WebVol"
FSType = vxfs
FsckOpt = "-y"
)
VCS Architecture
The slide shows how the major components of the VCS architecture work together
to manage application services.
1
Before Failover After Failover
Active/Passive
In this configuration, an application runs on a primary or master server. A
dedicated redundant server is present to take over on any failover. The redundant
server is not configured to perform any other functions.
The redundant server is on standby with full performance capability. The next
examples show types of active/passive configurations:
Before Failover
After Failover
N-to-1
This configuration reduces the cost of hardware redundancy while still providing a
dedicated spare. One server protects multiple active servers, on the theory that
simultaneous multiple failures are unlikely.
This configuration is used when shared storage is limited by the number of servers
that can attach to it and requires that after the faulted system is repaired, the
original configuration is restored.
1
After Failover
Before Failover
After Repair
N+1
When more than two systems can connect to the same shared storage, as in a SAN
environment, a single dedicated redundant server is no longer required.
When a server fails in this environment, the application service restarts on the
spare. Unlike the N-to-1 configuration, after the failed server is repaired, it can
then become the redundant server.
Active/Active
In an active/active configuration, each server is configured to run a specific
application service, as well as to provide redundancy for its peer.
In this configuration, hardware usage appears to be more efficient because there
are no standby servers. However, each server must be robust enough to run
multiple application services, increasing the per-server cost up front.
1
Before Failover
After Failover
N-to-N
This configuration is an active/active configuration that supports multiple
application services running on multiple servers. Each application service is
capable of being failed over to different servers in the cluster.
Careful testing is required to ensure that all application services are compatible to
run with other application services that may fail over to the same server.
Summary
This lesson introduced the basic VERITAS Cluster Server terminology and gave
an overview of VCS architecture and supporting communication mechanisms.
Next Steps
Your understanding of basic VCS functions enables you to prepare your site for
installing VCS.
Additional Resources
High Availability Design Using VERITAS Cluster Server
This course will be available in the future from VERITAS Education if you are
interested in developing custom agents or learning more about high availability
design considerations for VCS environments.
VERITAS Cluster Server Bundled Agents Reference Guide
This guide describes each bundled agent in detail.
VERITAS Cluster Server Users Guide
This guide provides detailed information on procedures and concepts for
configuring and managing VCS clusters.
Introduction
Overview
This lesson describes guidelines and considerations for planning to deploy
VERITAS Cluster Server (VCS). You also learn how to prepare your site for
installing VCS.
Importance
Before you install VERITAS Cluster Server, you must prepare your environment
to meet the requirements needed to implement a cluster. By following these
guidelines, you can ensure that your system hardware and software are configured
to install VCS.
2
and Recommendations requirements.
Software Requirements Describe general VCS software
and Recommendations requirements.
Preparing Cluster Collect cluster design information to
Information prepare for installation.
Outline of Topics
Planning for Implementation
Hardware Requirements and Recommendations
Software Requirements and Recommendations
Preparing Cluster Information
2
for VCS installation as described throughout this lesson.
Consider how these activities may affect running
services.
Prepare or complete a design worksheet that is used
during VCS installation and configuration, if this
worksheet is not provided.
Validate
Validate the
the design
design
worksheet
worksheet asas you
you Cluster Definition Value
prepare
prepare the
the site.
site. Cluster Name vcs
Required Attributes
UserNames admin=password
ClusterAddress 192.168.3.91
S2 Administrators admin
S1
System Definition Value
System S1
System S2
2
Redundant storage arrays
Uninterruptible power supplies
Identically configured systems
System type
Network interface cards
Storage HBAs
Networking
VERITAS Cluster Server requires a minimum of two heartbeat channels for the
cluster interconnect, one of which must be an Ethernet network connection. While
it is possible to use a single network and a disk heartbeat, the best practice
configuration is two or more network links.
Loss of the cluster interconnect results in downtime, and in nonfencing
environments, can result in split brain condition (described in detail later in the
course).
For a highly available configuration, each system in the cluster must have a
minimum of two physically independent Ethernet connections for the cluster
interconnect:
Two-system clusters can use crossover cables.
Clusters with three or more systems require hubs or switches.
You can use layer 2 switches; however, this is not a requirement.
Note: For clusters using VERITAS Cluster File System or Oracle Real Application
Cluster (RAC), VERITAS recommends the use of multiple gigabit interconnects
and gigabit switches.
scsi-initiator-id scsi-initiator-id
7 5 Typical 7
default
2
Use unique SCSI IDs for each system.
Check the controller SCSI ID on both systems and the SCSI
IDs of the disks in shared storage.
Change the controller SCSI ID on one system, if necessary.
Shut down, cable shared disks, and reboot.
Verify that both systems can see all the shared disks.
SCSI Interfaces
Additional considerations for SCSI implementations:
Both differential and single-ended SCSI controllers require termination;
termination can be either active or passive.
All SCSI devices on a controller must be compatible with the controlleruse
only differential SCSI devices on a differential SCSI controller.
Mirror disks on separate controllers for additional fault tolerance.
Configurations with two systems can use standard cables; a bus can be
terminated at each system with disks between systems.
Configurations with more than two systems require cables with connectors that
are appropriately spaced.
2
Main Menu:Enter command>ser scsi init path value
Main Menu:Enter command>ser scsi init 8/4 5
3 Use the ioscan -fn command to verify shared disks after the system
reboots.
Linux
1 Connect the disk to the first cluster system.
2 Power on the disk.
3 Connect a terminator to the other port of the disk.
4 Boot the system. The disk is detected while the system boots.
5 Press the key sequence for your adapter to bring up the SCSI BIOS settings for
that disk.
6 Set Host adapter SCSI ID = 7 or to an appropriate value for your configuration.
7 Set Host Adapter BIOS in Advanced Configuration Options to Disabled.
Hardware Verification
Hardware may have been installed but not yet configured, or improperly
configured. Basic hardware configuration considerations are described next.
Network
Test the network connections to ensure that each cluster system is accessible on the
public network. Also verify that the cluster interconnect is working by temporarily
assigning network addresses and using ping to verify communications. You must
use different IP network addresses to ensure that traffic actually uses the correct
interface.
Also, depending on the operating system, you may need to ensure that network
interface speed and duplex settings are hard set and auto negotiation is disabled.
Storage
VCS is designed primarily as a shared data high availability product. In order to
fail over an application from one system to another, both systems must have access
to the data storage.
Other considerations when checking hardware include:
Switched-fabric zoning configurations in a SAN
Active-active versus active-passive on disk arrays
2
Use identical configurations:
Configuration files
User accounts
Disabled abort sequence (Solaris)
ssh or rsh configured during installation
Use volume management software for storage.
2
Verify that the operating system and network
configuration files are the same.
vlicense.veritas.com
vlicense.veritas.com
VERITAS
VERITASsales
salesrepresentative
representative
VERITAS
VERITASSupport
Supportfor
forupgrades
upgrades
Software Verification
Verify that the VERITAS products in the high availability solution are compatible
with the operating system versions in use or with the planned upgrades.
Verify that the required operating system patches are installed on the systems
before installing VCS.
Obtain VCS license keys.
You must obtain license keys for each cluster system to complete the license
process. For new installations, use the VERITAS vLicense Web site,
http://vlicense.veritas.com, or contact your VERITAS sales
representative for license keys. For upgrades, contact VERITAS Support.
Also, verify that you have the required licenses to run applications on all
systems where the corresponding service can run.
Verify that operating system and network configuration files are configured to
enable application services to run identically on all target systems. For
example, if a database needs to be started with a particular user account, ensure
that user account, password, and group files contain the same configuration for
that account on all systems that need to be able to run the database.
2
public network.
VxSS provides a single sign-on for authenticated user
accounts.
All cluster systems must be authentication broker nodes.
VERITAS recommends using a system outside the cluster
to serve as the root broker node.
Summary
This lesson described how to prepare sites and application services for use in the
VCS high availability environment. Performing these preparation tasks ensures
that the site is ready to deploy VCS, and helps illustrate how VCS manages
application resources.
Next Steps
After you have prepared your operating system environment for high availability,
you can install VERITAS Cluster Server.
Additional Resources
VERITAS Cluster Server Release Notes
The release notes provide detailed information about hardware and software
supported by VERITAS Cluster Server.
VERITAS Cluster Server Installation Guide
This guide provides detailed information about installing VERITAS Cluster
Server.
http://support.veritas.com
Check the VERITAS Support Web site for supported hardware and software
information.
Visually
Visuallyinspect
inspectthe
theclassroom
classroomlablabsite.
site.
Complete
Complete and validate the designworksheet.
and validate the design worksheet.
Use
Usethe
thelab
labappendix
appendixbest
bestsuited
suitedto
toyour
your
experience
experiencelevel:
level:
?? Appendix
AppendixA:
A:Lab
LabSynopses
Synopses
?? Appendix
AppendixB:
B:Lab
LabDetails
Details
?? Appendix
AppendixC:
C:Lab
LabSolutions
Solutions
2
train2
train1
System Definition Sample Value Your Value
System train1
System train2
See
Seethe
thenext
nextslide
slidefor
forlab
labassignments.
assignments.
Goal
The purpose of this lab is to prepare the site, your classroom lab systems, for VCS
installation.
Results
The system requirements are validated, the interconnect is configured, and the
design worksheet is completed and verified.
Prerequisites
Obtain any classroom-specific values needed for your classroom lab environment
and record these values in your design worksheet included with the lab exercise
instructions.
Introduction
Overview
This lesson describes the automated VCS installation process carried out by the
VERITAS Common Product Installer.
Importance
Installing VCS is a simple, automated procedure in most high availability
environments. The planning and preparation tasks you perform prior to starting the
installation process ensure that VCS installs quickly and easily.
3
Outline of Topics
Using the VERITAS Common Product Installer
VCS Configuration Files
Viewing the Default VCS Configuration
Other Installation Considerations
The installvcsutility
Theinstallvcs utilityrequires
requiresremote
remoteroot
rootaccess
accesstotoother
other
systems
systemsin inthe
thecluster
clusterwhile
whilethe
thescript
scriptis
isbeing
beingrun.
run.
The /.rhostsfile:
The/.rhosts file:You
Youcan
canremove .rhostsfiles
remove.rhosts filesafter
afterVCS
VCS
3
installation.
installation.
ssh:
ssh:No
Noprompting
promptingis ispermitted.
permitted.
Options to installvcs
The installvcs utility supports several options that enable you to tailor the
installation process. For example, you can:
Perform an unattended installation.
Install software packages without configuring a cluster.
Install VCS in a secure environment.
Upgrade an existing VCS cluster.
For a complete description of installvcs options, see the VERITAS Cluster
Server Installation Guide.
Select
Selectoptional
optionalpackages.
packages.
Configure
Configurethe
thecluster
cluster What the script
(name,
(name,ID,
ID,interconnect).
interconnect). does
User input Select
Selectroot
rootbroker
brokernode.
node.
to the script
Set
Setup
upVCS
VCSuser
useraccounts.
accounts.
Configure
Configurethe
theWeb
WebGUI
GUI Install
InstallVCS
VCSpackages.
packages.
(device
(devicename,
name,IP
IPaddress,
address,
subnet
subnetmask).
mask). Configure
ConfigureVCS.
VCS.
Configure
ConfigureSMTP
SMTPand andSNMP
SNMP Start
StartVCS.
VCS.
notification.
notification.
3
without requests for passwords or passphrases.
Licensing VCS
The installation utility verifies the license status of each system. If a VCS license is
found on the system, you can use that license or enter a new license.
If no VCS license is found on the system, or you want to add a new license, enter a
license key when prompted.
Configuring Security
If you choose to configure VxSS security, you are prompted to select the root
broker node. The system acting as root broker node must be set up and running
before installing VCS in the cluster. All cluster nodes are automatically set up as
authentication broker nodes.
Directories Contents
/sbin, /usr/sbin, Executables, scripts, libraries
/opt/VRTSvcs/bin
Commonly
Commonlyused
usedenvironment
environmentvariables:
variables:
3
$VCS_CONF:
$VCS_CONF: /etc/VRTSvcs
/etc/VRTSvcs $VCS_HOME:
$VCS_HOME: /opt/VRTSvcs/bin
/opt/VRTSvcs/bin
$VCS_LOG: /var/VRTSvcs
$VCS_LOG: /var/VRTSvcs
include "types.cf"
cluster VCS ( Cluster
Cluster Name
Name
)
system train1 (
All
All the systems
systems where
where
) VCS
VCS is
is installed
installed
system train2 (
)
group ClusterService (
) Information
Information entered
entered for
for
IP webip ( Web-based
Web-based Cluster
Cluster Manager
Manager
Device = hme1
3
Address = "192.168.105.101"
NetMask = "255.255.255.0"
)
Log onto the VCS Web Console using the IP address specified during
installation:
http://IP_Address:8181/vcs
View the product documentation:
/opt/VRTSvcsdc
3
A S1 RUNNING 0
A S2 RUNNING 0
Viewing Status
After installation is complete, you can check the status of VCS components.
View VCS communications status on the cluster interconnect using LLT and
GAB commands. This topic is discussed in more detail later in the course. For
now, you can see that LLT is up by running the following command:
lltconfig
llt is running
View GAB port a and port h memberships for all systems:
gabconfig -a
GAB Port Memberships
===============================================
Port a gen a36e003 membership 01
Port h gen fd57002 membership 01
View the cluster status:
hastatus -sum
Access
Accessthe
theCluster
ClusterManager
ManagerJava
JavaConsole
Consoleto
toverify
verifyinstallation.
installation.
On
OnUNIX
UNIXsystems,
systems,type
typehagui&.
hagui&.
On
OnWindows
Windowssystems,
systems,start
startthe
theGUI
GUIusing
usingthe
theCluster
ClusterManager
3
Manager
desktop icon.
desktop icon.
3
Summary
This lesson described the procedure for installing VCS and viewing the cluster
configuration after the installation has completed.
Next Steps
After you install the VCS software, you can prepare your application services for
the high availability environment.
Additional Resources
VERITAS Cluster Server Release Notes
This document provides important information regarding VERITAS Cluster
Server (VCS) on the specified platform. It is recommended that you review
this entire document before installing VCS.
VERITAS Cluster Server Installation Guide
This guide provides information on how to install VERITAS Cluster Server on
the specified platform.
Web Resources
To verify that you have the latest operating system patches before installing
VCS, see the corresponding vendor Web site for that platform. For
example, for Solaris, see http://sunsolve.sun.com.
To contact VERITAS Technical Support, see:
http://support.veritas.com
To obtain VERITAS software licenses, see:
http://vlicense.veritas.com
vcs1
Link 1:______
Link 1:______ Link 2:______
Link 2:______
Public:______ Public:______
train1 train2
4.x ## ./installer
./installer Software
4.x
location:_______________________________
Pre-4.0
Pre-4.0 ## ./installvcs
./installvcs
Subnet:_______
Goal
The purpose of this lab exercise is to set up a two-system VCS cluster with a
shared disk configuration and to install the VCS software using the installation
utility.
Prerequisites
Pairs of students work together to install the cluster. Select a pair of systems or
use the systems designated by your instructor.
Obtain installation information from your instructor and record it in the design
worksheet provided with the lab instructions.
Results
A two-system cluster is running VCS with one system running the ClusterService
service group.
Introduction
Overview
In this lesson, you learn how to manage applications that are under the control of
VCS. You are introduced to considerations that must be taken when managing
applications in a highly available clustered environment.
Importance
It is important to understand how to manage applications when they are under
VCS control. An application is a member of a service group that also contains
resources necessary to run the application that needs to be managed. Applications
must be brought up and down using the VCS interface rather than by using a
traditional direct interface with the application. Application upgrades and backups
are handled differently in a cluster environment.
Outline of Topics
4
Managing Applications in a Cluster Environment
Service Group Operations
Using the VCS Simulator
You
Youcan
canmistakenly
mistakenlycause
causeproblems,
problems,such
suchas
asforcing
forcingfaults
faults
! and
andpreventing
of
preventingfailover,
ofVCS.
VCS.
failover,ififyou
youmanipulate
manipulateresources
resourcesoutside
outside
4
You can use any of the VCS interfaces to manage the cluster environment,
provided that you have the proper VCS authorization. VCS user accounts are
described in more detail in the VCS Configuration Methods lesson.
For details about the requirements for running the graphical user interfaces (GUIs),
see the VERITAS Cluster Server Release Notes and the VERITAS Cluster Server
Users Guide.
Note: You cannot use the Simulator to manage a running cluster configuration.
4
Knowing how to display attributes and status about a VCS cluster, service groups,
and resources helps you monitor the state of cluster objects and, if necessary, find
and fix problems. Familiarity with status displays also helps you build an
understanding of how VCS responds to events in the cluster environment, and the
effects on application services under VCS control.
You can display attributes and status using the GUI or CLI management tools.
! Show
Show continuous
log
log in
in another
another to
hastatus display
continuous hastatus
to become
display in
become familiar
in one
familiar with
one window
with VCS
window and
VCS activities
and the
activities and
the command
and operations.
operations.
Displaying Logs
You can display the HAD log to see additional status information about activity in
the cluster. You can also display the command log to see how the activities you
perform using the GUI are translated into VCS commands. You can also use the
command log as a resource for creating batch files to use when performing
repetitive configuration or administration tasks.
Note: Both the HAD log and command log can be viewed using the GUI.
The primary log file, the engine log, is located in /var/VRTSvcs/log/
engine_A.log. Log files are described in more detail later in the course.
4
When a service group is brought online, resources are brought online starting with
the lowest (child) resources and progressing up the resource dependency tree to the
highest (parent) resources.
In order to bring a failover service group online, VCS must verify that all
nonpersistent resources in the service group are offline everywhere in the cluster.
If any nonpersistent resource is online on another system, the service group is not
brought online.
A service group is considered online if all of its autostart and critical resources are
online.
An autostart resource is a resource whose AutoStart attribute is set to 1.
A critical resource is a resource whose Critical attribute is set to 1.
A service group is considered partially online if one or more nonpersistent
resources is online and at least one resource is:
Autostart-enabled
Critical
Offline
The state of persistent resources is not considered when determining the online or
offline state of a service group because persistent resources cannot be taken
offline.
hagrp
hagrp -offline
-offline
4
When a service group is taken offline, resources are taken offline starting with the
highest (parent) resources in each branch of the resource dependency tree and
progressing down the resource dependency tree to the lowest (child) resources.
Persistent resources cannot be taken offline. Therefore, the service group is
considered offline when all nonpersistent resources are offline.
system S2.
hagrp
hagrp -switch
-switch
hagrp
hagrp -freeze
-freeze
4
When you freeze a service group, VCS continues to monitor the resources, but
does not allow the service group (or its resources) to be taken offline or brought
online. Failover is also disabled, even if a resource faults.
You can also specify that the freeze is in effect even if VCS is stopped and
restarted throughout the cluster.
Warning: When frozen, VCS does not take action on the service group even if you
cause a concurrency violation by bringing the service online on another system
outside of VCS.
hares
hares -online
-online
hares
hares -offline
-offline
4
Taking resources offline should not be a normal occurrence. Doing so causes the
service group to become partially online, and availability of the application service
is affected.
If a resource needs to be taken offline, for example, for maintenance of underlying
hardware, then consider switching the service group to another system.
If multiple resources need to be taken offline manually, then they must be taken
offline in resource dependency tree order, that is, from top to bottom.
Taking a resource offline and immediately bringing it online may be necessary if,
for example, the resource must reread a configuration file due to a change.
hares
hares clear
clear
hares
hares -probe
-probe
Download
Download the
the simulator
simulator from
from http://van.veritas.com.
http://van.veritas.com.
4
A graphical user interface, referred to as the Simulator Java Console, is provided
to create and manage Simulator configurations. Using the Simulator Java Console,
you can run multiple Simulator configurations simultaneously.
To start the Simulator Java Console:
On UNIX systems:
a Set the PATH environment variable to /opt/VRTScssim/bin.
b Set VCS_SIMULATOR_HOME to /opt/VRTScssim.
c Type /opt/VRTSvcs/bin/hasimgui &
On Windows systems, environment variables are set during installation. Start
the Simulator Java Console by double clicking the icon on the desktop.
When the Simulator Java Console is running, a set of sample Simulator
configurations is displayed, showing an offline status. You can start one or more
existing cluster configurations and then launch an instance of the Cluster Manager
Java Console for each running Simulator configuration.
You can use the Cluster Manager Java Console to perform all the same tasks as an
actual cluster configuration. Additional options are available for Simulator
configurations to enable you to test various failure scenarios, including faulting
resources and powering off systems.
You
You can
can also
also copy
copy aa main.cf
main.cf file
file to
to the
the
/opt/VRTSsim/cluster_name/conf/config
/opt/VRTSsim/cluster_name/conf/config
directory
directory before
before starting
starting the
the Simulated
Simulated cluster.
cluster.
# cd /opt/VRTSsim
# hasim setupclus myclus simport 16555 wacport -1
# hasim start myclus_sys1 clus myclus
# VCS_SIM_PORT=16555
# WAC_SIM_PORT=-1
# export VCS_SIM_PORT WAC_SIM_PORT
# hasim clus display
< Output is equivalent to haclus display >
# hasim sys state
$System Attribute Value
myclus_sys1 SysState Running
4
You can use the Simulator command-line interface (CLI) to add and manage
simulated cluster configurations. While there are a few commands specific to
Simulator activities, such as cluster setup shown in the slide, in general the hasim
command syntax follows the corresponding ha commands used to manage an
actual cluster configuration.
The procedure used to initially set up a Simulator cluster configuration is shown
below. The corresponding commands are displayed in the slide.
Note: This procedure assumes you have already set the PATH and
VCS_SIMULATOR_HOME environment variables.
1 Change to the /opt/VRTSsim directory if you want to view the new
structure created when adding a cluster.
2 Add the cluster configuration, specifying a unique cluster name and port. For
local clusters, specify -1 as the WAC port.
3 Start the cluster on the first system.
4 Set the VCS_SIM_PORT and WAC_SIM_PORT environment variables to the
values you specified when adding the cluster.
Now you can use hasim commands or Cluster Manager to test or modify the
configuration.
Summary
4
In this lesson, you learned how to manage applications that are under control of
VCS.
Next Steps
Now that you are more comfortable managing applications in a VCS cluster, you
can prepare your application components and deploy your cluster design.
Additional Resources
http://van.veritas.com
The VCS Simulator software is available for download from the VERITAS
Web site.
VERITAS Cluster Server Release Notes
The release notes provide detailed information about hardware and software
supported by VERITAS Cluster Server.
VERITAS Cluster Server Users Guide
This guide provides detailed information on procedures and concepts for
configuring and managing VCS clusters.
Goal
The purpose of this lab is to reinforce the material learned in this lesson by
performing a directed series of operator actions on a simulated VCS configuration.
Prerequisites
Obtain the main.cf file for this lab exercise from the location provided by your
instructor.
Results
Each student has a Simulator running with the main.cf file provided for the lab
exercise.
Introduction
Overview
This lesson describes how to prepare application services for use in the VCS high
availability environment. Performing these preparation tasks also helps illustrate
how VCS manages application resources.
Importance
By following these requirements and recommended practices for preparing to
configure service groups, you can ensure that your hardware, operating system,
and application resources are configured to enable VCS to manage and monitor the
components of the high availability services.
Outline of Topics
Preparing Applications for VCS
One-Time Configuration Tasks
Testing the Application Service
Stopping and Migrating a Service
Validating the Design Worksheet
IP Address
NIC
Network
Network
End Users
Process
Application
Application File
System
Storage
Storage
Perform
Performone-time
one-time
configuration
configurationtasks
taskson
on
each
eachsystem.
system.
N
Start,
Start,verify,
verify,and
and Y
More
More Ready
Readyfor
for
stop
stopservices
servicesonon Systems?
Systems? VCS
VCS
one
onesystem
systemat ataatime.
time.
5
Details are provided in the following section.
Create
Createaavolume.
volume. vxassist -g DemoDG make DemoVol 1g
Make
Makeaafile
filesystem.
system. mkfs args vxfs /dev/vx/rdsk/DemoDG/DemoVol
Make
Makeaamount
mountpoint.
point. mkdir /demo Each
Each System
System
Volume
Volume Manager
Manager Example
Example
5
VxVM is shown for simplicityobjects and commands are essentially the same
on all platforms. The agents for other volume managers are described in the
VERITAS Cluster Server, Implementing Local Clusters participant guide.
Preparing shared storage, such as creating disk groups, volumes, and file systems,
is performed once, from one system. Then you must create mount point directories
on each system.
The options to mkfs may differ depending on platform type, as displayed in the
following examples.
Solaris
mkfs -F vxfs /dev/vx/rdsk/DemoDG/DemoVol
AIX
mkfs -V vxfs /dev/vx/rdsk/DemoDG/DemoVol
HP-UX
mkfs -F vxfs /dev/vx/rdsk/DemoDG/DemoVol
Linux
mkfs -t vxfs /dev/vx/rdsk/DemoDG/DemoVol
Administrative IP Addresses
Administrative IP addresses (also referred to as base IP addresses or maintenance
IP addresses) are controlled by the operating system. The administrative IP
addresses are associated with a physical network interface on the system, such as
qfe1 on Solaris systems, and are configured whenever the system is brought up.
These addresses are used to access a specific system over the network and can also
be used to verify that the system is physically connected to the network even
before an application is brought up.
5
BROADCAST_ADDRESS[0]=
DHCP_ENABLE[0]=0
2 Edit /etc/hosts and assign an IP address to the interface name.
166.98.112.14 train14_lan2
3 Use ifconfig to manually configure the IP address to test the configuration
without rebooting:
ifconfig lan2 inet 166.98.112.114
ifconfig lan2 up
N
Ready
Readyfor
for
Bring
Bringup
upresources.
resources. S1 VCS
VCS
More
More
Start up all resources
Systems?
Systems?
in dependency order.
Y
Shared
Sharedstorage
storage
Virtual S2
S2Sn
VirtualIP
IPaddress
address
Stop
Stopresources.
resources.
Application
Applicationsoftware
software
Test
Testthe
theapplication.
application.
Test
Testthe
theapplication.
application.
Stop
Stopresources.
resources. Bring
Bringup
upresources.
resources.
5
This procedure emulates how VCS manages application services. The actual
commands used may differ from those used in this lesson. However, conceptually,
the same type of action is performed by VCS.
Bringing Up Resources
Shared Storage
Verify that shared storage resources are configured properly and accessible. The
examples shown in the slide are based on using Volume Manager.
1 Import the disk group.
2 Start the volume.
3 Mount the file system.
Mount the file system manually for the purposes of testing the application
service. Do not configure the operating system to automatically mount any file
system that will be controlled by VCS.
If the file system is added to /etc/vfstab, it will be mounted on the first
system to boot. VCS must control where the file system is mounted.
Examples of mount commands are provided for each platform.
Solaris
mount -F vxfs /dev/vx/dsk/ProcDG/ProcVol /process
AIX
mount -V vxfs /dev/vx/dsk/ProcDG/ProcVol /process
HP-UX
mount -F vxfs /dev/vx/dsk/ProcDG/ProcVol /process
Linux
mount -t vxfs /dev/vx/dsk/ProcDG/ProcVol /process
5
interface:number.
Solaris
The qfe1:1 device is used for the first virtual IP address on the qfe1 interface;
qfe1:2 is used for the second.
1 Plumb the virtual interface and bring up the IP on the next available logical
interface:
ifconfig qfe1 addif 192.168.30.132 up
2 Edit /etc/hosts to assign a virtual hostname (application service name) to
the IP address.
192.168.30.132 process_services
Verify
Verifythe
thedisk
diskgroup.
group. vxdg list DemoDG
Verify dd if=/dev/vx/rdsk/DemoDG/DemoVol \
Verifythe
thevolume.
volume.
of=/dev/null count=1 bs=128
Verify
Verifythe
thefile
filesystem.
system. mount | grep /demo
Verify
Verifythe
theadmin
adminIP.
IP. ping same_subnet_IP
Verify
Verifythe
thevirtual
virtualIP.
IP. ifconfig arguments
Verify
Verifythe
theapplication.
application. ps arguments | grep process
Verifying Resources
You can perform some simple steps, such as those shown in the slide, to verify that
each component needed for the application service to function is operating at a
basic level.
This helps you identify any potential configuration problems before you test the
service as a whole, as described in the Testing the Integrated Components
section.
5
mount the exported file system from a client on the network. This is described in
more detail later in the course.
/sbin/orderproc stop
IP Address
Network
Network
NIC
Process
Application
Application
File
S1 System S2
Storage
Storage
5
Validate or complete your design worksheet to document the information
required to configure VCS to manage the services.
Use the procedures described in this lesson to configure and test the underlying
operating system resources.
5
attributes may be different.
S1
Failover System
5
These attributes are described in more detail later in the course.
Summary
This lesson described how to prepare sites and application services for use in the
VCS high availability environment. Performing these preparation tasks ensures
that the site is ready to deploy VCS, and helps illustrate how VCS manages
application resources.
Next Steps
After you have prepared your operating system environment and applications for
high availability, you can install VERITAS Cluster Server and then configure
service groups for your application services.
Additional Resources
VERITAS Cluster Server Bundled Agents Reference Guide
This guide describes each bundled agent in detail.
VERITAS Cluster Server Users Guide
This guide provides detailed information on procedures and concepts for
configuring and managing VCS clusters.
High Availability Using VERITAS Cluster Server, Implementing Local
Clusters
This course provides detailed information on advanced clustering topics,
focusing on configurations of clusters with more than two nodes.
/bob1/loopy /sue1/loopy
bobDG1 sueDG1
/bob1 bobVol1
disk1 sueVol1 /sue1
disk2
Disk/Lun Disk/Lun
See
Seenext
nextslide
slidefor
forclassroom
classroomvalues.
values.
5
Appendix C provides complete lab instructions and solutions.
Lab 5 Solutions: Preparing Application Services, page C-51
Goal
The purpose of this lab is to prepare the loopy process service for high availability.
Prerequisites
Obtain any classroom-specific values needed for your classroom lab environment
and record these values in your design worksheet included with the lab exercise
instructions.
Results
Each students service can be started, monitored, and stopped on each cluster
system.
Introduction
Overview
This lesson provides an overview of the configuration methods you can use to
create and modify service groups. This lesson also describes how VCS manages
and protects the cluster configuration.
Importance
By understanding all methods available for configuring VCS, you can choose the
tools and procedures that best suit your requirements.
Outline of Topics
Overview of Configuration Methods
Controlling Access to VCS
Online Configuration
Offline Configuration
Starting and Stopping VCS
VCS 4.1
The halogin command is provided in VCS 4.1 to save authentication
information so that users do not have to enter credentials every time a VCS
command is run.
The command stores authentication information in the users home directory. You
must either set the VCS_HOST environment variable to the name of the node from
which you are running VCS commands, or add the node name to the /etc/
.vcshosts file.
If you run halogin for different hosts, VCS stores authentication information for
each host.
6
VCS 3.5 and 4.0
For releases prior to 4.1, halogin is not supported. When logged on to UNIX as
a nonroot account, the user is prompted to enter a VCS account name and
password every time a VCS command is entered.
To enable nonroot users to more easily administer VCS, you can set the
AllowNativeCliUsers cluster attribute to 1. For example, type:
haclus -modify AllowNativeCliUsers 1
When set, VCS maps the UNIX user name to the same VCS account name to
determine whether the user is valid and has the proper privilege level to perform
the operation. You must explicitly create each VCS account name to match the
UNIX user names and grant the appropriate privilege level.
User Accounts
You can ensure that the different types of administrators in your environment have
a VCS authority level to affect only those aspects of the cluster configuration that
are appropriate to their level of responsibility.
For example, if you have a DBA account that is authorized to take a database
service group offline or switch it to another system, you can make a VCS Group
Operator account for the service group with the same account name. The DBA can
then perform operator tasks for that service group, but cannot affect the cluster
configuration or other service groups. If you set AllowNativeCliUsers to 1, then
the DBA logged on with that account can also use the VCS command line to
manage the corresponding service group.
Setting VCS privileges is described in the next section.
6
For example, to add a user called DBSG_Op to the VCS configuration, type:
hauser -add DBSG_Op
In non-secure mode, VCS user accounts are stored in the main.cf file in
encrypted format. If you use a GUI or wizard to set up a VCS user account,
passwords are encrypted automatically. If you use the command line, you must
encrypt the password using the vcsencrypt command.
Note: In non-secure mode, If you change a UNIX account, this change is not
reflected in the VCS main.cf file automatically. You must manually modify
accounts in both places if you want them to be synchronized.
Online Configuration
Benefits
Online configuration has these advantages:
The VCS engine is up and running, providing high availability of existing
service groups during configuration.
This method provides syntax checking, which helps protect you from making
configuration errors.
This step-by-step procedure is suitable for testing each object as it is
configured, simplifying troubleshooting of configuration mistakes that you
may make when adding resources.
You do not need to be logged into the UNIX system as root to use the GUI and
CLI to make VCS configuration changes.
Considerations
Online configuration has these considerations:
Online configuration is more time-consuming for large-scale modifications.
The online process is repetitive. You have to add service groups and resources
one at a time.
hagrp
hagrp add
add
Config
Config
In-memory
configuration In-memory
configuration
6
The VCS command-line interface is an alternate online configuration tool. When
you run ha commands, had responds in the same fashion.
Note: When two administrators are changing the cluster configuration
simultaneously, each sees all changes as they are being made.
haconf
haconf -makerw
-makerw
main.cf main.cf
.stale .stale
haconf
haconf -dump
-dump
main.cf main.cf
.stale .stale
haconf
haconf dump
dump -makero
-makero
main.cf main.cf
2 .stale .stale
6
To understand how this protection mechanism works, you must first understand
the normal VCS startup procedure.
Offline Configuration
In some circumstances, you can simplify cluster implementation or configuration
tasks by directly modifying the VCS configuration files. This method requires you
to stop and restart VCS in order to build the new configuration in memory.
The benefits of using an offline configuration method are that it:
Offers a very quick way of making major changes or getting an initial
configuration up and running
Provides a means for deploying a large number of similar clusters
One consideration when choosing to perform offline configuration is that you must
be logged into the a cluster system as root.
This section describes situations where offline configuration is useful. The next
section shows how to stop and restart VCS to propagate the new configuration
throughout the cluster. The Offline Configuration of Service Groups lesson
provides detailed offline configuration procedures and examples.
group
group DB3
DB3 ((
SystemList
SystemList == {S3=1,S4=2
{S3=1,S4=2
AutoStartList
AutoStartList == {S3}
{S3}
DB1 DB2 ))
main.cf
OOOO
S1
S1 S2
S2
Cluster1
Cluster1 DB3 DB4
main.cf
group
group DB1
DB1 (( OOOO
SystemList
SystemList == {S1=1,S2=2
{S1=1,S2=2
AutoStartList
AutoStartList == {S1}
{S1} S3
S3 S4
S4
)) Cluster2
Cluster2
DemoProcess
AppProcess
Demo
Mount App
DemoIP
AppIP Mount
Demo DemoVol
AppVol
NIC App
DemoDG NIC
AppDG
S1 S2
Local Build Current_
Cluster No config in
Discover
Conf memory
_Wait
5
3 main.cf main.cf 8
2 .stale .stale 7
had had
hashadow hashadow
6
1
hastart
hastart 4 9 hastart
hastart
S1 S2
Running Remote
Cluster
Cluster Cluster Build
Conf
Conf Conf
12
main.cf
main.cf main.cf
main.cf
had had
hashadow hashadow
10 11
S1 S2
Stale Unknown
No config in
admin
memory
wait
4
3 main.cf
main.cf main.cf
main.cf
2 .stale .stale
had
hashadow
1 hastart
hastart 5
6
while the configuration was open. This also occurs if you start VCS and the
main.cf file has a syntax error. This enables you to inspect the main.cf file and
decide whether you want to start VCS with that main.cf file. You may have to
modify the main.cf file if you made changes in the running cluster after saving
the configuration to disk.
S1 S2
Local Build Waiting
Cluster No config in
for a
Conf memory
running
4 config
3 main.cf
main.cf main.cf
main.cf
2 .stale .stale
had had
hashadow hashadow
1 hasys
hasys force
force S1
S1
S1 S2
Running Remote
Cluster
Cluster Cluster Build
Conf
Conf Conf
7
main.cf
main.cf main.cf
main.cf
8
.stale
had had
hashadow hashadow
5 6
5 When had is in a running state on S1, this state change is broadcast on the
cluster interconnect by GAB.
6 S2 then performs a remote build to put the new cluster configuration into its
memory.
7 The had process on S2 copies the cluster configuration into the local
main.cf and types.cf files after moving the original files to backup
copies with timestamps.
8 The had process on S2 removes the .stale file, if present, from the local
configuration directory.
3 main.cf main.cf
2 .stale 7
had had
hashadow hashadow
6
1
hastart
hastart 4 8 hastart
hastart -stale
-stale
S1 S2
Running Remote
Cluster
Cluster Cluster Build
Conf
Conf Conf
11
main.cf
main.cf main.cf
main.cf
12
.stale
had had
hashadow hashadow
9 10
9 When VCS is in a running state on S1, HAD on S1 sends a copy of the cluster
configuration over the cluster interconnect to S2.
10 S2 performs a remote build to put the new cluster configuration in memory.
11 HAD on S2 copies the cluster configuration into the local main.cf and
types.cf files after moving the original files to backup copies with
timestamps.
12 HAD on S2 removes the .stale file from the local configuration directory.
S1 S2
S1 S2
had had
1 hastop -local
S1 S2 had had
had had
Stopping VCS
There are three methods of stopping the VCS engine (had and hashadow
daemons) on a cluster system:
Stop VCS and take all service groups offline, stopping application services
under VCS control.
Stop VCS and evacuate service groups to another cluster system where VCS is
running.
Stop VCS and leave application services running.
VCS can also be stopped on all systems in the cluster simultaneously. The hastop
command is used with different options and arguments that determine how running
services are handled.
Summary
This lesson introduced the methods you can use to configure VCS. You also
learned how VCS starts and stops in a variety of circumstances.
Next Steps
Now that you are familiar with the methods available for configuring VCS, you
can apply these skills by creating a service group using an online configuration
method.
Additional Resources
VERITAS Cluster Server Users Guide
This guide provides detailed information on starting and stopping VCS, and
6
performing online and offline configuration.
VERITAS Cluster Server Command Line Quick Reference
This card provides the syntax rules for the most commonly used VCS
commands.
vcs1
train1 train2
## hastop
hastop all
all -force
-force
Goal
The purpose of this lab is to observe the effects of stopping and starting VCS.
Prerequisites
Students must work together to coordinating stopping and restarting VCS.
Results
The cluster is running and the ClusterService group is online.
Introduction
Overview
This lesson describes how to use the VCS Cluster Manager graphical user
interface (GUI) and the command-line interface (CLI) to create a service group
and configure resources while the cluster is running.
Importance
You can perform all tasks necessary to create and test a service group while VCS is
running without affecting other high availability services.
Outline of Topics
Online Configuration Procedure
Adding a Service Group
Adding Resources
Solving Common Configuration Errors
Testing the Service Group
Open
Opencluster
clusterconfiguration.
configuration. Add Service Group
This
This procedure
procedure assumes
assumes that
that you
you Y
have
have prepared
prepared andand tested
tested the
the
application More? Test
application service on each system
system
and
and it
it is
is offline
offline everywhere,
everywhere, asas N
described
described in in the
the Preparing
Preparing Services
Services
for
for High
High Availability
Availability lesson.
lesson.
main.cf
main.cf
group
group DemoSG
DemoSG ((
SystemList
SystemList == {{ S1
S1 == 0,
0, S2
S2 == 11 }}
AutoStartList
AutoStartList == {{ S1
S1 }}
))
service group. In the example displayed in the slide, the S1 system is selected
as the system on which DemoSG is started when VCS starts up.
The Service Group Type selection is failover by default.
If you save the configuration after creating the service group, you can view the
main.cf file to see the effect of had modifying the configuration and writing the
changes to the local disk.
Considerations:
Add
AddResource
Resource Add resources in order of
dependency, starting at the bottom.
Set
SetNon-Critical
Non-Critical
Configure all required attributes.
Modify
ModifyAttributes
Attributes Enable the resource.
Bring each resource online before
Enable
EnableResource
Resource adding the next resource.
It is recommended that you set
Bring
BringOnline
Online resources as non-critical until testing
has completed.
N
Online?
Online? Troubleshoot Resources
Y Done
Done
Adding Resources
Online Resource Configuration Procedure
Add resources to a service group in the order of resource dependencies starting
from the child resource (bottom up). This enables each resource to be tested as it is
added to the service group.
Adding a resource requires you to specify:
The service group name
The unique resource name
If you prefix the resource name with the service group name, you can more
easily identify the service group to which it belongs. When you display a list of
resources from the command line using the hares -list command, the
resources are sorted alphabetically.
The resource type
Attribute values
Use the procedure shown in the diagram to configure a resource.
7
Notes:
You are recommended to set each resource to be non-critical during initial
configuration. This simplifies testing and troubleshooting in the event that you
have specified incorrect configuration information. If a resource faults due to a
configuration error, the service group does not fail over if resources are non-
critical.
Enabling a resource signals the agent to start monitoring the resource.
main.cf
main.cf
IP
IP DemoIP
DemoIP ((
Critical
Critical == 0
Device
Device == qfe1
qfe1
Address == ""10.10.21.198
Address 10.10.21.198""
))
Adding an IP Resource
The slide shows the required attribute values for an IP resource in the DemoSG
service group. The corresponding entry is made in the main.cf file when the
configuration is saved.
Notice that the IP resource has two required attributes, Device and Address, which
specify the network interface and IP address, respectively.
Optional Attributes
NetMask: Netmask associated with the application IP address
The value may be specified in decimal (base 10) or hexadecimal (base 16). The
default is the netmask corresponding to the IP address class.
Options: Options to be used with the ifconfig command
ArpDelay: Number of seconds to sleep between configuring an interface and
sending out a broadcast to inform routers about this IP address
The default is 1 second.
IfconfigTwice: If set to 1, this attribute causes an IP address to be configured
twice, using an ifconfig up-down-up sequence. This behavior increases the
probability of gratuitous ARPs (caused by ifconfig up) reaching clients.
The default is 0.
DiskGroup
DiskGroup DemoDG
DemoDG ((
Critical
Critical == 00
DiskGroup
DiskGroup == DemoDG
DemoDG
)) main.cf
main.cf
Modify
ModifyAttributes
Attributes Disable
DisableResource*
Resource* Flush
FlushGroup
Group
Enable
EnableResource*
Resource* Clear
ClearResource
Resource
Y
Bring
BringOnline
Online N
Faulted?
Faulted?
Waiting to Go Online
N
Online?
Online? Check
CheckLog
Log
Y
Verify
VerifyOffline
Offline(OS)
(OS)
Done
Done Everywhere
Everywhere
Misconfigured
Misconfiguredresources
resourcescancan
cause
causeagent
agentprocesses
processesto to
appear
appearto tohang.
hang.
Verify
Verifythat
thatthe
theresource
resourceisis
stopped
stoppedat atthe
theoperating
operating
system
systemlevel.
level.
Flush
Flushthe
theservice
servicegroup
groupto to
stop
stopall
allonline
onlineand
andoffline
offline
processes.
processes.
hagrp
hagrp flush
flush DemoSG
DemoSG sys
sys S1
S1
hares
hares modify
modify DemoIP
DemoIP Enabled
Enabled 0
Disabling a Resource
Disable a resource before you start modifying attributes to fix a misconfigured
resource. When you disable a resource, VCS stops monitoring the resource, so it
does not fault or wait to come online while you are making changes.
When you disable a resource, the agent calls the close entry point, if defined. The
close entry point is optional.
When the close tasks are completed, or if there is no close entry point, the agent
stops monitoring the resource.
Start
Start Link
LinkResources
Resources
main.cf
main.cf
DemoIP
DemoIP requires
requires DemoNIC
DemoNIC
hares
hares link
link DemoIP
DemoIP DemoNIC
DemoNIC
Linking Resources
When you link a parent resource to a child resource, the dependency becomes a
component of the service group configuration. When you save the cluster
configuration, each dependency is listed at the end of the service group definition,
after the resource specifications, in the format show in the slide.
In addition, VCS creates a dependency tree in the main.cf file at the end of the
service group definition to provide a more visual view of resource dependencies.
This is not part of the cluster configuration, as denoted by the // comment
markers.
// NIC DemoNIC
// }
//}
Resource Dependencies
VCS enables you to link resources to specify dependencies. For example, an IP
address resource is dependent on the NIC providing the physical link to the
network.
Ensure that you understand the dependency rules shown in the slide before you
start linking resources.
nameMount1 nameVol1
nameIP1 nameNIC1
nameProcess1 nameMount1
nameProcess1 nameIP1
main.cf
main.cf
NIC
NIC DemoNIC
DemoNIC ((
Device
Device == qfe1
qfe1
))
hares
hares modify
modify DemoNIC
DemoNIC Critical
Critical 11
Process
/demo/orderproc
IP Mount
10.10.21.198 /demo
NIC Volume
qfe1 DemoVol
DiskGroup
DemoDG
cluster VCS (
UserNames = { admin = "j5_eZ_^]Xbd^\\_Y_d\\" }
Administrators = { admin }
CounterInterval = 5
)
7
system S2 (
)
group DemoSG (
SystemList = { S1 = 1, S2 = 2 }
AutoStartList = { S1 }
)
DiskGroup DemoDG (
Critical = 0
DiskGroup = DemoDG
)
IP DemoIP (
Critical = 0
Device = qfe1
Address = "10.10.21.198"
)
Mount DemoMount (
Critical = 0
MountPoint = "/demo"
BlockDevice = "/dev/vx/dsk/DemoDG/DemoVol"
FSType = vxfs
FsckOpt = "-y"
)
NIC DemoNIC (
Critical = 0
Device = qfe1
)
Process DemoProcess (
Critical = 0
PathName = "/bin/sh"
Arguments = "/sbin/orderproc up"
)
Summary
This lesson described the procedure for creating a service group and two tools for
modifying a running cluster: the Cluster Manager graphical user interface and
VCS ha commands.
Next Steps
After you familiarize yourself with the online configuration methods and tools,
you can modify configuration files directly to practice offline configuration.
Additional Resources
VERITAS Cluster Server Bundled Agents Reference Guide
This guide describes each bundled agent in detail.
VERITAS Cluster Server Users Guide
This guide provides detailed information on procedures and concepts for
configuring and managing VCS clusters.
VERITAS Cluster Server Command Line Quick Reference
This card provides the syntax rules for the most commonly used VCS
commands.
Goal
The purpose of this lab is to create a service group while VCS is running using
either the Cluster Manager graphical user interface or the command-line interface.
Prerequisites
The shared storage and networking resources must be configured and tested. Disk
groups must be offline on all systems.
7
Results
New service groups defined in the design worksheet are running and tested on both
cluster systems.
Introduction
Overview
This lesson describes how to create a service group and configure resources by
modifying the main.cf configuration file.
Importance
In some circumstances, it is more efficient to modify the cluster configuration by
changing the configuration files and restarting VCS to bring the new configuration
into memory on each cluster system.
Outline of Topics
Offline Configuration Procedures
Using the Design Worksheet
Offline Configuration Tools
Solving Offline Configuration Problems
Testing the Service Group
Stop hastop
hastop -all ## vivi main.cf
main.cf
StopVCS
VCSon
onall
allsystems.
systems. -all
~~
~~
Edit
Editthe
theconfiguration
configurationfile.
file. vi
vi main.cf
main.cf group
group WebSG
WebSG ((
SystemList
SystemList ==
Verify hacf
hacf verify
verify .. AutoStartList
AutoStartList
Verifyconfiguration
configurationfile
filesyntax.
syntax. ))
NIC
NIC WebNIC
WebNIC ((
Start hastart
hastart
StartVCS
VCSon
onthis
thissystem.
system. Critical
Critical == 00
Device
Device == xxxx
xxxx
hastatus
hastatus -sum
-sum }}
Verify
Verifythat
thatVCS
VCSisisrunning.
running. .. .. ..
First
First System
System
Start
StartVCS
VCSon
onall
allother
othersystems.
systems. hastart
hastart -stale
-stale All
All Other
Other Systems
Systems
Stop VCS
Stop VCS on all cluster systems. This ensures that there is no possibility that
another administrator is changing the cluster configuration while you are
modifying the main.cf file.
include "types.cf"
cluster vcs (
UserNames = { admin = ElmElgLimHmmKumGlj }
ClusterAddress = "192.168.27.51"
Administrators = { admin }
CounterInterval = 5
)
system S1 (
)
system S2 (
)
group WebSG (
SystemList = { S1 = 1, S2 = 2 }
AutoStartList = { S1 }
)
DiskGroup WebDG (
Critical = 0
DiskGroup = WebDG
)
8
Mount WebMount (
Critical = 0
MountPoint = "/Web"
BlockDevice = "/dev/dsk/WebDG/WebVol"
FSType = vxfs
)
NIC WebNIC (
Critical = 0
Device = qfe1
)
Process WebProcess (
Critical = 0
PathName = "/bin/ksh"
Arguments = "/sbin/tomcat"
)
Volume WebVol (
Critical = 0
Volume = WebVol
DiskGroup = WebDG
)
Close
Closethe
theconfiguration.
configuration. haconf
haconf dump
dump -makero
-makero
Change
Changeto
tothe configdirectory.
theconfig directory. cd
cd /etc/VRTSvcs/conf/config
/etc/VRTSvcs/conf/config
Create
Createaaworking
workingdirectory.
directory. mkdir
mkdir stage
stage
Copy main.cfand
Copymain.cf andtypes.cf.
types.cf. cp
cp main.cf
main.cf types.cf
types.cf stage
stage
Change
Changeto
tothe stagedirectory.
thestage directory. cd
cd stage
stage
First
First System
System
Edit
Editthe
theconfiguration
configurationfiles.
files. vi
vi main.cf
main.cf
Verify
Verifyconfiguration
configurationfile
filesyntax.
syntax. hacf
hacf verify
verify ..
Existing Cluster
The diagram illustrates a process for modifying the cluster configuration when you
already have service groups configured and want to minimize the time that VCS is
not running to protect services that are running.
This procedure includes several built-in protections from common configuration
errors and maximizes high availability.
First System
Close the Configuration
Close the cluster configuration before you start making changes. This ensures that
the working copy you make has the latest in-memory configuration. This also
ensures that you do not have a stale configuration when you attempt to start the
cluster later.
Make a Staging Directory
Make a subdirectory of /etc/VRTSvcs/conf/config in which you can edit
a copy of the main.cf file. This ensures that your edits cannot be overwritten if
another administrator is making configuration changes simultaneously.
Copy the Configuration Files
Copy the main.cf file and types.cf from
/etc/VRTSvcs/conf/config to the staging directory.
8
Stop
StopVCS;
VCS;leave
leaveservices
servicesrunning.
running. hastop
hastop all
all -force
-force
Copy
Copythe
thetest main.cffile
testmain.cf fileback.
back. cp
cp main.cf
main.cf ../main.cf
../main.cf
Start
StartVCS
VCSon
onthis
thissystem.
system. hastart
hastart
First
First
hastatus System
System
Verify
Verifythat
thatHAD
HADis
isrunning.
running. hastatus -sum
-sum
Start
StartVCS
VCSstale
staleon
onother
othersystems.
systems. hastart
hastart -stale
-stale Other
Other Systems
Systems
If
If you
you are modifying
modifying an
an existing
existing service
service group, freeze
freeze the
the group
group persistently
persistently
before
before stopping
stopping VCS.
VCS. This prevents
prevents the
the group
group from
from failing
failing over
over when
when VCS
VCS
restarts
restarts if
if there
there are
are problems
problems with
with the configuration.
Stop VCS
Note: If you have modified an existing service group, first freeze the service group
persistently to prevent VCS from failing over the group. This simplifies fixing
resource configuration problemsthe service group is not being switched between
systems.
Stop VCS on all cluster systems after making configuration changes. To leave
applications running, use the -force option, as shown in the diagram.
Start VCS
Start VCS first on the system with the modified main.cf file.
Process
Resource Definition Sample Value /app
Service Group Name AppSG
Resource Name AppIP
IP Mount
Resource Type IP 10.10.21.199 /app
Required Attributes
Device qfe1
Address 10.10.21.199 AppNIC Volume
qfe1 AppVol
Optional Attributes
NetMask* 255.255.255.0
DiskGroup
*Required
*Required only
only on
on HP-UX.
HP-UX. AppDG
Resource Dependencies
Document resource dependencies in your design worksheet and add the links at the
end of the service group definition, using the syntax shown in the slide. A
complete example service group definition is shown in the next section.
.. .. ..
group
group AppSG
AppSG ((
SystemList
SystemList == {{ S1
S1 == 0,
0, S2
S2 == 1}
1}
AutoStartList
AutoStartList == {{ S1
S1 }}
Operators
Operators == {{ SGoper
SGoper }}
))
DiskGroup
DiskGroup AppDG
AppDG (( main.cf
main.cf
Critical
Critical == 00
DiskGroup
DiskGroup == AppDG
AppDG
))
IP
IP AppIP
AppIP ((
Critical
Critical == 00
Device
Device == qfe1
qfe1
Address
Address == "10.10.21.199"
"10.10.21.199"
))
.. .. ..
group AppSG (
SystemList = { S1 = 1, S2 = 2 }
AutoStartList = { S1 }
)
DiskGroup AppDG (
Critical = 0
DiskGroup = AppDG
)
Mount AppMount (
Critical = 0
MountPoint = "/app"
BlockDevice = "/dev/dsk/AppDG/AppVol"
FSType = vxfs
)
NIC AppNIC (
Critical = 0
Device = qfe1
)
Process AppProcess (
Critical = 0
PathName = "/bin/ksh"
Arguments = "/app/appd test"
)
Volume AppVol (
Critical = 0
Volume = AppVol
DiskGroup = AppDG
)
You
Youcancancreate
createandandmodify
modify
main.cfand
main.cf types.cffiles
andtypes.cf files
using
usingthetheVCS
VCSSimulator.
Simulator.
This
Thismethod
methoddoes doesnotnotaffect
affect
the
thecluster
clusterconfiguration
configuration
files;
files;simulator
simulatorconfiguration
configuration
files
filesare
arecreated
createdin inaaseparate
separate
directory.
directory.
You
Youcancanalso
alsouseusethe
the
simulator
simulator to testany
to test any
main.cffile
main.cf filebefore
beforeputting
putting
the
the configurationinto
configuration intothe
the
actual
actualcluster
clusterenvironment.
environment.
## ls
ls -l
-l /etc/VRTSvcs/conf/config
/etc/VRTSvcs/conf/config
total 140
total 140
-rw-------
-rw------- Mar
Mar 21
21 13:09
13:09 main.cf
main.cf
-rw-------
-rw------- Mar
Mar 14
14 17:22
17:22 main.cf.14Mar2004.17:22:25
main.cf.14Mar2004.17:22:25
-rw-------
-rw------- Mar
Mar 16
16 18:00
18:00 main.cf.16Mar2004.18:00:54
main.cf.16Mar2004.18:00:54
-rw-------
-rw------- Mar
Mar 20
20 11:37
11:37 main.cf.20Mar2004.11:37:49
main.cf.20Mar2004.11:37:49
-rw-------
-rw------- Mar
Mar 21
21 13:09
13:09 main.cf.21Mar2004.13:09:11
main.cf.21Mar2004.13:09:11
-rw-------
-rw------- Mar
Mar 21
21 13:10
13:10 main.cf.previous
main.cf.previous
-rw-------
-rw------- Mar
Mar 21
21 13:09
13:09 types.cf
types.cf
-rw-------
-rw------- Mar
Mar 14
14 17:22
17:22 types.cf.14Mar2004.17:22:25
types.cf.14Mar2004.17:22:25
-rw-------
-rw------- Mar 16 18:00 types.cf.16Mar2004.18:00:54
Mar 16 18:00 types.cf.16Mar2004.18:00:54
-rw-------
-rw------- Mar
Mar 20
20 11:37
11:37 types.cf.20Mar2004.11:37:49
types.cf.20Mar2004.11:37:49
-rw-------
-rw------- Mar
Mar 21
21 13:09
13:09 types.cf.21Mar2004.13:09:11
types.cf.21Mar2004.13:09:11
-rw-------
-rw------- Mar
Mar 21
21 13:10
13:10 types.cf.previous
types.cf.previous
root
root other
other
Summary
This lesson introduced a methodology for creating a service group by modifying
the main.cf configuration file and restarting VCS to use the new configuration.
Next Steps
Now that you are familiar with a variety of tools and methods for configuring
service groups, you can apply these skills to more complex configuration tasks.
Additional Resources
VERITAS Cluster Server Bundled Agents Reference Guide
This guide describes each bundled agent in detail.
VERITAS Cluster Server Users Guide
This guide provides detailed information on procedures and concepts for
configuring and managing VCS clusters.
VERITAS Cluster Server Command Line Quick Reference
This card provides the syntax rules for the most commonly used VCS
commands.
8
nameSG1
nameSG1 nameSG2
nameSG2 name
name
Process1 Process2
name name
App
DG1 Working DG2
Workingtogether,
together,follow
followthe
theoffline
offline DG
configuration
configurationprocedure.
procedure.
Alternately,
Alternately,work
workalone
aloneand
anduse
usethe
the
GUI
GUIto
tocreate
createaanew
newservice
servicegroup.
group.
Goal
The purpose of this lab is to add a service group by copying and editing the
definition in main.cf for nameSG1.
Prerequisites
Students must coordinate when stopping and restarting VCS.
Results
The new service group defined in the design worksheet is running and tested on
both cluster systems.
Introduction
Overview
This lesson describes how to create a parallel service group containing networking
resources shared by multiple service groups.
Importance
If you have multiple service groups that use the same network interface, you can
reduce monitoring overhead by using Proxy resources instead of NIC resources. If
you have many NIC resources, consider using Proxy resources to minimize any
potential performance impacts of monitoring.
9
Topic After completing this lesson, you
will be able to:
Sharing Network Describe how multiple service groups
Interfaces can share network interfaces.
Alternate Network Describe alternate network
Configurations configurations.
Using Parallel Service Use parallel service groups with
Groups network resources.
Localizing Resource Localize resource attributes.
Attributes
Outline of Topics
Sharing Network Interfaces
Alternate Network Configurations
Using Parallel Service Groups
Localizing Resource Attributes
9
WebSG DBSG
DBSG
WebSG .. NFSSG
NFSSG
main.cf .. main.cf
Orac1SG
main.cf
Orac1SG
.. main.cf .. .. main.cf
main.cf
Ora1SG
Ora1SG
.. .. main.cf
main.cf
DBSG
DBSG
.. .. .. .. main.cf
main.cf
.. .. main.cf
main.cf
.. IP
IP ..DBIP
DBIP
.. (
(
IP
IP..AppIP
AppIP ((
IP
IP WebIP
WebIP (( Device
IP ..AppIP
Device == qfe1
((
IP
IP
IP
AppIP
Device
DeviceAppIP
AppIP ==qfe1
qfe1
qfe1
((
Device
Device == qfe1
qfe1 Address
Device
Address
Device
IP
IP DBIP
DBIP====""((" 10.10.21.198"
qfe1
10.10.21.198"
qfe1
Address
Address
Device
Device ==== " qfe110.10.21.198"
10.10.21.198"
qfe1
Address
Address == 10.10.21.198"
10.10.21.198" )) Address
Address
Device==
Device = =="" qfe1
10.10.21.198"
10.10.21.198"
qfe1
)) Address = "" 10.10.21.198"
)) )) Address
Address
Address
10.10.21.198"
== "" 10.10.21.199"
10.10.21.199"
))
NIC
NIC DBNIC
DBNIC)) ((
NIC
NIC AppNIC
AppNIC ((
NIC
NIC WebNIC
WebNIC (( NIC
Device
NIC
Device AppNIC
== qfe1
AppNIC ((
Device
Device
NIC AppNIC ==qfe1
qfe1
qfe1 ((
Device
Device == qfe1
qfe1 )) NIC
Device AppNIC == qfe1
)) Device
NIC
NIC DBNIC
DeviceDBNIC
Device
qfe1
== qfe1
((
qfe1
)) )) Device = qfe1
)) Device = qfe1
DBIP
DBIP requires
requires
)) DBNIC
DBNIC
AppIP
AppIP requires
requires AppNIC AppNIC
WebIP
WebIP requires
requires WebNIC
WebNIC AppIP
AppIP requires
requires AppNIC AppNIC
AppIP
AppIP requires
requires AppNIC AppNIC
DBIP
DBIP requires
requires DBNIC DBNIC
Solaris
Configuration View
The example shows a configuration with many service groups using the same
network interface specified in the NIC resource. Each service group has a unique
NIC resource with a unique name, but the Device attribute for all is qfe1 in this
Solaris example.
In addition to the overhead of many monitor cycles for the same resource, a
disadvantage of this configuration is the effect of changes in NIC hardware. If you
must change the network interface (for example in the event the interface fails),
you must change the Device attribute for each NIC resource monitoring that
interface.
Web DB
Process Process
DB DB
Web Web Mount
IP
Mount IP
DB DB
DBVol
Web Web
Proxy Vol
Vol NIC
A
A Proxy
Proxy resource
resource mirrors
mirrors the
the state
state of
of DB
DB
Web DG
DG another
another resource
resource (for example, NIC).
NIC). DG
9
on the local system, unless
Resource Definition Sample TargetSysName is specified.
Value
TargetResName must be in a
Service Group Name DBSG
separate service group.
Resource Name DBProxy
Resource Type Proxy
Required Attributes
main.cf
main.cf
TargetResName WebNIC
Proxy
Proxy DBProxy
DBProxy ((
Critical
Critical == 00
TargetResName
TargetResName == WebNIC
WebNIC
))
Optional Attributes
TargetSysName specifies the name of the system on which the target resource
status is monitored. If no system is specified, the local system is used as the target
system.
NetNIC NetSG
S1 S2
How
How do
do you
you determine
determine the
the status
status of
of the
the parallel
parallel service
service group
group with
with only
only aa
persistent
persistent resource?
resource?
9
DBIP WebIP DBIP WebIP
NetSG
Net Net
S1 NIC Phantom S2
S1 S2
A
A Phantom
Phantom resource
resource can
can be
be used
used to
to enable
enable VCS
VCS to
to report
report the
the online
online
status
status of
of aa service
service group
group with
with only
only persistent
persistent resources.
resources.
Phantom Resources
The Phantom resource is used to report the actual status of a service group that
consists of only persistent resources. A service group shows an online status only
when all of its nonpersistent resources are online. Therefore, if a service group has
only persistent resources, VCS considers the group offline, even if the persistent
resources are running properly. When a Phantom resource is added, the status of
the service group is shown as online.
Note: Use this resource only with parallel service groups.
9
resources.
Service Group Sample
Use an online method to set
Definition Value
Parallel before adding resources.
Group NetSG
Required Attributes
Parallel 1
group
group NetSG
NetSG ((
SystemList S1=0, SystemList
SystemList == {S1
{S1 = 0,
0, S2
S2 == 1}
1}
S2=1 AutoStartList
AutoStartList == {S1,
{S1, S2}
S2}
Optional Attributes Parallel
Parallel == 1
))
AutoStartList S1, S2 Solaris
NIC
NIC NetNIC
NetNIC ((
Device
Device == qfe1
qfe1
))
Phantom
Phantom NetPhantom
NetPhantom (
)) main.cf
main.cf
The
The difference
difference is
is that
that aa parallel
parallel group
group can
can be
be online
online on
on
more
more than
than one
onesystem
systemwithout
without causing
causingaaconcurrency
concurrencyfault.
fault.
9
DBSG WebSG DBSG WebSG
NIC
NIC NetNIC
NetNIC ((
Device@S1
Device@S1 == qfe1
qfe1 NetSG
Device@S2
Device@S2 == hme0
hme0
)) Net Net
NIC Phantom
S1
S1 qfe1
qfe1 hme0
hme0 S2
S2
You
You can
can localize
localize the
the Device
Device attribute
attribute for
for NIC
NIC resources when systems
have
have different
different network
network interfaces.
interfaces.
Summary
This lesson introduced a methodology for sharing network resources among
service groups.
Next Steps
Now that you are familiar with a variety of tools and methods for configuring
service groups, you can apply these skills to more complex configuration tasks.
Additional Resources
VERITAS Cluster Server Bundled Agents Reference Guide
This guide describes each bundled agent in detail.
VERITAS Cluster Server Users Guide
This guide describes the behavior of parallel service groups and advantages of
using Proxy resources.
9
nameSG1
nameSG1 nameSG2
nameSG2 name
name
Process1 Process2
name name
DB
DG1 Network Network DG2
DG
NIC Phantom NetworkSG
NetworkSG
Goal
The purpose of this lab is to add a parallel service group to monitor the NIC
resource and replace the NIC resources in the failover service groups with Proxy
resources.
Prerequisites
Students must coordinate when stopping and restarting VCS.
Results
A new parallel service group defined in the design worksheet is running on both
cluster systems. NIC is replaced with Proxy resources in all other service groups.
Introduction
Overview
This lesson describes how to configure VCS to provide event notification using e-
mail, SNMP traps, and triggers.
Importance
In order to maintain a high availability cluster, you must be able to detect and fix
problems when they occur. By configuring notification, you can have VCS
proactively notify you when certain events occur.
10
Configuring Notification Configure notification using the
NotifierMngr resource.
Using Triggers for Use triggers to provide notification.
Notification
Outline of Topics
Notification Overview
Configuring Notification
Using Triggers for Notification
NotifierMngr NotifierMngr
SNMP SMTP
NIC NIC
notifier
had had
Replicated Message O
O
Queue O O
O O
Notification Overview
When VCS detects certain events, you can configure the notifier to:
Generate an SNMP (V2) trap to specified SNMP consoles.
Send an e-mail message to designated recipients.
Message Queue
VCS ensures that no event messages are lost while the VCS engine is running,
even if the notifier daemon stops or is not started. The had daemons
throughout the cluster communicate to maintain a replicated message queue.
If the service group with notifier configured as a resource fails on one of the nodes,
notifier fails over to another node in the cluster. Because the message queue is
guaranteed to be consistent and replicated across nodes, notifier can resume
message delivery from where it left off after it fails over to the new node.
Messages are stored in the queue until one of these conditions is met:
The notifier daemon sends an acknowledgement to had that at least one
recipient has received the message.
The queue is full. The queue is circularthe last (oldest) message is deleted in
order to write the current (newest) message.
Messages in the queue for one hour are deleted if notifier is unable to deliver to
the recipient.
Note: Before the notifier daemon connects to had, messages are stored
permanently in the queue until one of the last two conditions is met.
SevereError
10
Information
had had
A complete list of events
and severity levels is
included in the Job Aids
Appendix.
Add
AddaaNotifierMngr
NotifierMngrtype
typeof
ofresource
resourceto
to
Note:
Note: AA the
theClusterService
ClusterServicegroup.
group.
NotifierMngr
NotifierMngr
resource
resource isis
added Modify
Modifythe
theSmtpServer
SmtpServerand
and
added toto only
only SmtpRecipients
one
one service
service SmtpRecipientsattributes.
attributes. If SMTP
group,
group, the
the notification
ClusterService Optionally,
Optionally,modify
modifythe
theResourceOwner
ResourceOwner is required
ClusterService
and
andGroupOwner
GroupOwnerattributes.
attributes.
group.
group.
Modify
Modifythe
theSnmpConsoles
SnmpConsolesattribute
attributeof
of
NotifierMngr.
NotifierMngr. If SNMP
notification
Configure
Configurethe
theSNMP
SNMPconsole
consoleto
toreceive
receive is required
VCS
VCStraps.
traps.
Modify
Modifyany
anyother
otheroptional
optionalattributes
attributesof
of
NotifierMngr
NotifierMngrasasdesired.
desired.
Configuring Notification
While you can start and stop the notifier daemon manually outside of VCS,
you can make the notifier component highly available by placing the daemon
under VCS control.
Carry out the following steps to configure a highly available notification within the
cluster:
1 Add a NotifierMngr type of resource to the ClusterService group.
2 If SMTP notification is required:
a Modify the SmtpServer and SmtpRecipients attributes of the NotifierMngr
type of resource.
b If desired, modify the ResourceOwner attribute of individual resources
(described later in the lesson).
c You can also specify a GroupOwner e-mail address for each service group.
3 If SNMP notification is required:
a Modify the SnmpConsoles attribute of the NotifierMngr type of resource.
b Verify that the SNMPTrapPort attribute value matches the port configured
for the SNMP console. The default is port 162.
c Configure the SNMP console to receive VCS traps (described later in the
lesson).
4 Modify any other optional attributes of the NotifierMngr type of resource, as
desired.
10
notifier arguments shown in this example are:
See the manual pages for notifier and hanotify for a complete description
of notification configuration options.
NotifierMngr
NotifierMngr notifier
notifier (
SmtpServer
SmtpServer == "smtp.veritas.com"
"smtp.veritas.com"
SmtpRecipients
SmtpRecipients == {{ "vcsadmin@veritas.com"
"vcsadmin@veritas.com" == SevereError
SevereError }}
PathName
PathName == "/opt/VRTSvcs/bin/notifier"
"/opt/VRTSvcs/bin/notifier"
)) main.cf
main.cf
Optional Attributes
EngineListeningPort: The port that the VCS engine uses for listening.
The default is 14141.
Note: This optional attribute exists for VCS 3.5 for Solaris and for HP-UX.
This attribute does not exist for VCS 3.5 for AIX or VCS 4.0 for Solaris.
10
MessagesQueue: The number of messages in the queue
The default is 30.
NotifierListeningPort: Any valid unused TCP/IP port numbers
The default is 14144.
SnmpConsole: The fully qualified host name of the SNMP console and the
severity level
SnmpConsole is a required attribute if SMTP is not specified.
SnmpCommunity: The community ID for the SNMP manager
The default is public.
SnmpdTrapPort: The port to which SNMP traps are sent.
The value specified for this attribute is used for all consoles if more than one
SNMP console is specified.
The default is 162.
SmtpFromPath: A valid e-mail address, if a custom e-mail address is desired
for the FROM: field in the e-mail sent by notifier
SmtpReturnPath: A valid e-mail address, if a custom e-mail address is desired
for the Return-Path: <> field in the e-mail sent by notifier
SmtpServerTimeout: The time in seconds that notifier waits for a response
from the mail server for the SMTP commands it has sent to the mail server
This value can be increased if the mail server takes too much time to reply
back to the SMTP commands sent by notifier.
The default is 10.
SmtpServerVrfyOff: A toggle for sending SMTP VRFY requests
Setting this value to 1 results in notifier not sending a SMTP VRFY request to
the mail server specified in SmtpServer attribute, while sending e-mails. Set
this value to 1 if your mail server does not support the SMTP VRFY command.
The default is 0.
Notification Events
ResourceStateUnknown ResourceRestartingByAgent
ResourceMonitorTimeout ResourceWentOnlineByItself
ResourceNotGoingOffline ResourceFaulted
10
hagrp modify grp_name GroupOwner chris
Examples
Examplesof ofservice
servicegroup
groupevents
eventsthat
that
cause
causeVCS
VCSto tosend
sendnotification
notificationto
toGroupOwner:
GroupOwner:
Faulted
Faulted
Concurrency
ConcurrencyViolation
Violation
Autodisabled
Autodisabled
10
ResNotOff PreOnline
PreOnline
ResNotOff
SysOffline ResStateChange
ResStateChange
SysOffline
PostOffline Applies
Appliesonly
onlyto
toenabled
enabled
PostOffline service
PostOnline servicegroups.
groups.
PostOnline
Applies
Appliescluster-wide.
cluster-wide.
Summary
This lesson described how to configure VCS to provide notification using e-mail
and SNMP traps.
Next Steps
The next lesson describes how VCS responds to resource faults and the options
you can configure to modify the default behavior.
Additional Resources
VERITAS Cluster Server Bundled Agents Reference Guide
This document provides important reference information for the VCS agents
bundled with the VCS software.
VERITAS Cluster Server Users Guide
This document provides information about all aspects of VCS configuration.
nameSG1 nameSG2
ClusterService
NotifierMngr
10
Optional Lab
resfault
resfault
Triggers
Triggers nofailover
nofailover SMTP
SMTPServer:
Server:
resadminwait
resadminwait
___________________________________
___________________________________
Goal
The purpose of this lab is to configure notification.
Prerequisites
Students work together to add a NotifierMngr resource to the ClusterService
group.
Results
The ClusterService group now has a NotifierMngr resource and notification is
working.
Introduction
Overview
This lesson describes how VCS responds to resource faults and introduces various
components, such as resource type attributes, that you can configure to customize
the VCS engines response to resource faults. This lesson also describes how to
recover after a resource is put into a FAULTED or ADMIN_WAIT state.
Importance
In order to maintain a high availability cluster, you must understand how service
groups behave in response to resource failures and how you can customize this
behavior. This enables you to configure the cluster optimally for your computing
environment.
11
Event Handling triggers.
Outline of Topics
VCS Response to Resource Faults
Determining Failover Duration
Controlling Fault Behavior
Recovering from Resource Faults
Fault Notification and Event Handling
Fault
Faultthe
theresource
resource
Take
Takeall
allresources
resourcesin
inpath
pathoffline
offline
Y Critical
N Keep
Keepgroup
group
Fault
Faultthe
theservice
servicegroup
group online resource
in path? partially
partiallyonline
online
Take
Takethe
theentire
entireSG
SGoffline
offline
Failover N
11
target Keep
available? Keepthe
theservice
servicegroup
groupoffline
offline
Y
Bring
Bringthe
theservice
servicegroup
grouponline
onlineelsewhere
elsewhere
Do
Donothing
nothingexcept
exceptto
to
A resource goes Yes
Frozen? fault
faultthe
theresource
resource
offline unexpectedly
and
andthe
theservice
servicegroup
group
No
0 1
Do
Donot
nottake
takeany Take
Takeall
allresources
11
any FaultPropagation resources
other
otherresource
resourceoffline
offline in
inthe
thepath
pathoffline
offline
Frozen or TFrozen
These service group attributes are used to indicate that the service group is frozen
due to an administrative command. When a service group is frozen, all agent
actions except for monitor are disabled. If the service group is temporarily frozen
using the hagrp -freeze group command, the TFrozen attribute is set to 1,
and if the service group is persistently frozen using the hagrp -freeze group
-persistent command, the Frozen attribute is set to 1. When the service group
is unfrozen using the hagrp -unfreeze group [-persistent]
command, the corresponding attribute is set back to the default value of 0.
ManageFaults
This service group attribute can be used to prevent VCS from taking any automatic
actions whenever a resource failure is detected. Essentially, ManageFaults
determines whether VCS or an administrator handles faults for a service group.
If ManageFaults is set to the default value of ALL, VCS manages faults by
executing the clean entry point for that resource to ensure that the resource is
completely offline, as shown previously. This is the default value (ALL). The
default setting of ALL provides the same behavior as VCS 3.5.
FaultPropagation
The FaultPropagation attribute determines whether VCS evaluates the effects of a
resource fault on parents of the faulted resource.
If ManageFaults is set to ALL, VCS runs the clean entry point for the faulted
resource and then checks the FaultPropagation attribute of the service group. If this
attribute is set to 0, VCS does not take any further action. In this case, VCS fails
over the service group only on system failures and not on resource faults.
The default value is 1, which means that VCS continues through the failover
process shown in the next section. This is the same behavior as VCS 3.5 and
earlier releases.
Notes:
The ManageFaults and FaultPropagation attributes of a service group are
introduced in VCS version 3.5 for AIX and VCS version 3.5 MP1 (or 2.0 P4)
for Solaris. VCS 3.5 for HP-UX and any earlier versions of VCS on any other
platform do not have these attributes. If these attributes do not exist, the VCS
response to resource faults is the same as with the default values of these
attributes.
ManageFaults and FaultPropagation have essentially the same effect when
enabledservice group failover is suppressed. The difference is that when
ManageFaults is set to NONE, the clean entry point is not run and that resource
is put in an ADMIN_WAIT state.
Y
Take 0
Takethe
theentire
entire AutoFailOver
group
groupoffline
offline
1
Choose
Chooseaafailover
failovertarget
targetfrom
from
the
theSystemList
SystemListbased
basedononFailOverPolicy
FailOverPolicy
Y Failover N Keep
Keepthe
theservice
11
Bring
Bringthe
theservice
servicegroup
group service
target
online
onlineelsewhere
elsewhere group
groupoffline
offline
available?
AutoFailOver
This attribute determines whether automatic failover takes place when a resource
or system faults. The default value of 1 indicates that the service group should be
failed over to other available systems if at all possible. However, if the attribute is
set to 0, no automatic failover is attempted for the service group, and the service
group is left in an OFFLINE|FAULTED state.
11
= Failover Duration
11
! If
If you
you change
change aa resource
affect
affect all
resource type
all resources
type attribute,
resources of
attribute, you
of that
that type.
type.
you
Adjusting Monitoring
You can change some resource type attributes to facilitate failover testing. For
example, you can change the monitor interval to see the results of faults more
quickly. You can also adjust these attributes to affect how quickly an application
fails over when a fault occurs.
MonitorInterval
This is the duration (in seconds) between two consecutive monitor calls for an
online or transitioning resource.
The default is 60 seconds for most resource types.
OfflineMonitorInterval
This is the duration (in seconds) between two consecutive monitor calls for an
offline resource. If set to 0, offline resources are not monitored.
The default is 300 seconds for most resource types.
Refer to the VERITAS Cluster Server Bundled Agents Reference Guide for the
applicable monitor interval defaults for specific resource types.
11
Controlling Fault Behavior
Type Attributes Related to Resource Faults
Although the failover capability of VCS helps to minimize the disruption of
application services when resources fail, the process of migrating a service to
another system can be time-consuming. In some cases, you may want to attempt to
restart a resource on the same system before failing it over to another system.
Whether a resource can be restarted depends on the application service:
The resource must be successfully cleared (taken offline) after failure.
The resource must not be a child resource, which has dependent parent
resources that must be restarted.
If you have determined that a resource can be restarted without impacting the
integrity of the application, you can potentially avoid service group failover by
configuring these resource type attributes:
RestartLimit
The restart limit determines how many times a resource can be restarted within
the confidence interval before the resource faults.
For example, you may want to restart a resource such as the Oracle listener
process several times before it causes an Oracle service group to fault.
ConfInterval
When a resource has remained online for the specified time (in seconds),
previous faults and restart attempts are ignored by the agent. When this clock
expires, the restart or tolerance counter is reset to zero.
Offline Offline
Online Online Online
ConfInterval
11
MonitorInterval
Restart Faulted
Restart Example
This example illustrates how the RestartLimit and ConfInterval attributes can be
configured for modifying the behavior of VCS when a resource is faulted.
Setting RestartLimit = 1 and ConfInterval = 180 has this effect when a resource
faults:
1 The resource stops after running for 10 minutes.
2 The next monitor returns offline.
3 The ConfInterval counter is set to 0.
4 The agent checks the value of RestartLimit.
5 The resource is restarted because RestartLimit is set to 1, which allows one
restart within the ConfInterval.
6 The next monitor returns online.
7 The ConfInterval counter is now 60 (one monitor cycle has completed).
8 The resource stops again.
9 The next monitor returns offline.
10 The ConfInterval counter is now 120 (two monitor cycles have completed).
11 The resource is not restarted because the RestartLimit counter is now 2 and the
ConfInterval counter is 120 (seconds). Because the resource has not been
online for the ConfInterval time of 180 seconds, it is not restarted.
12 VCS faults the resource.
If the resource had remained online for 180 seconds, the internal RestartLimit
counter would have been reset to 0.
type
type NIC
NIC ((
static
static int
int MonitorInterval
MonitorInterval == 1515
static
static int
int OfflineMonitorInterval
OfflineMonitorInterval == 60
60
static
static int
int ToleranceLimit
ToleranceLimit == 22
static
static str
str ArgList[]
ArgList[] == {{ Device,
Device,
)) types.cf
types.cf
Can be used to
optimize agents
Is applied to all
resources of the
specified type
hatype
hatype modify
modify NIC ToleranceLimit
ToleranceLimit 22
hares
hares
override myMount MonitorInterval Override MonitorInterval
modify myMount MonitorInterval 10 Modify overridden attribute
display ovalues myMount Display overridden values
undo_override myMount MonitorInterval Restore default settings
Mount
Mount myMount
myMount ((
MountPoint="/mydir"
MountPoint="/mydir"
.. .. .. main.cf
main.cf
11
MonitorInterval=10
MonitorInterval=10
.. .. ..
))
To
To clear
clearaanonpersistent
nonpersistentresource
resourcefault:
fault:
1.
1. Ensure
Ensurethat
thatthe
thefault
faultis
isfixed
fixedoutside
outsideof
ofVCS
VCSand
andthat
thatthe
the
resource
resourceis
iscompletely
completelyoffline.
offline.
2.
2. Use
Usethe
thehares
hares clear resourcecommand
clear resource commandtotoclear
clearthe
the
FAULTED status.
FAULTED status.
To
To clear
clear aa persistent
persistentresource
resource fault:
fault:
1.
1. Ensure
Ensurethat
thatthe
thefault
faultis
isfixed
fixedoutside
outsideof
ofVCS.
VCS.
2.
2. Either wait for the periodic monitoringor
Either wait for the periodic monitoring orprobe
probethe
theresource
resource
manually
manuallyusing
usingthe
thecommand:
command:
hares
hares probe
probe resource
resource syssys system
system
11
11
A resource cannot
Call resnotoff (if present).
be taken offline.
A resource is placed
in an ADMIN_WAIT state. Call resadminwait (if present).
A resource is
Call resstatechange.
brought online or taken
(if present and configured).
offline successfully.
11
The failover target Call nofailover (if present).
does not exist.
11
Cluster Server instructor-led training
Summary
This lesson described how VCS responds to resource faults and introduced various
components of VCS that enable you to customize VCS response to resource faults.
Next Steps
The next lesson describes how the cluster communication mechanisms work to
build and maintain the cluster membership.
Additional Resources
VERITAS Cluster Server Bundled Agents Reference Guide
This document provides important reference information for the VCS agents
bundled with the VCS software.
VERITAS Cluster Server Users Guide
This document provides information about all aspects of VCS configuration.
High Availability Design Using VERITAS Cluster Server instructor-led training
This cover provides configuration procedures and practical exercises for
configuring triggers.
Note:
Note:Network
Networkinterfaces
interfacesfor
forvirtual
virtualIP
IPaddresses
addresses
are
areunconfigured
unconfiguredtotoforce
forcethe
theIP
IPresource
resourcetotofault.
fault.
In
Inyour
yourclassroom,
classroom,the
theinterface
interfaceyou
youspecify
specifyis:______
is:______
Replace
Replacethe
thevariable
variableinterface
interfacein
inthe
thelab
labsteps
stepswith
withthis
this
value.
value.
Goal
The purpose of this lab is to observe how VCS responds to faults in a variety of
scenarios.
Results
Each student observes the effects of failure events in the cluster.
Prerequisites
Obtain any classroom-specific values needed for your classroom lab environment
and record these values in your design worksheet included with the lab exercise
instructions.
Introduction
Overview
This lesson describes how the cluster interconnect mechanism works. You also
learn how the GAB and LLT configuration files are set up during installation to
implement the communication channels.
Importance
Although you may never need to reconfigure the cluster interconnect, developing a
thorough knowledge of how the cluster interconnect functions is key to
understanding how VCS behaves when systems or network links fail.
Outline of Topics
12
VCS Communications Review
Cluster Membership
Cluster Interconnect Configuration
Joining the Cluster Membership
Agent
Agent Agent
Agent Agent
Agent Agent
Agent Agent
Agent Agent
Agent
GAB
GAB GAB
GAB GAB
GAB
LLT
LLT LLT
LLT LLT
LLT
Broadcast
Broadcast heartbeat
heartbeat Each
Each LLT
LLT module
module LLT
LLT forwards
forwards the
the
on
on each
each interface
interface tracks
tracks status
status of
of heartbeat
heartbeat status
status of
of
every
every
second.
second. heartbeat
heartbeat from
from each
each each
each node
node to
to GAB.
GAB.
peer
peer on
on each
each
interface.
interface.
12
GAB determines cluster membership by monitoring heartbeats transmitted
from each system over LLT.
## gabconfig
gabconfig -a
-a
GAB
GAB Port
Port Memberships
Memberships
===============================================
===============================================
Port
Port aa gen
gen a36e003
a36e003 membership
membership 01
01 ;; ;12
;12
Port
Port hh gen
gen fd57002
fd57002 membership
membership 01
01 ;; ;12
;12
Cluster Membership
12
GAB Status and Membership Notation
To display the cluster membership status, type gabconfig on each system. For
example:
gabconfig -a
If GAB is operating, the following GAB port membership information is returned:
Port a indicates that GAB is communicating, a36e0003 is a randomly
generated number, and membership 01 indicates that systems 0 and 1 are
connected.
Port h indicates that VCS is started, fd570002 is a randomly generated
number, and membership 01 indicates that systems 0 and 1 are both running
VCS.
Note: The port a and port h generation numbers change each time the membership
changes.
Solaris
Solaris Example
Example
S1#
S1# lltstat
lltstat -nvv
-nvv |pg
|pg
LLT
LLT node
node information:
information:
Node
Node State
State Link
Link Status
Status Address
Address
** 00 S1
S1 OPEN
OPEN
qfe0
qfe0 UP
UP 08:00:20:AD:BC:78
08:00:20:AD:BC:78
hme0
hme0 UP
UP 08:00:20:AD:BC:79
08:00:20:AD:BC:79
11 S2
S2 OPEN
OPEN
qfe0
qfe0 UP
UP 08:00:20:B4:0C:3B
08:00:20:B4:0C:3B
hme0
hme0 UP
UP 08:00:20:B4:0C:3C
08:00:20:B4:0C:3C
12
The lltstat Command
Use the lltstat command to verify that links are active for LLT. This command
returns information about the links for LLT for the system on which it is typed. In
the example shown in the slide, lltstat -nvv is typed on the S1 system to
produce the LLT status in a cluster with two systems.
The -nvv options cause lltstat to list systems with very verbose status:
Link names from llttab
Status
MAC address of the Ethernet ports
Other lltstat uses:
Without options, lltstat reports whether LLT is running.
The -c option displays the values of LLT configuration directives.
The -l option lists information about each configured LLT link.
You can also use lltstat effectively to create a script that runs lltstat
-nvv and checks for the string DOWN. Run this from cron periodically to report
failed links.
Use the exclude directive in llttab to eliminate information about
nonexistent systems.
Note: This level of detailed information about LLT links is only available through
the CLI. Basic status is shown in the GUI.
12
The LLT configuration files are located in the /etc directory.
# cat /etc/llttab
0 - 255 set-cluster 10
set-node S1
link qfe0 /dev/qfe:0 - ether - -
link hme0 /dev/hme:0 - ether - -
Solaris
Solaris
# cat /etc/llthosts
0 S1
0 - 31 1 S2
12
A unique number must be assigned to each system in a cluster using the
set-node directive.
The value of set-node can be one of the following:
An integer in the range of 0 through 31 (32 systems per cluster maximum)
A system name matching an entry in /etc/llthosts
If a number is specified, each system in the cluster must have a unique llttab
file, which has a unique value for set-node. Likewise, if a system name is
specified, each system must have a different llttab file with a unique system
name that is listed in llthosts, which LLT maps to a node ID.
# cat /etc/llthosts
0 S1
1 S2
# cat /etc/VRTSvcs/conf/sysname
S1
12
The sysname file is an optional LLT configuration file. This file is used to store
the system (node) name. In later versions, the VCS installation utility creates the
sysname file on each system, which contains the host name for that system.
The purpose of the sysname file is to remove VCS dependence on the UNIX
uname utility for determining the local system name. If the sysname file is not
present, VCS determines the local host name using uname. If uname returns a
fully qualified domain name (sys.company.com), VCS cannot match the name
to the systems in the main.cf cluster configuration and therefore cannot start on
that system.
If uname returns a fully qualified domain name on your cluster systems, ensure
that the sysname file is configured with the local host name in
/etc/VRTSvcs/conf.
Note: Although you can specify a name in the sysname file that is completely
different from the UNIX host name shown in the output of uname, this can lead to
problems and is not recommended. For example, consider a scenario where system
S1 fails and you replace it with another system named S3. You configure VCS on
S3 to make it appear to be S1 by creating a sysname file with S1. While this has
the advantage of minimizing VCS configuration changes, it can create a great deal
of confusion when troubleshooting problems. From the VCS point of view, the
system is shown as S1. From the UNIX point of view, the system is S3.
See the sysname manual page for a complete description of the file.
## cat
cat /etc/gabtab
/etc/gabtab
/sbin/gabconfig
/sbin/gabconfig c
c n
n 44
I am alive
I am alive
HAD HAD
GAB I am alive
GAB
HAD
S1
S1 LLT S3
S3
GAB LLT
LLT S2
S2
12
GAB and LLT are started automatically when a system starts up. HAD can only
start after GAB membership has been established among all cluster systems. The
mechanism that ensures that all cluster systems are visible on the cluster
interconnect is GAB seeding.
AIX
HP-UX
Linux
3 Seeded
HAD 3 HAD
GAB Seeds
GAB
HAD
S1
S1 LLT S3
S3
2 GAB LLT
1
LLT S2
S2
Manual Seeding
12
You can override the seed values in the gabtab file and manually force GAB to
seed a system using the gabconfig command. This is useful when one of the
systems in the cluster is out of service and you want to start VCS on the remaining
systems.
To seed the cluster, start GAB on one node with -x to override the -n value set in
the gabtab file. For example, type:
gabconfig -c -x
Warning: Only manually seed the cluster when you are sure that no other systems
have GAB seeded. In clusters that do not use I/O fencing, you can potentially
create a split brain condition by using gabconfig improperly.
After you have started GAB on one system, start GAB on other systems using
gabconfig with only the -c option. You do not need to force GAB to start with
the -x option on other systems. When GAB starts on the other systems, it
determines that GAB is already seeded and starts up.
A B
3
monitor
2
agent
agent agent
agent agent
agentHAD HAD
2 HAD tells agents to probe (monitor) all resources on all systems in the
SystemList to determine their status.
Summary
12
This lesson described how the cluster interconnect mechanism works and the
format and content of the configuration files.
Next Steps
The next lesson describes how system and communication failures are handled in a
VCS cluster environment that does not support I/O fencing.
Additional Resources
VERITAS Cluster Server Users Guide
This guide provides detailed information on procedures and concepts for
configuring and managing VCS clusters.
VERITAS Cluster Server Installation Guide
This guide provides detailed information on configuring VCS communication
mechanisms.
Introduction
Overview
This lesson describes how VCS handles system and communication failures in
clusters that do not implement I/O fencing.
Importance
A thorough understanding of how VCS responds to system and communication
faults ensures that you know how services and their users are affected in common
failure situations.
Outline of Topics
Ensuring Data Integrity
Cluster Interconnect Failures
Changing the Interconnect Configuration
13
A C B C
S1 O S3
S1 S3
O
S2
S2
13
Prior to any failures, systems S1, S2, and S3 are part of the regular membership of
cluster number 1.
= Failover Duration
A B C
S1 O S3
S1 O S3
O
S2
S2
13
failover due to a resource fault or switchover of service groups at operator request
is unaffected.
The only change is that other systems prevented from starting service groups on
system fault. VCS continues to operate as a single cluster when at least one
network channel exists between the systems.
In the example shown in the diagram where one LLT link fails:
A jeopardy membership is formed that includes just system S3.
System S3 is also a member of the regular cluster membership with systems S1
and S2.
Service groups A, B, and C continue to run and all other cluster functions
remain unaffected.
Failover due to a resource fault or an operator request to switch a service
groups is unaffected.
If system S3 now faults or its last LLT link is lost, service group C is not
started on systems S1 or S2.
Jeopardy Membership
When a system is down to a single LLT link, VCS can no longer reliably
discriminate between loss of a system and loss of the last LLT connection. Systems
with only a single LLT link are put into a special cluster membership known as
jeopardy.
Jeopardy is a mechanism for preventing split-brain condition if the last LLT link
fails. If a system is in a jeopardy membership, and then loses its final LLT link:
Service groups in the jeopardy membership are autodisabled in the regular
cluster membership.
Service groups in the regular membership are autodisabled in the jeopardy
membership.
Jeopardy membership also occurs in the case where had stops and hashadow is
unable to restart had.
S1 O S3
S1 2 O S3
O
S2
S2
1 Jeopardy membership: S3
3 SGs autodisabled
13
Because system S3 was in a jeopardy membership prior to the last link failing:
Service group C is autodisabled in the mini-cluster containing systems S1
and S2 to prevent either system from starting it.
Service groups A and B are autodisabled in the cluster membership for
system S3 to prevent system S3 from starting either one.
Service groups A and B can still fail over between systems S1 and S2.
In this example, the cluster interconnect has partitioned and two separate cluster
memberships have formed as a result, one on each side of the partition.
Each of the mini-clusters continues to operate. However, because they cannot
communicate, each maintains and updates only its own version of the cluster
configuration and the systems on different sides of the network partition have
different cluster configurations.
1
2
S1 O S3
S1 O S3
O
S2
S2 3
Recovery Behavior
When a cluster partitions because the cluster interconnect has failed, each of the
mini-clusters continues to operate. However, because they cannot communicate,
each maintains and updates only its own version of the cluster configuration and
the systems on different sides of the network partition have different cluster
configurations.
13
If you reconnect the LLT links without first stopping VCS on one side of the
partition, GAB automatically stops HAD on selected systems in the cluster to
protect against a potential split-brain scenario.
GAB protects the cluster as follows:
In a two-system cluster, the system with the lowest LLT node number
continues to run VCS and VCS is stopped on the higher-numbered system.
In a multisystem cluster, the mini-cluster with the most systems running
continues to run VCS. VCS is stopped on the systems in the smaller mini-
clusters.
If a multisystem cluster is split into two equal-size mini-clusters, the cluster
containing the lowest node number continues to run VCS.
A C B A B C
1
S1 O S3
S1 S3
O
S2
S2
13
If an application starts on multiple systems and can gain control of what are
normally exclusive resources, such as disks in a shared storage device, split brain
condition results and data can be corrupted.
O
S1
S1 2 S3
S3
O
S2
S2
2 O
1 No change in membership
Jeopardy membership: S3
Regular membership: S1, 2 Public is now used for
2
S2, S3 heartbeat and status.
O
S1
S1 S3
S3
O
S2
S2
Jeopardy membership: S3
Regular membership: S1, S2, S3
Public is now used for
heartbeat and status.
13
O O
S1
S1 S3
S3
1
S2
S2
Network partition
1 Regular membership:
S1, S2
SGHB resource faults when Regular membership: S3
2
brought online.
Disk
Disk
A C B C
O 1
S1
S1 S3
S3
O
3
S2
S2
1 S3 faults; C started on S1 or S2
Regular membership: S1, S2
2 LLT links to S3 disconnected No membership: S3
13
HAD from starting on those systems.
In the scenario shown in the diagram, system S3 cannot start HAD when it reboots
because the network failure prevents GAB from communicating with any other
cluster systems; therefore, system S3 cannot seed.
Edit Files
# llttab file
# llthosts # gabtab file
set-node S1
set-cluster 10 0 S1
# sysname [path]/gabconfig c n #
1 S2
OOO S1
13
For example, if you added a system to a running cluster, you can change the value
of -n in the gabtab file without having to restart GAB. However, if you added
the -j option to change the recovery behavior, you must either restart GAB or
execute the gabtab command manually for the change to take effect.
Similarly, if you add a host entry to llthosts, you do not need to restart LLT.
However, if you change llttab, or you change a host name in llthosts, you
must stop and restart LLT, and, therefore, GAB.
Regardless of the type of change made, the procedure shown in the slide ensures
that the changes take effect. You can also use the scripts in the /etc/rc*.d
directories to stop and start services.
Note: On Solaris, you must also unload the LLT and GAB modules if you are
removing a system from the cluster, or upgrading LLT or GAB binaries. For
example:
modinfo | grep gab
modunload -i gab_id
modinfo | grep llt
modunload -i llt_id
# cat /etc/llttab
set-node S1
set-cluster 10
# Solaris example
link qfe0 /dev/qfe:0 - ether - - Solaris
Solaris
link hme0 /dev/hme:0 - ether - -
link ce0 /dev/ce:0 - ether - -
link-lowpri qfe1 /dev/qfe:1 - ether - -
Summary
This lesson described how VCS protects data in shared storage environments that
do not support I/O fencing. You also learned how you can modify the
communication configuration.
Next Steps
13
Now that you know how VCS behaves when faults occur in a non-fencing
environment, you can learn how VCS handles system and communication failures
in a fencing environment.
Additional Resources
VERITAS Cluster Server Installation Guide
This guide describes how to configure the cluster interconnect.
VERITAS Cluster Server Users Guide
This guide provides detailed information on procedures and concepts for
configuring and managing VCS clusters.
trainxx
trainxx
O trainxx
trainxx
Optional Lab
Trigger
Trigger injeopardy
injeopardy
Goal
The purpose of this lab is to configure a low-priority link and then pull network
cables and observe how VCS responds.
Prerequisites
Work together to perform the tasks in this lab.
Results
All interconnect links are up when the lab is completed.
Prerequisites
Obtain any classroom-specific values needed for your lab environment and record
these values in your design worksheet included with the lab exercise.
Results
All network links are up and monitored, and the InJeopardy trigger has been run.
13
Introduction
Overview
This lesson describes how the VCS I/O fencing feature protects data in a shared
storage environment.
Importance
Having a thorough understanding of how VCS responds to system and
communication faults when I/O fencing is configured ensures that you know how
services and their users are affected in common failure situations.
Outline of Topics
Data Protection Requirements
I/O Fencing Concepts and Components
I/O Fencing Operations
I/O Fencing Implementation
Configuring I/O Fencing
Recovering Fenced Systems
14
App
DB
App
DB
System Failure
In order to keep services high availability, the cluster software must be capable of
taking corrective action on the failure of a system. Most cluster implementations
are lights out environmentsthe HA software must automatically respond to
faults without administrator intervention.
Example corrective actions are:
Starting an application on another node
Reconfiguring parallel applications to no longer include the departed node in
locking operations
The animation shows conceptually how VCS handles a system fault. The yellow
service group that was running on Server 2 is brought online on Server 1 after
14
GAB on Server 1 stops receiving heartbeats from Server 2 and notifies HAD.
App
DB
Interconnect Failure
A key function of a high availability solution is to detect and respond to system
faults. However, the system may still be running but unable to communicate
heartbeats due to a failure of the cluster interconnect. The other systems in the
cluster have no way to distinguish between the two situations.
This problem is faced by all HA solutionshow can the HA software distinguish a
system fault from a failure of the cluster interconnect? As shown in the example
diagram, whether the system on the right side (Server 2) fails or the cluster
interconnect fails, the system on the left (Server 1) no longer receives heartbeats
from the other system.
The HA software must have a method to prevent an uncoordinated view among
systems of the cluster membership in any type of failure scenario.
In the case where nodes are running but the cluster interconnect has failed, the HA
software needs to have a way to determine how to handle the nodes on each side of
the network split, or partition.
Network Partition
A network partition is formed when one or more nodes stop communicating on the
cluster interconnect due to a failure of the interconnect.
App App
DB DB
14
hung such that it seems to have failed, its services can be started on another
system. This can also happen on systems where the hardware supports a break and
resume function. If the system is dropped to command-prompt level with a break
and subsequently resumed, the system can appear to have failed. The cluster is
reformed and then the system recovers and begins writing to shared storage again.
The remainder of this lesson describes how the VERITAS fencing mechanism
prevents split brain condition in failure situations.
Cluster membership
DB
The membership must be
consistent.
Data protection
Upon membership
change, only one cluster
can survive and have When
Whenthetheheartbeats
heartbeatsstop,stop,VCS
VCS
exclusive control of needs
needsto totake
takeaction,
action,butbutboth
both
shared data disks. failures
failureshave
havethe
thesame
samesymptoms.
symptoms.
What
What action shouldbe
action should betaken?
taken?
Which
Whichfailure
failureisisit?
it?
X
App
called I/O fencing to
guarantee data
X
DB
protection.
I/O fencing uses SCSI-3
persistent reservations
(PR) to fence off data
drives to prevent split
brain condition.
14
SCSI-3 PR supports multiple nodes accessing a device while at the same time
blocking access to other nodes. Persistent reservations are persistent across SCSI
bus resets and PR also supports multiple paths from a host to a disk.
Coordinator Disks
The coordinator disks act as a global lock mechanism, determining which nodes
are currently registered in the cluster. This registration is represented by a unique
key associated with each node that is written to the coordinator disks. In order for a
node to access a data disk, that node must have a key registered on coordinator
disks.
When system or interconnect failures occur, the coordinator disks ensure that only
one cluster survives, as described in the I/O Fencing Operations section.
Data Disks
Data disks are standard disk devices used for shared data storage. These can be
physical disks or RAID logical units (LUNs). These disks must support SCSI-3
PR. Data disks are incorporated into standard VM disk groups. In operation,
Volume Manager is responsible for fencing data disks on a disk group basis.
Disks added to a disk group are automatically fenced, as are new paths to a device
are discovered.
14
14
AVCS, Node 1 uses BVCS, and so on.
In the example shown in the diagram, Node 0 is registered to write to the data disks
in the disk group belonging to the DB service group. Node 1 is registered to write
to the data disks in the disk group belonging to the App service group.
After registering with the data disk, Volume Manager sets a Write Exclusive
Registrants Only reservation on the data disk. This reservation means that only the
registered system can write to the data disk.
System Failure
The diagram shows the fencing sequence when a system fails.
1 Node 0 detects that Node 1 has failed when the LLT heartbeat times out and
informs GAB. At this point, port a on Node 0 (GAB membership) shows only 0.
2 The fencing driver is notified of the change in GAB membership and Node 0
races to win control of a majority of the coordinator disks.
This means Node 0 must eject Node 1 keys (B) from at least two of three
coordinator disks. In coordinator disk serial number order, the fencing driver
ejects the registration of Node 1 (B keys) using the SCSI-3 Preempt and Abort
command. This command allows a registered member on a disk to eject the
registration of another. Because I/O fencing uses the same key for all paths
from a host, a single preempt and abort ejects a host from all paths to storage.
3 In this example, Node 0 wins the race for each coordinator disk by ejecting
Node 1 keys from each coordinator disk.
4 Now port b (fencing membership) shows only Node 0 because Node 1 keys
have been ejected. Therefore, fencing has a consistent membership and passes
the cluster reconfiguration information to HAD.
5 GAB port h reflects the new cluster membership containing only Node 0 and
HAD now performs whatever failover operations are defined for the service
groups that were running on the departed system.
Fencing takes place when a service group is brought online on a surviving
system as part of the disk group importing process. When the DiskGroup
resources come online, the agent online entry point instructs Volume Manager
to import the disk group with options to remove the Node 1 registration and
reservation, and place a SCSI-3 registration and reservation for Node 0.
X
Node 1 detects no more h0 01 h 01
App
App
heartbeats from Node 0.
App
App
2. Nodes 0 and 1 race for the
DB
DB
coordinator disks, ejecting
each other's keys. Only one Node
Node 00 Node
Node 11
node can win each disk.
3. Node 0 wins majority A A
A
coordinator disks. B B
X
B X X
4. Node 1 panics.
Disk group Disk Group
group
5. Node 0 now has perfect for DB, has for App,
membership. key of A A
B has key of
6. VCS fails over the App AVCS and BVCS and
AVCS
service group, importing the reservation Reservation
reservation
disk group and changing the for Node 0 for Node 0
1
reservation. exclusive exclusive
access access
Interconnect Failure
The diagram shows how VCS handles fencing if the cluster interconnect is severed
and a network partition is created. In this case, multiple nodes are racing for
control of the coordinator disks.
1 LLT on Node 0 informs GAB that it has not received a heartbeat from Node 1
within the timeout period. Likewise, LLT on Node 1 informs GAB that it has
not received a heartbeat from Node 0.
2 When the fencing drivers on both nodes receive a cluster membership change
from GAB, they begin racing to gain control of the coordinator disks.
The node that reaches the first coordinator disk (based on disk serial number)
ejects the failed nodes key. In this example, Node 0 wins the race for the first
14
coordinator disk and ejects the B------- key.
After the B key is ejected by Node 0, Node 1 cannot eject the key for Node 0
because the SCSI-PR protocol says that only a member can eject a member.
SCSI command tag queuing creates a stack of commands to process, so there is
no chance of these two ejects occurring simultaneously on the drive. This
condition means that only one system can win.
3 Node 0 also wins the race for the second coordinator disk.
Node 0 is favored to win the race for the second coordinator disk according to
the algorithm used by the fending driver. Because Node 1 lost the race for the
first coordinator disk, Node 1 has to reread the coordinator disk keys a number
of times before it tries to eject the other nodes key. This favors the winner of
the first coordinator disk to win the remaining coordinator disks. Therefore,
Node 1 does not gain control of the second or third coordinator disks.
14
gabconfig -x command.
4 As part of the initialization of fencing, the fencing driver receives a list of
current nodes in the GAB membership, reads the keys present on the
coordinator disks, and performs a comparison.
In this example, the fencing driver on Node 1 detects keys from Node 0 (A---
----) but does not detect Node 0 in the GAB membership because the cluster
interconnect has been severed.
gabconfig -a
GAB Port Memberships
===================================================
Port a gen b7r004 membership 1
14
Majority Clusters
The I/O fencing algorithm is designed to give priority to larger clusters in any
arbitration scenario. For example, if a single node is separated from a 16-node
cluster due to an interconnect fault, the 15-node cluster should continue to run. The
fencing driver uses the concept of a majority cluster. The algorithm determines if
the number of nodes remaining in the cluster is greater than or equal to the number
of departed nodes. If so, the larger cluster is considered a majority cluster. The
majority cluster begins racing immediately for control of the coordinator disks on
any membership change. The fencing drivers on the nodes in the minority cluster
delay the start of the race to give an advantage to the larger cluster. This delay is
accomplished by reading the keys on the coordinator disks a number of times. This
algorithm ensures that the larger cluster wins, but also allows a smaller cluster to
win if the departed nodes are not actually running.
vxfen
Uses GAB port b for LMX
GAB b communication
LLT Determines coordinator
disks on vxfen startup
LLT
Intercepts RECONFIG
messages from GAB
GAB b destined for the VCS engine
Controls fencing actions by
vxfen
Volume Manager
VM
14
disks and stores them in memory
d If this is the second or later member to register, obtains serial numbers of
coordinator disks from the first member
e Reads and compares the local serial number
f Errors out, if the serial number is different
g Begins a preexisting network partition check
h Reads current keys registered on coordinator disks
i Determines that all keys match the current port b membership
j Registers the key with coordinator disks
2 Membership is established (port b).
3 HAD is started and port h membership is established.
Fencing Driver
Fencing in VCS is implemented in two primary areas:
The vxfen fencing driver, which directs Volume Manager
Volume Manager, which carries out actual fencing operations at the disk group
level
The fencing driver is a kernel module that connects to GAB to intercept cluster
membership changes (reconfiguration messages). If a membership change occurs,
GAB passes the new membership in the form of a reconfiguration message to
vxfen on GAB port b. The fencing driver on the node with lowest node ID in the
remaining cluster races for control of the coordinator disks, as described
previously. If this node wins, it passes the list of departed nodes to VxVM to have
14
these nodes ejected from all shared disk groups.
After carrying out required fencing actions, vxfen passes the reconfiguration
message to HAD.
14
Deport
Deportdisk
diskgroup.
group. vxdg deport fendg
Create
Create/etc/vxfendg
/etc/vxfendg echo fendg > /etc/vxfendg
on
onall
allsystems.
systems.
14
-f file_name (Verify all disks listed in the file.)
-g disk_group (Verify all disks in the disk group.)
Note: You can check individual LUNs for SCSI-3 support to ensure that you
have the array configured properly before checking all disk groups. To
determine the paths on each system for that disk, use the vxfenadm utility to
check the serial number of the disk. For example:
vxfenadm -i disk_dev_path
After you have verified the paths to that disk on each system, you can run
vxfentsthdw with no arguments, which prompts you for the systems and
then for the path to that disk from each system. A verified path means that the
SCSI inquiry succeeds. For example, vxfenadm returns a disk serial number
from a SCSI disk and an ioctl failed message from non-SCSI 3 disk.
Run
Runthethestart
start /etc/init.d/vxfen start
script
scriptfor
forfencing.
fencing.
Save
Saveand
andclose
closethe
the haconf dump -makero
configuration.
configuration.
On
Oneach
eachsystem
system
Stop
StopVCS
VCSon
onall
allsystems.
systems. hastop all
Set
SetUseFence
UseFencein
inmain.cf.
main.cf. UseFence=SCSI3
Restart
RestartVCS.
VCS. hastart [-stale]
! You
You must
disk
must stop
stop and
disk groups
groups are
and restart
restart service
are imported
service groups
imported using
groups so
using SCSI-3
so that
that the
the
SCSI-3 reservations.
reservations.
14
the vxfentab file with the current list of all paths to the coordinator disks.
Note: This is the reason coordinator disks cannot be dynamically replaced. The
fencing driver must be stopped and restarted to populate the vxfentab file
with the updated paths to the coordinator disks.
6 Save and close the cluster configuration before modifying main.cf to ensure
that the changes you make to main.cf are not overridden.
7 Stop VCS on all systems. Do not use the -force option. You must stop and
restart service groups to reimport disk groups to place data under fencing
control.
8 Set the UseFence cluster attribute to SCSI3 in the main.cf file.
Note: You cannot set UseFence dynamically while VCS is running.
14
Node
Node22isiscut
cutoff
off
from
fromthe
theheartbeat
heartbeat
1
network,
network,loses
losesthe
the
race,
race,and
andpanics.
panics.
0 1 2
2 Shut
Shutdown
downNode
Node2.
2.
Fix
Fixthe
thesystem
systemor
or
3
interconnect.
interconnect.
4 Start
StartNode
Node2.
2.
14
Quorum SCSI
Summary
This lesson described how VCS protects data in a shared storage environment,
focusing on the concepts and basic operations of the I/O fencing feature available
in VCS version 4.
Next Steps
Now that you understand how VCS behaves normally and when faults occur, you
can gain experience performing basic troubleshooting in a cluster environment.
Additional Resources
VERITAS Cluster Server Installation Guide
This guide describes I/O fencing configuration.
14
VERITAS Cluster Server Users Guide
This guide provides detailed information on procedures and concepts for
configuring and managing VCS clusters.
VERITAS Volume Manager Users Guide
This guide provides detailed information on procedures and concepts for
configuring and managing storage using Volume Manager.
http://van.veritas.com
The VERITAS Architect Network provides access to technical papers
describing I/O fencing.
trainxx trainxx
Disk 1:___________________
Disk 3:___________________
nameDG1, nameDG2
Goal
The purpose of this lab is to set up I/O fencing in a two-node cluster and simulate
node and communication failures.
Prerequisites
Work with your lab partner to complete the tasks in this lab exercise.
Results
Each student observes the failure scenarios and performs the tasks necessary to
bring the cluster back to a running state.
Introduction
Overview
In this lesson you learn an approach for detecting and solving problems with
VERITAS Cluster Server (VCS) software. You work with specific problem
scenarios to gain a better understanding of how the product works.
Importance
To successfully deploy and manage a cluster, you need to understand the
significance and meaning of errors, faults, and engine problems. This helps you
detect and solve problems efficiently and effectively.
Outline of Topics
Monitoring VCS
Troubleshooting Guide
Cluster Communication Problems
VCS Engine Problems
Service Group and Resource Problems
Archiving VCS-Related Files
15
Monitoring VCS
VCS provides numerous resources you can use to gather information about the
status and operation of the cluster. These include:
VCS log files
VCS engine log file, /var/VRTSvcs/log/engine_A.log
Agent log files
hashadow log file, /var/VRTSvcs/log/hashadow_A.log
System log files:
/var/adm/messages (/var/adm/syslog on HP-UX)
/var/log/syslog
The hastatus utility
Notification by way of SNMP traps and e-mail messages
Event triggers
Cluster Manager
The information sources that have not been covered elsewhere in the course are
discussed in more detail in the next sections.
Unique Message
Identifier (UMI)
2003/05/20 16:00:09 VCS NOTICE V-16-1-10322
System S1 (Node '0') changed state from
STALE_DISCOVER_WAIT to STALE_ADMIN_WAIT
2003/05/20 16:01:27 VCS INFO V-16-1-50408
Received connection from client Cluster Manager -
Java Console (ID:400)
2003/05/20 16:01:31 VCS ERROR V-16-1-10069
All systems have configuration files marked STALE.
Unable to form cluster.
Most
Most Recent
Recent
VCS Logs
In addition to the engine_A.log primary VCS log file, VCS logs information
for had, hashadow, and all agent programs in these locations:
had: /var/VRTSvcs/log/engine_A.log
hashadow: /var/VRTSvcs/log/hashadow_A.log
Agent logs: /var/VRTSvcs/log/AgentName_A.log
Messages in VCS logs have a unique message identifier (UMI) built from product,
category, and message ID numbers. Each entry includes a text code indicating
severity, from CRITICAL entries indicating that immediate attention is required,
to INFO entries with status information.
The log entries are categorized as follows:
CRITICAL: VCS internal message requiring immediate attention
Note: Contact Customer Support immediately.
ERROR: Messages indicating errors and exceptions
WARNING: Messages indicating warnings
15
UMIs
UMIs
map
map to
to
Tech
Tech
Note
Note IDs.
IDs.
UMI-Based Support
UMI support in all VERITAS 4.x products, including VCS, provides a mapping
between the message ID number and technical notes provided on the Support Web
site. This helps you quickly find solutions to the specific problem indicated by the
message ID.
15
Use
UsethetheSupport
Support
Web
Websitesiteto:
to:
Download
Download
patches.
patches.
Track
Trackyour
your
cases.
cases.
Search
Searchfor for
tech
technotes.
notes.
The
TheVERITAS
VERITAS
Architect
ArchitectNetwork
Network
(VAN)
(VAN)is isanother
another
forum
forumfor fortechnical
technical
information.
information.
Start
hastatus -sum
Troubleshooting Guide
VCS problems are typically one of three types:
Cluster communication
VCS engine startup
Service groups, resources, or agents
Procedure Overview
To start troubleshooting, determine which type of problem is occurring based on
the information displayed by hastatus -summary output.
Cluster communication problems are indicated by the message:
Cannot connect to server -- Retry Later
VCS engine startup problems are indicated by systems in the
STALE_ADMIN_WAIT or ADMIN_WAIT state.
Other problems are indicated when the VCS engine, LLT, and GAB are all
running on all systems, but service groups or resources are in an unexpected
state.
15
Checking GAB
Check the status of GAB using gabconfig:
gabconfig -a
If no port memberships are present, GAB is not seeded. This indicates a
problem with GAB or LLT.
Check LLT (next section). If all systems can communicate over LLT, check
/etc/gabtab and verify that the seed number is specified correctly.
If port h membership is not present, the VCS engine (had) is not running.
15
Checking LLT
Run the lltconfig command to determine whether LLT is running. If it is not
running:
Check the console and system log for messages indicating missing or
misconfigured LLT files.
Check the LLT configuration files, llttab, llthosts, and sysname to
verify that they contain valid and matching entries.
Use other LLT commands to check the status of LLT, such as lltstat and
lltconfig -a list.
15
15
STALE_ADMIN_WAIT
If you try to start VCS on a system where the local disk configuration is stale and
there are no other running systems, the VCS engine transitions to the
STALE_ADMIN_WAIT state. This signals that administrator intervention is
required in order to get the VCS engine into the running state, because the
main.cf may not match the configuration that was in memory when the engine
stopped.
If the VCS engine is in the STALE_ADMIN_WAIT state:
1 Visually inspect the main.cf file to determine if it is up-to-date (reflects the
current configuration).
2 Edit the main.cf file, if necessary.
3 Verify the main.cf file syntax, if you modified the file:
hacf verify config_dir
4 Start the VCS engine on the system with the valid main.cf file:
hasys -force system_name
The other systems perform a remote build from the system now running.
ADMIN_WAIT
The ADMIN_WAIT state results when a system is performing a remote build and
the last running system in the cluster fails before the configuration is delivered. It
can also occur if the VCS is performing a local build and the main.cf is missing
or invalid (syntax errors).
In either case, fix the problem as follows:
1 Locate a valid main.cf file from a main.cf.previous file on disk or a
backup on tape or other media.
2 Replace the invalid main.cf with the valid version on the local node.
3 Use the procedure specified for a stale configuration to force VCS to start.
15
Ensure that the resources are not running outside of VCS control.
Verify that there are no network partitions in the cluster.
To clear the AutoDisabled attribute, type:
hagrp -autoenable service_group -sys system_name
This
This most
most commonly
commonly occurs
occurs if you are not using a sysname
sysname
! file
file and someone changes
changes the UNIX host name.
15
Concurrency Violations
A concurrency violation occurs when a failover service group becomes fully or
partially online on more than one system. When this happens, VCS takes the
service group offline on the system that caused the concurrency violation and
invokes the violation event trigger on that system.
The Violation trigger is configured by default during installation. The violation
trigger script is placed in /opt/VRTSvcs/bin/triggers and no other
configuration is required.
The script notifies the administrator and takes the service group offline on the
system where the trigger was invoked.
The script can send a message to the system log and console on all cluster systems
and can be customized to send additional messages or e-mail messages.
Example:
Example:NFS
NFSservice
servicegroups
groupshave
havethis
thisproblem
problemififan
anNFS
NFSclient
clientdoes
doesnot
not
disconnect.
disconnect.The
TheShare
Shareresource
resourcecannot
cannotcome
comeoffline
offlinewhen
whenaaclient
clientis
is
connected.
connected.You
Youcan
canconfigure
configureResNotOff
ResNotOfftotoforcibly
forciblystop
stopthe
theshare.
share.
Resource Problems
15
15
Agent Problems
An agent process should be running on the system for each configured resource
type. If the agent process is stopped for any reason, VCS cannot carry out
operations on any resource of that type. Check the VCS engine and agent logs to
identify what caused the agent to stop or prevented it from starting. It could be an
incorrect path for the agent binary, the wrong agent name, or a corrupt agent
binary.
Use the haagent command to restart the agent. Ensure that you start the agent on
all systems in the cluster.
## hasnap
hasnap backup
backup f
f /tmp/vcs.tar
/tmp/vcs.tar -n
-n -m
-m Oracle_Cluster
Oracle_Cluster
V-8-1-15522
V-8-1-15522 Initializing
Initializing file
file "vcs.tar"
"vcs.tar" for
for backup.
backup.
V-8-1-15526
V-8-1-15526 Please
Please wait...
wait...
Checking
Checking VCS
VCS package
package integrity
integrity
Collecting
Collecting VCS
VCS information
information
..
..
Compressing
Compressing /tmp/vcs.tar
/tmp/vcs.tar to
to /tmp/vcs.tar.gz
/tmp/vcs.tar.gz
Done.
Done.
Option Purpose
-backup Copies the files to a local predefined directory
-restore Copies the files in the specified snapshot to a directory
-display Lists all snapshots and the details of a specified snapshot
15
Summary
This lesson described how to detect and solve problems with VCS faults. Common
problem scenarios were described and solutions were provided, as well as a
general purpose troubleshooting methodology.
Next Steps
Now that you have learned how to configure, manage, and troubleshoot high
availability services in the VCS environment, you can learn how to manage more
complex cluster configurations, such as multinode clusters.
Additional Resources
Troubleshooting Job Aid
This quick reference is included with this participant guide.
VERITAS Cluster Server Users Guide
This guide provides detailed information on procedures and concepts for
configuring and managing VCS clusters.
VERITAS Cluster Server Bundled Agents Reference Guide
This guide describes each bundled agent in detail.
http://support.veritas.com
This Web site provides troubleshooting information about all VERITAS
products.
Optional
Optionallab:
lab:IfIfyour
yourinstructor
instructorindicates
indicatesthat
thatyour
your
classroom
classroomhashasaccess
accesstotothe
theVERITAS
VERITASSupport
SupportWeb
Website,
site,
search http://support.veritas.comfor
searchhttp://support.veritas.com fortechnical
technical
notes
notestotohelp
helpyou
yousolve
solve the
theproblems
problemscreated
createdasaspart
part of
of
this lab exercise.
this lab exercise.
Prerequisites
Wait for your instructor to indicate that your systems are ready for troubleshooting.
Results
The cluster is running with all service groups online.
Optional Lab
Some classrooms have access to the VERITAS Support Web site. If your instructor
indicates that your classroom network can access VERITAS Support, search
15
A shutdown 5-12
start 5-17
abort sequence 2-14 application components
access control 6-8 stopping 5-20
access, controlling 6-6 application service
adding license 3-7 definition 1-7
admin account 2-16 testing 5-13
ADMIN_WAIT state atomic broadcast mechanism 12-4
definition 15-17 attribute
in ManageFaults attribute 11-8 display 4-7
in ResAdminWait trigger 11-25 local 9-13
recovering resource from 11-22 override 11-19
administration application 4-4 resource 1-12, 7-10
administrative IP address 5-8 resource type 11-13, 11-15
administrator, network 5-11 service group failover 11-7
agent service group validation 5-25
clean entry point 1-14, 11-5 verify 5-23
close entry point 7-28 autodisable
communication 12-4 definition 15-19
custom 15-32 in jeopardy 13-8
definition 1-14 service group 12-20
logs 15-5 AutoDisabled attribute 12-20, 15-19
monitor entry point 1-14 AutoFailover attribute 11-9
offline entry point 1-14 AutoStart attribute 4-9, 15-18
online entry point 1-14 AutoStartList attribute 15-18
troubleshooting 15-30
AIX
configure IP address 5-9
B
configure virtual IP address 5-16 backup configuration files 8-19
llttab 12-12 base IP address 5-8
lslpp command 3-14 best practice
SCSI ID 2-10 application management 4-4
startup files 12-18 application service testing 5-13
AllowNativeCliUsers attribute 6-7 cluster interconnect 2-7
application boot disk 2-8
clean 5-12 Bundled Agents Reference Guide 1-15
component definition 5-4
configure 5-12
IP address 5-15 C
management 4-4 cable, SCSI 2-9
managing 4-4
child resource
manual migration 5-21
configuration 7-9
preparation procedure 5-13
dependency 1-11
prepare 5-4
linking 7-31
service 5-4
Index-1
Copyright 2004 VERITAS Software Corporation. All rights reserved.
clean entry point 11-5 protection 6-17
clear save 6-15
autodisable 15-19 cluster interconnect
resource fault 4-16 configuration files 3-12
CLI configure 3-8
online configuration 6-13 definition 1-16
resource configuration 7-16 VCS startup 6-23, 6-24
service group configuration 7-6 Cluster Manager
close installation 3-17
cluster configuration 6-16 online configuration 6-13
entry point 7-28 Windows 3-18
cluster Cluster Monitor 4-22
campus 14-26 cluster state
communication 12-4 GAB 6-27, 12-4
configuration 1-22 remote build 6-24, 6-27
configuration files 3-13 running 6-27
configure 3-5 Stale_Admin_Wait 6-25
create configuration 6-19 unknown 6-25
definition 1-5 Wait 6-26
design Intro-6, 2-6 ClusterService group
duplicate configuration 6-20 installation 3-8
duplicate service group configuration 6-21 main.cf file 3-13
ID 2-16, 12-11 notification 10-6
installation preparation 2-16 command-line interface 7-6
interconnect 1-16 communication
interconnect configuration 13-18
agent 12-4
maintenance 2-4
between cluster systems 12-5
managing applications 4-4
cluster problems 15-9
member systems 12-8
configure 13-18
membership 1-16, 12-7
fencing 14-21
membership seeding 12-17
within a system 12-4
membership status 12-7
component testing 5-19
name 2-16
Running state 6-24 concurrency violation
simulator 4-18 in failover service group 15-24
terminology 1-4 in frozen service group 4-13
troubleshooting 8-16 prevention 12-20
cluster communication configuration
configuration files 3-12 application 5-12
overview 1-16 application IP address 5-15
cluster configuration application service 5-6
backup files 8-19
build from file 6-28
build from file 6-28
close 6-16
cluster 3-5
in memory 6-22
cluster interconnect 3-8
in-memory 6-24
downtime 6-5
modification 8-14
fencing 3-16, 14-27
offline 6-18
files 1-23
open 6-14
GAB 12-16