100-002839-A SF6x AdministrationFundamentals Lessons

CONFIDENTIAL - NOT FOR DISTRIBUTION
Symantec Cluster
Server 6.x for UNIX:
Administration
Fundamentals
Lessons
100-002839-A
COURSE DEVELOPERS Copyright © 2014 Symantec Corporation. All rights reserved.
Raj Kiran Prasad Thota Symantec, the Symantec Logo, and VERITAS are trademarks or
registered trademarks of Symantec Corporation or its affiliates in
LEAD SUBJECT MATTER the U.S. and other countries. Other names may be trademarks of
EXPERTS their respective owners.
Graeme Gofton
THIS PUBLICATION IS PROVIDED “AS IS” AND ALL
Sean Nockles EXPRESS OR IMPLIED CONDITIONS, REPRESENTATIONS
Brad Willer AND WARRANTIES, INCLUDING ANY IMPLIED
Gaurav Dong WARRANTY OF MERCHANTABILITY, FITNESS FOR A
PARTICULAR PURPOSE OR NON-INFRINGEMENT, ARE
TECHNICAL DISCLAIMED, EXCEPT TO THE EXTENT THAT SUCH
CONTRIBUTORS AND DISCLAIMERS ARE HELD TO BE LEGALLY INVALID.
REVIEWERS SYMANTEC CORPORATION SHALL NOT BE LIABLE FOR
Geoff Bergren INCIDENTAL OR CONSEQUENTIAL DAMAGES IN
Kelli Cameron CONNECTION WITH THE FURNISHING, PERFORMANCE,
Tomer Gurantz OR USE OF THIS PUBLICATION. THE INFORMATION
Anthony Herr CONTAINED HEREIN IS SUBJECT TO CHANGE WITHOUT
James Kenney NOTICE.
Bob Lucas No part of the contents of this book may be reproduced or
Paul Johnston transmitted in any form or by any means without the written
Rod Pixley permission of the publisher.
Clifford Barcliff Symantec Cluster Server 6.x for UNIX: Administration
Danny Yonkers Fundamentals
Antonio Antonucci Symantec Corporation
Satoko Saito World Headquarters
Steve Evans 350 Ellis Street
Feng Liu Mountain View, CA 94043
Maurizio Lancia United States
http://www.symantec.com
Copyright © 2014 Symantec Corporation. All rights reserved.

2
Table of Contents
Course Introduction
Veritas Cluster Server curriculum path.................................................... Intro-2
Cluster design ......................................................................................... Intro-4
Courseware contents .............................................................................. Intro-6
Lesson 1: High Availability Concepts

High availability concepts ............................................................................. 1-3
Clustering concepts...................................................................................... 1-5
HA application services ................................................................................ 1-8
Clustering prerequisites.............................................................................. 1-12
High availability references ........................................................................ 1-14
Lesson 2: VCS Building Blocks

VCS terminology .......................................................................................... 2-3
Cluster communication............................................................................... 2-12
VCS architecture ........................................................................................ 2-17
Lesson 3: Preparing a Site for VCS

Hardware and software requirements .......................................................... 3-3
Hardware and software recommendations................................................... 3-4
Preparing installation information ................................................................. 3-9
Lesson 4: Installing VCS

Using the Common Product Installer............................................................ 4-3
VCS configuration files ................................................................................. 4-6
Cluster management tools ......................................................................... 4-12
Lesson 5: VCS Operations

Common VCS tools and operations ............................................................. 5-3
Service group operations ............................................................................. 5-7
Resource operations .................................................................................. 5-11
Lesson 6: VCS Configuration Methods

Starting and stopping VCS ........................................................................... 6-3
Overview of configuration methods .............................................................. 6-7
Online configuration ..................................................................................... 6-8
Controlling access to VCS.......................................................................... 6-12
Lesson 7: Preparing Services for VCS

Preparing applications for VCS .................................................................... 7-3
Performing one-time configuration tasks...................................................... 7-5
Testing the application service ................................................................... 7-10
Stopping and migrating an application service........................................... 7-19
Collecting configuration information ........................................................... 7-21
3 Table of Contents i
Lesson 8: Online Configuration
Online service group configuration ............................................................... 8-3
Adding resources.......................................................................................... 8-7
Solving common configuration errors ......................................................... 8-16
Testing the service group ........................................................................... 8-18
Lesson 9: Offline Configuration

Offline configuration examples ..................................................................... 9-3
Offline configuration procedures................................................................... 9-6
Solving offline configuration problems ........................................................ 9-15
Testing the service group ........................................................................... 9-19
Lesson 10: Configuring Notification

Notification overview................................................................................... 10-3
Configuring notification ............................................................................... 10-8
Overview of triggers.................................................................................. 10-15
Lesson 11: Handling Resource Faults

VCS response to resource faults ................................................................ 11-3
Determining failover duration...................................................................... 11-8
Controlling fault behavior .......................................................................... 11-13
Recovering from resource faults............................................................... 11-17
Fault notification and event handling ........................................................ 11-18
Lesson 12: Intelligent Monitoring Framework

IMF overview .............................................................................................. 12-3
IMF configuration ........................................................................................ 12-6
Lesson 13: Cluster Communications

VCS communications review ...................................................................... 13-3
Cluster interconnect configuration .............................................................. 13-7
Joining the cluster membership ................................................................ 13-12
System and cluster interconnect failures .................................................. 13-15
Changing the interconnect configuration .................................................. 13-21
Lesson 14: Protecting Data Using SCSI 3-Based Fencing

Data protection requirements ..................................................................... 14-3

I/O fencing concepts ................................................................................... 14-5
I/O fencing operations................................................................................. 14-9
I/O fencing implementation ....................................................................... 14-16
Configuring I/O fencing ............................................................................. 14-20
Lesson 15: Clustering Applications

Application service overview....................................................................... 15-3
VCS agents for managing applications....................................................... 15-5
The Application agent ............................................................................... 15-10
IMF support and prevention of concurrency violation ............................... 15-13

4 ii Symantec Cluster Server 6.x for UNIX: Administration Fundamentals
Lesson 16: Clustering Databases
VCS database agents ................................................................................ 16-3
Database preparation................................................................................. 16-7
The database agent for Oracle................................................................. 16-12
Database failover behavior....................................................................... 16-25
Additional Oracle agent functions............................................................. 16-26

5 Table of Contents iii

6 iv Symantec Cluster Server 6.x for UNIX: Administration Fundamentals
7
Course Introduction

Veritas Cluster Server curriculum path
The Veritas Cluster Server for UNIX curriculum is a series of courses that are
designed to provide a full range of expertise with Veritas Cluster Server (VCS)
high availability solutions—from design through disaster recovery.
• Veritas Cluster Server for UNIX: Install and Configure
This course covers installation and configuration of common VCS
environments, focusing on two-node clusters running application and database
services.
• Veritas Cluster Server for UNIX: Manage and Administer
This course focuses on multinode VCS clusters and advanced topics related to
managing more complex cluster configurations.
• eLearning Library
The eLearning Library is available with bundled training options and includes
content on advanced high availability and disaster recovery features.

8 Intro–2 Symantec Cluster Server 6.x for UNIX: Administration Fundamentals
Course overview
This training provides comprehensive instruction on the installation and initial
configuration of Veritas Cluster Server (VCS). The course covers principles and
methods that enable you to prepare, create, and test VCS service groups and
resources using tools that best suit your needs and your high availability
environment. You learn to configure and test failover and notification behavior,
cluster additional applications, and further customize your cluster according to
specified design criteria.
Intro

9 Course Introduction
Intro–3
Cluster design
Sample cluster design input
A VCS design can be presented in many different formats with varying levels of
detail.
In some cases, you may have only the information about the application services
that need to be clustered and the desired operational behavior in the cluster. For
example, you may be told that the application service uses multiple network ports
and requires local failover capability among those ports before it fails over to
another system.
In other cases, you may have the information you need as a set of diagrams with
notes on various aspects of the desired cluster operations.
If you receive the design information that does not detail the resource information,
develop a detailed design worksheet before starting the deployment.

Using a design worksheet to document all aspects of your high availability
environment helps ensure that you are well-prepared to start implementing your
cluster design.
In this course, you are provided with a set of design worksheets showing sample
values as a tool for implementing the cluster design in the lab exercises.
You can use a similar format to collect all the information you need before starting
deployment at your site.

Lab design for the course
The diagram shows a conceptual view of the cluster design used as an example
throughout this course and implemented in hands-on lab exercises.
Each aspect of the cluster configuration is described in greater detail, where
applicable, in course lessons.
The environment consists of:
• One two-node cluster, west
• Several high availability services; including multiple failover and service
groups and one parallel network service group
• iSCSI shared storage, accessible from each node
• Private Ethernet interfaces for the cluster interconnect network
• Ethernet connections to the public network
Additional complexity is added to the design throughout the labs to illustrate

certain aspects of cluster configuration in later lessons. The design diagram shows
a conceptual view of the cluster design.
Intro

Intro–5
Courseware contents
This course consists of slides for each lesson that feature concepts, processes, and
examples. Each lab is introduced at the end of the lesson, explaining the goals of
the hands-on exercises. Quiz slides are provided to reinforce your understanding of
the lesson objectives.
The participant guides include a copy of the slide, along with supplementary
content with additional details supporting the slide content.
Two levels of lab guides are provided in the appendixes:
• Appendix A provides steps for more experienced participants who want the
additional challenge of determining the tasks to be performed.
• Appendix B provides steps and their detailed solutions, showing the
commands and output needed to successfully complete the tasks.
In most cases, optional advanced exercises are provided as an additional
challenge for more experienced participants. These can be skipped, if desired,

without affecting subsequent labs.
Other appendixes may be present, which provide supplementary information that
may be of interest to some participants, but is outside the scope of the course
objectives.

Typographic conventions used in this course
The following tables describe the typographic conventions used in this course.
Typographic conventions in text and commands
Convention Element Examples

Courier New, Command input, To display the robot and drive configuration:
bold both syntax and tpconfig -d
examples
To display disk information:
vxdisk -o alldgs list
Courier New, • Command output In the output:
plain • Command protocol_minimum: 40
names, directory protocol_maximum: 60
names, file protocol_current: 0
names, path Locate the altnames directory.
names, URLs
Go to http://www.symantec.com.
when used within
regular text Enter the value 300.
paragraphs Log on as user1.
Courier New, Variables in To install the media server:
Italic, bold or command syntax, /cdrom_directory/install
plain and examples:
To access a manual page:
• Variables in
man command_name
command input
are Italic, plain. To display detailed information for a disk:
• Variables in vxdisk -g disk_group list
command output disk_name
are Italic, bold.
Typographic conventions in graphical user interface descriptions
Convention Element Examples

Arrow Menu navigation paths Select File > Save.
Initial capitalization Buttons, menus, windows, Click Next.
options, and other interface Open the Task Status

elements window.
Clear the Print File check
box.
Bold Interface elements Mark the Include
subvolumes in object
view window check box.
Intro

Intro–7

Lesson 1
High Availability Concepts

15

16 1–2 Symantec Cluster Server 6.x for UNIX: Administration Fundamentals
1
High availability concepts

Levels of availability
Data centers may implement different levels of availability depending on their
requirements for availability.
• Backup: At minimum, all data needs to be protected using an effective backup
solution, such as Veritas NetBackup.
• Data availability: Local mirroring provides real-time data availability within
the local data center. Point-in-time copy solutions protect against corruption.
Online configuration keeps data available to applications while storage is
expanded to accommodate growth. DMP provides resilience against path
failure.
• Shared disk groups and cluster file systems: These features minimize
application failover time because the disk groups, volumes, and file systems
are available on multiple systems simultaneously.

• Local clustering: The next level is a application clustering solution, such as
Veritas Cluster Server, for application and server availability.
• Remote replication: After implementing local availability, you can further
ensure data availability in the event of a site failure by replicating data to a
remote site. Replication can be application-, host-, or array-based.
• Remote clustering: Implementing remote clustering ensures that the
applications and data can be started at a remote site.Veritas Cluster Server
supports remote clustering with automatic site failover capability.

17 Lesson 1 High Availability Concepts
1–3
Costs of downtime
A Gartner study shows that large companies experienced a loss of between
$954,000 and $1,647,000 (USD) per month for nine hours of unplanned
downtime.
In addition to the monetary loss, downtime also results in loss of business
opportunities and reputation.
Planned downtime is almost as costly as unplanned. Planned downtime can be
significantly reduced by migrating a service to another server while maintenance is
performed.
Given the magnitude of the cost of downtime, the case for implementing a high
availability solution is clear.

1
Clustering concepts
The term cluster refers to multiple independent systems connected into a
management framework.
Types of clusters
A variety of clustering solutions are available for various computing purposes.
• HA clusters: Provide resource monitoring and automatic startup and failover
• Parallel processing clusters: Break large computational programs into smaller
tasks executed in parallel on multiple systems
• Load balancing clusters: Monitor system load and distribute applications
automatically among systems according to specified criteria
• High performance computing clusters: Use a collection of computing
resources to enhance application performance
• Fault-tolerant clusters: Provide uninterrupted application availability

Fault tolerance guarantees 99.9999 percent availability, or approximately 30
seconds of downtime per year. Six 9s (99.9999 percent) availability is appealing,
but the costs of this solution are well beyond the affordability of most companies.
In contrast, high availability solutions can achieve five 9s (99.999 percent
availability—less than five minutes of downtime per year) at a fraction of the cost.
The focus of this course is Veritas Cluster Server, which is primarily used for high
availability, although it also provides some support for parallel processing and load
balancing.

1–5
Local cluster configurations
Depending on your clustering solution, you may be able to implement a variety of
configurations, enabling you to deploy the clustering solution to best suit your HA
requirements and utilize existing hardware.
• Active/Passive—In this configuration, an application runs on a primary or
master server and a dedicated redundant server is present to take over on any
failover.
• Active/Active—In this configuration, each server is configured to run specific
applications or services, and essentially provides redundancy for its peer.
• N-to-1—In this configuration, the applications fail over to the spare when a
system crashes. When the server is repaired, applications must be moved back
to their original systems.
• N + 1—Similar to N-to-1, the applications restart on the spare after a failure.
Unlike the N-to-1 configuration, after the failed server is repaired, it can
become the redundant server.
• N-to-N—This configuration is an active/active configuration that supports
multiple application services running on multiple servers. Each application
service is capable of being failed over to different servers in the cluster.
In the example shown in the slide, utilization is increased by reconfiguring four
active/passive clusters and one active/active cluster into one N-to-1 cluster and one
N-to-N cluster. This enables a savings of four systems.

1
Campus and global cluster configurations

Cluster configurations that enable data to be duplicated among multiple physical
locations protect against site-wide failures.
Campus clusters
The campus or stretch cluster environment is a single cluster stretched over
multiple locations, connected by an Ethernet subnet for the cluster interconnect
and a fiber channel SAN, with storage mirrored at each location.
Advantages of this configuration are:
• It provides local high availability within each site as well as protection against
site failure.
• It is a cost-effective solution; replication is not required.
• Recovery time is short.

• The data center can be expanded.
• You can leverage existing infrastructure.
Global clusters
Global clusters, or wide-area clusters, contain multiple clusters in different
geographical locations. Global clusters protect against site failures by providing
data replication and application failover to remote data centers.
Global clusters are not limited by distance because cluster communication uses
TCP/IP. Replication can be provided by hardware vendors or by a software
solution, such as Veritas Volume Replicator, for heterogeneous array support.
1–7
HA application services
An application service is a collection of hardware and software components
required to provide a service, such as a Web site, that an end-user can access by
connecting to a particular network IP address or host name. Each application
service typically requires components of the following three types:
• Application binaries (executables)
• Network
• Storage
If an application service needs to be switched to another system, all of the
components of the application service must migrate together to re-create the
service on another system.
These are the same components that the administrator must manually move from a
failed server to a working server to keep the service available to clients in a
nonclustered environment.
Application service examples include:
• A Web service consisting of a Web server program, IP addresses, associated
network interfaces used to allow access into the Web site, a file system
containing Web data files, and a volume and disk group containing the file
system.
• A database service may consist of one or more IP addresses, database
management software, a file system containing data files, a volume and disk
group on which the file system resides, and a NIC for network access.

1
Local application service failover

Cluster management software performs a series of tasks in order for clients to
access a service on another server in the event a failure occurs. The software must:
• Ensure that data stored on the disk is available to the new server, if shared
storage is configured (Storage).
• Move the IP address of the old server to the new server (Network).
• Start up the application on the new server (Application).
The process of stopping the application services on one system and starting it on
another system in response to a fault is referred to as a failover.

1–9
Local and global failover
In a global cluster environment, the application services are generally highly
available within a local cluster, so faults are first handled by the HA software,
which performs a local failover.
When HA methods such as replication and clustering are implemented across
geographical locations, recovery procedures are started immediately at a remote
location when a disaster takes down a site.

1
Application requirements for clustering

The most important requirements for an application to run in a cluster are crash
tolerance and host independence. This means that the application should be able to
recover after a crash to a known state, in a predictable and reasonable time, on two
or more hosts.
Most commercial applications today satisfy this requirement. More specifically, an
application is considered well-behaved and can be controlled by clustering
software if it meets the requirements shown in the slide.

1–11
Clustering prerequisites
Hardware and infrastructure redundancy
All failovers cause some type of client disruption. Depending on your
configuration, some applications take longer to fail over than others. For this
reason, good design dictates that the HA software first try to fail over within the
system, using agents that monitor local resources.
Design as much resiliency as possible into the individual servers and components
so that you do not have to rely on any hardware or software to cover a poorly
configured system or application. Likewise, try to use all resources to make
individual servers as reliable as possible.
Single point of failure analysis

Determine whether any single points of failure exist in the hardware, software, and
infrastructure components within the cluster environment.

Any single point of failure becomes the weakest link of the cluster. The application
is equally inaccessible if a client network connection fails, or if a server fails.
Also consider the location of redundant components. Having redundant hardware
equipment in the same location is not as effective as placing the redundant
component in a separate location.
In some cases, the cost of redundant components outweighs the risk that the
component will become the cause of an outage. For example, buying an additional
expensive storage array may not be practical. Decisions about balancing cost
versus availability need to be made according to your availability requirements.
1
External dependencies
Whenever possible, it is good practice to eliminate or reduce reliance by high
availability applications on external services. If it is not possible to avoid outside
dependencies, ensure that those services are also highly available.
For example, network name and information services, such as DNS (Domain
Name System) and NIS (Network Information Service), are designed with
redundant capabilities.

1–13
High availability references
Use these references as resources for building a complete understanding of high
availability environments within your organization.
• The Resilient Enterprise: Recovering Information Services from Disasters
This book explains the nature of disasters and their impacts on enterprises,
organizing and training recovery teams, acquiring and provisioning recovery
sites, and responding to disasters.
• Blueprints for High Availability: Designing Resilient Distributed Systems
This book provides a step-by-step guide for building systems and networks
with high availability, resiliency, and predictability.
• High Availability Design, Techniques, and Processes
This guide describes how to create systems that are easier to maintain, and
defines ongoing availability strategies that account for business change.
• Designing Storage Area Networks

The text offers practical guidelines for using diverse SAN technologies to
solve existing networking problems in large-scale corporate networks. With
this book, you learn how the technologies work and how to organize their
components into an effective, scalable design.
• Storage Area Network Essentials: A Complete Guide to Understanding and
Implementing SANs (VERITAS Series)
This book identifies the properties, architectural concepts, technologies,
benefits, and pitfalls of storage area networks (SANs).

Lesson 2
VCS Building Blocks

29

2
VCS terminology
VCS cluster
A VCS cluster is a collection of independent systems working together under the
VCS management framework for increased service availability.
VCS clusters have the following components:
• Up to 64systems—sometimes referred to as nodes or servers
Each system runs its own operating system.
• A cluster interconnect, which enables cluster communications
• A public network, connecting each system in the cluster to a LAN for client
access
• Shared storage (optional), accessible by each system in the cluster that needs to
run the application

31 Lesson 2 VCS Building Blocks
2–3
Service groups
A service group is a virtual container that enables VCS to manage an application
service as a unit. The service group contains all the hardware and software
components required to run the service. The service group enables VCS to
coordinate failover of the application service resources in the event of failure or at
the administrator’s request.
A service group is defined by these attributes:
• The cluster-wide unique name of the group
• The list of the resources in the service group, usually determined by which
resources are needed to run a specific application service
• The dependency relationships between the resources
• The list of cluster systems on which the group is allowed to run
• The list of cluster systems on which you want the group to start automatically

2
Service group types

Service groups can be one of three types:
• Failover
This service group runs on one system at a time in the cluster. Most application
services, such as database and NFS servers, use this type of group.
• Parallel
This service group runs simultaneously on more than one system in the cluster.
This type of service group requires an application that can be started on more
than one system at a time without threat of data corruption.

2–5
Resources
Resources are VCS objects that correspond to hardware or software components,
such as the application, the networking components, and the storage components.
VCS controls resources through these actions:
• Bringing a resource online (starting)
• Taking a resource offline (stopping)
• Monitoring a resource (probing)
Resource categories
• Persistent, never turned off
– None
VCS can only monitor persistent resources—these resources cannot be
brought online or taken offline. The most common example of a persistent

resource is a network interface card (NIC), because it must be present but
cannot be stopped.
– On-only
VCS brings the resource online if required but does not stop the resource if
the associated service group is taken offline. ProcessOnOnly is a resource
used to start, but not stop a process such as daemon, for example.
• Nonpersistent, also known as on-off
Most resources fall into this category, meaning that VCS brings them online
and takes them offline as required. Examples are Mount, IP, and Process.

2
Resource dependencies
Resources depend on other resources because of application or operating system
requirements. Dependencies are defined to configure VCS for these requirements.
Dependency rules
These rules apply to resource dependencies:
• A parent resource depends on a child resource. In the diagram, the Mount
resource (parent) depends on the Volume resource (child). This dependency
illustrates the operating system requirement that a file system cannot be
mounted without the Volume resource being available.
• Dependencies are homogenous. Resources can only depend on other
resources.
• No cyclical dependencies are allowed. There must be a clearly defined
starting point.

2–7
Resource attributes
Resources attributes define the specific characteristics on individual resources. As
shown in the slide, the resource attribute values for the sample resource of type
Mount correspond to the UNIX command line to mount a specific file system.
VCS uses the attribute values to run the appropriate command or system call to
perform an operation on the resource.
Each resource has a set of required attributes that must be defined in order to
enable VCS to manage the resource.
For example, the Mount resource has four required attributes that must be defined
for each resource of type Mount:
• The directory of the mount point (MountPoint)
• The device for the mount point (BlockDevice)
• The type of file system (FSType)

• The options for the fsck command (FsckOpt)
The first three attributes are the values used to build the UNIX mount command
shown in the slide. The FsckOpt attribute is used if the mount command fails. In
this case, VCS runs fsck with the specified options (-y, which means answer yes
to all fsck questions) and attempts to mount the file system again.
Some resources also have additional optional attributes you can define to control
how VCS manages a resource. In the Mount resource example, MountOpt is an
optional attribute you can use to define options to the UNIX mount command.
For example, if this is a read-only file system, you can specify -ro as the
MountOpt value.
2
Resource types and type attributes

Resources are classified by resource type. For example, disk groups, network
interface cards (NICs), IP addresses, mount points, and databases are distinct types
of resources. VCS provides a set of predefined resource types—some bundled,
some add-ons—in addition to the ability to create new resource types.
Individual resources are instances of a resource type. For example, you may have
several IP addresses under VCS control. Each of these IP addresses is,
individually, a single resource of resource type IP.
A resource type can be thought of as a template that defines the characteristics or
attributes needed to define an individual resource (instance) of that type.
You can view the relationship between resources and resource types by comparing
the mount command for a resource on the previous slide with the mount syntax
on this slide. The resource type defines the syntax for the mount command. The
resource attributes fill in the values to form an actual command line.

2–9
Agents: How VCS controls resources
Agents are processes that control resources. Each resource type has a
corresponding agent that manages all resources of that resource type. Each cluster
system runs only one agent process for each active resource type, no matter how
many individual resources of that type are in use.
Agents control resources using a defined set of actions, also called entry points.
The four entry points common to most agents are:
• Online: Resource startup
• Offline: Resource shutdown
• Monitor: Probing the resource to retrieve status
• Clean: Killing the resource or cleaning up as necessary when a resource fails to
be taken offline gracefully
The difference between offline and clean is that offline is an orderly termination
and clean is a forced termination. In UNIX, this can be thought of as the difference
between exiting an application and sending the kill -9 command to the
process.
Each resource type needs a different way to be controlled. To accomplish this,
each agent has a set of predefined entry points that specify how to perform each of
the four actions. For example, the startup entry point of the Mount agent mounts a
block device on a directory, whereas the startup entry point of the IP agent uses the
ifconfig (Solaris, AIX, HP-UX) or ip addr add (Linux) command to set
the IP address on a unique IP alias on the network interface.
VCS provides both predefined agents and the ability to create custom agents.
2
Bundled agents documentation

The Veritas Cluster Server Bundled Agents Reference Guide describes the agents
that are provided with VCS and defines the required and optional attributes for
each associated resource type.
Symantec also provides additional application and database agents in an Agent
Pack that is updated quarterly. Some examples of these agents are:
• Data Loss Prevention
• Documentum
• IBM DB2 Database
Select Downloads >High Availability Agents at sort.symantec.com for a
complete list of agents available for VCS.
Note: The Veritas Cluster User’s Guide provides an appendix with a complete
description of attributes for all cluster objects.
To obtain PDF versions of product documentation for VCS and agents, see the
SORT Web site.

2–11
Cluster communication
VCS requires a cluster communication channel between systems in a cluster to
serve as the cluster interconnect. This communication channel is also sometimes
referred to as the private network because it is often implemented using a
dedicated Ethernet network.
Symantec recommends that you use a minimum of two dedicated communication
channels with separate infrastructures—for example, multiple NICs and separate
network hubs—to implement a highly available cluster interconnect.
The cluster interconnect has two primary purposes:
• Determine cluster membership: Membership in a cluster is determined by
systems sending and receiving heartbeats (signals) on the cluster interconnect.
This enables VCS to determine which systems are active members of the
cluster and which systems are joining or leaving the cluster.
In order to take corrective action on node failure, surviving members must

agree when a node has departed. This membership needs to be accurate and
coordinated among active members—nodes can be rebooted, powered off,
faulted, and added to the cluster at any time.
• Maintain a distributed configuration: Cluster configuration and status
information for every resource and service group in the cluster is distributed
dynamically to all systems in the cluster.
Cluster communication is handled by the Group Membership Services/Atomic
Broadcast (GAB) mechanism and the Low Latency Transport (LLT) protocol, as
described in the next sections.

2
Low-Latency Transport
Clustering technologies from Symantec use a high-performance, low-latency
protocol for communications. LLT is designed for the high-bandwidth and low-
latency needs of not only Veritas Cluster Server, but also Veritas Cluster File
System and Veritas Storage Foundation for Oracle RAC.
LLT runs directly on top of the Data Link Provider Interface (DLPI) layer over
Ethernet and has several major functions:
• Sending and receiving heartbeats over network links
• Monitoring and transporting network traffic over multiple network links to
every active system
• Balancing the cluster communication load over multiple links
• Maintaining the state of communication
• Providing a transport mechanism for cluster communications

2–13
Group Membership Services/Atomic Broadcast (GAB)
GAB provides the following:
• Group Membership Services: GAB maintains the overall cluster
membership by way of its group membership services function. Cluster
membership is determined by tracking the heartbeat messages sent and
received by LLT on all systems in the cluster over the cluster interconnect.
GAB messages determine whether a system is an active member of the cluster,
joining the cluster, or leaving the cluster. If a system stops sending heartbeats,
GAB determines that the system has departed the cluster.
• Atomic Broadcast: Cluster configuration and status information are
distributed dynamically to all systems in the cluster using GAB’s atomic
broadcast feature. Atomic broadcast ensures that all active systems receive all
messages for every resource and service group in the cluster.

2
I/O fencing
The fencing driver implements I/O fencing, which prevents multiple systems from
accessing the same Volume Manager-controlled shared storage devices in the
event that the cluster interconnect is severed. In the example of a two-node cluster
displayed in the diagram, if the cluster interconnect fails, each system stops
receiving heartbeats from the other system.
GAB on each system determines that the other system has failed and passes the
cluster membership change to the fencing module.
The fencing modules on both systems contend for control of the disks according to
an internal algorithm. The losing system is forced to panic and reboot. The
winning system is now the only member of the cluster, and it fences off the shared
data disks so that only systems that are still part of the cluster membership (only
one system in this example) can access the shared storage.
The winning system takes corrective action as specified within the cluster
configuration, such as bringing service groups online that were previously running
on the losing system.

2–15
High Availability Daemon
The VCS engine, also referred to as the high availability daemon (HAD), is the
primary VCS process running on each cluster system.
HAD tracks all changes in cluster configuration and resource status by
communicating with GAB. HAD manages all application services (by way of
agents) whether the cluster has one or many systems.
Building on the knowledge that the agents manage individual resources, you can
think of HAD as the manager of the agents. HAD uses the agents to monitor the
status of all resources on all nodes.
This modularity between had and the agents allows for efficiency of roles:
• HAD does not need to know how to start up Oracle or any other applications
that can come under VCS control.
• Similarly, the agents do not need to make cluster-wide decisions.

This modularity allows a new application to come under VCS control simply by
adding a new agent—no changes to the VCS engine are required.
On each active cluster system, HAD updates all the other cluster systems with
changes to the configuration or status.
In order to ensure that the had daemon is highly available, a companion daemon,
hashadow, monitors had, and if had fails, hashadow attempts to restart had.
Likewise, had restarts hashadow if hashadow stops.

2
VCS architecture
Maintaining the cluster configuration
HAD maintains configuration and state information for all cluster resources in
memory on each cluster system. Cluster state refers to tracking the status of all
resources and service groups in the cluster. When any change to the cluster
configuration occurs, such as the addition of a resource to a service group, HAD
on the initiating system sends a message to HAD on each member of the cluster by
way of GAB atomic broadcast, to ensure that each system has an identical view of
the cluster.
Atomic means that all systems receive updates, or all systems are rolled back to the
previous state, much like a database atomic commit.
The cluster configuration in memory is created from the main.cf file on disk in
the case where HAD is not currently running on any cluster systems, so there is no
configuration in memory. When you start VCS on the first cluster system, HAD
builds the configuration in memory on that system from the main.cf file.
Changes to a running configuration (in memory) are saved to disk in main.cf
when certain operations occur. These procedures are described in more detail later
in the course.

2–17
VCS configuration files
Configuring VCS means conveying to VCS the definitions of the cluster, service
groups, resources, and resource dependencies. VCS uses two configuration files in
a default configuration:
• The main.cf file defines the entire cluster, including the cluster name,
systems in the cluster, and definitions of service groups and resources, in
addition to service group and resource dependencies.
• The types.cf file defines the resource types.
Additional files similar to types.cf may be present if agents have been added.
For example, if the Oracle enterprise agent is added, a resource types file, such as
OracleTypes.cf, is also present.
The cluster configuration is saved on disk in the /etc/VRTSvcs/conf/
config directory, so the memory configuration can be re-created after systems
are restarted.

2
Labs and solutions for this lesson are located on the following pages.
• “ Lab environment,” page A-3.

2–19

Lesson 3
Preparing a Site for VCS

49

3
Hardware and software requirements

Hardware requirements
See the hardware compatibility list (HCL) at the Symantec Web site for the most
recent list of supported hardware for Veritas products by Symantec.
Cluster interconnect
Veritas Cluster Server requires a minimum of two heartbeat channels for the
cluster interconnect.
Loss of the cluster interconnect results in downtime, and in nonfencing
environments, can result in split brain condition (described in detail later in the
course).
Configure a minimum of two physically independent Ethernet connections on each
node for the cluster interconnect:

• Two-node clusters can use crossover cables.
• Clusters with three or more nodes require hubs or switches.
• You can use layer 2 switches; however, this is not a requirement.
For VCS clusters, the interconnect does not require high bandwidth components.
During steady state, the traffic on the interconnect is negligible.
For clusters using Veritas Cluster File System, Symantec recommends the use of
multiple gigabit interconnects and gigabit switches for the interconnect due.

51 Lesson 3 Preparing a Site for VCS
3–3
Hardware and software recommendations
Networking
For a highly available configuration, each system in the cluster should have a
minimum of two physically independent Ethernet connections for the public
network. Using the same interfaces on each system simplifies configuring and
managing the cluster.
Shared storage
VCS is designed primarily as a shared data high availability product; however, you
can configure a cluster that has no shared storage.
For shared storage clusters, consider these recommendations:
• One HBA minimum for shared and one for nonshared (boot) disks:
› To eliminate single points of failure, Symantec recommends you have

two HBAs to connect to disks and use a dynamic multipathing
software, such as Veritas Volume Manager DMP.
› Use multiple single-port HBAs or SCSI controllers rather than
multiport interfaces to avoid single points of failure.
• Shared storage on a SAN must reside in the same zone as all cluster nodes.
• Data should be mirrored or protected by a hardware-based RAID mechanism.
• Use redundant storage and paths.
• Include all cluster-controlled data in your backup planning, implementation,
and testing.

Note: Although the recommendation is to use identical hardware
configurations, your requirements may indicate using different
hardware for differing workloads.
3

3–5
System and network preparation
Perform these tasks before starting VCS installation.
• Add directories to the PATH variable, if required. For the PATH settings, see
the installation guide for your platform.
• Verify that administrative IP addresses are configured on your public network
interfaces and that all systems are accessible on the public network using fully
qualified host names.
For details on configuring administrative IP addresses, see the “Job Aids”
appendix.
• Disable the operating system suspend/resume feature, if present.
Solaris
Solaris operating systems can be paused and resumed using the Stop-A and go
sequence. When a Solaris system in a VCS cluster is paused with the Stop-A, the
system stops producing VCS heartbeats. This causes other systems to consider this
a failed node.
Ensure that the only action possible after an abort is a reset. To ensure that you
never issue a go function after an abort, create an alias for the go function that
displays a message. See the Veritas Cluster Server Installation Guide for the
detailed procedure.

3
Preparation assistance
Several tools are available from the Symantec Operations Readiness Tools (SORT)
Web site to help you prepare your environment to implement clustering.
• Data collection and reporting tools
A data collector can be run from the Web site, or downloaded locally, to gather
system information, run preinstallation checks, and generate reports.
• Documentation and compatibility lists
All product documentation, as well as software and hardware compatibility
lists are available from SORT.
• Preparation checklists
Platform-specific checklists can be created to assist in preparing an
environment for clustering.
• Patch management
SORT provides access to all products in the Storage Foundation HA family.
• Risk assessment
Checklists and reports can be used to analyze your environment and identify
risks and recommend remedies.
• Error code lookup
SORT enables you to search for additional information about error messages.
You can also request help for undocumented error codes.
• Inventory management service
Inventory management is a service that provides the ability to gather license
information from Storage Foundation HA deployments.
3–7
Alternately, you can run installvcs from the location of your VCS product
distribution to check your environment and examine the resultant log file to assess
readiness to install VCS.
cd sw_location
./installvcs -precheck system1 system2

3
Preparing installation information

Required installation input
Verify that you have the information necessary to install VCS. Be prepared to
select:
• Product, corresponding to licenses obtained from Symantec
• End-user licensing agreement
• Package set, which determines the amount of disk space required
• Names of the systems that will be installed with the selected product
• License keys or keyless licensing
• Product level, which applies to keyless licensing, and determines the level of
functionality of the product
• Product options, including Veritas Replicator and Global Clustering Option
For more information about these selections, see the Veritas Cluster Server
Installation Guide.

3–9
Cluster configuration options
You are prompted to configure the cluster after the software installation is
complete. Be prepared to supply:
• A name for the cluster, beginning with a letter of the alphabet (a-z, A-Z)
• A unique ID number for the cluster in the range 0 to 64k
All clusters sharing a private network infrastructure (including connection to
the same public network if used for low-priority links) must have a unique ID.
• Device names of the network interfaces used for the cluster interconnect
You can also configure additional cluster services, including:
• Cluster virtual IP address: Used for some cluster configurations, including
global clusters
• Security: Enable secure communication between cluster nodes and clients.
Secure cluster configuration is described in detail in the Veritas Cluster Server

Installation Guide.
• VCS user accounts: Add accounts or change the default admin account.
• Notification: Specify SMTP and SNMP information during installation to
configure the cluster notification service.

3
Duplicate cluster ID detection and automatic generation

Duplicate cluster IDs create configuration failures which may result in clusters
failing to start. The Common Product Installer for VCS can detect duplicate cluster
IDs on the network and enable you to configure a unique ID.
The installer may be unable to detect conflicting cluster IDs in certain
circumstances. For example:
• Another cluster using the same network for the LLT links is offline
• LLT links are not properly configured
• NICs are not connected properly to related switches
You can also prevent duplicate cluster IDs by opting to have CPI automatically
generate the cluster ID.

3–11
Lab solutions for this lesson are located on the following pages.
• “Lab 2: Validating site preparation,” page A-47

Lesson 4
Installing VCS

61

Using the Common Product Installer 4
Symantec ships Veritas high availability and storage foundation products with a
product installation utility that enables you to install these products using the same
interface.
You can also use the CPI utility to add licenses, configure products, and start and
stop services.
Viewing installation logs

At the end of every product installation, the installer creates three text files:
• A log file containing any system commands executed and their output
• A response file to be used in conjunction with the -responsefile option of
the installer
• A summary file containing the output of the Veritas product installer scripts
These files are located in /opt/VRTS/install/logs. The names and

locations of each file are displayed at the end of each product
installation—installertimestamp.log, .summary, and .response. It
is recommended that these logs be kept for auditing and debugging purposes.

63 Lesson 4 Installing VCS
4–3
Web installer
VCS includes a Web-based interface to the CPI installer. The key components of
the Web installer architecture are shown in the diagram in the slide.
• The Web browser can be run on any platform that supports the browser
requirements and can connect securely (RSH or SSH) to the Web server.
• The Web server runs the xprtlwid daemon, which is started using the
webinstaller command on the distribution media. The Web installer uses
the CPI installer scripts, and the software packages. Therefore, the system
acting as the Web server must have access to the software distribution media.
The Web server must be able to connect securely (RSH or SSH) to the
installation target systems.
• The installation targets are the systems on which the software is installed and
configured.
The Web installer supports most features of the installer utility. See the Veritas
Cluster Server Installation Guide for a description of supported options. The guide
also includes the browser types and versions supported by the Web installer.

Data protection 4
If you are using VCS with shared storage devices that support SCSI-3 Persistent
Reservations, configure fencing after VCS is initially installed.
SCSI-3-based fencing provides the highest level of protection for data that is
located on shared storage and accessed by multiple cluster nodes.
You can configure fencing at any time using the installvcs -fencing
utility, as described in the “I/O Fencing” lesson. However, if you set up fencing
after you have service groups running, you must stop and restart VCS for fencing
to take effect.

4–5
VCS configuration files
VCS installed file locations
The VCS installation procedure creates several directory structures.
• Commands:
/sbin, /usr/sbin, and /opt/VRTSvcs/bin
• VCS engine and agent log files:
/var/VRTSvcs/log
• Configuration files:
/etc and /etc/VRTSvcs/conf/config
• Installation log files:
/opt/VRTS/install/logs
Product documentation is not included with the software packages. You can
download all documentation from the SORT Web site.

Communication configuration files 4
The installvcs utility creates these VCS communication configuration files:
• /etc/llttab
The llttab file is the primary LLT configuration file and is used to:
– Set the cluster ID number.
– Set system ID numbers.
– Specify the network device names used for the cluster interconnect.
• /etc/llthosts
The llthosts file associates a system name with a unique VCS cluster node
ID number for every system in the cluster. This file is the same on all systems
in the cluster.
• /etc/gabtab
This file contains the command line that is used to start GAB.
Cluster communication is described in detail later in the course.

4–7
Cluster configuration files
The following cluster configuration files are added as a result of package
installation:
• /etc/VRTSvcs/conf/config/types.cf
• /etc/VRTSvcs/conf/config/main.cf
The installvcs utility modifies the main.cf file to configure the
ClusterService group if a cluster virtual IP or notification options are selected
during configuration. This service group includes the resources used to manage
SMTP and SNMP notification. If a cluster virtual IP address is specified during
cluster configuration, the resources for managing the IP address are also included
in the ClusterService group. VCS configuration files are discussed in detail
throughout the course.

Viewing LLT status 4
After installation is complete, you can check the status of VCS components.
Use the lltstat command to verify that links are active for LLT. This command
returns information about the LLT links for the system on which it is typed. In the
example shown in the slide, lltstat -nvv active is typed on the s1 system
to produce the LLT status in a cluster with two systems.
The -nvv options cause lltstat to list systems with very verbose status:
• Link names from llttab
• Status
• MAC address of the Ethernet ports
Note: This command line shows status only if a module is using LLT, such as
GAB. If GAB is not running, the output shows a comm wait state.
The configured and active options show only nodes where LLT is
configured or active.
The lltconfig command just displays whether LLT is running, with no detail.
LLT is discussed in more detail later in the course. For now, you can see that LLT
is running using these commands.

4–9
Viewing GAB status
To display the cluster membership status, type gabconfig on each system. For
example:
gabconfig -a
The example output in the slide shows:
• Port a, GAB membership, has two nodes numbered 0 and 1
• Port h, VCS membership, has two nodes numbered 0 and 1
This indicates that HAD and GAB are communicating on two nodes.

Viewing VCS status 4
You can use the hastatus command to view the state of VCS cluster nodes,
service groups, and resources.
The example in the slide shows the state of the ClusterService group after a
successful installation and initial configuration of VCS on the s1 and s2 systems.
The -sum option shows the status as a snapshot in time. If you run hastatus
with no options, the status is displayed continuously, showing any changes in the
state of systems, service groups, and resources as they occur. You can stop the
display by typing Ctrl-C.

4–11
Cluster management tools
Veritas Operations Manager environment
Veritas Operations Manager provides a single, centralized management console
for the Storage Foundation and High Availability products. You can use VOM to
monitor, visualize, and manage storage and high availability resources and
generate reports.
A typical Veritas Operations Manager deployment consists of a management
server and the managed hosts. The management server receives information about
all the resources on the managed hosts within the domain.
Managed hosts and the management server communicate securely using the
HTTPS protocol, through HTTP servers and clients implemented within the
XPRTL component of SFHA.
Managed hosts can be running different versions of SFHA. In some previous

versions of SFHA, you must install the VRTSsfmh package to enable the host to be
managed by VOM.
The Web management console is any system connecting to the management server
by way of a supported Web browser, also referred to as a VOM console.
For information about configuration VOM, see the product documentation:
• Veritas Operations Manager Getting Started Guide
• Veritas Operations Manager Installation Guide
• Veritas Operations Manager Administrator's Guide

4
• “Lab 3: Installing Storage Foundation HA 6.x,” page A-65.

4–13

Lesson 5
VCS Operations

75
I

Common VCS tools and operations
VCS management tools
You can use any of the VCS interfaces to manage the cluster environment,
provided that you have the proper VCS authorization. VCS user accounts are
described in more detail in the “VCS Configuration Methods” lesson.
5
The VCS command-line interface is installed by default and is best suited for
configuration and management of the local cluster.
Veritas Operations Manager (VOM) is a Web-based interface for administering
managed hosts in local and remote clusters. Installation and configuration of the
VOM environment is described in detail in the Veritas Operations Manager
Installation Guide. The VOM can be used as a better management tool to practice
managing resources and service groups.

77 Lesson 5 VCS Operations
5–3
Displaying logs
The engine log is located in /var/VRTSvcs/log/engine_A.log. You can
view this file with standard UNIX text file utilities such as tail, more, or view.
VCS provides the hamsg utility that enables you to filter and sort the data in log
files.
In addition, you can display the engine log in Cluster Manager to see a variety of
views of detailed status information about activity in the cluster.
Using VOM 6.0 you can view the logs for the selected perspectives like Storage,
Availability, and Server level.

Displaying object information
Displaying resources using the CLI

The following examples show how to display resource attributes and status.
• Display values of attributes to ensure they are set properly.
hares -display webip 5
#Resource Attribute System Value
. . .
webip AutoStart global 1
webip Critical global 1
• Determine which resources are non-critical.
hares -list Critical=0
webapache s1
webapache s2
• Determine the virtual IP address for the websg service group.
hares -value webip Address
webip Address global 10.10.27.93
• Determine the state of a resource on each cluster system.
hares -state webip
webip State s1 OFFLINE
webip State s2 ONLINE

5–5
Displaying service group information using the CLI
The following examples show some common uses of the hagrp command for
displaying service group information and status.
• Display values of all attributes to ensure they are set properly.
hagrp -display websg
#Group Attribute System Value
. . .
websg AutoFailOver global 1
websg AutoRestart global 1
. . .
• Determine which service groups are frozen, and are therefore not able to be
stopped, started, or failed over.
hagrp -list Frozen=1
websg s1
websg s2
• Determine whether a service group is set to automatically start.
hagrp -value websg AutoStart
1
• List the state of a service group on each system.
hagrp -state websg
#Group Attribute System Value
websg State s1 |Online|
websg State s2 |Offline|

Service group operations
Bringing service groups online
When a service group is brought online, resources are brought online starting with
the child resources and progressing up the dependency tree to the parent resources.
In order to bring a failover service group online, VCS must verify that all
5
nonpersistent resources in the service group are offline everywhere in the cluster.
If any nonpersistent resource is online on another system, the service group is not
brought online.
A service group is considered online if all of its nonpersistent and autostart
resources are online. An autostart resource is a resource whose AutoStart attribute
is set to 1.
The state of persistent resources is not considered when determining the online or
offline state of a service group because persistent resources cannot be taken

offline. However, a service group is faulted if a persistent resource faults.
Bringing a service group online using the CLI

To bring a service group online, use either form of the hagrp command:
• hagrp -online group -sys system
• hagrp -online group -any
The -any option brings the service group online based on the group’s failover
policy. Failover policies are described in detail later in the course.

5–7
Taking service groups offline
When a service group is taken offline, resources are taken offline starting with the
highest (parent) resources in each branch of the resource dependency tree and
progressing down the resource dependency tree to the lowest (child) resources.
Persistent resources cannot be taken offline. Therefore, the service group is
considered offline when all nonpersistent resources are offline.
Taking a service group offline using the CLI

To take a service group offline, use either form of the hagrp command:
• hagrp -offline group -sys system
Provide the service group name and the name of a system where the service
group is online.
• hagrp -offline group -any

Provide the service group name. The -any switch takes a failover service
group offline on the system where it is online. All instances of a parallel
service group are taken offline when the -any switch is used.

.
Switching service groups

In order to ensure that failover can occur as expected in the event of a fault, test the
failover process by switching the service group between systems within the cluster.
Switching a service group does not have the same effect as taking a service group
offline on one system and bring the service group online on another system. When
you switch a service group, VCS replicates the state of each resource on the target 5
system. If a resource has been manually taken offline on a system before the
switch command is run, that resource is not brought online on the target system.
Switching a service group using the CLI

To switch a service group, type:
hagrp -switch group -to system
Provide the service group name and the name of the system where the service
group is to be brought online.

5–9
Freezing a service group
When you freeze a service group, VCS continues to monitor the resources, but it
does not allow the service group (or its resources) to be taken offline or brought
online. Failover is also disabled, even if a resource faults.
You can also specify that the freeze is in effect even if VCS is stopped and
restarted throughout the cluster.
Warning: Freezing a service group effectively overrides VCS protection against a
concurrency violation—which occurs when the same application is started on
more than one system simultaneously. You can cause possible data corruption if
you bring an application online outside of VCS while the associated service group
is frozen.
Freezing and unfreezing a service group using the CLI

• To freeze and unfreeze a service group temporarily, type:

hagrp -freeze group
hagrp -unfreeze group
• To freeze a service group persistently, you must first open the configuration:
haconf -makerw
hagrp -freeze group -persistent
hagrp -unfreeze group -persistent
To determine if a service group is frozen, display the Frozen (for persistent) and
TFrozen (for temporary) service group attributes for a service group.
hagrp -value group Frozen
Resource operations
Bringing resources online
In normal day-to-day operations, you perform most management operations at the
service group level.
However, you may need to perform maintenance tasks that require one or more
5
resources to be offline while others are online. Also, if you make errors during
resource configuration, you can cause a resource to fail to be brought online.
Bringing resources online using the CLI

To bring a resource online, type:
hares -online resource -sys system
Provide the resource name and the name of a system that is configured to run the
service group.
Note: The service group shown in the slide is partially online after the webdg
resource is brought online. This is depicted by the textured coloring of the
service group circle.

5–11
Taking resources offline
Taking resources offline should not be a normal occurrence. Taking resources
offline causes the service group to become partially online, and availability of the
application service is affected.
If a resource needs to be taken offline, for example, for maintenance of underlying
hardware, then consider switching the service group to another system.
If multiple resources need to be taken offline manually, then they must be taken
offline in resource dependency tree order, that is, from top to bottom.
Taking a resource offline and immediately bringing it online may be necessary if,
for example, the resource must reread a configuration file due to a change. Or you
may need to take a database resource offline in order to perform an update that
modifies the database files.
Taking resources offline using the CLI

To take a resource offline, type:
hares -offline resource -sys system
Provide the resource name and the name of a system.

5
• “Lab 4: Performing common VCS operations,” page A-85.

5–13

Lesson 6
VCS Configuration Methods

89

Starting and stopping VCS
VCS startup behavior
The default VCS startup process is demonstrated using a cluster with two systems
connected by the cluster interconnect. To illustrate the process, assume that no
systems have an active cluster configuration.
1 The hastart command is run on s1 and starts the had and hashadow
processes.
2 HAD checks for a valid configuration file (hacf -verify config_dir).
3 HAD checks for an active cluster configuration on the cluster interconnect.
4 Because there is no active cluster configuration, HAD on s1 reads the local
main.cf file and loads the cluster configuration into local memory. 6
The s1 system is now in the VCS local build state, meaning that VCS is
building a cluster configuration in memory on the local system.

91 Lesson 6 VCS Configuration Methods
6–3
5 The hastart command is then run on s2 and starts had and hashadow on
s2.
The s2 system is now in the VCS current discover wait state, meaning VCS is
in a wait state while it is discovering the current state of the cluster.
6 HAD on s2 checks for a valid configuration file on disk.
7 HAD on s2 checks for an active cluster configuration by sending a broadcast
message out on the cluster interconnect, even if the main.cf file on s2 is
valid.
8 HAD on s1 receives the request from s2 and responds.
9 HAD on s1 sends a copy of the cluster configuration over the cluster
interconnect to s2.
The s1 system is now in the VCS running state, meaning VCS determines that
there is a running configuration in memory on system s1.

The s2 system is now in the VCS remote build state, meaning VCS is building
the cluster configuration in memory on the s2 system from the cluster
configuration that is in a running state on s1.
10 HAD on s2 performs a remote build to place the cluster configuration in
memory.
11 When the remote build process completes, HAD on s2 copies the cluster
configuration into the local main.cf file.
If s2 has valid local configuration files (main.cf and types.cf), these are
saved to new files with a name, including a date and time stamp, before the
active configuration is written to the main.cf file on disk.
Note: If the checksum of the configuration in memory matches the main.cf on
disk, no write to disk occurs.
The startup process is repeated on each system until all members have identical
copies of the cluster configuration in memory and matching main.cf files on
local disks. Synchronization is maintained by data transfer through LLT and GAB.
6

6–5
Stopping VCS
There are several methods of stopping the VCS engine (had and hashadow
daemons) on a cluster system.
The options you specify to hastop determine where VCS is stopped, and how
resources under VCS control are affected.
VCS shutdown examples

The four examples show the effect of using different options with the hastop
command:
• The -all option stops had on all systems and takes the service groups offline.
• The -all -force options stop had on both systems and leave the services
running. Although they are no longer protected highly available services and
cannot fail over, the services continue to be available to users.
Use caution with this option. VCS does not warn you if the configuration is
open and you stop using the -force option.
• The -local option causes the service group to be taken offline on s1 and
stops the VCS engine (had) on s1.
• The -local -evacuate options cause the service group on s1 to be
migrated to s2 and then stop had on s1.

Overview of configuration methods
VCS provides several tools and methods for configuring service groups and
resources, generally categorized as:
• Online configuration
You can modify the cluster configuration while VCS is running using one of
the graphical user interfaces or the command-line interface. These online
methods change the cluster configuration in memory. When finished, you write
the in-memory configuration to the main.cf file on disk to preserve the
configuration.
• Offline configuration
In some circumstances, you can simplify cluster implementation and
6
configuration using an offline method, including:
– Editing configuration files manually
– Using the Simulator to create, modify, model, and test configurations

This method requires you to stop and restart VCS in order to build the new
configuration in memory.

6–7
Online configuration
How VCS changes the online cluster configuration
When you use Cluster Manager to modify the configuration, the GUI
communicates with had on the specified cluster system to which Cluster Manager
is connected.
Note: Cluster Manager configuration requests are shown conceptually as ha

commands in the diagram, but they are implemented as system calls.
The had daemon communicates the configuration change to had on all other
nodes in the cluster, and each had daemon changes the in-memory configuration.
When the command to save the configuration is received from Cluster Manager,
had communicates this command to all cluster systems, and each system’s had
daemon writes the in-memory configuration to the main.cf file on its local disk.
The VCS command-line interface is an alternate online configuration tool. When
you run ha commands, had responds in the same fashion.
Note: When two administrators are changing the cluster configuration

simultaneously, each administrator sees all changes as they are being made.

Opening the cluster configuration
You must open the cluster configuration to add service groups and resources, make
modifications, and perform certain operations.
The state of the configuration is maintained in an internal attribute (ReadOnly). If
you try to stop VCS with the configuration open, a warning is displayed that the
configuration is open. This helps ensure that you remember to save the
configuration to disk so you do not lose any changes you may have made while the
configuration was open.
You can override this protection, as described later in this lesson.
6

6–9
Saving the cluster configuration
When you save the cluster configuration, VCS copies the configuration in memory
to the main.cf file in the /etc/VRTSvcs/conf/config directory on all
running cluster systems. At this point, the configuration is still open. You have
only written the in-memory configuration to disk and have not closed the
configuration.
If you save the cluster configuration after each change, you can view the main.cf
file to see how the in-memory modifications are reflected in the main.cf file.

Closing the cluster configuration
When the administrator saves and closes the configuration, VCS:
1 Changes the state of the configuration to closed (ReadOnly=1)
2 Writes the configuration in memory to the main.cf file
6

6–11
Controlling access to VCS
Relating VCS and UNIX user accounts
If you have not configured Symantec Product Authentication Service (SPAS)
security in the cluster, VCS has a completely separate list of user accounts and
passwords to control access to VCS.
When using the Cluster Manager to perform administration, you are prompted for
a VCS account name and password. Depending on the privilege level of that VCS
user account, VCS displays the Cluster Manager GUI with an appropriate set of
options. If you do not have a valid VCS account, you cannot run Cluster Manager.
When using the command-line interface for VCS, you are also prompted to enter a
VCS user account and password and VCS determines whether that user account
has proper privileges to run the command. One exception is the UNIX root user.
By default, only the UNIX root account is able to use VCS ha commands to
administer VCS from the command line.
VCS access in secure mode

When running in secure mode, VCS uses operating system-based authentication,
which enables VCS to provide a single sign-on mechanism. All VCS users are
system and domain users and are configured using fully qualified user names, for
example, administrator@xyz.com.
When running in secure mode, you can add system or domain users to VCS and
assign them VCS privileges. However, you cannot assign or change passwords
using a VCS interface.

Simplifying VCS administrative access
The halogin command

The halogin command is provided to save authentication information so that
users do not have to enter credentials every time a VCS command is run.
The command stores authentication information in the user’s home directory. You
must set the VCS_HOST environment variable to the name of the node from which
you are running VCS commands to use halogin.
Note: The effect of halogin only applies for that shell session.
6

6–13
VCS user account privileges
You can ensure that the different types of administrators in your environment have
a VCS authority level to affect only those aspects of the cluster configuration that
are appropriate to their level of responsibility.
For example, if you have a DBA account that is authorized to take a database
service group offline or switch it to another system, you can make a VCS Group
Operator account for the service group with the same account name. The DBA can
then perform operator tasks for that service group, but cannot affect the cluster
configuration or other service groups. If you set AllowNativeCliUsers to 1, then
the DBA logged on with that account can also use the VCS command line to
manage the corresponding service group.
Setting VCS privileges is described in the next section.

Configuring cluster user accounts
VCS users are not the same as UNIX users except when running VCS in secure
mode. If you have not configured SPAS security in the cluster, VCS maintains a set
of user accounts separate from UNIX accounts. In this case, even if the same user
exists in both VCS and UNIX, this user account can be given a range of rights in
VCS that does not necessarily correspond to the user’s UNIX system privileges.
The slide shows how to use the hauser command to create users and set
privileges. You can also add privileges with the -addpriv and -deletepriv
options to hauser.
In non-secure mode, VCS passwords are stored in the main.cf file in encrypted
format. If you use a GUI or CLI to set up a VCS user account, passwords are 6
encrypted automatically. If you edit the main.cf file, you must encrypt the
password using the vcsencrypt command.
Note: In non-secure mode, if you change a UNIX account, this change is not
reflected in the VCS configuration automatically. You must manually modify
accounts in both places if you want them to be synchronized.
Modifying user accounts

Use the hauser command to make changes to a VCS user account:
• Change the password for an account.
hauser -update user_name
• Delete a user account.
hauser -delete user_name

6–15
• “Lab 5: Starting and stopping VCS,” page A-97.

Lesson 7
Preparing Services for VCS

105

Preparing applications for VCS
Application service overview
An application service is the service that the end-user perceives when accessing a
particular network address. An application service typically consists of multiple
components, some hardware- and some software-based, all cooperating together to
produce a service.
For example, a service can include application software (processes), a file system
containing data files, a physical disk on which the file system resides, one or more
IP addresses, and a NIC for network access.
If this application service needs to be migrated to another system for recovery
purposes, all of the components that compose the service must migrate together to
re-create the service on another system.

107 Lesson 7 Preparing Services for VCS
7–3
Identifying components
The first step in preparing services to be managed by VCS is to identify the
components required to support the services. These components should be
itemized in your design worksheet and may include the following, depending on
the requirements of your application services:
• Shared storage resources:
– Disks or components of a logical volume manager, such as Volume
Manager disk groups and volumes
– File systems to be mounted
– Directory mount points
• Network-related resources:
– IP addresses
– Network interfaces
• Application-related resources:
– Identical installation and configuration procedures
– Procedures to manage and monitor the application
– The location of application binary and data files
The following sections describe the aspects of these components that are critical to
understanding how VCS manages resources.

Performing one-time configuration tasks
Configuration and migration procedure
Use the procedure shown in the diagram to prepare and test application services on
each system before placing the service under VCS control. Consider using a design
worksheet to obtain and record information about the service group and each
resource. This is the information you need to configure VCS to control these
resources.
Details are provided in the following section.

7–5
Documenting attributes
In order to configure the operating system resources you have identified as
requirements for an application, you need the detailed configuration information
used when initially configuring and testing services.
You can use a design diagram and worksheet while performing one-time
configuration tasks and testing to:
• Show the relationships between the resources, which determine the order in
which you configure, start, and stop resources.
• Document the values needed to configure VCS resources after testing is
complete.
Note: If your systems are not configured identically, you must note those
differences in the design worksheet. The “Online Configuration” lesson
shows how you can configure a resource with different attribute values for
different systems.

Checking resource attributes
Verify that the resources specified in your design worksheet are appropriate and
complete for your platform. Refer to the Veritas Cluster Server Bundled Agents
Reference Guide before you begin configuring resources.
The examples displayed in the slides in this lesson show values for various
operating system platforms, indicated by the icons. In the case of the appsg service
group shown in the slide, the lan2 value of the Device attribute for the NIC
resource is specific to HP-UX. Solaris, Linux, and AIX have other operating
system-specific values, as shown in the respective Bundled Agents Reference
Guides.

7–7
Configuring shared storage
The diagram shows the procedure for configuring shared storage on the initial
system. In this example, Volume Manager is used to manage shared storage.
Note: Although examples used throughout this course are based on Veritas
Volume Manager, VCS also supports other volume managers. VxVM is
shown for simplicity—objects and commands are essentially the same on
all platforms. The agents for other volume managers are described in the
Veritas Cluster Server Bundled Agents Reference Guide.
Preparing shared storage, such as creating disk groups, volumes, and file systems,
is performed once, from one system. Then you must create mount point directories
on each system.
The options to mkfs differ depending on platform type, as displayed in the

following examples.
AIX
mkfs -V vxfs /dev/vx/rdsk/appdatadg/appdatavol
Linux
mkfs -t vxfs /dev/vx/rdsk/appdatadg/appdatavol
Solaris/HP-UX
mkfs -F vxfs /dev/vx/rdsk/appdatadg/appdatavol

Configuring the application
You must ensure that the application is installed and configured identically on
each system that is a startup or failover target and manually test the application
after all dependent resources are configured and running.
Some VCS agents have application-specific installation instructions to ensure the
application is installed and configured properly for a cluster environment. Check
the Symantec Operations Readiness Tools (SORT) Web site for application-
specific guides, such as the Veritas Cluster Server Agent for Oracle Installation
and Configuration Guide.
Depending on the application requirements, you may need to
• Create user accounts.
• Configure environment variables.
• Apply licenses.
• Set up configuration files.
This ensures that you have correctly identified the information used by the VCS
agent scripts to control the application. 7
Note: The shutdown procedure should be a graceful stop, which performs any
cleanup operations.

7–9
Testing the application service
Before configuring a service group in VCS to manage an application, test the
application components on each system that can be a startup or failover target for
the service group. Following this best practice recommendation ensures that VCS
can successfully manage the application service after you configure a service
group to manage the application.
The testing procedure emulates how VCS manages application services and must
include:
• Startup: Online
• Shutdown: Offline
• Verification: Monitor
The actual commands used may differ from those used in this lesson. However,
conceptually, the same type of action is performed by VCS. Example operations
are described for each component throughout this section.

Bringing up shared storage resources
Verify that shared storage resources are configured properly and accessible. The
examples shown in the slide are based on using Volume Manager.
1 Import the disk group.
2 Start the volume.
3 Mount the file system.
Mount the file system manually for the purposes of testing the application
service. Do not configure the operating system to automatically mount any file
system that will be controlled by VCS.
For example, on Linux systems, ensure that the application file system is not
added to /etc/fstab. VCS must control where the file system is mounted.
Examples of mount commands are provided for each platform.
AIX
mount -V vxfs /dev/vx/dsk/appdatadg/appdatavol /appdata
Linux
mount -t vxfs /dev/vx/dsk/appdatadg/appdatavol /appdata 7
Solaris/HP-UX
mount -F vxfs /dev/vx/dsk/appdatadg/appdatavol /appdata

7–11
Virtual IP addresses
The example in the slide demonstrates how users access services through a virtual
IP address that is specific to an application. In this scenario, VCS is managing a
Web server that is accessible to network clients over a public network.
1 A network client requests access to http://eweb.com.
2 The DNS server translates the host name to the virtual IP address of the Web
server.
3 The virtual IP address is managed and monitored by a VCS IP resource in the
Web service group.
The virtual IP address is associated with the next virtual network interface for
e1000g0, which is e1000g0:1 in this example of Solaris network interfaces.
4 The system which has the service group online accepts the incoming request
on the virtual IP address.
Note: The administrative IP address is associated with a physical network

interface on a specific system and is configured by the operating system
during system startup. These are also referred to as base or test IP
addresses.

Virtual IP address migration
The diagram in the slide shows what happens if the system running the Web
service group (s1) fails.
1 The IP address is no longer available on the network. Network clients may
receive errors that web pages are not accessible.
2 VCS on the running system (s2) detects the failure and starts the service group.
3 The IP resource is brought online, which configures the same virtual IP address
on the next available virtual network interface alias, e1000g:1 in this
example.
This virtual IP address floats, or migrates, with the service. It is not tied to a
system.
4 The network client Web request is now accepted by the s2 system.
Note: The admin IP address on s2 is also configured during system startup. This
address is unique and associated with only this system, unlike the virtual IP
address.
7
CAUTION The administrative IP address cannot be placed under VCS control.

This address must be configured by the operating system. Ensure
that you do not configure an IP resource with the value of the
administrative IP address.

7–13
Configuring application IP addresses
Configure the application IP addresses associated with specific application
services to ensure that clients can access the application service using the specified
address.
Application IP addresses are configured as virtual IP addresses. On most
platforms, the devices used for virtual IP addresses are defined as
interface:number.
Note: These virtual IP addresses are only configured temporarily for testing
purposes. You must not configure the operating system to manage the
virtual IP addresses.
The following examples show the platform-specific commands used to configure a

virtual IP address for testing purposes.

AIX
Create an alias for the virtual interface and bring up the IP on the next available
logical interface.
ifconfig en1 inet 10.10.21.198 netmask 255.0.0.0 alias

HP-UX
1 Configure IP address using the ifconfig command.
ifconfig lan2:1 inet 10.10.21.198
2 Use ifconfig to manually configure the IP address to test the configuration
without rebooting.
ifconfig lan2:1 up
Linux
Configure IP address using the ip addr command.
ip addr add 10.10.21.198/24 broadcast 10.10.2.255 \
dev eth0 label eth0:1
ip addr show eth0
Solaris
Plumb the virtual interface and bring up the IP address on the next available
logical interface.
ifconfig e1000g0 addif 10.10.21.198 up
Note: In each case, you can edit /etc/hosts to assign a virtual host name
(application name) to the virtual IP address.
10.10.21.198 eweb.com

7–15
Starting the application
When all dependent resources are available, you can start the application software.
Ensure that the application is not configured to start automatically during system
boot. VCS must be able to start and stop the application using the same methods
you use to control the application manually.
Examples of operating system control of applications:
• On AIX and HP-UX, rc files may be present if the application is under
operating system control.
• On Linux, you can use the chkconfig command to determine if an
application is under operating system control.
• On Solaris 10 platforms, you must disable the Service Management Facility
(SMF) using the svcadm command for some services, such as Apache, to
ensure that SMF is not trying to control the service.
Follow the guidelines for your platform to remove an application from operating
system control in preparation for configuring VCS to control the application.

Verifying resources
You can perform some simple steps, such as those shown in the slide, to verify that
each component needed for the application to function is operating at a basic level.
Note: To test the network resources, access one or more well-known addresses
outside of the cluster, such as local routers, or primary and secondary DNS
servers.
This helps you identify any potential configuration problems before you test the
service as a whole, as described in the “Testing the Integrated Components”
section.

7–17
Testing the integrated components
When all components of the service are running, test the service in situations that
simulate real-world use of the service.
For example, if you have an application with a backend database, you can:
1 Start the database (and listener process).
2 Start the application.
3 Connect to the application from the public network using the client software to
verify name resolution to the virtual IP address.
4 Perform user tasks, as applicable; perform queries, make updates, and run
reports.
Another example that illustrates how you can test your service uses Network File
System (NFS). If you are preparing to configure a service group to manage an
exported file system, verify that you can mount the exported file system from a
client on the network. This is described in more detail later in the course.

Stopping and migrating an application service
Stopping application components
Stop resources in the order of the dependency tree from the top down after you
have finished testing the service. You must have all resources offline in order to
migrate the application service to another system for testing. The procedure also
illustrates how VCS stops resources.
The ifconfig options are platform-specific, as shown in the following
examples.
AIX
ifconfig en1 10.10.21.198 delete
HP-UX
ifconfig lan2:1 0.0.0.0
Linux
ifdown eth0:1
Solaris
7
ifconfig e1000g0 removeif 10.10.21.198

7–19
Manually migrating an application service
After you have verified that the application service works properly on one system,
manually migrate the service between all intended target systems. Performing
these operations enables you to:
• Ensure that your operating system and application resources are properly
configured on all potential target cluster systems.
• Validate or complete your design worksheet to document the information
required to configure VCS to manage the services.
Perform the same type of testing used to validate the resources on the initial
system, including real-world scenarios, such client access from the network.

Collecting configuration information
Documenting resource dependencies
Ensure that the steps you perform to bring resources online and take them offline
while testing the service are accurately reflected in a design worksheet. Compare
the worksheet with service group diagrams you have created or that have been
provided to you.
The slide shows the resource dependency definition for the application used as an
example in this lesson.

7–21
Validating service group attributes
Check the service group attributes in your design worksheet to ensure that the
appropriate startup and failover systems are listed. Other service group attributes
may be included in your design worksheet, according to the requirements of each
service.
Service group definitions consist of the attributes of a particular service group.
These attributes are described in more detail later in the course.

• “Lab 6: Preparing application services,” page A-113.

7–23

Lesson 8
Online Configuration

129

Online service group configuration
The chart on the left in the diagram illustrates the high-level procedure you can use
to modify the cluster configuration while VCS is running.
Online configuration procedure

You can use the procedures shown in the diagram as a standard methodology for
creating service groups and resources. Although there are many ways you could
vary this configuration procedure, following a recommended practice simplifies
and streamlines the initial configuration and facilitates troubleshooting if you
encounter configuration problems.

131 Lesson 8 Online Configuration
8–3
Naming convention suggestions
Using a consistent pattern for selecting names for VCS objects simplifies initial
configuration of high availability. Perhaps more importantly, applying a naming
convention helps avoid administrator errors and can significantly reduce
troubleshooting efforts when errors or faults occur.
As shown in the slide, you are recommend to use pattern based on the function of
the service group, and match some portion of the name among all resources and
the service group in which the resources are contained.
When deciding upon a naming convention, consider delimiters, such as dash (-)
and underscore (_), with care. Differences in keyboards may prevent use of some
characters, especially in the case where clusters span geographic locations.

Adding a service group using the GUI
The minimum required information to create a service group is:
• A unique name
Using a consistent naming scheme helps identify the purpose of the service
group and all associated resources.
• The list of systems on which the service group can run
The SystemList attribute for the service group defines where the service group
can run, as displayed in the excerpt from the sample main.cf file. A priority
number is associated with each system to determine the order systems are
selected for failover. The lower-numbered system is selected first.
• The list of systems where the service group can be started
The Startup box specifies that the service group starts automatically when VCS
starts on the system, if the service group is not already online elsewhere in the
cluster. This is defined by the AutoStartList attribute of the service group. In

the example displayed in the slide, the s1 system is selected as the system on
which appsg is started when VCS starts up.
• The type of service group
The Service Group Type selection is Failover by default.
If you save the configuration after creating the service group, you can view the
main.cf file to see the effect of had modifying the configuration and writing the
changes to the local disk. 8

8–5
Note: You can click the Show Command button to see the commands that are
run when you click OK.
Adding a service group using the CLI

You can also use the VCS command-line interface to modify a running cluster
configuration. The next example shows how to use hagrp commands to add the
appsg service group and modify its attributes.
haconf –makerw
hagrp –add appsg
hagrp –modify appsg SystemList s1 0 s2 1
hagrp –modify appsg AutoStartList s1 s2
haconf –dump -makero
The corresponding main.cf excerpt for appsg is shown in the slide.
Notice that the main.cf definition for the appsg service group does not include
the Parallel attribute. When a default value is specified for a resource, the attribute
is not written to the main.cf file. To display all values for all attributes:
• In the GUI, select the object (resource, service group, system, or cluster), click
the Properties tag, and click Show all attributes.
• From the command line, use the -display option to the corresponding ha
command. For example:
hagrp -display appsg
See the command-line reference card provided with this course for a list of
commonly used ha commands.

Adding resources
Online resource configuration procedure
Add resources to a service group in the order of resource dependencies starting
from the child resource (bottom up). This enables each resource to be tested as it is
added to the service group.
Adding a resource requires you to specify:
• The service group name
• The unique resource name
If you prefix the resource name with the service group name, you can more
easily identify the service group to which it belongs. When you display a list of
resources from the command line using the hares -list command, the
resources are sorted alphabetically.
• The resource type

• Attribute values
Use the procedure shown in the diagram to configure a resource.
Notes:
• It is recommended that you set each resource to be non-critical during initial
configuration. This simplifies testing and troubleshooting in the event that you
have specified incorrect configuration information. If a resource faults due to a
configuration error, the service group does not fail over if resources are non-
8
critical.
• Enabling a resource signals the agent to start monitoring the resource.

8–7
Adding a resource using the GUI: NIC example
The NIC resource has only one required attribute, Device, for all platforms other
than HP-UX, which also requires NetworkHosts unless PingOptimize is set to 0.
Optional attributes for NIC vary by platform. Refer to the Veritas Cluster Server
Bundled Agents Reference Guide for a complete definition. These optional
attributes are common to all platforms.
• NetworkType: Type of network, Ethernet (ether)
• PingOptimize: Number of monitor cycles to detect if the configured interface
is inactive
A value of 1 optimizes broadcast pings and requires two monitor cycles. A
value of 0 performs a broadcast ping during each monitor cycle and detects the
inactive interface within the cycle. The default is 1.
Note: On the HP-UX platform, if the PingOptimize attribute is set to 1, the

monitor entry point does not send broadcast pings.
• NetworkHosts: The list of hosts on the network that are used to determine if
the network connection is alive
It is recommended that you specify the IP address of the host rather than the
host name to prevent the monitor cycle from timing out due to DNS problems.
• Example device attribute values:
AIX: en0; HP-UX: lan2; Linux: eth0; Solaris: e1000g0

Adding an IP resource
The slide shows the required attribute values for an IP resource (on Solaris) in the
appsg service group. The corresponding entry is made in the main.cf file when
the configuration is saved.
Notice that the IP resource on Solaris has two required attributes: Device and
Address, which specify the network interface and virtual IP address, respectively.
The required attributes vary depending on the platform.
Optional Attributes
• NetMask: Netmask associated with the application IP address
– The value may be specified in decimal (base 10) or hexadecimal (base 16).
The default is the netmask corresponding to the IP address class.
– This is a required attribute on AIX.
• Options: Options to be used with the ifconfig command

• ArpDelay: Number of seconds to sleep between configuring an interface and
sending out a broadcast to inform routers about this IP address
The default is 1 second.
• IfconfigTwice: If set to 1, this attribute causes an IP address to be configured
twice, using an ifconfig up-down-up sequence. This behavior increases the
probability of gratuitous ARPs (caused by ifconfig up) reaching clients.
The default is 0.
8

8–9
Adding a resource using the CLI: DiskGroup example
You can use the hares command to add a resource and configure the required
attributes. This example shows how to add a DiskGroup resource.
The DiskGroup resource

The DiskGroup resource has only one required attribute, DiskGroup, except on
Linux, which also requires StartVolumes and StopVolumes.
Note: As of version 4.1, VCS sets the vxdg autoimport option to no, which
disables autoimporting of disk groups.
Example optional attributes:

• StartVolumes: Starts all volumes after importing the disk group
This also starts layered volumes by running vxrecover -s. The default is 1,
enabled, on all UNIX platforms except Linux.
• StopVolumes: Stops all volumes before deporting the disk group with vxvol
The default is 1, enabled, on all UNIX platforms except Linux.

The Volume resource
The Volume resource can be used to manage a VxVM volume. Although the
Volume resource is not strictly required, it provides additional monitoring. You
can use a DiskGroup resource to start volumes when the DiskGroup resource is
brought online. This has the effect of starting volumes more quickly, but only the
disk group is monitored.
If you have a large number of volumes on a single disk group, the DiskGroup
resource can time out when trying to start or stop all the volumes simultaneously.
In this case, you can set the StartVolume and StopVolume attributes of the
DiskGroup to 0, and create Volume resources to start the volumes individually.
Also, if you are using volumes as raw devices with no file systems, and, therefore,
no Mount resources, consider using Volume resources for the additional level of
monitoring.
The Volume resource has no optional attributes.

8–11
The Mount resource
The Mount resource has the required attributes displayed in the main.cf file
excerpt in the slide.
Example optional attributes:
• MountOpt: Specifies options for the mount command
When setting attributes with arguments starting with a dash (-), use the percent
(%) character to escape the arguments. Examples:
hares -modify appmnt FsckOpt %-y
The percent character is an escape character for the VCS CLI which prevents
VCS from interpreting the string as an argument to hares.
• SnapUmount: Determines whether VxFS snapshots are unmounted when the
file system is taken offline (unmounted)
The default is 0, meaning that snapshots are not automatically unmounted

when the file system is unmounted.
Note: If SnapUmount is set to 0 and a VxFS snapshot of the file system is

mounted, the unmount operation fails when the resource is taken offline,
and the service group is not able to fail over.
This is desired behavior in some situations, such as when a backup is being

performed from the snapshot.

File system locking
Storage Foundation enables a file system to be mounted with a key which must be
used to unmount the file system. The Mount resource has a VxFSMountLock
attribute to manage the file system mount key.
This attribute is set to the “VCS” string by default when a Mount resource is
added. The Mount agent uses this key for online and offline operations to ensure
the file system cannot be inadvertently unmounted outside of VCS control.
You can unlock a file system without unmounting by using the fsadm command:
/opt/VRTS/bin/fsadm -o mntunlock="key" mount_point_name
Note: The example operating system commands for unmounting a locked file
system are specific to Solaris. Other operating systems may use different
methods for unmounting file systems.

8–13
The Process resource
The Process resource controls the application and is added last because it requires
all other resources to be online in order to start. The Process resource is used to
start, stop, and monitor the status of a process.
• Online: Starts the process specified in the PathName attribute, with options, if
specified in the Arguments attribute
• Offline: Sends SIGTERM to the process
SIGKILL is sent if process does not exit within one second.
• Monitor: Determines if the process is running by scanning the process table
The optional Arguments attribute specifies any command-line options to use when
starting the process.

Process attribute specification
• If the executable is a shell script, you must specify the script name followed by
arguments. You must also specify the full path for the shell in the PathName
attribute.
• The monitor script calls ps and matches the process name. The process name
field is limited to 80 characters in the ps output. If you specify a path name to
a process that is longer than 80 characters, the monitor entry point fails.

143 Lesson 8 Online Configuration Copyright © 2014 Symantec Corporation. All rights reserved.
8–15
Solving common configuration errors
Troubleshooting resources
Verify that each resource is online on the local system before continuing the
service group configuration procedure.
If you are unable to bring a resource online, use the procedure in the diagram to
find and fix the problem. You can view the logs through Cluster Manager or in the
/var/VRTSvcs/logs directory if you need to determine the cause of errors.
VCS log entries are written to engine_A.log and agent entries are written to
resource_A.log files.
Note: Some resources must be disabled and reenabled. Only resources whose
agents have open and close entry points, such as MultiNICB, require you to
disable and enable again after fixing the problem. By contrast, a Mount
resource does not need to be disabled if, for example, you incorrectly
specify the MountPoint attribute.
However, it is generally good practice to disable and enable regardless because it is

difficult to remember when it is required and when it is not. In addition, a resource
is immediately monitored upon enabling, which would indicate potential problems
with attribute specification.
More detail on performing tasks necessary for solving resource configuration
problems is provided in the following sections.

Clearing resource faults
A fault indicates that the monitor entry point is reporting an unexpected offline
state for a previously online resource. This indicates a problem with the underlying
component being managed by the resource.
Before clearing a fault, you must resolve the problem that caused the fault. Use the
VCS logs to help you determine which resource has faulted and why.
It is important to clear faults for critical resources after fixing underlying problems
so that the system where the fault originally occurred can be a failover target for
the service group. In a two-node cluster, a faulted critical resource would prevent
the service group from failing back if another fault occurred. You can clear a
faulted resource on a particular system, or on all systems when the service group
can run.
Note: Persistent resource faults should be probed to force the agent to monitor the
resource immediately. Otherwise, the resource is not online until the next
OfflineMonitorInterval, up to five minutes.
Clearing and Probing Resources Using the CLI

• To clear a faulted resource, type:
hares -clear resource [-sys system]
If the system name is not specified then the resource is cleared on all systems. 8
• To probe a resource, type:
hares -probe resource -sys system
8–17
Testing the service group
After you have successfully brought each resource online, link the resources and
switch the service group to each system on which the service group can run.
Test procedure
For simplicity, the example service group uses the default Priority failover policy.
That is, if a critical resource in appsg faults, the service group is taken offline and
brought online on the system with the lowest priority value that is available for
failover.
The “Handling Resource Faults” lesson provides additional information about
configuring and testing failover behavior. Additional failover policies are also
described in the Veritas Cluster Server for UNIX: Cluster Management participant
guide.

Linking resources
When you link a parent resource to a child resource, the dependency becomes a
component of the service group configuration. When you save the cluster
configuration, each dependency is listed at the end of the service group definition,
after the resource specifications, in the format show in the slide.
In addition, VCS creates a dependency tree in the main.cf file at the end of the
service group definition to provide a more visual view of resource dependencies.
This is not part of the cluster configuration, as denoted by the // comment
markers.
// resource dependency tree
//
//group appsg
//{
//IP appip
// {
// NIC appnic
// }
//}
Note: You cannot use the // characters as general comment delimiters. VCS
strips out all lines with // upon startup and re-creates these lines based on the 8
requires statements in the main.cf file.

8–19
Running a virtual fire drill
You can run a virtual fire drill for a service group to check that the underlying
infrastructure is properly configured to enable failover to other systems. The
service group must be fully online on one system, and can then be checked on all
other systems where it is offline.
You can select which type of infrastructure components to check, or run all
checks. In some cases, you can use the virtual fire drill to correct problems, such as
making a mount point directory if it does not exist.
However, not all resources have defined actions for virtual fire drills, in which case
a message is displayed indicating that no checks were performed.
You can also run fire drills using the havfd command, as shown in the slide.

Setting the critical attribute
The Critical attribute is set to 1, or true, by default. When you initially configure a
resource, you set the Critical attribute to 0, or false. This enables you to test the
resources as you add them without the resource faulting and causing the service
group to fail over as a result of configuration errors you make.
Some resources may always be set to non-critical. For example, a resource
monitoring an Oracle reporting database may not be critical to the overall service
being provided to users. In this case, you can set the resource to non-critical to
prevent downtime due to failover in the event that it was the only resource that
faulted.
Note: When you set an attribute to a default value, the attribute is removed from
main.cf. For example, after you set Critical to 1 for a resource, the
Critical = 0 line is removed from the resource configuration because

it is now set to the default value for the resource type.
To see the values of all attributes for a resource, use the hares command. For
example:
hares -display appdg

8–21
• “Lab 7: Online configuration of a service group,” page A-125.

Lesson 9
Offline Configuration

151

9
Offline configuration examples

Example 1: Reusing a cluster configuration
One example where offline configuration is appropriate is when your high
availability environment is expanding and you are adding clusters with similar
configurations.
In the example displayed in the diagram, the original cluster consists of two
systems, each system running a database instance. Another cluster with essentially
the same configuration is being added, but it is managing different databases.
You can copy the configuration files from the original cluster, make the necessary
changes, and then restart VCS as described later in this lesson. This method may
be more efficient than creating each service group and resource using a graphical-
user interface or the VCS command-line interface.

153 Lesson 9 Offline Configuration
9–3
Example 2: Reusing a service group configuration
Another example of using offline configuration is when you want to add a service
group with a similar set of resources as another service group in the same cluster.
In the example shown in the slide, the portion of the main.cf file that defines the
extwebsg service group is copied and edited as necessary to define a new intwebsg
service group.

9
Example 3: Modeling a configuration
You can use the VOM to create and test a cluster configuration on Windows and
then copy the finalized configuration files into a real cluster environment. The
VOM enables you to create configurations for all supported UNIX, Linux, and
Windows platforms.
This only applies to the cluster configuration. You must perform all preparation
tasks to create and test the underlying resources, such as virtual IP addresses,
shared storage objects, and applications.
After the cluster configuration is copied to the real cluster and VCS is restarted,
you must perform complete testing of all objects, as shown later in this lesson.

9–5
Offline configuration procedures
New cluster
The diagram illustrates a process for modifying the cluster configuration when you
are configuring your first service group and do not already have services running
in the cluster. Select one system to be your primary node for configuration. Work
from this system for all steps up to the final point of restarting VCS.
1 Save and close the configuration.
Always save and close the configuration before making any modifications.
This ensures the configuration in the main.cf file on disk is the most recent
in-memory configuration.
2 Change to the configuration directory.
The examples used in this procedure assume you are working in the /etc/
VRTSvcs/conf/config directory.
3 Stop VCS.
Stop VCS on all cluster systems. This ensures that there is no possibility of
another administrator changing the cluster configuration while you are
modifying the main.cf file.
4 Edit the configuration files.
You must choose a system on which to modify the main.cf file. You can
choose any system. However, you must then start VCS first on that system.
5 Verify the configuration file syntax.

Run the hacf command in the /etc/VRTSvcs/conf/config directory
to verify the syntax of the main.cf and types.cf files after you have
modified them. VCS cannot start if the configuration files have syntax errors.
Run the command in the config directory using the dot (.) to indicate the
current working directory, or specify the full path. 9
Note: The hacf command only identifies syntax errors, not configuration errors.
6 Start VCS on the system with the modified configuration file.

Start VCS first on the primary system with the modified main.cf file.
7 Verify that VCS is running.
Verify that VCS is running on the primary configuration system before starting
VCS on other systems.
8 Start other systems
After VCS is in a running state on the first system, start VCS on all other
systems. If you cannot bring VCS to a running state on all systems, see the
“Solving Offline Configuration Problems” section.

9–7
Existing cluster
The diagram illustrates a process for modifying the cluster configuration when you
want to minimize the time that VCS is not running to protect existing services.
This procedure includes several built-in protections from common configuration
errors and maximizes high availability.
First system
Designate one system as the primary change management node. This makes
troubleshooting easier if you encounter problems with the configuration.
1 Save and close the configuration.
Save and close the cluster configuration before you start making changes. This
ensures that the working copy has the latest in-memory configuration.
2 Back up the main.cf file.

Make a copy of the main.cf file with a different name. This ensures that you
have a backup of the configuration that was in memory when you saved the
configuration to disk.
Note: If any *types.cf files are being modified, also back up these files.
3 Make a staging directory.
Make a subdirectory of /etc/VRTSvcs/conf/config in which you can
edit a copy of the main.cf file. This helps ensure that your edits are not
overwritten if another administrator changes the configuration simultaneously.
4 Copy the configuration files.

Copy the *.cf files /etc/VRTSvcs/conf/config to the staging
directory.
5 Modify the configuration files.
Modify the main.cf file in the staging directory on one system. The diagram
on the slide refers to this as the first system. 9
6 Freeze the service groups.

If you are modifying existing service groups, freeze those service groups
persistently by setting the Frozen attribute to 1. This simplifies fixing resource
configuration problems after VCS is started because the service groups will not
fail over between systems if faults occur.
. . .
group extwebsg (
SystemList = { s1 = 1, s2 = 0}
AutoStartList = { s1, s2 }
Operators = { extwebsgoper }
Frozen = 1
)

9–9
7 Verify the configuration file syntax.
Run the hacf command in the staging directory to verify the syntax of the
main.cf and types.cf files after you have modified them.
Note: The dot (.) argument indicates that the current working directory is used as
the path to the configuration files. You can run hacf -verify from any
directory by specifying the path to the configuration directory:
hacf -verify /etc/VRTSvcs/conf/config
8 Stop VCS.
Stop VCS on all cluster systems after making configuration changes. To leave
applications running, use the -force option, as shown in the diagram.
9 Copy the new configuration file.

Copy the modified main.cf file and all *types.cf files from the staging
directory back into the configuration directory.
10 Start VCS.
Start VCS first on the system with the modified main.cf file.
11 Verify that VCS is in a local build or running state on the primary system.
12 Start other systems.
After VCS is in a running state on the first system, start VCS all other systems.
You must wait until the first system has built a cluster configuration in
memory and is in a running state to ensure the other systems perform a remote
build from the first system’s configuration in memory.
9
VCS startup using a specific main.cf file

The diagram illustrates how to start VCS to ensure that the cluster configuration in
memory is built from a specific main.cf file.
Starting VCS using a modified main.cf file

Ensure that VCS builds the new configuration in memory on the system where the
changes were made to the main.cf file. All other systems must wait for the build
to successfully complete and the system to transition to the running state before
VCS is started elsewhere.
1 Run hastart on s1 to start the had and hashadow processes.
2 HAD checks for a valid main.cf file.
3 HAD checks for an active cluster configuration on the cluster interconnect.
4 Because there is no active cluster configuration, HAD on s1 reads the local
main.cf file and loads the cluster configuration into local memory on s1.
5 Verify that VCS is in a local build or running state on s1 using hastatus
-sum.

9–11
6 When VCS is in a running state on s1, run hastart on s2 to start the had and
hashadow processes.
7 HAD on s2 checks for a valid main.cf file.
8 HAD on s2 checks for an active cluster configuration on the cluster
interconnect.
9 The s1 system sends a copy of the cluster configuration over the cluster
interconnect to s2.
10 The s2 system performs a remote build to load the new cluster configuration in
memory.
11 HAD on s2 backs up the existing main.cf and types.cf files and saves the
current in-memory configuration to disk.

9
Resource dependencies
Ensure that you create the resource dependency definitions at the end of the
service group definition. Add the links using the syntax shown in the slide.

9–13
A completed configuration file
A portion of the completed main.cf file with the new service group definition
for intwebsg is displayed in the slide. This service group was created by copying
the extwebsg service group definition and changing the attribute names and values.
Two errors are intentionally shown in the example in the slide.
• The extwebip resource name was not changed in the intwebsg service group.
This causes a syntax error when the main.cf file is checked using hacf
-verify because you cannot have duplicate resource names within the
cluster.
• The intwebdg resource has the value of extwebdatadg for the DiskGroup
attribute. This does not cause a syntax error, but is not a correct attribute value
for this resource. The extwebdatadg disk group is being used by the extwebsg
service group and cannot be imported by another failover service group.
Note: You cannot include comment lines in the main.cf file. The lines you see
starting with // are generated by VCS to show resource dependencies. Any
lines starting with // are stripped out during VCS startup.

9
Solving offline configuration problems

Starting from an old configuration
If you are running an old cluster configuration because you started VCS on the
wrong system first, you can recover the main.cf file on the system where you
originally made the modifications using the main.cf.previous backup file
created automatically by VCS.
Recovering from an old configuration

Use the offline configuration procedure to restart VCS using the recovered
main.cf file.
Note: You must ensure that VCS is in the local build or running state on the
system with the recovered main.cf file before starting VCS on other
systems.

9–15
All systems in a wait state
This scenario results in all cluster systems entering a wait state:
• Your new main.cf file has a syntax problem.
• You forget to check the file with hacf -verify.
• You start VCS on the first system with hastart.
• The first system cannot build a configuration and goes into a wait state, such as
ADMIN_WAIT.
Forcing VCS to start from a wait state

1 An attempt to start VCS on a s1 with a bad main.cf results in the cluster in a
wait state.
2 Visually inspect the main.cf file and modify or replace the file as necessary
to ensure it contains the correct configuration content.
3 Verify the configuration with hacf -verify /opt/VRTSvcs/conf/

config.
4 Run hasys -force s1 on s1. This starts the local build process.
You must have a valid main.cf file to force VCS to a running state. If the
main.cf file has a syntax error, VCS enters the ADMIN_WAIT state.
5 HAD checks for a valid main.cf file.
6 The had daemon on s1 reads the local main.cf file, and if it has no syntax
errors, HAD loads the cluster configuration into local memory on s1.

9
7 When HAD is in a running state on s1, this state change is broadcast on the
cluster interconnect by GAB.
8 Next, run hastart on s2 to start HAD.
9 HAD on s2 checks for a valid main.cf file. This system has an old version of
the main.cf.
10 HAD on s2 then checks for another node in a local build or running state.
11 Since s1 is in a local build or running state, HAD on s2 performs a remote
build from the configuration on s1.
12 HAD on s2 copies the cluster configuration into the local main.cf and
types.cf files after moving the original files to backup copies with
timestamps.

9–17
Configuration file backups
Each time you save the cluster configuration, VCS maintains backup copies of the
main.cf and types.cf files.
This occurs as follows:
1 New main.cf.datetime and *types.cf.datetime files are created.
2 The hard links for main.cf, main.cf.previous, types.cf and
types.cf.previous (as well as any others) are changed to point to the
correct versions
Although it is always recommended that you copy configuration files before
modifying them, you can revert to an earlier version of these files if you damage or
lose a file.

9
Testing the service group

Service group testing procedure
After you restart VCS throughout the cluster, use the procedure shown in the slide
to verify that your configuration additions or changes are correct.
Notes:
• This process is slightly different from online configuration, which tests each
resource before creating the next and before creating dependencies.
• Resources should come online after you restart VCS if you have specified the
appropriate attributes to automatically start the service group.
Use the procedures shown in the “Online Configuration” lesson to solve
configuration problems, if any.
If you need to make additional modifications, you can use one of the online tools
or modify the configuration files using the offline procedure.

9–19
• “Lab 8: Offline configuration,” page A-157.

Lesson 10
Configuring Notification

171

10
Notification overview
When VCS detects certain events, you can configure the notifier to:
• Generate an SNMP (V2) trap to specified SNMP consoles.
• Send an e-mail message to designated recipients.
Message queue
VCS ensures that no event messages are lost while the VCS engine is running,
even if the notifier daemon stops or is not started. The had daemons
throughout the cluster communicate to maintain a replicated message queue.
If the service group with notifier configured as a resource fails on one of the nodes,
notifier fails over to another node in the cluster. Because the message queue is
guaranteed to be consistent and replicated across nodes, notifier can resume
message delivery from where it left off after it fails over to the new node.
Messages are stored in the queue until one of these conditions is met:
• The notifier daemon sends an acknowledgement to had that at least one
recipient has received the message.
• The queue is full. The queue is circular—the last (oldest) message is deleted in
order to write the current (newest) message.
• Messages in the queue for one hour are deleted if notifier is unable to deliver to
the recipient.
Note: Before the notifier daemon connects to had, messages are stored
permanently in the queue until one of the last two conditions is met.

173 Lesson 10 Configuring Notification
10–3
You can view the entries in the message cue using the haclus -notes
command. You can also delete all queued messages on all cluster nodes using
haclus -delnotes, but the notifier must be stopped first.

10
High availability for notification

The notification service is managed by a NotifierMngr type resource contained in
the ClusterService group. ClusterService is created automatically during cluster
configuration, if certain configuration options are selected.
You can also configure ClusterService later using the VCS command-line
interface or Veritas Operations Manager.
After configuring notification, ClusterService contains a NotifierMngr type
resource to manage the notifier daemon, and a csgnic resource to monitor the
network interface used by the notifier for sending messages.
ClusterService is a special-purpose service group that:
• Is the first group to come online on the first node in a running state
• Can fail over despite being frozen
• Cannot be autodisabled
• Switches to another node upon hastop -local on the online system
• Attempts to start on all miniclusters if a network partition occurs
ClusterService is also used to manage the wide-area connector process in a global
cluster environment.
CAUTION Do not add resources to ClusterService for managing non-VCS

applications or services.

10–5
Message severity levels
Event messages are assigned one of four severity levels by notifier:
• Information: Normal cluster activity is occurring, such as resources being
brought online.
• Warning: Cluster or resource states are changing unexpectedly, such as a
resource in an unknown state.
• Error: Services are interrupted, such as a service group faulting that cannot be
failed over.
• SevereError: Potential data corruption is occurring, such as a concurrency
violation.
The administrator can configure notifier to specify which recipients are sent
messages based on the severity level.
A complete list of events and corresponding severity levels is provided in the

Veritas Cluster Server Administrator’s Guide.

10
Notifier and log events

The table in the slide shows how the notifier levels shown in e-mail messages
compare to the log file codes for corresponding events. Notice that notifier
SevereError events correlate to CRITICAL entries in the engine log.

10–7
Configuring notification
Configuration methods
Although you can start and stop the notifier daemon manually outside of VCS,
you should make the notifier component highly available by placing the daemon
under VCS control.
You can configure VCS to manage the notifier manually using the command-line
interface or Veritas Operations Manager, or set up notification during initial cluster
configuration.

10
Notification configuration
These high-level tasks are required to manually configure highly available
notification within the ClusterService group.
1 Add a NotifierMngr type of resource to the ClusterService group.
Link the resource to the csgnic resource that is present
2 If SMTP notification is required:
a Modify the SmtpServer and SmtpRecipients attributes of the NotifierMngr
type of resource.
b Optionally, modify the ResourceOwner attribute of individual resources.
c Optionally, specify a GroupOwner e-mail address for each service group.
3 If SNMP notification is required:
a Modify the SnmpConsoles attribute of the NotifierMngr type of resource.
b Verify that the SNMPTrapPort attribute value matches the port configured
for the SNMP console. The default is port 162.
c Configure the SNMP console to receive VCS traps (described later in the
lesson).
4 Modify any other optional attributes of the NotifierMngr type of resource.
See the manual pages for notifier and hanotify for a complete description
of notification configuration options.

10–9
The NotifierMngr resource type
The notifier daemon runs on only one system in the cluster, where it processes
messages from the local had daemon. If the notifier daemon fails on that
system, the NotifierMngr agent detects the failure and migrates the service group
containing the NotifierMngr resource to another system.
Because the message queue is replicated throughout the cluster, any system that is
a target for the service group has an identical queue. When the NotifierMngr
resource is brought online, had sends the queued messages to the notifier
daemon.
The example in the slide shows the configuration of a notifier resource for e-mail
notification.
See the Veritas Cluster Server Bundled Agents Reference Guide for detailed
information about the NotifierMngr agent.
Note: Before modifying resource attributes, ensure that you take the resource
offline and disable it. The notifier daemon must be stopped and
restarted with new parameters in order for changes to take effect.

10
The ResourceOwner attribute

You can set the ResourceOwner attribute to define an owner for a resource. After
the attribute is set to a valid e-mail address and notification is configured, an e-
mail message is sent to the defined recipient when one of the resource-related
events occurs, shown in the table in the slide.
VCS also creates an entry in the log file in addition to sending an e-mail message.
ResourceOwner can be specified as an e-mail ID (dbas@company.com) or a
user account (gene). If a user account is specified, the e-mail address is
constructed as login@smtp_system, where smtp_system is the system that
was specified in the SmtpServer attribute of the NotifierMngr resource.

10–11
The GroupOwner attribute
You can set the GroupOwner attribute to define an owner for a service group.
After the attribute is set to a valid e-mail address and notification is configured, an
e-mail message is sent to the defined recipient when one of the group-related
events occurs, as shown in the table in the slide.
GroupOwner can be specified as an e-mail ID (dbas@company.com) or a user
account (gene). If a user account is specified, the e-mail address is constructed as
login@smtp_system, where smtp_system is the system that was specified
in the SmtpServer attribute of the NotifierMngr resource.

10
Additional recipients for notifications

Additional attributes enable broader specification of users to be notified of
resource and service group events. These attributes are configured at the
corresponding object level. For example, the GroupRecipients attribute is
configured within a service group definition.
These attributes are specified by a list of e-mail addresses along with the severity
level. The registered users get only those events which have severity equal to or
greater than the severity requested. For example, if janedoe is configured in the
ClusterRecipients attribute with a severity level “Warning”, she would get events
of severity “Warning”, “Error” and “SevereError” but would not get events with
severity “Information”. A cluster event, such as a cluster fault, which is Error
level, would be sent to janedoe.

10–13
Configuring the SNMP console
To enable an SNMP management console to recognize VCS traps, you must load
the VCS MIB into the console. The textual MIB is located in the
/etc/VRTSvcs/snmp/vcs.mib file.
For HP OpenView Network Node Manager (NNM), you must merge the VCS
SNMP trap events contained in the /etc/VRTSvcs/snmp/vcs_trapd file. To
merge the VCS events, type:
xnmevents -merge vcs_trapd
SNMP traps sent by VCS are then displayed in the HP OpenView NNM SNMP
console.

10
Overview of triggers
Using triggers
VCS provides an additional method for notifying users of important events. When
VCS detects certain events, you can configure a trigger to notify an administrator
or perform other actions. You can use event triggers in place of, or in conjunction
with, notification.
Triggers are executable programs, batch files, shell or Perl scripts associated with
the predefined event types supported by VCS that are shown in the slide.
Triggers are configured by specifying one or more keys in the TriggersEnabled
attribute. Some keys are specific to service groups or resources.
The RESSTATECHANGE, RESRESTART, and RESFAULT keys apply to both
resources and service groups. When one of these keys is specified in TriggerPath
at the service group level, the trigger applies to each resource in the service group.
Examples of some trigger keys include:
• POSTOFFLINE: The service group went offline from a PARTIAL or ONLINE
state.
• POSTONLINE: The service group went online from OFFLINE state.
• RESFAULT: A resource faulted.
• RESRESTART: A resource was restarted after a fault.
For a complete description of triggers, see the Veritas Cluster Server
Administrator’s Guide.

10–15
Sample triggers
A set of sample trigger scripts is provided in /opt/VRTSvcs/bin/
sample_triggers. These scripts can be copied to /opt/VRTSvcs/bin/
triggers and modified to your specifications.
The sample scripts include comments that explain how the trigger is invoked and
provide guidance about modifying the samples to your specifications.

10
Location of triggers
Trigger executable programs, batch files, shell or Perl scripts reside in
/opt/VRTSvcs/bin/triggers by default.
You can change the location of triggers by specifying the TriggerPath attribute at
the service group or resource level. This attribute enables you to set up different
trigger programs for resources or service groups. In previous versions of VCS, the
same triggers applied to all resources or service groups in the cluster.
The value of the TriggerPath attribute is appended to /opt/VRTSvcs (also
referred to as VCS_HOME) to form a directory containing the trigger programs. In
the example shown in the slide, TriggerPath is set to bin/websg. Therefore, the
files executed when the PREONLINE key is specified for the websg service group
must be located in /opt/VRTSvcs/bin/websg.
The example portion of the main.cf file shows the PREONLINE trigger enabled
for websg on both s1 and s2, and the trigger path customized to map to
/opt/VRTSvcs/bin/websg.

10–17
Example configuration
The slide shows the basic procedure for creating a trigger using a sample script
provided with VCS.
In this case, the resfault script is copied from the sample_triggers
directory and then modified to use the Linux /bin/mail program to send e-mail
to the modified recipients list.
The only changes required to make use of the sample resfault trigger in this
example are the following two lines:
@recipients=("student\@mgt.example.com");
. . .
"/bin/mail -s resfault $recipient < $msgfile";
After a trigger is modified, you must ensure the file is executable by root, and then
copy the script or program to each system in the cluster that can run the trigger.
Finally, modify the TriggersEnabled attribute to specify the key for each system
that can run the trigger.

10
Using multiple scripts for a trigger

VCS supports the use of multiple scripts for a single trigger. This enables you to
break the logic of a trigger into components, rather than having all trigger logic in
one monolithic script.
The number contained in the file name determines the order in which the scripts
are run, similar to legacy UNIX startup scripts in rc* directories.
To use multiple files for a single trigger, you must specify a custom path using the
TriggerPath attribute.

10–19
• “Lab 9: Configuring notification,” page A-173.

Lesson 11
Handling Resource Faults

191

11
VCS response to resource faults

Failover decisions and critical resources
Critical resources define the basis for failover decisions made by VCS. When the
monitor entry point for a resource returns with an unexpected offline status, the
action taken by the VCS engine depends on whether the resource is critical.
By default, if a critical resource in a failover service group faults or is taken offline
as a result of another resource fault, VCS determines that the service group is
faulted. VCS then fails the service group over to another cluster system, as defined
by a set of service group attributes. The rules for selecting a failover target are
described in the “Startup and Failover Policies” lesson in the Veritas Cluster
Server for UNIX: Manage and Administer course.
The default failover behavior for a service group can be modified using one or
more optional service group attributes. Failover determination and behavior are
described throughout this lesson.

193 Lesson 11 Handling Resource Faults
11–3
How VCS responds to resource faults by default
VCS responds in a specific and predictable manner to faults. When VCS detects a
resource failure, it performs the following actions:
1 Instructs the agent to execute the clean entry point for the failed resource to
ensure that the resource is completely offline
The resource transitions to a FAULTED state.
2 Takes all resources in the path of the fault offline starting from the faulted
resource up to the top of the dependency tree
3 If an online critical resource is part of the path that was faulted or taken offline,
faults the service group and takes the group offline to prepare for failover
If no online critical resources are affected, no more action occurs.
4 Attempts to start the service group on another system in the SystemList
attribute according to the FailOverPolicy defined for that service group and the
relationships between multiple service groups

Failover policies and the impact of service group interactions during failover
are discussed in detail in the Veritas Cluster Server for UNIX: Manage and
Administer course.
Note: The state of the group on the new system prior to failover must be
offline (not faulted).
5 If no other systems are available, the service group remains offline.
VCS also executes certain triggers and carries out notification while it performs
each task in response to resource faults. The role of notification and event triggers
in resource faults is explained in detail later in this lesson.

11
The impact of service group attributes on failover

Several service group attributes can be used to change the default behavior of VCS
while responding to resource faults.
ManageFaults
The ManageFaults attribute can be used to prevent VCS from taking any automatic
actions whenever a resource failure is detected. Essentially, ManageFaults
determines whether VCS or an administrator handles faults for a service group.
If ManageFaults is set to the default value of ALL, VCS manages faults by
executing the clean entry point for that resource to ensure that the resource is
completely offline, as shown previously. This is the default value (ALL).
If this attribute is set to NONE, VCS places the resource in an ADMIN_WAIT
state and waits for administrative intervention. This is often used for service
groups that manage database instances. You may need to leave the database in its
FAULTED state in order to perform problem analysis and recovery operations.
Note: This attribute is set at the service group level. This means that any resource
fault within that service group requires administrative intervention if the
ManageFaults attribute for the service group is set to NONE.

11–5
Frozen or TFrozen
These service group attributes are used to indicate that the service group is frozen
due to an administrative command. When a service group is frozen, all agent
online and offline actions are disabled.
• If the service group is temporarily frozen using the hagrp -freeze
group command, the TFrozen attribute is set to 1.
• If the service group is persistently frozen using the hagrp -freeze group
-persistent command, the Frozen attribute is set to 1.
• When the service group is unfrozen using the hagrp -unfreeze group
[-persistent] command, the corresponding attribute is set back to the
default value of 0.

11
AutoFailOver
This attribute determines whether automatic failover takes place when a resource
or system faults. The default value of 1 indicates that the service group should be
failed over to other available systems if at all possible. However, if the attribute is
set to 0, no automatic failover is attempted for the service group, and the service
group is left in an OFFLINE | FAULTED state.

11–7
Determining failover duration
Failover duration on a resource fault
When a resource failure occurs, application services may be disrupted until either
the resource is restarted on the same system or the application services migrate to
another system in the cluster. The time required to address the failure is a
combination of the time required to:
• Detect the failure.
When traditional monitoring is configured, a resource failure is only detected
when the monitor entry point of that resource returns an offline status
unexpectedly. The resource type attributes used to tune the frequency of
monitoring a resource are MonitorInterval (default of 60 seconds) and
OfflineMonitorInterval (default of 300 seconds).
• Fault the resource.
This is related to two factors:

– How much tolerance you want VCS to have for false failure detections
For example, in an overloaded network environment, the NIC resource can
return an occasional failure even though there is nothing wrong with the
physical connection. You may want VCS to verify the failure a couple of
times before faulting the resource.
– Whether or not you want to attempt a restart before failing over
For example, it may be much faster to restart a failed process on the same
system rather than to migrate the entire service group to another system.

• Take the entire service group offline.
In general, the time required for a resource to be taken offline is dependent on
the type of resource and what the offline procedure includes. However, VCS
enables you to define the maximum time allowed for a normal offline
procedure before attempting to force the resource to be taken offline. The
resource type attributes related to this factor are OfflineTimeout and
CleanTimeout.
• Select a failover target.
The time required for the VCS policy module to determine the target system is
negligible, less than one second in all cases, in comparison to the other factors.
• Bring the service group online on another system in the cluster.
In most cases, in order to start an application service after a failure, you need to
carry out some recovery procedures. For example, a file system’s metadata
needs to be checked if it is not unmounted properly, or a database needs to
carry out recovery procedures, such as applying the redo logs to recover from
11
sudden failures.
Take these considerations into account when you determine the amount of time
you want VCS to allow for an online process. The resource type attributes
related to bringing a service group online are OnlineTimeout,
OnlineWaitLimit, and OnlineRetryLimit.
For more information on attributes that affect failover, refer to the Veritas Cluster
Server Bundled Agents Reference Guide.

11–9
Adjusting monitoring
You can change some resource type attributes to facilitate failover testing. For
example, you can change the monitor interval to see the results of faults more
quickly. You can also adjust these attributes to affect how quickly an application
fails over when a fault occurs.
MonitorInterval
This is the duration (in seconds) between two consecutive monitor calls for an
online or transitioning resource.
The default is 60 seconds for most resource types.
OfflineMonitorInterval
This is the duration (in seconds) between two consecutive monitor calls for an
offline resource. If set to 0, offline resources are not monitored.

The default is 300 seconds for most resource types.
Refer to the Veritas Cluster Server Bundled Agents Reference Guide for the
applicable monitor interval defaults for specific resource types.

11
Adjusting timeout values

The attributes MonitorTimeout, OnlineTimeout, and OfflineTimeout indicate the
maximum time (in seconds) within which the monitor, online, and offline entry
points must finish or be terminated. The default for the MonitorTimeout attribute
is 60 seconds. The defaults for the OnlineTimeout and OfflineTimeout attributes
are 300 seconds.
For best results, measure the length of time required to bring a resource online,
take it offline, and monitor it before modifying the defaults. Simply issue an online
or offline command to measure the time required for each action. To measure how
long it takes to monitor a resource, fault the resource, and then issue a probe, or
bring the resource online outside of VCS control and issue a probe.

11–11
Controlling fault behavior
Type attributes related to resource faults
Although the failover capability of VCS helps to minimize the disruption of
application services when resources fail, the process of migrating a service to
another system can be time-consuming. In some cases, you may want to attempt to
restart a resource on the same system before failing it over to another system.
Whether a resource can be restarted depends on the application service:
• The resource must be successfully cleared (taken offline) after failure.
• The resource must not be a child resource with dependent parent resources that
must be restarted.
If you have determined that a resource can be restarted without impacting the
integrity of the application, you can potentially avoid service group failover by
configuring the RestartLimit, ConfInterval, and ToleranceLimit resource type

attributes.
For example, you can set the ToleranceLimit to a value greater than 0 to allow the
monitor entry point to run several times before a resource is determined to be
faulted. This is useful when the system is very busy and a service, such as a
database, is slow to respond.

11
Restart example
This example illustrates how the RestartLimit and ConfInterval attributes can be
configured for modifying the behavior of VCS when a resource is faulted.
Setting RestartLimit = 1 and ConfInterval = 180 has this effect when a resource
faults:
1 The resource stops after running for 10 minutes.
2 The next monitor returns offline.
3 The ConfInterval counter is set to 0.
4 The agent checks the value of RestartLimit.
5 The resource is restarted because RestartLimit is set to 1, which allows one
restart within the ConfInterval counter
6 The next monitor returns online.
7 The ConfInterval counter is now 60; one monitor cycle has completed.
8 The resource stops again.
9 The next monitor returns offline.
10 The ConfInterval counter is now 120; two monitor cycles have completed.
11 The resource is not restarted because the RestartLimit counter is now 1 and the
ConfInterval counter is 120 (seconds). Because the resource has not been
online for the ConfInterval time of 180 seconds, it is not restarted.
12 VCS faults the resource.
If the resource had remained online for 180 seconds, the internal RestartLimit
counter would have been reset to 0.

203 Lesson 11 Handling Resource Faults Copyright © 2014 Symantec Corporation. All rights reserved.
11–13
Modifying resource type attributes
You can modify the resource type attributes to affect how an agent monitors all
resources of a given type. For example, agents usually check their online resources
every 60 seconds. You can modify that period so that the resource type is checked
more often. This is good for either testing situations or time-critical resources.
You can also change the period so that the resource type is checked less often. This
reduces the load on VCS overall, as well as on the individual systems, but
increases the time it takes to detect resource failures.
For example, to change the ToleranceLimit attribute for all NIC resources so that
the agent ignores occasional network problems, type:
hatype -modify NIC ToleranceLimit 2

11
Overriding resource type attributes

Resource type attributes apply to all resources of that type. You can override a
resource type attribute to change its value for a specific resource.
Use the options to hares shown on the slide to override resource type attributes.
Note: The configuration must be in read-write mode in order to modify and

override resource type attributes. The changes are reflected in the
main.cf file only after you save the configuration using the haconf -
dump command.
Some predefined static resource type attributes (those resource type attributes that
do not appear in types.cf unless their value is changed, such as
MonitorInterval) and all static attributes that are not predefined (static attributes
that are defined in the type definition file) can be overridden. For a detailed list of
predefined static attributes that can be overridden, refer to the VERITAS Cluster
Server User’s Guide.

11–15
Recovering from resource faults
When a resource failure is detected, the resource is put into a FAULTED or an
ADMIN_WAIT state depending on the cluster configuration. In either case,
administrative intervention is required to bring the resource status back to normal.
Recovering a resource from a faulted state

A critical resource in FAULTED state cannot be brought online on a system. When
a critical resource is FAULTED on a system, the service group status also changes
to FAULTED on that system, and that system can no longer be considered as an
available target during a service group failover.
You have to clear the FAULTED status of a nonpersistent resource manually.
Before clearing the FAULTED status, ensure that the resource is completely
offline and that the fault is fixed outside of VCS.
Note: You can also run hagrp -clear group [-sys system] to clear
all FAULTED resources in a service group. However, you have to ensure
that all of the FAULTED resources are completely offline and the faults are
fixed on all the corresponding systems before running this command.
The FAULTED status of a resource is cleared when the monitor returns an online
status for that resource. Note that offline resources are monitored according to the
value of OfflineMonitorInterval, which is 300 seconds (five minutes) by default.
To avoid waiting for the periodic monitoring, you can initiate the monitoring of the
resource manually by probing the resource.

11
Fault notification and event handling

Fault notification
As a response to a resource fault, VCS carries out tasks to take resources or service
groups offline and to bring them back online elsewhere in the cluster. While
carrying out these tasks, VCS generates certain messages with a variety of severity
levels and the VCS engine passes these messages to the notifier daemon.
Whether these messages are used for SNMP traps or SMTP notification depends
on how the notification component of VCS is configured, as described in the
“Configuring Notification” lesson.
The following events are examples that result in a notification message being
generated:
• A resource becomes offline unexpectedly; that is, a resource is faulted.
• VCS cannot determine the state of a resource.

• A failover service group is online on more than one system.
• The service group is brought online or taken offline successfully.
• The service group has faulted on all nodes where the group could be brought
online, and there are no nodes to which the group can fail over.

11–17
• “Lab 10: Configuring resource fault behavior,” page A-197.

Lesson 12
Intelligent Monitoring Framework

209

IMF overview 12
Drawbacks of traditional monitoring

The Intelligent Monitoring Framework was created to meet customer demands for
supporting increasing numbers of highly available services. Some environments
are supporting large numbers resources (hundreds of mount points, for example)
running on already loaded systems. With traditional monitoring, VCS agents poll
each resource every 60 seconds, by default, which can add substantially to the
system load in large-scale environments. The periodic nature of traditional
monitoring, coupled with the requirement to run the monitor process for each
resource, results in the state of the resource being unknown between monitor
cycles, and requires additional system resources

211 Lesson 12 Intelligent Monitoring Framework
12–3
IMF event-driven notification
The intelligent monitoring framework is a notification-based mechanism that
minimizes load on the system and provides immediate notification of faults. IMF
is implemented through the asynchronous monitoring framework (AMF) kernel
module.
When IMF monitoring is configured for a resource, the agent registers the resource
with the AMF module. The AMF module receives event notifications from the
operating system when a registered resource changes states.
The AMF module passes the notification to the agent for handling, as described
later in the lesson.

Agents with IMF support 12
When IMF was introduced in VCS 5.1 SP1, the agents listed on the right side of
the slide supported IMF monitoring.
In VCS 6.x, intelligent monitoring capability has been added for the DB2, Sybase,
Zone, and WPAR agents.
In addition, you can create custom agents that use IMF monitoring by linking the
AMF plug-ins with the script agent and creating an XML file to enable registration
with the AMF module. For more information about using IMF monitoring for
custom agents, see the VCS Agent Developer’s Guide.

12–5
IMF configuration
IMF modes
The Mode key of the IMF attribute determines whether IMF or traditional
monitoring is configured for a resource. Accepted values are:
0—Does not perform intelligent resource monitoring
1—Performs intelligent resource monitoring for offline resources and performs
poll-based monitoring for online resources
2—Performs intelligent resource monitoring for online resources and performs
poll-based monitoring for offline resources
3—Performs intelligent resource monitoring for both online and for offline
resources

Combining IMF and poll-based monitoring 12
The configuration snippet in the slide shows a Process resource with IMF enabled
for monitoring online resources and traditional poll-based monitoring for offline
resources. For Process resources that are offline, the agent runs monitor entry point
periodically as specified by the OfflineMonitorInterval attribute.
With MonitorFreq set to 5, the agent runs the monitor entry point periodically,
calculated by multiplying the value of MonitorFreq (5) by MonitorInterval (60
seconds), which results in a poll-based monitor occurring every five minutes for an
online resource.
Poll-based monitoring is performed by checking the process table for the process
IDs listed in the PidFile.

12–7
Failover duration when a resource faults
Failover duration for service groups when an IMF-monitored resource is
determined in similar fashion to the process described in the “Handling Resource
Faults” lesson.
The key difference is the time required to detect a resource fault. Depending on
when a resource faults in the traditional poll-based model, the detection of the fault
can take up to 60 seconds.
For IMF-monitored resources, a fault is detected and the agent probes the resource
immediately to determine the resource state.
When a process dies or hangs, the operating system generates an alert. The agent is
registered to receive such alerts from the operating system, through the AMF
kernel module. The agent then probes the resource to determine the state and
notifies HAD if the resources is faulted. HAD can then take action within seconds
of a resource fault, rather than minutes, as with poll-based monitoring.

12
• “Lab 11: IMF and AMF,” page A-243.

12–9

Lesson 13
Cluster Communications

219

VCS communications review
VCS maintains the cluster state by tracking the status of all resources and service
groups in the cluster. The state is communicated between had processes on each
cluster system by way of the atomic broadcast capability of Group Membership
Services/Atomic Broadcast (GAB). HAD is a replicated state machine, which uses
the GAB atomic broadcast mechanism to ensure that all systems within the cluster
are immediately notified of changes in resource status, cluster membership, and 13
configuration.
Atomic means that all systems receive updates, or all systems are rolled back to the
previous state, much like a database atomic commit. If a failure occurs while
transmitting status changes, GAB’s atomicity ensures that, upon recovery, all
systems have the same information regarding the status of any monitored resource
in the cluster.
VCS on-node communications

VCS uses agents to manage resources within the cluster. Agents perform resource-
specific tasks on behalf of had, such as online, offline, and monitoring actions.
These actions can be initiated by an administrator issuing directives using the VCS
graphical or command-line interfaces, or by other events that require had to take
some action. Agents also report resource status back to had. Agents do not
communicate with one another, but only with had.
The had processes on each cluster system communicate cluster status information
over the cluster interconnect.

221 Lesson 13 Cluster Communications
13–3
VCS inter-node communications
In order to replicate the state of the cluster to all cluster systems, VCS must
determine which systems are participating in the cluster membership. This is
accomplished by the group membership services mechanism of GAB.
Cluster membership refers to all systems configured with the same cluster ID and
interconnected by a pair of redundant Ethernet LLT links. Under normal operation,
all systems configured as part of the cluster during VCS installation actively
participate in cluster communications.
Systems join a cluster by issuing a cluster join message during GAB startup.
Cluster membership is maintained by heartbeats. Heartbeats are signals sent
periodically from one system to another to determine system state. Heartbeats are
transmitted by the LLT protocol.
VCS communications stack summary

The hierarchy of VCS mechanisms that participate in maintaining and
communicating cluster membership and status information is shown in the slide
diagram.
• Agents communicate with had.
• The had processes on each system communicate status information by way of
GAB.
• GAB determines cluster membership by monitoring heartbeats transmitted
from each system over LLT.

Cluster interconnect specifications
LLT can be configured to designate links as high-priority or low-priority links.
High-priority links are used for cluster communications (GAB) as well as
heartbeats. Low-priority links carry only heartbeats unless there is a failure of all
configured high-priority links. At this time, LLT switches cluster communications
to the first available low-priority link. Traffic reverts to high-priority links as soon 13
as they are available.

223 Lesson 13 Cluster Communications Copyright © 2014 Symantec Corporation. All rights reserved.
13–5
GAB membership notation
To display the cluster membership status, type gabconfig on each system. For
example:
gabconfig -a
The first example in the slide shows:
• Port a, GAB membership, has four nodes: 0, 1, 21, and 22
• Port b, fencing membership, has four nodes: 0, 1, 21, and 22
• Port h, VCS membership, has four nodes: 0, 1, 21, and 22
Note: The port a, port b, and port h generation numbers change each time the
membership changes.
GAB membership notation

The gabconfig output uses a positional notation to indicate which systems are
members of the cluster. Only the last digit of the node number is displayed relative
to semicolons that indicate the 10s digit.
The second example shows gabconfig output for a cluster with 22 nodes.

Cluster interconnect configuration
The VCS installation utility sets up all cluster interconnect configuration files and
starts LLT and GAB. You may never need to modify communication configuration
files. Understanding how these files work together to define the cluster
communication mechanism helps you understand VCS behavior.
LLT configuration files 13
The LLT configuration files are located in the /etc directory.
The llttab file

The llttab file is the primary LLT configuration file and is used to:
• Set the cluster ID number.
• Set system ID numbers.
• Specify the network device names used for the cluster interconnect.
• Modify LLT behavior, such as heartbeat frequency.
Note: Ensure that there is only one set-node line in the llttab file.
This is the minimum recommended set of directives required to configure LLT.

The basic format of the file is an LLT configuration directive followed by a value.
These directives and their values are described in more detail in the next sections.
For a complete list of directives, see the sample_llttab file in the
/opt/VRTS/llt directory and the llttab manual page.

13–7
The llthosts file
The llthosts file associates a system name with a VCS cluster node ID
number. This file must be present in the /etc directory on every system in the
cluster. It must contain a line with the unique name and node ID for each system in
the cluster. The format is:
node_number name
The critical requirements for llthosts entries are:
• Node numbers must be unique. If duplicate node IDs are detected on the
Ethernet LLT cluster interconnect, LLT in VCS 4.0 is stopped on the joining
node. In VCS versions before 4.0, the joining node panics.
• The system name must match the name in llttab if a name is configured for
the set-node directive (rather than a number).
• System names must match those in main.cf, or VCS cannot start.
Note: The system (node) name does not need to be the UNIX host name found
using the hostname command. However, Symantec recommends that
you keep the names the same to simplify administration, as described in the
next section.
See the llthosts manual page for a complete description of the file.

How node and cluster numbers are specified
A unique number must be assigned to each system in a cluster using the
set-node directive.
Each system in the cluster must have a unique llttab file, which has a unique
value for set-node, which can be one of the following:
• An integer in the range of 0 through 63 (64 systems per cluster maximum) 13
• A system name matching an entry in /etc/llthosts
The set-cluster directive

LLT uses the set-cluster directive to assign a unique number to each cluster.
A cluster ID is set during installation and can be validated as a unique ID among
all clusters sharing a network for the cluster interconnect.
Note: You can use the same cluster interconnect network infrastructure for
multiple clusters. The llttab file must specify the appropriate cluster ID
to ensure that there are no conflicting node IDs.
If you bypass the installer mechanisms for ensuring the cluster ID is unique and
LLT detects multiple systems with the same node ID and cluster ID on a private
network, the LLT interface is disabled on the node that is starting up. This prevents
a possible split-brain condition, where a service group might be brought online on
the two systems with the same node ID.

13–9
The sysname file
The sysname file is an optional LLT configuration file that is configured
automatically during VCS installation. This file is used to store the short-form of
the system (node) name.
The purpose of the sysname file is to enable specification of a VCS node name
other than the UNIX host name. This may be desirable, for example, when the
UNIX host names are long and you want VCS to use shorter names.
Note: If the sysname file contains a different name from the llttab/
llthosts/main.cf files, this “phantom” system is added to the cluster
upon cluster startup.
The sysname file can be specified for the set-node directive in the llttab
file. In this case, the llttab file can be identical on every node, which may
simplify reconfiguring the cluster interconnect in some situations.
See the sysname manual page for a complete description of the file.

The GAB configuration file
GAB is configured with the /etc/gabtab file. This file contains one line that is
used to start GAB. For example:
/sbin/gabconfig -c -n 4
This example starts GAB and specifies that four systems are required to be running
GAB to start within the cluster. The -n option should always be set to the total 13
number of systems in the cluster.
A sample gabtab file is included in /opt/VRTSgab.
Note: Other gabconfig options are discussed later in this lesson. See the
gabconfig manual page for a complete description of the file.

13–11
Joining the cluster membership
GAB and LLT are started automatically when a system starts up. HAD can only
start after GAB membership has been established among all cluster systems. The
mechanism that ensures that all cluster systems are visible on the cluster
interconnect is GAB seeding.
Seeding during startup

Seeding is a mechanism to ensure that systems in a cluster are able to
communicate before VCS can start. Only systems that have been seeded can
participate in a cluster. Seeding is also used to define how many systems must be
online and communicating before a cluster is formed.
By default, a system is not seeded when it boots. This prevents VCS from starting,
which prevents applications (service groups) from starting. If the system cannot
communicate with the cluster, it cannot be seeded.

Seeding is a function of GAB and is performed automatically or manually,
depending on how GAB is configured. GAB seeds a system automatically in one
of two ways:
• When an unseeded system communicates with a seeded system
• When all systems in the cluster are unseeded and able to communicate with
each other
The number of systems that must be seeded before VCS is started on any system is
also determined by the GAB configuration.

When the cluster is seeded, each node is listed in the port a membership displayed
by gabconfig -a. In the following example, all four systems (nodes 0, 1, 2,
and 3) are seeded, as shown by port a membership:
# gabconfig -a
GAB Port Memberships
=======================================================
Port a gen a356e003 membership 0123
LLT, GAB, and VCS startup files

These startup files are placed on the system when VCS is installed.
AIX
/etc/rc.d/rc2.d/S70llt Checks for /etc/llttab and runs

/sbin/lltconfig -c to start LLT
/etc/rc.d/rc2.d/S92gab Calls /etc/gabtab
/etc/rc.d/rc2.d/S99vcs Runs /opt/VRTSvcs/bin/hastart
HP-UX
/sbin/rc2.d/S680llt Checks for /etc/llttab and runs

/sbin/rc2.d/S920gab Calls /etc/gabtab
/sbin/rc2.d/S990vcs Runs /opt/VRTSvcs/bin/hastart
Linux 13
/etc/rc[2345].d/llt Checks for /etc/llttab and runs

/etc/rcX.d/gab Calls /etc/gabtab
/etc/rcX.d/vcs Runs /opt/VRTSvcs/bin/hastart
Solaris 10
/lib/svc/method/llt Checks for /etc/llttab and runs

/lib/svc/method/gab Calls /etc/gabtab
/lib/svc/method/vcs Runs /opt/VRTSvcs/bin/hastart

231 Lesson 13 Cluster Communications
13–13
Probing resources during normal startup
During initial startup, VCS autodisables a service group until all its resources are
probed on all systems in the SystemList. When a service group is autodisabled,
VCS sets the AutoDisabled attribute to 1 (true), which prevents the service group
from starting on any system. This protects against a situation where enough
systems are running LLT and GAB to seed the cluster, but not all systems have
HAD running.
In this case, port a membership is complete, but port h is not. VCS cannot detect
whether a service is running on a system where HAD is not running. Rather than
allowing a potential concurrency violation to occur, VCS prevents the service
group from starting anywhere until all resources are probed on all systems.
After all resources are probed on all systems, a service group can come online by
bringing offline resources online. If the resources are already online, as in the case
where HAD has been stopped with the hastop -all -force option, the
resources are marked as online.

System and cluster interconnect failures
VCS response to system failure
The example cluster used throughout most of this section contains three systems,
s1, s2, and s3, each of which can run any of the three service groups, A, B, and C.
The abbreviated system and service group names are used to simplify the
diagrams. 13
In this example, there are two Ethernet LLT links for the cluster interconnect.
Prior to any failures, systems s1, s2, and s3 are part of the regular membership of
cluster number 1. When the s3 system fails, it is no longer part of the cluster
membership. Service group C fails over and starts up on either s1 or s2, according
to the SystemList and FailOverPolicy values.

13–15
Failover duration on a system failure
When a system faults, application services that were running on that system are
disrupted until the services are started up on another system in the cluster. The time
required to address a system fault is a combination of the time required to:
• Detect the system failure.
A system is determined to be faulted according to these default timeout
periods:
– LLT timeout: If LLT on a running system does not receive a heartbeat from
a system for 16 seconds, LLT notifies GAB of a heartbeat failure.
– GAB stable timeout: GAB determines that a membership change is
occurring, and after five seconds, GAB delivers the membership change to
HAD.
• Select a failover target.
The time required for the VCS policy module to determine the target system is
negligible, less than one second in all cases, in comparison to the other factors.
• Bring the service group online on another system in the cluster.
As described in an earlier lesson, the time required for the application service
to start up is a key factor in determining the total failover time.

Manual seeding
You can override the seed values in the gabtab file and manually force GAB to
seed a system using the gabconfig command. This is useful when one of the
systems in the cluster is out of service and you want to start VCS on the remaining
systems.
To seed the cluster if GAB is already running, use gabconfig with the -x 13
option to override the -n value set in the gabtab file. For example, type:
gabconfig -x
If GAB is not already started, you can start and force GAB to seed using -c and
-x options to gabconfig:
gabconfig -c -x
CAUTION Only manually seed the cluster when you are sure that no other
systems have GAB seeded. In clusters that do not use I/O fencing,
you can potentially create a split brain condition by using
gabconfig improperly.
After you have started GAB on one system, start GAB on other systems using
gabconfig with only the -c option. You do not need to force GAB to start with
the -x option on other systems. When GAB starts on the other systems, it
determines that GAB is already seeded and starts up.

13–17
Single LLT link failure
In the case where a node has only one functional LLT link, the node is a member of
the regular membership and the jeopardy membership. Being in a regular
membership and jeopardy membership at the same time changes only the failover
behavior on system fault. All other cluster functions remain. This means that
failover due to a resource fault or switchover of service groups at operator request
is unaffected.
The only change is that other systems prevented from starting service groups on
system fault. VCS continues to operate as a single cluster when at least one
network channel exists between the systems.
In the example shown in the diagram where one LLT link fails:
• A jeopardy membership is formed that includes just system s3.
• System s3 is also a member of the regular cluster membership with systems s1
and s2.
• Service groups A, B, and C continue to run and all other cluster functions
remain unaffected.
• Failover due to a resource fault or an operator request to switch a service group
is unaffected.
• If system s3 now faults or its last LLT link is lost, service group C is not started
on systems s1 or s2.

Interconnect failure and potential split brain condition
When both LLT links fail simultaneously:
• The cluster partitions into two separate clusters. No jeopardy membership is
formed and no service groups are autodisabled.
• Each cluster determines that the other systems are down and tries to start the
service groups. 13
If an application starts on multiple systems and can gain control of what are
normally exclusive resources, such as disks in a shared storage device, split brain
condition results and data can be corrupted.

13–19
Interconnect failures with a low-priority public link
LLT can be configured to use a low-priority network link as a backup to normal
heartbeat channels. Low-priority links are typically configured on the public
network or administrative network.
In normal operation, the low-priority link carries only heartbeat traffic for cluster
membership and link state maintenance. The frequency of heartbeats is reduced by
half to minimize network overhead.
When the low-priority link is the only remaining LLT link, LLT switches all
cluster status traffic over the link. Upon repair of any configured link, LLT
switches cluster status traffic back to the high-priority link.
Notes:
• Nodes must be on the same public network segment in order to configure low-
priority links. LLT is a non-routable protocol.

• You can have up to eight LLT links total, which can be a combination of low-
and high-priority links. For example, if you have three high-priority links, you
have the same progression to jeopardy membership. The difference is that all
three links are used for regular heartbeats and cluster status information.

Changing the interconnect configuration
Example reconfiguration scenarios
You may never need to perform any manual configuration of the cluster
interconnect because the VCS installation utility sets up the interconnect based on
the information you provide about the cluster.
13
However, certain configuration tasks require you to modify VCS communication
configuration files, as shown in the slide.

13–21
Manually modifying the interconnect
The procedure shown in the diagram can be used for any type of change to the
VCS communications configuration. The first task refers to the procedure
provided in the “Offline Configuration” lesson, and includes saving and closing
the cluster configuration before backing up and editing files.
Although some types of modifications do not require you to stop both GAB and
LLT, using this procedure ensures that any type of change you make takes effect.
For example, if you added a system to a running cluster, you can change the value
of -n in the gabtab file without having to restart GAB. However, if you added
the -j option to change the recovery behavior, you must either restart GAB or
execute the gabtab command manually for the change to take effect.
Similarly, if you add a host entry to llthosts, you do not need to restart LLT.
However, if you change llttab, or you change a host name in llthosts, you
must stop and restart LLT, and, therefore, GAB.

Following this procedure ensures that any type of changes take effect. You can
also use the scripts in the rc*.d directories to stop and start services.

Example LLT link specification
You can add links to the LLT configuration as additional layers of redundancy for
the cluster interconnect. You may want an additional interconnect link for:
• VCS for heartbeat redundancy
• Storage Foundation for Oracle RAC for additional bandwidth
13
To add an Ethernet link to the cluster interconnect:
1 Cable the link on all systems.
2 Use the process on the previous page to modify the llttab file on each
system to add the new link directive.
To add a low-priority public network link, add a link-lowpri directive using
the same syntax as the link directive, as shown in the llttab file example in the
slide.
VCS uses the low-priority link only for heartbeats (at half the normal rate), unless
it is the only remaining link in the cluster interconnect.

13–23
• “Lab 12: Cluster communications,” page A-257.

Lesson 14
Protecting Data Using SCSI 3-Based
Fencing

243

24414–2 Symantec Cluster Server 6.x for UNIX: Administration Fundamentals
Data protection requirements
In order to understand how VCS protects shared data in a high availability
environment, it helps to see the problem that needs to be solved—how a cluster
goes from normal operation to responding to various failures.
Split brain condition

A network partition can lead to a split brain condition—an issue faced by all
cluster implementations. This problem occurs when the HA software cannot
distinguish between a system failure and an interconnect failure. The symptoms
look identical.
For example, in the diagram, if the system on the right fails, it stops sending
heartbeats over the private interconnect. The left node then takes corrective action. 14
Failure of the cluster interconnect presents identical symptoms. In this case, both
nodes determine that their peer has departed and attempt to take corrective action.
This can result in data corruption if both nodes are able to take control of storage in
an uncoordinated manner.
Other scenarios can cause this situation. If a system is so busy that it appears to be
hung, to another system in the cluster it would seem to have failed. The second
system would then take the corrective action of starting the services of the hung
system. This can also happen on systems where the hardware supports a break and
resume function. If the system is dropped to command-prompt level with a break
and subsequently resumed, the system can appear to have failed. The cluster is
reformed and then the system recovers and begins writing to shared storage again.

245 Lesson 14 Protecting Data Using SCSI 3-Based Fencing Copyright © 2014 Symantec Corporation. All rights reserved.
14–3
I/O fencing
The key to protecting data in a shared storage cluster environment is to guarantee
that there is always a single consistent view of cluster membership. In other words,
when one or more systems stop sending heartbeats, the HA software must
determine which systems can continue to participate in the cluster membership and
how to handle the other systems.
VCS uses a mechanism called I/O fencing to guarantee data protection. I/O
fencing uses SCSI-3 persistent reservations (PR) to fence out data drives to prevent
the data loss consequences of split-brain condition. Fencing ensures that data
protection is the highest priority concern, stopping running systems when
necessary to ensure systems cannot starts services when a split-brain condition is
encountered, as described in detail in this lesson.
SCSI-3 PR supports multiple nodes accessing a device while at the same time
blocking access to other nodes. Persistent reservations are persistent across SCSI
bus resets and also support multiple paths from a host to a disk.

I/O fencing concepts
I/O fencing components
VCS uses fencing to allow write access to members of the active cluster and to
block access to nonmembers.
I/O fencing in VCS consists of several components. The physical components are
coordinator disks and data disks. Each has a unique purpose and uses different
physical disk devices.
Coordinator disks
The coordinator disks act as a global lock mechanism used by the fencing driver to
determine which nodes are currently registered in the cluster. This registration is 14
represented by a unique key associated with each node that is written to the
coordinator disks. In order for a node to access a data disk, that node must have a
key registered on coordinator disks.

When system or interconnect failures occur, the coordinator disks enable the
fencing driver to ensure that only one cluster survives, as described in the “I/O
Fencing Operations” section.

14–5
Data disks
Data disks are standard disk devices used for shared data storage. These can be
physical disks or RAID logical units (LUNs). These disks must support SCSI-3
PR. Data disks are incorporated into standard VM disk groups. In operation,
Volume Manager is responsible for fencing data disks on a disk group basis.
Disks added to a disk group are automatically fenced, as are new paths to a device
as they are discovered.

SCSI 3 registration keys for coordinator disks
SCSI 3 registration keys are used by the fencing driver as the locking mechanism
for the coordinator disks.
The registration keys are based on the LLT node number. Each key is eight
characters (bytes), specified as follows:
• The left-most two bytes are the ASCII VF characters, indicating the keys are
written by Veritas fencing.
• The next four bytes are the hexadecimal value of the cluster ID number.
• The last two bytes are the hexadecimal value of the node ID number.
For example, in a cluster with an ID of 8, node 0 uses key VF000800, node 1 uses
14
key VF000801, node 2 is VF000802, and so on. For simplicity, these are shown
as 0 and 1 in subsequent diagrams.
Note: The registration key is not actually written to disk, but is stored in the drive
electronics or RAID controller.

14–7
SCSI 3 registration keys and reservations for data disks
Registration keys for data disks are also based on the LLT node number. Each key
is eight characters (bytes), specified as follows:
• The first byte (left-most character) is the LLT node number added to the
hexidecimal number A.
For example, the first byte for LLT node 0 is formed by adding hexadecimal A
to 0, which yields A.
The first byte of LLT node 1 is hexadecimal A plus 1, which yields B.
• The next three bytes are the ASCII characters VCS, indicating the keys are
written by the VCS fencing driver.
• The final four bytes are null.
As shown in the table in the slide, node 0 uses key AVCS, node 1 uses key BVCS,
node 2 would be CVCS, and so on. For simplicity, these are shown as A and B in
the diagram.
After registering with the data disks, a Write Exclusive Registrants Only
reservation is set on the data disk. This reservation means that only the registered
system can write to the data disk.

I/O fencing operations
Registration with coordinator disks
After GAB has started and port a membership is established, each system registers
with the coordinator disks. HAD cannot start building the cluster configuration
until registration is complete.
All systems are aware of the keys of all other systems, forming a membership of
registered systems. This fencing membership—maintained by way of GAB port
b—is the basis for determining which nodes have access to the data disks. When
the fencing membership is complete, the fencing driver signals HAD and HAD can
then start building the cluster configuration.
14

251 Lesson 14 Protecting Data Using SCSI 3-Based Fencing
14–9
Service group startup
After the fencing membership is established and port b shows all systems as
members, each system writes registration keys to the coordinator disks. In the
example shown in the diagram, the cluster has two members, node 0 and node 1, so
port b membership shows 0 and 1.
At this point, HAD is initialized on each system and one system starts building the
cluster configuration. When HAD is running and all systems have the cluster
configuration in memory, VCS brings service groups online according to their
specified startup policies. When a disk group resource associated with a service
group is brought online, the Volume Manager disk group agent (DiskGroup)
imports the disk group and writes a SCSI-3 registration key to the data disks. This
registration is performed in a similar way to coordinator disk registration.
In the example shown in the diagram, node 0 is registered to write to the data disks
in the disk group belonging to the dbsg service group. Node 1 is registered to write
to the data disks in the disk group belonging to the appsg service group.
After registering with the data disk, Volume Manager sets a Write Exclusive
Registrants Only reservation on the data disk.

System failure
The diagram shows the fencing sequence when a system fails.
1 Node 0 detects node 1 has failed when the LLT heartbeat times out and informs
GAB. At this point, port a on node 0 (GAB membership) shows only 0.
2 The fencing driver is notified of the change in GAB membership and node 0
races to win control of a majority of the coordinator disks.
This means node 0 must eject node 1 keys (1) from at least two of three
coordinator disks. The fencing driver ejects the registration of node 1 (1 keys)
using the SCSI-3 Preempt and Abort command. This command allows a
registered member on a disk to eject the registration of another. Because I/O
fencing uses the same key for all paths from a host, a single preempt and abort
14
ejects a host from all paths to storage.
3 In this example, node 0 wins the race for each coordinator disk by ejecting
node 1 keys from each coordinator disk.

4 Now port b (fencing membership) shows only node 0 because node 1 keys
have been ejected. Therefore, fencing has a consistent membership and passes
the cluster reconfiguration information to HAD.
5 GAB port h reflects the new cluster membership containing only node 0 and
HAD now performs failover operations defined for the service groups that
were running on the departed system.

14–11
Fencing takes place when a service group is brought online on a surviving
system as part of the disk group import process. When the DiskGroup
resources come online, the agent online entry point instructs Volume Manager
to import the disk group with options to remove the node 1 registration and
reservation, and place a SCSI-3 registration and reservation for node 0.

Interconnect failure
The diagram shows how VCS handles fencing if the cluster interconnect is severed
and a network partition is created. In this case, multiple nodes are racing for
control of the coordinator disks.
1 LLT on node 0 informs GAB that it has not received a heartbeat from node 1
within the timeout period. Likewise, LLT on node 1 informs GAB that it has
not received a heartbeat from node 0.
2 When the fencing drivers on both nodes receive a cluster membership change
from GAB, they begin racing to gain control of the coordinator disks.
The node that reaches the first coordinator disk (based on disk serial number)
ejects the failed node’s key. In this example, node 0 wins the race for the first
14
coordinator disk and ejects the VF000801 (shown as 1 in the diagram) key.
After the B key is ejected by node 0, node 1 cannot eject the key for node 0
because the SCSI-PR protocol says that only a member can eject a member.
This condition means that only one system can win.
3 Node 0 also wins the race for the second coordinator disk.
Node 0 is favored to win the race for the second coordinator disk according to
the algorithm used by the fencing driver. Because node 1 lost the race for the
first coordinator disk, node 1 has to sleep for one second (default) before it
tries to eject the other node’s key. This favors the winner of the first
coordinator disk to win the remaining coordinator disks. Therefore, node 1
does not gain control of the second or third coordinator disks.

14–13
4 After node 0 wins control of the majority of coordinator disks (all three in this
example), node 1 loses the race and calls a kernel panic to shut down
immediately and reboot.
5 Now port b (fencing membership) shows only node 0 because node 1 keys
have been ejected. Therefore, fencing has a consistent membership and passes
the cluster reconfiguration information to HAD.
6 GAB port h reflects the new cluster membership containing only node 0, and
HAD now performs the defined failover operations for the service groups that
were running on the departed system.
When a service group is brought online on a surviving system, fencing takes
place as part of the disk group importing process.
I/O fencing behavior

As demonstrated in the example failure scenarios, I/O fencing behaves the same
regardless of the type of failure:
• The fencing drivers on each system race for control of the coordinator disks
and the winner determines cluster membership.
• Reservations are placed on the data disks by Volume Manager when disk
groups are imported.

I/O fencing with multiple nodes
In a multinode cluster, the lowest numbered (LLT ID) node always races on behalf
of the remaining nodes. This means that at any time only one node is the
designated racer for any mini-cluster.
If a designated racer wins the coordinator disk race, it broadcasts this success on
port b to all other nodes in the mini-cluster.
If the designated racer loses the race, it panics and reboots. All other nodes
immediately detect another membership change in GAB when the racing node
panics. This signals all other members that the racer has lost and they must also
panic.
14
Majority clusters
The I/O fencing algorithm is designed to give priority to larger clusters in any
arbitration scenario. For example, if a single node is separated from a 16-node

cluster due to an interconnect fault, the 15-node cluster should continue to run. The
fencing driver uses the concept of a majority cluster. The algorithm determines if
the number of nodes remaining in the cluster is greater than or equal to the number
of departed nodes. If so, the larger cluster is considered a majority cluster. The
fencing driver gives the majority cluster advantage for winning the race for the
coordinator disks.
Fencing can be configured to override this default behavior and designate certain
nodes as being the preferred racing winners. See the Veritas Cluster Server
Administrator’s Guide for information on configuring preferred fencing.

14–15
I/O fencing implementation
Fencing implementation in Volume Manager
Volume Manager handles all fencing of data drives for disk groups that are
controlled by the VCS DiskGroup resource type. After a node successfully joins
the GAB cluster and the fencing driver determines that a preexisting network
partition does not exist, the VCS DiskGroup agent directs VxVM to import disk
groups using SCSI-3 registration and a Write Exclusive Registrants Only
reservation. This ensures that only the registered node can write to the disk group.
Each path to a drive represents a different I/O path. I/O fencing in VCS places the
same key on each path. For example, if node 0 has four paths to the first disk
group, all four paths have key AVCS registered. Later, if node 0 must be ejected,
VxVM preempts and aborts key AVCS, effectively ejecting all paths.
Because VxVM controls access to the storage, adding or deleting disks is not a
problem. VxVM fences any new drive added to a disk group and removes keys
when drives are removed. VxVM also determines if new paths are added and
fences these, as well.HAD starts service groups.

Fencing implementation in VCS
When the UseFence cluster attribute is set to SCSI3, HAD cannot start unless the
fencing driver is running in enabled mode. This ensures that services cannot be
brought online by VCS unless fencing is already protecting shared storage disks.
The UseFence attribute cannot be changed while VCS is running because the disk
groups must be reimported after fencing is configured. Therefore, you must use the
installer script to configure all fencing components, described in the next
section.
Alternately, you can use the offline configuration method and manually make the
change in the main.cf file, but this also means you must manually configure all
other components. 14

14–17
Coordinator disk implementation
Coordinator disks are three standard disks or LUNs that are set aside for use by I/O
fencing during cluster reconfiguration.
The coordinator disks can be any three disks that support persistent reservations.
Symantec typically recommends using small LUNs (at least 150 MB) for
coordinator use. Using LUNs of at least 150 MBs ensures that:
• Certain array technologies interpret the LUN as a data device, not an internal
(gatekeeper) device.
• Sufficient space is available for SCSI-3 support testing so that the private
region does not fill the disks.
You cannot use coordinator disks for any other purpose in the VCS configuration.
Do not store data on these disks or include the disks in disk groups used for data.
The data would not be protected and would interfere with the fencing process.
Using the coordinator=on option to vxdg for the coordinator disk group
ensures that the coordinator disk group has exactly three disks. This flag is set by
default when fencing is configured using the installer.

DMP support
VCS supports dynamic multipathing for both data and coordinator disks. The
/etc/vxfenmode file is used to set the mode for coordinator disks and these
sample files are provided for configuration:
• /etc/vxfen.d/vxfenmode_scsi3_dmp
• /etc/vxfen.d/vxfenmode_scsi3_raw
• /etc/vxfen.d/vxfenmode_scsi3_sanvm
• /etc/vxfen.d/vxfenmode_scsi3_disabled
• /etc/vxfen.d/vxfenmode_scsi3_cps (for customized mode)
The following example shows the vxfenmode file contents for a DMP
14
configuration:
vxfen_mode=scsi3
scsi3_disk_policy=dmp

261 Lesson 14 Protecting Data Using SCSI 3-Based Fencing
14–19
Configuring I/O fencing
Using CPI for automated fencing configuration
Fencing can be configured using the CPI installer with the -fencing option.
This enables you to configure fencing without manually modifying configuration
files.
Before configuring fencing:
• Use the /opt/VRTSvcs/vxfen/bin/vxfentsthdw utility to verify that
the shared storage array supports SCSI-3 persistent reservations.
Warning: The vxfentsthdw utility overwrites and destroys existing data on
the disks by default. You can change this behavior using the -r option to
perform read-only testing. Other commonly used options include:
– -f file (Verify all disks listed in the file.)
– -g disk_group (Verify all disks in the disk group.)

After you have verified the paths to that disk on each system, you can run
vxfentsthdw with no arguments, which prompts you for the systems and
then for the path to that disk from each system. A verified path means that the
SCSI inquiry succeeds. For example, vxfenadm returns a disk serial number
from a SCSI disk and an ioctl failed message from non-SCSI 3 disk.
• Initialize the disks to be used for the coordinator disk group.

Example fencing configuration files
The fencing configuration files created by the installer include:
• The /etc/vxfendg file is created on each system in the cluster. This file
contains the coordinator disk group name.
• The /etc/vxfentab file is automatically generated upon fencing startup.
The file contains a list of all paths to each coordinator disk. This is
accomplished during driver startup as follows:
a Read the vxfendg file to obtain the name of the coordinator disk group.
vxdisk -o alldgs list
b Run grep to create a list of each device name (path) in the coordinator
disk group. 14
c For each disk device in this list, run vxdisk list disk and create a list
of each device that is in the enabled state.
d Write the list of enabled devices to the vxfentab file.

This ensures that any time a system is rebooted, the fencing driver reinitializes
the vxfentab file with the current list of all paths to the coordinator disks.
• The /etc/vxfenmode file is also created on each system in the cluster. This
file contains the fencing mode and disk policy. In the example file shown in the
slide, the mode is disk-based SCSI 3 fencing and the disk policy is DMP. The
“Coordination Point Server” lesson shows an example of a non-disk based
fencing policy.
• The UseFence cluster attribute is set to SCSI3 in the main.cf file.

14–21
Viewing keys and status
You can check the keys on coordinator and data disks using the vxfenadm
command, as shown in the slide.
The following example shows the vxfenadm -s command with a specific data
disk:
vxfenadm -s /dev/vx/rdmp/ams_wms0_51
Device Name: /dev/vx/rdmp/ams_wms0_51
Total Number Of Keys: 2
key[0]:
[Numeric Format]: 65,86,67,83,0,0,0,0
[Character Format]: AVCS
[Node Format]: Cluster ID 3 Node ID: 0 Node Name: s1

key[1]:
[Numeric Format]: 65,86,67,83,0,0,0,0
[Character Format]: AVCS
[Node Format]: Cluster ID 3 Node ID: 0 Node Name: s1
You can also use the -R, -r, options to view registrations.
Replacing coordinator disks
You can replace a coordinator disk using the vxfenswap command. See the Veritas
Cluster Server Administrator’s Guide for detailed information.

Stopping systems running I/O fencing
To ensure that keys held by a system are removed from disks when you stop a
cluster system, use the shutdown command. If you use the reboot command,
the fencing shutdown scripts do not run to clear keys from disks.
If you inadvertently use reboot to shut down, you may see a message about a
pre-existing split brain condition when you try to restart the cluster. In this case,
you can use the vxfenclearpre utility described in the Veritas Cluster Server
Administrator's Guide.
14

14–23
• “Lab 13: Configuring SCSI3 disk-based I/O fencing,” page A-285.

Lesson 15
Clustering Applications

267

Application service overview
Recall that application services consist of three basic building blocks or
components:
• Application—the processes or daemons that handle client requests
• Storage—data storage resources
• Network—resources that enable clients to access the service from the network
In order to make an application service highly available using VCS, you configure
the applicable resources corresponding to the objects that are logically contained
within these components.
Lessons in the Veritas Cluster Server for UNIX: Install and Configure guide
focused on configuring the storage and network components in VCS. This module
concentrates on the VCS agents you can use to manage the application component.
Note: Before bringing VCS into the environment, ensure that all components are
properly configured, as described in the “Preparing Services for VCS”
lesson in the Veritas Cluster Server for UNIX: Install and Configure 15
participant guide.

269 Lesson 15 Copyright © 2014 Symantec Corporation. All rights reserved.
15–3
Application requirements
The most important requirements for an application to run in a cluster are the
ability to recover to a known state in a predictable and reasonable time on two or
more hosts. Most commercial applications today satisfy this requirement.
An application is considered well-behaved and can be controlled using a VCS
agent if the application meets the following criteria:
• There must be a defined start and stop procedure that can be automated.
• If the application fails, it must be possible to clean up anything left behind,
such as defunct processes, reserved memory segments, and so on.
• It must be possible to check the status of the application using an automated
method. In general, the closer a test emulates actual use of the application, the
better the test is in discovering problems. The monitor method should be
carefully balanced to ensure that the application is functional, while
minimizing performance impacts.

• Application data must be stored on an external shared disk because the
application is started on another host after operational failures. Any data that
other hosts cannot reach is lost. This precludes the use of any NVRAM
accelerator boards or other disk-caching mechanisms contained in a local host.
• The application must be able to restart to a known state without manual
intervention after operational failures. For example, a database guarantees this
by acknowledging the clients only after writing the changes to a log file on the
shared disk when a commit changes request is received.
• A highly available application should not have any features that tie it to a
specific system in the cluster, such as licenses specific to a host name.
VCS agents for managing applications
Agents are programs—combinations of binaries, libraries, scripts—that function
as the intermediary between resources and the VCS engine. Every resource type
has an agent that interacts with all resources of that type on behalf of VCS.
You have several options for selecting an agent to manage an application.
• Use agents from the agent packs, updated periodically. Application-specific
agents are:
– Easy to install and configure
– Fully tested and supported for the documented versions of the operating
system and application
– Enabled for second-level monitoring (most agents) to perform additional
verification of the application status
• Use the Application or Process agents that are bundled with the base VCS
product. Depending on your application and requirements, you may need to

write custom code or scripts when using the Application agent. For example, if
you want to change the way that the agent monitors your application, you can
create your own script and configure the agent to use that script. 15
• Symantec has created partnerships with many application vendors to enable

them to create agents for managing their software.
• Modify an existing agent or create a new agent if you have specialized
requirements for managing your application. Symantec provides separate
documentation and training to enable you to develop VCS agents.
Symantec-provided agents can be downloaded from sort.symantec.com.

271 Lesson 15
15–5
Key properties of VCS agents
Key properties of VCS agents include:
• Only one agent process runs on a system for each configured resource type.
That agent manages all resources of that type on that system.
• An agent runs a single operation on a resource at one time.
• Agents are multithreaded so that operations can be carried out in parallel on
multiple resources of the same type. For example, the IP agent can bring the
IP1 resource online at the same time it monitors the IP2 resource.
However, an agent will only run one task at a time on a single resource. This
prevents multiple start process from attempting to bring up a resource. This
also prevents multiple monitor process running simultaneously on a single
resource, which ties up system resources.
• If there is no defined resource of a given type that can run on a system, then the
agent process is not started on that system. The agent may be running on other
systems in the cluster if they are configured to run a resource of that type.
• A resource cannot be managed without an agent.
Custom and bundled VCS agents are located within subdirectories of the VCS bin
directory, /opt/VRTSvcs/bin. Database and other enterprise agents are located
in /opt/VRTSagents/ha/bin.

VCS agent functions
Most VCS agents perform this common set of tasks:
• Bring resources online when requested by the VCS engine.
• Take resources offline upon request.
• Periodically monitor resources and send status information to the VCS engine.
• Clean up if a resource fails to go offline properly.
Agents also:
• Restart resources when they fault (depending on the resource configuration).
• Send a message to the VCS engine and the agent log when errors are detected.
Some agents support multiple levels of monitoring. For example, most database
agents can be configured to perform two levels of monitoring:
• Basic monitoring: Scans the process table for database processes

• Second-level monitoring: Performs a transaction on a test table in the database
Agents may also perform other functions, depending on the type of resource they
manage. The documentation for each agent describes which functions each one 15
supports. The Veritas Cluster Server Agent Developer’s Guide provides a
comprehensive description of each supported agent function.

15–7
The VCS agent framework
The modular design of VCS provides an agent framework to enable additional
agents to be added without affecting the core engine functions. Agents manage
resources by calling the appropriate function or entry point, and HAD manages the
agents.
When an event occurs and HAD determines that an action must be taken, HAD
calls the appropriate agent. The agent then invokes the script or function for the
applicable entry point, passing a set of input parameters as arguments to perform
the resource management task requested. The scripts called by the agent are
referred to as entry points into the agent.
Entry points that are common to most agents are:
• Online: Starts or creates the resource
• Offline: Stops or removes a resource
• Monitor: Determines resource status

• Clean: Forcibly takes a resource offline
Entry points are determined by the type of resource that an agent is managing. For
example, a NIC resource is persistent, so it does not have online and offline entry
points. Online-only resources, such as NFS, do not have an offline entry point.
However, all agents are required to have monitor entry points.
In most cases, entry points are executed as well-known UNIX shell commands and
are easily customizable for developers experienced with creating shell scripts.

How agents work
When VCS starts an agent, HAD sends the agent a complete snapshot of the
information for all the resources of the type monitored by the agent. These values
are specified by resource attributes in the main.cf file.
When the VCS engine requires an action to be taken for a resource—as the result
of an administrative command or in response to an event, such as a fault—HAD
tells the agent to run the appropriate entry point. The agent uses the resource
information passed at startup to perform the required action.
The agent calls the entry point with the name of the resource followed by
arguments—values of the resource attributes (main.cf)—in the order specified
in the ArgList attribute for the resource type (types.cf).
Note: The diagram shows a conceptual view of the communication and

relationships among VCS components. The agent does not read the
configuration files directly. The VCS engine has the configuration in
memory and passes the configuration information to the agent when it
starts, when a new resource is created, and when an existing resource 15
configuration is modified. The agent then stores the configuration
information in memory.

15–9
The Application agent
If you have an application that does not have an available agent, you may be able
to use the Application agent to make the application highly available. The
Application agent can be used to control more complex applications and can be
easily customized. Whereas the Process agent can be used to monitor a single
process, the Application agent can manage applications consisting of multiple
processes.
Entry points
• Online: Runs StartProgram with the specified parameters in the specified user
context
• Offline: Runs StopProgram with the specified parameters in the specified user
context
• Monitor: If no MonitorProgram is specified, verifies that all processes
specified in PidFiles and MonitorProcesses are running

If MonitorProgram is defined, the agent executes the MonitorProgram.
• Clean: Kills all the processes specified in PidFiles or in MonitorProcesses
If CleanProgram is defined, the agent executes the CleanProgram.

Application resource definition
The required attributes listed in the slide must be defined to enable the Application
agent to manage a resource.
Note: If you use only PidFiles for monitoring, you may receive a false indication
of online if the application has not cleared out the process IDs upon
restarting. The PIDs can be cleared using a startup script or an open entry
point.
The optional attributes, CleanProgram, User, EnvFile, and UseSUDash, can be

configured to provide additional control of the application.
15

15–11
Application resource configuration
The example in the slide shows how an Application resource is configured to make
an example payroll application highly available.
Note: The MonitorProcess values must match the output displayed by the ps
command exactly. For example, if the processes are displayed with full
path names, you must include the full path name when specifying the
processes to monitor.

IMF support and prevention of concurrency violation
IMF-related attributes and entry points
The Application agent is IMF-aware and uses asynchronous monitoring
framework (AMF) kernel driver for IMF notification by default.
The Mode key of the IMF attribute determines how monitoring is configured.
Accepted values are:
0—Does not perform intelligent resource monitoring
1—Performs intelligent resource monitoring for offline resources and performs
poll-based monitoring for online resources
2—Performs intelligent resource monitoring for online resources and performs
poll-based monitoring for offline resources
3—Performs intelligent resource monitoring for both online and for offline
resources (default for Application resource type)
The MonitorFreq key determines how often a resource is monitored by traditional
polling. When set to an integer greater than 0, the value of MonitorFreq is 15
multiplied by the value of the MonitorInterval and OfflineMonitorInterval
attributes to determine the frequency of running the poll-based monitor entry point
for online and offline resources, respectively.
RegisterRetry determines how many times the agent tries to register the resource
with the IMF notification module.
IMFRegList specifies the attributes registered with the IMF notification module
and should not be modified.
15–13
Supported configurations
Intelligent monitoring is supported for the Application agent only under specific
configurations. The complete list of such configurations is provided in the table in
the slide.
See the Veritas Cluster Server Administrator’s Guide and Bundled Agents
Reference Guide for details about configuring IMF for the Application agent.

Prevention of concurrency violation (PCV) configuration
Prevention of concurrency violation (PCV) is method for VCS to detect when a
VCS-managed application is brought online on another cluster system outside of
VCS control, thus preventing a concurrency violation.
PCV is enabled by setting the ProPCV service group attribute to 1. IMF
monitoring must be configured for the Application resource for both online and
offline monitoring.
15

281 Lesson 15
15–15
Limitations of PCV
PCV is supported for the Application agent only and is limited to configurations
with only the MonitorProcess attribute set and where StartProgram is not a
wrapper-type script that starts up other processes. The MonitorProcess setting
must exactly match the script name and path of the process started by the
StartProgram attribute. For example, if you started the process with the
/bin/sh /usr/bin/myappd command line, PCV would not detect a user
starting that process by typing
/usr/bin/myappd on another system.
You cannot specify the MonitorProgram or PidFile attributes when you have PCV
configured for an Application type of resource.

15
• “Lab 14: Configuring an Application resource,” page A-307.

15–17

Lesson 16
Clustering Databases

285

VCS database agents
You have several options for selecting an agent to manage the database
application.
• VCS contains agents for all common database products. These agents are:
– Easy to install and configure
– Fully tested and supported for the documented versions of the operating
system and database application
– Provided at no extra charge
• You can also use the Application agent that is bundled with the base VCS
product. Depending on your database and your requirements, you may need to
write custom code or scripts. For example, if you want to change the way the
agent monitors your database, you modify the monitor entry point script.
• Additionally, you can modify an existing agent or create a new agent if you
have specialized requirements for managing your database in a high

availability environment. Symantec provides separate documentation and
training to enable you to develop VCS agents.
16

287 Lesson 16 Clustering Databases Copyright © 2014 Symantec Corporation. All rights reserved.
16–3
Database agent functions
All agents perform the same basic functions—start, stop, monitor, clean. The
enterprise agents for databases each perform similar functions.
The agent starts an instance using the database-specific utility (such as sqlplus
for Oracle), usually with a startup profile file name. The instance is likewise
stopped with the database shutdown utility.
Two levels of monitoring can be configured:
• Basic monitoring: Scans the process table for database daemon processes
• Second-level monitoring: Performs a transaction on a test table in the database
Basic monitoring
The instance can be monitored by scanning the process table for the process IDs
(PIDs) for critical database processes. The processes monitored vary by database.
For example, the Oracle agent monitors the ora_smon, ora_dbw, ora_pmon,
and ora_lgwr processes.

Second-level monitoring
When second-level monitoring is enabled, the monitor script attempts to write to a
test table in a specified database. If this script fails and failover is enabled, the
entire service group fails over to the next available system.
Writing to the database ensures that failures, such as full logs, are detected.
Reading from the database does not detect this type of error.
In order to write to a database table, you must create the test table in the database
that you are monitoring. Then, provide the agent with the access and configuration
information needed to write to the table:
• The database user account with update privileges
• The password for the database user account
• The name of the database table
• The file containing SQL statements used to write to the table
Note: Second-level monitoring may detect errors that cannot be handled by

failover or restarts, such as a database running out of tablespace. Some
agents, such as Oracle, provide optional methods of handling these types of
errors. These additional error-handling methods are described later in this
lesson.
16

16–5
Monitoring listener processes
Some database products, such as DB2, Sybase, and Informix, have the database
network connection mechanism built into the database processes and do not
require additional agents.
The Oracle agent package includes an additional agent for managing the Oracle
Net Services (ONS) listener process for Oracle, which controls client network
connections to Oracle databases.
The Netlsnr agent starts and stops the Oracle Net Services listener process using
the lsnrctl command. Two levels of monitoring can be configured:
• Basic monitoring: Scans the process table for the tnslsnr process
• Second-level monitoring: Uses the lsnrctl command to test the listener
process

Database preparation
Verifying software compatibility
Ensure that each system that can host the database instance meets the operating
system and database requirements specified in the agent documentation.
Some agents, such as Universal Database Enterprise Server Edition (ESE) DB2,
are supported on multiple operating systems:
• AIX 6.1 and 7.1
• Red Hat Enterprise Linux 5 and 6
• Suse Linux Enterprise 10 and 11
• Oracle Enterprise Linux 5
Also, ensure that each system has adequate resources, such as shared memory.
Each system in the SystemList of the service group must be properly configured as
a database server.
16

291 Lesson 16 Clustering Databases
16–7
Database program (binary) files
The database binaries can be configured either on the local disks or on the shared
storage. Select the location based on your high availability requirements.
• Configuring the binaries on the local disks may enable you to minimize
downtime due to maintenance. You can update the database software on one
system while the database is running on another.
• Installing binaries on a shared storage disk simplifies configuration, especially
if there are many systems that can run the same database instance and
participate in the service group. If you have many systems, applying updates
on each system can be a significant burden.
Note: Some databases, such as Oracle, install updates in a new directory. This
directory can be on shared storage, which provides a way to use the
advantages of updating one system while the database is running on the

other. If binaries are storage on local disks, you have to update each
individual system.

Data files
You must locate the data files on shared storage so that each failover target system
can access them. Use these guidelines for locating data files in a VCS
environment:
• If using file system-based data files, the file systems that contain these files
must be located on shared disks. Create the same file system mount point on
each system in the cluster.
• If using raw devices, such as Volume Manager volumes, set the permissions
for the volumes to be owned permanently by the database account. For
example:
vxedit –g diskgroup set group=dba user=oracle \
mode=660 volume
Note: Volume Manager overrides UNIX permissions. It is not sufficient to

change permissions in UNIX because Volume Manager sets volume
permissions when importing a volume (with default owner is root).
16

16–9
Database-related system files
Ensure that any database-related configurations are the same or consistent on all of
the systems. Some examples are shown below.
Solaris AIX HP-UX Linux

/etc/passwd /etc/passwd /etc/passwd /etc/passwd
/etc/shadow /etc/security /etc/shadow
/passwd
/etc/group /etc/group /etc/group /etc/group

/etc/services /etc/services /etc/services /etc/services
/etc/system /usr/samples/ /stand/system **
kernel/vmtune
/etc/hosts /etc/hosts /etc/hosts /etc/hosts
**Linux
Shared memory settings:
– For drivers built into the kernel, append parameters to the kernel command
line using the boot loader.
– For kernel modules, use /etc/modules.conf.
– For tunable parameters, use sysctl and /etc/sysctl.conf.
*Solaris
Changes to /etc/system require a reboot to take effect.
Network configuration
Each database service group requires at least one IP address for client connections.
This IP address should fail over together with the database in case of any major
faults.
Therefore, you need to use an IP resource (or an IPMultiNIC resource) and
configure the host name of the service group IP address in the database. The clients
connect to the host name corresponding to this virtual IP address and not to the
local host names of the servers.
16

16–11
The database agent for Oracle
Installing database agents
Database agents are installed automatically during initial VCS installation on all
cluster nodes if you select recommended or all package sets.
These agents are provided as a standard software package named VRTSvcsea.

High availability database configuration overview
Before you begin configuring VCS to manage your database services, prepare your
environment, including equipment, software, and staff.
Coordinate with the database and systems administrators responsible for the
database servers you are adding to the cluster environment. After database services
are placed under control of VCS, the databases cannot be managed separately,
outside of VCS.
As with any high availability service, configure and test each component on the
startup system and each failover system before you perform any VCS
configuration of the service.
This procedure applies to configurations where the database binaries are installed
locally on each system, which reflects the cluster design decisions assumed for this
course. Examples of other configurations are shown later in the lesson.
See the platform-specific database agent guides for details about how to design
your VCS configuration to meet your high availability requirements.
16

16–13
Oracle service group configuration example
A service group example for an Oracle database instance contains these types of
resources:
• Netlsnr: Monitors the listener process
• Oracle: Monitors the database and processes
• Mount: Monitors one or more file systems containing:
– Oracle application files (if installed on shared storage)
– Oracle data files
– Oracle archive and redo log files
• Volume: Monitors one or more volumes for the file systems
• DiskGroup: Monitors one or more disk groups for the volumes
• IP: Monitors the service group IP for the listener process
• NIC: Monitors one or more network interface cards for remote client
connection
The example shown on the slide assumes that the Oracle binaries are located on
local storage. The data files are located on a file system (rather than raw volumes).
The clients access Oracle services using the service group IP address defined by
the IP resource.

Configuring an Oracle resource
The required attributes for an Oracle resource are shown in the slide.
Agent entry points

• Online: Uses the sqldba/svrmgrl/sqlplus startup -f $Pfile
command to start the database
The $Pfile argument is the startup profile file for the Oracle database
instance and is specified in the Pfile attribute.
• Offline: Uses the shutdown command to stop the database
• Monitor: Scans the process table for ora_dbw, ora_smon, ora_pmon, and
ora_lgwr
The ora_pmon process monitors all other Oracle processes.
• Clean: Forcibly stops the Oracle instance by using the svrmgrl or sqlplus
shutdown abort command:

If the process does not respond to the shutdown command, the agent scans
the process table and kills processes associated with the configured instance.
• Info: Provides static and dynamic information about the state of the database
• Action: Performs predefined actions on a resource
16

16–15
Optional Oracle attributes
The StartUpOpt attribute

The Oracle agent supports an optional StartUpOpt attribute to enable you to
control how VCS starts a database instance.
Examples:
• RECOVERDB: Assume that in the event the database node fails for a
nondatabase issue (for example, if the system panics due to a hardware issue),
you do not want the database to start up (come online) if any data file recovery
needs to take place. If you set StartUpOpt to RECOVERDB and the system
running the database faults, VCS starts the database in recovery mode when it
fails over to another node.
• CUSTOM: You may want to start the database up in standby mode and apply
logs. In this case, you can create an SQL script to perform these actions, and
this script is called when you set StartUpOpt to CUSTOM.
You must create the script in /opt/VRTSagents/ha/bin/Oracle with
the name of start_custom_Sid.sql, where Sid is the same as the value
of the Sid attribute.

The ShutDownOpt attribute
You can configure the ShutDownOpt attribute for the Oracle resource to control
how VCS stops a database instance when the Oracle resource is taken offline.
Note: If CUSTOM is specified, a script must exist in /opt/VRTSagents/ha/

bin/Oracle with the name of shut_custom_Sid.sql, where Sid is
the same as the value of the Sid attribute for this Oracle resource.
16

16–17
Other optional Oracle attributes
The Oracle agent supports the additional optional attributes described in the slide.
The MonitorOption attribute is only supported in the Oracle agent version 5.0 and
later. When set to 0, monitoring checks the process table. When set to 1,
monitoring uses the Oracle Heath Check APIs.
Example main.cf entries with sample values for some commonly-used attributes
are shown in the following resource definition. For more information, see the
Veritas Cluster Server Agent for Oracle Installation and Configuration Guide for
your platform.
Oracle hr_oracle (
Sid = HR
Owner = oracle
Home = "/hr_ora"
EnvFile = "/oracle/.ora_envfile"
AutoEndBkup = 0
Encoding = eucJP
)
The example value for the Encoding attribute sets encoding to the Japanese
language set. For a complete list of optional attributes, see the Veritas Cluster
Server Agent for Oracle Installation and Configuration Guide for your platform.

IMF monitoring for Oracle resources
The Oracle agent is IMF-aware and uses asynchronous monitoring framework
(AMF) kernel driver for IMF notification by default.
The Mode key of the IMF attribute determines how monitoring is configured. The
default setting is 3, which means intelligent monitoring is performed for online and
offline Oracle resource resources.
The MonitorFreq key is set to 5 by default. Therefore, the monitor entry point is
run every five minutes for online Oracle resources and every 25 minutes for offline
resources.
This interval is determined by multiplying the value of MonitorFreq by the value
of the MonitorInterval and OfflineMonitorInterval attributes.
RegisterRetry determines how many times the agent tries to register the resource
with the IMF notification module.

IMFRegList specifies the attributes registered with the IMF notification module
and should not be modified.
The LevelTwoMonitorFreq attribute can be set to specify how often second-level
monitoring is performed, if configured for an Oracle resource.
16

16–19
Configuring Oracle second-level monitoring
The example configuration shows an Oracle resource with detail monitoring
configured. These optional Oracle attributes are used to configure detail
monitoring:
• LevelTwoMonitorFreq: A flag to enable and disable detail monitoring
The default is 0, do not perform detail monitoring. You can set this value
higher than zero to control how often detail monitoring is performed. For
example, if set to 5, detail monitoring is performed every fifth monitor interval.
• User: A database user account with update privileges on the test table
• Pword: The password for the database user account
You must encrypt the password as described in a later section.
• Table: The name of the database table VCS uses for additional monitoring
• MonScript: The executable script file containing the SQL statements VCS uses
when writing to the table
• EnvFile: The file containing environment variables sourced by the agent
Configuration prerequisites
• Create the database user and password for use by VCS.
• Create a test table within the monitored database.
• Create an executable script with SQL statements.
In this example, the user scott with the password tiger should be defined in
the HR database with update privileges to the table called testtable. This table
should be created in the database before the additional monitoring is enabled.
Encrypting passwords
You can use the VCS encryption utility to encrypt database passwords before
configuring the Pword attribute in the Oracle agent configuration.
Note: The value of Pword is automatically encrypted when you use VOM or the
VCS Java GUI to configure the resource.
To encrypt a password, run the vcsencrypt command. For example, type:

/opt/VRTSvcs/bin/vcsencrypt -agent
16

16–21
Error handling for second-level monitoring
You can configure how VCS responds to different types of errors when second-
level monitoring is configured to prevent the database from failing over in cases
where that action is not appropriate.
For example, if the user name and password specified for an Oracle resource is
incorrect, an error of the form ORA-01017 invalid username/password;
logon denied occurs. This error indicates that the configuration is wrong, but
the database instance may be functioning properly. In this case, you do not want
VCS to bring down the instance and restart it on another system. The same failure
occurs wherever the instance is brought online.
Rather, you want VCS to notify the appropriate administrator to fix the
configuration problem.
You can customize error handling by modifying the oraerror.dat file in
/opt/VRTSvcs/bin/Oracle. Entries in this file have the format:

Oracle_error_string:action:confidence_level
This example entry leaves the Oracle resource online and sets the confidence level
to 10 when an invalid user name error code is returned.
01017:WARN:10

Configuring a Netlsnr resource
The following main.cf excerpt corresponds to the Netlsnr example resource
shown in the design worksheet in the slide.
Netlsnr hr_oranetlsnr (
Owner = oracle
Home = "/hr_ora"
TnsAdmin = "/hr_ora/net/admin"
Listener= LISTENER
LsnrPwd = S2cEjcD5s3Cbc
EnvFile = "/oracle/dbs/envfile“
MonScript = "/opt/VRTSagents/ha/bin/Oracle/Netlsnr/
LsnrTest.pl"
)
Agent entry points

• Online: lsnrctl -start $LISTENER
• Offline: lsnrctl -stop $LISTENER
• Monitor: Scans the process table for tnslsnr $LISTENER
If detail monitoring is enabled, the monitor script runs the command:
lsnrctl -stat $LISTENER
• Clean: Scans the process table for tnslsnr $LISTENER and kills it
16
• Action: Performs predefined actions on a resource

16–23
Other example Oracle configurations
The diagram shows some additional Oracle VCS configurations to illustrate that
you can use variations of resources and dependencies in an Oracle service group to
accommodate different requirements.
• In the resource diagram on the left:
– Oracle binary and data files are both located on shared storage.
– Volume resources are not present, as volumes are automatically started
when the disk groups are imported.
– The Oracle instance must be online before the listener is started.
Note: Consider minimizing the number of volumes and disk groups used in
database service groups. Large numbers of objects complicate
administration and can slow service group startup.
• In the resource diagram on the right:

– The Oracle resource must come online last.
– The log and data files are located on separate file systems, volumes, and
disk groups. In this case Volume type resources are present for additional
monitoring.
– The binaries are on local storage and so are not part of the service group.
Also consider that the Oracle resource is not required to be dependent on the IP
resource. However, clients cannot connect to the database without the IP. The
Netlsnr resource should always have a dependency on the IP resource.
Database failover behavior
Database failover scenarios
You can use resource and service group attributes to control how VCS responds
when resources or systems fault. A complete discussion of failover behavior is
provided in the “Handling Resource Faults”, “Cluster Communications” and “Data
Protection using SCSI 3-Based Fencing” lessons in the Veritas Cluster Server for
UNIX: Install and Configure course.
The scenarios in the slide show the practical application of some of the more
common failover configurations.
Failover during Oracle hot backup

If a cluster node fails while running a hot backup of an Oracle database, failover to
another node can succeed only if the AutoEndBkup attribute is set to a nonzero
value.
Otherwise, the database in the backup mode on the failover system cannot be
opened and VCS cannot bring the Oracle resource online. The following errors are
displayed to indicate this condition:
$ ORA-1110 "data file %s: ’%s’"
$ ORA-1113 "file %s needs media recovery"
Before VCS can bring the Oracle resource online on the failover system, you must
take the tablespaces out of backup mode and shut down the database instance so 16
that it can be reopened. Refer to the Oracle documentation for instructions on how
to change the state of the tablespaces.
16–25
Additional Oracle agent functions
The Oracle agent supports two additional entry points you can use to manage
database functions from within VCS:
• Action: Performs specified actions, such as backing up the Oracle database,
changing the database state, and suspending and resuming a database instance
This can be useful for scripting common database administration tasks that can
be initiated from the VCS operator or administrator.
• Info: Checks the status of the instance

Running a virtual fire drill
The Oracle and Netlsnr agents use the action entry point to support the virtual fire
drill testing. You can run a virtual fire drill for a service group to check that the
underlying infrastructure is properly configured on other systems to enable
failover. The service group must be fully online on one system.
The Oracle agent and Netlsnr resources perform the actions shown in the
following table when you run a virtual fire drill on an Oracle service group.
16

16–27
Action Description
getid Verifies that the Oracle owner account exists
pfile.vfd • Checks for pfile or spfile on the local disk
• Logs a message that pfile/spfile must be located on
shared storage if the Oracle resource is online, and pfile/
spfile is absent from the local disk
home.vfd Determines whether:
• $ORACLE_HOME is mounted on the node and a
corresponding entry is present in fstab
• If $ORACLE_HOME is not mounted, checks whether any
other resource has already it mounted
• Pfile is provided and exists on the node
• $ORACLE_HOME/dbs/orapwSID is present.
owner.vfd • Verifies the uid and gid of the Oracle Owner attribute
• Determines whether the uid and gid of Owner attribute is
the same on the node where the Oracle resource is currently
ONLINE
tnsadmin.vfd • Checks whether listener.ora is present
• If present, determines if $ORACLE_HOME is mounted

“Lab 15: Configuring an Oracle service group,” page A-327 16

16–29
Symantec IT certification holders are highly valued IT Professionals. Customers,
colleagues, and employers are confident that Symantec IT certification holders have the
knowledge and expertise to effectively install, configure, deploy, administer, or provide
consulting services on Symantec products. Protecting this value benefits you, as well
as Symantec.
x You invest a considerable amount of time, expense, and expertise to prepare for
and complete a Symantec technical exam, which is undermined by those who
engage in exam misconduct.
x Exam misconduct enables less qualified individuals to compete for the jobs and
benefits YOU deserve.
x Exam misconduct erodes confidence in both Symantec programs and your skills
as a certified IT professional and can lead to security and liability risks for your
customers and/or employer
x To confidentially report suspected cases of misconduct, please contact
global_exams@symantec.com.
Symantec is committed to maintaining the security and integrity of its brand and
certification and accreditation exams. This ensures that our products are installed and
maintained by qualified IT Professionals and provides end users with the confidence
that their system software is operating at maximum efficiency. Symantec actively
investigates and takes corrective action against individuals and organizations who
attempt to compromise the security of our exams or engage in any form of exam
misconduct. To learn more about Symantec Testing Policies and Exam Security, visit
http://www.symantec.com/business/training/certification/path.jsp?pathID=policies
CONFIDENTIAL
To learn more - NOT Certification
about the Symantec FOR DISTRIBUTION
Program and exams,
314 visit http://go.symantec.com/certification

100-002839-A SF6x AdministrationFundamentals Lessons

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

100-002839-A SF6x AdministrationFundamentals Lessons

Uploaded by

Copyright:

Available Formats

CONFIDENTIAL - NOT FOR DISTRIBUTION

CONFIDENTIAL - NOT FOR DISTRIBUTION

Lesson 1: High Availability Concepts

Lesson 2: VCS Building Blocks

Lesson 3: Preparing a Site for VCS

Lesson 4: Installing VCS

Lesson 5: VCS Operations

Lesson 6: VCS Configuration Methods

Lesson 7: Preparing Services for VCS

Lesson 9: Offline Configuration

Lesson 10: Configuring Notification

Lesson 11: Handling Resource Faults

Lesson 12: Intelligent Monitoring Framework

Lesson 13: Cluster Communications

Lesson 14: Protecting Data Using SCSI 3-Based Fencing

Data protection requirements ..................................................................... 14-3

Lesson 15: Clustering Applications

CONFIDENTIAL - NOT FOR DISTRIBUTION

CONFIDENTIAL - NOT FOR DISTRIBUTION

CONFIDENTIAL - NOT FOR DISTRIBUTION

CONFIDENTIAL - NOT FOR DISTRIBUTION

content on advanced high availability and disaster recovery features.

CONFIDENTIAL - NOT FOR DISTRIBUTION

CONFIDENTIAL - NOT FOR DISTRIBUTION

develop a detailed design worksheet before starting the deployment.

CONFIDENTIAL - NOT FOR DISTRIBUTION

Additional complexity is added to the design throughout the labs to illustrate

CONFIDENTIAL - NOT FOR DISTRIBUTION

challenge for more experienced participants. These can be skipped, if desired,

CONFIDENTIAL - NOT FOR DISTRIBUTION

Typographic conventions in text and commands

Convention Element Examples

Typographic conventions in graphical user interface descriptions

Convention Element Examples

options, and other interface Open the Task Status

CONFIDENTIAL - NOT FOR DISTRIBUTION

CONFIDENTIAL - NOT FOR DISTRIBUTION

CONFIDENTIAL - NOT FOR DISTRIBUTION

CONFIDENTIAL - NOT FOR DISTRIBUTION

High availability concepts

are available on multiple systems simultaneously.

CONFIDENTIAL - NOT FOR DISTRIBUTION

CONFIDENTIAL - NOT FOR DISTRIBUTION

• Fault-tolerant clusters: Provide uninterrupted application availability

CONFIDENTIAL - NOT FOR DISTRIBUTION

CONFIDENTIAL - NOT FOR DISTRIBUTION

Campus and global cluster configurations

• Recovery time is short.

CONFIDENTIAL - NOT FOR DISTRIBUTION

Local application service failover

CONFIDENTIAL - NOT FOR DISTRIBUTION

CONFIDENTIAL - NOT FOR DISTRIBUTION

Application requirements for clustering

CONFIDENTIAL - NOT FOR DISTRIBUTION

Single point of failure analysis

infrastructure components within the cluster environment.

CONFIDENTIAL - NOT FOR DISTRIBUTION

• Designing Storage Area Networks

CONFIDENTIAL - NOT FOR DISTRIBUTION

CONFIDENTIAL - NOT FOR DISTRIBUTION

CONFIDENTIAL - NOT FOR DISTRIBUTION

CONFIDENTIAL - NOT FOR DISTRIBUTION

CONFIDENTIAL - NOT FOR DISTRIBUTION