Ha Vcs 410 101a 2 10 srtpg1 130918134130 Phpapp02 PDF

VERITAS Cluster Server for
UNIX, Fundamentals
(Lessons)
HA-VCS-410-101A-2-10-SRT (100-002149-A)
COURSE DEVELOPERS Disclaimer
Bilge Gerrits
The information contained in this publication is subject to change without
Siobhan Seeger notice. VERITAS Software Corporation makes no warranty of any kind
Dawn Walker with regard to this guide, including, but not limited to, the implied
warranties of merchantability and fitness for a particular purpose.
VERITAS Software Corporation shall not be liable for errors contained
herein or for incidental or consequential damages in connection with the
furnishing, performance, or use of this manual.
LEAD SUBJECT MATTER
EXPERTS
Copyright
Geoff Bergren
Connie Economou Copyright 2005 VERITAS Software Corporation. All rights reserved.
Paul Johnston No part of the contents of this training material may be reproduced in any
form or by any means or be used for the purposes of training or education
Dave Rogers
without the written permission of VERITAS Software Corporation.
Jim Senicka
Pete Toemmes Trademark Notice
VERITAS, the VERITAS logo, and VERITAS FirstWatch, VERITAS
Cluster Server, VERITAS File System, VERITAS Volume Manager,
VERITAS NetBackup, and VERITAS HSM are registered trademarks of
TECHNICAL VERITAS Software Corporation. Other product names mentioned herein
CONTRIBUTORS AND may be trademarks and/or registered trademarks of their respective
REVIEWERS companies.
Billie Bachra
VERITAS Cluster Server for UNIX, Fundamentals
Barbara Ceran
Participant Guide
Bob Lucas
April 2005 Release
Gene Henriksen
Margy Cassidy VERITAS Software Corporation
350 Ellis Street
Mountain View, CA 94043
Phone 6505278000
www.veritas.com
Table of Contents
Course Introduction
VERITAS Cluster Server Curriculum ................................................................ Intro-2
Course Prerequisites......................................................................................... Intro-3
Course Objectives............................................................................................. Intro-4
Certification Exam Objectives........................................................................... Intro-5
Cluster Design Input .......................................................................................... Intro-6
Sample Design Input.......................................................................................... Intro-7
Sample Design Worksheet................................................................................. Intro-8
Lab Design for the Course ................................................................................ Intro-9
Lab Naming Conventions ................................................................................ Intro-10
Classroom Values for Labs............................................................................... Intro-11
Course Overview............................................................................................. Intro-12
Legend ............................................................................................................ Intro-15
Lesson 1: VCS Building Blocks

Introduction ............................................................................................................. 1-2
Cluster Terminology ................................................................................................ 1-4
A Nonclustered Computing Environment ................................................................ 1-4
Definition of a Cluster .............................................................................................. 1-5
Definition of VERITAS Cluster Server and Failover............................................... 1-6
Definition of an Application Service ........................................................................ 1-7
Definition of Service Group...................................................................................... 1-8
Service Group Types................................................................................................. 1-9
Definition of a Resource ......................................................................................... 1-10
Resource Dependencies .......................................................................................... 1-11
Resource Attributes................................................................................................. 1-12
Resource Types and Type Attributes...................................................................... 1-13
Agents: How VCS Controls Resources .................................................................. 1-14
Using the VERITAS Cluster Server Bundled Agents Reference Guide ................ 1-15
Cluster Communication......................................................................................... 1-16
Low-Latency Transport .......................................................................................... 1-17
Group Membership Services/Atomic Broadcast (GAB) ........................................ 1-18
The Fencing Driver ................................................................................................. 1-19
The High Availability Daemon............................................................................... 1-20
Comparing VCS Communication Protocols and TCP/IP ....................................... 1-21
Maintaining the Cluster Configuration ................................................................... 1-22
VCS Architecture................................................................................................... 1-24
How does VCS know what to fail over?................................................................. 1-24
How does VCS know when to fail over?................................................................ 1-24
Supported Failover Configurations........................................................................ 1-25
Active/Passive......................................................................................................... 1-25
N-to-1...................................................................................................................... 1-26
N + 1 ....................................................................................................................... 1-27
Active/Active .......................................................................................................... 1-28
N-to-N ..................................................................................................................... 1-29
Table of Contents i
Copyright 2005 VERITAS Software Corporation. All rights reserved.
Lesson 2: Preparing a Site for VCS
Planning for Implementation ................................................................................... 2-4
Implementation Needs .............................................................................................. 2-4
The Implementation Plan .......................................................................................... 2-5
Using the Design Worksheet..................................................................................... 2-6
Hardware Requirements and Recommendations ................................................... 2-7
SCSI Controller Configuration for Shared Storage .................................................. 2-9
Hardware Verification............................................................................................ 2-12
Software Requirements and Recommendations................................................... 2-13
Software Verification ............................................................................................. 2-15
Preparing Cluster Information ............................................................................... 2-16
VERITAS Security Services .................................................................................. 2-17
Lab 2: Validating Site Preparation ........................................................................ 2-19
Lesson 3: Installing VERITAS Cluster Server

Introduction ............................................................................................................. 3-2
Using the VERITAS Product Installer...................................................................... 3-4
Viewing Installation Logs ......................................................................................... 3-4
The installvcs Utility ................................................................................................. 3-5
Automated VCS Installation Procedure .................................................................... 3-6
Installing VCS Updates.......................................................................................... 3-10
VCS Configuration Files........................................................................................ 3-11
VCS File Locations ................................................................................................. 3-11
Communication Configuration Files...................................................................... 3-12
Cluster Configuration Files .................................................................................... 3-13
Viewing the Default VCS Configuration ................................................................ 3-14
Viewing Installation Results .................................................................................. 3-14
Viewing Status ....................................................................................................... 3-15
Other Installation Considerations .......................................................................... 3-16
Fencing Considerations .......................................................................................... 3-16
Cluster Manager Java GUI..................................................................................... 3-17
Lab 3: Installing VCS ............................................................................................ 3-20
Lesson 4: VCS Operations

Introduction ............................................................................................................. 4-2
Managing Applications in a Cluster Environment.................................................... 4-4
Key Considerations ................................................................................................... 4-4
VCS Management Tools ........................................................................................... 4-5
Service Group Operations....................................................................................... 4-6
Displaying Attributes and Status............................................................................... 4-7
Bringing Service Groups Online............................................................................... 4-9
Taking Service Groups Offline ............................................................................... 4-11
Switching Service Groups...................................................................................... 4-12
Freezing a Service Group....................................................................................... 4-13
Bringing Resources Online .................................................................................... 4-14
Taking Resources Offline ...................................................................................... 4-15
Clearing Resource Faults ....................................................................................... 4-16
ii VERITAS Cluster Server for UNIX, Fundamentals

Using the VCS Simulator ...................................................................................... 4-18
The Simulator Java Console ................................................................................... 4-19
Creating a New Simulator Configuration ............................................................... 4-20
Simulator Command-Line Interface ....................................................................... 4-21
Using the Java GUI with the Simulator .................................................................. 4-22
Lab 4: Using the VCS Simulator ........................................................................... 4-24
Lesson 5: Preparing Services for VCS

Introduction ............................................................................................................. 5-2
Preparing Applications for VCS............................................................................... 5-4
Application Service Component Review.................................................................. 5-4
Configuration and Migration Procedure ................................................................... 5-5
One-Time Configuration Tasks ............................................................................... 5-6
Identifying Components............................................................................................ 5-6
Configuring Shared Storage...................................................................................... 5-7
Configuring the Network .......................................................................................... 5-8
Configuring the Application ................................................................................... 5-12
Testing the Application Service............................................................................. 5-13
Bringing Up Resources ........................................................................................... 5-14
Verifying Resources................................................................................................ 5-18
Testing the Integrated Components ........................................................................ 5-19
Stopping and Migrating an Application Service..................................................... 5-20
Stopping Application Components ......................................................................... 5-20
Manually Migrating an Application Service........................................................... 5-21
Validating the Design Worksheet .......................................................................... 5-22
Documenting Resource Attributes.......................................................................... 5-22
Checking Resource Attributes ................................................................................ 5-23
Documenting Resource Dependencies ................................................................... 5-24
Validating Service Group Attributes ...................................................................... 5-25
Lab 5: Preparing Application Services .................................................................. 5-27
Lesson 6: VCS Configuration Methods

Introduction ............................................................................................................. 6-2
Overview of Configuration Methods ........................................................................ 6-4
Effects on the Cluster................................................................................................ 6-5
Controlling Access to VCS ...................................................................................... 6-6
Relating VCS and UNIX User Accounts.................................................................. 6-6
Simplifying VCS Administrative Access ................................................................. 6-7
User Accounts........................................................................................................... 6-8
Changing Privileges ................................................................................................ 6-10
VCS Access in Secure Mode .................................................................................. 6-11
Online Configuration ............................................................................................. 6-12
How VCS Changes the Online Cluster Configuration ........................................... 6-13
Opening the Cluster Configuration......................................................................... 6-14
Saving the Cluster Configuration............................................................................ 6-15
Closing the Cluster Configuration .......................................................................... 6-16
How VCS Protects the Cluster Configuration ........................................................ 6-17
Table of Contents iii

Offline Configuration ............................................................................................. 6-18
Offline Configuration Examples ............................................................................ 6-19
Starting and Stopping VCS ................................................................................... 6-22
How VCS Starts Up by Default ............................................................................. 6-22
VCS Startup with a .stale File ................................................................................ 6-25
Forcing VCS to Start from a Wait State................................................................. 6-26
Building the Configuration Using a Specific main.cf File..................................... 6-28
Stopping VCS......................................................................................................... 6-30
Lab 6: Starting and Stopping VCS ........................................................................ 6-32
Lesson 7: Online Configuration of Service Groups

Introduction ............................................................................................................. 7-2
Online Configuration Procedure.............................................................................. 7-4
Creating a Service Group .......................................................................................... 7-4
Adding a Service Group .......................................................................................... 7-5
Adding a Service Group Using the GUI ................................................................... 7-5
Adding a Service Group Using the CLI.................................................................... 7-6
Classroom Exercise: Creating a Service Group ........................................................ 7-7
Design Worksheet Example...................................................................................... 7-8
Adding Resources................................................................................................... 7-9
Online Resource Configuration Procedure ............................................................... 7-9
Adding Resources Using the GUI: NIC Example.................................................. 7-10
Adding an IP Resource........................................................................................... 7-12
Classroom Exercise: Creating Network Resources Using the GUI ....................... 7-13
Adding a Resource Using the CLI: DiskGroup Example ...................................... 7-16
Classroom Exercise: Creating Storage Resources using the CLI .......................... 7-20
The Process Resource ............................................................................................ 7-23
Classroom Exercise: Creating a Process Resource ................................................ 7-24
Solving Common Configuration Errors.................................................................. 7-26
Flushing a Service Group....................................................................................... 7-27
Disabling a Resource.............................................................................................. 7-28
Copying and Deleting a Resource.......................................................................... 7-29
Testing the Service Group .................................................................................... 7-30
Linking Resources.................................................................................................. 7-31
Resource Dependencies ......................................................................................... 7-32
Classroom Exercise: Linking Resources................................................................ 7-33
Design Worksheet Example................................................................................... 7-34
Setting the Critical Attribute .................................................................................. 7-35
Classroom Exercise: Testing the Service Group.................................................... 7-36
A Completed Process Service Group..................................................................... 7-37
Lab 7: Online Configuration of a Service Group ................................................... 7-41
Lesson 8: Offline Configuration of Service Groups

Introduction ............................................................................................................. 8-2
Offline Configuration Procedures ............................................................................ 8-4
New Cluster............................................................................................................... 8-4
Example Configuration File ...................................................................................... 8-5
Existing Cluster......................................................................................................... 8-7
iv VERITAS Cluster Server for UNIX, Fundamentals

First System .............................................................................................................. 8-7
Using the Design Worksheet................................................................................. 8-10
Resource Dependencies .......................................................................................... 8-11
A Completed Configuration File ............................................................................ 8-12
Offline Configuration Tools.................................................................................... 8-14
Editing Configuration Files .................................................................................... 8-14
Using the VCS Simulator ....................................................................................... 8-15
Solving Offline Configuration Problems ................................................................ 8-16
Common Problems ................................................................................................. 8-16
All Systems in a Wait State .................................................................................... 8-17
Propagating an Old Configuration .......................................................................... 8-17
Recovering from an Old Configuration .................................................................. 8-18
Configuration File Backups .................................................................................... 8-19
Testing the Service Group .................................................................................... 8-20
Service Group Testing Procedure ........................................................................... 8-20
Lab 8: Offline Configuration of Service Groups..................................................... 8-22
Lesson 9: Sharing Network Interfaces

Introduction ............................................................................................................. 9-2
Sharing Network Interfaces..................................................................................... 9-4
Conceptual View....................................................................................................... 9-4
Alternate Network Configurations ........................................................................... 9-6
Using Proxy Resources ............................................................................................. 9-6
The Proxy Resource Type......................................................................................... 9-7
Using Parallel Service Groups ................................................................................ 9-8
Determining Service Group Status ........................................................................... 9-8
Phantom Resources................................................................................................... 9-9
The Phantom Resource Type .................................................................................. 9-10
Configuring a Parallel Service Group..................................................................... 9-11
Properties of Parallel Service Groups ..................................................................... 9-12
Localizing Resource Attributes.............................................................................. 9-13
Localizing a NIC Resource Attribute ..................................................................... 9-13
Lab 9: Creating a Parallel Service Group.............................................................. 9-15
Lesson 10: Configuring Notification

Introduction ........................................................................................................... 10-2
Notification Overview ............................................................................................ 10-4
Message Queue ....................................................................................................... 10-4
Message Severity Levels......................................................................................... 10-5
Configuring Notification ......................................................................................... 10-6
The NotifierMngr Resource Type........................................................................... 10-8
Configuring the ResourceOwner Attribute........................................................... 10-10
Configuring the GroupOwner Attribute................................................................ 10-11
Configuring the SNMP Console ........................................................................... 10-12
Using Triggers for Notification............................................................................. 10-13
Lab 10: Configuring Notification .......................................................................... 10-15
Table of Contents v
Lesson 11: Configuring VCS Response to Resource Faults
Introduction ........................................................................................................... 11-2
VCS Response to Resource Faults ...................................................................... 11-4
Failover Decisions and Critical Resources ............................................................. 11-4
How VCS Responds to Resource Faults by Default............................................... 11-5
The Impact of Service Group Attributes on Failover.............................................. 11-7
Practice: How VCS Responds to a Fault............................................................... 11-10
Determining Failover Duration ............................................................................. 11-11
Failover Duration on a Resource Fault ................................................................. 11-11
Adjusting Monitoring............................................................................................ 11-13
Adjusting Timeout Values .................................................................................... 11-14
Controlling Fault Behavior................................................................................... 11-15
Type Attributes Related to Resource Faults.......................................................... 11-15
Modifying Resource Type Attributes.................................................................... 11-18
Overriding Resource Type Attributes ................................................................... 11-19
Recovering from Resource Faults....................................................................... 11-20
Recovering a Resource from a FAULTED State .................................................. 11-20
Recovering a Resource from an ADMIN_WAIT State ........................................ 11-22
Fault Notification and Event Handling ................................................................. 11-24
Fault Notification .................................................................................................. 11-24
Extended Event Handling Using Triggers ............................................................ 11-25
The Role of Triggers in Resource Faults .............................................................. 11-25
Lab 11: Configuring Resource Fault Behavior .................................................... 11-28
Lesson 12: Cluster Communications

Introduction ........................................................................................................... 12-2
VCS Communications Review .............................................................................. 12-4
VCS On-Node Communications............................................................................ 12-4
VCS Inter-Node Communications ......................................................................... 12-5
VCS Communications Stack Summary ................................................................. 12-5
Cluster Interconnect Specifications........................................................................ 12-6
Cluster Membership .............................................................................................. 12-7
GAB Status and Membership Notation.................................................................. 12-7
Viewing LLT Link Status ...................................................................................... 12-9
The lltstat Command .............................................................................................. 12-9
Cluster Interconnect Configuration...................................................................... 12-10
Configuration Overview....................................................................................... 12-10
LLT Configuration Files ....................................................................................... 12-11
The sysname File.................................................................................................. 12-15
The GAB Configuration File ............................................................................... 12-16
Joining the Cluster Membership.......................................................................... 12-17
Seeding During Startup ........................................................................................ 12-17
LLT, GAB, and VCS Startup Files ...................................................................... 12-18
Manual Seeding.................................................................................................... 12-19
Probing Resources During Startup....................................................................... 12-20
vi VERITAS Cluster Server for UNIX, Fundamentals

Lesson 13: System and Communication Faults
Introduction ........................................................................................................... 13-2
Ensuring Data Integrity.......................................................................................... 13-4
VCS Response to System Failure ........................................................................... 13-5
Failover Duration on a System Fault ...................................................................... 13-6
Cluster Interconnect Failures ................................................................................ 13-7
Single LLT Link Failure ......................................................................................... 13-7
Jeopardy Membership............................................................................................. 13-8
Recovery Behavior................................................................................................ 13-11
Modifying the Default Recovery Behavior........................................................... 13-12
Potential Split Brain Condition............................................................................. 13-13
Interconnect Failures with a Low-Priority Public Link ........................................ 13-14
Interconnect Failures with Service Group Heartbeats .......................................... 13-16
Preexisting Network Partition............................................................................... 13-17
Changing the Interconnect Configuration............................................................ 13-18
Modifying the Cluster Interconnect Configuration............................................... 13-19
Adding LLT Links ................................................................................................ 13-20
Lab 13: Testing Communication Failures............................................................ 13-22
Optional Lab: Configuring the InJeopardy Trigger .............................................. 13-23
Lesson 14: I/O Fencing

Introduction ........................................................................................................... 14-2
Data Protection Requirements .............................................................................. 14-4
Understanding the Data Protection Problem........................................................... 14-4
Split Brain Condition .............................................................................................. 14-7
Data Protection Requirements ................................................................................ 14-8
I/O Fencing Concepts and Components ............................................................... 14-9
I/O Fencing Components ...................................................................................... 14-10
I/O Fencing Operations ....................................................................................... 14-12
Registration with Coordinator Disks .................................................................... 14-12
Service Group Startup........................................................................................... 14-13
System Failure ...................................................................................................... 14-14
Interconnect Failure .............................................................................................. 14-15
I/O Fencing Behavior............................................................................................ 14-19
I/O Fencing with Multiple Nodes ......................................................................... 14-20
I/O Fencing Implementation ................................................................................ 14-21
Communication Stack........................................................................................... 14-21
Fencing Driver ...................................................................................................... 14-23
Fencing Implementation in Volume Manager ...................................................... 14-24
Fencing Implementation in VCS .......................................................................... 14-25
Coordinator Disk Implementation ........................................................................ 14-26
Configuring I/O Fencing ...................................................................................... 14-27
Fencing Effects on Disk Groups ........................................................................... 14-31
Stopping and Recovering Fenced Systems ........................................................ 14-32
Stopping Systems Running I/O Fencing............................................................... 14-32
Recovery with Running Systems .......................................................................... 14-33
Recovering from a Partition-In-Time ................................................................... 14-34
Lab 14: Configuring I/O Fencing ......................................................................... 14-36
Table of Contents vii

Lesson 15: Troubleshooting
Introduction ........................................................................................................... 15-2
Monitoring VCS ..................................................................................................... 15-4
VCS Logs ............................................................................................................... 15-5
UMI-Based Support ............................................................................................... 15-7
Using the VERITAS Support Web Site ................................................................. 15-8
Troubleshooting Guide.......................................................................................... 15-9
Procedure Overview............................................................................................... 15-9
Using the Troubleshooting Job Aid ..................................................................... 15-10
Cluster Communication Problems....................................................................... 15-11
Checking GAB ...................................................................................................... 15-11
Checking LLT ...................................................................................................... 15-12
Duplicate Node IDs.............................................................................................. 15-13
Problems with LLT .............................................................................................. 15-14
VCS Engine Problems ........................................................................................ 15-15
Startup Problems .................................................................................................. 15-15
STALE_ADMIN_WAIT ..................................................................................... 15-16
ADMIN_WAIT.................................................................................................... 15-17
Service Group and Resource Problems.............................................................. 15-18
Service Groups Problems ..................................................................................... 15-18
Resource Problems............................................................................................... 15-27
Agent Problems and Resource Type Problems .................................................... 15-30
Archiving VCS-Related Files............................................................................... 15-32
Making Backups................................................................................................... 15-32
The hasnap Utility ................................................................................................ 15-33
Lab 15: Troubleshooting ..................................................................................... 15-35
Index
viii VERITAS Cluster Server for UNIX, Fundamentals

Course Introduction
VERITAS Cluster Server Curriculum
Learning Path
VERITAS
Cluster Server,
Fundamentals
VERITAS
Cluster Server,
Implementing Local
Clusters
High Availability
VERITAS Disaster Recovery
Design Using
Cluster Server Agent Using VVR 4.0 and
VERITAS
Development Global Cluster Option
Cluster Server
VERITAS Cluster Server Curriculum

The VERITAS Cluster Server curriculum is a series of courses that are designed to
provide a full range of expertise with VERITAS Cluster Server (VCS) high
availability solutionsfrom design through disaster recovery.
VERITAS Cluster Server, Fundamentals

This course covers installation and configuration of common VCS configurations,
focusing on two-node clusters running application and database services.
VERITAS Cluster Server, Implementing Local Clusters

This course focuses on multinode VCS clusters and advanced topics related to
more complex cluster configurations.
VERITAS Cluster Server Agent Development

This course enables students to create and customize VCS agents.
High Availability Design Using VERITAS Cluster Server

This course enables participants to translate high availability requirements into a
VCS design that can be deployed using VERITAS Cluster Server.
Disaster Recovery Using VVR and Global Cluster Option

This course covers cluster configurations across remote sites, including replicated
data clusters (RDCs) and the Global Cluster Option for wide-area clusters.
Intro2 VERITAS Cluster Server for UNIX, Fundamentals

Course Prerequisites
To successfully complete this course, you
should have the following expertise:
UNIX operating system and network
administration
System and network device configuration
VERITAS Volume Manager configuration
Course Prerequisites
This course assumes that you have an administrator-level understanding of one or
more UNIX platforms. You should understand how to configure systems, storage
devices, and networking in multiserver environments.
Course Introduction Intro3

Course Objectives
After completing the VERITAS Cluster Server for
UNIX, Fundamentals course, you will be able to:
Manage services in an existing VCS environment.
Install and configure a cluster according to a
specified sample design.
Use a design worksheet to put applications under
VCS control.
Customize cluster behavior to implement specified
requirements.
Respond to resource, system, and communication
failures.
Course Objectives
In the VERITAS Cluster Server for UNIX, Fundamentals course, you are given a
high availability design to implement in the classroom environment using
VERITAS Cluster Server.
The course simulates the job tasks you perform to configure a cluster, starting with
preparing the site and application services that will be made highly available.
Lessons build upon each other, exhibiting the processes and recommended best
practices you can apply to implementing any design cluster.
The core material focuses on the most common cluster implementations. Other
cluster designs emphasizing additional VCS capabilities are provided to illustrate
the power and flexibility of VERITAS Cluster Server.

Certification Exam Objectives
The summary of VERITAS Certified
High Availability Implementation Exam
objectives covered in this lesson are:
Verify and adjust the preinstallation
environment.
Install VCS.
Configure the high availability environment.
Perform advanced cluster configuration.
Validate the implementation and make
adjustments for high availability.
Document and maintain the high availability
solution.
For the complete set of exam objectives,
follow the Certification link from
www.veritas.com/education.
Certification Exam Objectives

The high-level objectives for the Implementation of HA Solutions certification
exam are shown in the slide.
Note: Not all objectives are covered by the VERITAS Cluster Server for UNIX,
Fundamentals course. The VERITAS Cluster Server for UNIX, Implementing Local
Clusters course is also required to provide complete training on all certification
exam objectives.
Detailed objectives are provided on the VERITAS Web site, along with sample
exams.

Cluster Design Input
A VCS cluster design includes:
Cluster information, including cluster
communications
System information
Application service information, including detailed
information about required software and hardware
resources
User account information
Notification requirements
Customization requirements
This course provides cluster design information
needed to prepare, install, and configure a cluster.
Cluster Design Input

The staff responsible for the deployment of a VCS cluster may not necessarily be
the same people who developed the cluster design. To ensure a successful
deployment process, define the information that needs to be passed to the
deployment team from a VCS design.
A VCS design includes the following information:
Cluster information, including cluster communications
The cluster name and ID number
Ethernet ports that will be used for the cluster interconnect
Any other VCS communication channels required
Member system names
High availability services information
The service name and type
Systems where the service can start up and run
Startup policies
Failover policies
Interactions with other services
Resources required by the services, and their relationships
User information and privilege levels
Notification requirements: SNMP/SMTP notification and triggers
Customization requirements: Enterprise and custom agents; cluster, service
group, system, resource, and agent attributes that are not VCS default values

Sample Design Input
Web
Server
Web Service
IP
IP Address Mount
Start up on system S1.
192.168.3.132
192.168.3.132 /web
Restart Web server
process 3 times before
faulting it.
Fail over to S2 if any NIC Volume
resource faults. eri0 WebVol
Notify patg@company.com
if any resource faults.
OOOO OOOO
OOOO OO Disk Group
WebDG
Components
Components required
required to
to
provide
provide the
the Web
Web service.
service.
Sample Design Input

A VCS design may come in many different formats with varying levels of detail.
In some cases, you may have only the information about the application services
that need to be clustered and the desired operational behavior in the cluster. For
example, you may be told that the application service uses multiple network ports
and requires local failover capability among those ports before it fails over to
another system.
In other cases, you may have the information you need as a set of service
dependency diagrams with notes on various aspects of the desired cluster
operations.
If you receive the design information that does not detail the resource information,
develop a detailed design worksheet before starting the deployment, as shown in
the following Cluster Design Worksheet.
Using a design worksheet to document all aspects of your high availability
environment helps ensure that you are well-prepared to start implementing your
cluster design.
You are provided with a design worksheet showing sample values to use
throughout this course as a tool for implementing the cluster design in the lab
exercises.
You can use a similar format to collect all the information you need before starting
deployment at your site.

Sample Design Worksheet
Service Group Definition Sample Value

Group WebSG
Required Attributes
FailOverPolicy Priority
SystemList S1=0 S2=1
Optional Attributes
AutoStartList S1
Resource Definition Sample Value

Service Group WebSG
Resource Name WebIP
Resource Type IP
Required Attributes
Device eri0
Address 192.168.3.132
Optional Attributes
Netmask 255.255.255.0
Critical? Yes (1)
Enabled? Yes (1)
Example: main.cf
group WebSG (
SystemList = { S1 = 0, S2 = 1 }
AutoStartList = { S1 }
)
IP WebIP (
Device = eri0
Address = 192.168.3.132
Netmask = 255.255.255.0
)

Lab Design for the Course
vcsx
their_nameSG1
your_nameSG1
their_nameSG2
your_nameSG2
NetworkSG
trainxx
trainxx
Lab Design for the Course

The diagram shows a conceptual view of the cluster design used as an example
throughout this course and implemented in hands-on lab exercises.
Each aspect of the cluster configuration is described in greater detail, where
applicable, in course lessons.
The cluster consists of:
Two nodes
Five high availability services; four failover service groups and one parallel
network service group
Fibre connections to SAN shared storage from each node through a switch
Two private Ethernet interfaces for the cluster interconnect network
Ethernet connections to the public network
Additional complexity is added to the design to illustrate certain aspects of cluster
configuration in later lessons. The design diagram shows a conceptual view of the
cluster design described in the worksheet.

Lab Naming Conventions
Service Group Sample

Definition Value Resource Definition Sample Value
Group nameSG Service Group Name nameSG
Required Attributes Resource Name nameIP
SGAttribute1 value Resource Type IP
SGAttribute2 value Required Attributes
Optional Attributes ResAttribute1 value
SGAttribute3 value ResAttribute2 value
...
Substitute your name, or a nickname, wherever tables or instructions

indicate name in labs.
Following this convention simplifies labs and helps prevent naming
conflicts with your lab partner.
Lab Naming Conventions

To simplify the labs, use your name or a nickname as a prefix for cluster objects
created in the lab exercises. This includes Volume Manager objects, such as disk
groups and volumes, as well as VCS service groups and resources.
Following this convention helps distinguish your objects when multiple students
are working on systems in the same cluster and helps ensure that each student uses
unique names. The lab exercises represent your name with the word name in
italics. You substitute the name you select whenever you see the name placeholder
in a lab step.

Classroom Values for Labs
Network Definition Your Value

Subnet
DNS Address
Software Location Your Value

VCS installation dir
Lab files directory
...
Use the classroom values provided by your instructor at the beginning

of each lab exercise.
Lab tables are provided to at the beginning of the lab to record these
values. Alternately, your instructor may hand out printed tables.
If sample values are provided as guidelines, substitute your
classroom-specific values provided by your instructor.
Classroom Values for Labs

Your instructor will provide the classroom-specific information you need to
perform the lab exercises. You can record these values in your lab books using the
tables provided, or your instructor may provide separate handouts showing the
classroom values for your location.
In some lab exercises, sample values may be shown in tables as a guide to the
types of values you must specify. Substitute the values provided by your instructor
to ensure that your configuration is appropriate for your classroom.
If you are not sure of the configuration for your classroom, ask your instructor.

Course Overview

Lesson 3: Installing VCS
Lesson 11: Configuring VCS Response to Faults
Course Overview
This training provides comprehensive instruction on the installation and initial
configuration of VERITAS Cluster Server (VCS). The course covers principles
and methods that enable you to prepare, create, and test VCS service groups and
resources using tools that best suit your needs and your high availability
environment. You learn to configure and test failover and notification behavior,
cluster additional applications, and further customize your cluster according to
specified design criteria.

Course Resources
Participant Guide
Lessons
Appendix A: Lab Synopses
Appendix B: Lab Details
Appendix C: Lab Solutions
Appendix D: Job Aids
Appendix E: Design Worksheet Template
Supplements
VCS Simulator: van.veritas.com
Troubleshooting Job Aid
VCS Command-Line Reference card
Tips & Tricks: www.veritas.com/education
Course Resources
This course uses this participant guide containing lessons presented by your
instructor and lab exercises to enable you to practice your new skills.
Lab materials are provided in three forms, with increasing levels of detail to suit a
range of student expertise levels.
Appendix A: Lab Synopses has high-level task descriptions and design
worksheets.
Appendix B: Lab Details includes the lab procedures and detailed steps.
Appendix C: Lab Solutions includes the lab procedures and steps with the
corresponding command lines required to perform each step.
Appendix D: Job Aids provides supplementary material that can be used as
on-the-job guides for performing some common VCS operations.
Appendix E: Design Worksheet Template provides a blank design
worksheet.
Additional supplements may be used in the classroom or provided to you by your
instructor.

Course Platforms
This course covers the following versions of VCS:
VCS 4.1, 4.0, and 3.5 for Solaris
VCS 4.0 for Linux
VCS 4.0 for AIX
VCS 3.5 for HP-UX
Course Platforms
This course material applies to the VCS platforms shown in the slide. Indicators
are provided in slides and text where there are differences in platforms.
Refer to the VERITAS Cluster Server user documentation for your platform and
version to determine which features are supported in your environment.

Legend
These are common symbols used in this course.
Symbol Description
Server, node, or cluster system (terms
used interchangeably)
Server or cluster system that has faulted
Storage
Application service
Cluster interconnect
Wide area network (WAN) cloud

Symbol Description
Client systems on a network
VCS service group
Offline service group
VCS resource

Lesson 1
VCS Building Blocks
Lesson Introduction

Introduction
Overview
This lesson introduces basic VERITAS Cluster Server terminology and concepts,
and provides an overview of the VCS architecture and supporting communication
mechanisms.
Importance
The terms and concepts covered in this lesson provide a foundation for learning
the tasks you need to perform to deploy the VERITAS Cluster Server product, both
in the classroom and in real-world applications.
12 VERITAS Cluster Server for UNIX, Fundamentals

Lesson Topics and Objectives
1
Topic After completing this lesson, you
will be able to:
Cluster Terminology Define clustering terminology.
Cluster Communication Describe cluster communication
mechanisms.
Maintaining the Cluster Describe how the cluster configuration
Configuration is maintained.
VCS Architecture Describe the VCS architecture.
Supported Failover Describe the failover configurations
Configurations supported by VCS.
Outline of Topics
Cluster Terminology
Cluster Communication
Maintaining the Cluster Configuration
VCS Architecture
Supported Failover Configurations
Lesson 1 VCS Building Blocks 13

A Nonclustered Computing Environment
Cluster Terminology
A Nonclustered Computing Environment
An example of a traditional, nonclustered computing environment is a single
server running an application that provides public network links for client access
and data stored on local or SAN storage.
If a single component fails, application processing and the business service that
relies on the application are interrupted or degraded until the failed component is
repaired or replaced.

Definition of a Cluster
1
A cluster is a collection of multiple
independent systems working
together under a management
framework for increased service
availability.
Application
Node
Storage
Cluster Interconnect
Definition of a Cluster
A clustered environment includes multiple components configured such that if one
component fails, its role can be taken over by another component to minimize or
avoid service interruption.
This allows clients to have high availability to their data and processing, which is
not possible in nonclustered environments.
The term cluster, simply defined, refers to multiple independent systems or
domains connected into a management framework for increased availability.
Clusters have the following components:
Up to 32 systemssometimes referred to as nodes or servers
Each system runs its own operating system.
A cluster interconnect, which allows for cluster communications
A public network, connecting each system in the cluster to a LAN for client
access
Shared storage (optional), accessible by each system in the cluster that needs to
run the application

Definition of VERITAS Cluster Server and
Failover
VCS detects faults and
performs automated failover.
Application
Node
Failed Node
Storage
Cluster
Interconnect
Definition of VERITAS Cluster Server and Failover

In a highly available environment, HA software must perform a series of tasks in
order for clients to access a service on another server in the event a failure occurs.
The software must:
Ensure that data stored on the disk is available to the new server, if shared
storage is configured (Storage).
Move the IP address of the old server to the new server (Network).
Start up the application on the new server (Application).
VERITAS Cluster Server (VCS) is a software solution for automating these tasks.
VCS monitors and controls applications running in the cluster and, if a failure is
detected, automates application restart.
When another server is required to restart the application, VCS performs a
failoverthis is the process of stopping the application on one system and starting
them on another system.

Definition of an Application Service
1
An application service is a collection
of all the hardware and software
components required to provide a
service.
If the service must be migrated to
another system, all components
need to be moved in an orderly
fashion.
Examples include Web servers,
databases, and applications.
Definition of an Application Service

An application service is a collection of hardware and software components
required to provide a service, such as a Web site an end-user may access by
connecting into a particular network IP address or host name. Each application
service typically requires components of the following three types:
Application binaries (executables)
Network
Storage
If an application service needs to be switched to another system, all of the
components of the application service must migrate together to re-create the
service on another system.
Note: These are the same components that the administrator must manually move
from a failed server to a working server to keep the service available to clients in a
nonclustered environment.
Application service examples include:
A Web service consisting of a Web server program, IP addresses, associated
network interfaces used to allow access into the Web site, a file system
containing Web data files, and a volume and disk group containing the file
system.
A database service may consist of one or more IP addresses, relational
database management system (RDBMS) software, a file system containing
data files, a volume and disk group on which the file system resides, and a NIC
for network access.

Definition of a Service Group
A service group is a virtual container

that enables VCS to manage an
application service as a unit.
All components required to
provide the service, and the
relationships between these
components, are defined within
the service group.
A service groups has attributes
that define its behavior, such as
where it can start and run.
Definition of Service Group

A service group is a virtual container that enables VCS to manage an application
service as a unit. The service group contains all the hardware and software
components required to run the service, which enables VCS to coordinate failover
of the application service resources in the event of failure or at the administrators
request.
A service group is defined by these attributes:
The cluster-wide unique name of the group
The list of the resources in the service group, usually determined by which
resources are needed to run a specific application service
The dependency relationships between the resources
The list of cluster systems on which the group is allowed to run
The list of cluster systems on which you want the group to start automatically

Service Group Types
Failover:
The service group can be online on only one cluster
1
system at a time.
VCS migrates the service group at the administrators
request and in response to faults.
Parallel
The service group can be online on multiple cluster
systems simultaneously.
An example is Oracle Real Application Cluster (RAC).
Hybrid
This is a special-purpose type of service group used to manage
service groups in replicated data clusters (RDCs), which are
based on VERITAS Volume Replicator.
Service Group Types

Service groups can be one of three types:
Failover
This service group runs on one system at a time in the cluster. Most application
services, such as database and NFS servers, use this type of group.
Parallel
This service group runs simultaneously on more than one system in the cluster.
This type of service group requires an application that can be started on more
than one system at a time without threat of data corruption.
Hybrid (4.x)
A hybrid service group is a combination of a failover service group and a
parallel service group used in VCS 4.x replicated data clusters (RDCs), which
are based on VERITAS Volume Replicator. This service group behaves as a
failover group within a defined set of systems, and a parallel service group
within a different set of systems. RDC configurations are described in the
VERITAS Disaster Recovery Using VVR and Global Cluster Option course.

Definition of a Resource
Resources are VCS objects that correspond to the
hardware or software components of an application
service.
Each resource must have a unique name throughout the
cluster. Choosing names that reflect the service group
name makes it easy to identify all resources in that
group, for example, WebIP in the WebSG group.
Resources are always contained within service groups.
Resource categories include:
Persistent
None (NIC)
On-only (NFS)
Nonpersistent
On-off (Mount)
Definition of a Resource
Resources are VCS objects that correspond to hardware or software components,
such as the application, the networking components, and the storage components.
VCS controls resources through these actions:
Bringing a resource online (starting)
Taking a resource offline (stopping)
Monitoring a resource (probing)
Resource Categories
Persistent
None
VCS can only monitor persistent resourcesthey cannot be brought online
or taken offline. The most common example of a persistent resource is a
network interface card (NIC), because it must be present but cannot be
stopped. FileNone and ElifNone are other examples.
On-only
VCS brings the resource online if required, but does not stop it if the
associated service group is taken offline. NFS daemons are examples of
on-only resources. FileOnOnly is another on-only example.
Nonpersistent, also known as on-off
Most resources fall into this category, meaning that VCS brings them online
and takes them offline as required. Examples are Mount, IP, and Process.
FileOnOff is an example of a test version of this resource.

Resource Dependencies
Resources in a service group have a defined
dependency relationship, which determines the
1
online and offline order of the resource.
A parent resource depends
on a child resource.
There is no limit to the
number of parent and child Parent
resources.
Persistent resources, such
as NIC, cannot be parent Parent/child
resources.
Dependencies cannot be
cyclical.
Child
Resources depend on other resources because of application or operating system
requirements. Dependencies are defined to configure VCS for these requirements.
Dependency Rules
These rules apply to resource dependencies:
A parent resource depends on a child resource. In the diagram, the Mount
resource (parent) depends on the Volume resource (child). This dependency
illustrates the operating system requirement that a file system cannot be
mounted without the Volume resource being available.
Dependencies are homogenous. Resources can only depend on other
resources.
No cyclical dependencies are allowed. There must be a clearly defined
starting point.

Resource Attributes
Resource attributes define WebMount
WebMount
an individual resource. resource
resource
The attribute values are
used by VCS to manage
the resource.
Resources can have
required and optional
attributes, as specified
by the resource type
definition.
Solaris
Solaris
mount
mount F
F vxfs
vxfs /dev/vx/dsk/WebDG/WebVol
/dev/vx/dsk/WebDG/WebVol /Web
/Web
Resource Attributes
Resources attributes define the specific characteristics on individual resources. As
shown in the slide, the resource attribute values for the sample resource of type
Mount correspond to the UNIX command line to mount a specific file system.
VCS uses the attribute values to run the appropriate command or system call to
perform an operation on the resource.
Each resource has a set of required attributes that must be defined in order to
enable VCS to manage the resource.
For example, the Mount resource on Solaris has four required attributes that must
be defined for each resource of type Mount:
The directory of the mount point (MountPoint)
The device for the mount point (BlockDevice)
The type of file system (FSType)
The options for the fsck command (FsckOpt)
The first three attributes are the values used to build the UNIX mount command
shown in the slide. The FsckOpt attribute is used if the mount command fails. In
this case, VCS runs fsck with the specified options (-y) and attempts to mount
the file system again.
Some resources also have additional optional attributes you can define to control
how VCS manages a resource. In the Mount resource example, MountOpt is an
optional attribute you can use to define options to the UNIX mount command.
For example, if this is a read-only file system, you can specify -ro as the
MountOpt value.

Resource Types
Resources are classified
by type.
1
The resource type
specifies the
attributes needed to
define a resource of
that type.
For example, a Mount
resource has different
properties than an IP
resource.
Solaris
Solaris
mount
mount [-F
[-F FSType]
FSType] [options]
[options] block_device
block_device mount_point
mount_point
Resource Types and Type Attributes

Resources are classified by resource type. For example, disk groups, network
interface cards (NICs), IP addresses, mount points, and databases are distinct types
of resources. VCS provides a set of predefined resource typessome bundled,
some add-onsin addition to the ability to create new resource types.
Individual resources are instances of a resource type. For example, you may have
several IP addresses under VCS control. Each of these IP addresses individually is
a single resource of resource type IP.
A resource type can be thought of as a template that defines the characteristics or
attributes needed to define an individual resource (instance) of that type.
You can view the relationship between resources and resource types by comparing
the mount command for a resource on the previous slide with the mount syntax
on this slide. The resource type defines the syntax for the mount command. The
resource attributes fill in the values to form an actual command line.

Agents: How VCS Controls Resources
Each resource type has a corresponding agent
process that manages all resources of that type.
Agents have one or more entry points that perform a set of
actions on resources.
Each system runs one agent for each active resource type.
/web /log
10.1.2.3 online
WebDG WebVol logVol
eri0
offline
monitor
Mount clean
IP
Disk Volume
NIC Group
Agents: How VCS Controls Resources

Agents are processes that control resources. Each resource type has a
corresponding agent that manages all resources of that resource type. Each cluster
system runs only one agent process for each active resource type, no matter how
many individual resources of that type are in use.
Agents control resources using a defined set of actions, also called entry points.
The four entry points common to most agents are:
Online: Resource startup
Offline: Resource shutdown
Monitor: Probing the resource to retrieve status
Clean: Killing the resource or cleaning up as necessary when a resource fails to
be taken offline gracefully
The difference between offline and clean is that offline is an orderly termination
and clean is a forced termination. In UNIX, this can be thought of as the difference
between exiting an application and sending the kill -9 command to the
process.
Each resource type needs a different way to be controlled. To accomplish this, each
agent has a set of predefined entry points that specify how to perform each of the
four actions. For example, the startup entry point of the Mount agent mounts a
block device on a directory, whereas the startup entry point of the IP agent uses the
ifconfig command to set the IP address on a unique IP alias on the network
interface.
VCS provides both predefined agents and the ability to create custom agents.

Using the VERITAS Cluster Server Bundled
Agents Reference Guide
1
The
TheVERITAS
VERITASCluster
ClusterServer
Server
Bundled
BundledAgents
AgentsReference
Reference
Guide defines all VCS resource
Guide defines all VCS resource
types
typesfor
forall
allbundled
bundledagents.
agents.
See
Seehttp://support.veritas.com
http://support.veritas.com
for product documentation.
for product documentation.
Solaris AIX HP-UX Linux
Using the VERITAS Cluster Server Bundled Agents Reference Guide

The VERITAS Cluster Server Bundled Agents Reference Guide describes the
agents that are provided with VCS and defines the required and optional attributes
for each associated resource type.
Excerpts of the definitions for the NIC, Mount, and Process resource types are
included in the Job Aids appendix.
VERITAS also provides a set of agents that are purchased separately from VCS,
known as enterprise agents. Some examples of enterprise agents are:
Oracle
NetBackup
Informix
iPlanet
Select the Agents and Options link on the VERITAS Cluster Server page at
www.veritas.com for a complete list of agents available for VCS.
To obtain PDF versions of product documentation for VCS and agents, see the
Support Web site at http://support.veritas.com.

A cluster interconnect provides a communication
channel between cluster nodes.
The cluster interconnect

serves to:
Determine which systems are
members of the cluster using a
heartbeat mechanism.
Maintain a single view of the
status of the cluster
configuration on all systems in
the cluster membership.
VCS requires a cluster communication channel between systems in a cluster to
serve as the cluster interconnect. This communication channel is also sometimes
referred to as the private network because it is often implemented using a
dedicated Ethernet network.
VERITAS recommends that you use a minimum of two dedicated communication
channels with separate infrastructuresfor example, multiple NICs and separate
network hubsto implement a highly available cluster interconnect. Although
recommended, this configuration is not required.
The cluster interconnect has two primary purposes:
Determine cluster membership: Membership in a cluster is determined by
systems sending and receiving heartbeats (signals) on the cluster interconnect.
This enables VCS to determine which systems are active members of the
cluster and which systems are joining or leaving the cluster.
In order to take corrective action on node failure, surviving members must
agree when a node has departed. This membership needs to be accurate and
coordinated among active membersnodes can be rebooted, powered off,
faulted, and added to the cluster at any time.
Maintain a distributed configuration: Cluster configuration and status
information for every resource and service group in the cluster is distributed
dynamically to all systems in the cluster.
Cluster communication is handled by the Group Membership Services/Atomic
Broadcast (GAB) mechanism and the Low Latency Transport (LLT) protocol, as
described in the next sections.

Low-Latency Transport (LLT)
LLT:
1
Is responsible for sending
heartbeat messages
Transports cluster
communication traffic to
every active system
Balances traffic load
across multiple network
LLT links
Maintains the
LLT communication link state
Is a nonroutable protocol
Runs on an Ethernet
network
Low-Latency Transport
VERITAS uses a high-performance, low-latency protocol for cluster
communications. LLT is designed for the high-bandwidth and low-latency needs
of not only VERITAS Cluster Server, but also VERITAS Cluster File System, in
addition to Oracle Cache Fusion traffic in Oracle RAC configurations. LLT runs
directly on top of the Data Link Provider Interface (DLPI) layer over Ethernet and
has several major functions:
Sending and receiving heartbeats over network links
Monitoring and transporting network traffic over multiple network links to
every active system
Balancing cluster communication load over multiple links
Maintaining the state of communication
Providing a nonroutable transport mechanism for cluster communications

Group Membership Services/Atomic
Broadcast (GAB)
GAB:
Performs two functions:
Manages cluster
membership; referred to
as GAB membership
GAB Sends and receives
atomic broadcasts of
GAB LLT
configuration information
LLT Is a proprietary
broadcast protocol
Uses LLT as its
transport mechanism
Group Membership Services/Atomic Broadcast (GAB)

GAB provides the following:
Group Membership Services: GAB maintains the overall cluster
membership by way of its Group Membership Services function. Cluster
membership is determined by tracking the heartbeat messages sent and
received by LLT on all systems in the cluster over the cluster interconnect.
Heartbeats are the mechanism VCS uses to determine whether a system is an
active member of the cluster, joining the cluster, or leaving the cluster. If a
system stops sending heartbeats, GAB determines that the system has departed
the cluster.
Atomic Broadcast: Cluster configuration and status information are
distributed dynamically to all systems in the cluster using GABs Atomic
Broadcast feature. Atomic Broadcast ensures all active systems receive all
messages for every resource and service group in the cluster.

The Fencing Driver
Fencing:
1
Monitors GAB to detect
cluster membership
Reboot changes
Ensures a single view
Fence
of cluster membership
Fence GAB
Prevents multiple nodes
GAB LLT
from accessing the
same Volume Manager
LLT 4.x shared storage
devices
The Fencing Driver

The fencing driver prevents multiple systems from accessing the same Volume
Manager-controlled shared storage devices in the event that the cluster
interconnect is severed. In the example of a two-node cluster displayed in the
diagram, if the cluster interconnect fails, each system stops receiving heartbeats
from the other system.
GAB on each system determines that the other system has failed and passes the
cluster membership change to the fencing module.
The fencing modules on both systems contend for control of the disks according to
an internal algorithm. The losing system is forced to panic and reboot. The
winning system is now the only member of the cluster, and it fences off the shared
data disks so that only systems that are still part of the cluster membership (only
one system in this example) can access the shared storage.
The winning system takes corrective action as specified within the cluster
configuration, such as bringing service groups online that were previously running
on the losing system.

The High Availability Daemon (HAD)
The VCS engine, the
high availability
daemon:
Runs on each system
in the cluster
HAD
hashadow
Maintains configuration
and state information
Fence for all cluster resources
GAB Manages all agents
The hashadow daemon
LLT monitors HAD.
The High Availability Daemon

The VCS engine, also referred to as the high availability daemon (had), is the
primary VCS process running on each cluster system.
HAD tracks all changes in cluster configuration and resource status by
communicating with GAB. HAD manages all application services (by way of
agents) whether the cluster has one or many systems.
Building on the knowledge that the agents manage individual resources, you can
think of HAD as the manager of the agents. HAD uses the agents to monitor the
status of all resources on all nodes.
This modularity between had and the agents allows for efficiency of roles:
HAD does not need to know how to start up Oracle or any other applications
that can come under VCS control.
Similarly, the agents do not need to make cluster-wide decisions.
This modularity allows a new application to come under VCS control simply by
adding a new agentno changes to the VCS engine are required.
On each active cluster system, HAD updates all the other cluster systems of
changes to the configuration or status.
In order to ensure that the had daemon is highly available, a companion daemon,
hashadow, monitors had and if had fails, hashadow attempts to restart it.
Likewise, had restarts hashadow if hashadow stops.

Comparing VCS Communication Protocols
and TCP/IP
1
HAD User Processes iPlanet
hashadow
GAB TCP
Kernel Processes
LLT IP
NIC
Hardware
NIC
Comparing VCS Communication Protocols and TCP/IP

To illustrate the suitability and use of GAB and LLT for VCS communications,
compare GAB running over LLT with TCP/IP, the standard public network
protocols.
GAB Versus TCP

GAB is a multipoint-to-multipoint broadcast protocol; all systems in the cluster
send and receive messages simultaneously. TCP is a point-to-point protocol.
GAB Versus UDP

GAB also differs from UDP, another broadcast protocol. UDP is a fire-and-forget
protocolit merely sends the packet and assumes it is received. GAB, however,
checks and guarantees delivery of transmitted packets, because it requires
broadcasts to all nodes including the originator.
LLT Versus IP
LLT is driven by GAB, has specific targets in its domain and assumes constant
connection between servers, known as a connection-oriented protocol. IP is a
connectionless protocol it assumes that packets can take different paths to reach
the same destination.

HAD maintains a
replica of the cluster
configuration in
memory on each
system.
Changes to the
HAD HAD configuration are
broadcast to HAD on
hashadow
main.cf hashadow
all systems
simultaneously by
way of GAB using
LLT.
The configuration is
preserved on disk in
the main.cf file.

HAD maintains configuration and state information for all cluster resources in
memory on each cluster system. Cluster state refers to tracking the status of all
resources and service groups in the cluster. When any change to the cluster
configuration occurs, such as the addition of a resource to a service group, HAD
on the initiating system sends a message to HAD on each member of the cluster by
way of GAB atomic broadcast, to ensure that each system has an identical view of
the cluster.
Atomic means that all systems receive updates, or all systems are rolled back to the
previous state, much like a database atomic commit.
The cluster configuration in memory is created from the main.cf file on disk in
the case where HAD is not currently running on any cluster systems, so there is no
configuration in memory. When you start VCS on the first cluster system, HAD
builds the configuration in memory on that system from the main.cf file.
Changes to a running configuration (in memory) are saved to disk in main.cf
when certain operations occur. These procedures are described in more detail later
in the course.

VCS Configuration Files main.cf
1
include "types.cf"
cluster vcs (
UserNames = { admin = ElmElgLimHmmKumGlj }
Administrators = { admin }
CounterInterval = 5
) A
A simple
simple text
text file
file is
is used
used to
to
system S1 ( store
store the
the cluster
cluster configuration
configuration
) on
on disk.
disk.
system S2 ( The
The file
file contents
contents areare described
described
) in
in detail later in the
the course.
course.
group WebSG (
SystemList = { S1 = 0, S2 = 1 }
)
Mount WebMount (
MountPoint = "/web"
BlockDevice = "/dev/vx/dsk/WebDG/WebVol"
FSType = vxfs
FsckOpt = "-y"
)
VCS Configuration Files

Configuring VCS means conveying to VCS the definitions of the cluster, service
groups, resources, and resource dependencies. VCS uses two configuration files in
a default configuration:
The main.cf file defines the entire cluster, including cluster name, systems
in the cluster, and definitions of service groups and resources, in addition to
service group and resource dependencies.
The types.cf file defines the resource types.
Additional files similar to types.cf may be present if agents have been added.
For example, if the Oracle enterprise agent is added, a resource types file, such as
OracleTypes.cf, is also present.
The cluster configuration is saved on disk in the /etc/VRTSvcs/conf/
config directory, so the memory configuration can be re-created after systems
are restarted.
Note: The VCS installation utility creates the $VCS_CONF environment variable
containing the /etc/VRTSvcs path. The short path to the configuration directory
is $VCS_CONF/conf/config.

VCS Architecture
Agents monitor resources on
each system and provide
status to HAD on the local
system.
HAD on each system sends
status information to GAB.
GAB broadcasts
configuration information to
all cluster members.
LLT transports all cluster
communications to all cluster
nodes.
HAD on each node takes
corrective action, such as
failover, when necessary.
VCS Architecture
The slide shows how the major components of the VCS architecture work together
to manage application services.
How does VCS know what to fail over?

Each cluster system has its own copy of configuration files, libraries, scripts,
daemons, and executable programs that are components of VCS. Cluster systems
share a common view of the cluster configuration.
An application service consists of all the resources that the application requires in
order to run, including the application itself, and networking and storage resources.
This application service provides the structure for a service group, which is the
unit of failover.
Dependencies define whether a resource or service group failure impacts other
resources or service groups. Dependencies also define the order VCS brings
service groups and resources online or takes them offline.
How does VCS know when to fail over?

Agents communicate the status of resources to HAD, the VCS engine. The agents
alert the engine when a resource has faulted. The VCS engine determines what to
do and initiates any necessary action.

Active/Passive
1
Before Failover After Failover
Supported Failover Configurations

The following examples illustrate the wide variety of failover configurations
supported by VCS.
Active/Passive
In this configuration, an application runs on a primary or master server. A
dedicated redundant server is present to take over on any failover. The redundant
server is not configured to perform any other functions.
The redundant server is on standby with full performance capability. The next
examples show types of active/passive configurations:

Active/Passive N-to-1
Before Failover
After Failover
N-to-1
This configuration reduces the cost of hardware redundancy while still providing a
dedicated spare. One server protects multiple active servers, on the theory that
simultaneous multiple failures are unlikely.
This configuration is used when shared storage is limited by the number of servers
that can attach to it and requires that after the faulted system is repaired, the
original configuration is restored.

Active/Passive N + 1
1
After Failover
Before Failover
After Repair
N+1
When more than two systems can connect to the same shared storage, as in a SAN
environment, a single dedicated redundant server is no longer required.
When a server fails in this environment, the application service restarts on the
spare. Unlike the N-to-1 configuration, after the failed server is repaired, it can
then become the redundant server.

Active/Active
Before Failover After Failover
Active/Active
In an active/active configuration, each server is configured to run a specific
application service, as well as to provide redundancy for its peer.
In this configuration, hardware usage appears to be more efficient because there
are no standby servers. However, each server must be robust enough to run
multiple application services, increasing the per-server cost up front.

N-to-N
1
Before Failover
After Failover
N-to-N
This configuration is an active/active configuration that supports multiple
application services running on multiple servers. Each application service is
capable of being failed over to different servers in the cluster.
Careful testing is required to ensure that all application services are compatible to
run with other application services that may fail over to the same server.

Lesson Summary
Key Points
HAD is the primary VCS process, which manages
resources by way of agents.
Resources are organized into service groups.
Each system in a cluster has an identical view of
the state of resources and service groups.
Reference Materials
High Availability Design Using VERITAS Cluster
Server course
VERITAS Cluster Server Bundled Agents
Reference Guide
VERITAS Cluster Server Users Guide
Summary
This lesson introduced the basic VERITAS Cluster Server terminology and gave
an overview of VCS architecture and supporting communication mechanisms.
Next Steps
Your understanding of basic VCS functions enables you to prepare your site for
installing VCS.
Additional Resources
High Availability Design Using VERITAS Cluster Server
This course will be available in the future from VERITAS Education if you are
interested in developing custom agents or learning more about high availability
design considerations for VCS environments.
VERITAS Cluster Server Bundled Agents Reference Guide
This guide describes each bundled agent in detail.
This guide provides detailed information on procedures and concepts for
configuring and managing VCS clusters.

Lesson 2
Preparing a Site for VCS
Lesson Introduction

Introduction
Overview
This lesson describes guidelines and considerations for planning to deploy
VERITAS Cluster Server (VCS). You also learn how to prepare your site for
installing VCS.
Importance
Before you install VERITAS Cluster Server, you must prepare your environment
to meet the requirements needed to implement a cluster. By following these
guidelines, you can ensure that your system hardware and software are configured
to install VCS.


will be able to:
Planning for Plan for VCS implementation.
Implementation
Hardware Requirements Describe general VCS hardware
2
and Recommendations requirements.
Software Requirements Describe general VCS software
and Recommendations requirements.
Preparing Cluster Collect cluster design information to
Information prepare for installation.
Outline of Topics
Planning for Implementation
Hardware Requirements and Recommendations
Software Requirements and Recommendations
Preparing Cluster Information
Lesson 2 Preparing a Site for VCS 23

Implementation Needs
Access to staffing resources, as required
Network, system, and application
administrators required for configuration and
testing
Future cluster operators and administrators
who should be involved in deployment in
preparation for managing the cluster
Physical access to the equipment in
accordance with security policy
Access to support resources, such as
VERITAS, operating system, application
vendor telephone, and Web sites
Planning for Implementation

Implementation Needs
Confirm Technical Personnel Availability

Verify that you have identified the key internal personnel required for deployment
and ensure that they are available as needed. Also, consider including other IT staff
who will be involved in ongoing maintenance of the cluster. The deployment phase
is an opportunity for transferring knowledge about how to manage applications
and cluster services.
Confirm System and Site Access

Consider site security and equipment access:
What is the time window allowed for deployment?
When will you have access to the systems?
Is there special security in the server room that you need to be aware of?
Which user name and password will you use to obtain access to the systems?
Consider that you will need initial access to systems for verification purposes prior
to the onset of implementation.

The Implementation Plan
Prepare a plan stating the impact of VCS implementation
on running services and operations.
For example, you may need to add network interfaces,
patch the operating system, or upgrade applications.
Determine how to minimize downtime for existing
services, taking into consideration the time needed for
operational testing.
Plan any actions necessary to prepare the environment
2
for VCS installation as described throughout this lesson.
Consider how these activities may affect running
services.
Prepare or complete a design worksheet that is used
during VCS installation and configuration, if this
worksheet is not provided.
The Implementation Plan

VCS installation, configuration, and testing has an impact on running application
services and operations. When preparing for VCS installation and configuration,
develop an implementation plan that takes into account how VERITAS products
can be installed with minimal impact on the services already running.
You can use an implementation plan to:
Describe any actions necessary to prepare the environment for VCS
installation.
Describe the impacts on staff and services during the implementation.
Determine how to minimize the time period during which services are not
available.
Determine the impact of clustering application services on operational
procedures. For example, applications under VCS control should no longer be
stopped or started without taking VCS into consideration, which may impact
the way backups are taken on a server.
VERITAS recommends that you prepare a detailed design worksheet to be used
during VCS installation and configuration if you are not provided with a
completed worksheet resulting from the design phase. A sample design worksheet
is provided for the deployment tasks that are carried out during this course.

Using the Design Worksheet
Validate
Validate the
the design
design
worksheet
worksheet asas you
you Cluster Definition Value
prepare
prepare the
the site.
site. Cluster Name vcs
Required Attributes
UserNames admin=password
ClusterAddress 192.168.3.91
S2 Administrators admin
S1
System Definition Value
System S1
System S2

This course assumes that you are given a completed design worksheet, which you
can use to prepare the site for VCS installation and deployment. As you configure
your cluster environment in preparation for installation, verify that the information
in the design worksheet is accurate and complete. If you are not provided with a
completed design worksheet, you can use the site preparation phase as an
opportunity to record information in a new worksheet. You can then use this
worksheet later when you are installing VCS.
Cluster Definition Value

Cluster Name vcs
Required Attributes
UserNames admin=password
ClusterAddress 192.168.3.91
Administrators admin
System Definition Value

System S1
System S2

Hardware requirements:
support.veritas.com
support.veritas.com Supported hardware (HCL)
www.veritas.com
www.veritas.com Minimum configurations
Redundant cluster interconnect
Hardware recommendations:
Redundant public network
interfaces and infrastructures
Redundant HBAs for shared
storage (Fibre or SCSI)
2
Redundant storage arrays
Uninterruptible power supplies
Identically configured systems
System type
Network interface cards
Storage HBAs

See the hardware compatibility list (HCL) at the VERITAS Web site for the most
recent list of supported hardware for VERITAS products.
Networking
VERITAS Cluster Server requires a minimum of two heartbeat channels for the
cluster interconnect, one of which must be an Ethernet network connection. While
it is possible to use a single network and a disk heartbeat, the best practice
configuration is two or more network links.
Loss of the cluster interconnect results in downtime, and in nonfencing
environments, can result in split brain condition (described in detail later in the
course).
For a highly available configuration, each system in the cluster must have a
minimum of two physically independent Ethernet connections for the cluster
interconnect:
Two-system clusters can use crossover cables.
Clusters with three or more systems require hubs or switches.
You can use layer 2 switches; however, this is not a requirement.
Note: For clusters using VERITAS Cluster File System or Oracle Real Application
Cluster (RAC), VERITAS recommends the use of multiple gigabit interconnects
and gigabit switches.

Shared Storage
VCS is designed primarily as a shared data high availability product; however, you
can configure a cluster that has no shared storage.
For shared storage clusters, consider these requirements and recommendations:
One HBA minimum for nonshared disks, such as system (boot) disks
To eliminate single points of failure, it is recommended to use two HBAs to
connect to the internal disks and to mirror the system disk.
One HBA minimum for shared disks
To eliminate single points of failure, it is recommended to have two
HBAs to connect to shared disks and to use a dynamic multipathing
software, such as VERITAS Volume Manager DMP.
Use multiple single-port HBAs or SCSI controllers rather than
multiport interfaces to avoid single points of failure.
Shared storage on a SAN must reside in the same zone as all of the nodes in the
cluster.
Data residing on shared storage should be mirrored or protected by a hardware-
based RAID mechanism.
Use redundant storage and paths.
Include all cluster-controlled data in your backup planning and
implementation. Periodically test restoration of critical data to ensure that the
data can be restored.

SCSI Controller Configuration Requirements
scsi-initiator-id scsi-initiator-id
7 5 Typical 7
default
Not applicable for fibre attached storage

If using SCSI for shared storage:
2
Use unique SCSI IDs for each system.
Check the controller SCSI ID on both systems and the SCSI
IDs of the disks in shared storage.
Change the controller SCSI ID on one system, if necessary.
Shut down, cable shared disks, and reboot.
Verify that both systems can see all the shared disks.
SCSI Controller Configuration for Shared Storage

If using shared SCSI disk arrays, the SCSI controllers on each system must be
configured so that they do not conflict with any devices on the SCSI bus.
SCSI Interfaces
Additional considerations for SCSI implementations:
Both differential and single-ended SCSI controllers require termination;
termination can be either active or passive.
All SCSI devices on a controller must be compatible with the controlleruse
only differential SCSI devices on a differential SCSI controller.
Mirror disks on separate controllers for additional fault tolerance.
Configurations with two systems can use standard cables; a bus can be
terminated at each system with disks between systems.
Configurations with more than two systems require cables with connectors that
are appropriately spaced.
Cabling SCSI Devices

Use the following procedure when cabling SCSI devices:
1 Shut down all systems in the cluster.
2 If the cluster has two systems, cable shared devices in a SCSI chain with the
systems at the ends of the chain.
3 If the cluster has more than two systems, disable SCSI termination on systems
that are not at the end of a SCSI chain.

Changing the Controller SCSI ID
Solaris
1 Use the eeprom command to check the SCSI initiator ID on each system.
2 If necessary, connect shared storage to one system only and check the SCSI
IDs of disk devices using the probe-scsi-all command at the ok prompt
from the system.
3 Select a unique SCSI ID for each system on the shared SCSI bus.
Note: SCSI is designed to monitor and respond to requests from SCSI IDs in
this order: 7 to 0, then 15 to 8. Therefore, use high-priority IDs for the systems
and lower-priority IDs for devices, such as disks. For example, use 7, 6, and 5
for the systems and use the remaining IDs for the devices.
a If the SCSI initiator IDs are already set to unique values, you do not need to
make any changes.
b If it is necessary to change the SCSI ID for each system, bring the system
to the ok prompt, and type these commands:
setenv scsi-initiator-id id
ok setenv scsi-initiator-id 5
Notes:
You can also change this parameter without suspending the system by
typing the eeprom scsi-initiator-id=5 command from the
command line. However, the change does not take place until you reboot.
Because this command changes the SCSI ID of all the controllers on the
system, you need to ensure that there are no conflicts with devices on the
nonshared controllers, as well.
4 Reboot all of the systems by typing:
ok boot -r
Note: While this is a very quick and effective method, it changes the SCSI ID for
all controllers on that system. To control the individual SCSI IDs for each
controller in the system, refer to the VERITAS Cluster Server Installation Guide.
AIX
1 Determine the SCSI adapters on each system:
lsdev -C -c adapter | grep scsi
2 Verify the SCSI ID of each adapter:
lsattr -E -l scsi0 -a id
lsattr -E -l scsi1 -a id
3 Change the SCSI initiator ID, if needed, on one system only:
chdev -P -l scsi0 -a id=5
chdev -P -l scsi1 -a id=5
4 Shut down, cable disks, and reboot.
5 Verify shared storage devices from both systems:
lspv

HP-UX
1 Check the controller SCSI ID and SCSI IDs of shared disk devices using the
ioscan fnC ctl command.
2 Change the controller SCSI ID, if needed.
Some controller cards have a dip switch to set the controller SCSI ID. You may
need to call an HP service technician to make this change.
For PCI controllers that require a software setup:
Reboot the system.
Break out of the boot process.
Change the SCSI initiator ID using the configuration menu:
2
Main Menu:Enter command>ser scsi init path value
Main Menu:Enter command>ser scsi init 8/4 5
3 Use the ioscan -fn command to verify shared disks after the system
reboots.
Linux
1 Connect the disk to the first cluster system.
2 Power on the disk.
3 Connect a terminator to the other port of the disk.
4 Boot the system. The disk is detected while the system boots.
5 Press the key sequence for your adapter to bring up the SCSI BIOS settings for
that disk.
6 Set Host adapter SCSI ID = 7 or to an appropriate value for your configuration.
7 Set Host Adapter BIOS in Advanced Configuration Options to Disabled.

Hardware Verification
Inspect the hardware:
Confirm that the hardware being used in the
implementation is supported.
Cable the cluster interconnect.
Ensure that the hardware is configured properly
for the HA environment:
Confirm public network connectivity for each
system.
Confirm that multiple channels to storage
exist.
Can the operating system detect all storage?
Are arrays configured properly?
Hardware Verification
Hardware may have been installed but not yet configured, or improperly
configured. Basic hardware configuration considerations are described next.
Network
Test the network connections to ensure that each cluster system is accessible on the
public network. Also verify that the cluster interconnect is working by temporarily
assigning network addresses and using ping to verify communications. You must
use different IP network addresses to ensure that traffic actually uses the correct
interface.
Also, depending on the operating system, you may need to ensure that network
interface speed and duplex settings are hard set and auto negotiation is disabled.
Storage
VCS is designed primarily as a shared data high availability product. In order to
fail over an application from one system to another, both systems must have access
to the data storage.
Other considerations when checking hardware include:
Switched-fabric zoning configurations in a SAN
Active-active versus active-passive on disk arrays

Software requirements: support.veritas.com
support.veritas.com
Determine supported software. www.veritas.com
www.veritas.com
Modify the PATH environment variable.
Software recommendations:
Use the same operating system version and patch
level on all systems.
2
Use identical configurations:
Configuration files
User accounts
Disabled abort sequence (Solaris)
ssh or rsh configured during installation
Use volume management software for storage.

For the latest software requirements, refer to the VERITAS Cluster Server Release
Notes for the specific operating system. You can also obtain requirements from the
VERITAS Web site or by calling Sales or Support.
Before installing VCS, add the directory containing the VCS binaries to the PATH
environment variable. The installation and other commands are located in the
/sbin, /usr/sbin, and /opt/VRTSvcs/bin directories. Add the path to
the VCS manual pages to the MANPATH variable.
Follow these recommendations to simplify installation, configuration, and
management of the cluster:
Operating system: Although it is not a strict requirement to run the same
operating system version on all cluster systems, doing so greatly reduces
complexity during the initial installation of VCS through ongoing maintenance
of the cluster.
Software configuration: Setting up identical operating system configurations
helps ensure that your application services run properly on all cluster systems
that are startup or failover targets for the service.
Volume management software: Using storage management software, such as
VERITAS Volume Manager and VERITAS File System, enhances high
availability by enabling you to mirror data for redundancy and change the
configuration or physical disks without interrupting services.

Consider disabling the abort sequence on Solaris systems. When a Solaris
system in a VCS cluster is halted with the abort sequence (STOP-A), it stops
producing VCS heartbeats. To disable the abort sequence on Solaris systems,
add the following line to the /etc/default/kbd file (create the file if it
does not exist):
KEYBOARD_ABORT=disable
After the abort sequence is disabled, reboot the system.
Enable ssh/rsh communication between systems. This enables you to install
all cluster systems from the system on which you run the installation utility. If
you cannot enable secure communications, you can install VCS individually
on each system.
See the installation guide for your platform and version for the specific
requirements for your environment.

Software Verification
Inspect the software:
Confirm that the operating system version is
supported.
Verify that the necessary patches are installed.
Verify that software licenses are available or
installed for VCS and applications.
2
Verify that the operating system and network
configuration files are the same.
vlicense.veritas.com
vlicense.veritas.com
VERITAS
VERITASsales
salesrepresentative
representative
VERITAS
VERITASSupport
Supportfor
forupgrades
upgrades
Software Verification
Verify that the VERITAS products in the high availability solution are compatible
with the operating system versions in use or with the planned upgrades.
Verify that the required operating system patches are installed on the systems
before installing VCS.
Obtain VCS license keys.
You must obtain license keys for each cluster system to complete the license
process. For new installations, use the VERITAS vLicense Web site,
http://vlicense.veritas.com, or contact your VERITAS sales
representative for license keys. For upgrades, contact VERITAS Support.
Also, verify that you have the required licenses to run applications on all
systems where the corresponding service can run.
Verify that operating system and network configuration files are configured to
enable application services to run identically on all target systems. For
example, if a database needs to be started with a particular user account, ensure
that user account, password, and group files contain the same configuration for
that account on all systems that need to be able to run the database.

Verify that you have the information needed for the
VCS installation procedure:
System names
License keys
Cluster name, ID number
Network interfaces for cluster interconnect
VCS user names and passwords
Network interface for Web GUI
IP address for Web GUI
Optional
SMTP server name and e-mail addresses
SNMP Console name and message levels
Root and authentication broker nodes for security

Verify that you have the information necessary to install VCS. If necessary,
document the information in your design worksheet.
Be prepared to supply:
Names of the systems that will be members of the cluster
A name for the cluster, beginning with a letter of the alphabet (a-z, A-Z)
A unique ID number for the cluster in the range 0 to 255
Avoid using 0 because this is the default setting and can lead to conflicting
cluster numbers if other clusters are added later using the default setting. All
clusters sharing a private network infrastructure (including connection to the
same public network if used for low-priority links) must have a unique ID.
Device names of the network interfaces used for the cluster interconnect
You can opt to configure additional cluster services during installation.
VCS user accounts: Add accounts or change the default admin account.
Web GUI: Specify a network interface and virtual IP address on the public
network during installation so that VCS configures a highly available Web
management interface.
Notification: Specify SMTP and SNMP information during installation to
configure the cluster notification service.
Broker nodes required for security (4.1)
VCS can be configured to use VERITAS Security Services (VxSS) to provide
secure communication between cluster nodes and clients, as described in the
next section.

VERITAS Security Services
If configured in secure mode, VCS uses VERITAS Security
Services (VxSS) to provide secure communication:
Among cluster systems
Between VCS clients (Cluster Manager Java and Web
consoles) and cluster systems
VCS uses digital certificates for authentication and Secure
Socket Layer (SSL) to encrypt communication over the
2
public network.
VxSS provides a single sign-on for authenticated user
accounts.
All cluster systems must be authentication broker nodes.
VERITAS recommends using a system outside the cluster
to serve as the root broker node.
VERITAS Security Services

VCS versions 4.1 and later can be configured to use VERITAS Security Services
(VxSS) to provide secure communication between cluster nodes and clients,
including the Java and the Web consoles. VCS uses digital certificates for
authentication and uses SSL to encrypt communication over the public network.
In the secure mode, VCS uses platform-based authentication; VCS does not store
user passwords. All VCS users are system users. After a user is authenticated, the
account information does not need to be provided again to connect to the cluster
(single sign-on).
Note: VERITAS Security Services are in the process of being implemented in all
VERITAS products.
VxSS requires one system to act as a root broker node. This system serves as the
main registration and certification authority and should be a system that is not a
member of the cluster.
All cluster systems must be configured as authentication broker nodes, which can
authenticate clients.
Security can be configured after VCS is installed and running. For additional
information on configuring and running VCS in secure mode, see Enabling and
Disabling VERITAS Security Services in the VERITAS Cluster Server Users
Guide.

Lesson Summary
Key Points
Verify hardware and software compatibility and
record information in a worksheet.
Prepare cluster configuration values before you
begin installation.
Reference Materials
VERITAS Cluster Server Release Notes
VERITAS Cluster Server Installation Guide
Summary
This lesson described how to prepare sites and application services for use in the
VCS high availability environment. Performing these preparation tasks ensures
that the site is ready to deploy VCS, and helps illustrate how VCS manages
application resources.
Next Steps
After you have prepared your operating system environment for high availability,
you can install VERITAS Cluster Server.
The release notes provide detailed information about hardware and software
supported by VERITAS Cluster Server.
This guide provides detailed information about installing VERITAS Cluster
Server.
Check the VERITAS Support Web site for supported hardware and software
information.

Lab 2: Validating Site Preparation
Visually
Visuallyinspect
inspectthe
theclassroom
classroomlablabsite.
site.
Complete
Complete and validate the designworksheet.
and validate the design worksheet.
Use
Usethe
thelab
labappendix
appendixbest
bestsuited
suitedto
toyour
your
experience
experiencelevel:
level:
?? Appendix
AppendixA:
A:Lab
LabSynopses
Synopses
?? Appendix
AppendixB:
B:Lab
LabDetails
Details
?? Appendix
AppendixC:
C:Lab
LabSolutions
Solutions
2
train2
train1
System Definition Sample Value Your Value
System train1
System train2
See
Seethe
thenext
nextslide
slidefor
forlab
labassignments.
assignments.
Lab 2: Validating Site Preparation

Labs and solutions for this lesson are located on the following pages.
Appendix A provides brief lab instructions for experienced students.
Lab 2 Synopsis: Validating Site Preparation, page A-2
Appendix B provides step-by-step lab instructions.
Lab 2: Validating Site Preparation, page B-3
Appendix C provides complete lab instructions and solutions.
Lab 2 Solutions: Validating Site Preparation, page C-3
Goal
The purpose of this lab is to prepare the site, your classroom lab systems, for VCS
installation.
Results
The system requirements are validated, the interconnect is configured, and the
design worksheet is completed and verified.
Prerequisites
Obtain any classroom-specific values needed for your classroom lab environment
and record these values in your design worksheet included with the lab exercise
instructions.

Lesson 3
Installing VERITAS Cluster Server
Lesson Introduction

Introduction
Overview
This lesson describes the automated VCS installation process carried out by the
VERITAS Common Product Installer.
Importance
Installing VCS is a simple, automated procedure in most high availability
environments. The planning and preparation tasks you perform prior to starting the
installation process ensure that VCS installs quickly and easily.


will be able to:
Using the VERITAS Install VCS using the VPI utility.
Product Installer
VCS Configuration Files Display the configuration files created
during installation.
Viewing the Default VCS View the VCS configuration created
Configuration during installation.
Other Installation Describe other components to
Considerations consider at installation time.
3
Outline of Topics
Using the VERITAS Common Product Installer
Viewing the Default VCS Configuration
Other Installation Considerations
Lesson 3 Installing VERITAS Cluster Server 33

Using the VERITAS Product Installer
The VERITAS product installer (VPI), the
recommended installation procedure for VCS:
Performs environment checking to ensure that
prerequisites are met
Enables you to add product licenses
Is started from the installer file in the CD mount
point directory
Runs the installvcs utility
Logs user input and program output to files in:
/opt/VRTS/install/logs
Using the VERITAS Product Installer

VERITAS ships high availability and storage foundation products with a
VERITAS product installer (VPI) utility that enables you to install these products
using the same interface.
Viewing Installation Logs

At the end of every product installation, the VPI creates three text files:
A log file containing any system commands executed and their output
A response file to be used in conjunction with the -responsefile option of
the VPI
A summary file containing the output of the VERITAS Product Installer scripts
These files are located in /opt/VRTS/install/logs. The names and
locations of each file are displayed at the end of each product installation. It is
recommended that these logs be kept for auditing and debugging purposes.

The installvcs Utility
The installvcs utility, called by VPI:
Uses the platform-specific operating system command to
install the VCS packages on all the systems in the cluster
Configures Ethernet network links for the VCS
communications interconnect
Brings the cluster up with the ClusterService group,
which manages the VCS Web GUI (if configured during
installation)
The installvcsutility
Theinstallvcs utilityrequires
requiresremote
remoteroot
rootaccess
accesstotoother
other
systems
systemsin inthe
thecluster
clusterwhile
whilethe
thescript
scriptis
isbeing
beingrun.
run.
The /.rhostsfile:
The/.rhosts file:You
Youcan
canremove .rhostsfiles
remove.rhosts filesafter
afterVCS
VCS
3
installation.
installation.
ssh:
ssh:No
Noprompting
promptingis ispermitted.
permitted.
The installvcs Utility

The installvcs utility is used by the VPI to automatically install and configure
a cluster. If remote root access is enabled, installvcs installs and configures
all cluster systems you specify during the installation process.
The installation utility performs these high level tasks:
Installs VCS packages on all the systems in the cluster
Configures cluster interconnect links
Brings the cluster up without any application services
Make any changes to the new cluster configuration, such as the addition of any
application services, after the installation is completed.
For a list of software packages that are installed, see the release notes for your
VCS version and platform.
Options to installvcs
The installvcs utility supports several options that enable you to tailor the
installation process. For example, you can:
Perform an unattended installation.
Install software packages without configuring a cluster.
Install VCS in a secure environment.
Upgrade an existing VCS cluster.
For a complete description of installvcs options, see the VERITAS Cluster
Server Installation Guide.

Automated VCS Installation Procedure
Invoke Verify
Verifycommunication
communication
Invoke Enter
installer Entersystem
systemnames.
names. and
andinstall
installVERITAS
VERITAS
installer infrastructure
infrastructurepackages.
packages.
Enter
Enterlicense
licensekeys.
keys.
Select
Selectoptional
optionalpackages.
packages.
Configure
Configurethe
thecluster
cluster What the script
(name,
(name,ID,
ID,interconnect).
interconnect). does
User input Select
Selectroot
rootbroker
brokernode.
node.
to the script
Set
Setup
upVCS
VCSuser
useraccounts.
accounts.
Configure
Configurethe
theWeb
WebGUI
GUI Install
InstallVCS
VCSpackages.
packages.
(device
(devicename,
name,IP
IPaddress,
address,
subnet
subnetmask).
mask). Configure
ConfigureVCS.
VCS.
Configure
ConfigureSMTP
SMTPand andSNMP
SNMP Start
StartVCS.
VCS.
notification.
notification.
Automated VCS Installation Procedure

If you use the VPI installer utility and select VCS from the product list,
installvcs is started. You can also run installvcs directly from the
command line. Using information you supply, the installvcs utility installs
VCS and all bundled agents on each cluster system, installs a Perl interpreter, and
sets up the LLT and GAB communication services. The utility also gives you the
option to install and configure the Web-based Cluster Manager (Web Console) and
to set up SNMP and SMTP notification features in the cluster.
As you use the installvcs utility, you can review summaries to confirm the
information that you provide. You can stop or restart the installation after
reviewing the summaries. Installation of VCS packages takes place only after you
have confirmed the information. However, partially installed VCS files must be
removed before running the installvcs utility again.
The installation utility is described in detail in the VERTIAS Cluster Server
Installation Guide for each platform. The next sections provide a summary of the
steps involved.

Starting the Installation
To start the installation utility:
1 Log on as the root user on a system connected by the network to the systems
where VCS is to be installed. The system from which VCS is installed does not
need to be part of the cluster.
2 Insert the CD with the VCS software into a drive connected to the system.
3 Start the VCS installation utility by starting either VPI or the installvcs
utility directly:
./installvcs
or
./installer
The utility starts by prompting you for the names of the systems in the cluster.
The utility verifies that the systems you specify can communicate using ssh or
rsh. If ssh binaries are found, the program confirms that ssh is set up to operate
3
without requests for passwords or passphrases.
Licensing VCS
The installation utility verifies the license status of each system. If a VCS license is
found on the system, you can use that license or enter a new license.
If no VCS license is found on the system, or you want to add a new license, enter a
license key when prompted.
Configuring the Cluster

After licensing is completed, the installation utility:
Shows the list of VCS packages that will be installed
Determines whether any VCS packages are currently installed
Determines whether enough free disk space is available
Stops any VCS processes that might be running
When these checks are complete, the installation utility asks if you want to
configure VCS. If you choose to do so, you are prompted for the following cluster
configuration information:
A name for the cluster, beginning with a letter of the alphabet (a-z, A-Z)
A unique ID number for the cluster in the range 0 to 255
Avoid using 0 because this is the default setting and can lead to conflicting
cluster numbers if other clusters are added later using the default setting. All
clusters sharing the private network infrastructure (including connection to the
same public network if used for low-priority links) must have a unique ID.

Configuring the Cluster Interconnect
After you enter the cluster ID number, the installation utility discovers and lists all
NICs on the first system to enable you to configure the private network interfaces.
Note: With VCS 4.x, you can configure more than two Ethernet links and low-
priority network links using the installation utility. A low-priority network link is a
private link used only for less-frequent heartbeat communications without any
status information under normal operating conditions. The cluster interconnect is
described in more detail later in the course.
If you are using the same NICs for private heartbeat links on all systems,
installvcs automatically configures the same set of interfaces for the cluster
interconnect.
If you are using different interfaces, enter n when prompted and the utility prompts
for the NICs of each system.
A verification message then displays a summary of the user input:
Cluster information verification:
Cluster Name: mycluster
Cluster ID Number: 200
Private Heartbeat Links for train7: link1=dev0
link2=dev1
Private Heartbeat Links for train8:link1=dev0
link2=dev1
Configuring Security
If you choose to configure VxSS security, you are prompted to select the root
broker node. The system acting as root broker node must be set up and running
before installing VCS in the cluster. All cluster nodes are automatically set up as
authentication broker nodes.
Configuring User Accounts

If you configured VxSS security, you are not prompted to add VCS users. When
running in secure mode, system (UNIX) users and passwords are used to verify
identity. VCS user names and passwords are no longer used in a secure cluster.
Configuring the Web Console

The installation utility describes the information required to configure Cluster
Manager (Web Console). Configuring Cluster Manager is optional. To configure
the Web Console, enter the following information when prompted:
A public NIC used by each system in the cluster
A virtual IP address and netmask for the Cluster Manager
The installation process creates a service group named ClusterService to make the
Web Console highly available.

If you type n and do not configure Cluster Manager, the installation program
advances you to the screen enabling you to configure SMTP/SNMP notification. If
you choose to configure VCS to send event notifications to SMTP e-mail services
or SNMP management consoles, you need to provide the SMTP server name and
e-mail addresses of people to be notified or SNMP management console name and
message severity levels. Note that it is also possible to configure notification after
installation.
Configuring SMTP/SNMP notification is described later in this course.
Completing the Installation

After you have entered all configuration information, the installation utility:
1 Begins installing the packages on the first system
The same packages are installed on each machine in the cluster.
2 Creates configuration files and copies them to each system
3 Asks for confirmation to start VCS and its components on each system

Installing VCS Updates
Check for any updates before proceeding to
configure your cluster:
Updates are usually provided as patches or
maintenance packs.
Read the installation instructions included
with the update to ensure that all prerequisites
are met before you start the installation
process.
Installing VCS Updates

Updates for VCS are periodically created in the form of patches or maintenance
packs to provide software fixes and enhancements. Before proceeding to configure
your cluster, check the VERITAS Support Web site at http://
support.veritas.com for information about any updates that might be
available.
Download the latest update for your version of VCS according to the instructions
provided on the Web site.
The installation instructions for VCS updates are included with the update pack.
Before you install an update, ensure that all prerequisites are met. At the end of the
update installation, you may be prompted to run scripts to update agents or other
portions of the VCS configuration. Continue through any additional procedures to
ensure that the latest updates are applied.

VCS File Locations
Directories Contents
/sbin, /usr/sbin, Executables, scripts, libraries
/opt/VRTSvcs/bin
/etc Configuration files for the cluster interconnect
/etc/VRTSvcs/conf Cluster configuration files

/config
/opt/VRTSvcs/gui/ Apache servelet engine for Web GUI
conf
/var/VRTSvcs/log Log files
Commonly
Commonlyused
usedenvironment
environmentvariables:
variables:
3
$VCS_CONF:
$VCS_CONF: /etc/VRTSvcs
/etc/VRTSvcs $VCS_HOME:
$VCS_HOME: /opt/VRTSvcs/bin
/opt/VRTSvcs/bin
$VCS_LOG: /var/VRTSvcs
$VCS_LOG: /var/VRTSvcs

VCS File Locations
The VCS installation procedure creates several directory structures.
Commands:
/sbin, /usr/sbin, and /opt/VRTSvcs/bin
Configuration files:
/etc and /etc/VRTSvcs/conf/config
GUI configuration files:
/opt/VRTSvcs/gui/conf
Logging directory:
/var/VRTSvcs/log
The VCS installation procedure also adds several environment variables during
installation, including these commonly used variables:
$VCS_CONF
/etc/VRTSvcs
$VCS_HOME
/opt/VRTSvcs/bin
$VCS_LOG
/var/VRTSvcs

Communication Configuration Files
LLT configuration files:

/etc/llttab
Specifies the cluster ID number, the host name,
and the LLT network interfaces used for the
cluster interconnect
/etc/llthosts
Lists the host names of each cluster system with
its corresponding LLT node ID number
GAB configuration file:
/etc/gabtab
Specifies how many systems are members of the cluster
and starts GAB.
Communication Configuration Files

The installvcs utility creates these VCS communication configuration files:
/etc/llttab
The llttab file is the primary LLT configuration file and is used to:
Set system ID numbers.
Set the cluster ID number.
Specify the network device names used for the cluster interconnect.
Modify LLT behavior, such as heartbeat frequency.
/etc/llthosts
The llthosts file associates a system name with a unique VCS cluster node
ID number for every system in the cluster. This file is the same on all systems
in the cluster.
/etc/gabtab
This file contains the command line that is used to start GAB.
Cluster communication is described in detail later in the course.

Cluster Configuration Files
VCS configuration files
/etc/VRTSvcs/conf/config/types.cf
/etc/VRTSvcs/conf/config/main.cf
include "types.cf"
cluster VCS ( Cluster
Cluster Name
Name

)
system train1 (
All
All the systems
systems where
where
) VCS
VCS is
is installed
installed
system train2 (
)
group ClusterService (

) Information
Information entered
entered for
for
IP webip ( Web-based
Web-based Cluster
Cluster Manager
Manager
Device = hme1
3
Address = "192.168.105.101"
NetMask = "255.255.255.0"
)

Cluster Configuration Files

The following cluster configuration files are added as a result of package
installation:
/etc/VRTSvcs/conf/config/types.cf
The installvcs utility modifies the main.cf file to configure the
ClusterService service group, which includes the resources used to manage the
Web-based Cluster Manager (Web Console). VCS configuration files are discussed
in detail throughout the course.

Viewing Installation Results
List the VERITAS packages installed.

pkginfo
pkginfo lslpp
lslpp swlist
swlist rpm
rpm
View VCS configuration files.

llthosts
llthosts main.cf
main.cf
gabtab
gabtab
llttab
llttab types.cf
types.cf
Access the VCS Cluster Manager Web Console.

http://IP_Address:8181/vcs
Access the VCS documentation.
/opt/VRTSvcsdc
Viewing the Default VCS Configuration

Viewing Installation Results
After the initial installation, you can perform the following tasks to view the
cluster configuration performed during the installation process.
List the VERITAS packages installed on the system:
Solaris
pkginfo | grep -i vrts
AIX
lslpp -L | grep -i vrts
HP-UX
swlist | grep -i vrts
Linux
rpm -qa | grep -i vrts
Log onto the VCS Web Console using the IP address specified during
installation:
http://IP_Address:8181/vcs
View the product documentation:
/opt/VRTSvcsdc

Viewing Status
View LLT status:

# lltconfig
llt is running
View GAB status:
# gabconfig -a
GAB Port Memberships
================================
Port a gen a36e003 membership 01
Port h gen fd57002 membership 01
View VCS status:
# hastatus sum
-- System State Frozen
3
A S1 RUNNING 0
A S2 RUNNING 0
Viewing Status
After installation is complete, you can check the status of VCS components.
View VCS communications status on the cluster interconnect using LLT and
GAB commands. This topic is discussed in more detail later in the course. For
now, you can see that LLT is up by running the following command:
lltconfig
llt is running
View GAB port a and port h memberships for all systems:
gabconfig -a
===============================================
Port h gen fd57002 membership 01
View the cluster status:
hastatus -sum

I/O Fencing Considerations
I/O Fencing is the recommended method for
protecting shared storage in a cluster
environment.
Configure fencing after initial VCS installation if:
Your shared storage devices support SCSI-3
Persistent Reservations; and
You installed a 4.0 or later version of VCS; and
You are using Volume Manager 4.0 or later.
A detailed procedure is provided later in the
course.
Other Installation Considerations

Fencing Considerations
If you are using VCS with shared storage devices that support SCSI-3 Persistent
Reservations, configure fencing after VCS is initially installed. You must have
VCS 4.0 and Volume Manager 4.0 (or later) to implement fencing.
You can configure fencing at any time. However, if you set up fencing after you
have service groups running, you must stop and restart the service groups for
fencing to take effect.
The procedure for configuring fencing is provided later in the course.

Cluster Manager Java GUI
You can install VERITAS Cluster Manager on any
supported system.
Cluster Manager runs on UNIX and Windows systems.
You can install Cluster Manager on cluster systems as part
of the installvcs process. However, you are not
required to have Cluster Manager on any cluster system.
See the VERITAS Cluster Server Release Notes for details
about platform support and installation procedures.
Access
Accessthe
theCluster
ClusterManager
ManagerJava
JavaConsole
Consoleto
toverify
verifyinstallation.
installation.
On
OnUNIX
UNIXsystems,
systems,type
typehagui&.
hagui&.
On
OnWindows
Windowssystems,
systems,start
startthe
theGUI
GUIusing
usingthe
theCluster
ClusterManager
3
Manager
desktop icon.
desktop icon.

You can install the VCS Java-based Cluster Manager GUI during the cluster
installation process as part of the installation process. You can also install Cluster
Manager on any supported system manually using the appropriate operating
system installation utility.
The next examples show how to install Cluster Manager on supported UNIX
platforms.
Solaris
pkgadd -d /cdrom/pkgs VRTScscm
AIX
cd /cdrom
installp -a -d ./pkgs/VRTScscm.rte.bff VRTScscm.rte
HP-UX
swinstall -s /cdrom/cluster_server/pkgs VRTScscm
Linux
Insert the VCS CD into a drive on the system. The software automatically mounts
the CD on /mnt/cdrom.
cd /mnt/cdrom/vcsgui
rpm -ihv VRTScscm-base-2.0.3-Linux.i386.rpm

Installing the Java Console on Windows
You can also install and use the VCS Java Console remotely from a Windows
workstation. You do not need to have the VCS software installed locally on the
system to use the Java Console. To install the VCS Cluster Manager (Java
Console) on a Windows workstation:
1 Insert the VCS CD into the drive on your Windows workstation.
2 Using Windows Explorer, select the CD drive.
3 Navigate to
\pkgs\WindowsInstallers\WindowsClusterManager\EN.
4 Double-click Setup.exe.
The VCS InstallShield guides you through the installation process.

Lesson Summary
Key Points
Use the VERITAS Common Product Installer to
install VCS on UNIX systems.
Familiarize yourself with the installed and running
configuration.
Reference Materials
http://vlicense.veritas.com
3
Summary
This lesson described the procedure for installing VCS and viewing the cluster
configuration after the installation has completed.
Next Steps
After you install the VCS software, you can prepare your application services for
the high availability environment.
This document provides important information regarding VERITAS Cluster
Server (VCS) on the specified platform. It is recommended that you review
this entire document before installing VCS.
This guide provides information on how to install VERITAS Cluster Server on
the specified platform.
Web Resources
To verify that you have the latest operating system patches before installing
VCS, see the corresponding vendor Web site for that platform. For
example, for Solaris, see http://sunsolve.sun.com.
To contact VERITAS Technical Support, see:
To obtain VERITAS software licenses, see:
http://vlicense.veritas.com

Lab 3: Installing VCS
vcs1
Link 1:______
Link 1:______ Link 2:______
Link 2:______
Public:______ Public:______
train1 train2
4.x ## ./installer
./installer Software
4.x
location:_______________________________
Pre-4.0
Pre-4.0 ## ./installvcs
./installvcs
Subnet:_______
Lab 3: Installing VCS

Lab 3 Synopsis: Installing VCS, page A-6
Lab 3: Installing VCS, page B-11
Lab 3 Solutions: Installing VCS, page C-13
Goal
The purpose of this lab exercise is to set up a two-system VCS cluster with a
shared disk configuration and to install the VCS software using the installation
utility.
Prerequisites
Pairs of students work together to install the cluster. Select a pair of systems or
use the systems designated by your instructor.
Obtain installation information from your instructor and record it in the design
worksheet provided with the lab instructions.
Results
A two-system cluster is running VCS with one system running the ClusterService
service group.

Lesson 4
VCS Operations
Course Overview

Introduction
Overview
In this lesson, you learn how to manage applications that are under the control of
VCS. You are introduced to considerations that must be taken when managing
applications in a highly available clustered environment.
Importance
It is important to understand how to manage applications when they are under
VCS control. An application is a member of a service group that also contains
resources necessary to run the application that needs to be managed. Applications
must be brought up and down using the VCS interface rather than by using a
traditional direct interface with the application. Application upgrades and backups
are handled differently in a cluster environment.


will be able to:
Managing Applications in Describe key considerations for
a Cluster Environment managing applications.
Service Group Operations Perform common cluster
administrative operations.
Using the VCS Simulator Use the VCS Simulator to practice
managing services.
Outline of Topics
4
Managing Applications in a Cluster Environment
Service Group Operations
Using the VCS Simulator
Lesson 4 VCS Operations 43

Key Considerations
After an application is placed under VCS control,
you must change your management practices. You
have two basic administrative approaches:
Use VCS to start and stop service groups and
resources.
Direct VCS not to intervene while you are
performing administrative operations outside of
VCS by freezing the service group.
You
Youcan
canmistakenly
mistakenlycause
causeproblems,
problems,such
suchas
asforcing
forcingfaults
faults
! and
andpreventing
of
preventingfailover,
ofVCS.
VCS.
failover,ififyou
youmanipulate
manipulateresources
resourcesoutside
outside
Managing Applications in a Cluster Environment

Key Considerations
In a cluster environment, the application software is a resource that is a member of
the service group. When an application is placed under control of VCS, you must
change your standard administration practices for managing the application.
Consider a nonclustered, single-host environment running an Oracle database. A
common method for shutting down the database is to log on as the database
administrator (DBA) and use sqlplus to shut down the database.
In a clustered environment where Oracle is a resource in a failover service group,
the same action causes a failover, which results in VCS detecting a fault (the
database is offline) and bringing the database online on another system.
It is also normal and common to do other things in a nonclustered environment,
such as forcibly unmounting a file system.
Under VCS, the manipulation of resources that are part of service groups and the
service groups themselves need to be managed using VCS utilities, such as the
GUI or CLI, with full awareness of resource and service group dependencies.
Alternately, you can freeze the service group to prevent VCS from taking action
when changes in resource status are detected, as described later in this lesson.
Warning: In clusters that do not implement fencing, VCS cannot prevent someone
with proper permissions from manually starting another instance of the application
on another system outside of VCS control. VCS will eventually detect this and
take corrective action, but it may be too late to prevent data corruption.

VCS Management Tools
VCS Simulator Java GUI Web GUI CLI
Create, model, Graphical user Graphical user Command-

and test interface interface line interface
configurations Runs on UNIX Runs on Runs on the
Cannot be and Windows systems with local system
used to systems supported
manage a Web browsers
running
cluster
configuration
Only authorized VCS user accounts have access to VCS

administrative interfaces.
VCS Management Tools
4
You can use any of the VCS interfaces to manage the cluster environment,
provided that you have the proper VCS authorization. VCS user accounts are
described in more detail in the VCS Configuration Methods lesson.
For details about the requirements for running the graphical user interfaces (GUIs),
see the VERITAS Cluster Server Release Notes and the VERITAS Cluster Server
Users Guide.
Note: You cannot use the Simulator to manage a running cluster configuration.

Common Operations
Ce.ico
These common service group operations are
described in more detail throughout this section:
Displaying attributes and status
Bringing service groups online
Taking service groups offline
Switching service groups
Freezing service groups
Bringing resources online
Taking resources offline
Clearing faults
Your instructor will demonstrate how you can use the
VCS Java GUI to perform these tasks.
Service Group Operations

This section describes common service group operations that operators and
administrators need to perform. Your instructor will demonstrate using the Java
GUI to perform each task as it is discussed in class.

Displaying Attributes and Status
You can view attributes for the following VCS
objects using any VCS interface:
Clusters
Systems
Use
Usethe
thehastatus
Service groups hastatus
command
commandwithwithno
no
Resources options
optionsfor
for
Resource types continuous
continuousdisplay
display
of
ofcluster
clusterstatus
Display status information to: information.
status
information.
Determine the state of the cluster.
Analyze the causes of errors and correct them, when
necessary.
Logging In Status and Attributes
Displaying Attributes and Status
4
Knowing how to display attributes and status about a VCS cluster, service groups,
and resources helps you monitor the state of cluster objects and, if necessary, find
and fix problems. Familiarity with status displays also helps you build an
understanding of how VCS responds to events in the cluster environment, and the
effects on application services under VCS control.
You can display attributes and status using the GUI or CLI management tools.
Display Cluster Status Using the CLI

To display cluster status, use either form of the hastatus command:
hastatus -sum[mary]
Show a static snapshot of the status of cluster objects.
hastatus
Show a continuous updated display of the status of cluster objects.

Displaying Logs
HAD (engine) log:
Is located in /var/VRTSvcs/log/engine_A.log
Tracks all cluster activity
Is useful for solving configuration problems
Command log:
Tracks each command issued using a GUI
Useful for learning the CLI
Can be used for creating batch files
Can be printed, but is not stored on disk in a file
! Show
Show continuous
log
log in
in another
another to
hastatus display
continuous hastatus
to become
display in
become familiar
in one
familiar with
one window
with VCS
window and
VCS activities
and the
activities and
the command
and operations.
operations.
Displaying Logs
You can display the HAD log to see additional status information about activity in
the cluster. You can also display the command log to see how the activities you
perform using the GUI are translated into VCS commands. You can also use the
command log as a resource for creating batch files to use when performing
repetitive configuration or administration tasks.
Note: Both the HAD log and command log can be viewed using the GUI.
The primary log file, the engine log, is located in /var/VRTSvcs/log/
engine_A.log. Log files are described in more detail later in the course.

Bringing Service Groups Online
Service group attributes determine
how service groups are brought online
WebSG
WebSG
automatically by VCS.
You may need to manually bring a
Web
service group online in some cases,
for example, if a service group is IP Mount
taken offline for maintenance. NIC Volume
Resources are brought online in
dependency tree order from bottom
DiskGroup
child resources to top parent
resources. System S1
A service group can be partially
online.
hagrp
hagrp -online
-online
Bringing Service Groups Online
4
When a service group is brought online, resources are brought online starting with
the lowest (child) resources and progressing up the resource dependency tree to the
highest (parent) resources.
In order to bring a failover service group online, VCS must verify that all
nonpersistent resources in the service group are offline everywhere in the cluster.
If any nonpersistent resource is online on another system, the service group is not
brought online.
A service group is considered online if all of its autostart and critical resources are
online.
An autostart resource is a resource whose AutoStart attribute is set to 1.
A critical resource is a resource whose Critical attribute is set to 1.
A service group is considered partially online if one or more nonpersistent
resources is online and at least one resource is:
Autostart-enabled
Critical
Offline
The state of persistent resources is not considered when determining the online or
offline state of a service group because persistent resources cannot be taken
offline.

Bringing a Service Group Online Using the CLI
To bring a service group online, use either form of the hagrp command:
hagrp -online service_group -sys system
Provide the service group name and the name of the system where the service
group is to be brought online.
hagrp -online service_group -any
Provide the service group name. The -any option, supported as of VCS 4.0,
brings the service group online based on the groups failover policy. Failover
policies are described in detail later in the course.

Taking Service Groups Offline
Service groups are taken offline:
Manually, for maintenance WebSG
WebSG
Automatically, by VCS as part of
failover
Web
A service group is considered offline
when all nonpersistent resources on IP Mount
a system are offline. NIC Volume
Optionally, take all resources offline
in dependency tree order from top DiskGroup
parent resources to bottom child
resources. S1
hagrp
hagrp -offline
-offline
Taking Service Groups Offline
4
When a service group is taken offline, resources are taken offline starting with the
highest (parent) resources in each branch of the resource dependency tree and
progressing down the resource dependency tree to the lowest (child) resources.
Persistent resources cannot be taken offline. Therefore, the service group is
considered offline when all nonpersistent resources are offline.
Taking a Service Group Offline Using the CLI

To take a service group offline, use either form of the hagrp command:
hagrp -offline service_group -sys system
Provide the service group name and the name of a system where the service
group is online.
hagrp -offline service_group -any
Provide the service group name. The -any switch, supported as of VCS 4.0,
takes a failover service group offline on the system where it is online. All
instances of a parallel service group are taken offline when the -any switch is
used.

Switching Service Groups
A manual failover can be performed by
switching the service group between
WebSG
WebSG
systems. VCS performs these actions:
Apply the rules for going offline and
Web
coming online.
IP Mount
Take the service group offline on
system S1. NIC Volume
Bring the group online on system S2.
Maintain state data: DiskGroup
Only resources which were online on
system S1 are brought online on S2
system S2.
hagrp
hagrp -switch
-switch
Switching Service Groups

In order to ensure that failover can occur as expected in the event of a fault, test the
failover process by switching the service group between systems within the cluster.
Switching a Service Group Using the CLI

To switch a service group, type:
hagrp -switch service_group -to system
Provide the service group name and the name of the system where the service
group is to be brought online.

Freezing a Service Group
Freezing a service group prevents it from being
taken offline, brought online, or failed over.
Example uses:
Enable a DBA to perform database operations
outside of VCS control.
Perform application updates that require the
application to be stopped and restarted.
A persistent freeze remains in effect through VCS
restarts.
IfIfaaservice
servicegroup
groupisisfrozen,
frozen,VCS
VCSdoes
doesnot
nottake
takethe
theservice
! group
groupoffline
outside
offlineeven
of VCS
evenififyou
on
youinadvertently
another
inadvertentlystart
system
startthe
(concurrency
service
theservice
service
violation).
outside of VCS on another system (concurrency violation).
hagrp
hagrp -freeze
-freeze
Freezing a Service Group
4
When you freeze a service group, VCS continues to monitor the resources, but
does not allow the service group (or its resources) to be taken offline or brought
online. Failover is also disabled, even if a resource faults.
You can also specify that the freeze is in effect even if VCS is stopped and
restarted throughout the cluster.
Warning: When frozen, VCS does not take action on the service group even if you
cause a concurrency violation by bringing the service online on another system
outside of VCS.
Freezing and Unfreezing a Service Group Using the CLI

To freeze and unfreeze a service group temporarily, type:
hagrp -freeze service_group
hagrp -unfreeze service_group
To freeze a service group persistently, you must first open the configuration:
haconf -makerw
hagrp -freeze service_group -persistent
hagrp -unfreeze service_group -persistent
To determine if a service group is frozen, display the Frozen (for persistent) and
TFrozen (for temporary) service group attributes for a service group.
hagrp -display service_group -attribute Frozen

Bringing Resources Online
When you bring a resource online, VCS calls
the online entry point of the agent that
corresponds to the resource type. WebSG
WebSG
For example, when you bring a Mount
resource online, the Mount agent online
Web
entry point mounts the file system using
values specified in the resource attributes. IP Mount
In most configurations, resources are
brought online automatically when VCS NIC Volume
starts up and brings the service groups
online. DiskGroup
You may need to bring a resource online if it
has been taken offline manually or faulted. S1
hares
hares -online
-online
Bringing Resources Online

In normal day-to-day operations, you perform most management operations at the
service group level.
However, you may need to perform maintenance tasks that require one or more
resources to be offline while others are online. Also, if you make errors during
resource configuration, you can cause a resource to fail to be brought online.
Bringing Resources Online Using the CLI

To bring a resource online, type:
hares -online resource -sys system
Provide the resource name and the name of a system that is configured to run the
service group.

Taking Resources Offline
You may need to take an individual
resource offline to perform application
WebSG
WebSG
maintenance.
For example, you may want to shut
down just the Oracle database instance Web
to perform an update that requires the Mount
IP
Oracle data files to be available.
You must take resources offline in the NIC Volume
order determined by the dependency
tree. DiskGroup
You must bring all resources back
online, or they will not be brought online S1
if the service group fails over.
hares
hares -offline
-offline
Taking Resources Offline
4
Taking resources offline should not be a normal occurrence. Doing so causes the
service group to become partially online, and availability of the application service
is affected.
If a resource needs to be taken offline, for example, for maintenance of underlying
hardware, then consider switching the service group to another system.
If multiple resources need to be taken offline manually, then they must be taken
offline in resource dependency tree order, that is, from top to bottom.
Taking a resource offline and immediately bringing it online may be necessary if,
for example, the resource must reread a configuration file due to a change.
Taking Resources Offline Using the CLI

To take a resource offline, type:
hares -offline resource -sys system
Provide the resource name and the name of a system.

Clearing Resource Faults
Faulted resources must be cleared after you fix the
underlying problem.
Nonpersistent resources must be explicitly cleared.
Persistent resources (such as NIC) are cleared when
the problem is fixed and they are subsequently
monitored (probed) by the agent.
Offline resources are probed periodically.
You can manually force a probe.
You can bring the resource online again after you
have fixed the problems and the fault is cleared.
hares
hares clear
clear
hares
hares -probe
-probe
Clearing Resource Faults

A fault indicates that the monitor entry point is reporting an unexpected offline
state for an online resource. This indicates a problem with the resource. Before
clearing a fault, you must resolve the problem that caused the fault.
A faulted resource status prevents VCS from considering that system as a possible
target during service group failover. Therefore, a faulted resource must be cleared
before VCS can bring the resource and the corresponding service group online on
that system. The VCS logs help you determine which resource has faulted and
why, as described in more detail in later lessons.
After fixing the problem that caused the fault, you can clear a faulted resource on a
particular system, or on all systems defined in the service groups SystemList
attribute.
Note: Persistent resource faults cannot be cleared manually. You must probe the
resource so that the agent monitors the resource. The fault is automatically cleared
after the resource is probed and the agent determines that the resource is back
online. When you probe a resource, VCS directs the agent to run the monitor entry
point, which returns the resource status.
Clearing Resource Faults Using the CLI

To clear a faulted resource, type:
hares -clear resource [-sys system]

Provide the resource name and the name of a system where the resource has the
FAULTED status. If the system name is not specified, then the resource is cleared
on all systems on which it is faulted.
Probing Resources Using the CLI

To probe a resource, type:
hares -probe resource -sys system
Provide the resource name and the name of the system where the resource status is
to be checked.

You can use the VCS Simulator as a tool for learning
how to manage VCS, in addition to creating and testing
cluster configurations.
The simulator runs on UNIX and Windows systems.
You can perform all common operator and
administrator tasks using either the Java GUI or the
simulator-specific command-line interface.
You can use predefined configuration files (main.cf
and types.cf) or create service groups and
resources.
You can simulate faults to see how VCS responds.
Download
Download the
the simulator
simulator from
from http://van.veritas.com.
http://van.veritas.com.

You can use the VCS Simulator as a tool for learning how to manage VCS
operations and applications under VCS control. You can perform all basic service
group operations using the Simulator.
The Simulator also has other uses as a configuration and test tool. For this lesson,
the focus of the Simulator discussion is on using a predefined VCS configuration
to practice performing administration tasks.
You can install the Simulator while installing VCS or you can download the
Simulator from the VERITAS Web site at http://van.veritas.com. To
locate the download site, search for they key word Simulator. No additional
licensing is required to install and use the Simulator.

Simulator Java Console
Use the Simulator Java Console to:
Start and stop
sample Simulator
configurations.
Launch the Cluster
Manager Java
Console.
Create new
Simulator
configurations.
Verify the
configuration syntax.
The Simulator Java Console
4
A graphical user interface, referred to as the Simulator Java Console, is provided
to create and manage Simulator configurations. Using the Simulator Java Console,
you can run multiple Simulator configurations simultaneously.
To start the Simulator Java Console:
On UNIX systems:
a Set the PATH environment variable to /opt/VRTScssim/bin.
b Set VCS_SIMULATOR_HOME to /opt/VRTScssim.
c Type /opt/VRTSvcs/bin/hasimgui &
On Windows systems, environment variables are set during installation. Start
the Simulator Java Console by double clicking the icon on the desktop.
When the Simulator Java Console is running, a set of sample Simulator
configurations is displayed, showing an offline status. You can start one or more
existing cluster configurations and then launch an instance of the Cluster Manager
Java Console for each running Simulator configuration.
You can use the Cluster Manager Java Console to perform all the same tasks as an
actual cluster configuration. Additional options are available for Simulator
configurations to enable you to test various failure scenarios, including faulting
resources and powering off systems.

Creating a New Simulator Configuration
When you add a cluster configuration:
A one-node cluster is created using a system name based on the
cluster name.
You must assign a unique port number
Platform selection determines which types.cf file is used.
A directory structure with the cluster name is created in
/opt/VRTSsim.
You can then start the
cluster and launch the
to test or modify the
configuration.
You
You can
can also
also copy
copy aa main.cf
main.cf file
file to
to the
the
/opt/VRTSsim/cluster_name/conf/config
/opt/VRTSsim/cluster_name/conf/config
directory
directory before
before starting
starting the
the Simulated
Simulated cluster.
cluster.
Creating a New Simulator Configuration

When you add a Simulator cluster configuration, a new directory structure is
created and populated with sample files based on the criteria you specify.
On UNIX systems, Simulator configurations are located in /opt/VRTSsim. On
Windows, the Simulator repository is in C:\Program Files\VERITAS\VCS
Simulator.
Within the Simulator directory, each Simulator configuration has a directory
corresponding to the cluster name. When the Simulator is installed, several sample
configurations are placed in the sim_dir, such as:
SOL_ORACLE: An two-node cluster with an Oracle service group
LIN_NFS: A two-node cluster with two NFS service groups
WIN_SQL_VVR_C1: One of two clusters in a global cluster with a SQL
service group
When you add a cluster:
The default types.cf file corresponding to the selected platform is copied
from sim_dir/types to the sim_dir/cluster_name/conf/
config directory.
A main.cf file is created based on the sim_dir/sample_clus/conf/
config/main.cf file, using the cluster and system names specified when
adding the cluster.

Simulator Command-Line Interface hasim
hasim
Use a separate terminal window for each Simulator configuration.
# cd /opt/VRTSsim
# hasim setupclus myclus simport 16555 wacport -1
# hasim start myclus_sys1 clus myclus
# VCS_SIM_PORT=16555
# WAC_SIM_PORT=-1
# export VCS_SIM_PORT WAC_SIM_PORT
# hasim clus display
< Output is equivalent to haclus display >
# hasim sys state
$System Attribute Value
myclus_sys1 SysState Running
Simulator Command-Line Interface
4
You can use the Simulator command-line interface (CLI) to add and manage
simulated cluster configurations. While there are a few commands specific to
Simulator activities, such as cluster setup shown in the slide, in general the hasim
command syntax follows the corresponding ha commands used to manage an
actual cluster configuration.
The procedure used to initially set up a Simulator cluster configuration is shown
below. The corresponding commands are displayed in the slide.
Note: This procedure assumes you have already set the PATH and
VCS_SIMULATOR_HOME environment variables.
1 Change to the /opt/VRTSsim directory if you want to view the new
structure created when adding a cluster.
2 Add the cluster configuration, specifying a unique cluster name and port. For
local clusters, specify -1 as the WAC port.
3 Start the cluster on the first system.
4 Set the VCS_SIM_PORT and WAC_SIM_PORT environment variables to the
values you specified when adding the cluster.
Now you can use hasim commands or Cluster Manager to test or modify the
configuration.

Using Cluster Manager with the Simulator
To administer a simulated
cluster using Cluster
Manager:
1. Start Cluster Manager.
2. Select File>New
Simulator.
3. Enter the host name of the
local system.
4. Verify that the port for that
cluster configuration is
selected.
5. Select the platform from
the drop-down list.
Using the Java GUI with the Simulator

After the simulator is started, you can use the Java GUI to connect to the simulated
cluster. When the Cluster Monitor is running, select File>New Simulator and
select the following values:
Host name: Enter the name of the system where the Simulator is running. You
can use localhost as the host name if you are running the simulator on the same
system.
Failover retries: Retain the default of 12.
Configuration for: Select the same platform specified when you initially added
the cluster configuration.
Solaris
Windows 2000
Linux
AIX
HP-UX
If you do not select the platform that matches the types.cf file in the
simulated cluster configuration, the wizards display error messages.
Note: If you receive a message that the GUI is unable to connect to the Simulator:
Verify that the Simulator is running.
Check the port number.

Lesson Summary
Key Points
Use VCS tools to manage applications under VCS
control.
The VCS Simulator can be used to practice
managing resources and service groups.
Reference Materials
VERITAS Architect Network (VAN):
http://van.veritas.com
Summary
4
In this lesson, you learned how to manage applications that are under control of
VCS.
Next Steps
Now that you are more comfortable managing applications in a VCS cluster, you
can prepare your application components and deploy your cluster design.
The VCS Simulator software is available for download from the VERITAS
Web site.
The release notes provide detailed information about hardware and software
supported by VERITAS Cluster Server.

Lab 4: Using the VCS Simulator
1. Start the Simulator Java GUI.
hasimgui &
2. Add a cluster.
3. Copy the preconfigured
main.cf file to the new
directory.
4. Start the cluster from the
Simulator GUI.
5. Launch the Cluster Manager
Java Console
6. Log in using the VCS
account oper with password
oper.
This account demonstrates
different privilege levels in
VCS.
Seenext
See next slide for classroom values
See nextslide
slidefor
forlab
labassignments.
assignments.
Lab 4: Using the VCS Simulator

Lab 4 Synopsis: Using the VCS Simulator, page A-18
Lab 4: Using the VCS Simulator, page B-21
Lab 4 Solutions: Using the VCS Simulator, page C-35
Goal
The purpose of this lab is to reinforce the material learned in this lesson by
performing a directed series of operator actions on a simulated VCS configuration.
Prerequisites
Obtain the main.cf file for this lab exercise from the location provided by your
instructor.
Results
Each student has a Simulator running with the main.cf file provided for the lab
exercise.

Lesson 5
Preparing Services for VCS
Course Overview

Introduction
Overview
This lesson describes how to prepare application services for use in the VCS high
availability environment. Performing these preparation tasks also helps illustrate
how VCS manages application resources.
Importance
By following these requirements and recommended practices for preparing to
configure service groups, you can ensure that your hardware, operating system,
and application resources are configured to enable VCS to manage and monitor the
components of the high availability services.


will be able to:
Preparing Applications for Prepare applications for the VCS
VCS environment.
One-Time Configuration Perform one-time configuration tasks.
Tasks
Testing the Application Test the application services before
Service placing them under VCS control.
Stopping and Migrating a Stop resources and manually migrate a
Service service.
Validating the Design Validate the design worksheet using
Worksheet configuration information.
Outline of Topics
Preparing Applications for VCS
One-Time Configuration Tasks
Testing the Application Service
Stopping and Migrating a Service
Validating the Design Worksheet
Lesson 5 Preparing Services for VCS 53

Application Service Overview
IP Address
NIC
Network
Network
End Users
Process
Application
Application File
System
Storage
Storage
Preparing Applications for VCS

Application Service Component Review
An application service is the service that the end-user perceives when accessing a
particular network address. An application service typically consists of multiple
components, some hardware- and some software-based, all cooperating together to
produce a service.
For example, a service can include application software (processes), a file system
containing data files, a physical disk on which the file system resides, one or more
IP addresses, and a NIC for network access.
If this application service needs to be migrated to another system for recovery
purposes, all of the components that compose the service must migrate together to
re-create the service on another system.

Configuration and Migration Procedure
Create or prepare operating system and application resources
initially on each system for each service that will be placed
under VCS control; this is not part of VCS operation.
Manually test each service on each system that is a startup or
failover target before placing it under VCS control.
Configuring services on a test cluster first is recommended to
minimize impacts to production environments.
Perform
Performone-time
one-time
configuration
configurationtasks
taskson
on
each
eachsystem.
system.
N
Start,
Start,verify,
verify,and
and Y
More
More Ready
Readyfor
for
stop
stopservices
servicesonon Systems?
Systems? VCS
VCS
one
onesystem
systemat ataatime.
time.
Configuration and Migration Procedure

Use the procedure shown in the diagram to prepare and test application services on
each system before placing the service under VCS control. Use the design
worksheet to obtain and record information about the service group and each
resource. This is the information you need to configure VCS to control these
resources.
5
Details are provided in the following section.

Identifying Components
Shared storage resources:
Disk or components of a logical volume manager,
such as Volume Manager disk groups and volumes
File systems to be mounted
Mount point directories
Network resources:
IP addresses
Network interfaces
Application resources:
Identical installation and configuration procedures
Procedures to start, stop, and monitor
Location of data, binary, and configuration files
One-Time Configuration Tasks

Identifying Components
The first step in preparing services to be managed by VCS is to identify the
components required to support the services. These components should be
itemized in your design worksheet and may include the following, depending on
the requirements of your application services:
Shared storage resources:
Disks or components of a logical volume manager, such as Volume
Manager disk groups and volumes
File systems to be mounted
Directory mount points
Network-related resources:
IP addresses
Network interfaces
Application-related resources:
Procedures to manage and monitor the application
The location of application binary and data files
The following sections describe the aspects of these components that are critical to
understanding how VCS manages resources.

Configuring Shared Storage
From
From One
One System
System
Initialize
Initializedisks.
disks. vxdisksetup -i disk_dev
Create vxdg init DemoDG DemoDG01=disk_dev

Createaadisk
diskgroup.
group.
Create
Createaavolume.
volume. vxassist -g DemoDG make DemoVol 1g
Make
Makeaafile
filesystem.
system. mkfs args vxfs /dev/vx/rdsk/DemoDG/DemoVol
Make
Makeaamount
mountpoint.
point. mkdir /demo Each
Each System
System
Volume
Volume Manager
Manager Example
Example
Configuring Shared Storage

The diagram shows the procedure for configuring shared storage on the initial
system. In this example, Volume Manager is used to manage shared storage on a
Solaris system.
Note: Although examples used throughout this course are based on VERITAS
Volume Manager, VCS also supports raw disks and other volume managers.
5
VxVM is shown for simplicityobjects and commands are essentially the same
on all platforms. The agents for other volume managers are described in the
VERITAS Cluster Server, Implementing Local Clusters participant guide.
Preparing shared storage, such as creating disk groups, volumes, and file systems,
is performed once, from one system. Then you must create mount point directories
on each system.
The options to mkfs may differ depending on platform type, as displayed in the
following examples.
Solaris
mkfs -F vxfs /dev/vx/rdsk/DemoDG/DemoVol
AIX
mkfs -V vxfs /dev/vx/rdsk/DemoDG/DemoVol
HP-UX
mkfs -F vxfs /dev/vx/rdsk/DemoDG/DemoVol
Linux
mkfs -t vxfs /dev/vx/rdsk/DemoDG/DemoVol

Configuring an Administrative IP Address
For public network access to high availability services,
you must configure an administrative IP address
associated with the physical network interface.
Each system needs a unique administrative IP address for
each interface.
Configure the operating system to bring up the
administrative IP address during system boot.
The IP addresses are used by VCS to monitor network
interfaces.
These addresses are also sometimes referred to as base,
maintenance, or test IP addresses.
The administrative IP address may already be
configured and only needs to be verified.
Procedure Solaris AIX HP-UX Linux
Configuring the Network

In a high availability environment, the IP address that is used by the client to
access an application service should not be tied to a specific system because the
same service can be provided by any system in the cluster. VCS uses the concept
of virtual IP addresses to differentiate between IP addresses associated with a
specific system and IP addresses associated with an application service. In order to
configure a virtual IP address, an administrative IP address must be up on the
network interface.
Administrative IP Addresses
Administrative IP addresses (also referred to as base IP addresses or maintenance
IP addresses) are controlled by the operating system. The administrative IP
addresses are associated with a physical network interface on the system, such as
qfe1 on Solaris systems, and are configured whenever the system is brought up.
These addresses are used to access a specific system over the network and can also
be used to verify that the system is physically connected to the network even
before an application is brought up.
Configuring an Administrative IP Address

The procedures for configuring an administrative IP address vary by platform.
Examples are displayed on the following page.
Note: The administrative IP address is often already configured, in which case,
you only need to verify that it is up.

Solaris
1 Create /etc/hostname.interface with the desired interface name so
that the IP address is configured during system boot:
train14_qfe1
2 Edit /etc/hosts and assign an IP address to the interface name.
166.98.112.14 train14_qfe1
3 Use ifconfig to manually configure the IP address to test the configuration
without rebooting:
ifconfig qfe1 inet 166.98.112.114 netmask +
ifconfig qfe1 up
AIX
1 Use SMIT or mktcpip to configure the IP address to come up during system
boot.
166.98.112.14 train14_en1
3 Use ifconfig to manually configure the IP address or it will be configured
during the next reboot.
ifconfig en1 inet 166.98.112.114 netmask +
ifconfig en1 up
HP-UX
1 Add an entry in /etc/rc.config.d/netconf to include the configuration
information for the interface.
INTERFACE_NAME[0]=lan2
IP_ADDRESS[0]=192.12.25.3
SUBNET_MASK[0]=255.255.255.0
5
BROADCAST_ADDRESS[0]=
DHCP_ENABLE[0]=0
166.98.112.14 train14_lan2
without rebooting:
ifconfig lan2 inet 166.98.112.114
ifconfig lan2 up

Linux
1 Add an entry in the appropriate file in /etc/sysconfig/networking/
devices to include the configuration information for the interface.
# cd /etc/sysconfig/networking/devices
# ls
ifcfg-eth0 ifcfg-eth1 ifcfg-eth2 ifcfg-eth3 ifcfg-
eth4
# more ifcfg-eth0
DEVICE=eth0
BOOTPROTO=static
BROADCAST=166.98.112.255
IPADDR=166.98.112.14
NETMASK=255.255.255.0
NETWORK=166.98.112.0
ONBOOT=yes
GATEWAY=166.98.112.1
TYPE=Ethernet
USERCTL=no
PEERDNS=no
166.98.112.14 train14_lan2
without rebooting:
ifconfig eth2 166.98.112.14 netmask 255.255.255.0
ifconfig eth2 up

Other Network Configuration Tasks
Depending on your environment, other network configuration may be required to
complete the network configuration for both administrative and virtual IP
addresses. Examples are:
Add administrative IP addresses to /etc/hosts files so that these addresses
can be resolved without relying on an outside name service.
Add entries to the name server:
Include administrative IP addresses if you want these addresses to be
accessible on the public network.
Include the virtual IP addresses with virtual host names for the high
availability services.
Configure all other applicable files, such as:
/etc/resolv.conf
/etc/nsswitch.conf
Work with your network administrator to ensure that any necessary configuration
tasks are completed.

Configuring the Application
Install and configure applications identically on each
target system.
Determine file locations:
Shared or local storage
Binaries, data, Resource Definition Sample Value
configuration
Service Group Name DemoSG
Identify startup, monitor,
and shutdown procedures. Resource Name DemoProcess
Depending on the Resource Type Process

application needs: Required Attributes
Create user accounts. PathName /bin/sh
Configure environment Optional Attributes
variables.
Arguments /sbin/orderproc
Apply licenses. up
Set up configuration files.
Configuring the Application

You must ensure that the application is installed and configured identically on each
system that is a startup of the failover target to manually test the application after
all dependent resources are configured and running.
This ensures that you have correctly identified the information used by the VCS
agent scripts to control the application.
Note: The shutdown procedure should be a graceful stop, which performs any
cleanup operations.

N
Ready
Readyfor
for
Bring
Bringup
upresources.
resources. S1 VCS
VCS
More
More
Start up all resources
Systems?
Systems?
in dependency order.
Y
Shared
Sharedstorage
storage
Virtual S2
S2Sn
VirtualIP
IPaddress
address
Stop
Stopresources.
resources.
Application
Applicationsoftware
software
Test
Testthe
theapplication.
application.
Test
Testthe
theapplication.
application.
Stop
Stopresources.
resources. Bring
Bringup
upresources.
resources.

Before configuring a service group in VCS to manage an application service, test
the application components on each system that can be a startup or failover target
for the service group. Following this best practice recommendation ensures that
VCS will successfully manage the application service after you configure a service
group to manage the service.
5
This procedure emulates how VCS manages application services. The actual
commands used may differ from those used in this lesson. However, conceptually,
the same type of action is performed by VCS.

Bringing Up Resources: Shared Storage
1. Import the disk group:
vxdg import DemoDG
Solaris
Solaris Volume
Volume Manager
Manager
2. Start the volume: Example
Example
vxvol g DemoDG start DemoVol
3. Mount the file system:
mount F vxfs /dev/vx/dsk/DemoDG/DemoVol /demo
Do not configure the operating system to automatically
mount file systems that will be controlled by VCS.
Verify that there are no entries in the file system startup
table (for example /etc/vfstab).
Bringing Up Resources
Shared Storage
Verify that shared storage resources are configured properly and accessible. The
examples shown in the slide are based on using Volume Manager.
1 Import the disk group.
2 Start the volume.
3 Mount the file system.
Mount the file system manually for the purposes of testing the application
service. Do not configure the operating system to automatically mount any file
system that will be controlled by VCS.
If the file system is added to /etc/vfstab, it will be mounted on the first
system to boot. VCS must control where the file system is mounted.
Examples of mount commands are provided for each platform.
Solaris
mount -F vxfs /dev/vx/dsk/ProcDG/ProcVol /process
AIX
mount -V vxfs /dev/vx/dsk/ProcDG/ProcVol /process
HP-UX
mount -F vxfs /dev/vx/dsk/ProcDG/ProcVol /process
Linux
mount -t vxfs /dev/vx/dsk/ProcDG/ProcVol /process

Configuring Application (Virtual) IP Addresses
Application IP addresses are:
Added as virtual IP addresses to the network
interface
Associated with an application service; resolved by
the name service
Controlled by the high availability software
Migrated to other systems if the current system fails
Also called service group or floating IP addresses
Application IP address considerations for HA:
Verify that an administrative IP address is already
configured on the interface.
Do not configure the application IP to come online
during system boot.
Procedure Solaris AIX HP-UX Linux
Configuring Application IP Addresses

Configure the application IP addresses associated with specific application
services to ensure that clients can access the application service using the specified
address.
Application IP addresses are configured as virtual IP addresses. On most
platforms, the devices used for virtual IP addresses are defined as
5
interface:number.
Solaris
The qfe1:1 device is used for the first virtual IP address on the qfe1 interface;
qfe1:2 is used for the second.
1 Plumb the virtual interface and bring up the IP on the next available logical
interface:
ifconfig qfe1 addif 192.168.30.132 up
2 Edit /etc/hosts to assign a virtual hostname (application service name) to
the IP address.
192.168.30.132 process_services

AIX
The en1 device is used for all virtual IP addresses with the alias keyword to
ifconfig.
1 Plumb the virtual interface and bring up the IP on the next available logical
interface:
ifconfig en1 inet 192.168.30.13 netmask 255.255.255.0 \
alias
the IP address.
HP-UX
The lan2:1 device is used for the first virtual IP address on the lan2 interface;
lan2:2 is used for the second IP address.
1 Configure the IP address using the ifconfig command.
ifconfig lan2:1 inet 192.168.30.13
2 Bring the IP address up.
ifconfig lan2:1 up
the IP address.
Linux
The eth0:1 device is used for the first virtual IP address on the eth0 interface;
eth0:2 is used for the second IP address.
1 Configure the IP address using the ifconfig command.
ifconfig eth0:1 192.168.30.13
ifconfig eth0:1 up
the IP address.

Starting the Application
Manually start the application for testing
purposes.
An example command line for a fictional
application:
/sbin/orderproc up
Do not configure the operating system to
automatically start the application during system
boot.
Verify that there are no startup files in the system
startup directory (for example /etc/rc2.d).
Starting the Application

When all dependent resources are available, you can start the application software.
Ensure that the application is not configured to start automatically during system
boot. VCS must be able to start and stop the application using the same methods
you use to control the application manually.

Verifying Resources
Verify
Verifythe
thedisk
diskgroup.
group. vxdg list DemoDG
Verify dd if=/dev/vx/rdsk/DemoDG/DemoVol \
Verifythe
thevolume.
volume.
of=/dev/null count=1 bs=128
Verify
Verifythe
thefile
filesystem.
system. mount | grep /demo
Verify
Verifythe
theadmin
adminIP.
IP. ping same_subnet_IP
Verify
Verifythe
thevirtual
virtualIP.
IP. ifconfig arguments
Verify
Verifythe
theapplication.
application. ps arguments | grep process
Verifying Resources
You can perform some simple steps, such as those shown in the slide, to verify that
each component needed for the application service to function is operating at a
basic level.
This helps you identify any potential configuration problems before you test the
service as a whole, as described in the Testing the Integrated Components
section.

Testing the Integrated Components
Test the application service using simulated
or real world scenarios, if possible.
For example, if you have an application with
a back-end database, you can:
1. Start the database (and listener process).
2. Start the application.
3. Connect to the application from the public
network using the client software to verify
name resolution to the virtual IP address.
4. Perform user tasks, as applicable; perform
queries, make updates, and run reports.
Testing the Integrated Components

When all components of the service are running, test the service in situations that
simulate real-world use of the service.
The example in the slide describes how you can test a database service. Another
example that illustrates how you can test your service is NFS. If you are preparing
to configure a service group to manage an exported file system, verify that you can
5
mount the exported file system from a client on the network. This is described in
more detail later in the course.

Stopping Application Components
Stop the application.
/sbin/orderproc stop
Take down the virtual IP. Unmount the file system.
ifconfig umount /demo
Stop the volume.
vxvol g DemoDG stop DemoVol
Deport the disk group.
vxdg g DemoDG deport DemoDG

Stop
Stop resources
resources in
in the
the order
order
of
of the
the dependency
dependency tree.
tree.
Stopping and Migrating an Application Service

Stopping Application Components
Stop resources in the order of the dependency tree from the top down after you
have finished testing the service. You must have all resources offline in order to
migrate the application service to another system for testing. The procedure also
illustrates how VCS stops resources.
The ifconfig options are platform-specific, as displayed in the following
examples.
Solaris
ifconfig qfe1:1 unplumb
AIX
ifconfig en1 192.168.30.13 delete
HP-UX
ifconfig lan2:1 down
Linux
ifdown eth0:1

Manually Migrating an Application Service
IP Address
Network
Network
NIC
Process
Application
Application
File
S1 System S2
Storage
Storage
Manually Migrating an Application Service

After you have verified that the application service works properly on one system,
manually migrate the service between all intended target systems. Performing
these operations enables you to:
Ensure that your operating system and application resources are properly
configured on all potential target cluster systems.
5
Validate or complete your design worksheet to document the information
required to configure VCS to manage the services.
Use the procedures described in this lesson to configure and test the underlying
operating system resources.

Documenting Resource Attributes

Process
Service Group Name DemoSG
Resource Name DemoIP
Resource Type IP
File System IP Address
Required Attributes
Device qfe1
Address 192.168.30.13
Network
Volume Optional Attributes
Interface
NetMask 255.255.255.0
Use the design worksheet to document details for

Disk configuring resources.
Group Note any attributes that are different among systems, for
example, network interface device names.
Validating the Design Worksheet

Documenting Resource Attributes
In order to configure the operating system resources you have identified as
requirements for application service, you need the detailed configuration
information from the design worksheet. A design diagram is helpful to show the
relationships between the resources, which determines the order in which you
configure, start, and stop resources. These relationships are also defined in the
design worksheet as part of the service group definition, shown in the
Documenting Resource Dependencies section.
Note: If your systems are not configured identically, you must note those
differences in the design worksheet. The Online Configuration of Service
Groups lesson shows how you can configure a resource with different attribute
values for different systems.

Checking Resource Attributes
Use the VERITAS
Cluster Server Bundled
Agents Reference
Guide to determine:
Required attributes
Optional attributes
Allowed values
Examples in slides
show Solaris resource
types.
Not all platforms have
the same resources or Bundled Agents Reference Guides
attributes.
Checking Resource Attributes

Verify that the resources specified in your design worksheet are appropriate and
complete for your platform. Refer to the VERITAS Cluster Server Bundled Agents
Reference Guide before you begin configuring resources.
The examples displayed in the slides in this lesson are based on the Solaris
operating system. If you are using another platform, your resource types and
5
attributes may be different.

Documenting Resource Dependencies
Verify that the resource dependencies are listed in the
worksheet.
Compare the parent and child resources to your service group
diagrams.
Verify that the relationships are correct according to the
testing you performed.
Resource Dependency Definition

Service Group DemoSG
Process Parent Resource Requires Child Resource
IP Mount DemoVol DemoDG
DemoMount DemoVol
NIC Volume
DemoIP DemoNIC
DemoProcess DemoMount
DiskGroup DemoProcess DemoIP
Documenting Resource Dependencies

Ensure that the steps you perform to bring resources online and take them offline
while testing the service are accurately reflected in the design worksheet. Compare
the worksheet with service group diagrams you have created or that have been
provided to you.
The slide shows the resource dependency definition for the application used as an
example in this lesson.

Validating Service Group Attributes

Startup System Group DemoSG
Required Attributes
FailoverPolicy Priority
DemoSG SystemList S1=0, S2=1
Optional Attributes
S2
AutoStartList S1
S1
Failover System
Validating Service Group Attributes

Check the service group attributes in your design worksheet to ensure that the
appropriate startup and failover systems are listed. Other service group attributes
may be included in your design worksheet, according to the requirements of each
service.
Service group definitions consist of the attributes of a particular service group.
5
These attributes are described in more detail later in the course.

Lesson Summary
Key Points
Prepare each component of a service and
document attributes.
Test services in preparation for configuring
VCS service groups.
Reference Materials
Reference Guide
VERITAS Cluster Server User's Guide
Summary
This lesson described how to prepare sites and application services for use in the
VCS high availability environment. Performing these preparation tasks ensures
that the site is ready to deploy VCS, and helps illustrate how VCS manages
application resources.
Next Steps
After you have prepared your operating system environment and applications for
high availability, you can install VERITAS Cluster Server and then configure
service groups for your application services.
High Availability Using VERITAS Cluster Server, Implementing Local
Clusters
This course provides detailed information on advanced clustering topics,
focusing on configurations of clusters with more than two nodes.

Lab 5: Preparing Application Services
/bob1/loopy /sue1/loopy
while true while true

NIC NIC
do do
IP Address echo echo IP Address
done done
bobDG1 sueDG1
/bob1 bobVol1
disk1 sueVol1 /sue1
disk2
Disk/Lun Disk/Lun
See
Seenext
nextslide
slidefor
forclassroom
classroomvalues.
values.
Lab 5: Preparing Application Services

Lab 5 Synopsis: Preparing Application Services, page A-24
Lab 5: Preparing Application Services, page B-29
5
Lab 5 Solutions: Preparing Application Services, page C-51
Goal
The purpose of this lab is to prepare the loopy process service for high availability.
Prerequisites
instructions.
Results
Each students service can be started, monitored, and stopped on each cluster
system.

Lesson 6
VCS Configuration Methods
Lesson Introduction

Introduction
Overview
This lesson provides an overview of the configuration methods you can use to
create and modify service groups. This lesson also describes how VCS manages
and protects the cluster configuration.
Importance
By understanding all methods available for configuring VCS, you can choose the
tools and procedures that best suit your requirements.


will be able to:
Overview of Configuration Compare and contrast VCS
Methods configuration methods.
Controlling Access to VCS Set user account privileges to control
access to VCS.
Online Configuration Describe the offline configuration
method.
Offline Configuration Describe the offline configuration
method.
Starting and Stopping Start and stop VCS.
VCS
Outline of Topics
Overview of Configuration Methods
Controlling Access to VCS
Online Configuration
Offline Configuration
Starting and Stopping VCS
Lesson 6 VCS Configuration Methods 63

Configuration Methods
Online configuration: VCS does not need to
be stopped.
Cluster Manager Java graphical user interface
Cluster Manager Web graphical user interface
VCS command-line interface
Command batch files
Offline configuration: VCS must be stopped
and restarted.
Manual modification of configuration files
Modification of configuration files using the
VCS Simulator
Overview of Configuration Methods

VCS provides several tools and methods for configuring service groups and
resources, generally categorized as:
Online configuration: You can modify the cluster configuration while VCS is
running using one of the graphical interfaces or the command-line interface.
These online methods change the cluster configuration in memory. When
finished, you write the in-memory configuration to the main.cf file on disk
to preserve the configuration.
Offline configuration: In some circumstances, you can simplify cluster
implementation and configuration using an offline method, including:
Editing configuration files manually
Using the Simulator to create, modify, model, and test configurations
This method requires you to stop and restart VCS in order to build the new
configuration in memory.

Effects on the Cluster
Using an online method does not mean that
you can configure a service group without
incurring any application downtime.
Both methods require application downtime.
At a minimum, you must test failover to each
system that can run the service group.
Online configuration means that you can
configure a service group while VCS
continues to run and keep other service
groups highly available.
Effects on the Cluster

Whichever method you choose to use for configuring VCS to manage an
application service, you must plan for application downtime. Online configuration
refers to keeping VCS running, not the application service.
If you are configuring your first service group, you may not care whether VCS
remains online during configuration. Stopping and restarting VCS has very little
effect on your environment in this case.
If you already have service groups running in the cluster, you may want to use an
online configuration method so that those services are protected while you are
making modifications.

Relating VCS and UNIX User Accounts
In nonsecure mode:
There is no mapping between UNIX and VCS user
accounts by default except root, which has Cluster
Administrator privileges.
The default VCS admin account:
Is created by default during cluster configuration.
Has privileges to perform all cluster operations.
Nonroot users are prompted for a VCS account name
and password when executing VCS commands using
the command-line interface.
Controlling Access to VCS

Relating VCS and UNIX User Accounts
If you have not configured VxSS security in the cluster, VCS has a completely
separate list of user accounts and passwords to control access to the VCS
configuration.
When using the Cluster Manager to perform administration, you are prompted for
a VCS account name and password. Depending on the privilege level of that VCS
user account, VCS displays the Cluster Manager GUI with an appropriate set of
options. If you do not have a valid VCS account, VCS does not display Cluster
Manager.
When using the command-line interface for VCS, you are also prompted to enter a
VCS user account and password and VCS determines whether the VCS user
account has proper privileges to run the command. One exception is the UNIX root
user. By default, only the UNIX root account is able to use VCS ha commands to
administer VCS from the command line.

Simplifying VCS Administrative Access
You can configure VCS to simplify administration from
the command line using one of these methods:
For VCS 4.1 clusters:
Set the VCS_HOST environment variable to the node
name or add the node name to the /etc/.vcshosts
file.
Log on to VCS:
halogin vcs_user_name password
For pre-4.1 clusters:
Set the cluster attribute AllowNativeCliUsers to map
UNIX account names to VCS accounts.
A VCS account must exist with the same name as the
UNIX user account with appropriate privileges.
Simplifying VCS Administrative Access
VCS 4.1
The halogin command is provided in VCS 4.1 to save authentication
information so that users do not have to enter credentials every time a VCS
command is run.
The command stores authentication information in the users home directory. You
must either set the VCS_HOST environment variable to the name of the node from
which you are running VCS commands, or add the node name to the /etc/
.vcshosts file.
If you run halogin for different hosts, VCS stores authentication information for
each host.
6
VCS 3.5 and 4.0
For releases prior to 4.1, halogin is not supported. When logged on to UNIX as
a nonroot account, the user is prompted to enter a VCS account name and
password every time a VCS command is entered.
To enable nonroot users to more easily administer VCS, you can set the
AllowNativeCliUsers cluster attribute to 1. For example, type:
haclus -modify AllowNativeCliUsers 1
When set, VCS maps the UNIX user name to the same VCS account name to
determine whether the user is valid and has the proper privilege level to perform
the operation. You must explicitly create each VCS account name to match the
UNIX user names and grant the appropriate privilege level.

VCS User Account Privileges
Give VCS users the level of authorization needed to administer

components of the cluster environment.
Cluster Administrator
Full privileges
Cluster Operator
All cluster, service group, and resource-level operations
Cluster Guest
Read-only access; new users created as Cluster Guest accounts by
default
Group Administrator
All service group operations for a specified service group, except deletion
of service groups
Group Operator
Brings service groups and resources online and takes them offline;
temporarily freezes or unfreezes service groups
User Accounts
You can ensure that the different types of administrators in your environment have
a VCS authority level to affect only those aspects of the cluster configuration that
are appropriate to their level of responsibility.
For example, if you have a DBA account that is authorized to take a database
service group offline or switch it to another system, you can make a VCS Group
Operator account for the service group with the same account name. The DBA can
then perform operator tasks for that service group, but cannot affect the cluster
configuration or other service groups. If you set AllowNativeCliUsers to 1, then
the DBA logged on with that account can also use the VCS command line to
manage the corresponding service group.
Setting VCS privileges is described in the next section.

Creating Cluster User Accounts
The cluster configuration must be open.

Add user accounts with the hauser command.
hauser add user
Additional privileges can then be added:
haclus -modify Operators -add user
hagrp -modify group Operators -add user
The default VCS admin account created during
installation is assigned password.
User accounts can also be
created using the GUI.
Creating Cluster User Accounts

VCS users are not the same as UNIX users except when running VCS in secure
mode. If you have not configured VxSS security in the cluster, VCS maintains a
set of user accounts separate from UNIX accounts. In this case, even if the same
user exists in both VCS and UNIX, this user account can be given a range of rights
in VCS that does not necessarily correspond to the users UNIX system privileges.
To add a user account:
1 Open the cluster configuration:
haconf -makerw
2 Add a new account with the hauser command:
hauser -add username
6
For example, to add a user called DBSG_Op to the VCS configuration, type:
hauser -add DBSG_Op
In non-secure mode, VCS user accounts are stored in the main.cf file in
encrypted format. If you use a GUI or wizard to set up a VCS user account,
passwords are encrypted automatically. If you use the command line, you must
encrypt the password using the vcsencrypt command.
Note: In non-secure mode, If you change a UNIX account, this change is not
reflected in the VCS main.cf file automatically. You must manually modify
accounts in both places if you want them to be synchronized.

Changing Privileges
A new account is given Cluster Guest privileges by default. Change the privileges
for a user account with the haclus and hagrp commands using this syntax:
haclus -modify Administrators | Operators -add user
hagrp -modify group Administrators | Operators -add user
For example, to give Operator privileges for the DBSG service group to the user
account DBSG_Op, type:
hagrp -modify DBSG Operators -add DBSG_Op
With VCS 4.x, you can also add privileges with the -addpriv and
-deletepriv options of the hauser command.
Modifying User Accounts

Use the hauser command to make changes to a VCS user account:
Display account information.
hauser -display
Change the password for an account.
hauser -update user_name
Delete a user account.
hauser -delete user_name
The cluster configuration must be open to update or delete a user account.

VCS Access in Secure Mode
When running a cluster in secure mode:
VCS does not maintain user accounts separate from
UNIX.
The UNIX root user is granted Cluster Administrator
privileges.
Nonroot UNIX users are granted Guest privileges by
default.
Additional privileges can be granted using the CLI or
User Manager GUI.
You cannot change a user password from a VCS
interface.
VCS Access in Secure Mode

When running in secure mode, VCS uses platform-based authentication; VCS does
not store user passwords. All VCS users are system and domain users and are
configured using fully-qualified user names, for example,
administrator@vcsdomain. VCS provides a single sign-on mechanism, so
authenticated users need not sign on each time to connect to a cluster.
When running in secure mode, you can add system or domain users to VCS and
assign them VCS privileges. However, you cannot assign or change passwords
using a VCS interface.

Online Configuration Characteristics
Cluster Manager and the VCS command-line interface
enable you to create and test service groups and
resources in a running cluster.
The tools perform syntax checking as you are
performing each configuration change.
Online configuration is a step-wise procedure. You
can create and test resources one at a time.
VCS changes the in-memory configuration when you
make changes while VCS is online. You must
explicitly save the in-memory configuration to the
main.cf and types.cf files on disk.
Online Configuration
Benefits
Online configuration has these advantages:
The VCS engine is up and running, providing high availability of existing
service groups during configuration.
This method provides syntax checking, which helps protect you from making
configuration errors.
This step-by-step procedure is suitable for testing each object as it is
configured, simplifying troubleshooting of configuration mistakes that you
may make when adding resources.
You do not need to be logged into the UNIX system as root to use the GUI and
CLI to make VCS configuration changes.
Considerations
Online configuration has these considerations:
Online configuration is more time-consuming for large-scale modifications.
The online process is repetitive. You have to add service groups and resources
one at a time.

How VCS Changes the Online Cluster
Configuration
Add
Add service
service group.
group.
hagrp
hagrp add
add
Config
Config
In-memory
configuration In-memory
configuration
How VCS Changes the Online Cluster Configuration

When you use Cluster Manager to modify the configuration, the GUI
communicates with had on the specified cluster system to which Cluster Manager
is connected.
Note: Cluster Manager configuration requests are shown conceptually as ha
commands in the diagram, but they are implemented as system calls.
The had daemon communicates the configuration change to had on all other
nodes in the cluster, and each had daemon changes the in-memory configuration.
When the command to save the configuration is received from Cluster Manager,
had communicates this command to all cluster systems, and each systems had
daemon writes the in-memory configuration to the main.cf file on its local disk.
6
The VCS command-line interface is an alternate online configuration tool. When
you run ha commands, had responds in the same fashion.
Note: When two administrators are changing the cluster configuration
simultaneously, each sees all changes as they are being made.

Opening the Cluster Configuration
haconf
haconf -makerw
-makerw
Shared cluster configuration

in memory
main.cf main.cf
.stale .stale
Opening the Cluster Configuration

You must open the cluster configuration to add service groups and resources, make
modifications, and perform certain operations.
When you open the cluster configuration, VCS creates a .stale file in the
/etc/VRTSvcs/conf/config configuration directory on every system in the
cluster. This file indicates that the configuration is open and that the configuration
in memory may not match the configuration on disk in the main.cf file.

Saving the Cluster Configuration
haconf
haconf -dump
-dump

in memory
main.cf main.cf
.stale .stale
Saving the Cluster Configuration

When you save the cluster configuration, VCS copies the configuration in memory
to the main.cf file in the /etc/VRTSvcs/conf/config directory on all
running cluster systems. The .stale file remains in the configuration directory
because the configuration is still open.
If you save the cluster configuration after each change, you can view the main.cf
file to see how the in-memory modifications are reflected in the main.cf file.

Closing the Cluster Configuration
haconf
haconf dump
dump -makero
-makero

1 in memory
main.cf main.cf

2 .stale .stale
Closing the Cluster Configuration

When the administrator saves and closes the configuration, VCS writes the
configuration in memory to the main.cf file and removes the .stale file on all
running cluster systems.

How VCS Protects the Cluster Configuration
The stale flag:
Is a mechanism for protecting the cluster
configuration when you make changes while VCS is
running.
Indicates that the in-memory configuration may not
match the configuration in the main.cf file on disk.
When present:
VCS warns you to close the configuration if you
attempt to stop VCS.
VCS does not start without administrative
intervention.
How VCS Protects the Cluster Configuration

The .stale file provides a protection mechanism needed for online
configuration. When the .stale file is present, you cannot stop VCS without
overriding the warning that the configuration is open.
If you ignore the warning and stop VCS while the configuration is open, the
configuration in main.cf on disk may not be the same as the configuration in
memory. When this occurs, VCS considers the configuration stale because the
administrator may have changed the configuration in memory without writing it to
disk and closing the configuration.
Although rare, a stale configuration can also result from all systems in the cluster
crashing when the configuration is open.
6
To understand how this protection mechanism works, you must first understand
the normal VCS startup procedure.

Offline Configuration Characteristics
You can change the cluster configuration by modifying
the VCS configuration files and then restarting VCS.
Configuration files can be modified using any text
editor or the VCS Simulator.
The offline configuration method enables you to
create an entire cluster configuration by editing the
main.cf file.
You can also copy an existing configuration to
create a new cluster configuration.
In addition, you can use this method to create
multiple similar resources or service groups within a
cluster.
Offline Configuration
In some circumstances, you can simplify cluster implementation or configuration
tasks by directly modifying the VCS configuration files. This method requires you
to stop and restart VCS in order to build the new configuration in memory.
The benefits of using an offline configuration method are that it:
Offers a very quick way of making major changes or getting an initial
configuration up and running
Provides a means for deploying a large number of similar clusters
One consideration when choosing to perform offline configuration is that you must
be logged into the a cluster system as root.
This section describes situations where offline configuration is useful. The next
section shows how to stop and restart VCS to propagate the new configuration
throughout the cluster. The Offline Configuration of Service Groups lesson
provides detailed offline configuration procedures and examples.

Example 1: Creating a New Cluster
Cluster Definition Sample Value
Cluster WebCluster include main.cf
main.cf
include types.cf
types.cf
Attributes
cluster
cluster WebCluster
WebCluster ((
UserNames admin=ElmElg UserNames
UserNames == {{ admin
admin ==
Administrators admin ElmElg
ElmElg }}
Systems Administrators
Administrators == {{ admin
admin }}
))
system Web1 system
system Web1
Web1 ((
system Web2 ))
Service Group Sample Value system
system Web2
Web2 ((
))
Group WebSG
group
group WebSG
WebSG ((
Required Attributes SystemList
SystemList ==
FailoverPolicy Priority {Web1=0,Web=1}
{Web1=0,Web=1}
AutoStartList
AutoStartList == {Ora1}
{Ora1}
SystemList Web1=0, Web2=1
))
Optional Attributes
AutoStartList Web1
Offline Configuration Examples
Example 1: Creating a New Cluster

You can create a new cluster configuration by creating a new main.cf file. The
slide displays the beginning of a main.cf file that is being created from the
values in a design worksheet.
You can define all cluster attributes, add service groups and resources, define
relationships, and specify failover behaviorall aspects of cluster configuration
by modifying the main.cf file.

Example 2: Reusing a Cluster Configuration
group
group DB3
DB3 ((
SystemList
SystemList == {S3=1,S4=2
{S3=1,S4=2
AutoStartList
AutoStartList == {S3}
{S3}
DB1 DB2 ))
main.cf
OOOO
S1
S1 S2
S2
Cluster1
Cluster1 DB3 DB4
main.cf
group
group DB1
DB1 (( OOOO
SystemList
SystemList == {S1=1,S2=2
{S1=1,S2=2
AutoStartList
AutoStartList == {S1}
{S1} S3
S3 S4
S4
)) Cluster2
Cluster2
Example 2: Reusing a Cluster Configuration

One example where offline configuration is appropriate is when your high
availability environment is expanding and you are adding clusters with similar
configurations.
In the example displayed in the diagram, the original cluster consists of two
systems, each running a database instance. Another cluster with essentially the
same configuration is being added, but is managing different Oracle databases.
You can copy the configuration files from the original cluster, make the necessary
changes, and then restart VCS as described later in this lesson. This method may
be more efficient than creating each service group and resource using Cluster
Manager or the VCS command-line interface.

Example 3: Reusing a Service Group Definition
Group DemoSG
Group AppSG
Required Attributes
Required Attributes
SystemList S1=0, S2=1
SystemList S1=0, S2=1
Optional Attributes
Optional Attributes
AutoStartList S1
AutoStartList S1
DemoProcess
AppProcess
Demo
Mount App
DemoIP
AppIP Mount
Demo DemoVol
AppVol
NIC App
DemoDG NIC
AppDG
Example 3: Reusing a Service Group Configuration

Another example of using offline configuration is when you want to add a service
group with a similar set of resources as another service group in the same cluster.
In the example displayed in the diagram, the portion of the main.cf file that
defines the DemoSG service group is copied and edited as necessary to define a
new AppSG service group.

How VCS Starts Up by Default
S1 S2
Local Build Current_
Cluster No config in
Discover
Conf memory
_Wait
5
3 main.cf main.cf 8
2 .stale .stale 7
had had
hashadow hashadow
6
1
hastart
hastart 4 9 hastart
hastart
Starting and Stopping VCS

There are a variety of options for starting and stopping VCS, which are described
in this section. Understanding the effects of each method helps you decide which
configuration method is best suited for your high availability environment.
How VCS Starts Up by Default

The default VCS startup process is demonstrated using a cluster with two systems
connected by the cluster interconnect. To illustrate the process, assume that no
systems have an active cluster configuration.
1 The hastart command is run on S1 and starts the had and hashadow
processes.
2 HAD checks for a .stale file in the configuration directory. In this example,
there is no .stale file present.
3 HAD checks for a valid configuration file (hacf -verify config_dir).
4 HAD checks for an active cluster configuration on the cluster interconnect.
5 Because there is no active cluster configuration, HAD on S1 reads the local
main.cf file and loads the cluster configuration into local memory.
The S1 system is now in the VCS local build state, meaning VCS is building a
cluster configuration in memory on the local system.
6 The hastart command is then run on S2 and starts had and hashadow on
S2.
The S2 system is now in the VCS current discover wait state, meaning VCS is
in a wait state while it is discovering the current state of the cluster.
7 HAD checks for a .stale file in the configuration directory.

8 HAD on S2 checks for a valid configuration file on disk.
9 HAD on S2 checks for an active cluster configuration by sending a broadcast
message out on the cluster interconnect, even if the main.cf file on S2 is
valid.

Next System Startup
S1 S2
Running Remote
Cluster
Cluster Cluster Build
Conf
Conf Conf
12
main.cf
main.cf main.cf
main.cf
had had
hashadow hashadow
10 11
10 HAD on S1 receives the request from S2 and responds.

11 HAD on S1 sends a copy of the cluster configuration over the cluster
interconnect to S2.
The S1 system is now in the VCS running state, meaning VCS determines
there is a running configuration in memory on system S1.
The S2 system is now in the VCS remote build state, meaning VCS is building
the cluster configuration in memory on the S2 system from the cluster
configuration that is in a running state on S1.
12 When the remote build process completes, HAD on S2 copies the cluster
configuration into the local main.cf file.
If S2 has valid local configuration files (main.cf and types.cf), these are
saved to new files with a name, including a date and time stamp, before the
active configuration is written to the main.cf file on disk.
The startup process is repeated on each system until all members have identical
copies of the cluster configuration in memory and matching main.cf files on
local disks.
Synchronization is maintained by data transfer through LLT and GAB.

VCS Startup with a .stale File
S1 S2
Stale Unknown
No config in
admin
memory
wait
4
3 main.cf
main.cf main.cf
main.cf
2 .stale .stale
had
hashadow
1 hastart
hastart 5
VCS Startup with a .stale File

To illustrate how VCS protects the cluster configuration, assume that no systems
have an active cluster configuration and a .stale file is present.
1 The hastart command is run on S1 and starts the had and hashadow
processes.
2 HAD checks for a .stale file and determines that the file is present.
3 HAD determines whether the configuration files are valid.
4 HAD determines that there is no active configuration anywhere in the cluster.
5 Because there is no active cluster configuration, HAD goes into the stale
admin wait state.
The stale admin wait state indicates to you that you stopped VCS on all systems
6
while the configuration was open. This also occurs if you start VCS and the
main.cf file has a syntax error. This enables you to inspect the main.cf file and
decide whether you want to start VCS with that main.cf file. You may have to
modify the main.cf file if you made changes in the running cluster after saving
the configuration to disk.

Forcing VCS to Start from a Wait State
S1 S2
Local Build Waiting
for a
Conf memory
running
4 config
3 main.cf
main.cf main.cf
main.cf

2 .stale .stale
had had
hashadow hashadow
1 hasys
hasys force
force S1
S1
Forcing VCS to Start from a Wait State

If all systems are in a wait state, you must force VCS to start on the system with
the correct main.cf file. In this case, had is already started on each system, so
you cannot use the hastart command to build the cluster configuration. Instead,
use hasys -force to tell had to create the cluster configuration in memory on
the appropriate system.
1 Run hasys -force S1 on S1. This starts the local build process.
Note: You must have a valid main.cf file to force VCS to a running state. If
the main.cf file has a syntax error, running hasys -force results in VCS
entering the Admin_Wait state. You can run hacf -verify to check the file
syntax.
2 HAD removes the .stale flag, if present.
3 HAD checks for a valid main.cf file.
4 The had daemon on S1 reads the local main.cf file, and if it has no syntax
problems, HAD loads the cluster configuration into local memory on S1.

Next System Startup
S1 S2
Running Remote
Cluster
Conf
Conf Conf
7
main.cf
main.cf main.cf
main.cf

8
.stale
had had
hashadow hashadow
5 6
5 When had is in a running state on S1, this state change is broadcast on the
cluster interconnect by GAB.
6 S2 then performs a remote build to put the new cluster configuration into its
memory.
7 The had process on S2 copies the cluster configuration into the local
main.cf and types.cf files after moving the original files to backup
copies with timestamps.
8 The had process on S2 removes the .stale file, if present, from the local
configuration directory.

Building the Configuration Using a Specific
main.cf File
S1 S2
Local Build Waiting
for a
Conf memory
running
config
5
3 main.cf main.cf
2 .stale 7
had had
hashadow hashadow
6
1
hastart
hastart 4 8 hastart
hastart -stale
-stale
Building the Configuration Using a Specific main.cf File

The diagram illustrates how to start VCS to ensure that the cluster configuration in
memory is built from a specific main.cf file.
Starting VCS Using a Stale Flag

By starting VCS with the -stale flag on all other systems, you ensure that VCS
builds the new configuration in memory on the system where the changes were
made to the main.cf file and all other systems wait for the build to successfully
complete before building their in-memory configurations.
1 Run hastart on S1 to start the had and hashadow processes.
2 HAD checks for a .stale flag.
3 HAD checks for a valid main.cf file.
4 HAD checks for an active cluster configuration on the cluster interconnect.
5 Because there is no active cluster configuration, the had daemon on S1 reads
the local main.cf file and loads the cluster configuration into local memory
on S1.
6 Run hastart -stale on S2.
7 HAD starts and checks for a .stale flag, which is present because VCS
writes the file when the -stale option is given to hastart.
The S2 system is now in the stale admin wait state while VCS checks for a
valid configuration in memory on another cluster system.
8 HAD on S2 checks for an active cluster configuration on the cluster
interconnect and waits until S1 has a running cluster configuration.

Starting the Next System
S1 S2
Running Remote
Cluster
Conf
Conf Conf
11
main.cf
main.cf main.cf
main.cf

12
.stale
had had
hashadow hashadow
9 10
9 When VCS is in a running state on S1, HAD on S1 sends a copy of the cluster
configuration over the cluster interconnect to S2.
10 S2 performs a remote build to put the new cluster configuration in memory.
11 HAD on S2 copies the cluster configuration into the local main.cf and
types.cf files after moving the original files to backup copies with
timestamps.
12 HAD on S2 removes the .stale file from the local configuration directory.

Stopping VCS
S1 S2
S1 S2
had had
1 hastop -local
S1 S2 had had
2 hastop -local -evacuate
had had
3 hastop all -force
Stopping VCS
There are three methods of stopping the VCS engine (had and hashadow
daemons) on a cluster system:
Stop VCS and take all service groups offline, stopping application services
under VCS control.
Stop VCS and evacuate service groups to another cluster system where VCS is
running.
Stop VCS and leave application services running.
VCS can also be stopped on all systems in the cluster simultaneously. The hastop
command is used with different options and arguments that determine how running
services are handled.
VCS Shutdown Examples

The three examples show the effect of using different options with the hastop
command:
Example 1: The -local option causes the service group to be taken offline on
S1 and stops VCS services (had) on S1.
Example 2: The -local -evacuate options cause the service group on S1
to be migrated to S2 and then stops VCS services (had) on System1.
Example 3: The -all -force options stop VCS services (had) on both
systems and leave the services running. Although they are no longer protected
highly available services and cannot fail over, the services continue to be
available to users.

Lesson Summary
Key Points
Online configuration enables you to keep VCS
running while making configuration changes.
Offline configuration is best suited for large-
scale modifications.
Reference Materials
VERITAS Cluster Server Command Line Quick
Reference
Summary
This lesson introduced the methods you can use to configure VCS. You also
learned how VCS starts and stops in a variety of circumstances.
Next Steps
Now that you are familiar with the methods available for configuring VCS, you
can apply these skills by creating a service group using an online configuration
method.
This guide provides detailed information on starting and stopping VCS, and
6
performing online and offline configuration.
VERITAS Cluster Server Command Line Quick Reference
This card provides the syntax rules for the most commonly used VCS
commands.

Lab 6: Starting and Stopping VCS
vcs1
train1 train2
## hastop
hastop all
all -force
-force
Lab 6: Starting and Stopping VCS

Lab 6 Synopsis: Starting and Stopping VCS, page A-29
Lab 6: Starting and Stopping VCS, page B-37
Lab 6 Solutions: Starting and Stopping VCS, page C-63
Goal
The purpose of this lab is to observe the effects of stopping and starting VCS.
Prerequisites
Students must work together to coordinating stopping and restarting VCS.
Results
The cluster is running and the ClusterService group is online.

Lesson 7
Online Configuration of Service Groups
Lesson Introduction

Introduction
Overview
This lesson describes how to use the VCS Cluster Manager graphical user
interface (GUI) and the command-line interface (CLI) to create a service group
and configure resources while the cluster is running.
Importance
You can perform all tasks necessary to create and test a service group while VCS is
running without affecting other high availability services.


will be able to:
Online Configuration Describe an online configuration
Procedure procedure.
Adding a Service Group Create a service group using online
configuration tools.
Adding Resources Create resources using online
configuration tools.
Solving Common Resolve common errors made during
Configuration Errors online configuration.
Testing the Service Group Test the service group to ensure that it
is correctly configured.
Outline of Topics
Online Configuration Procedure
Adding a Service Group
Adding Resources
Solving Common Configuration Errors
Testing the Service Group
Lesson 7 Online Configuration of Service Groups 73

Open
Opencluster
clusterconfiguration.
configuration. Add Service Group
Modify Set SystemList

Modifythe
thecluster
clusterconfiguration
configuration
using
usingthe
theGUI
GUIororCLI.
CLI.
Set Opt Attributes
Save
Savethe
thecluster
configuration.
Add/Test Resource
Close
Closethe
thecluster
configuration. Resource Flow Chart
This
This procedure
procedure assumes
assumes that
that you
you Y
have
have prepared
prepared andand tested
tested the
the
application More? Test
application service on each system
system
and
and it
it is
is offline
offline everywhere,
everywhere, asas N
described
described in in the
the Preparing
Preparing Services
Services
for
for High
High Availability
Availability lesson.
lesson.

The chart on the left in the diagram illustrates the high-level procedure you can use
to modify the cluster configuration while VCS is running.
Creating a Service Group

You can use the procedures shown in the diagram as a standard methodology for
creating service groups and resources. Although there are many ways you could
vary this configuration procedure, following a recommended practice simplifies
and streamlines the initial configuration and facilitates troubleshooting if you
encounter configuration problems.

Adding a Service Group Using the GUI
main.cf
main.cf
group
group DemoSG
DemoSG ((
SystemList
SystemList == {{ S1
S1 == 0,
0, S2
S2 == 11 }}
AutoStartList
AutoStartList == {{ S1
S1 }}
))
Adding a Service Group

Adding a Service Group Using the GUI
The minimum required information to create a service group is:
Enter a unique name. Using a consistent naming scheme helps identify the
purpose of the service group and all associated resources.
Specify the list of systems on which the service group can run.
This is defined in the SystemList attribute for the service group, as displayed in
the excerpt from the sample main.cf file. A priority number is associated
with each system as part of the SystemList definition. The priority number is
used for the default failover policy, Priority. VCS uses the priority number to
choose a system for failover when more than one system is specified. The
lower numbered system is selected first. Other failover policies are described
in other lessons.
The Startup box specifies that the service group starts automatically when the
had daemon starts on the system, if the service group is not already online
elsewhere in the cluster. This is defined by the AutoStartList attribute of the
7
service group. In the example displayed in the slide, the S1 system is selected
as the system on which DemoSG is started when VCS starts up.
The Service Group Type selection is failover by default.
If you save the configuration after creating the service group, you can view the
main.cf file to see the effect of had modifying the configuration and writing the
changes to the local disk.

Note: You can click the Show Command button to see the commands that are run
when you click OK.
Adding a Service Group Using the CLI

You can also use the VCS command-line interface to modify a running cluster
configuration. The next example shows how to use hagrp commands to add the
DemoSG service group and modify its attributes.
haconf makerw
hagrp add DemoSG
hagrp modify DemoSG SystemList S1 0 S2 1
hagrp modify DemoSG AutoStartList S1
haconf dump -makero
The corresponding main.cf excerpt for DemoSG is shown in the slide.
Notice that the main.cf definition for the DemoSG service group does not
include the Parallel attribute. When a default value is specified for a resource, the
attribute is not written to the main.cf file. To display all values for all attributes:
In the GUI, select the object (resource, service group, system, or cluster), click
the Properties tag, and click Show all attributes.
From the command line, use the -display option to the corresponding ha
command. For example:
hagrp -display DemoSG
See the command-line reference card provided with this course for a list of
commonly used ha commands.

Classroom Exercise
Create a service group using the Java GUI. Your instructor may
demonstrate the steps to perform this task.
1. Complete the design worksheet with values for your classroom.
2. Add a service group using the Cluster Manager.
3. See Appendix A, B, or C for detailed instructions.
Resource Definition Sample Value Your Value

Group nameSG1
Required Attributes
SystemList train1=0, train2=1
Optional Attributes
AutoStartList train1
Classroom Exercise: Creating a Service Group

Create a service group using the Cluster Manager GUI. The service group should
have these properties:
Specify a name based on your name, or use a student name, such as S1 for the
student using train1, as directed by your instructor.
Select both systems, with priority given to the system you are assigned. For
example, if you are working on train1, assign priority 0 to that system and 1 to
the next system, train2.
Select your system as the startup system.
Retain the default of failover for the service group type.
Resources are added to the service group in a later exercise.
Creating a Service Group, page A-32
Creating a Service Group, page B-43
7

Creating a Service Group, page C-69

Design Worksheet Example
Service Group Definition Sample Value Your Value

Group nameSG1
Required Attributes
SystemList train1=0 train2=1
Optional Attributes
AutoStartList train1
Corresponding main.cf Entry

group nameSG1 (
SystemList = { train1 = 0, train2 = 1 }
AutoStartList = { train1 }
)

Resource Configuration Procedure
Considerations:
Add
AddResource
Resource Add resources in order of
dependency, starting at the bottom.
Set
SetNon-Critical
Non-Critical
Configure all required attributes.
Modify
ModifyAttributes
Attributes Enable the resource.
Bring each resource online before
Enable
EnableResource
Resource adding the next resource.
It is recommended that you set
Bring
BringOnline
Online resources as non-critical until testing
has completed.
N
Online?
Online? Troubleshoot Resources
Y Done
Done
Adding Resources
Online Resource Configuration Procedure
Add resources to a service group in the order of resource dependencies starting
from the child resource (bottom up). This enables each resource to be tested as it is
added to the service group.
Adding a resource requires you to specify:
The service group name
The unique resource name
If you prefix the resource name with the service group name, you can more
easily identify the service group to which it belongs. When you display a list of
resources from the command line using the hares -list command, the
resources are sorted alphabetically.
The resource type
Attribute values
Use the procedure shown in the diagram to configure a resource.
7
Notes:
You are recommended to set each resource to be non-critical during initial
configuration. This simplifies testing and troubleshooting in the event that you
have specified incorrect configuration information. If a resource faults due to a
configuration error, the service group does not fail over if resources are non-
critical.
Enabling a resource signals the agent to start monitoring the resource.

Adding a Resource Using the GUI: NIC Example
NIC is persistent; it shows as

online as soon as you enable
the resources.
The agent monitors NIC using
ping to the NetworkHosts
address or broadcast to the
administrative IP subnet.
You must have an
administrative IP configured
on the interface to monitor
NIC.
main.cf
main.cf
NIC
NIC DemoNIC
DemoNIC ((
Critical
Critical == 00
Device
Device == qfe1
qfe1
))
Adding Resources Using the GUI: NIC Example

The NIC resource has only one required attribute, Device, for all platforms other
than HP-UX, which also requires NetworkHosts.
Optional attributes for NIC vary by platform. Refer to the VERITAS Cluster Server
Bundled Agents Reference Guide for a complete definition. These optional
attributes are common to all platforms.
NetworkType: Type of network, Ethernet (ether)
PingOptimize: Number of monitor cycles to detect if the configured interface
is inactive
A value of 1 optimizes broadcast pings and requires two monitor cycles. A
value of 0 performs a broadcast ping during each monitor cycle and detects the
inactive interface within the cycle. The default is 1.
NetworkHosts: The list of hosts on the network that are used to determine if
the network connection is alive
It is recommended that you enter the IP address of the host rather than the host
name to prevent the monitor cycle from timing out due to DNS problems.

Example Device Attribute Values
Solaris
qfe1
AIX
en0
HP-UX
lan0
Linux
eth0

Adding an IP Resource
The agent uses ifconfig to

configure the IP address.
The virtual IP address set in
the Address attribute must be
different from the
adminstrative IP address.
main.cf
main.cf
IP
IP DemoIP
DemoIP ((
Critical
Critical == 0
Device
Device == qfe1
qfe1
Address == ""10.10.21.198
Address 10.10.21.198""
))
Adding an IP Resource
The slide shows the required attribute values for an IP resource in the DemoSG
service group. The corresponding entry is made in the main.cf file when the
configuration is saved.
Notice that the IP resource has two required attributes, Device and Address, which
specify the network interface and IP address, respectively.
Optional Attributes
NetMask: Netmask associated with the application IP address
The value may be specified in decimal (base 10) or hexadecimal (base 16). The
default is the netmask corresponding to the IP address class.
Options: Options to be used with the ifconfig command
ArpDelay: Number of seconds to sleep between configuring an interface and
sending out a broadcast to inform routers about this IP address
The default is 1 second.
IfconfigTwice: If set to 1, this attribute causes an IP address to be configured
twice, using an ifconfig up-down-up sequence. This behavior increases the
probability of gratuitous ARPs (caused by ifconfig up) reaching clients.
The default is 0.

Classroom Exercise
Create network resources using the Java GUI. Your instructor may
2. Add a NIC and IP resource using the Cluster Manager.

Service Group Name nameSG1
Resource Name nameNIC1
Resource Type NIC
Required Attributes
Device eri0
NetworkHosts* 192.168.xx.1
*Required
*Required only
only on
on HP-UX.
HP-UX.
Critical? No (0)
Enabled? See next slide forYes (1)IP resource.
other
Classroom Exercise: Creating Network Resources Using the GUI

Create NIC and IP resources using the Cluster Manager GUI, using the values
provided by your instructor for your classroom.
Specify resource names based on your name, or use a student name, such as S1 for
the student using train1, as directed by your instructor.
Adding Resources to a Service Group, page A-33
Adding Resources to a Service Group, page B-44
Adding Resources to a Service Group, page C-71


Service Group nameSG1
Resource Name nameNIC1
Resource Type NIC
Required Attributes
Device Solaris: eri0
Sol Mob: dmfe0
AIX: en1
HP-UX: lan0
Linux: eth1
VA: bge0
NetworkHosts* 192.168.xx.1 (HP-UX
only)
Critical? No (0)
Enabled? Yes (1)

Resource Name nameIP1
Resource Type IP
Required Attributes
Device Solaris: eri0
Sol Mob: dmfe0
AIX: en1
HP-UX: lan0
Linux: eth1
VA: bge0
Address 192.168.xx.51* see table
Optional Attributes
Netmask 255.255.255.0
Critical? No (0)
Enabled? Yes (1)

System IP Address
train1 192.168.xx.51
train2 192.168.xx.52
train3 192.168.xx.53
train4 192.168.xx.54
train5 192.168.xx.55
train6 192.168.xx.56
train7 192.168.xx.57
train8 192.168.xx.58
train9 192.168.xx.59
train10 192.168.xx.60
train11 192.168.xx.61
train12 192.168.xx.62

NIC nameNIC1 (
Critical = 0
Device = eri0
NetworkHosts = 192.168.xx.1 (Required only on HP-UX.)
)
IP nameIP1 (
Critical = 0
Device = eri0
Address = "192.168.xx.51"
)

Adding a Resource Using Commands: hares
hares
DiskGroup Example
You can use the hares command to add a resource
and modify resource attributes. The DiskGroup agent:
The DiskGroup agent:
haconf makerw Imports
Importsand
anddeports
deports
aadisk group
disk group
hares add DemoDG DiskGroup DemoSG
hares modify DemoDG Critical 0 Monitors
Monitorsthe
thedisk
disk
group
group using
using vxdg
vxdg
hares modify DemoDG DiskGroup DemoDG
hares modify DemoDG Enabled 1
haconf dump makero
DiskGroup
DiskGroup DemoDG
DemoDG ((
Critical
Critical == 00
DiskGroup
DiskGroup == DemoDG
DemoDG
)) main.cf
main.cf
Adding a Resource Using the CLI: DiskGroup Example

You can use the hares command to add a resource and configure the required
attributes. This example shows how to add a DiskGroup resource, which is
described in more detail in the next section.
The DiskGroup Resource

The DiskGroup resource has only one required attribute, DiskGroup.
Note: VCS uses the vxdg with the -t option when importing a disk group to
disable autoimport. This ensures that VCS controls the disk group. VCS deports a
disk group if it was manually imported without the -t option (outside of VCS
control).
Optional attributes:
MonitorReservation: Monitors SCSI reservations
The default is 1, the agent monitors the SCSI reservation on the disk group. If
the reservation is missing, the agent brings the resource offline.
StartVolumes: Starts all volumes after importing the disk group
This also starts layered volumes by running vxrecover -s. The default is 1,
enabled, on all UNIX platforms except Linux. This attribute is required on
Linux.

StopVolumes: Stops all volumes before deporting the disk group
The default is 1, enabled, on all UNIX platforms except Linux. This attribute is
required on Linux.
Note: Set StartVolumes and StopVolumes attributes to 0 (zero) if using VCS with
VERITAS Volume Replicator.

The Volume Resource
Starts and stops a

volume using vxvol
Service Group Name DemoSG Reads a block from the
raw device interface
Resource Name DemoVol using dd to determine
Resource Type Volume status
Required Attributes
Volume DemoVol
DiskGroup DemoDG main.cf
main.cf
Volume
Volume DemoVol
DemoVol ((
Volume
Volume == DemoVol
DemoVol
DiskGroup
DiskGroup == DemoDG
DemoDG
))
The Volume Resource

The Volume resource can be used to manage a VxVM volume. Although the
Volume resource is not strictly required, it provides additional monitoring. You can
use a DiskGroup resource to start volumes when the DiskGroup resource is
brought online. This has the effect of starting volumes more quickly, but only the
disk group is monitored.
However, if you have a large number of volumes on a single disk group, the
DiskGroup resource can time out when trying to start or stop all the volumes
simultaneously. In this case, you can set the StartVolume and StopVolume
attributes of the DiskGroup to 0, and create Volume resources to start the volumes
individually.
Also, if you are using volumes as raw devices with no file systems, and, therefore,
no Mount resources, consider using Volume resources for the additional level of
monitoring.
The Volume resource has no optional attributes.

The Mount Resource
Resource Definition Sample Value Mounts and unmounts

Service Group Name DemoSG a block device on the
directory; runs fsck to
Resource Name DemoMount
remount if mount fails
Resource Type Mount
Uses stat and
Required Attributes statvfs to monitor the
MountPoint /demo file system
BlockDevice /dev/vx/dsk/DemoDG
/DemoVol
FSType vxfs
FsckOpt -y main.cf
main.cf
Mount
Mount DemoMount
DemoMount ((
BlockDevice
BlockDevice == /dev/vx/dsk/DemoDG/DemoVol
/dev/vx/dsk/DemoDG/DemoVol
FStype
FStype == vxfs
vxfs
MountPoint
MountPoint = /demo
FsckOpt
FsckOpt == -y
-y
))
The Mount Resource

The Mount resource has the required attributes displayed in the main.cf file
excerpt in the slide.
Optional attributes:
MountOpt: Specifies options for the mount command
SnapUmount: Determines whether VxFS snapshots are unmounted when the
file system is taken offline (unmounted)
The default is 0, meaning that snapshots are not automatically unmounted
when the file system is unmounted.
Note: If SnapUmount is set to 0 and a VxFS snapshot of the file system is
mounted, the unmount operation fails when the resource is taken offline, and
the service group is not able to fail over.
This is desired behavior in some situations, such as when a backup is being
performed from the snapshot. 7

Classroom Exercise hares
hares
Create storage resources using the CLI. Your instructor may
2. Add DiskGroup, Volume, and Mount resources using hares.

Resource Name nameDG1
Resource Type DiskGroup
Required Attributes
DiskGroup nameDG1
Optional Attributes
StartVolumes 1
StopVolumes 1
Critical? No (0)
Enabled? Yes (1)
Classroom Exercise: Creating Storage Resources using the CLI

Create DiskGroup, Volume, and Mount resources using the command-line
interface with the values provided by your instructor for your classroom.
Adding a DiskGroup Resource, page B-48
Adding a DiskGroup Resource, page C-76


Resource Name nameDG1
Resource Type DiskGroup
Required Attributes
DiskGroup nameDG1
Optional Attributes
StartVolumes 1
StopVolumes 1
Critical? No (0)
Enabled? Yes (1)

Resource Name nameVol1
Resource Type Volume
Required Attributes
Volume nameVol1
DiskGroup nameDG1
Critical? No (0)
Enabled? Yes (1)

Resource Name nameMount1
Resource Type Mount
Required Attributes
MountPoint /name1
BlockDevice /dev/vx/dsk/nameDG1/
nameVol1 (no spaces)
FSType vxfs
FsckOpt -y
Critical? No (0)
Enabled? Yes (1)
Corresponding main.cf Entries

DiskGroup nameDG1 (
Critical = 0
DiskGroup = nameDG1
)
Volume nameVol1 (
Critical = 0
Volume = nameVol1
DiskGroup = nameDG1
)
Mount nameMount1 (
Critical = 0
MountPoint = "/name1"
BlockDevice = "/dev/vx/dsk/nameDG1/nameVol1"
FSType = vxfs
FsckOpt = "-y"
)

The Process Resource
Starts and stops a
daemon-type process
Property Value Monitors the process by
scanning the process
Service Group DemoSG
table
Resource Name DemoProcess
Resource Type Process
Required Attributes
PathName /bin/sh
Arguments /sbin/orderproc main.cf
main.cf
up Process DemoProcess (
Process DemoProcess (
Optional Attributes PathName == ""/bin/sh
PathName /bin/sh""
Arguments
Arguments == ""/demo/orderproc
/demo/orderproc up
up""
))
The Process Resource

The Process resource controls the application and is added last because it requires
all other resources to be online in order to start. The Process resource is used to
start, stop, and monitor the status of a process.
Online: Starts the process specified in the PathName attribute, with options, if
specified in the Arguments attribute
Offline: Sends SIGTERM to the process
SIGKILL is sent if process does not exit within one second.
Monitor: Determines if the process is running by scanning the process table
The optional Arguments attribute specifies any command-line options to use when
starting the process.
Notes:
If the executable is a shell script, you must specify the script name followed by
arguments. You must also specify the full path for the shell in the PathName
attribute.
The monitor script calls ps and matches the process name. The process name
7
field is limited to 80 characters in the ps output. If you specify a path name to

a process that is longer than 80 characters, the monitor entry point fails.

Classroom Exercise
Create a Process resource. Your instructor may demonstrate the
steps to perform this task.
1. Complete the design worksheet with values for your
classroom.
2. Add a Process resource using either the GUI or CLI.
Resource Name nameProcess1
Required Attributes
PathName /bin/sh
Required Attributes
Arguments /name1/loopy name 1
Critical? No (0)
Enabled? Yes (1)
Classroom Exercise: Creating a Process Resource

Create a Process resource using either the Cluster Manager GUI or the command
line interface, using the values provided by your instructor for your classroom.
Adding a Process Resource, page B-51
Adding a Process Resource, page C-82


Resource Name nameProcess1
Required Attributes
PathName /bin/sh
Optional Attributes
Arguments /name1/loopy name 1
Critical? No (0)
Enabled? Yes (1)

Process nameProcess1 (
PathName = "/bin/sh"
Arguments = "/name1/loopy name 1"
)

Common mistakes made during configuration:
You have improperly specified an attribute value.
The operating system configuration is incorrect for the
resource.
Modify
ModifyAttributes
Attributes Disable
DisableResource*
Resource* Flush
FlushGroup
Group
Enable
EnableResource*
Resource* Clear
ClearResource
Resource
Y
Bring
BringOnline
Online N
Faulted?
Faulted?
Waiting to Go Online
N
Online?
Online? Check
CheckLog
Log
Y
Verify
VerifyOffline
Offline(OS)
(OS)
Done
Done Everywhere
Everywhere

Verify that each resource is online on the local system before continuing the
service group configuration procedure.
If you are unable to bring a resource online, use the procedure in the diagram to
find and fix the problem. You can view the logs through Cluster Manager or in the
/var/VRTSvcs/log file if you need to determine the cause of errors.
Note: Some resources do not need to be disabled and reenabled. Only resources
whose agents have open and close entry points, such as MultiNICA, require you to
disable and enable again after fixing the problem. By contrast, a Mount resource
does not need to be disabled if, for example, you incorrectly specify the
MountPoint attribute.
However, it is generally good practice to disable and enable regardless because it is
difficult to remember when it is required and when it is not.
More detail on performing tasks necessary for solving resource configuration
problems is provided in the following sections.

Flushing a Service Group
Misconfigured
Misconfiguredresources
resourcescancan
cause
causeagent
agentprocesses
processesto to
appear
appearto tohang.
hang.
Verify
Verifythat
thatthe
theresource
resourceisis
stopped
stoppedat atthe
theoperating
operating
system
systemlevel.
level.
Flush
Flushthe
theservice
servicegroup
groupto to
stop
stopall
allonline
onlineand
andoffline
offline
processes.
processes.
hagrp
hagrp flush
flush DemoSG
DemoSG sys
sys S1
S1
Flushing a Service Group

Occasionally, agents for the resources in a service group can appear to become
suspended waiting for resources to be brought online or be taken offline.
Generally, this condition occurs during initial configuration and testing because
the required attributes for a resource are not defined properly or the underlying
operating system resources are not prepared correctly. If it appears that a resource
or group has become suspended while being brought online, you can flush the
service group to enable corrective action.
Flushing a service group stops VCS from attempting to bring resources online or
take them offline and clears any internal wait states. You can then check resources
for configuration problems or underlying operating system configuration problems
and then attempt to bring resources back online.
Note: Before flushing a service group, verify that the physical or software resource
is actually stopped. 7

Disabling a Resource
Nonpersistent
Nonpersistentresources
resources
must
mustbebetaken
takenoffline
offline
before
beforebeing
beingdisabled.
disabled.
VCS
VCScalls
callsthe
theagent
agentonon
each
eachsystem
systemin inthe
the
SystemList.
SystemList.
The
Theagent
agentcalls
callsthe
theclose
close
entry
entrypoint,
point,ififpresent,
present,toto
reset
resetthe
theresource.
resource.
The
Theagent
agentstops
stops
monitoring
monitoringdisabled
disabled
resources.
resources.
hares
hares modify
modify DemoIP
DemoIP Enabled
Enabled 0
Disabling a Resource
Disable a resource before you start modifying attributes to fix a misconfigured
resource. When you disable a resource, VCS stops monitoring the resource, so it
does not fault or wait to come online while you are making changes.
When you disable a resource, the agent calls the close entry point, if defined. The
close entry point is optional.
When the close tasks are completed, or if there is no close entry point, the agent
stops monitoring the resource.

Copying and Deleting a Resource
To
Tochange
changeaaresource
resourcename,
name,
you
youmust
mustdelete
deletethe theexisting
existing
resource
resourceand andcreate
createaanewnew
resource
resourcewithwiththethecorrect
correct
name.
name.
Before
Beforedeleting
deletingaaresource:
resource:
1.
1. Takeparent
Take parentresources
resources
offline,
offline,ififany
anyexist.
exist.
2.
2. Take the resourceoffline.
Take the resource offline.
3.
3. Disable
Disablethe theresource.
resource.
4.
4. Unlink
Unlinkanyanydependent
dependent
resources.
resources.
Delete
Deleteall
allresources
resourcesbeforebefore
deleting
deletingaaservice
servicegroup.
group.
You
Youcan
cancopy
copyandandpaste
paste
hares
hares offline
offline parent_res
parent_res resources
resourcesas asaamethod
methodof of
modifying
modifyingthe thename.
name.
hares
hares delete
delete DemoDG
DemoDG
Copying and Deleting a Resource

If you add a resource and want to change the resource name in a running cluster
later, you must delete the resource.
Before deleting a resource, take all parent resources offline, take the resource
offline, and then disable the resource. Also, remove any links to and from that
resource.
A recommended practice is to delete all resources before removing a service
group. This prevents possible resource faults and error log entries that can occur if
a service group with online resources is deleted. After deleting the resources, you
can delete the service group using the hagrp -delete service_group
command.
You can copy and paste a resource to modify the resource name. You can either
add a prefix or suffix to the existing name, or specify a completely different name.
You can also copy a partial or complete resource tree by right-clicking the topmost
resource and selecting Copy>Self and Child Nodes.
7

Testing Procedure
After all resources are
online locally:
Test
TestFailover
Failover Done
Done
1. Link resources.
2. Switch the service
Set
SetCritical
CriticalRes
Res
group to each system
on which it is Y
configured to run.
N
3. Set resources to Success? Check
Success? CheckLogs/Fix
Logs/Fix
critical, as specified in
the design worksheet.
4. Test failover.
Test
TestSwitching
Switching
Start
Start Link
LinkResources
Resources

After you have successfully brought each resource online, link the resources and
switch the service group to each system on which the service group can run.
For simplicity, this service group uses the Priority failover policy, which is the
default value. That is, if a critical resource in DemoSG faults, the service group is
taken offline and brought online on the system with the highest priority.
The Configuring VCS Response to Resource Faults lesson provides additional
information about configuring and testing failover behavior. Additional failover
policies are also described in the High Availability Using VERITAS Cluster Server
for UNIX, Implementing Local Clusters course.

Linking Resources
main.cf
main.cf
DemoIP
DemoIP requires
requires DemoNIC
DemoNIC
hares
hares link
link DemoIP
DemoIP DemoNIC
DemoNIC
Linking Resources
When you link a parent resource to a child resource, the dependency becomes a
component of the service group configuration. When you save the cluster
configuration, each dependency is listed at the end of the service group definition,
after the resource specifications, in the format show in the slide.
In addition, VCS creates a dependency tree in the main.cf file at the end of the
service group definition to provide a more visual view of resource dependencies.
This is not part of the cluster configuration, as denoted by the // comment
markers.
// resource dependency tree

//
//group DemoSG
//{
//IP DemoIP
// {
7
// NIC DemoNIC
// }
//}

Parent
Parentresources
resourcesdepend
dependon
on
child
childresources:
resources:
AAchild
childresource
resourcemust
mustbe be
Resource Dependency Definition online
onlinebefore
beforethe
theparent
parent
Service Group DemoSG resource can come online.
resource can come online.
The
Theparent
parentresource
resourcemust
must
Parent Requires Child go
Resource Resource gooffline
offlinebefore
beforethe
thechild
child
resource can go offline.
resource can go offline.
DemoVol DemoDG Parent
Parentresources
resourcescannot
cannotbe
be
DemoMount DemoVol persistent.
persistent.
DemoIP DemoNIC You
Youcannot
cannotlink
linkresources
resourcesinin
different
differentservice
servicegroups.
groups.
DemoProcess DemoMount
Resources
Resourcescan canhave
haveanan
DemoProcess DemoIP unlimited
unlimitednumber
numberof ofparent
parent
and
andchild
childresources.
resources.
Cyclical
Cyclicaldependencies
dependenciesare are
not
notallowed.
allowed.
VCS enables you to link resources to specify dependencies. For example, an IP
address resource is dependent on the NIC providing the physical link to the
network.
Ensure that you understand the dependency rules shown in the slide before you
start linking resources.

Classroom Exercise
Link resources. Your instructor may demonstrate the steps to
perform this task.
2. Link resources according to the worksheet using either the GUI
or CLI.

Parent Resource Requires Child Resource
nameVol1 nameDG1
nameMount1 nameVol1
nameIP1 nameNIC1
nameProcess1 nameMount1
nameProcess1 nameIP1
Classroom Exercise: Linking Resources

Link resources in the nameSG1 service group according to the worksheet using
either the GUI or CLI.
Linking Resources in the Service Group, page A-37
Linking Resources in the Service Group, page B-52
Linking Resources in the Service Group, page C-84


Parent Resource Requires Child Resource
nameVol1 nameDG1
nameMount1 nameVol1
nameIP1 nameNIC1
nameProcess1 nameMount1
nameProcess1 nameIP1
Corresponding main.cf Entries

nameProcess1 requires nameIP1
nameProcess1 requires nameMount1
nameMount1 requires nameVol1
nameVol1 requires nameDG1
nameIP1 requires nameNIC1

Setting the Critical Attribute
When
Whensetsetto toCritical:
Critical:
The
The Criticalattribute
Critical attributeis
is
removed
removedfrom frommain.cf
main.cf
(Critical=1
(Critical=1is isthe
thedefault
default
setting
settingforforall
allresources).
resources).
The
Theentire
entireservice
servicegroup
groupfails
fails
over
overififthe
theresource
resourcefaults.
faults.
main.cf
main.cf
NIC
NIC DemoNIC
DemoNIC ((
Device
Device == qfe1
qfe1
))
hares
hares modify
modify DemoNIC
DemoNIC Critical
Critical 11
Setting the Critical Attribute

The Critical attribute is set to 1, or true, by default. When you initially configure a
resource, you set the Critical attribute to 0, or false. This enables you to test the
resources as you add them without the resource faulting and causing the service
group to fail over as a result of configuration errors you make.
Some resources may always be set to non-critical. For example, a resource
monitoring an Oracle reporting database may not be critical to the overall service
being provided to users. In this case, you can set the resource to non-critical to
prevent downtime due to failover in the event that it was the only resource that
faulted.
Note: When you set an attribute to a default value, the attribute is removed from
main.cf. For example, after you set Critical to 1 for a resource, the Critical
= 0 line is removed from the resource configuration because it is now set to the
default value for the NIC resource type.
To see the values of all attributes for a resource, use the hares command. For
example:
7
hares -display DemoNIC

Classroom Exercise
Test the service group. Your instructor may
1. Complete the design worksheet with values for your
classroom.
2. Test switching the service group between cluster
systems.
3. Set resources to Critical using either the GUI or CLI.
Classroom Exercise: Testing the Service Group

Set each resource to critical and then switch the service group between systems
and verify that it operates properly on both systems in the cluster.
Testing the Service Group, page A-37
Testing the Service Group, page B-53
Testing the Service Group, page C-85

A Completed Process Service Group
Process
/demo/orderproc
IP Mount
10.10.21.198 /demo
NIC Volume
qfe1 DemoVol
DiskGroup
DemoDG
A Completed Process Service Group

You can display the completed resource diagram in Cluster Manager in the
Resources view when a service group is selected. The main.cf file
corresponding to this sample configuration for Solaris is shown here. An example
main.cf corresponding to the classroom exercises is shown in Appendix B and
Appendix C.
Corresponding main.cf Entries for DemoSG

include "types.cf"
cluster VCS (
UserNames = { admin = "j5_eZ_^]Xbd^\\_Y_d\\" }
CounterInterval = 5
)
7

system S1 (
)
system S2 (
)
group DemoSG (
SystemList = { S1 = 1, S2 = 2 }
)
DiskGroup DemoDG (
Critical = 0
DiskGroup = DemoDG
)
IP DemoIP (
Critical = 0
Device = qfe1
Address = "10.10.21.198"
)
Mount DemoMount (
Critical = 0
MountPoint = "/demo"
BlockDevice = "/dev/vx/dsk/DemoDG/DemoVol"
FSType = vxfs
FsckOpt = "-y"
)
NIC DemoNIC (
Critical = 0
Device = qfe1
)
Process DemoProcess (
Critical = 0
PathName = "/bin/sh"
Arguments = "/sbin/orderproc up"
)

Volume DemoVol (
Critical = 0
Volume = DemoVol
DiskGroup = DemoDG
)
DemoProcess requires DemoIP

DemoProcess requires DemoMount
DemoMount requires DemoVol
DemoVol requires DemoDG
DemoIP requires DemoNIC

Lesson Summary
Key Points
Follow a standard procedure for creating and
testing service groups.
Recognize common configuration problems
and apply a methodology for finding solutions.
Reference Materials
VERITAS Cluster Server Bundled Agent
Reference Guide
Reference
Summary
This lesson described the procedure for creating a service group and two tools for
modifying a running cluster: the Cluster Manager graphical user interface and
VCS ha commands.
Next Steps
After you familiarize yourself with the online configuration methods and tools,
you can modify configuration files directly to practice offline configuration.
commands.

Lab 7: Online Configuration of a Service Group
Use the Java GUI to:
Create a service
group.
Add resources to
the service group
from the bottom of
the dependency
tree.
Substitute the
name you used to
create the disk
group and volume.
Lab 7: Online Configuration of a Service Group

Lab 7 Synopsis: Online Configuration of a Service Group, page A-31
Lab 7: Online Configuration of a Service Group, page B-41
Lab 7 Solutions: Online Configuration of a Service Group, page C-67
Goal
The purpose of this lab is to create a service group while VCS is running using
either the Cluster Manager graphical user interface or the command-line interface.
Prerequisites
The shared storage and networking resources must be configured and tested. Disk
groups must be offline on all systems.
7
Results
New service groups defined in the design worksheet are running and tested on both
cluster systems.

Lesson 8
Offline Configuration of Service Groups
Lesson Introduction

Introduction
Overview
This lesson describes how to create a service group and configure resources by
modifying the main.cf configuration file.
Importance
In some circumstances, it is more efficient to modify the cluster configuration by
changing the configuration files and restarting VCS to bring the new configuration
into memory on each cluster system.


will be able to:
Offline Configuration Describe offline configuration
Procedures procedures.
Using the Design Use a design worksheet during
Worksheet configuration.
Offline Configuration Create service groups and resources
Tools using offline configuration tools.
Solving Offline Resolve common errors made during
Configuration Problems offline configuration.
Testing the Service Group Test the service group to ensure it is
correctly configured.
Outline of Topics
Offline Configuration Procedures
Offline Configuration Tools
Solving Offline Configuration Problems
Lesson 8 Offline Configuration of Service Groups 83

New Cluster ProcedureNo Services Running
Stop hastop
hastop -all ## vivi main.cf
main.cf
StopVCS
VCSon
onall
allsystems.
systems. -all
~~
~~
Edit
Editthe
theconfiguration
configurationfile.
file. vi
vi main.cf
main.cf group
group WebSG
WebSG ((
SystemList
SystemList ==
Verify hacf
hacf verify
verify .. AutoStartList
AutoStartList
Verifyconfiguration
configurationfile
filesyntax.
syntax. ))
NIC
NIC WebNIC
WebNIC ((
Start hastart
hastart
StartVCS
VCSon
onthis
thissystem.
system. Critical
Critical == 00
Device
Device == xxxx
xxxx
hastatus
hastatus -sum
-sum }}
Verify
Verifythat
thatVCS
VCSisisrunning.
running. .. .. ..
First
First System
System
Start
StartVCS
VCSon
onall
allother
othersystems.
systems. hastart
hastart -stale
-stale All
All Other
Other Systems
Systems
Offline Configuration Procedures

New Cluster
The diagram illustrates a process for modifying the cluster configuration when you
are configuring your first service group and do not already have services running
in the cluster.
Stop VCS
Stop VCS on all cluster systems. This ensures that there is no possibility that
another administrator is changing the cluster configuration while you are
modifying the main.cf file.
Edit the Configuration Files

You must choose a system on which to modify the main.cf file. You can choose
any system. However, you must then start VCS first on that system. Add service
group definitions, as shown in the Example Configuration File section.
Verify the Configuration File Syntax

Run the hacf command in the /etc/VRTSvcs/conf/config directory to
verify the syntax of the main.cf and types.cf files after you have modified
them. VCS cannot start if the configuration files have syntax errors. Run the
command in the config directory or specify the path.
Note: The hacf command only identifies syntax errors, not configuration errors.

Start VCS on the System with the Modified Configuration File
Start VCS first on the system with the modified main.cf file. Verify that VCS
started on that system.
Verify that VCS Is Running

Verify that VCS started on that system before starting VCS on other systems.
Start Other Systems

After VCS is in a running state on the first system, start VCS on all other systems.
If you cannot bring VCS to a running state on all systems, see the Solving
Common Offline Configuration Problems section.
Example Configuration File
include "types.cf"
cluster vcs (
UserNames = { admin = ElmElgLimHmmKumGlj }
ClusterAddress = "192.168.27.51"
CounterInterval = 5
)
system S1 (
)
system S2 (
)
group WebSG (
SystemList = { S1 = 1, S2 = 2 }
)
DiskGroup WebDG (
Critical = 0
DiskGroup = WebDG
)
8

IP WebIP (
Critical = 0
Device = qfe1
Address = "10.10.21.200"
)
Mount WebMount (
Critical = 0
MountPoint = "/Web"
BlockDevice = "/dev/dsk/WebDG/WebVol"
FSType = vxfs
)
NIC WebNIC (
Critical = 0
Device = qfe1
)
Process WebProcess (
Critical = 0
PathName = "/bin/ksh"
Arguments = "/sbin/tomcat"
)
Volume WebVol (
Critical = 0
Volume = WebVol
DiskGroup = WebDG
)
WebProcess requires WebIP

WebProcess requires WebMount
WebMount requires WebVol
WebVol requires WebDG
WebIP requires WebNIC

Existing Cluster Procedure: Part 1
Close
Closethe
theconfiguration.
configuration. haconf
haconf dump
dump -makero
-makero
Change
Changeto
tothe configdirectory.
theconfig directory. cd
cd /etc/VRTSvcs/conf/config
/etc/VRTSvcs/conf/config
Create
Createaaworking
workingdirectory.
directory. mkdir
mkdir stage
stage
Copy main.cfand
Copymain.cf andtypes.cf.
types.cf. cp
cp main.cf
main.cf types.cf
types.cf stage
stage
Change
Changeto
tothe stagedirectory.
thestage directory. cd
cd stage
stage
First
First System
System
Edit
Editthe
theconfiguration
configurationfiles.
files. vi
vi main.cf
main.cf
Verify
Verifyconfiguration
configurationfile
filesyntax.
syntax. hacf
hacf verify
verify ..
Existing Cluster
The diagram illustrates a process for modifying the cluster configuration when you
already have service groups configured and want to minimize the time that VCS is
not running to protect services that are running.
This procedure includes several built-in protections from common configuration
errors and maximizes high availability.
First System
Close the Configuration
Close the cluster configuration before you start making changes. This ensures that
the working copy you make has the latest in-memory configuration. This also
ensures that you do not have a stale configuration when you attempt to start the
cluster later.
Make a Staging Directory
Make a subdirectory of /etc/VRTSvcs/conf/config in which you can edit
a copy of the main.cf file. This ensures that your edits cannot be overwritten if
another administrator is making configuration changes simultaneously.
Copy the Configuration Files
Copy the main.cf file and types.cf from
/etc/VRTSvcs/conf/config to the staging directory.
8

Modify the Configuration Files
Modify the main.cf file in the staging directory on one system. The diagram on
the slide refers to this as the first system.
Verify the Configuration File Syntax

Run the hacf command in the staging directory to verify the syntax of the
main.cf and types.cf files after you have modified them.
Note: The dot (.) argument indicates that current working directory is used as the
path to the configuration files. You can run hacf -verify from any directory
by specifying the path to the configuration directory, as shown in this example:

Existing Cluster Part 2: Restarting VCS
Stop
StopVCS;
VCS;leave
leaveservices
servicesrunning.
running. hastop
hastop all
all -force
-force
Copy
Copythe
thetest main.cffile
testmain.cf fileback.
back. cp
cp main.cf
main.cf ../main.cf
../main.cf
Start
StartVCS
VCSon
onthis
thissystem.
system. hastart
hastart
First
First
hastatus System
System
Verify
Verifythat
thatHAD
HADis
isrunning.
running. hastatus -sum
-sum
Start
StartVCS
VCSstale
staleon
onother
othersystems.
systems. hastart
hastart -stale
-stale Other
Other Systems
Systems
If
If you
you are modifying
modifying an
an existing
existing service
service group, freeze
freeze the
the group
group persistently
persistently
before
before stopping
stopping VCS.
VCS. This prevents
prevents the
the group
group from
from failing
failing over
over when
when VCS
VCS
restarts
restarts if
if there
there are
are problems
problems with
with the configuration.
Stop VCS
Note: If you have modified an existing service group, first freeze the service group
persistently to prevent VCS from failing over the group. This simplifies fixing
resource configuration problemsthe service group is not being switched between
systems.
Stop VCS on all cluster systems after making configuration changes. To leave
applications running, use the -force option, as shown in the diagram.
Copy the New Configuration File

Copy the modified main.cf file from the staging directory into the configuration
directory.
Start VCS
Start VCS first on the system with the modified main.cf file.
Verify that VCS Is Running

Verify that VCS has started on that system.
Starting Other Systems

After VCS is in a running state on the first system, start VCS with the -stale
option on all other systems. These systems wait until the first system has built a
cluster configuration in memory, and then they build their in-memory
8
configurations from the first system.

Using a Resource Worksheet and Diagram
Process
Resource Definition Sample Value /app
Service Group Name AppSG
Resource Name AppIP
IP Mount
Resource Type IP 10.10.21.199 /app
Required Attributes
Device qfe1
Address 10.10.21.199 AppNIC Volume
qfe1 AppVol
Optional Attributes
NetMask* 255.255.255.0
DiskGroup
*Required
*Required only
only on
on HP-UX.
HP-UX. AppDG

Refer to the design worksheet for the values needed to create service groups and
resources.
The slide displays a service group diagram and a portion of the design worksheet
that describes an example resource used for the offline configuration example. A
new service group named AppSG is created by copying the DemoSG service
group definition within the main.cf file. The DemoSG service group was
created as the online configuration example.
The diagram shows the relationship between the resources in the AppSG service
group. In this example, the DemoSG service group definition, with all its resources
and resource dependencies, is copied and modified. You change all resource names
from DemoResource to AppResource. For example, DemoNIC is changed to
AppNIC.
As you edit each resource, modify the appropriate attributes. For example, the
AppIP resource must have the Address attribute set to 10.10.21.199.

Remember to add
resource
Resource Dependency Definition dependencies to the
Service Group AppSG
service group
definition.
Parent Requires Child Review these rules:
AppProcess AppIP Parent resources cannot
be persistent.
AppProcess AppMount
You cannot link
AppMount AppVol resources in different
AppVol AppDG service groups.
Resources can have
AppIP AppNIC an unlimited number
of parent and child
AppProcess requires AppIP resources.
AppProcess requires AppMount Cyclical
dependencies are not
AppMount requires AppVol allowed.
AppVol requires AppDG
AppIP requires AppNIC
Document resource dependencies in your design worksheet and add the links at the
end of the service group definition, using the syntax shown in the slide. A
complete example service group definition is shown in the next section.

A Completed Configuration File
.. .. ..
group
group AppSG
AppSG ((
SystemList
SystemList == {{ S1
S1 == 0,
0, S2
S2 == 1}
1}
AutoStartList
AutoStartList == {{ S1
S1 }}
Operators
Operators == {{ SGoper
SGoper }}
))
DiskGroup
DiskGroup AppDG
AppDG (( main.cf
main.cf
Critical
Critical == 00
DiskGroup
DiskGroup == AppDG
AppDG
))
IP
IP AppIP
AppIP ((
Critical
Critical == 00
Device
Device == qfe1
qfe1
Address
Address == "10.10.21.199"
"10.10.21.199"
))
.. .. ..
A Completed Configuration File

A portion of the completed main.cf file with the new service group definition
for AppSG is displayed in the slide. The complete description of the AppSG
service group, created by copying the DemoSG service group definition used in
the previous lesson, is provided here.
Note: You cannot include comment lines in the main.cf file. The lines you see
starting with // are generated by VCS to show resource dependencies. Any lines
starting with // are stripped out during VCS startup.
group AppSG (
SystemList = { S1 = 1, S2 = 2 }
)
DiskGroup AppDG (
Critical = 0
DiskGroup = AppDG
)

IP AppIP (
Critical = 0
Device = qfe1
Address = "10.10.21.199"
)
Mount AppMount (
Critical = 0
MountPoint = "/app"
BlockDevice = "/dev/dsk/AppDG/AppVol"
FSType = vxfs
)
NIC AppNIC (
Critical = 0
Device = qfe1
)
Process AppProcess (
Critical = 0
PathName = "/bin/ksh"
Arguments = "/app/appd test"
)
Volume AppVol (
Critical = 0
Volume = AppVol
DiskGroup = AppDG
)
AppProcess requires AppIP

AppProcess requires AppMount
AppMount requires AppVol
AppVol requires AppDG
AppIP requires AppNIC
8

Editing Configuration Files
As a best practice, copy the main.cf or types.cf
files into a staging directory before modifying.
VCS configuration files are simple text files.
You can modify these files with any text editor.
Save and close the cluster configuration before you
edit files. This ensures that the main.cf file matches
the in-memory configuration.
Check the syntax after editing configuration files.
After making changes, copy files back into the
configuration directory.
Offline Configuration Tools

You can modify the configuration files directly from a staging directory, or use the
VCS Simulator to modify copies of configuration files, which can then be used to
build the configuration.
Editing Configuration Files

It is recommended that you create a staging directory and copy configuration files
to that directory before making modifications. This ensures that you prevent more
than one administrator from making configuration changes simultaneously.
You can use any text editor to modify the main.cf or types.cf files. The
example in the offline configuration procedure diagram shows the vi editor
because it is a commonly used UNIX editor.

You
Youcancancreate
createandandmodify
modify
main.cfand
main.cf types.cffiles
andtypes.cf files
using
usingthetheVCS
VCSSimulator.
Simulator.
This
Thismethod
methoddoes doesnotnotaffect
affect
the
thecluster
clusterconfiguration
configuration
files;
files;simulator
simulatorconfiguration
configuration
files
filesare
arecreated
createdin inaaseparate
separate
directory.
directory.
You
Youcancanalso
alsouseusethe
the
simulator
simulator to testany
to test any
main.cffile
main.cf filebefore
beforeputting
putting
the
the configurationinto
configuration intothe
the
actual
actualcluster
clusterenvironment.
environment.

You can use the VCS Simulator to create or modify copies of VCS configuration
files that are located in a simulator-specific directory. You can also test the new or
modified configuration using the simulator and then copy the test configuration
files into the /etc/VRTSvcs/conf/config VCS configuration directory.
In addition to the advantage of using a familiar interface, using the VCS Simulator
ensures that your configuration files do not contain syntax errors that can more
easily be introduced when manually editing the files directly.
When you have completed the configuration, you can copy the files into the
standard configuration directory and restart VCS to build that configuration in
memory on cluster systems, as described earlier in the Offline Configuration
Procedure sections.

Common Problems
Two common problems can occur if you do not
follow the recommended offline configuration
procedures:
All systems enter a wait state when you start
VCS because the main.cf file has a syntax
error.
You start the cluster from the wrong system,
and an old configuration is built in memory.
Solving Offline Configuration Problems

Common Problems
Although there are several protection mechanisms built into the offline
configuration process shown at the beginning of this lesson, you may see these
types of problems if you miss a step or perform steps in the wrong order.
All systems in the cluster are in a wait state as a result of starting VCS on a
system where the main.cf file has a syntax problem.
You have started VCS from a system with an old configuration file and now
have the wrong configuration running.
Solutions are provided in the next sections.

Systems in Wait States
If all systems are in a STALE_ADMIN_WAIT or
ADMIN_WAIT state:
1. Run hacf verify dir to identify the line with
the syntax problem.
2. Fix the syntax problem in the main.cf file.
3. Verify the configuration.
4. Force the system to perform a local build using
hasys -force sys_name
5. Wait while all other systems perform a remote build
automatically and then verify that the systems are
running.
All Systems in a Wait State

Consider this scenario:
Your new main.cf file has a syntax problem.
You forget to check the file with hacf -verify.
You start VCS on the first system with hastart.
The first system cannot build a configuration and goes into a wait state, such as
STALE_ADMIN_WAIT or ADMIN_WAIT.
You forget to verify that had is running on the first system and start all other
cluster systems using hastart -stale.
All cluster systems are now waiting and cannot start VCS.
Note: This can also occur if you stop had on all cluster systems while the
configuration is open.
Propagating an Old Configuration

If your new main.cf file has a syntax problem and you forget to check the file,
that system (S1) goes into a wait state. If you then start VCS on another system
(S2) using hastart without the -stale option, that system builds the cluster
configuration in memory from its old main.cf file on disk. The first system (S1)
then builds its configuration from the in-memory configuration on S2, moves the
new main.cf file to main.cf.previous, and then writes the old
configuration that is now in memory to the main.cf file.
8

Recovering from an Old Configuration
If you inadvertently start the cluster with an old
configuration file:
1. Close the configuration, if open.
2. Stop VCS on all systems and keep applications running.
3. On the system where you originally modified the
main.cf file, copy the main.cf.previous file to the
main.cf file.
4. Verify the configuration.
5. Start VCS on this system using the hastart command.
6. Verify that VCS is running using hastatus.
7. Start VCS stale on all other systems to ensure that they
wait to build their configuration from the first system.
Recovering from an Old Configuration

If you are running an old cluster configuration because you started VCS on the
wrong system first, you can recover the main.cf file on the system where you
originally made the modifications using the backup main.previous.cf file
created automatically by VCS.
You then use the offline configuration procedure to restart VCS using the
recovered configuration file, as shown with example commands below.
1 Close the configuration, if open.
haconf -dump -makero
2 Stop VCS on all systems and keep applications running.
hastop -all -force
3 On the system where you originally modified the main.cf file, copy the
main.cf.previous file to the main.cf file.
copy main.cf.pervious main.cf
4 Verify the configuration.
hacf -verify
5 Start VCS on this system using the hastart command.
hastart
6 Verify that VCS is running using hastatus.
hastatus -sum
7 Start VCS stale on all other systems to ensure that they wait to build their
configuration from the first system.
hastart -stale

Configuration File Backups
## ls
ls -l
-l /etc/VRTSvcs/conf/config
/etc/VRTSvcs/conf/config
total 140
total 140
-rw-------
-rw------- Mar
Mar 21
21 13:09
13:09 main.cf
main.cf
-rw-------
-rw------- Mar
Mar 14
14 17:22
17:22 main.cf.14Mar2004.17:22:25
main.cf.14Mar2004.17:22:25
-rw-------
-rw------- Mar
Mar 16
16 18:00
18:00 main.cf.16Mar2004.18:00:54
main.cf.16Mar2004.18:00:54
-rw-------
-rw------- Mar
Mar 20
20 11:37
11:37 main.cf.20Mar2004.11:37:49
main.cf.20Mar2004.11:37:49
-rw-------
-rw------- Mar
Mar 21
21 13:09
13:09 main.cf.21Mar2004.13:09:11
main.cf.21Mar2004.13:09:11
-rw-------
-rw------- Mar
Mar 21
21 13:10
13:10 main.cf.previous
main.cf.previous
-rw-------
-rw------- Mar
Mar 21
21 13:09
13:09 types.cf
types.cf
-rw-------
-rw------- Mar
Mar 14
14 17:22
17:22 types.cf.14Mar2004.17:22:25
types.cf.14Mar2004.17:22:25
-rw-------
-rw------- Mar 16 18:00 types.cf.16Mar2004.18:00:54
Mar 16 18:00 types.cf.16Mar2004.18:00:54
-rw-------
-rw------- Mar
Mar 20
20 11:37
11:37 types.cf.20Mar2004.11:37:49
types.cf.20Mar2004.11:37:49
-rw-------
-rw------- Mar
Mar 21
21 13:09
13:09 types.cf.21Mar2004.13:09:11
types.cf.21Mar2004.13:09:11
-rw-------
-rw------- Mar
Mar 21
21 13:10
13:10 types.cf.previous
types.cf.previous
root
root other
other
Configuration File Backups

Each time you save the cluster configuration, VCS maintains backup copies of the
main.cf and types.cf files.
Although it is always recommended that you copy configuration files before
modifying them, you can revert to an earlier version of these files if you damage or
lose a file.

Service Group Testing Procedure
1. Bring service
groups online. Test
TestFailover
Failover Done
Done
2. Switch the service
group to each Set
SetCritical
Critical
system on which it
Y
is configured to
N
run. Success?
Success? Check
CheckLogs/Fix
Logs/Fix
3. Change critical
attributes, as
appropriate. Test
TestSwitching
Switching
4. Test failover.
Y
N
Start
Start Online?
Online? Troubleshoot

Service Group Testing Procedure
After you restart VCS throughout the cluster, bring the new service groups online
and use the procedure shown in the slide to verify that your configuration additions
or changes are correct.
Note: This process is slightly different from online configuration, which tests each
resource before creating the next and before creating dependencies.
Use the procedures shown in the Online Configuration of Service Groups lesson
to solve configuration problems, if any.
If you need to make additional modifications, you can use one of the online tools
or modify the configuration files using the offline procedure.

Lesson Summary
Key Points
You can use a text editor or the VCS Simulator
to modify VCS configuration files.
Apply a methodology for modifying and testing
a VCS configuration.
Reference Materials
Reference Guide
Reference
Summary
This lesson introduced a methodology for creating a service group by modifying
the main.cf configuration file and restarting VCS to use the new configuration.
Next Steps
Now that you are familiar with a variety of tools and methods for configuring
service groups, you can apply these skills to more complex configuration tasks.
commands.
8

Lab 8: Offline Configuration of a Service Group
nameSG1
nameSG1 nameSG2
nameSG2 name
name
Process1 Process2
name name name name

Mount1 IP1 IP2 Mount2
name name name name

AppVol
Vol1 NIC1 NIC2 Vol2
name name
App
DG1 Working DG2
Workingtogether,
together,follow
followthe
theoffline
offline DG
configuration
configurationprocedure.
procedure.
Alternately,
Alternately,work
workalone
aloneand
anduse
usethe
the
GUI
GUIto
tocreate
createaanew
newservice
servicegroup.
group.
Lab 8: Offline Configuration of Service Groups

Lab 8 Synopsis: Offline Configuration of a Service Group, page A-38
Lab 8: Offline Configuration of a Service Group, page B-57
Lab 8 Solutions: Offline Configuration of a Service Group, page C-89
Goal
The purpose of this lab is to add a service group by copying and editing the
definition in main.cf for nameSG1.
Prerequisites
Students must coordinate when stopping and restarting VCS.
Results
The new service group defined in the design worksheet is running and tested on
both cluster systems.

Lesson 9
Sharing Network Interfaces
Lesson Introduction

Introduction
Overview
This lesson describes how to create a parallel service group containing networking
resources shared by multiple service groups.
Importance
If you have multiple service groups that use the same network interface, you can
reduce monitoring overhead by using Proxy resources instead of NIC resources. If
you have many NIC resources, consider using Proxy resources to minimize any
potential performance impacts of monitoring.

9
will be able to:
Sharing Network Describe how multiple service groups
Interfaces can share network interfaces.
Alternate Network Describe alternate network
Configurations configurations.
Using Parallel Service Use parallel service groups with
Groups network resources.
Localizing Resource Localize resource attributes.
Attributes
Outline of Topics
Alternate Network Configurations
Using Parallel Service Groups
Localizing Resource Attributes
Lesson 9 Sharing Network Interfaces 93

Conceptual View
Three service groups contain NIC
resources to monitor the same
network interface on the system.
The NIC agent runs the monitor
cycle every 60 seconds for
each NIC resource that is
online.
Additional network traffic is
generated by multiple monitor qfe1
cycles running for the same
device.

Conceptual View
This example shows a cluster system running three service groups using the same
network interface. Each service group has a unique NIC resource with a unique
name, but the Device attribute for each NIC resource is the same.
Because each service group has its own NIC resource for the interface, VCS
monitors the same network interfaceqfe1many times, creating unnecessary
overhead and network traffic.

Configuration View
9
WebSG DBSG
DBSG
WebSG .. NFSSG
NFSSG
main.cf .. main.cf
Orac1SG
main.cf
Orac1SG
.. main.cf .. .. main.cf
main.cf
Ora1SG
Ora1SG
.. .. main.cf
main.cf
DBSG
DBSG
.. .. .. .. main.cf
main.cf
.. .. main.cf
main.cf
.. IP
IP ..DBIP
DBIP
.. (
(
IP
IP..AppIP
AppIP ((
IP
IP WebIP
WebIP (( Device
IP ..AppIP
Device == qfe1
((
IP
IP
IP
AppIP
Device
DeviceAppIP
AppIP ==qfe1
qfe1
qfe1
((
Device
Device == qfe1
qfe1 Address
Device
Address
Device
IP
IP DBIP
DBIP====""((" 10.10.21.198"
qfe1
10.10.21.198"
qfe1
Address
Address
Device
Device ==== " qfe110.10.21.198"
10.10.21.198"
qfe1
Address
Address == 10.10.21.198"
10.10.21.198" )) Address
Address
Device==
Device = =="" qfe1
10.10.21.198"
10.10.21.198"
qfe1
)) Address = "" 10.10.21.198"
)) )) Address
Address
Address
10.10.21.198"
== "" 10.10.21.199"
10.10.21.199"
))
NIC
NIC DBNIC
DBNIC)) ((
NIC
NIC AppNIC
AppNIC ((
NIC
NIC WebNIC
WebNIC (( NIC
Device
NIC
Device AppNIC
== qfe1
AppNIC ((
Device
Device
NIC AppNIC ==qfe1
qfe1
qfe1 ((
Device
Device == qfe1
qfe1 )) NIC
Device AppNIC == qfe1
)) Device
NIC
NIC DBNIC
DeviceDBNIC
Device
qfe1
== qfe1
((
qfe1
)) )) Device = qfe1
)) Device = qfe1
DBIP
DBIP requires
requires
)) DBNIC
DBNIC
AppIP
AppIP requires
requires AppNIC AppNIC
WebIP
WebIP requires
requires WebNIC
WebNIC AppIP
AppIP requires
AppIP
AppIP requires
DBIP
DBIP requires
requires DBNIC DBNIC
Solaris
Configuration View
The example shows a configuration with many service groups using the same
network interface specified in the NIC resource. Each service group has a unique
NIC resource with a unique name, but the Device attribute for all is qfe1 in this
Solaris example.
In addition to the overhead of many monitor cycles for the same resource, a
disadvantage of this configuration is the effect of changes in NIC hardware. If you
must change the network interface (for example in the event the interface fails),
you must change the Device attribute for each NIC resource monitoring that
interface.

Using Proxy Resources
Web DB
Process Process
DB DB
Web Web Mount
IP
Mount IP
DB DB
DBVol
Web Web
Proxy Vol
Vol NIC
A
A Proxy
Proxy resource
resource mirrors
mirrors the
the state
state of
of DB
DB
Web DG
DG another
another resource
resource (for example, NIC).
NIC). DG
Alternate Network Configurations

Using Proxy Resources
You can use a Proxy resource to allow multiple service groups to monitor the same
network interfaces. This reduces the network traffic that results from having
multiple NIC resources in different service groups monitor the same interface.

The Proxy Resource Type
The Proxy agent monitors the
status of a specified resource
9
on the local system, unless
Resource Definition Sample TargetSysName is specified.
Value
TargetResName must be in a
Service Group Name DBSG
separate service group.
Resource Name DBProxy
Resource Type Proxy
Required Attributes
main.cf
main.cf
TargetResName WebNIC
Proxy
Proxy DBProxy
DBProxy ((
Critical
Critical == 00
TargetResName
TargetResName == WebNIC
WebNIC
))
The Proxy Resource Type

The Proxy resource mirrors the status of another resource in a different service
group. The required attribute, TargetResName, is the name of the resource whose
status is reflected by the Proxy resource.
Optional Attributes
TargetSysName specifies the name of the system on which the target resource
status is monitored. If no system is specified, the local system is used as the target
system.

Network Resources in a Parallel Service Group
DBIP WebIP DBIP WebIP
DBProxy WebProxy DBProxy WebProxy
DBSG WebSG DBSG WebSG
NetNIC NetSG
S1 S2
How
How do
do you
you determine
determine the
the status
status of
of the
the parallel
parallel service
service group
group with
with only
only aa
persistent
persistent resource?
resource?
Using Parallel Service Groups

You can further refine your configuration for multiple service groups that use the
same network interface by creating a service group that provides only network
services. This service group is configured as a parallel service group because the
NIC resource is persistent and can be online on multiple systems. With this
configuration, all the other service groups are configured with a Proxy to the NIC
resource on their local system.
This type of configuration can be easier to monitor at an operations desk, because
the state of the parallel service group can be used to indicate the state of the
network interfaces on each system.
Determining Service Group Status

Service groups that do not include any OnOff resources as members are not
reported as online, even if their member resources are online, because the status of
the None and OnOnly resources is not considered when VCS reports whether a
service group is online.

Phantom Resources
9
DBIP WebIP DBIP WebIP
DBProxy WebProxy DBProxy WebProxy
NetSG
Net Net
S1 NIC Phantom S2
S1 S2
A
A Phantom
Phantom resource
resource can
can be
be used
used to
to enable
enable VCS
VCS to
to report
report the
the online
online
status
status of
of aa service
service group
group with
with only
only persistent
persistent resources.
resources.
Phantom Resources
The Phantom resource is used to report the actual status of a service group that
consists of only persistent resources. A service group shows an online status only
when all of its nonpersistent resources are online. Therefore, if a service group has
only persistent resources, VCS considers the group offline, even if the persistent
resources are running properly. When a Phantom resource is added, the status of
the service group is shown as online.
Note: Use this resource only with parallel service groups.

The Phantom Resource Type
Determines the status of a
service group
Requires no attributes
Resource Definition Sample
Value
Service Group Name NetSG
Resource Name NetPhantom
Resource Type Phantom
Required Attributes
main.cf
main.cf
Phantom
Phantom NetPhantom
NetPhantom ((
))
The Phantom Resource Type

The Phantom resource enables VCS to determine the status of service groups with
no OnOff resources, that is, service groups with only persistent resources.
Service groups that do not have any OnOff resources are not brought online
unless they include a Phantom resource.
The Phantom resource is used only in parallel service groups.

Configuring a Parallel Service Group
Use an offline process to set the
Parallel attribute if the group has
9
resources.
Service Group Sample
Use an online method to set
Definition Value
Parallel before adding resources.
Group NetSG
Required Attributes
Parallel 1
group
group NetSG
NetSG ((
SystemList S1=0, SystemList
SystemList == {S1
{S1 = 0,
0, S2
S2 == 1}
1}
S2=1 AutoStartList
AutoStartList == {S1,
{S1, S2}
S2}
Optional Attributes Parallel
Parallel == 1
))
AutoStartList S1, S2 Solaris
NIC
NIC NetNIC
NetNIC ((
Device
Device == qfe1
qfe1
))
Phantom
Phantom NetPhantom
NetPhantom (
)) main.cf
main.cf
Configuring a Parallel Service Group

You cannot change an existing failover service group that contains resources to a
parallel service group except by using the offline configuration procedure. In this
case, you can add the Parallel attribute definition to the service group, as displayed
in the diagram.
To create a parallel service group in a running cluster:
1 Create a new service group using either the GUI or CLI.
2 Set the Parallel attribute to 1 (true).
3 Add resources.
Set the critical attributes after you have verified that the service group is online on
all systems in SystemList.
Note: If you have a service group that already contains resources, you must set the
Parallel attribute by editing the main.cf file and restarting VCS with the
modified configuration file.

Properties of Parallel Service Groups
Parallel service groups are handled the same
way as failover service groups in most cases.
All groups:
Start up only on systems defined in AutoStartList
Fail over to target systems defined in SystemList
(where they are not already online)
Are managed by the GUI/CLI in the same manner
The
The difference
difference is
is that
that aa parallel
parallel group
group can
can be
be online
online on
on
more
more than
than one
onesystem
systemwithout
without causing
causingaaconcurrency
concurrencyfault.
fault.
Properties of Parallel Service Groups

Parallel service groups are managed like any other service group in VCS. The
group is only started on a system if that system is listed in the AutoStartList and
the SystemList attributes. The difference with a parallel service group is that it
starts on multiple systems simultaneously if more than one system is listed in
AutoStartList.
A parallel service group can also fail over if the service group faults on a system
and there is an available system (listed the SystemList attribute) that is not already
running the service group.

Localizing a NIC Resource Attribute
9
NIC
NIC NetNIC
NetNIC ((
Device@S1
Device@S1 == qfe1
qfe1 NetSG
Device@S2
Device@S2 == hme0
hme0
)) Net Net
NIC Phantom
S1
S1 qfe1
qfe1 hme0
hme0 S2
S2
You
You can
can localize
localize the
the Device
Device attribute
attribute for
for NIC
NIC resources when systems
have
have different
different network
network interfaces.
interfaces.
Localizing Resource Attributes

An attribute whose value applies to all systems is global in scope. An attribute
whose value applies on a per-system basis is local in scope. By default, all
attributes are global. Some attributes can be localized to enable you to specify
different values for different systems.
Localizing a NIC Resource Attribute

In the example displayed in the slide, the Device attribute for the NIC resource is
localized to enable you to specify a different interface for each system.
After creating the resource, you can localize attribute values using the hares
command, a GUI, or an offline configuration method. For example, when using the
CLI, type:
hares -local NetNIC Device
hares -modify NetNIC Device qfe0 -sys S1
hares -modify NetNIC Device hme0 -sys S2
Any attribute can be localized. Network-related resources are common examples
for local attributes.

Lesson Summary
Key Points
Proxy resources reflect the state of another
resources without monitoring overhead.
Network resources can be contained in a
parallel service group for efficiency.
Reference Materials
Reference Guide
Summary
This lesson introduced a methodology for sharing network resources among
service groups.
Next Steps
Now that you are familiar with a variety of tools and methods for configuring
service groups, you can apply these skills to more complex configuration tasks.
This guide describes the behavior of parallel service groups and advantages of
using Proxy resources.

Lab 9: Creating a Parallel Service Group
9
nameSG1
nameSG1 nameSG2
nameSG2 name
name
Process1 Process2
name name name name

Mount1 IP1 IP2 Mount2
name name name name

DBVol
Vol1 Proxy1 Proxy2 Vol2
name name
DB
DG1 Network Network DG2
DG
NIC Phantom NetworkSG
NetworkSG
Lab 9: Creating a Parallel Service Group

Lab 9 Synopsis: Creating a Parallel Service Group, page A-47
Lab 9: Creating a Parallel Service Group, page B-73
Lab 9 Solutions: Creating a Parallel Service Group, page C-109
Goal
The purpose of this lab is to add a parallel service group to monitor the NIC
resource and replace the NIC resources in the failover service groups with Proxy
resources.
Prerequisites
Students must coordinate when stopping and restarting VCS.
Results
A new parallel service group defined in the design worksheet is running on both
cluster systems. NIC is replaced with Proxy resources in all other service groups.

Lesson 10
Configuring Notification
Course Overview

Introduction
Overview
This lesson describes how to configure VCS to provide event notification using e-
mail, SNMP traps, and triggers.
Importance
In order to maintain a high availability cluster, you must be able to detect and fix
problems when they occur. By configuring notification, you can have VCS
proactively notify you when certain events occur.


will be able to:
Notification Overview Describe how VCS provides
notification.
10
Configuring Notification Configure notification using the
NotifierMngr resource.
Using Triggers for Use triggers to provide notification.
Notification
Outline of Topics
Notification Overview
Using Triggers for Notification
Lesson 10 Configuring Notification 103

How VCS performs notification:
1. The had daemon sends a message to the notifier daemon
when an event occurs.
2. The notifier daemon formats the event message and sends
an SNMP trap or e-mail message (or both) to designated
recipients.
NotifierMngr NotifierMngr
SNMP SMTP
NIC NIC
notifier
had had
Replicated Message O
O
Queue O O
O O
When VCS detects certain events, you can configure the notifier to:
Generate an SNMP (V2) trap to specified SNMP consoles.
Send an e-mail message to designated recipients.
Message Queue
VCS ensures that no event messages are lost while the VCS engine is running,
even if the notifier daemon stops or is not started. The had daemons
throughout the cluster communicate to maintain a replicated message queue.
If the service group with notifier configured as a resource fails on one of the nodes,
notifier fails over to another node in the cluster. Because the message queue is
guaranteed to be consistent and replicated across nodes, notifier can resume
message delivery from where it left off after it fails over to the new node.
Messages are stored in the queue until one of these conditions is met:
The notifier daemon sends an acknowledgement to had that at least one
recipient has received the message.
The queue is full. The queue is circularthe last (oldest) message is deleted in
order to write the current (newest) message.
Messages in the queue for one hour are deleted if notifier is unable to deliver to
the recipient.
Note: Before the notifier daemon connects to had, messages are stored
permanently in the queue until one of the last two conditions is met.

Message Severity Levels
SMTP Resource has faulted.

Agent has faulted. SNMP
Warning
Error SNMP Concurrency violation
SNMP
SevereError
10
Information
Service group is online. notifier
had had
A complete list of events
and severity levels is
included in the Job Aids
Appendix.
Message Severity Levels

Event messages are assigned one of four severity levels by notifier:
Information: Normal cluster activity is occurring, such as resources being
brought online.
Warning: Cluster or resource states are changing unexpectedly, such as a
resource in an unknown state.
Error: Services are interrupted, such as a service group faulting that cannot be
failed over.
SevereError: Potential data corruption is occurring, such as a concurrency
violation.
The administrator can configure notifier to specify which recipients are sent
messages based on the severity level.
A complete list of events and corresponding severity levels is provided in the Job
Aids appendix.

Add
AddaaNotifierMngr
NotifierMngrtype
typeof
ofresource
resourceto
to
Note:
Note: AA the
theClusterService
ClusterServicegroup.
group.
NotifierMngr
NotifierMngr
resource
resource isis
added Modify
Modifythe
theSmtpServer
SmtpServerand
and
added toto only
only SmtpRecipients
one
one service
service SmtpRecipientsattributes.
attributes. If SMTP
group,
group, the
the notification
ClusterService Optionally,
Optionally,modify
modifythe
theResourceOwner
ResourceOwner is required
ClusterService
and
andGroupOwner
GroupOwnerattributes.
attributes.
group.
group.
Modify
Modifythe
theSnmpConsoles
SnmpConsolesattribute
attributeof
of
NotifierMngr.
NotifierMngr. If SNMP
notification
Configure
Configurethe
theSNMP
SNMPconsole
consoleto
toreceive
receive is required
VCS
VCStraps.
traps.
Modify
Modifyany
anyother
otheroptional
optionalattributes
attributesof
of
NotifierMngr
NotifierMngrasasdesired.
desired.
While you can start and stop the notifier daemon manually outside of VCS,
you can make the notifier component highly available by placing the daemon
under VCS control.
Carry out the following steps to configure a highly available notification within the
cluster:
1 Add a NotifierMngr type of resource to the ClusterService group.
2 If SMTP notification is required:
a Modify the SmtpServer and SmtpRecipients attributes of the NotifierMngr
type of resource.
b If desired, modify the ResourceOwner attribute of individual resources
(described later in the lesson).
c You can also specify a GroupOwner e-mail address for each service group.
3 If SNMP notification is required:
a Modify the SnmpConsoles attribute of the NotifierMngr type of resource.
b Verify that the SNMPTrapPort attribute value matches the port configured
for the SNMP console. The default is port 162.
c Configure the SNMP console to receive VCS traps (described later in the
lesson).
4 Modify any other optional attributes of the NotifierMngr type of resource, as
desired.

Starting the Notifier Manually
To test the notification component, you can start the notifier process from the
command line on a system in the cluster. Note that notification is not under VCS
control when the notifier process is started from the command line.
VCS notification is configured by starting the notifier daemon with arguments
specifying recipients and corresponding message severity levels. For example:
notifier -t m=smtp.acme.com,e=admin@acme.com,l=Warning
In this example, an e-mail message is sent to admin@acme.com for each VCS
event of severity level Warning and higher (including Error and SevereError). The
10
notifier arguments shown in this example are:
-t Indicates SMTP server configurations

m Specifies the SMTP system name for the SMTP mail server
e Specifies recipients e-mail addresses
l Indicates the event message severity level to include
This example shows a notifier configuration for SNMP.

notifier -s m=north -s m=south,p=2000,l=Error,c=company
The notifier arguments shown in this example are:
-s Indicates SNMP server configurations

m=north Sends all level SNMP traps to the north system at the default
SNMP port and community value (public)
m=south,p=2000, Sends Error and SevereError traps to the south system at port
l=Error,c=compa 2000 and community value company
ny
See the manual pages for notifier and hanotify for a complete description
of notification configuration options.

The NotifierMngr Resource Type
Service Group Name ClusterService
Resource Name notifier
Resource Type NotifierMngr
Required Attributes*
*Note:
*Note:
SmtpServer smtp.veritas.com Either
Either SnmpConsoles
SnmpConsoles
SmtpRecipients vcsadmin@veritas.com = or
or SmtpXxx
SmtpXxx attributes
attributes
SevereError must
must be
be specified.
specified.
PathName /opt/VRTSvcs/bin/notifier Both
Both may
may be
be
specified.
specified.
(required on AIX)
NotifierMngr
NotifierMngr notifier
notifier (
SmtpServer
SmtpServer == "smtp.veritas.com"
"smtp.veritas.com"
SmtpRecipients
SmtpRecipients == {{ "vcsadmin@veritas.com"
"vcsadmin@veritas.com" == SevereError
SevereError }}
PathName
PathName == "/opt/VRTSvcs/bin/notifier"
"/opt/VRTSvcs/bin/notifier"
)) main.cf
main.cf
The NotifierMngr Resource Type

The notifier daemon can run on only one system in the cluster, where it
processes messages from the local had daemon. If the notifier daemon fails on
that system, the NotifierMngr agent detects the failure and migrates the service
group containing the NotifierMngr resource to another system.
Because the message queue is replicated throughout the cluster, any system that is
a target for the service group has an identical queue. When the NotifierMngr
resource is brought online, had sends the queued messages to the notifier
daemon.
Adding a NotifierMngr Resource

You can add a NotifierMngr resource using one of the usual methods for adding
resources to service groups:
Edit the main.cf file and restart VCS.
Use the Cluster Manager graphical user interface to add the resource
dynamically.
Use the hares command to add the resource to a running cluster.
Note: Before modifying resource attributes, ensure that you take the resource
offline and disable it. The notifier daemon must be stopped and restarted with
new parameters in order for changes to take effect.

The slide displays examples of the required attributes for Solaris and HP-UX
platforms. The NotifierMngr resource on the AIX platform also requires an
attribute called PathName, which is the absolute pathname of the notifier
daemon.
Optional Attributes
EngineListeningPort: The port that the VCS engine uses for listening.
The default is 14141.
Note: This optional attribute exists for VCS 3.5 for Solaris and for HP-UX.
This attribute does not exist for VCS 3.5 for AIX or VCS 4.0 for Solaris.
10
MessagesQueue: The number of messages in the queue
The default is 30.
NotifierListeningPort: Any valid unused TCP/IP port numbers
The default is 14144.
SnmpConsole: The fully qualified host name of the SNMP console and the
severity level
SnmpConsole is a required attribute if SMTP is not specified.
SnmpCommunity: The community ID for the SNMP manager
The default is public.
SnmpdTrapPort: The port to which SNMP traps are sent.
The value specified for this attribute is used for all consoles if more than one
SNMP console is specified.
The default is 162.
SmtpFromPath: A valid e-mail address, if a custom e-mail address is desired
for the FROM: field in the e-mail sent by notifier
SmtpReturnPath: A valid e-mail address, if a custom e-mail address is desired
for the Return-Path: <> field in the e-mail sent by notifier
SmtpServerTimeout: The time in seconds that notifier waits for a response
from the mail server for the SMTP commands it has sent to the mail server
This value can be increased if the mail server takes too much time to reply
back to the SMTP commands sent by notifier.
The default is 10.
SmtpServerVrfyOff: A toggle for sending SMTP VRFY requests
Setting this value to 1 results in notifier not sending a SMTP VRFY request to
the mail server specified in SmtpServer attribute, while sending e-mails. Set
this value to 1 if your mail server does not support the SMTP VRFY command.
The default is 0.

Configuring the ResourceOwner Attribute
VCS sends an e-mail message to the account specified in
the ResourceOwner attribute when notification is
configured and the attribute is defined for a resource.
An entry is also created in the log file:
2003 /12/03 11:23:48 VCS INFO V-16-1-10304
Resource file1 (Owner=daniel,
Group=testgroup) is offline on machine1
To set the ResourceOwner attribute:
hares modify res_name ResourceOwner daniel
Notification Events
ResourceStateUnknown ResourceRestartingByAgent
ResourceMonitorTimeout ResourceWentOnlineByItself
ResourceNotGoingOffline ResourceFaulted
Configuring the ResourceOwner Attribute

You can set the ResourceOwner attribute to define an owner for a resource. After
the attribute is set to a valid e-mail address and notification is configured, an e-
mail message is sent to the defined recipient when one of these resource-related
events occurs:
ResourceStateUnknown
ResourceMonitorTimeout
ResourceNotGoingOffline
ResourceRestartingByAgent
ResourceWentOnlineByItself
ResourceFaulted
VCS also creates an entry in the log file in addition to sending an e-mail message.
For example:
2003 /12/03 11:23:48 VCS INFO V-16-1-10304 Resource file1
(Owner=daniel, Group testgroup) is offline on machine1
ResourceOwner can be specified as an e-mail ID (daniel@domain.com) or a user
account (daniel). If a user account is specified, the e-mail address is constructed as
login@smtp_system, where smtp_system is the system that was specified in the
SmtpServer attribute of the NotifierMngr resource.

Configuring the GroupOwner Attribute
VCS sends an e-mail message to the account specified in
the GroupOwner attribute when notification is configured
and the attribute is defined for a service group.
An entry is also created in the log file:
2003 /12/03 11:23:48 VCS INFO V-16-1-10304 Group
SG1(Owner=chris, Group=testgroup) is offline on
sys1
To set the GroupOwner attribute:
10
hagrp modify grp_name GroupOwner chris
Examples
Examplesof ofservice
servicegroup
groupevents
eventsthat
that
cause
causeVCS
VCSto tosend
sendnotification
notificationto
toGroupOwner:
GroupOwner:
Faulted
Faulted
Concurrency
ConcurrencyViolation
Violation
Autodisabled
Autodisabled
Configuring the GroupOwner Attribute

You can set the GroupOwner attribute to define an owner for a service group. After
the attribute is set to a valid e-mail address and notification is configured, an e-
mail message is sent to the defined recipient when one of these group-related
events occurs:
The service group caused a concurrency violation.
The service group has faulted and cannot be failed over anywhere.
The service group is online.
The service group is offline.
The service group is autodisabled.
The service group is restarting.
The service group is being switched.
The service group is restarting in response to a persistent resource being
brought online.
VCS also creates an entry in the log file of the form displayed in the slide in
addition to sending an e-mail message.
GroupOwner can be specified as an e-mail ID (chris@domain.com) or a user
account (chris). If a user account is specified, the e-mail address is constructed as
login@smtp_system, where smtp_system is the system that was specified in the
SmtpServer attribute of the NotifierMngr resource.

Configuring the SNMP Console
Load the MIB for VCS traps into the SNMP
management console.
For HP OpenView Network Node Manager, merge
events:
xnmevents -merge vcs_trapd
VCS SNMP configuration files:
vcs.mib
vcs_trapd
/etc/VRTSvcs/snmp
Configuring the SNMP Console

To enable an SNMP management console to recognize VCS traps, you must load
the VCS MIB into the console. The textual MIB is located in the
/etc/VRTSvcs/snmp/vcs.mib file.
For HP OpenView Network Node Manager (NNM), you must merge the VCS
SNMP trap events contained in the /etc/VRTSvcs/snmp/vcs_trapd file. To
merge the VCS events, type:
xnmevents -merge vcs_trapd
SNMP traps sent by VCS are then displayed in the HP OpenView NNM SNMP
console.

Triggers are scripts run by VCS when certain events
occur and can be used as an alternate method of
notification.
Examples
Examplesof oftriggers
triggers
enabled
enabledby bypresence
presence Triggers
Triggersconfigured
configuredby
by
of
ofscript
scriptfile:
file: service
servicegroup
group
ResFault attributes:
attributes:
ResFault
10
ResNotOff PreOnline
PreOnline
ResNotOff
SysOffline ResStateChange
ResStateChange
SysOffline
PostOffline Applies
Appliesonly
onlyto
toenabled
enabled
PostOffline service
PostOnline servicegroups.
groups.
PostOnline
Applies
Appliescluster-wide.
cluster-wide.
Note: Other uses of triggers are described later in this course.

VCS provides an additional method for notifying the administrator of important
events. When VCS detects certain events, you can set a trigger to notify an
administrator or perform other actions. You can use event triggers in place of, or in
conjunction with, notification.
Triggers are executable programs, batch files, or Perl scripts that reside in
$VCS_HOME/bin/triggers. The script name must be one of the predefined
event types supported by VCS that are shown in the table.
For example, the ResNotOff trigger is a Perl script named resnotoff that
resides in $VCS_HOME/bin/triggers.
Most triggers are configured (enabled) if the trigger program is present; no other
configuration is necessary or possible. Each system in the cluster must have the
script in this location.
These triggers apply to the entire cluster. For example, ResFault applies to all
resources running on the cluster.
Some triggers are enabled by an attribute. Examples are:
PreOnline: If the PreOnline attribute for the service group is set to 1, the
PreOnline trigger is run as part of the online procedure, before the service
group is actually brought online.
ResStateChange: If the TriggerResStateChange attribute is set to 1 for a service
group, the ResStateChange trigger is enabled for that service group.
These types of triggers must also have a corresponding script present in the
$VCS_HOME/bin/triggers directory.

Lesson Summary
Key Points
You can choose from a variety of notification
methods.
Customize the notification facilities to meet
your specific requirements.
Reference Materials
Reference Guide
Summary
This lesson described how to configure VCS to provide notification using e-mail
and SNMP traps.
Next Steps
The next lesson describes how VCS responds to resource faults and the options
you can configure to modify the default behavior.
This document provides important reference information for the VCS agents
bundled with the VCS software.
This document provides information about all aspects of VCS configuration.

Lab 10: Configuring Notification
nameSG1 nameSG2
ClusterService
NotifierMngr
10
Optional Lab
resfault
resfault
Triggers
Triggers nofailover
nofailover SMTP
SMTPServer:
Server:
resadminwait
resadminwait
___________________________________
___________________________________
Lab 10: Configuring Notification

Lab 10 Synopsis: Configuring Notification, page A-52
Lab 10: Configuring Notification, page B-85
Lab 10 Solutions: Configuring Notification, page C-125
Goal
The purpose of this lab is to configure notification.
Prerequisites
Students work together to add a NotifierMngr resource to the ClusterService
group.
Results
The ClusterService group now has a NotifierMngr resource and notification is
working.

Lesson 11
Configuring VCS Response to Resource
Faults
Lesson Introduction

Introduction
Overview
This lesson describes how VCS responds to resource faults and introduces various
components, such as resource type attributes, that you can configure to customize
the VCS engines response to resource faults. This lesson also describes how to
recover after a resource is put into a FAULTED or ADMIN_WAIT state.
Importance
In order to maintain a high availability cluster, you must understand how service
groups behave in response to resource failures and how you can customize this
behavior. This enables you to configure the cluster optimally for your computing
environment.


will be able to:
VCS Resource to Describe how VCS responds to
Resource Faults resources faults.
Determining Failover Determine failover duration for a
Duration service group.
Controlling Fault Behavior Control fault behavior using resource
type attributes.
Recovering from Recover from resource faults.
Resource Faults
Fault Notification and Configure fault notification and
11
Event Handling triggers.
Outline of Topics
VCS Response to Resource Faults
Determining Failover Duration
Controlling Fault Behavior
Recovering from Resource Faults
Fault Notification and Event Handling
Lesson 11 Configuring VCS Response to Resource Faults 113

Failover Decisions and Critical Resources
One or more resources in a service group must be set
to critical in order for automatic failover to occur in
response to a fault.
The default VCS behavior for a failover service group
is:
If a critical resource faults, the service group fails over.
If any critical resource is taken offline as a result of a
fault, the service group fails over.
Other attributes can be set to modify this behavior, as
described throughout this lesson.
VCS Response to Resource Faults

Failover Decisions and Critical Resources
Critical resources define the basis for failover decisions made by VCS. When the
monitor entry point for a resource returns with an unexpected offline status, the
action taken by the VCS engine depends on whether the resource is critical.
By default, if a critical resource in a failover service group faults, VCS determines
that the service group is faulted and fails the service group over to another cluster
system, as defined by a set of service group attributes. The rules for selecting a
failover target are described in the Service Group Workload Management lesson
in the High Availability Using VERITAS Cluster Server for UNIX, Implementing
Local Clusters course.
The default failover behavior for a service group can be modified using one or
more optional service group attributes. Failover determination and behavior are
described throughout this lesson.

Default
How VCS Responds to Resource Faults Default
Behavior
Behavior
A resource goes Execute

Executeclean
cleanentry
entrypoint
point
offline unexpectedly for
forthe
thefailed
failedresource
resource
Fault
Faultthe
theresource
resource
Take
Takeall
allresources
resourcesin
inpath
pathoffline
offline
Y Critical
N Keep
Keepgroup
group
Fault
Faultthe
theservice
servicegroup
group online resource
in path? partially
partiallyonline
online
Take
Takethe
theentire
entireSG
SGoffline
offline
Failover N
11
target Keep
available? Keepthe
theservice
servicegroup
groupoffline
offline
Y
Bring
Bringthe
theservice
servicegroup
grouponline
onlineelsewhere
elsewhere
How VCS Responds to Resource Faults by Default

VCS responds in a specific and predictable manner to faults. When VCS detects a
resource failure, it performs the following actions:
1 Instructs the agent to execute the clean entry point for the failed resource to
ensure that the resource is completely offline
Both the service group and the resource transition to a FAULTED state.
2 Takes all resources in the path of the fault offline starting from the faulted
resource up to the top of the dependency tree
3 If an online critical resource is part of the path that was faulted or taken offline,
takes the entire group offline in preparation for failover
If no online critical resources are affected, no more action occurs.
4 Attempts to start the service group on another system in the SystemList
attribute according to the FailOverPolicy defined for that service group and the
relationships between multiple service groups
Configuring failover policies to control how a failover target is chosen and the
impact of service group interactions during failover are discussed in detail later
in the course.
Note: The state of the group on the new system prior to failover must be
GROUP_OFFLINE (not faulted).
5 If no other systems are available, the service group remains offline.

VCS also executes certain event triggers and carries out notification while it
performs the tasks displayed on the slide as a response to resource faults. The role
of notification and event triggers in resource faults is explained in detail later in
this lesson.

The Impact of Service Group Attributes
Do
Donothing
nothingexcept
exceptto
to
A resource goes Yes
Frozen? fault
faultthe
theresource
resource
offline unexpectedly
and
andthe
theservice
servicegroup
group
No
ManageFaults NONE Place

Placeresource
resourcein
inan
an
ADMIN_WAIT
ADMIN_WAITstate
state
Path for default settings
ALL
Execute
Executeclean
cleanentry
entrypoint
point
Fault
Faultthe
theresource
resourceand
andthe
theservice
servicegroup
group
0 1
Do
Donot
nottake
takeany Take
Takeall
allresources
11
any FaultPropagation resources
other
otherresource
resourceoffline
offline in
inthe
thepath
pathoffline
offline
The Impact of Service Group Attributes on Failover

Several service group attributes can be used to change the default behavior of VCS
while responding to resource faults.
Frozen or TFrozen
These service group attributes are used to indicate that the service group is frozen
due to an administrative command. When a service group is frozen, all agent
actions except for monitor are disabled. If the service group is temporarily frozen
using the hagrp -freeze group command, the TFrozen attribute is set to 1,
and if the service group is persistently frozen using the hagrp -freeze group
-persistent command, the Frozen attribute is set to 1. When the service group
is unfrozen using the hagrp -unfreeze group [-persistent]
command, the corresponding attribute is set back to the default value of 0.
ManageFaults
This service group attribute can be used to prevent VCS from taking any automatic
actions whenever a resource failure is detected. Essentially, ManageFaults
determines whether VCS or an administrator handles faults for a service group.
If ManageFaults is set to the default value of ALL, VCS manages faults by
executing the clean entry point for that resource to ensure that the resource is
completely offline, as shown previously. This is the default value (ALL). The
default setting of ALL provides the same behavior as VCS 3.5.

If this attribute is set to NONE, VCS places the resource in an ADMIN_WAIT
state and waits for administrative intervention. This is often used for service
groups that manage database instances. You may need to leave the database in its
FAULTED state in order to perform problem analysis and recovery operations.
Note: This attribute is set at the service group level. This means that any resource
fault within that service group requires administrative intervention if the
ManageFaults attribute for the service group is set to NONE.
FaultPropagation
The FaultPropagation attribute determines whether VCS evaluates the effects of a
resource fault on parents of the faulted resource.
If ManageFaults is set to ALL, VCS runs the clean entry point for the faulted
resource and then checks the FaultPropagation attribute of the service group. If this
attribute is set to 0, VCS does not take any further action. In this case, VCS fails
over the service group only on system failures and not on resource faults.
The default value is 1, which means that VCS continues through the failover
process shown in the next section. This is the same behavior as VCS 3.5 and
earlier releases.
Notes:
The ManageFaults and FaultPropagation attributes of a service group are
introduced in VCS version 3.5 for AIX and VCS version 3.5 MP1 (or 2.0 P4)
for Solaris. VCS 3.5 for HP-UX and any earlier versions of VCS on any other
platform do not have these attributes. If these attributes do not exist, the VCS
response to resource faults is the same as with the default values of these
attributes.
ManageFaults and FaultPropagation have essentially the same effect when
enabledservice group failover is suppressed. The difference is that when
ManageFaults is set to NONE, the clean entry point is not run and that resource
is put in an ADMIN_WAIT state.

The Impact of Service Group Attributes
(Continued)
Critical Keep
1 online resource
N Keepgroup
group
in path? partially
partiallyonline
online
Y
Take 0
Takethe
theentire
entire AutoFailOver
group
groupoffline
offline
1
Choose
Chooseaafailover
failovertarget
targetfrom
from
the
theSystemList
SystemListbased
basedononFailOverPolicy
FailOverPolicy
Y Failover N Keep
Keepthe
theservice
11
Bring
Bringthe
theservice
servicegroup
group service
target
online
onlineelsewhere
elsewhere group
groupoffline
offline
available?
AutoFailOver
This attribute determines whether automatic failover takes place when a resource
or system faults. The default value of 1 indicates that the service group should be
failed over to other available systems if at all possible. However, if the attribute is
set to 0, no automatic failover is attempted for the service group, and the service
group is left in an OFFLINE|FAULTED state.

Practice Exercise FaultPropagation
ManageFaults
AutoFailover
Case Non- (F, M, A) Offline Taken Starts on

Critical SG offline another
7 Attributes due to system
fault
5 A - 1, ALL, 1 -
6
8 B 4 1, ALL, 1 -
3 C 4 1, ALL, 1 6,7
4
9 D 4,6 1, ALL, 1 -
1 E 4,6,7 1, ALL, 1 -
2
F 4 1, ALL, 1 7
G - 1,NONE,1 -
H - 0, ALL, 1 -
Resource 4 Faults
I - 1, ALL, 0 -
Practice: How VCS Responds to a Fault

The service group illustrated in the slide demonstrates how VCS responds to
faults. In each case (A, B, C, and so on), assume that the group is configured as
listed and that the service group is not frozen. As an exercise, determine what
occurs if the fourth resource in the group fails.
For example, in case A above, clean entry point is executed for resource 4 to
ensure that it is offline, and resources 7 and 6 are taken offline because they
depend on 4. Because 4 is a critical resource, the rest of the resources are taken
offline from top to bottom, and the group is then failed over to another system.

Failover Duration When a Resource Faults
Service group failover time is the sum of the
duration of each of failover task.
You can affect failover time behavior by setting
resource type attributes.
+ Detect the resource failure (< MonitorInterval).

+ Fault the resource.
+ Take the entire service group offline.
+ Select a failover target.
+ Bring the service group online on another system in
the cluster.
11
= Failover Duration
Determining Failover Duration

Failover Duration on a Resource Fault
When a resource failure occurs, application services may be disrupted until either
the resource is restarted on the same system or the application services migrate to
another system in the cluster. The time required to address the failure is a
combination of the time required to:
Detect the failure.
This is related to how often you want to monitor each resource in the service
group. A resource failure is only detected when the monitor entry point of that
resource returns an offline status unexpectedly. The resource type attributes
used to tune the frequency of monitoring a resource are MonitorInterval (with
a default 60 seconds) and OfflineMonitorInterval (with a default 300 seconds).
Fault the resource.
This is related to two factors:
How much tolerance you want VCS to have for false failure detections
For example, in an overloaded network environment, the NIC resource can
return an occasional failure even though there is nothing wrong with the
physical connection. You may want VCS to verify the failure a couple of
times before faulting the resource.

Whether or not you want to attempt a restart before failing over
For example, it may be much faster to restart a failed process on the same
system rather than to migrate the entire service group to another system.
The resource type attributes related to these decisions are RestartLimit,
ToleranceLimit, and ConfInterval. These attributes are described in more detail
in the following sections.
Take the entire service group to be taken offline.
In general, the time required for a resource to be taken offline is dependent on
the type of resource and what the offline procedure includes. However, VCS
enables you to define the maximum time allowed for a normal offline
procedure before attempting to force the resource to be taken offline. The
resource type attributes related to this factor are OfflineTimeout and
CleanTimeout. For more detailed information on these attributes, refer to the
VERITAS Cluster Server Bundled Agents Reference Guide.
Select a failover target.
The time required for the VCS policy module to determine the target system is
negligible, less than one second in all cases, in comparison to the other factors.
Bring the service group online on another system in the cluster.
This may be one of the more dominant factors in determining the total failover
time. In most cases, in order to start an application service after a failure, you
need to carry out some recovery procedures. For example, a file systems
metadata needs to be checked if it is not unmounted properly, or a database
needs to carry out recovery procedures, such as applying the redo logs to
recover from sudden failures. Take these considerations into account when you
determine the amount of time you want VCS to allow for an online process.
The resource type attributes related to bringing a service group online are
OnlineTimeout, OnlineWaitLimit, and OnlineRetryLimit. For more
information on these attributes, refer to the VERITAS Cluster Server Bundled
Agents Reference Guide.

Adjusting Monitoring
MonitorInterval: Frequency of online monitoring
The default value is 60 seconds for most resource types.
Consider reducing the value to 10 or 20 seconds for
testing.
Use caution when changing this value:
Lower values increase the load on cluster systems.
Some false resource faults can occur if resources cannot
respond in the interval specified.
OfflineMonitorInterval: Frequency of offline monitoring
Consider reducing the value to 60 seconds for testing.
11
! If
If you
you change
change aa resource
affect
affect all
resource type
all resources
type attribute,
resources of
attribute, you
of that
that type.
type.
you
Adjusting Monitoring
You can change some resource type attributes to facilitate failover testing. For
example, you can change the monitor interval to see the results of faults more
quickly. You can also adjust these attributes to affect how quickly an application
fails over when a fault occurs.
MonitorInterval
This is the duration (in seconds) between two consecutive monitor calls for an
online or transitioning resource.
The default is 60 seconds for most resource types.
OfflineMonitorInterval
This is the duration (in seconds) between two consecutive monitor calls for an
offline resource. If set to 0, offline resources are not monitored.
The default is 300 seconds for most resource types.
Refer to the VERITAS Cluster Server Bundled Agents Reference Guide for the
applicable monitor interval defaults for specific resource types.

Adjusting Timeouts
Timeout interval values define the maximum time within
which the entry points must finish or be terminated.
OnlineTimeout and OfflineTimeout:
The default value is 300 seconds.
Increase the value if all resources of a type require
more time to be brought online or taken offline in your
environment.
MonitorTimeout
Before modifying defaults:

Measure
Measure the
the online
online and
and offline times outside of
of VCS.
VCS.
Measure the
the monitor time
time by
by faulting the resource,
resource, then
issuing
issuing aa probe.
probe.
Adjusting Timeout Values

The attributes MonitorTimeout, OnlineTimeout, and OfflineTimeout indicate the
maximum time (in seconds) within which the monitor, online, and offline entry
points must finish or be terminated. The default for the MonitorTimeout attribute
is 60 seconds. The defaults for the OnlineTimeout and OfflineTimeout attributes
are 300 seconds.
For best results, measure the length of time required to bring a resource online,
take it offline, and monitor it before modifying the defaults. Simply issue an online
or offline command to measure the time required for each action. To measure how
long it takes to monitor a resource, fault the resource and then issue a probe, or
bring the resource online outside of VCS control and issue a probe.

Type Attributes Related to Resource Faults
RestartLimit
Controls the number of times a resource is restarted
on the same system before it is marked as FAULTED
Default: 0
ConfInterval
Determines the amount of time that must elapse
before restart and tolerance counters are reset to zero
Default: 600 seconds
ToleranceLimit
Enables the monitor entry point to return OFFLINE
several times before the resource is declared
FAULTED
Default: 0
11
Controlling Fault Behavior
Type Attributes Related to Resource Faults
Although the failover capability of VCS helps to minimize the disruption of
application services when resources fail, the process of migrating a service to
another system can be time-consuming. In some cases, you may want to attempt to
restart a resource on the same system before failing it over to another system.
Whether a resource can be restarted depends on the application service:
The resource must be successfully cleared (taken offline) after failure.
The resource must not be a child resource, which has dependent parent
resources that must be restarted.
If you have determined that a resource can be restarted without impacting the
integrity of the application, you can potentially avoid service group failover by
configuring these resource type attributes:
RestartLimit
The restart limit determines how many times a resource can be restarted within
the confidence interval before the resource faults.
For example, you may want to restart a resource such as the Oracle listener
process several times before it causes an Oracle service group to fault.
ConfInterval
When a resource has remained online for the specified time (in seconds),
previous faults and restart attempts are ignored by the agent. When this clock
expires, the restart or tolerance counter is reset to zero.

ToleranceLimit
This attribute determines how many times the monitor returns offline before
the agent attempts to either restart the resource or mark it as FAULTED. This is
within the confidence interval.

Restart Example
RestartLimit = 1
The resource is restarted one time within the ConfInterval
timeframe.
ConfInterval = 180
The resource can be restarted once within a three-minute
interval.
MonitorInterval = 60 seconds (default value)
The resource is monitored every 60 seconds.
Offline Offline
Online Online Online
ConfInterval
11
MonitorInterval
Restart Faulted
Restart Example
This example illustrates how the RestartLimit and ConfInterval attributes can be
configured for modifying the behavior of VCS when a resource is faulted.
Setting RestartLimit = 1 and ConfInterval = 180 has this effect when a resource
faults:
1 The resource stops after running for 10 minutes.
2 The next monitor returns offline.
3 The ConfInterval counter is set to 0.
4 The agent checks the value of RestartLimit.
5 The resource is restarted because RestartLimit is set to 1, which allows one
restart within the ConfInterval.
6 The next monitor returns online.
7 The ConfInterval counter is now 60 (one monitor cycle has completed).
8 The resource stops again.
9 The next monitor returns offline.
10 The ConfInterval counter is now 120 (two monitor cycles have completed).
11 The resource is not restarted because the RestartLimit counter is now 2 and the
ConfInterval counter is 120 (seconds). Because the resource has not been
online for the ConfInterval time of 180 seconds, it is not restarted.
12 VCS faults the resource.
If the resource had remained online for 180 seconds, the internal RestartLimit
counter would have been reset to 0.

Modifying Resource Type Attributes
type
type NIC
NIC ((
static
static int
int MonitorInterval
MonitorInterval == 1515
static
static int
int OfflineMonitorInterval
OfflineMonitorInterval == 60
60
static
static int
int ToleranceLimit
ToleranceLimit == 22
static
static str
str ArgList[]
ArgList[] == {{ Device,
Device,

)) types.cf
types.cf
Can be used to
optimize agents
Is applied to all
resources of the
specified type
hatype
hatype modify
modify NIC ToleranceLimit
ToleranceLimit 22
Modifying Resource Type Attributes

You can modify the resource type attributes to affect how an agent monitors all
resources of a given type. For example, agents usually check their online resources
every 60 seconds. You can modify that period so that the resource type is checked
more often. This is good for either testing situations or time-critical resources.
You can also change the period so that the resource type is checked less often. This
reduces the load on VCS overall, as well as on the individual systems, but
increases the time it takes to detect resource failures.
For example, to change the ToleranceLimit attribute for all NIC resources so that
the agent ignores occasional network problems, type:
hatype -modify NIC ToleranceLimit 2

Overriding Resource Type Attributes
You can override resource type attributes by setting
different values in the resource definition.
You must use the CLI to override attributes.
hares
hares
override myMount MonitorInterval Override MonitorInterval
modify myMount MonitorInterval 10 Modify overridden attribute
display ovalues myMount Display overridden values
undo_override myMount MonitorInterval Restore default settings
Mount
Mount myMount
myMount ((
MountPoint="/mydir"
MountPoint="/mydir"
.. .. .. main.cf
main.cf
11
MonitorInterval=10
MonitorInterval=10
.. .. ..
))
Overriding Resource Type Attributes

VCS 4.0 provides the functionality of changing the resource type attributes on a
per-resource basis. Unless the resource type attribute is overridden, the value
applies to all resources of the same type. If you override the resource type
attribute, you can change its value for a specific resource.
Some predefined static resource type attributes (those resource type attributes that
do not appear in types.cf unless their value is changed, such as
MonitorInterval) and all static attributes that are not predefined (static attributes
that are defined in the type definition file) can be overridden. For a detailed list of
predefined static attributes that can be overridden, refer to the VERITAS Cluster
Server Users Guide.
To override a resource type attribute, use the hares -override command as
shown on the slide. You cannot override attributes from the GUI. After the
resource type attribute is overridden, you can change its value for that specific
resource using the hares -modify command as shown on the slide. Note that
this change is stored in the main.cf file.
You can use the hares -display -ovalues command to display the
overridden attributes for a specific resource.
When you restore the default settings of the attribute by running the hares
-undo_override command, the entry for that resource type attribute is
removed from the main.cf file.
Note: The configuration must be in read-write mode for you to modify and
override resource type attributes. The changes are reflected in the main.cf file
only after you dump the configuration using the haconf -dump command.

Recovering a Resource from a FAULTED State
When a resource is FAULTED on a system, you cannot

bring the resource online without clearing the fault status.
To
To clear
clearaanonpersistent
nonpersistentresource
resourcefault:
fault:
1.
1. Ensure
Ensurethat
thatthe
thefault
faultis
isfixed
fixedoutside
outsideof
ofVCS
VCSand
andthat
thatthe
the
resource
resourceis
iscompletely
completelyoffline.
offline.
2.
2. Use
Usethe
thehares
hares clear resourcecommand
clear resource commandtotoclear
clearthe
the
FAULTED status.
FAULTED status.
To
To clear
clear aa persistent
persistentresource
resource fault:
fault:
1.
1. Ensure
Ensurethat
thatthe
thefault
faultis
isfixed
fixedoutside
outsideof
ofVCS.
VCS.
2.
2. Either wait for the periodic monitoringor
Either wait for the periodic monitoring orprobe
probethe
theresource
resource
manually
manuallyusing
usingthe
thecommand:
command:
hares
hares probe
probe resource
resource syssys system
system
Recovering from Resource Faults

When a resource failure is detected, the resource is put into a FAULTED or an
ADMIN_WAIT state depending on the cluster configuration as described in the
previous sections. In either case, administrative intervention is required to bring
the resource status back to normal.
Recovering a Resource from a FAULTED State

A resource in FAULTED state cannot be brought online on a system. When a
resource is FAULTED on a system, the service group status also changes to
FAULTED on that system, and that system can no longer be considered as an
available target during a service group failover.
You have to clear the FAULTED status of a nonpersistent resource manually.
Before clearing the FAULTED status, ensure that the resource is completely
offline and that the fault is fixed outside of VCS.
To clear a nonpersistent resource fault using the command line:
hares -clear resource [-sys system]
Provide the resource name and the name of the system that the resource is faulted
on. If the system name is not specified, the resource is cleared on all systems
where it is faulted.
Note: You can also run hagrp -clear group [-sys system] to clear all
FAULTED resources in a service group. However, you have to ensure that all of
the FAULTED resources are completely offline and the faults are fixed on all the
corresponding systems before running this command.

The FAULTED status of a persistent resource is cleared when the monitor returns
an online status for that resource. Note that offline resources are monitored
according to the value of OfflineMonitorInterval, which is 300 seconds (5
minutes) by default. To avoid waiting for the periodic monitoring, you can initiate
the monitoring of the resource manually by probing the resource.
To probe a resource using the command line:
hares -probe resource -sys system
Provide the resource name and the name of the system on which you want to probe
the resource.
11

Recovering a Resource from an ADMIN_WAIT
State
When a resource is in ADMIN_WAIT state, VCS waits for
administrative intervention and takes no further action until
the status is cleared.
Two possible solutions:
To continue
continue operation
operation without a failover:
1.
1. Fix
Fix the
the fault
fault and
and bring
bring the
the resource
resource online
online outside
outside of
of VCS.
2.
2. Tell
Tell VCS
VCS to to clear
clear the
the ADMIN_WAIT
ADMIN_WAIT status.
status. Type:
Type:
hagrp
hagrp clearadminwait
clearadminwait group group sys
sys system
system
Note:
Note: IfIf the
the next
next monitor
monitor cycle
cycle for
for the
the resource
resource returns
returns offline,
offline, the
the
resource
resource is is placed
placed back
back into
into ADMIN_WAIT
ADMIN_WAIT state.
state.
To initiate failover
failover after collecting debug
debug information:
information:
1.
1. Analyze
Analyze the
the log
log files
files and
and collect
collect debug
debug information.
information.
2.
2. Tell
Tell VCS
VCS toto clear
clear the
the ADMIN_WAIT
ADMIN_WAIT status
status and
and continue
continue with
with the
the
failover
failover process
process asas normal. Type:
Type:
hagrp clearadminwait fault group
hagrp clearadminwait fault group sys system sys system
Recovering a Resource from an ADMIN_WAIT State

If the ManageFaults attribute of a service group is set to NONE, VCS does not take
any automatic action when it detects a resource fault. VCS places the resource into
the ADMIN_WAIT state and waits for administrative intervention. There are two
primary reasons to configure VCS in this way:
You want to analyze and recover from the failure manually with the aim of
continuing operation on the same system.
In this case, fix the fault and bring the resource back to the state it was in
before the failure (online state) manually outside of VCS. After the resource is
back online, you can inform VCS to take the resource out of ADMIN_WAIT
state and put it back into ONLINE state using this command:
hagrp -clearadminwait group -sys system
Notes:
If the next monitor cycle does not report an online status, the resource is
placed back into the ADMIN_WAIT state. If the next monitor cycle reports
an online status, VCS continues normal operation without any failover.
If the resource is restarted outside of VCS and the monitor cycle runs
before you can run hagrp -clearadminwait group -sys
system, then the resource returns to an online status automatically.
You cannot clear the ADMIN_WAIT state from the GUI.

You want to collect debugging information before any action is taken.
The intention in this case is to let VCS wait until the failure is analyzed. After
the analysis is completed, you can then let VCS continue with the normal
failover process by running this command:
hagrp -clearadminwait -fault group -sys system
Note: As a result of this command, the clean entry point is executed on the
resource in the ADMIN_WAIT state, and the resource changes status to
OFFLINE | FAULTED. VCS then continues with the service group failover,
depending on the cluster configuration.
11

Fault Notification
A resource becomes Send notification (Error).

offline unexpectedly. E-mail ResourceOwner (if configured).
A resource cannot Send notification (Warning).

be taken offline. E-mail ResourceOwner (if configured).
The service group is Send notification (SevereError).

faulted due to a critical E-mail GroupOwner (if configured).
resource fault.
The service group is Send notification (Information).

brought online or taken E-mail GroupOwner (if configured).
offline successfully.
The failover target Send notification (Error).

does not exist. E-mail GroupOwner (if configured).
Fault Notification and Event Handling

Fault Notification
As a response to a resource fault, VCS carries out tasks to take resources or service
groups offline and to bring them back online elsewhere in the cluster. While
carrying out these tasks, VCS generates certain messages with a variety of severity
levels and the VCS engine passes these messages to the notifier daemon.
Whether these messages are used for SNMP traps or SMTP notification depends
on how the notification component of VCS is configured, as described in the
Configuring Notification lesson.
The following events are examples that result in a notification message being
generated:
A resource becomes offline unexpectedly; that is, a resource is faulted.
VCS cannot take the resource offline.
A service group is faulted and there is no failover target available.
The service group is brought online or taken offline successfully.
The service group has faulted on all nodes where the group could be brought
online, and there are no nodes to which the group can fail over.

Extended Event Handling Using Triggers
Call resfault (if present).

A resource becomes
Call resstatechange.
offline unexpectedly.
(if present and configured).
A resource cannot
Call resnotoff (if present).
be taken offline.
A resource is placed
in an ADMIN_WAIT state. Call resadminwait (if present).
A resource is
Call resstatechange.
brought online or taken
(if present and configured).
offline successfully.
11
The failover target Call nofailover (if present).
does not exist.
Extended Event Handling Using Triggers

You can use triggers to customize how VCS responds to events that occur in the
cluster.
For example, you could use the ResAdminWait trigger to automate the task of
taking diagnostics of the application as part of the failover and recovery process. If
you set ManageFaults to NONE for a service group, VCS places faulted resources
into the ADMIN_WAIT state. If the ResAdminWait trigger is configured, VCS
runs the script when a resource enters ADMIN_WAIT. Within the trigger script,
you can run a diagnostic tool and log information about the resource, then take a
desired action, such as clearing the state and faulting the resource:
hagrp -clearadminwait -fault group -sys system
The Role of Triggers in Resource Faults

As a response to a resource fault, VCS carries out tasks to take resources or service
groups offline and to bring them back online elsewhere in the cluster. While these
tasks are being carried out, certain events take place. If corresponding event
triggers are configured, VCS executes these trigger scripts.

The following events result in a trigger being executed if it is configured:
When a resource becomes offline unexpectedly, that is, a resource is faulted,
both the ResFault and the ResStateChange event triggers are executed.
If VCS cannot take the resource offline, the ResNotOff trigger is executed.
If a resource is placed in an ADMIN_WAIT state due to a fault (ManageFaults
= NONE), the ResAdminWait trigger is executed.
Note: The ResAdminWait trigger exists only with VCS 3.5 for AIX and VCS
3.5 MP1 for Solaris. VCS 3.5 for HP-UX and earlier versions of VCS on other
platforms do not support this event trigger.
The ResStateChange trigger is executed every time a resource changes its state
from online to offline or from offline to online.
If the service group has faulted on all nodes where the group can be brought
online and there are no nodes to which the group can fail over, the NoFailover
trigger is executed.
Triggers are placed in the /opt/VRTSvcs/bin/triggers directory. Sample
trigger scripts are provided in /opt/VRTSvcs/bin/sample_triggers.
Trigger configuration is described in the VERITAS Cluster Server Users Guide
and the High Availability Design Using VERITAS Cluster Server instructor-led
training course.

Lesson Summary
Key Points
You can customize how VCS responds to
faults by configuring attributes.
Failover duration can also be adjusted to meet
your specific requirements.
Reference Materials
VERITAS Cluster Server Bundled Agent
Reference Guide
High Availability Design Using VERITAS
11
Cluster Server instructor-led training
Summary
This lesson described how VCS responds to resource faults and introduced various
components of VCS that enable you to customize VCS response to resource faults.
Next Steps
The next lesson describes how the cluster communication mechanisms work to
build and maintain the cluster membership.
This document provides important reference information for the VCS agents
bundled with the VCS software.
This document provides information about all aspects of VCS configuration.
High Availability Design Using VERITAS Cluster Server instructor-led training
This cover provides configuration procedures and practical exercises for
configuring triggers.

Lab 11: Configuring Resource Fault Behavior
Critical=0
Critical=1
FaultPropagation=0
nameSG1 FaultPropagation=1
nameSG2
ManageFaults=NONE
ManageFaults=ALL
RestartLimit=1
Note:
Note:Network
Networkinterfaces
interfacesfor
forvirtual
virtualIP
IPaddresses
addresses
are
areunconfigured
unconfiguredtotoforce
forcethe
theIP
IPresource
resourcetotofault.
fault.
In
Inyour
yourclassroom,
classroom,the
theinterface
interfaceyou
youspecify
specifyis:______
is:______
Replace
Replacethe
thevariable
variableinterface
interfacein
inthe
thelab
labsteps
stepswith
withthis
this
value.
value.
Lab 11: Configuring Resource Fault Behavior

Lab 11 Synopsis: Configuring Resource Fault Behavior, page A-55
Lab 11: Configuring Resource Fault Behavior, page B-93
Lab 11 Solutions: Configuring Resource Fault Behavior, page C-133
Goal
The purpose of this lab is to observe how VCS responds to faults in a variety of
scenarios.
Results
Each student observes the effects of failure events in the cluster.
Prerequisites
instructions.

Lesson 12
Cluster Communications
Lesson Introduction

Introduction
Overview
This lesson describes how the cluster interconnect mechanism works. You also
learn how the GAB and LLT configuration files are set up during installation to
implement the communication channels.
Importance
Although you may never need to reconfigure the cluster interconnect, developing a
thorough knowledge of how the cluster interconnect functions is key to
understanding how VCS behaves when systems or network links fail.


will be able to:
VCS Communications Describe how components
Review communicate in a VCS environment.
Cluster Membership Describe how VCS determines cluster
membership.
Cluster Interconnect Describe the files that specify the
Configuration cluster interconnect configuration.
Joining the Cluster Describe how systems join the cluster
Membership membership.
Outline of Topics
12
VCS Communications Review
Cluster Membership
Cluster Interconnect Configuration
Joining the Cluster Membership
Lesson 12 Cluster Communications 123

On-Node and Off-Node Communication
Agent
Agent Agent
Agent Agent
Agent Agent
Agent Agent
Agent Agent
Agent
had had had
GAB
GAB GAB
GAB GAB
GAB
LLT
LLT LLT
LLT LLT
LLT
Broadcast
Broadcast heartbeat
heartbeat Each
Each LLT
LLT module
module LLT
LLT forwards
forwards the
the
on
on each
each interface
interface tracks
tracks status
status of
of heartbeat
heartbeat status
status of
of
every
every
second.
second. heartbeat
heartbeat from
from each
each each
each node
node to
to GAB.
GAB.
peer
peer on
on each
each
interface.
interface.
VCS Communications Review

VCS maintains the cluster state by tracking the status of all resources and service
groups in the cluster. The state is communicated between had processes on each
cluster system by way of the atomic broadcast capability of Group Membership
Services/Atomic Broadcast (GAB). HAD is a replicated state machine, which uses
the GAB atomic broadcast mechanism to ensure that all systems within the cluster
are immediately notified of changes in resource status, cluster membership, and
configuration.
Atomic means that all systems receive updates, or all systems are rolled back to the
previous state, much like a database atomic commit. If a failure occurs while
transmitting status changes, GABs atomicity ensures that, upon recovery, all
systems have the same information regarding the status of any monitored resource
in the cluster.
VCS On-Node Communications

VCS uses agents to manage resources within the cluster. Agents perform resource-
specific tasks on behalf of had, such as online, offline, and monitoring actions.
These actions can be initiated by an administrator issuing directives using the VCS
graphical or command-line interface, or by other events which require had to take
some action. Agents also report resource status back to had. Agents do not
communicate with one another, but only with had.
The had processes on each cluster system communicate cluster status information
over the cluster interconnect.

VCS Inter-Node Communications
In order to replicate the state of the cluster to all cluster systems, VCS must
determine which systems are participating in the cluster membership. This is
accomplished by the Group Membership Services mechanism of GAB.
Cluster membership refers to all systems configured with the same cluster ID and
interconnected by a pair of redundant Ethernet LLT links. Under normal operation,
all systems configured as part of the cluster during VCS installation actively
participate in cluster communications.
Systems join a cluster by issuing a cluster join message during GAB startup.
Cluster membership is maintained by heartbeats. Heartbeats are signals sent
periodically from one system to another to determine system state. Heartbeats are
transmitted by the LLT protocol.
VCS Communications Stack Summary

The hierarchy of VCS mechanisms that participate in maintaining and
communicating cluster membership and status information is shown in the
diagram.
Agents communicate with had.
The had processes on each system communicate status information by way of
GAB.
12
GAB determines cluster membership by monitoring heartbeats transmitted
from each system over LLT.

Cluster Interconnect Specifications
VCS supports up to eight links for the cluster
interconnect.
Links can be specified as:
High-priority:
Heartbeats every half-second
Cluster status information carried over links
Usually configured for dedicated cluster network links
Low-priority:
Heartbeats every second
No cluster status sent
Automatically promoted to high priority if there are no
high-priority links functioning
Can be configured on public network interfaces
Cluster Interconnect Specifications

LLT can be configured to designate links as high-priority or low-priority links.
High-priority links are used for cluster communications (GAB) as well as
heartbeats. Low-priority links only carry heartbeats unless there is a failure of all
configured high-priority links. At this time, LLT switches cluster communications
to the first available low-priority link. Traffic reverts to high-priority links as soon
as they are available.
Later lessons provide more detail about how VCS handles link failures in different
environments.

GAB Status and Membership Notation
Nodes 0 and 1
20s Placeholder
## gabconfig
gabconfig -a
-a
GAB
GAB Port
Port Memberships
Memberships
===============================================
===============================================
Port
Port aa gen
gen a36e003
a36e003 membership
membership 01
01 ;; ;12
;12
Port
Port hh gen
gen fd57002
fd57002 membership
membership 01
01 ;; ;12
;12
HAD is communicating. Indicates 10s Digit

(0 displayed if node 10
is a member of the cluster)
GAB is communicating. Nodes 21 and 22
Cluster Membership
12
GAB Status and Membership Notation
To display the cluster membership status, type gabconfig on each system. For
example:
gabconfig -a
If GAB is operating, the following GAB port membership information is returned:
Port a indicates that GAB is communicating, a36e0003 is a randomly
generated number, and membership 01 indicates that systems 0 and 1 are
connected.
Port h indicates that VCS is started, fd570002 is a randomly generated
number, and membership 01 indicates that systems 0 and 1 are both running
VCS.
Note: The port a and port h generation numbers change each time the membership
changes.

GAB Membership Notation
A positional notation is used by gabconfig to indicate which systems are
members of the cluster. Only the last digit of the node number is displayed relative
to semicolons that indicate the 10s digit.
For example, if systems 21 and 22 are also members of this cluster, gabconfig
displays the following output, where the first semicolon indicates the 10th node,
and the second indicates the 20th:
======================================================
Port a gen a36e0003 membership 01 ; ;12
Port h gen fd570002 membership 01 ; ;12

LLT Link Status: The lltstat Command
Solaris
Solaris Example
Example
S1#
S1# lltstat
lltstat -nvv
-nvv |pg
|pg
LLT
LLT node
node information:
information:
Node
Node State
State Link
Link Status
Status Address
Address
** 00 S1
S1 OPEN
OPEN
qfe0
qfe0 UP
UP 08:00:20:AD:BC:78
08:00:20:AD:BC:78
hme0
hme0 UP
UP 08:00:20:AD:BC:79
08:00:20:AD:BC:79
11 S2
S2 OPEN
OPEN
qfe0
qfe0 UP
UP 08:00:20:B4:0C:3B
08:00:20:B4:0C:3B
hme0
hme0 UP
UP 08:00:20:B4:0C:3C
08:00:20:B4:0C:3C
Shows which system runs the command
Viewing LLT Link Status
12
The lltstat Command
Use the lltstat command to verify that links are active for LLT. This command
returns information about the links for LLT for the system on which it is typed. In
the example shown in the slide, lltstat -nvv is typed on the S1 system to
produce the LLT status in a cluster with two systems.
The -nvv options cause lltstat to list systems with very verbose status:
Link names from llttab
Status
MAC address of the Ethernet ports
Other lltstat uses:
Without options, lltstat reports whether LLT is running.
The -c option displays the values of LLT configuration directives.
The -l option lists information about each configured LLT link.
You can also use lltstat effectively to create a script that runs lltstat
-nvv and checks for the string DOWN. Run this from cron periodically to report
failed links.
Use the exclude directive in llttab to eliminate information about
nonexistent systems.
Note: This level of detailed information about LLT links is only available through
the CLI. Basic status is shown in the GUI.

Configuration Overview
The cluster interconnect is automatically configured
during installation.
You may never need to modify any portion of the
interconnect configuration.
Details about the configuration and functioning of the
interconnect are provided to give you a complete
understanding of the VCS architecture.
Knowing how a cluster membership is formed and
maintained is necessary for understanding effects of
system and communications faults, described in later
lessons.
Cluster Interconnect Configuration

Configuration Overview
The VCS installation utility sets up all cluster interconnect configuration files and
starts up LLT and GAB. You may never need to modify communication
configuration files. Understanding how these files work together to define the
cluster communication mechanism helps you understand how VCS responds if a
fault occurs.

LLT Configuration Files: The llttab File
This is the primary LLT configuration file, used to:
Assign node numbers to systems.
Set the cluster ID number to specify which systems
are members of a cluster.
Specify the network devices used for the cluster
interconnect.
Modify default LLT behavior, such as heartbeat
frequency.
## cat
cat /etc/llttab
/etc/llttab
set-node 11 Solaris
Solaris
set-node
set-cluster
set-cluster 10
10
link
link qfe0
qfe0 /dev/qfe:0
/dev/qfe:0 -- ether
ether -- --
link
link hme0
hme0 /dev/hme:0
/dev/hme:0 -- ether
ether -- --
AIX HP-UX Linux
LLT Configuration Files
12
The LLT configuration files are located in the /etc directory.
The llttab File

The llttab file is the primary LLT configuration file and is used to:
Set system ID numbers.
Set the cluster ID number.
Specify the network device names used for the cluster interconnect.
Modify LLT behavior, such as heartbeat frequency.
The example llttab file shown in the slide describes this cluster system (S2):
System (node) ID is set to 1.
Cluster ID is set to 10.
Cluster interconnect uses the hme0 and qfe0 Ethernet ports.
This is the minimum recommended set of directives required to configure LLT.
The basic format of the file is an LLT configuration directive followed by a value.
These directives and their values are described in more detail in the next sections.
For a complete list of directives, see the sample llttab file in the /opt/
VRTSllt directory and the llttab manual page.
Note: Ensure that there is only one set-node line in the llttab file.

AIX
set-node 1
set-cluster 10
link en1 /dev/en:1 - ether - -
link en2 /dev/en:2 - ether - -
HP-UX
set-node 1
set-cluster 10
link lan1 /dev/lan:1 - ether - -
link lan2 /dev/lan:2 - ether - -
Linux
/etc/llttab
set-node 1
set-cluster 10
link link1 eth1 - ether - -
link link2 eth2 - ether - -

How Node and Cluster Numbers Are Specified
# cat /etc/llttab
0 - 255 set-cluster 10
set-node S1
link qfe0 /dev/qfe:0 - ether - -
link hme0 /dev/hme:0 - ether - -
Solaris
Solaris
# cat /etc/llthosts
0 S1
0 - 31 1 S2
AIX HP-UX Linux
How Node and Cluster Numbers Are Specified
12
A unique number must be assigned to each system in a cluster using the
set-node directive.
The value of set-node can be one of the following:
An integer in the range of 0 through 31 (32 systems per cluster maximum)
A system name matching an entry in /etc/llthosts
If a number is specified, each system in the cluster must have a unique llttab
file, which has a unique value for set-node. Likewise, if a system name is
specified, each system must have a different llttab file with a unique system
name that is listed in llthosts, which LLT maps to a node ID.
The set-cluster Directive

LLT uses the set-cluster directive to assign a unique number to each cluster.
Although a cluster ID is optional when only one cluster is configured on a physical
network, you should always define a cluster ID. This ensures that each system only
joins other systems with the same cluster ID to form a cluster.
If LLT detects multiple systems with the same node ID and cluster ID on a private
network, the LLT interface is disabled on the node that is starting up to prevent
split-brain condition where a service group could be brought online on the two
systems with the same node ID.
Note: You can use the same cluster interconnect network infrastructure for
multiple clusters. You must ensure the llttab file specifies the appropriate
cluster ID to ensure that there are no conflicting node IDs.

The llthosts File
This file associates a system name with a VCS
cluster node ID.
Have the same entries on all systems
Use unique node numbers, which are
required
Have system names match llttab, main.cf
Have system names match sysname, if used
# cat /etc/llthosts
0 S1
1 S2
The llthosts File

The llthosts file associates a system name with a VCS cluster node ID
number. This file must be present in the /etc directory on every system in the
cluster. It must contain a line with the unique name and node ID for each system in
the cluster. The format is:
node_number name
The critical requirements for llthosts entries are:
Node numbers must be unique. If duplicate node IDs are detected on the
Ethernet LLT cluster interconnect, LLT in VCS 4.0 is stopped on the joining
node. In VCS versions before 4.0, the joining node panics.
The system name must match the name in llttab if a name is configured for
the set-node directive (rather than a number).
System names must match those in main.cf, or VCS cannot start.
Note: The system (node) name does not need to be the UNIX host name found
using the hostname command. However, VERITAS recommends that you keep
the names the same to simplify administration, as described in the next section.
See the llthosts manual page for a complete description of the file.

The sysname File
The sysname file is used to provide VCS with the
short-form system name.
Some UNIX systems return a fully qualified domain
name that does not match the main.cf system name
and, therefore, prevents VCS from starting.
Using a sysname file removes dependence on the
UNIX uname utility.
The name in the sysname file must:
Be unique within the cluster; each system has a
different entry in the file
Match main.cf system attributes
Be specified in the SystemList and AutoStartList
attributes for service groups
# cat /etc/VRTSvcs/conf/sysname
S1
The sysname File
12
The sysname file is an optional LLT configuration file. This file is used to store
the system (node) name. In later versions, the VCS installation utility creates the
sysname file on each system, which contains the host name for that system.
The purpose of the sysname file is to remove VCS dependence on the UNIX
uname utility for determining the local system name. If the sysname file is not
present, VCS determines the local host name using uname. If uname returns a
fully qualified domain name (sys.company.com), VCS cannot match the name
to the systems in the main.cf cluster configuration and therefore cannot start on
that system.
If uname returns a fully qualified domain name on your cluster systems, ensure
that the sysname file is configured with the local host name in
/etc/VRTSvcs/conf.
Note: Although you can specify a name in the sysname file that is completely
different from the UNIX host name shown in the output of uname, this can lead to
problems and is not recommended. For example, consider a scenario where system
S1 fails and you replace it with another system named S3. You configure VCS on
S3 to make it appear to be S1 by creating a sysname file with S1. While this has
the advantage of minimizing VCS configuration changes, it can create a great deal
of confusion when troubleshooting problems. From the VCS point of view, the
system is shown as S1. From the UNIX point of view, the system is S3.
See the sysname manual page for a complete description of the file.

The GAB Configuration File
GAB configuration is specified in the /etc/gabtab
file.
The file contains the command to start GAB:
/sbin/gabconfig -c -n number_of_systems
The value specified by number_of_systems
determines the number of systems that must be
communicating by way of GAB to allow VCS to
start.
## cat
cat /etc/gabtab
/etc/gabtab
/sbin/gabconfig
/sbin/gabconfig c
c n
n 44
The GAB Configuration File

GAB is configured with the /etc/gabtab file. This file contains one line which
is used to start GAB. For example:
/sbin/gabconfig -c -n 4
This example starts GAB and specifies that four systems are required to be running
GAB to start within the cluster.
A sample gabtab file is included in /opt/VRTSgab.
Note: Other gabconfig options are discussed later in this lesson. See the
gabconfig manual page for a complete description of the file.

Seeding During Startup
I am alive
I am alive
HAD HAD
GAB I am alive
GAB
HAD
S1
S1 LLT S3
S3
GAB LLT
LLT S2
S2
1 LLT starts on each system.

GAB starts on each system with a seed value equal to the number of
2 systems in the cluster. When GAB sees three members, the cluster is
seeded.
3 HAD starts only after GAB is communicating on all systems.
Joining the Cluster Membership
12
GAB and LLT are started automatically when a system starts up. HAD can only
start after GAB membership has been established among all cluster systems. The
mechanism that ensures that all cluster systems are visible on the cluster
interconnect is GAB seeding.
Seeding During Startup

Seeding is a mechanism to ensure that systems in a cluster are able to
communicate before VCS can start. Only systems that have been seeded can
participate in a cluster. Seeding is also used to define how many systems must be
online and communicating before a cluster is formed.
By default, a system is not seeded when it boots. This prevents VCS from starting,
which prevents applications (service groups) from starting. If the system cannot
communicate with the cluster, it cannot be seeded.
Seeding is a function of GAB and is performed automatically or manually,
depending on how GAB is configured. GAB seeds a system automatically in one
of two ways:
When an unseeded system communicates with a seeded system
When all systems in the cluster are unseeded and able to communicate with
each other
The number of systems that must be seeded before VCS is started on any system is
also determined by the GAB configuration.

When the cluster is seeded, each node is listed in the port a membership displayed
by gabconfig -a. For example:
# gabconfig -a
=======================================================
LLT, GAB, and VCS Startup Files

These startup files are placed on the system when VCS is installed.
Solaris
/etc/rc2.d/S70llt Checks for /etc/llttab and runs

/sbin/lltconfig -c to start LLT
/etc/rc2.d/S92gab Calls /etc/gabtab
/etc/rc3.d/S99vcs Runs /opt/VRTSvcs/bin/hastart
AIX
/etc/rc.d/rc2.d/ Checks for /etc/llttab and runs

S70llt /sbin/lltconfig -c to start LLT
/etc/rc.d/rc2.d/ Calls /etc/gabtab
S92gab
/etc/rc2.d/S99vcs Runs /opt/VRTSvcs/bin/hastart
HP-UX
/sbin/rc2.d/ Checks for /etc/llttab and runs

/sbin/rc2.d/ Calls /etc/gabtab
S920gab
/sbin/rc3.d/ Runs /opt/VRTSvcs/bin/hastart
S990vcs
Linux
/etc/rc.d/rc3.d/ Checks for /etc/llttab and runs

/etc/rc.d/rc3.d/ Calls /etc/gabtab
S67gab
/etc/rc.d/rc3.d/ Runs /opt/VRTSvcs/bin/hastart
S99vcs

Manual Seeding
3 Seeded
HAD 3 HAD
GAB Seeds
GAB
HAD
S1
S1 LLT S3
S3
2 GAB LLT
1
LLT S2
S2
1 S3 is down for maintenance. S1 and S2 are rebooted.
2 LLT starts on S1 and S2. GAB cannot seed with S3 down.
Start GAB on S1 manually and force it to seed: gabconfig c -x.

3 Start GAB on S2: gabconfig -c; it seeds because it can see another
seeded system (S1).
4 Start HAD on S1 and S2.
Manual Seeding
12
You can override the seed values in the gabtab file and manually force GAB to
seed a system using the gabconfig command. This is useful when one of the
systems in the cluster is out of service and you want to start VCS on the remaining
systems.
To seed the cluster, start GAB on one node with -x to override the -n value set in
the gabtab file. For example, type:
gabconfig -c -x
Warning: Only manually seed the cluster when you are sure that no other systems
have GAB seeded. In clusters that do not use I/O fencing, you can potentially
create a split brain condition by using gabconfig improperly.
After you have started GAB on one system, start GAB on other systems using
gabconfig with only the -c option. You do not need to force GAB to start with
the -x option on other systems. When GAB starts on the other systems, it
determines that GAB is already seeded and starts up.

Probing Resources During Startup
1 A, B Autodisabled for S1, S2
A B
3
monitor
2
agent
agent agent
agent agent
agentHAD HAD
1 During startup, HAD autodisables service groups.
2 HAD tells agents to probe (monitor) all resources on all systems in the
SystemList to determine their status.
3 If agents successfully probe resources, HAD brings service groups online

according to AutoStart and AutoStartList attributes.
Probing Resources During Startup

During initial startup, VCS autodisables a service group until all its resources are
probed on all systems in the SystemList that have GAB running. When a service
group is autodisabled, VCS sets the AutoDisabled attribute to 1 (true), which
prevents the service group from starting on any system. This protects against a
situation where enough systems are running LLT and GAB to seed the cluster, but
not all systems have HAD running.
In this case, port a membership is complete, but port h is not. VCS cannot detect
whether a service is running on a system where HAD is not running. Rather than
allowing a potential concurrency violation to occur, VCS prevents the service
group from starting anywhere until all resources are probed on all systems.
After all resources are probed on all systems, a service group can come online by
bringing offline resources online. If the resources are already online, as in the case
where had has been stopped with the hastop -all -force option, the
resources are marked as online.

Lesson Summary
Key Points
The cluster interconnect is used for cluster
membership and status information.
The cluster interconnect configuration may
never require modification, but can be altered
for site-specific requirements.
Reference Materials
Summary
12
This lesson described how the cluster interconnect mechanism works and the
format and content of the configuration files.
Next Steps
The next lesson describes how system and communication failures are handled in a
VCS cluster environment that does not support I/O fencing.
This guide provides detailed information on configuring VCS communication
mechanisms.

Lesson 13
System and Communication Faults
Lesson Introduction

Introduction
Overview
This lesson describes how VCS handles system and communication failures in
clusters that do not implement I/O fencing.
Importance
A thorough understanding of how VCS responds to system and communication
faults ensures that you know how services and their users are affected in common
failure situations.


will be able to:
Ensuring Data Integrity Describe VERITAS recommendations
for ensuring data integrity.
Cluster Interconnect Describe how VCS responds to cluster
Failures interconnect failures.
Changing the Interconnect Change the cluster interconnect
Configuration configuration.
Outline of Topics
Ensuring Data Integrity
Cluster Interconnect Failures
Changing the Interconnect Configuration
13
Lesson 13 System and Communication Faults 133

For VCS 4.x, VERITAS recommends using I/O
fencing to protect data on shared storage, which
supports SCSI-3 persistent reservation (PR).
For environments that do not have SCSI-3 PR
support, VCS supports additional protection
mechanisms for membership arbitration:
Redundant communication links
Separate heartbeat infrastructures
Jeopardy cluster membership
Autodisabled service groups

A cluster implementation must protect data integrity by preventing a split brain
condition.
The simplest implementation is to disallow any corrective action in the cluster
when contact is lost with a node. However, this means that automatic recovery
after a legitimate node fault is not possible.
In order to support automatic recovery after a legitimate node fault, you must have
an additional mechanism to verify that a node failed.
VCS 4.0 introduced I/O fencing as the primary mechanism to verify node failure,
which ensures that data is safe above all other criteria. For environments that do
not support I/O fencing, several additional methods for protecting data integrity
are also supported.

System Failure Example
A C B C
S1 O S3
S1 S3
O
S2
S2
S3 faults; C started on S1 or S2 No Membership: S3

Regular Membership: S1, S2
VCS Response to System Failure

The example cluster used throughout most of this section contains three systems,
S1, S2, and S3, which can each run any of the three service groups, A, B, and C.
The abbreviated system and service group names are used to simplify the
diagrams.
In this example, there are two Ethernet LLT links for the cluster interconnect.
13
Prior to any failures, systems S1, S2, and S3 are part of the regular membership of
cluster number 1.

Failover Duration on a System Failure
In the case of a system failure, service group
failover time is the sum of the duration of each of
these tasks.
+ Detect the system failure21 seconds for heartbeat

timeouts.
+ Select a failover targetless than one second.
+ Bring the service group online on another system in
the cluster.
= Failover Duration
Failover Duration on a System Fault

When a system faults, application services that were running on that system are
disrupted until the services are started up on another system in the cluster. The time
required to address a system fault is a combination of the time required to:
Detect the system failure.
A system is determined to be faulted according to these default timeout
periods:
LLT timeout: If LLT on a running system does not receive a heartbeat from
a system for 16 seconds, LLT notifies GAB of a heartbeat failure.
GAB stable timeout: GAB determines that a membership change is
occurring, and after five seconds, GAB delivers the membership change to
HAD.
Select a failover target.
The time required for the VCS policy module to determine the target system is
negligible, less than one second in all cases, in comparison to the other factors.
Bring the service group online on another system in the cluster.
As described in an earlier lesson, the time required for the application service to
start up is a key factor in determining the total failover time.

Single LLT Link Remaining
A B C
S1 O S3
S1 O S3
O
S2
S2
Regular Membership: S1, S2, S3 Jeopardy Membership: S3
Cluster Interconnect Failures

Single LLT Link Failure
In the case where a node has only one functional LLT link, the node is a member of
the regular membership and the jeopardy membership. Being in a regular
membership and jeopardy membership at the same time changes only the failover
behavior on system fault. All other cluster functions remain. This means that
13
failover due to a resource fault or switchover of service groups at operator request
is unaffected.
The only change is that other systems prevented from starting service groups on
system fault. VCS continues to operate as a single cluster when at least one
network channel exists between the systems.
In the example shown in the diagram where one LLT link fails:
A jeopardy membership is formed that includes just system S3.
System S3 is also a member of the regular cluster membership with systems S1
and S2.
Service groups A, B, and C continue to run and all other cluster functions
remain unaffected.
Failover due to a resource fault or an operator request to switch a service
groups is unaffected.
If system S3 now faults or its last LLT link is lost, service group C is not
started on systems S1 or S2.

Jeopardy Membership
A special type of cluster membership called jeopardy
is formed when one or more systems have only a
single LLT link.
Service groups continue to run, and the cluster
functions normally.
Failover due to resource faults and switching at
operator request are unaffected.
The service groups running on a system in jeopardy
cannot fail over to another system if that system in
jeopardy then faults or loses its last link.
Reconnect the link to recover from jeopardy
condition.
Jeopardy Membership
When a system is down to a single LLT link, VCS can no longer reliably
discriminate between loss of a system and loss of the last LLT connection. Systems
with only a single LLT link are put into a special cluster membership known as
jeopardy.
Jeopardy is a mechanism for preventing split-brain condition if the last LLT link
fails. If a system is in a jeopardy membership, and then loses its final LLT link:
Service groups in the jeopardy membership are autodisabled in the regular
cluster membership.
Service groups in the regular membership are autodisabled in the jeopardy
membership.
Jeopardy membership also occurs in the case where had stops and hashadow is
unable to restart had.
Recovering from a Jeopardy Membership

Recovery from a single LLT link failure is simplefix and reconnect the link.
When GAB detects that the link is now functioning and the system in Jeopardy
again has reliable (redundant) communication with the other cluster systems, the
Jeopardy membership is removed.

Transition from Jeopardy to Network Partition
3 A, B autodisabled for S3 3 C autodisabled for S1, S2
A B C
S1 O S3
S1 2 O S3
O
S2
S2
1 Jeopardy membership: S3
Mini-cluster with regular

Mini-cluster with regular membership: S1, 2 membership: S3
S2 No Jeopardy membership
3 SGs autodisabled
Transition from Jeopardy to Network Partition

If the last LLT link fails:
A new regular cluster membership is formed that includes only systems S1 and
S2. This is referred to as a mini-cluster.
A new separate membership is created for system 3, which is a mini-cluster
with a single system.
13
Because system S3 was in a jeopardy membership prior to the last link failing:
Service group C is autodisabled in the mini-cluster containing systems S1
and S2 to prevent either system from starting it.
Service groups A and B are autodisabled in the cluster membership for
system S3 to prevent system S3 from starting either one.
Service groups A and B can still fail over between systems S1 and S2.
In this example, the cluster interconnect has partitioned and two separate cluster
memberships have formed as a result, one on each side of the partition.
Each of the mini-clusters continues to operate. However, because they cannot
communicate, each maintains and updates only its own version of the cluster
configuration and the systems on different sides of the network partition have
different cluster configurations.

Recovering from a Network Partition
4 A, B autoenabled for S3 4 C autoenabled for S1, S2
A B C
1
2
S1 O S3
S1 O S3
O
S2
S2 3
1 Stop HAD on S3.
2 Fix LLT links.

Mini-cluster with S1, S2 continues to run.
Start HAD on S3.
3 A, B, C are autoenabled by
HAD.
Recovering from a Network Partition

After a cluster is partitioned, reconnecting the LLT links must be undertaken with
care because each mini-cluster has its own separate cluster configuration.
You must enable the cluster configurations to resynchronize by stopping VCS on
the systems on one or the other side of the network partitions. When you reconnect
the interconnect, GAB rejoins the regular cluster membership and you can then
start VCS using hastart so that VCS rebuilds the cluster configuration from the
other running systems in the regular cluster membership.
To recover from a network partition:
1 On the cluster with the fewest systems (S3, in this example), stop VCS and
leave services running.
2 Recable or fix LLT links.
3 Restart VCS. VCS autoenables all service groups so that failover can occur.

Recovery Behavior
If you did not stop HAD before reconnecting the cluster
interconnect after a network partition, VCS and GAB are
automatically stopped and restarted as follows:
Two-system cluster:
The system with the lowest LLT node number continues
to run VCS.
VCS is stopped on higher-numbered system.
Multi-system cluster:
The mini-cluster with the most systems running
continues to run VCS. VCS is stopped on the systems in
the smaller mini-clusters.
If split into two equal size mini-clusters, the cluster
containing the lowest node number continues to run
VCS.
Recovery Behavior
When a cluster partitions because the cluster interconnect has failed, each of the
mini-clusters continues to operate. However, because they cannot communicate,
each maintains and updates only its own version of the cluster configuration and
the systems on different sides of the network partition have different cluster
configurations.
13
If you reconnect the LLT links without first stopping VCS on one side of the
partition, GAB automatically stops HAD on selected systems in the cluster to
protect against a potential split-brain scenario.
GAB protects the cluster as follows:
In a two-system cluster, the system with the lowest LLT node number
continues to run VCS and VCS is stopped on the higher-numbered system.
In a multisystem cluster, the mini-cluster with the most systems running
continues to run VCS. VCS is stopped on the systems in the smaller mini-
clusters.
If a multisystem cluster is split into two equal-size mini-clusters, the cluster
containing the lowest node number continues to run VCS.

Modifying the Default Recovery Behavior
You can configure GAB to force an immediate
reboot without a system shutdown in the case
where LLT links are reconnected after a network
partition.
Modify gabtab to start GAB with the j option.
For example:
gabconfig -c -n 2 j
This causes the high-numbered node to shut
down if GAB tries to start after all LLT links
simultaneously stop and then restart.
Modifying the Default Recovery Behavior

As an added protection against split brain condition, you can configure GAB to
force systems to immediately reboot without a system shutdown after a partitioned
cluster is reconnected by specifying the -j option to gabconfig in /etc/
gabtab.
In this case, if you reconnect the LLT links and do not stop VCS, GAB prevents
conflicts by halting systems according to these rules:
In a two-system cluster, the system with the lowest LLT node number
continues to run and the higher-numbered system is halted (panics).
In a multisystem cluster, the mini-cluster with the most systems running
continues to run. The systems in the smaller mini-clusters are halted.
If a multisystem cluster is split into two equal size mini-clusters, the cluster
containing the lowest node number continues to run.

Potential Split Brain Condition
A C B A B C
1
S1 O S3
S1 S3
O
S2
S2
S1 and S2 determine that S3 is

1 faulted. No jeopardy occurs, so no
SGs are autodisabled.
If all systems are in all SGs SystemList, VCS 1 S3 determines that S1
2 and S2 are faulted.
tries to bring them online on a failover target.
Potential Split Brain Condition

When both LLT links fail simultaneously:
The cluster partitions into two separate clusters.
Each cluster determines that the other systems are down and tries to start the
service groups.
13
If an application starts on multiple systems and can gain control of what are
normally exclusive resources, such as disks in a shared storage device, split brain
condition results and data can be corrupted.

Interconnect Failures with a Low-Priority Public
Link
A B C
O
S1
S1 2 S3
S3
O
S2
S2
2 O
1 No change in membership
Jeopardy membership: S3
Regular membership: S1, 2 Public is now used for
2
S2, S3 heartbeat and status.
Interconnect Failures with a Low-Priority Public Link

LLT can be configured to use a low-priority network link as a backup to normal
heartbeat channels. Low-priority links are typically configured on the public
network or administrative network.
In normal operation, the low-priority link carries only heartbeat traffic for cluster
membership and link state maintenance. The frequency of heartbeats is reduced by
half to minimize network overhead.
When the low-priority link is the only remaining LLT link, LLT switches all
cluster status traffic over the link. Upon repair of any configured link, LLT
switches cluster status traffic back to the high-priority link.
Notes:
Nodes must be on the same public network segment in order to configure low-
priority links. LLT is a non-routable protocol.
You can have up to eight LLT links total, which can be a combination of low-
and high-priority links. If you have three high-priority links in the scenario
shown in the slide, you would have the same progression to Jeopardy
membership. The difference is that all three links are used for regular
heartbeats and cluster status information.

Simultaneous Interconnect Failure with a Low-
Priority Link
A B C
O
S1
S1 S3
S3
O
S2
S2
Jeopardy membership: S3
Regular membership: S1, S2, S3
Public is now used for
heartbeat and status.
Simultaneous Interconnect Failure with a Low-Priority Link

If the dedicated Ethernet LLT links fail when a low-priority link is still
functioning, a jeopardy membership is formed.
The public network link is then used for all VCS membership and configuration
data until a private Ethernet LLT network is restored.
13

Interconnect Failure with Service Group
Heartbeats
C faults on S1 or S2. A faults on S3.
A C A C
2
O O
S1
S1 S3
S3
1
S2
S2
Network partition
1 Regular membership:
S1, S2
SGHB resource faults when Regular membership: S3
2
brought online.
Disk
Disk
Interconnect Failures with Service Group Heartbeats

VCS provides another type of heartbeat communication channel called service
group heartbeats. The heartbeat disks used for service group heartbeats must be
accessible from each system that can run the service group.
VCS includes the ServiceGroupHB resource type to implement this type of
heartbeat. You add a ServiceGroupHB resource to the service group at the bottom
of the dependency tree to ensure that no other nonpersistent resource can come
online unless the ServiceGroupHB resource is already online.
Bringing the ServiceGroupHB resource online starts an internal process that
periodically writes heartbeats to the disk. This process increments a counter, which
enables other systems to recognize that the ServiceGroupHB resource is online.
Only one system can initiate this process. When VCS attempts to bring a
ServiceGroupHB resource online on a system, that system monitors the disk to
detect if heartbeats are being written by another system. If heartbeats are being
written, VCS faults the ServiceGroupHB resource, thereby preventing the service
group from being brought online.
In the example shown in the diagram, both LLT links fail simultaneously creating
a network partition. VCS tries to bring service groups up on each side of the
partition, but the ServiceGroupHB resources fault while coming online because
the counter on the disk continues to increase.

Preexisting Network Partition
A C B C
O 1
S1
S1 S3
S3
O
3
S2
S2
1 S3 faults; C started on S1 or S2
Regular membership: S1, S2
2 LLT links to S3 disconnected No membership: S3
3 S3 reboots; S3 cannot start HAD because

GAB on S3 can only detect one member
Preexisting Network Partition

A preexisting network partition occurs if LLT links fail while a system is down. If
the system comes back up and starts running services without being able to
communicate with the rest of the cluster, a split-brain condition can result.
When a preexisting network partition occurs, VCS prevents systems on one side of
the partition from starting applications that may already be running by preventing
13
HAD from starting on those systems.
In the scenario shown in the diagram, system S3 cannot start HAD when it reboots
because the network failure prevents GAB from communicating with any other
cluster systems; therefore, system S3 cannot seed.

Example Scenarios
These are some examples where you may need to
change the cluster interconnect configuration:
Adding or removing cluster nodes
Merging clusters
Changing communication parameters, such as the
heartbeat time interval
Changing recovery behavior
Changing or adding interfaces used for the cluster
interconnect
Configuring additional disk or network heartbeat links
for increasing heartbeat redundancy
Changing the Interconnect Configuration

You may never need to perform any manual configuration of the cluster
interconnect because the VCS installation utility sets up the interconnect based on
the information you provide about the cluster.
However, certain configuration tasks require you to modify VCS communication
configuration files, as shown in the slide.

Procedure for Modifying the Configuration
Stop VCS hastop all -force Start VCS hastart
Stop GAB gabconfig -U Start GAB gabconfig c n #
Stop LLT lltconfig -U Start LLT lltconfig -c
Edit Files
# llttab file
# llthosts # gabtab file
set-node S1
set-cluster 10 0 S1
# sysname [path]/gabconfig c n #
1 S2
OOO S1
Modifying the Cluster Interconnect Configuration

The overall process shown in the diagram is the same for any type of change to the
VCS communications configuration.
Although some types of modifications do not require you to stop both GAB and
LLT, using this procedure ensures that any type of change you make takes effect.
13
For example, if you added a system to a running cluster, you can change the value
of -n in the gabtab file without having to restart GAB. However, if you added
the -j option to change the recovery behavior, you must either restart GAB or
execute the gabtab command manually for the change to take effect.
Similarly, if you add a host entry to llthosts, you do not need to restart LLT.
However, if you change llttab, or you change a host name in llthosts, you
must stop and restart LLT, and, therefore, GAB.
Regardless of the type of change made, the procedure shown in the slide ensures
that the changes take effect. You can also use the scripts in the /etc/rc*.d
directories to stop and start services.
Note: On Solaris, you must also unload the LLT and GAB modules if you are
removing a system from the cluster, or upgrading LLT or GAB binaries. For
example:
modinfo | grep gab
modunload -i gab_id
modinfo | grep llt
modunload -i llt_id

Adding LLT Links
Range (all) SAP
# cat /etc/llttab
set-node S1
set-cluster 10
# Solaris example
link qfe0 /dev/qfe:0 - ether - - Solaris
Solaris
link hme0 /dev/hme:0 - ether - -
link ce0 /dev/ce:0 - ether - -
link-lowpri qfe1 /dev/qfe:1 - ether - -
Tag Name Device:Unit Link Type MTU
AIX HP-UX Linux
Adding LLT Links

You can add links to the LLT configuration as additional layers of redundancy for
the cluster interconnect. You may want an additional interconnect link for:
VCS for heartbeat redundancy
Storage Foundation for Oracle RAC for additional bandwidth
To add an Ethernet link to the cluster interconnect:
1 Cable the link on all systems.
2 Use the process on the previous page to modify the llttab file on each
system to add the new link directive.
To add a low-priority public network link, add a link-lowpri directive using
the same syntax as the link directive, as shown in the llttab file example in the
slide.
VCS uses the low-priority link only for heartbeats (at half the normal rate), unless
it is the only remaining link in the cluster interconnect.

Lesson Summary
Key Points
Use redundant cluster interconnect links to
minimize interruption to services.
Use a standard procedure for modifying the
interconnect configuration when changes are
required.
Reference Materials
Summary
This lesson described how VCS protects data in shared storage environments that
do not support I/O fencing. You also learned how you can modify the
communication configuration.
Next Steps
13
Now that you know how VCS behaves when faults occur in a non-fencing
environment, you can learn how VCS handles system and communication failures
in a fencing environment.
This guide describes how to configure the cluster interconnect.

Lab 13: Testing Communication Failures
1. Configure the InJeopardy trigger (optional).
2. Configure a low-priority link.
3. Test failures.
trainxx
trainxx
O trainxx
trainxx
Optional Lab
Trigger
Trigger injeopardy
injeopardy
Lab 13: Testing Communication Failures

Lab 13 Synopsis: Testing Communication Failures, page A-60
Lab 13 Details: Testing Communication Failures, page B-101
Lab 13 Solutions: Testing Communication Failures, page C-149
Goal
The purpose of this lab is to configure a low-priority link and then pull network
cables and observe how VCS responds.
Prerequisites
Work together to perform the tasks in this lab.
Results
All interconnect links are up when the lab is completed.

Optional Lab: Configuring the InJeopardy Trigger
Goal
The purpose of this lab is to configure the InJeopardy trigger and see which
communication failures cause the trigger to run.
Prerequisites
Obtain any classroom-specific values needed for your lab environment and record
these values in your design worksheet included with the lab exercise.
Results
All network links are up and monitored, and the InJeopardy trigger has been run.
13

Lesson 14
I/O Fencing
Course Overview

Introduction
Overview
This lesson describes how the VCS I/O fencing feature protects data in a shared
storage environment.
Importance
Having a thorough understanding of how VCS responds to system and
communication faults when I/O fencing is configured ensures that you know how
services and their users are affected in common failure situations.


will be able to:
Data Protection Define requirements for protecting data
Requirements in a cluster.
I/O Fencing Concepts and Describe I/O fencing concepts and
Components components.
I/O Fencing Operations Describe I/O fencing operations.
I/O Fencing Describe how I/O fencing is

Implementation implemented.
Configuring I/O Fencing Configure I/O fencing in a VCS
environment.
Stopping Recovering Recover systems that have been
Fenced Systems fenced off.
Outline of Topics
Data Protection Requirements
I/O Fencing Concepts and Components
I/O Fencing Operations
I/O Fencing Implementation
Configuring I/O Fencing
Recovering Fenced Systems
14
Lesson 14 I/O Fencing 143

Normal VCS Operation
App
DB
Heartbeats travel on the cluster interconnect sending

I am alive messages.
Applications (Service Groups) run in the cluster and
their current status is known.

Understanding the Data Protection Problem
In order to understand how VCS protects shared data in a high availability
environment, it helps to see the problem that needs to be solvedhow a cluster
goes from normal operation to responding to various failures.
Normal VCS Operation

When the cluster is functioning normally, one system is running one or more
service groups and has the storage objects for those services imported or accessible
from that system only.

System Failure
App
DB
System failure is detected when the I am alive

heartbeats no longer are seen coming from a given
node.
VCS then takes corrective action to fail over the service
group from the failed server.
System Failure
In order to keep services high availability, the cluster software must be capable of
taking corrective action on the failure of a system. Most cluster implementations
are lights out environmentsthe HA software must automatically respond to
faults without administrator intervention.
Example corrective actions are:
Starting an application on another node
Reconfiguring parallel applications to no longer include the departed node in
locking operations
The animation shows conceptually how VCS handles a system fault. The yellow
service group that was running on Server 2 is brought online on Server 1 after
14
GAB on Server 1 stops receiving heartbeats from Server 2 and notifies HAD.

Interconnect Failure
App
DB
If the interconnect fails between the clustered systems:

The symptoms look the same as a system failure.
However; VCS should not take corrective action and
fail over the service groups.
A key function of a high availability solution is to detect and respond to system
faults. However, the system may still be running but unable to communicate
heartbeats due to a failure of the cluster interconnect. The other systems in the
cluster have no way to distinguish between the two situations.
This problem is faced by all HA solutionshow can the HA software distinguish a
system fault from a failure of the cluster interconnect? As shown in the example
diagram, whether the system on the right side (Server 2) fails or the cluster
interconnect fails, the system on the left (Server 1) no longer receives heartbeats
from the other system.
The HA software must have a method to prevent an uncoordinated view among
systems of the cluster membership in any type of failure scenario.
In the case where nodes are running but the cluster interconnect has failed, the HA
software needs to have a way to determine how to handle the nodes on each side of
the network split, or partition.
Network Partition
A network partition is formed when one or more nodes stop communicating on the
cluster interconnect due to a failure of the interconnect.

Split Brain Condition
App App
DB DB
Changing Block 1024 Changing Block 1024
If each system were to take corrective action and bring

the other systems service groups online:
Each application would be running on each system.
Data corruption can occur.
Split Brain Condition

A network partition can lead to split brain conditionan issue faced by all cluster
implementations. This problem occurs when the HA software cannot distinguish
between a system failure and an interconnect failure. The symptoms look identical.
For example, in the diagram, if the right-side system fails, it stops sending
heartbeats over the private interconnect. The left node then takes corrective action.
Failure of the cluster interconnect presents identical symptoms. In this case, both
nodes determine that their peer has departed and attempt to take corrective action.
This can result in data corruption if both nodes are able to take control of storage in
an uncoordinated manner.
Other scenarios can cause this situation. If a system is so busy that it appears to be
14
hung such that it seems to have failed, its services can be started on another
system. This can also happen on systems where the hardware supports a break and
resume function. If the system is dropped to command-prompt level with a break
and subsequently resumed, the system can appear to have failed. The cluster is
reformed and then the system recovers and begins writing to shared storage again.
The remainder of this lesson describes how the VERITAS fencing mechanism
prevents split brain condition in failure situations.

A cluster environment
requires a method to
guarantee: App
Cluster membership
DB
The membership must be
consistent.
Data protection
Upon membership
change, only one cluster
can survive and have When
Whenthetheheartbeats
heartbeatsstop,stop,VCS
VCS
exclusive control of needs
needsto totake
takeaction,
action,butbutboth
both
shared data disks. failures
failureshave
havethe
thesame
samesymptoms.
symptoms.
What
What action shouldbe
action should betaken?
taken?
Which
Whichfailure
failureisisit?
it?

The key to protecting data in a shared storage cluster environment is to guarantee
that there is always a single consistent view of cluster membership. In other words,
when one or more systems stop sending heartbeats, the HA software must
determine which systems can continue to participate in the cluster membership and
how to handle the other systems.

I/O Fencing
VERITAS Cluster Server
4.x uses a mechanism
X
App
called I/O fencing to
guarantee data
X
DB
protection.
I/O fencing uses SCSI-3
persistent reservations
(PR) to fence off data
drives to prevent split
brain condition.
I/O Fencing Concepts and Components

In order to guarantee data integrity in the event of communication failures among
cluster members, full protection requires determining who should be able to
remain in the cluster (membership) as well as guaranteed access blocking to
storage from any system that is not an acknowledged member of the cluster.
VCS 4.0 uses a mechanism called I/O fencing to guarantee data protection. I/O
fencing uses SCSI-3 persistent reservations (PR) to fence off data drives to prevent
split-brain condition, as described in detail in this lesson.
I/O fencing uses an enhancement to the SCSI specification, known as SCSI-3
persistent reservations, (SCSI-3 PR or just PR). SCSI-3 PR is designed to resolve
the issues of using SCSI reservations in a modern clustered SAN environment.
14
SCSI-3 PR supports multiple nodes accessing a device while at the same time
blocking access to other nodes. Persistent reservations are persistent across SCSI
bus resets and PR also supports multiple paths from a host to a disk.

I/O Fencing Components
Coordinator disks:
Act as a global lock App
device
Determine which nodes DB
are currently registered
in the cluster
Ensure that only one
cluster survives in the
case of a network
partition
I/O Fencing Components

VCS uses fencing to allow write access to members of the active cluster and to
block access to nonmembers.
I/O fencing in VCS consists of several components. The physical components are
coordinator disks and data disks. Each has a unique purpose and uses different
physical disk devices.
Coordinator Disks
The coordinator disks act as a global lock mechanism, determining which nodes
are currently registered in the cluster. This registration is represented by a unique
key associated with each node that is written to the coordinator disks. In order for a
node to access a data disk, that node must have a key registered on coordinator
disks.
When system or interconnect failures occur, the coordinator disks ensure that only
one cluster survives, as described in the I/O Fencing Operations section.

Data Disks
Are located on a
shared storage device App
Store application data
for service groups DB
Must support SCSI-3

Must be in a Volume
Manager 4.x disk
group
Data Disks
Data disks are standard disk devices used for shared data storage. These can be
physical disks or RAID logical units (LUNs). These disks must support SCSI-3
PR. Data disks are incorporated into standard VM disk groups. In operation,
Volume Manager is responsible for fencing data disks on a disk group basis.
Disks added to a disk group are automatically fenced, as are new paths to a device
are discovered.
14

Registration with Coordinator Disks
1. At GAB startup, just a 01 a 01
after port a membership b 01 b 01
is established, nodes
must register with the
coordinator disks
before starting HAD.
Node
Node 00 Node
Node 11
2. Nodes register keys
based on node number.
A A A
3. Nodes can only register
B B B
with the fencing
membership if
coordinator disks have Keys are
expected keys on them. based on
LLT node
4. Fencing membership
number:
uses GAB port b. 0=A, 1=B,
and so on.
I/O Fencing Operations

Registration with Coordinator Disks
After GAB has started and port a membership is established, each system registers
with the coordinator disks. HAD cannot start until registration is complete.
Registration keys are based on the LLT node number. Each key is eight
charactersthe left-most character is the ASCII character corresponding to the
LLT node ID. For example, Node 0 uses key A-------, Node 1 uses B------
-, Node 2 is C, and so on. The right-most seven characters are dashes. For
simplicity, these are shown as A and B in the diagram.
Note: The registration key is not actually written to disk, but is stored in the drive
electronics or RAID controller.
All systems are aware of the keys of all other systems, forming a membership of
registered systems. This fencing membershipmaintained by way of GAB port
bis the basis for determining cluster membership and fencing data drives.

Service Group Startup
HAD is allowed to start a 01 a 01
after fencing is b 01 b 01
h 01 h 01
finished. App
App
As service groups are DB
DB
started, VM disk groups Node
Node 00 Node
Node 11
are imported.
Importing a disk group A A A
writes a registration B B B
key and places a Disk group Disk group
for DB, has for App,
reservation on its key of AVCS has key of
disks. and BVCS and
reservation reservation
for Node 0 for Node 1
exclusive exclusive
access access
Service Group Startup

After each system has written registration keys to the coordinator disks, the
fencing membership is established and port b shows all systems as members. In the
example shown in the diagram, the cluster has two members, Node 0 and Node 1,
so port b membership shows 0 and 1.
At this point, HAD is started on each system. When HAD is running, VCS brings
service groups online according to their specified startup policies. When a disk
group resource associated with a service group is brought online, the Volume
Manager disk group agent (DiskGroup) imports the disk group and writes a SCSI-
3 registration key to the data disk. This registration is performed in a similar way
to coordinator disk registration. The key is different for each node; Node 0 uses
14
AVCS, Node 1 uses BVCS, and so on.
In the example shown in the diagram, Node 0 is registered to write to the data disks
in the disk group belonging to the DB service group. Node 1 is registered to write
to the data disks in the disk group belonging to the App service group.
After registering with the data disk, Volume Manager sets a Write Exclusive
Registrants Only reservation on the data disk. This reservation means that only the
registered system can write to the data disk.

System Failure
1. Node 0 detects no more aa 001 a 01
b0 01
01 b 01
heartbeats from Node 1.
hh0 01
01 h 01
2. Node 0 races for the App
App
coordinator disks, ejecting App
App
all B keys. DB
DB
3. Node 0 wins all Node
Node 00 Node
Node 11
coordinator disks.
4. Node 0 knows it has a A A A
perfect membership. B
X X
B X
B
5. VCS can now fail over the Disk group group
Disk Group
App service group and for DB, has for App,
import the disk group, key of AVCS A X
B
A has key of
changing the reservation. and AVCS and
BVCS
reservation reservation
Reservation
for Node 0 0
for Node 1
exclusive exclusive
access access
System Failure
The diagram shows the fencing sequence when a system fails.
1 Node 0 detects that Node 1 has failed when the LLT heartbeat times out and
informs GAB. At this point, port a on Node 0 (GAB membership) shows only 0.
2 The fencing driver is notified of the change in GAB membership and Node 0
races to win control of a majority of the coordinator disks.
This means Node 0 must eject Node 1 keys (B) from at least two of three
coordinator disks. In coordinator disk serial number order, the fencing driver
ejects the registration of Node 1 (B keys) using the SCSI-3 Preempt and Abort
command. This command allows a registered member on a disk to eject the
registration of another. Because I/O fencing uses the same key for all paths
from a host, a single preempt and abort ejects a host from all paths to storage.
3 In this example, Node 0 wins the race for each coordinator disk by ejecting
Node 1 keys from each coordinator disk.
4 Now port b (fencing membership) shows only Node 0 because Node 1 keys
have been ejected. Therefore, fencing has a consistent membership and passes
the cluster reconfiguration information to HAD.
5 GAB port h reflects the new cluster membership containing only Node 0 and
HAD now performs whatever failover operations are defined for the service
groups that were running on the departed system.
Fencing takes place when a service group is brought online on a surviving
system as part of the disk group importing process. When the DiskGroup
resources come online, the agent online entry point instructs Volume Manager
to import the disk group with options to remove the Node 1 registration and
reservation, and place a SCSI-3 registration and reservation for Node 0.

1. Node 0 detects no more aa 001 a 01
1
heartbeats from Node 1. bb0 01 b 01
X
Node 1 detects no more h0 01 h 01
App
App
heartbeats from Node 0.
App
App
2. Nodes 0 and 1 race for the
DB
DB
coordinator disks, ejecting
each other's keys. Only one Node
Node 00 Node
Node 11
node can win each disk.
3. Node 0 wins majority A A
A
coordinator disks. B B
X
B X X
4. Node 1 panics.
Disk group Disk Group
group
5. Node 0 now has perfect for DB, has for App,
membership. key of A A
B has key of
6. VCS fails over the App AVCS and BVCS and
AVCS
service group, importing the reservation Reservation
reservation
disk group and changing the for Node 0 for Node 0
1
reservation. exclusive exclusive
access access
The diagram shows how VCS handles fencing if the cluster interconnect is severed
and a network partition is created. In this case, multiple nodes are racing for
control of the coordinator disks.
1 LLT on Node 0 informs GAB that it has not received a heartbeat from Node 1
within the timeout period. Likewise, LLT on Node 1 informs GAB that it has
not received a heartbeat from Node 0.
2 When the fencing drivers on both nodes receive a cluster membership change
from GAB, they begin racing to gain control of the coordinator disks.
The node that reaches the first coordinator disk (based on disk serial number)
ejects the failed nodes key. In this example, Node 0 wins the race for the first
14
coordinator disk and ejects the B------- key.
After the B key is ejected by Node 0, Node 1 cannot eject the key for Node 0
because the SCSI-PR protocol says that only a member can eject a member.
SCSI command tag queuing creates a stack of commands to process, so there is
no chance of these two ejects occurring simultaneously on the drive. This
condition means that only one system can win.
3 Node 0 also wins the race for the second coordinator disk.
Node 0 is favored to win the race for the second coordinator disk according to
the algorithm used by the fending driver. Because Node 1 lost the race for the
first coordinator disk, Node 1 has to reread the coordinator disk keys a number
of times before it tries to eject the other nodes key. This favors the winner of
the first coordinator disk to win the remaining coordinator disks. Therefore,
Node 1 does not gain control of the second or third coordinator disks.

4 After Node 0 wins control of the majority of coordinator disks (all three in this
example), Node 1 loses the race and calls a kernel panic to shut down
immediately and reboot.
5 Now port b (fencing membership) shows only Node 0 because Node 1 keys
have been ejected. Therefore, fencing has a consistent membership and passes
the cluster reconfiguration information to HAD.
6 GAB port h reflects the new cluster membership containing only Node 0, and
HAD now performs the defined failover operations for the service groups that
were running on the departed system.
When a service group is brought online on a surviving system, fencing takes
place as part of the disk group importing process.

Interconnect Failure on Node Restart
1. Node 1 reboots.
aa 001 aa 1
2. Node 1 cannot join; b0 01 bb
gabconfig c n2 is set in b
h0 01 App
App hh
gabtab and only one node
is seen.
DB
DB
3. Admin mistakenly seeds
Node 1 with gabconfig c Node
Node 00 Node
Node 11
x.
4. The fence driver expects to A A A
see only B keys or no X
B X
B B
X
keys on coordinator disks. Disk group Disk group
5. When the fence driver sees for DB, has for App,
A keys, it is disabled. No key of A A
XB has key of
further cluster startup can AVCS and AVCS and
occur without a reboot. reservation reservation
6. To start the cluster, fix the for Node 0 for Node 0
interconnect and reboot exclusive exclusive
Node 1. access access
Interconnect Failure on Node Restart

A preexisting network partition occurs when the cluster interconnect is severed and
a node subsequently reboots to attempt to form a new cluster. After the node starts
up, it is prevented from gaining control of shared disks.
In this example, the cluster interconnect remains severed. Node 0 is running and
has key A------- registered with the coordinator disks.
1 Node 1 starts up.
2 GAB cannot seed because it detects only Node 1 and the gabtab file specifies
gabconfig -c -n2. GAB can only seed if two systems are communicating.
Therefore, HAD cannot start and service groups do not start.
3 At this point, an administrator mistakenly forces GAB to seed Node 1 using the
14
gabconfig -x command.
4 As part of the initialization of fencing, the fencing driver receives a list of
current nodes in the GAB membership, reads the keys present on the
coordinator disks, and performs a comparison.
In this example, the fencing driver on Node 1 detects keys from Node 0 (A---
----) but does not detect Node 0 in the GAB membership because the cluster
interconnect has been severed.
gabconfig -a
===================================================
Port a gen b7r004 membership 1

5 Because Node 1 can detect keys on the coordinator disks for systems not in the
GAB membership, the fencing driver on Node 1 determines that a preexisting
network partition exists and prints an error message to the console. The fencing
driver prevents HAD from starting, which, in turn, prevents a disk groups from
being imported.
To enable Node 1 to rejoin the cluster, you must repair the interconnect and restart
Node 1.

I/O Fencing Behavior
I/O fencing behavior is the same for both scenarios:
System failure
Cluster interconnect failure
I/O fencing makes no assumptions? the driver races
for the control of the coordinator disks to form a
perfect membership.
Data disks are fenced when imported by Volume
Manager.
Nodes that have departed from the cluster
membership are not allowed access to the data disks
until they rejoin the normal cluster membership and
the service groups are started on those nodes.
I/O Fencing Behavior

As demonstrated in the example failure scenarios, I/O fencing behaves the same
regardless of the type of failure:
The fencing drivers on each system race for control of the coordinator disks
and the winner determines cluster membership.
Reservations are placed on the data disks by Volume Manager when disk
groups are imported.
14

I/O Fencing with Multiple Nodes
The lowest LLT node
number on each side of
the mini-cluster races for
the coordinator disks.
Two systems race in this
scenario.
The winner broadcasts
success on GAB port b.
All nodes in the losing
mini-cluster panic.
With an even split,
generally the highest LLT
node number wins.
I/O Fencing with Multiple Nodes

In a multinode cluster, the lowest numbered (LLT ID) node always races on behalf
of the remaining nodes. This means that at any time only one node is the
designated racer for any mini-cluster.
If a designated racer wins the coordinator disk race, it broadcasts this success on
port b to all other nodes in the mini-cluster.
If the designated racer loses the race, it panics and reboots. All other nodes
immediately detect another membership change in GAB when the racing node
panics. This signals all other members that the racer has lost and they must also
panic.
Majority Clusters
The I/O fencing algorithm is designed to give priority to larger clusters in any
arbitration scenario. For example, if a single node is separated from a 16-node
cluster due to an interconnect fault, the 15-node cluster should continue to run. The
fencing driver uses the concept of a majority cluster. The algorithm determines if
the number of nodes remaining in the cluster is greater than or equal to the number
of departed nodes. If so, the larger cluster is considered a majority cluster. The
majority cluster begins racing immediately for control of the coordinator disks on
any membership change. The fencing drivers on the nodes in the minority cluster
delay the start of the race to give an advantage to the larger cluster. This delay is
accomplished by reading the keys on the coordinator disks a number of times. This
algorithm ensures that the larger cluster wins, but also allows a smaller cluster to
win if the departed nodes are not actually running.

Communication Stack
I/O fencing:
VM Is implemented by the
fencing driver (vxfen)
vxfen
Uses GAB port b for LMX
GAB b communication
LLT Determines coordinator
disks on vxfen startup
LLT
Intercepts RECONFIG
messages from GAB
GAB b destined for the VCS engine
Controls fencing actions by
vxfen
Volume Manager
VM
I/O Fencing Implementation

Communication Stack
I/O fencing uses GAB port b for communications. The following steps describe the
start-up sequence used by vxfen and associated communications:
1 Fencing is started with the vxfenconfig -c command. The fencing driver
vxfen is started during system startup by way of the /etc/rc2.d/
S97vxfen file, whether or not fencing is configured for VCS. The fencing
driver:
a Passes in the list of coordinator disks
b Checks for other members on port b
c If this is the first member to register, reads serial numbers from coordinator
14
disks and stores them in memory
d If this is the second or later member to register, obtains serial numbers of
coordinator disks from the first member
e Reads and compares the local serial number
f Errors out, if the serial number is different
g Begins a preexisting network partition check
h Reads current keys registered on coordinator disks
i Determines that all keys match the current port b membership
j Registers the key with coordinator disks
2 Membership is established (port b).
3 HAD is started and port h membership is established.

4 HAD starts service groups.
5 The DiskGroup resource is brought online and control is passed to VxVM to
import disk groups with SCSI-3 reservations.

Fencing Driver
The VERITAS fencing driver (vxfen):
Coordinates membership with the race for
coordinator disks
Is called by other modules for authorization to
continue
Is installed by VCS and started during system
startup
Fencing Driver
Fencing in VCS is implemented in two primary areas:
The vxfen fencing driver, which directs Volume Manager
Volume Manager, which carries out actual fencing operations at the disk group
level
The fencing driver is a kernel module that connects to GAB to intercept cluster
membership changes (reconfiguration messages). If a membership change occurs,
GAB passes the new membership in the form of a reconfiguration message to
vxfen on GAB port b. The fencing driver on the node with lowest node ID in the
remaining cluster races for control of the coordinator disks, as described
previously. If this node wins, it passes the list of departed nodes to VxVM to have
14
these nodes ejected from all shared disk groups.
After carrying out required fencing actions, vxfen passes the reconfiguration
message to HAD.

Fencing Implementation in Volume Manager
VM 4.x disk groups are fenced by the VCS DiskGroup
agent.
When the disk group agent brings a disk group online
Volume Manager:
Imports the disk group, placing the node key on each
disk in the disk group.
Places SCSI-3 reservations on the disks.
vxdg -o groupreserve -o clearreserve -t import
group
When the disk group agent takes a disk group offline,
Volume Manager removes the node key and SCSI-3
reservation from each disk.
VxVM allows a disk or path to be added or removed.
Fencing Implementation in Volume Manager

Volume Manager 4.0 handles all fencing of data drives for disk groups that are
controlled by the VCS DiskGroup resource type. After a node successfully joins
the GAB cluster and the fencing driver determines that a preexisting network
partition does not exist, the VCS DiskGroup agent directs VxVM to import disk
groups using SCSI-3 registration and a Write Exclusive Registrants Only
reservation. This ensures that only the registered node can write to the disk group.
Each path to a drive represents a different I/O path. I/O fencing in VCS places the
same key on each path. For example, if node 0 has four paths to the first disk
group, all four paths have key AVCS registered. Later, if node 0 must be ejected,
VxVM preempts and aborts key AVCS, effectively ejecting all paths.
Because VxVM controls access to the storage, adding or deleting disks is not a
problem. VxVM fences any new drive added to a disk group and removes keys
when drives are removed. VxVM also determines if new paths are added and
fences these, as well.

Fencing Implementation in VCS
HAD
VCS is modified to use fencing for coordination.
The UseFence attribute must be set to SCSI3 in
the main.cf file.
HAD does not start unless the fencing driver is
operational when UseFence set to SCSI3.
Legacy behavior available
Set UseFence=None.
Fencing Implementation in VCS

In VCS 4.0, had is modified to enable the use of fencing for data protection in the
cluster. When the UseFence cluster attribute is set to SCSI3, had cannot start
unless the fencing driver is operational. This ensures that services cannot be
brought online by VCS unless fencing is already protecting shared storage disks.
Note: With I/O fencing configured, GAB disk heartbeats are not supported.
14

Coordinator Disk Implementation
Three coordinator disks are required with these
properties:
Use standard disks or LUNs (the smallest possible
LUNs on separate spindles).
Ensure disks support SCSI-3 persistent reservations.
Create a separate disk group used only for fencing.
Do not store data on the coordinator disks.
Deport the disk group.
Configure hardware mirroring of coordinator disks;
they cannot be replaced without stopping the cluster.
Coordinator Disk Implementation

Coordinator disks are special-purpose disks in a VCS environment. Coordinator
disks are three standard disks or LUNs that are set aside for use by I/O fencing
during cluster reconfiguration.
You cannot use coordinator disks for any other purpose in the VCS configuration.
Do not store data on these disks or include the disks in disk groups used for data.
The coordinator disks can be any three disks that support persistent reservations.
VERITAS typically recommends the smallest possible LUNs for coordinator use.
Note: Discussion of coordinator disks in metropolitan area (campus) clusters is
provided in the Disaster Recovery Using VERITAS Cluster Server course.

I/O Fencing Configuration Procedure
vxdisksetup i disk1
Configure vxdisksetup i disk2
Configurecoordinator
coordinator
disk vxdisksetup i disk3
diskgroup.
group.
vxdg init fendg disk1 disk2 disk3
vxfentsthdw g fendg
Verify vxfentsthdw rg dataDG1
VerifySCSI-3
SCSI-3support.
support. vxfentsthdw rg dataDG2
. . .
Deport
Deportdisk
diskgroup.
group. vxdg deport fendg
Create
Create/etc/vxfendg
/etc/vxfendg echo fendg > /etc/vxfendg
on
onall
allsystems.
systems.
Configuring I/O Fencing

The diagram above shows the basic procedure used to configure I/O fencing in a
VCS environment. Additional information is provided as follows:
1 Create a disk group for the coordinator disks. You can choose any name. The
name in this example is fendg.
2 Use the vxfentsthdw utility to verify that the shared storage array supports
SCSI-3 persistent reservations. Assuming that the same array is used for
coordinator and data disks, you need only check one of the disks in the array.
Warning: The vxfentsthdw utility overwrites and destroys existing data on
the disks by default. You can change this behavior using the -r option to
perform read-only testing. Other commonly used options include:
14
-f file_name (Verify all disks listed in the file.)
-g disk_group (Verify all disks in the disk group.)
Note: You can check individual LUNs for SCSI-3 support to ensure that you
have the array configured properly before checking all disk groups. To
determine the paths on each system for that disk, use the vxfenadm utility to
check the serial number of the disk. For example:
vxfenadm -i disk_dev_path
After you have verified the paths to that disk on each system, you can run
vxfentsthdw with no arguments, which prompts you for the systems and
then for the path to that disk from each system. A verified path means that the
SCSI inquiry succeeds. For example, vxfenadm returns a disk serial number
from a SCSI disk and an ioctl failed message from non-SCSI 3 disk.

3 Deport the coordinator disk group.
4 Create the /etc/vxfendg fencing configuration file on each system in the
cluster. The file must contain the coordinator disk group name.

Configuring Fencing (Continued)
Run
Runthethestart
start /etc/init.d/vxfen start
script
scriptfor
forfencing.
fencing.
Save
Saveand
andclose
closethe
the haconf dump -makero
configuration.
configuration.
On
Oneach
eachsystem
system
Stop
StopVCS
VCSon
onall
allsystems.
systems. hastop all
Set
SetUseFence
UseFencein
inmain.cf.
main.cf. UseFence=SCSI3
Restart
RestartVCS.
VCS. hastart [-stale]
! You
You must
disk
must stop
stop and
disk groups
groups are
and restart
restart service
are imported
service groups
imported using
groups so
using SCSI-3
so that
that the
the
SCSI-3 reservations.
reservations.
5 Start the fencing driver on each system using the /etc/init.d/vxfen

startup file with the start option. Upon startup, the script creates the
vxfentab file with a list of all paths to each coordinator disk. This is
accomplished as follows:
a Read the vxfendg file to obtain the name of the coordinator disk group.
vxdisk -o alldgs list
b Run grep to create a list of each device name (path) in the coordinator
disk group.
c For each disk device in this list, run vxdisk list disk_dev and create
a list of each device that is in the enabled state.
d Write the list of enabled devices to the vxfentab file.
This ensures that any time a system is rebooted, the fencing driver reinitializes
14
the vxfentab file with the current list of all paths to the coordinator disks.
Note: This is the reason coordinator disks cannot be dynamically replaced. The
fencing driver must be stopped and restarted to populate the vxfentab file
with the updated paths to the coordinator disks.
6 Save and close the cluster configuration before modifying main.cf to ensure
that the changes you make to main.cf are not overridden.
7 Stop VCS on all systems. Do not use the -force option. You must stop and
restart service groups to reimport disk groups to place data under fencing
control.
8 Set the UseFence cluster attribute to SCSI3 in the main.cf file.
Note: You cannot set UseFence dynamically while VCS is running.

9 Start VCS on the system with the modified main.cf file and propagate that
configuration to all cluster systems. As a best practice, start all other systems
with the -stale option to ensure that all other systems wait to build their
configuration from the system where you modified the main.cf file. See the
Offline Configuration of a Service Group lesson for more information.
Note: You must stop VCS and take service groups offline. Do not use the
-force option to leave services running. You must deport and reimport the
disk groups to bring them under fencing control. These disk groups must be
imported with SCSI-3 reservations. The disk groups are automatically
deported when you stop VCS, which takes service groups offline. The disk
groups are automatically imported when VCS is restarted and the service
groups are brought back online.

Fencing Effects on Disk Groups
After SCSI reservations have been set on disk
groups:
The vxdisk o alldgs list command no longer
shows the disk groups that are imported on other
systems.
The format command shows disks with a SCSI-3
reservation as type unknown.
You can use vxdg C import disk_group to
determine if a disk group is imported on the local
system. The command fails if the disk group is
deported.
The vxdisk list command shows imported disks.
Fencing Effects on Disk Groups

When SCSI reservations have been set on disk groups, the vxdisk -o alldgs
list command no longer shows the disk groups that have been imported on non-
local cluster systems. Also, the format command then shows the disks as type
unknown. Therefore, you cannot run vxdisk -o alldgs list to find which
disk groups are in a deported state on the local system. Instead, you can run
vxdg -C import diskgroup and observe that it fails because of the SCSI
reservation.
14

Stopping Systems Running I/O Fencing
To shut down a cluster system which has
fencing configured and running:
Use the shutdown -r command to ensure
that keys are removed from disks reserved by
the node being shut down.
Do not use the reboot command. Keys are
not removed from the disks reserved because
the shutdown scripts are bypassed.
Stopping and Recovering Fenced Systems

Stopping Systems Running I/O Fencing
To ensure that keys held by a system are removed from disks when you stop a
cluster system, use the shutdown command. If you use the reboot command,
the fencing shutdown scripts do not run to clear keys from disks.
If you inadvertently use reboot to shut down, you may see a message about a
pre-existing split brain condition when you try to restart the cluster. In this case,
you can use the vxfenclearpre utility described in the Recovering from a
Partition-In-Time section to remove keys.

Recovering from a Single-Node Failure
Node
Node22isiscut
cutoff
off
from
fromthe
theheartbeat
heartbeat
1
network,
network,loses
losesthe
the
race,
race,and
andpanics.
panics.
0 1 2
2 Shut
Shutdown
downNode
Node2.
2.
Fix
Fixthe
thesystem
systemor
or
3
interconnect.
interconnect.
4 Start
StartNode
Node2.
2.
Recovery with Running Systems

If one or more nodes are fenced out due to system or interconnect failures, and
some part of the cluster remains running, recover as follows:
1 Shut down the systems that are fenced off.
2 Fix the system or network problem.
3 Start up the systems.
When the systems start communicating heartbeats, they are included in the cluster
membership and participate in fencing again.
14

Recovering from a Multiple-Node Failure
3
On
On Node
Node 1:
1: 1
a. Node
Node 11 fails.
fails.
a. Ensure
Ensure that
that the
the other
other
node Node
Node 00 fails
fails
node is
is not
not running
running or or
you before
before Node
Node 11 is
is
you can
can cause
cause split
split
brain repaired.
repaired.
brain condition.
condition.
b.
b. View
View keys
keys currently
currently 0 1
registered
registered onon
coordinator
coordinator disks:
disks:
vxfenadm
vxfenadm g g all
all ff 2
Repair
Repair and
and boot
boot
/etc/vxfentab
/etc/vxfentab Node
Node 1.
1.
c.
c. Clear
Clear keys
keys onon
coordinator Verify
Verify that
that Node
Node
coordinator andand data
data
disks: 00 is
is actually
actually
disks:
down.
down.
vxfenclearpre
vxfenclearpre
d.
d. Repair
Repair node
node 0. 0.
e.
e. Reboot
Reboot all
all systems.
systems.
Recovering from a Partition-In-Time

VERITAS provides the vxfenclearpre script to clear keys from the
coordinator and data disks in the event of a partition-in-time where all nodes are
fenced off.
The following procedure shows how to recover in an example scenario where:
Node 1 fails first.
Node 0 fails before Node 1 is repaired.
Node 1 is repaired and boots while Node 0 is down.
Node 1 cannot access the coordinator disks because Node 0s keys are still on
the disk.
To recover:
1 Verify that Node 0 is actually down to prevent the possibility of corruption
when you manually clear the keys.
2 Verify the systems currently registered with the coordinator disks:
vxfenadm -g all -f /etc/vxfentab
The output of this command identifies the keys registered with the coordinator
disks.
3 Clear all keys on the coordinator disks in addition to the data disks:
/opt/VRTSvcs/vxfen/bin/vxfenclearpre
4 Repair the faulted system.
5 Reboot all systems in the cluster.

Lesson Summary
Key Points
I/O fencing ensures data is protected in a
cluster environment.
Disk devices must support SCSI-3 persistent
reservations to implement I/O fencing.
Reference Materials
VERITAS Volume Manager User's Guide
Quorum SCSI
Summary
This lesson described how VCS protects data in a shared storage environment,
focusing on the concepts and basic operations of the I/O fencing feature available
in VCS version 4.
Next Steps
Now that you understand how VCS behaves normally and when faults occur, you
can gain experience performing basic troubleshooting in a cluster environment.
This guide describes I/O fencing configuration.
14
VERITAS Volume Manager Users Guide
configuring and managing storage using Volume Manager.
The VERITAS Architect Network provides access to technical papers
describing I/O fencing.

Lab 14: Configuring I/O Fencing
Work with your lab partner to configure fencing.
trainxx trainxx
Disk 1:___________________
Disk 2:___________________ Coordinator Disks
Disk 3:___________________
nameDG1, nameDG2
Lab 14: Configuring I/O Fencing

Lab 14 Synopsis: Configuring I/O Fencing, page A-66
Lab 14: Configuring I/O Fencing, page B-111
Lab 14 Solutions: Configuring I/O Fencing, page C-163
Goal
The purpose of this lab is to set up I/O fencing in a two-node cluster and simulate
node and communication failures.
Prerequisites
Work with your lab partner to complete the tasks in this lab exercise.
Results
Each student observes the failure scenarios and performs the tasks necessary to
bring the cluster back to a running state.

Lesson 15
Troubleshooting
Course Overview

Introduction
Overview
In this lesson you learn an approach for detecting and solving problems with
VERITAS Cluster Server (VCS) software. You work with specific problem
scenarios to gain a better understanding of how the product works.
Importance
To successfully deploy and manage a cluster, you need to understand the
significance and meaning of errors, faults, and engine problems. This helps you
detect and solve problems efficiently and effectively.


will be able to:
Monitoring VCS Describe tools for monitoring VCS
operations.
Troubleshooting Guide Apply troubleshooting techniques in a
VCS environment.
Cluster Communication Resolve cluster communication
Problems problems.
VCS Engine Problems Identify and solve VCS engine
problems.
Service Group and Correct service group and resource
Resource Problems problems.
Archiving VCS-Related Create an archive of VCS-related files.
Files
Outline of Topics
Monitoring VCS
Troubleshooting Guide
Cluster Communication Problems
VCS Engine Problems
Service Group and Resource Problems
Archiving VCS-Related Files
15
Lesson 15 Troubleshooting 153

Monitoring Facilities
VCS log files
System log files
The hastatus utility
Notification
Event triggers
VCS GUIs
Monitoring VCS
VCS provides numerous resources you can use to gather information about the
status and operation of the cluster. These include:
VCS log files
VCS engine log file, /var/VRTSvcs/log/engine_A.log
Agent log files
hashadow log file, /var/VRTSvcs/log/hashadow_A.log
System log files:
/var/adm/messages (/var/adm/syslog on HP-UX)
/var/log/syslog
The hastatus utility
Notification by way of SNMP traps and e-mail messages
Event triggers
Cluster Manager
The information sources that have not been covered elsewhere in the course are
discussed in more detail in the next sections.

VCS Logs
Engine log: /var/VRTSvcs/log/engine_A.log
View logs using the GUI or the hamsg command:
hamsg engine_A
Agent logs kept in /var/VRTSvcs/log
Unique Message
Identifier (UMI)
2003/05/20 16:00:09 VCS NOTICE V-16-1-10322
System S1 (Node '0') changed state from
STALE_DISCOVER_WAIT to STALE_ADMIN_WAIT
2003/05/20 16:01:27 VCS INFO V-16-1-50408
Received connection from client Cluster Manager -
Java Console (ID:400)
2003/05/20 16:01:31 VCS ERROR V-16-1-10069
All systems have configuration files marked STALE.
Unable to form cluster.
Most
Most Recent
Recent
VCS Logs
In addition to the engine_A.log primary VCS log file, VCS logs information
for had, hashadow, and all agent programs in these locations:
had: /var/VRTSvcs/log/engine_A.log
hashadow: /var/VRTSvcs/log/hashadow_A.log
Agent logs: /var/VRTSvcs/log/AgentName_A.log
Messages in VCS logs have a unique message identifier (UMI) built from product,
category, and message ID numbers. Each entry includes a text code indicating
severity, from CRITICAL entries indicating that immediate attention is required,
to INFO entries with status information.
The log entries are categorized as follows:
CRITICAL: VCS internal message requiring immediate attention
Note: Contact Customer Support immediately.
ERROR: Messages indicating errors and exceptions
WARNING: Messages indicating warnings
15
NOTICE: Messages indicating normal operations

INFO: Informational messages from agents
Entries with CRITICAL and ERROR severity levels indicate problems that require
troubleshooting.

Changing the Log Level and File Size
You can change the amount of information logged by agents for resources being
monitored. The log level is controlled by the LogDbg resource type attribute.
Changing this value affects all resources of that type.
Use the hatype command to change the LogDbg value and then write the in-
memory configuration to disk to save the results in the types.cf file.
Note: Only increase agent log levels when you experience problems. The
performance impacts and disk space usage can be substantial.
You can also change the size of the log file from the default of 32MB. When a log
file reaches the size limit defined in the LogFileSize cluster attribute, a new log file
is created with B, C, D, and so on appended to the file name. Letter A indicates
the first log file, B the second, C the third, and so on.

UMI-Based Support
Command-line error message:
VCS ERROR V-16-1-10069 All systems have configuration
files marked STALE. Unable to form cluster.
System log (syslog):

Jul 10 16:24:21 johndoe unix: VCS ERROR V-16-1-10069 All
systems have configuration files marked STALE. Unable
to form cluster.
VCS Engine Log (engine_A.log):
Jul 10 16:24:21 VCS ERROR V-16-1-10069 All systems have
configuration files marked STALE. Unable to form
cluster.
UMIs
UMIs
map
map to
to
Tech
Tech
Note
Note IDs.
IDs.
UMI-Based Support
UMI support in all VERITAS 4.x products, including VCS, provides a mapping
between the message ID number and technical notes provided on the Support Web
site. This helps you quickly find solutions to the specific problem indicated by the
message ID.
15

Using the VERITAS Technical Web Sites
Use
UsethetheSupport
Support
Web
Websitesiteto:
to:
Download
Download
patches.
patches.
Track
Trackyour
your
cases.
cases.
Search
Searchfor for
tech
technotes.
notes.
The
TheVERITAS
VERITAS
Architect
ArchitectNetwork
Network
(VAN)
(VAN)is isanother
another
forum
forumfor fortechnical
technical
information.
information.
Using the VERITAS Support Web Site

The VERITAS Support Web site contains product and patch information, a
searchable knowledge base of technical notes, access to product-specific news
groups and e-mail notification services, and other information about contacting
technical support staff.
The VERITAS Architect Network (VAN) provides a portal for accessing technical
resources, such as product documentation, software, technical articles, and
discussion groups. You can access VAN from http://van.veritas.com.

Procedure Overview
Start
hastatus -sum
Cannot connect to server SG1 Autodisabled . . .
Communication Problems Service Group Problems
System1 in WAIT State
VCS Startup Problem
Troubleshooting Guide
VCS problems are typically one of three types:
Cluster communication
VCS engine startup
Service groups, resources, or agents
Procedure Overview
To start troubleshooting, determine which type of problem is occurring based on
the information displayed by hastatus -summary output.
Cluster communication problems are indicated by the message:
Cannot connect to server -- Retry Later
VCS engine startup problems are indicated by systems in the
STALE_ADMIN_WAIT or ADMIN_WAIT state.
Other problems are indicated when the VCS engine, LLT, and GAB are all
running on all systems, but service groups or resources are in an unexpected
state.
15
Each type of problem is discussed in more detail in the following sections.

Using the Troubleshooting Job Aid
A troubleshooting job aid is provided with this course.
Use the job aid after familiarizing yourself with the
details of the troubleshooting techniques shown in
this lesson.
Using the Troubleshooting Job Aid

You can use the troubleshooting job aid provided with this course to assist you in
solving problems in your VCS environment. This lesson provides the background
for understanding the root causes of problems, as well as the effects of applying
solutions described in the job aid.
Ensure that you understand the consequences of the commands and methods you
use for troubleshooting when using the job aid.

Checking GAB
Check GAB by running gabconfig a:
No port a membership indicates a GAB or LLT problem.
Check the seed number in /etc/gabtab.
If a node is not operational, and, therefore, the cluster is not
seeded, force GAB to start:
gabconfig -x
If GAB starts and immediately shuts down, check LLT and
cluster interconnect cabling.
No port h membership indicates a VCS engine (had) startup
problem.
HAD not running:

## gabconfig
gabconfig -a
-a ## gabconfig -a GAB and LLT functioning
gabconfig -a
GAB
GAB Port
Port Memberships
Memberships GAB
GAB Port
Port Memberships
Memberships
========================
======================== ===================================
===================================
Port
Port aa gen
gen 24110002
24110002 membership
membership 01
01
Cluster Communication Problems

When hastatus displays a message that it is unable to connect to the server, a
cluster communication or VCS engine problem is indicated. Start by verifying that
the cluster communication mechanisms are working properly.
Checking GAB
Check the status of GAB using gabconfig:
gabconfig -a
If no port memberships are present, GAB is not seeded. This indicates a
problem with GAB or LLT.
Check LLT (next section). If all systems can communicate over LLT, check
/etc/gabtab and verify that the seed number is specified correctly.
If port h membership is not present, the VCS engine (had) is not running.
15

Checking LLT
Run lltconfig to determine if LLT is running. If LLT is not
running:
Check console and system log messages.
Check LLT configuration files:
Check the /etc/llttab file:
Verify that the node number is within range (0-31).
Verify that the cluster number is within range (0-255).
Determine whether the link directive is specified correctly (for
example, qf3 should be qfe3).
Check the /etc/llthosts file:
Verify that the node numbers are within range.
Verify that the system names match the entries in the llttab or
sysname files.
Check the /etc/VRTSvcs/conf/sysname file:
Ensure that the name listed is the local host (system) name.
Verify that the system name matches the entry in the llthosts
file.
Checking LLT
Run the lltconfig command to determine whether LLT is running. If it is not
running:
Check the console and system log for messages indicating missing or
misconfigured LLT files.
Check the LLT configuration files, llttab, llthosts, and sysname to
verify that they contain valid and matching entries.
Use other LLT commands to check the status of LLT, such as lltstat and
lltconfig -a list.

Duplicate Node IDs
VCS responds to duplicate node IDs on LLT
links differently depending on the version.
VCS shuts down LLT on the links where

4.x
4.x duplicates are detected.
VCS panics the first system to detect the duplicate

3.5
3.5 ID if another system is started with the same ID on
high-priority links only.
VCS panics the first system that detects another

2.0
2.0 system starting with the same node ID on any LLT
link, high- or low-priority.
Duplicate Node IDs

How VCS responds to duplicate node IDs in a cluster configuration depends on the
version of VCS you are running.
4.x: If LLT detects a duplicate node ID, LLT shuts down on the links where
duplicate IDs were detected.
3.5: If LLT detects a duplicate node ID, it informs GAB, and GAB panics the
system only if the duplicate IDs are detected on the high-priority links.
2.0: If LLT detects a duplicate node ID on any LLT link, whether the link is
high- or low-priority, it informs GAB, and GAB panics the system.
15

Problems with LLT
If LLT is running:
Run lltstat -n to determine if systems can detect each other
on the LLT link.
Check the physical network connections if
LLT cannot communicate with each node. lltstat options:
lltstat options:
-l: link
-l: link status
status
-z: reset
-z: reset counters
counters
-vv: very
-vv: very verbose
verbose
train11#
train11# lltconfig
lltconfig
LLT
LLT is
is running
running
train12#
train12# lltconfig
lltconfig
train11#
train11# lltstat
lltstat -n
-n LLT
LLT is
is running
running
LLT
LLT node
node information:
information:
Node
Node State
State Links
Links train12#
train12# lltstat
lltstat -n
-n
** 00 train11
train11 OPEN
OPEN 22 LLT
LLT node
node information:
information:
11 train12
train12 CONNWAIT
CONNWAIT 22 Node
Node State
State Links
Links
00 train11
train11 CONNWAIT
CONNWAIT 22
** 11 train12
train12 OPEN
OPEN 22
Problems with LLT

If LLT is running on each system, verify that each system can detect all other
cluster systems by running lltstat -n. Check the physical connections if you
determine that systems cannot detect each other.
There are several options to lltstat that may be helpful when troubleshooting
LLT problems.
-z Reset statistical counters

-v Verbose output
-vv Very verbose output
-l Display current status of links
-n Display current status of peer systems

Startup Problems
The VCS engine (had) does not start under certain
conditions related to licensing, seeding, and
misconfigured files.
Run hastatus sum:
Check GAB and LLT if you see this message:
Cannot connect to server -- Retry Later
Improper licensing can also cause this problem.
Verify that the main.cf file is valid and that system
names match llthosts and llttab:
hacf verify /etc/VRTSvcs/conf/config
Check for systems in WAIT states.
VCS Engine Problems

Startup Problems
The VCS engine fails to start in some circumstances, such as when:
VCS is not properly licensed.
The llthosts file does not exist, or contains entries that do not match the
cluster configuration.
The cluster is not seeded.
The engine state is STALE_ADMIN_WAIT or ADMIN_WAIT, indicating a
problem building the configuration in memory, as discussed in the following
pages.
15

STALE_ADMIN_WAIT
A system can enter the STALE_ADMIN_WAIT state when:
There is no other system in a RUNNING state.
The system has a .stale flag.
You start VCS on that system.
To recover from the STALE_ADMIN_WAIT state:
1. Visually inspect the main.cf file to determine if it is valid.
2. Edit the main.cf file, if necessary.
3. Verify the syntax of main.cf, if modified.
hacf verify config_dir
4. Start VCS on the system with the valid main.cf file:
hasys -force system_name
All other systems perform a remote build from the system
now running.
STALE_ADMIN_WAIT
If you try to start VCS on a system where the local disk configuration is stale and
there are no other running systems, the VCS engine transitions to the
STALE_ADMIN_WAIT state. This signals that administrator intervention is
required in order to get the VCS engine into the running state, because the
main.cf may not match the configuration that was in memory when the engine
stopped.
If the VCS engine is in the STALE_ADMIN_WAIT state:
1 Visually inspect the main.cf file to determine if it is up-to-date (reflects the
current configuration).
2 Edit the main.cf file, if necessary.
3 Verify the main.cf file syntax, if you modified the file:
hacf verify config_dir
4 Start the VCS engine on the system with the valid main.cf file:
hasys -force system_name
The other systems perform a remote build from the system now running.

ADMIN_WAIT
A system can be in the ADMIN_WAIT state under
these circumstances:
A .stale flag exists, and the main.cf file has
a syntax problem.
A disk error occurs, affecting main.cf during
a local build.
The system is performing a remote build, and
the last running system fails.
To fix this, restore main.cf and use the
procedure for STALE_ADMIN_WAIT.
ADMIN_WAIT
The ADMIN_WAIT state results when a system is performing a remote build and
the last running system in the cluster fails before the configuration is delivered. It
can also occur if the VCS is performing a local build and the main.cf is missing
or invalid (syntax errors).
In either case, fix the problem as follows:
1 Locate a valid main.cf file from a main.cf.previous file on disk or a
backup on tape or other media.
2 Replace the invalid main.cf with the valid version on the local node.
3 Use the procedure specified for a stale configuration to force VCS to start.
15

Service Group Does Not Come Online
If a service group does not come online
automatically when VCS starts, check the
AutoStart and AutoStartList attributes:
hagrp display service_group
The service group must also be configured to run
on the system.
Check the SystemList attribute and verify that the
system name is included.
Service Group and Resource Problems

Service Groups Problems
Service Group Not Configured to AutoStart on the System

To enable VCS to bring a service group online automatically after engine startup,
the service group AutoStart attribute must be set to 1 and the target system must be
listed in the AutoStartList attribute of the service group.
Service Group Not Configured to Run on the System

If the system is not included in the SystemList attribute of a service group, you
cannot bring the service group online on that system even if the system is part of
the same cluster. The systems listed in the SystemList attribute should be a
superset of the systems listed in the AutoStartList attribute.

Service Group AutoDisabled
Autodisabling occurs when:
GAB sees a system, but HAD is not running on the
system.
Resources of the service group are not fully probed on
all systems in the SystemList attribute.
Ensure that the service group is offline on all systems
in listed in the SystemList attribute and that resources
are not running outside of VCS.
Verify that there are no network partitions.
Clear the AutoDisabled attribute:
hagrp autoenable service_group -sys system
Service Group AutoDisabled

VCS automatically disables service groups under these conditions:
GAB detects the system but the VCS engine is not running.
Resources in a service group are not fully probed.
The autodisable feature is a mechanism used by VCS to prevent a split-brain
condition. If VCS cannot verify that the resources are offline everywhere, it sets
the AutoDisabled attribute to prevent the service group from coming online on
more than one system.
If a service group was autodisabled because HAD could not probe all its critical
resources then after HAD has successfully probed them, it clears the service
groups autodisabled flag.
In contrast, if a system that was in a jeopardy membership fails, VCS does not
enable you to bring the service group online on other systems until you manually
clear the AutoDisabled attribute for the service group. Before clearing the
AutoDisabled attribute:
Ensure that the service group is offline on all running systems in the cluster.
15
Ensure that the resources are not running outside of VCS control.
Verify that there are no network partitions in the cluster.
To clear the AutoDisabled attribute, type:
hagrp -autoenable service_group -sys system_name

Service Group Not Fully Probed
All resources in a service group must be probed on
all systems in SystemList before it can be brought
online.
If the service group cannot be fully probed:
Check the ProbesPending attribute:
hagrp -display service_group
Check which resources are not probed:
hastatus -sum
Check the Probes attribute for resources:
hares -display
Probe the resources:
hares probe resource -sys system
Improperly configured resource attributes cannot be
probed.
Service Group Not Fully Probed

A service group must be probed on all systems in the SystemList attribute before
VCS attempts to bring the group online. This ensures that even if the service group
was online prior to VCS being brought up, VCS does not inadvertently bring the
service group online on another system.
If the agents have not monitored each resource, the service group does not come
online. Resources that cannot be probed usually have incorrect values specified for
one or more attributes.
Follow this guideline determine whether resources are probed.
Check the ProbesPending attribute:
A value of 0 indicates that each resource in the service group has been
successfully probed. If there are any resources that cannot successfully be
probed, the ProbesPending attribute is set to 1 (true) and the service group
cannot be brought online.
Check which resources are not probed:
hastatus -sum
Check the Probes attribute for resources:
hares -display
Probe the resources:
hares probe resource -sys system
See the engine and agent logs in /var/VRTSvcs/log for more information.

Service Group Frozen
When a service group is frozen, no further agent
actions can take place on any resources in the
service group.
Verify the value of Frozen and TFrozen attributes:
Unfreeze the service group:
hagrp -unfreeze group [-persistent]
If you freeze persistently, you must unfreeze
persistently.
Service Group Frozen

A service group can be frozen in the online or offline state. When a service group
is frozen, no further agent actions can take place on any resources in the service
group, including failover.
Use the output of the hagrp command to check the value of the Frozen and
TFrozen attributes. For example, type:
The Frozen attribute shows whether a service group is frozen persistently or
not. If set to 1, it is a persistent freeze.
The TFrozen attribute shows whether a service group is frozen temporarily or
not. If set to 1, it is a temporary freeze.
Use the command hagrp -unfreeze to unfreeze the group.
Note: If you freeze persistently, you must unfreeze persistently. 15

Service Group Is Not Offline Elsewhere or
Waiting for Resource
Case 1: The service group is not offline elsewhere.
Solution: Switch the service group rather than
bringing it online.
Case 2: The service group is waiting for a resource
that is stuck in a wait state.
Solution:
1. Determine which resources are online/offline:
hastatus -sum
2. Ensure that the resources are offline outside of VCS on all
systems.
3. Verify the State attribute:
4. Flush the service group:
hagrp -flush service_group -sys system
Service Group Is Not Offline Elsewhere in the Cluster

VCS does not allow you to bring a service group online if the service group is
partially or fully online elsewhere in the cluster. If you want to bring a service
group online elsewhere, switch the service group using hagrp -switch.
Service Group Waiting for Resource to Be Brought Online

Because VCS brings resources online hierarchically according to the dependency
diagram, a service group cannot come online successfully if any resource cannot
come online. This can be due to:
Problems with the physical resource
Errors in the resource attribute values
Incorrectly specified resource dependencies
If the resource is stuck in an internal state (Istate attribute), such as Waiting to Go
Online, you may need to flush the service group before taking any corrective
measures. Flushing clears the internal state and enables you to bring the service
group online after correcting the error.

Incorrect Local Name
A service group cannot be brought online if the system
name is inconsistent in the llthosts, llttab,
sysname, or main.cf files.
1. Check each file for consistent use of system names.
2. Correct any discrepancies.
3. If main.cf is changed, stop and restart VCS.
4. If llthosts or llttab is changed:
a. Stop VCS, GAB, and LLT.
b. Restart LLT, GAB, and VCS.
This
This most
most commonly
commonly occurs
occurs if you are not using a sysname
sysname
! file
file and someone changes
changes the UNIX host name.
Incorrect Local Name for System

A service group cannot be brought online if VCS has an incorrect local name for
the system. This occurs when the name returned by the command uname -n does
not match the system name in the llthosts, llttab, or main.cf files.
This is typically the case when uname -n returns a fully domain-qualified name
and you are not using the sysname file to define the system name to VCS. Check
this using hasys -list to display the system names known to VCS.
15

Concurrency Violation
This occurs when you bring a failover service group
online outside of VCS when it is already online on
another system.
Notification is provided by the Violation trigger. This
trigger:
Is configured by Notifies the Is invoked on

default with the administrator the system that
violation script and takes the caused the
in the service group concurrency
/opt/VRTSvcs/ offline on the
violation
system causing
bin/triggers the violation
directory
! To prevent concurrency violations, do not

manage resources outside of VCS.
Concurrency Violations
A concurrency violation occurs when a failover service group becomes fully or
partially online on more than one system. When this happens, VCS takes the
service group offline on the system that caused the concurrency violation and
invokes the violation event trigger on that system.
The Violation trigger is configured by default during installation. The violation
trigger script is placed in /opt/VRTSvcs/bin/triggers and no other
configuration is required.
The script notifies the administrator and takes the service group offline on the
system where the trigger was invoked.
The script can send a message to the system log and console on all cluster systems
and can be customized to send additional messages or e-mail messages.

Problems Taking Service Groups Offline
If a service group is waiting for a resource to come
offline:
Identify which resource is not offline:
hastatus summary
Check logs.
Manually take the resource offline, if necessary.
Configure the ResNotOff trigger for notification or
action.
Example:
Example:NFS
NFSservice
servicegroups
groupshave
havethis
thisproblem
problemififan
anNFS
NFSclient
clientdoes
doesnot
not
disconnect.
disconnect.The
TheShare
Shareresource
resourcecannot
cannotcome
comeoffline
offlinewhen
whenaaclient
clientis
is
connected.
connected.You
Youcan
canconfigure
configureResNotOff
ResNotOfftotoforcibly
forciblystop
stopthe
theshare.
share.
Problems Taking Service Groups Offline

You can occasionally encounter problems when trying to take VCS service groups
offline. If this happens during a failover, it can prevent the service group from
coming online on another system. Use the following recommendations to solve
problems you may encounter.
Service Group Waiting for a Resource to Be Taken Offline

If a resource is stuck in the internal state of WAITING TO GO OFFLINE, none of
the child resources can be taken offline and this situation can prevent a failover.
This situation is often a result of a resource being controlled outside of VCS. For
example, a file system is unmounted before the Mount resource was taken offline.
The ResNotOff trigger can be configured to notify an administrator or, in case of
very critical services, to reboot or halt the system so that another system can start
the service group.
However, a careful analysis of the systems and the applications is required,
because halting a system causes failover, interrupting other service groups that
15
were online on that system.

Problems with Service Group Failover
If a service group does not fail over when a
fault occurs, check:
Critical attributes for resources
Service group attributes:
ManageFaults and FaultPropagation
Frozen and TFrozen
AutoFailover
FailOverPolicy
SystemList
Also, check timeout values if you have modified
timeout-related attributes.
Problems with Service Group Failover

If a service group does not fail over as you expect when a fault occurs, check all
resource and service group attributes that can affect fail over. Examples are listed
in the slide.
Refer to the Configuring VCS Response to Resource Faults lesson for detailed
information about how VCS handles resource faults. Also, see the System and
Communication Failures lessons to understand how those faults affect service
groups.

Resource Problems: Critical Resource Faults
A service group cannot come online on a system
where a critical resource is marked as FAULTED.
1. Determine which critical resource has faulted:
hastatus summary
2. Ensure that the resource is offline.
3. Examine the engine log to analyze the problem.
4. Fix the problem.
5. Freeze the service group.
6. Verify that the resources work properly outside of
VCS.
7. Clear the fault in VCS.
8. Unfreeze the service group. Note:
Note: Persistent
Persistent resource
resource faults
faults
are
are cleared
cleared automatically
automatically after
after
the
the underlying
underlying problem
problem is
is fixed.
fixed.
Resource Problems
Critical Resource Faults

A service group does not come online on a system where a critical resource is
marked as FAULTED. Persistent resource faults are cleared automatically after the
underlying software or hardware problem is fixed. The next monitor cycle
determines that the resource is responding properly and reports the resource as
online. You can also probe the resource to force a monitor cycle.
Nonpersistent resource faults need to be explicitly cleared.
15

Unable to Bring Resources Online or Offline
Resources cannot be brought online or taken
offline when:
A resource is waiting for child or parent
resources.
A resource is stuck in an internal Waiting state.
The agent is not running.
Problems Bringing Resources Online

If VCS is unable to bring a resource online, these are the likely causes:
The resource is waiting for a child resource to come online.
The resource is stuck in a WAIT state.
The agent is not running.
Waiting for Child Resources

VCS does not bring a resource online if one or more of its child resources cannot
be brought online. You need to solve the problem with the child resource and bring
it online first before attempting to bring the parent online.
Note: The resource waiting for its child resources has an internal wait state of
Waiting for Children Online. As soon as all the children are brought online, the
resource transitions to Waiting to Go Online.
Resource Waiting to Come Online

You can encounter this situation if VCS has directed the agent to run the online
entry point for the resource, but the resource is stuck in the internal state Waiting to
Go Online. Check the VCS engine and agent logs to identify the problem and solve
it.

Problems Taking Resources Offline
If VCS is unable to take a resource offline, these are the likely causes:
The resource is waiting for a parent resource.
The resource is waiting for a resource to respond.
The agent is not running (as discussed in the previous section).
Waiting for the Parent Resource

VCS does not take a resource offline if one or more of its parent resources cannot
be taken offline. Solve the problem with the parent resource and take it offline first
before attempting to bring the child offline.
Waiting for a Resource to Respond

You can encounter this situation if VCS has directed the agent to run the offline
entry point for the resource, but the resource is stuck in the internal state Waiting to
Go Offline. Check the VCS engine and agent logs to identify the problem and
solve it. VCS allows the offline entry point to run until the OfflineTimeout value is
reached. After that, it stops the entry point process and runs the clean entry point.
If the resource still does not go offline, it runs the ResNotOff trigger, if configured.
15

Agent Problems: Agent Not Running
Determine whether the agent for that resource is
FAULTED:
hastatus summary
Use the ps command to verify that the agent process
is not running.
Check the log files for:
Incorrect path name for the agent binary
Incorrect agent name
Corrupt agent binary
Verify that the agent is installed on all systems.
Restart the agent after fixing the problem:
haagent start agent sys system
Agent Problems and Resource Type Problems
Agent Problems
An agent process should be running on the system for each configured resource
type. If the agent process is stopped for any reason, VCS cannot carry out
operations on any resource of that type. Check the VCS engine and agent logs to
identify what caused the agent to stop or prevented it from starting. It could be an
incorrect path for the agent binary, the wrong agent name, or a corrupt agent
binary.
Use the haagent command to restart the agent. Ensure that you start the agent on
all systems in the cluster.

Resource Type Problems
If a resource does not come online after you check all other
possible causes, check the resource type definition:
Verify that the resource attributes are correctly
specified using hares display resource.
Verify that the agent is running using
haagent display.
Verify that the resource works properly outside of VCS.
Display values for ArgList and ArgListValues type
attributes using hatype display res_type.
If ArgList is corrupted in *types.cf:
1. Stop VCS on all systems.
2. Fix types.cf or replace with types.cf.previous.
3. Restart VCS.
Resource Type Problems

Another common problem preventing VCS from bringing a resource online is an
invalid specification of the agent argument list in a resource type ArgList attribute.
If you inadvertently select a resource type rather than a resource in the Cluster
Manager, and change the ArgList attribute, the agent cannot function properly.
Perform these tasks to determine if this problem is occurring:
Verify that the resource attributes are correctly specified:
hares display resource
Verify that the agent is running:
haagent display
Verify that the resource works properly outside of VCS.
Display values for ArgList and ArgListValues type attributes:
hatype display res_type
If ArgList is corrupted in types.cf:
1 Stop VCS on all systems:
15
hastop -all -force

2 Fix types.cf or replace with types.cf.previous. For example:
/etc/VRTSvcs/conf/config# cp types.cf.previous types.cf
Note: Check each *types.cf file if you have multiple types definition
files.
3 Start VCS on the repaired system and then start VCS stale on other systems:
hastart [-stale]

Making Backups
Back up key VCS files as part of your regular backup
procedure:
types.cf and customized types files
main.cf
main.cmd
sysname
LLT and GAB configuration files in /etc
Customized trigger scripts in
/opt/VRTSvcs/bin/triggers
Customized agents in /opt/VRTSvcs/bin
Archiving VCS-Related Files

Making Backups
Include VCS configuration information in your regular backup scheme. Consider
archiving these types of files and directories:
/etc/VRTSvcs/conf/config/types.cf and any other custom types
files
main.cmd, generated by:
hacf -cftocmd /etc/VRTSvcs/conf/config
LLT and GAB configuration files in /etc:
llthosts
llttab (unique on each system)
gabtab
Customized triggers in /opt/VRTSvcs/bin/triggers
Customized agents in /opt/VRTSvcs/bin
Notes:
You can use the hagetcf command in VCS 3.5 or hasnap in VCS 4.0 to
create a directory structure containing all cluster configuration files. This helps
ensure that you do not inadvertently miss archiving any key files.
The VCS software distribution includes the VRTSspt package, which provides
vxexplorer, a tool for gathering system information that may be needed by
Support to troubleshoot a problem.

The hasnap Utility
The hasnap utility backs up and restores VCS
configuration files.
This utility also serves as a support tool for collecting
information needed for problem analysis.
## hasnap
hasnap backup
backup f
f /tmp/vcs.tar
/tmp/vcs.tar -n
-n -m
-m Oracle_Cluster
Oracle_Cluster
V-8-1-15522
V-8-1-15522 Initializing
Initializing file
file "vcs.tar"
"vcs.tar" for
for backup.
backup.
V-8-1-15526
V-8-1-15526 Please
Please wait...
wait...
Checking
Checking VCS
VCS package
package integrity
integrity
Collecting
Collecting VCS
VCS information
information
..
..
Compressing
Compressing /tmp/vcs.tar
/tmp/vcs.tar to
to /tmp/vcs.tar.gz
/tmp/vcs.tar.gz
Done.
Done.
The hasnap Utility

The hasnap utility backs up and restores predefined and custom VCS files on
each node in a cluster. A snapshot is a collection of predefined VCS configuration
files and any files added to a custom file list. A snapshot also contains information
such as the snapshot name, description, time, and file permissions.
In the example shown in the slide, hasnap is used to:
Create a single file containing all backed up files (-f vcs.tar).
Specify no prompts for user input (-n).
Create a description for the snapshot (-m Oracle_Cluster).
The following table shows samples of hasnap options:
Option Purpose
-backup Copies the files to a local predefined directory
-restore Copies the files in the specified snapshot to a directory
-display Lists all snapshots and the details of a specified snapshot
15
-sdiff Shows differences between configuration files in a snapshot

and the files on a specified systems
-fdiff Shows differences between a specified file in a snapshot and
the file on a specified systems
-export Exports a snapshot to a single file
-custom Adds specified files along with predefined VCS files

Lesson Summary
Key Points
Develop an understanding of common problem
causes and solutions to problems using the
background provided in this lesson.
Use the troubleshooting job aid as a guide.
Reference Materials
Reference Guide
Summary
This lesson described how to detect and solve problems with VCS faults. Common
problem scenarios were described and solutions were provided, as well as a
general purpose troubleshooting methodology.
Next Steps
Now that you have learned how to configure, manage, and troubleshoot high
availability services in the VCS environment, you can learn how to manage more
complex cluster configurations, such as multinode clusters.
This quick reference is included with this participant guide.
This Web site provides troubleshooting information about all VERITAS
products.

Lab 15: Troubleshooting
Your instructor will run scripts that cause problems
within your cluster environment.
Apply the troubleshooting techniques provided in the
lesson to identify and fix the problems.
Notify your instructor when you have restored your
cluster to a functional state.
Optional
Optionallab:
lab:IfIfyour
yourinstructor
instructorindicates
indicatesthat
thatyour
your
classroom
classroomhashasaccess
accesstotothe
theVERITAS
VERITASSupport
SupportWeb
Website,
site,
search http://support.veritas.comfor
searchhttp://support.veritas.com fortechnical
technical
notes
notestotohelp
helpyou
yousolve
solve the
theproblems
problemscreated
createdasaspart
part of
of
this lab exercise.
this lab exercise.
Lab 15: Troubleshooting

Goal
During this lab exercise, your instructor will create several problems on your
cluster. Your goal is to diagnose, fix, and document them for the instructor during
the allotted time. If you finish early, your instructor may give you additional
problems to solve.
Prerequisites
Wait for your instructor to indicate that your systems are ready for troubleshooting.
Results
The cluster is running with all service groups online.
Optional Lab
Some classrooms have access to the VERITAS Support Web site. If your instructor
indicates that your classroom network can access VERITAS Support, search
15
http://support.veritas.com for technical notes to help you solve the

problems created as part of this lab exercise.

Index
A shutdown 5-12
start 5-17
abort sequence 2-14 application components
access control 6-8 stopping 5-20
access, controlling 6-6 application service
adding license 3-7 definition 1-7
admin account 2-16 testing 5-13
ADMIN_WAIT state atomic broadcast mechanism 12-4
definition 15-17 attribute
in ManageFaults attribute 11-8 display 4-7
in ResAdminWait trigger 11-25 local 9-13
recovering resource from 11-22 override 11-19
administration application 4-4 resource 1-12, 7-10
administrative IP address 5-8 resource type 11-13, 11-15
administrator, network 5-11 service group failover 11-7
agent service group validation 5-25
clean entry point 1-14, 11-5 verify 5-23
close entry point 7-28 autodisable
communication 12-4 definition 15-19
custom 15-32 in jeopardy 13-8
definition 1-14 service group 12-20
logs 15-5 AutoDisabled attribute 12-20, 15-19
monitor entry point 1-14 AutoFailover attribute 11-9
offline entry point 1-14 AutoStart attribute 4-9, 15-18
online entry point 1-14 AutoStartList attribute 15-18
troubleshooting 15-30
AIX
configure IP address 5-9
B
configure virtual IP address 5-16 backup configuration files 8-19
llttab 12-12 base IP address 5-8
lslpp command 3-14 best practice
SCSI ID 2-10 application management 4-4
startup files 12-18 application service testing 5-13
AllowNativeCliUsers attribute 6-7 cluster interconnect 2-7
application boot disk 2-8
clean 5-12 Bundled Agents Reference Guide 1-15
component definition 5-4
configure 5-12
IP address 5-15 C
management 4-4 cable, SCSI 2-9
managing 4-4
child resource
manual migration 5-21
configuration 7-9
preparation procedure 5-13
dependency 1-11
prepare 5-4
linking 7-31
service 5-4
Index-1
clean entry point 11-5 protection 6-17
clear save 6-15
autodisable 15-19 cluster interconnect
resource fault 4-16 configuration files 3-12
CLI configure 3-8
online configuration 6-13 definition 1-16
resource configuration 7-16 VCS startup 6-23, 6-24
service group configuration 7-6 Cluster Manager
close installation 3-17
cluster configuration 6-16 online configuration 6-13
entry point 7-28 Windows 3-18
cluster Cluster Monitor 4-22
campus 14-26 cluster state
communication 12-4 GAB 6-27, 12-4
configuration 1-22 remote build 6-24, 6-27
configuration files 3-13 running 6-27
configure 3-5 Stale_Admin_Wait 6-25
create configuration 6-19 unknown 6-25
definition 1-5 Wait 6-26
design Intro-6, 2-6 ClusterService group
duplicate configuration 6-20 installation 3-8
duplicate service group configuration 6-21 main.cf file 3-13
ID 2-16, 12-11 notification 10-6
installation preparation 2-16 command-line interface 7-6
interconnect 1-16 communication
interconnect configuration 13-18
agent 12-4
maintenance 2-4
between cluster systems 12-5
managing applications 4-4
cluster problems 15-9
member systems 12-8
configure 13-18
membership 1-16, 12-7
fencing 14-21
membership seeding 12-17
within a system 12-4
membership status 12-7
component testing 5-19
name 2-16
Running state 6-24 concurrency violation
simulator 4-18 in failover service group 15-24
terminology 1-4 in frozen service group 4-13
troubleshooting 8-16 prevention 12-20
cluster communication configuration
configuration files 3-12 application 5-12
overview 1-16 application IP address 5-15
cluster configuration application service 5-6
backup files 8-19
build from file 6-28
build from file 6-28
close 6-16
cluster 3-5
in memory 6-22
cluster interconnect 3-8
in-memory 6-24
downtime 6-5
modification 8-14
fencing 3-16, 14-27
offline 6-18
files 1-23
open 6-14
GAB 12-16
Index-2 VERITAS Cluster Server for UNIX, Fundamentals

GroupOwner attribute 10-11 fencing 1-19, 13-4
in-memory 6-12 HAD 14-25
interconnect 12-10, 13-18 jeopardy membership 13-8
LLT 13-20 requirement definition 14-8
main.cf file 1-23 service group heartbeats 13-16
methods 6-4 dependency
network 5-8 offline order 4-11
notification 10-6 online order 4-9
NotifierMngr 10-8 resource 1-11, 5-24, 7-31
offline method 6-18 resource offline order 4-15
overview 1-22 resource rules 7-32
protection 6-17 resource start order 5-13
resource type attribute 11-18 resource stop order 5-20
shared storage 5-7 rule 1-11
troubleshooting 7-26, 8-16 design
types.cf file 1-23
cluster 2-6
Web console 3-8
offline configuration 8-10
configuration files resource dependency 5-24
backup 8-19, 15-32 validate 5-22
installation 3-11 worksheet 2-6
llttab 12-11
differential SCSI 2-9
network 2-15
disable
operating system 2-15
resource 7-28
ConfInterval attribute 11-15, 11-17
disk
coordinator disk
coordinator 14-10
definition 14-10
data 14-11
disk group 14-27
fencing 14-10
requirements 14-26
shared 2-8
Critical attribute
disk group fencing 14-11
in critical resource 4-9
setting 7-35 DiskGroup resource 7-16
critical resource display
faults 15-27 service group 7-6
role of in failover 11-4 displaying
crossover cable 2-7 cluster membership status 12-7
LLT status 12-9
custom
agents 15-32 DMP 2-8
triggers 15-32 downtime
cluster configuration 6-5
D system fault 13-6
dynamic multipathing 2-8
data
corruption 13-13
disk 14-11
E
storage 2-8 eeprom command 2-10
data disk e-mail notification
reservation 14-13 configuration 10-4
data protection from GroupOwner attribute 10-11
definition 14-4 from ResourceOwner attribute 10-10
VERITAS Cluster Server for UNIX, Fundamentals Index-3

entry point failover duration 13-6
clean 11-5 ManageFaults attribute 11-8
close 7-28 manual management 11-7
definition 1-14 notification 11-24
offline 11-14 recover 11-20
online 11-14 resource 4-16
environment variable system 13-5
MANPATH 2-13 trigger 11-25
PATH 2-13 FaultPropagation attribute 11-8
Ethernet interconnect network 2-7 fencing
Ethernet ports 12-9 communication 14-21
event components 14-10
notification 11-24 configure 14-27
severity level 10-5 coordinator disk requirements 14-26
trigger 11-25 data protection 14-9
definition 1-19
event messages 10-5
disk groups 14-30, 14-31
GAB communication 14-15
F I/O 13-4
failover installation 3-16
interconnect failure 14-15
active/active 1-28
partition-in-time 14-34
active/passive 1-25
race 14-14
automatic 11-9
recovering a system 14-32
configurations supported 1-25
startup 14-29
critical resource 7-35, 11-4
system failure 14-14
default behavior 11-4
vxfen driver 14-23
definition 1-6
duration 11-11, 13-6 flush service group 7-27
manual 4-12, 11-9 force
N + 1 1-27 VCS stop 6-30
N-to-1 1-26 freeze
N-to-N 1-29 persistent 4-13
policy 11-5 service group 4-13
service group fault 11-4 temporary 4-13
service group problems 15-26 Frozen attribute 4-13, 11-7, 15-21
service group type 1-9
FailOverPolicy attribute 11-5 G
failure
communication 14-6 GAB
fencing 14-15 cluster state change 6-27
HAD startup 15-15 communication 12-4
interconnect recovery 14-33 configuration file 12-16
LLT link 13-7 definition 1-18
system 14-5 fencing 14-23
fault manual seeding 12-19
critical resource 7-35, 15-27 membership 12-7, 12-17, 14-17
detection duration 11-11 Port a 12-7
effects of resource type attributes 11-15 Port b 14-21
Port h 12-7

seeding 12-17 heartbeat
startup files 12-18 definition 1-18, 12-5
status 12-7 disk 13-16
timeout 13-6 frequency reduction 13-14
troubleshooting 15-11 loss of 14-6, 14-15
gabconfig command 12-7 low-priority link 12-6, 13-20
gabtab file 12-16 network requirement 2-7
public network 13-14
Group Membership Services/Atomic Broadcast
service group 13-16
definition 1-18
high availability
GroupOwner attribute 10-11
definition 1-5
GUI notification 10-6
adding a service group 7-5 online configuration 6-12
resource configuration 7-10
high availability daemon 1-20
high-priority link 12-6
H HP OpenView Network Node Manager 10-12
hacf command 8-4 HP-UX
haconf command 7-6 configuring IP address 5-9
configuring virtual IP address 5-16
HAD
llttab 12-12
data protection 14-25
SCSI ID 2-11
definition 1-20
startup files 12-18
log 4-8
swlist command 3-14
notifier 10-4
online configuration 6-13 hub 2-7
Stale_Admin_Wait state 6-25 hybrid service group type 1-9
startup 6-22, 14-25
hagetcf command 15-32 I
hagrp command 4-10, 4-11, 4-12, 4-13, 7-6 I/O fencing 13-4
hardware ID
requirements 2-7 cluster 2-16
SCSI 2-9 duplicate node numbers 15-13
storage 2-8 initiator 2-10
support 2-7 message 15-5
verify 2-12
ifconfig command 5-15
hardware compatibility list 2-7 initiator ID 2-10
hares command 4-14 installation
hashadow daemon 1-20 Cluster Manager 3-17
hasim command 4-21 fencing 3-16
hasimgui 4-19 Java GUI 3-17
hasnap command 15-33 log 3-4
hastart command 6-22 VCS preparation 2-16
view cluster configuration 3-14
hastatus command 4-7
installer command 3-7
hastop command 6-30
Installing 3-18
hauser command 6-9
installvcs command 3-5
HBA 2-8
interconnect
HCL 2-7 cable 13-10

cluster communication 12-5 resource 7-31
configuration 13-18 Linux
configuration procedure 13-19 configuring IP address 5-10
configure 12-10 configuring virtual IP address 5-16
Ethernet 2-7 llttab 12-12
failure 14-6, 14-15, 14-17 rpm command 3-14
failure recovery 14-33 SCSI ID 2-11
link failures 13-7 startup files 12-18
network partition 13-9, 14-17 LLT
partition 14-6 adding links 13-20
recover from network partition 13-10 configuration 13-20
requirement 2-7 definition 1-17
specifications 12-6 link failure 13-8
IP link failures 13-13
adding a resource 7-12 link status 12-9
address configuration 5-8 links 12-5
administrative address 5-8 low-priority link 13-14
application address configuration 5-15 node name 12-14
simultaneous link failure 13-9
J startup files 12-18
timeout 13-6
Java GUI troubleshooting 15-12, 15-14
installation 3-17 lltconfig command 3-14
installation on Windows 3-18
llthosts file 3-12, 12-14
simulator 4-22
lltstat command 12-9
Windows 3-18
llttab file 3-12, 12-11
jeopardy membership
after interconnect failure 13-15 local build 6-22
autodisabled service groups 13-8 local resource attribute 9-13
definition 13-8 log
system failure 15-19 agent 15-5
join cluster membership 12-17 display 4-8
installation 3-4
K
log file
key agent 15-4
SCSI registration 14-12 engine 15-4
hashadow 15-4
L system 15-4
low priority link 12-6
license low-latency transport 1-17
adding 3-7 LUN 14-11
HAD startup problem 15-15
upgrading 2-15
VCS 3-7 M
verification 2-15 MAC address 12-9
link main.cf file
high-priority 12-6 backup 6-27
low-priority 12-6 backup files 8-19

Critical attribute 7-35 interval 11-13
definition 1-23 network interface 9-6
editing 8-14 probe 15-27
example 8-5 VCS 15-4
fencing 14-30 MonitorInterval attribute 11-11, 11-13
installation 3-13 MonitorTimeout attribute 11-14
network resources 7-15 mount command 5-14
offline configuration 8-4
Mount resource 7-19, 7-22
old configuration 8-18
online configuration 6-14 mounting a file system 5-14
Process resource 7-25
resource dependencies 7-34 N
service group example 7-8, 7-37, 8-12
storage resources 7-22 name
syntax 8-17 cluster 2-16
troubleshooting 8-16 convention 7-9
resource 7-9
main.previous.cf file 8-18
service group 7-5
maintenance
network
cluster 2-13
administrator 5-11
staffing 2-4
cluster interconnect 2-12
ManageFaults attribute 11-7, 11-22 configuration files 2-15
MANPATH environment variable 2-13 configure 5-8
manual interconnect interfaces 12-11
application migration 5-21 interface monitoring 9-6
application start 5-17 interface sharing 9-4
fault management 11-7 LLT link 13-14
mount file system 5-14 partition 13-9, 14-17
seeding 12-19 preexisting partition 13-17
starting notifier 10-7 network partition 13-10
member systems 12-8 definition 14-6
membership NIC resource 7-10, 7-14, 9-13
cluster 12-4 changing ToleranceLimit attribute 11-18
GAB 12-7 false failure detections 11-11
jeopardy 13-8 parallel service groups 9-8
joining 12-17 sharing network interface 9-5
regular 13-8 NoFailover trigger 11-26
message severity levels 10-5 nonpersistent resource 9-9
migration notification
application 5-5 ClusterService group 10-6
application service 5-21 concurrency violation 15-24
VCS stop 6-30 configuration 10-6
mini-cluster 13-9 configure 2-16, 3-9, 10-8
mkfs command 5-7 e-mail 10-4
modify event messages 10-5
cluster interconnect 13-19 fault 11-24
service group 7-5 GroupOwner attribute 10-11
monitor message queue 10-4
adjusting 11-13, 11-14 overview 10-4

ResourceOwner 10-10 VCS support 2-13
severity level 10-5 override attributes 11-19
support e-mail 15-8 override seeding 12-19
test 10-7
trigger 10-13
notifier daemon
P
configuration 10-6 package installation 3-5
message queue 10-4 parallel
starting manually 10-7 service group 9-8
NotifierMngr resource service group configuration 9-11
configuring notification 10-6 service group type 1-9
definition 10-8 parent
nsswitch.conf file 5-11 resource 1-11, 7-31
partial online 15-22, 15-24
O definition 4-9
when resources taken offline 4-15
offline
partition
entry point 11-14
cluster 13-13
nonpersistent resource 4-11
interconnect failure 13-9
resource 4-15
preexisting 13-17
resource problems 15-29
recovery 13-10
service group problems 15-25
partition-in-time 14-34
troubleshooting 9-9
path
offline configuration
coordinator disk 14-29
benefits 6-18
fencing data disks 14-24
cluster 6-4
storage 2-8
examples 6-19
procedure for a new cluster 8-4 PATH environment variable 2-13
procedure for an existing cluster 8-7 persistent resource 4-11, 9-8, 15-27
troubleshooting 8-16 Phantom resource 9-9, 9-10
OfflineMonitorInterval attribute 11-11, 11-13 plan, implementation
OfflineTimeout attribute 11-14 implementation plan 2-5
online Port 12-7
definition 4-9 preexisting network partition 13-17
entry point 11-14 severed cluster interconnect 14-17
nonpersistent resource 4-9 PreOnline trigger 10-13
resource 4-14 prepare
resource problems 15-28 applications 5-4
online configuration identify application components 5-6
benefits 6-12 site 2-6
cluster 6-4 VCS installation 2-16
overview 6-13 private network 1-16
procedure 7-4 privilege
service group 7-4
UNIX user account 6-7
OnlineTimeout attribute 11-14 VCS 6-10
operating system probe
configuration files 2-15 clear autodisable 15-19
patches 2-15 clear resource 15-27

persistent resource fault 4-16 fault detection 11-11
resource 4-17, 11-21, 12-20 fault recovery 11-20
service group not probed 15-20 GUI configuration 7-10
Process resource 7-23, 7-25 local attribute 9-13
Proxy resource 9-6, 9-7 name 7-9
nonpersistent 9-9
offline definition 4-15
R offline order 4-11
RAID 2-8 offline problems 15-29
online definition 4-14
raw disks 5-7
online order 4-9
recover
online problems 15-28
network partition 13-10 parent 1-11
recovery persistent 4-11, 9-8
fenced system 14-33 probe 12-20
from ADMIN_WAIT state 11-22 recover 11-22
partition-in-time 14-34 restart 11-16
resource fault 11-20 restart example 11-17
registration troubleshooting 7-26
with coordinator disks 14-13 type 1-13
regular membership 13-8 type attribute 1-13
Remote Build state 6-27 verify 5-18
replicated state machine 12-4 resource type
requirements controlling faults 11-15
hardware 2-7 None 9-8
software 2-13 OnOff 9-8
OnOnly 9-8
ResAdminWait trigger 11-26
testing failover 11-13
reservation 14-13 troubleshooting 15-31
ResFault trigger 10-13 ResourceOwner attribute 10-6, 10-10
ResNotOff trigger 10-13 ResStateChange trigger 10-13, 11-26
resolv.conf file 5-11 RestartLimit attribute 11-15, 11-17
resource root user account 6-6, 6-12
attribute 1-12
rsh 2-14, 3-7
attribute verification 5-23
child 1-11 rules
clear fault 4-16 resource dependency 7-32
CLI configuration 7-16 Running state 6-27
configuration procedure 7-9
copying 7-29 S
critical 11-4
Critical attribute 7-35 SAN 2-8, 2-12
definition 1-10 SCSI
deletion 7-29 cable 2-9
dependency 5-24 controller configuration 2-9
dependency definition 1-11 termination 2-9
dependency rules 1-11 seeding
disable 7-28 definition 12-17
event handling 11-25 manual 12-19
fault 4-16, 11-5, 15-27

override 12-19 Simulator, configuration files 4-20
split brain condition 12-19 Simulator, Java Console 4-19
troubleshooting 15-15 Simulator, sample configurations 4-20
service group single point of failure 2-8
autodisable 12-20 single-ended SCSI controller 2-9
CLI configuration 7-6
site access 2-4
concurrency violation 15-24
data protection 14-14, 14-16 SMTP notification configuration 3-9
definition 1-8 SNMP
evacuation 6-30 console configuration 10-12
event handling 11-25 notification 3-9
failover attributes 11-7 notification configuration 10-6
failover type 1-9 software
failure to come offline 15-25 configuration 2-13
failure to come online 15-22, 15-23 managing applications 4-4
fault 11-4 requirements 2-13
flush 7-27, 15-22 verification 2-15
freeze 4-13, 15-21 Solaris
GUI configuration 7-5 abort sequence 2-14
heartbeat 13-16 configure IP address 5-9
hybrid type 1-9 configure virtual IP address 5-15
manage 4-6 llttab 12-11
name 7-5 pkginfo command 3-14
network interface 9-4 SCSI ID 2-10
offline 4-11 startup files 12-18
offline configuration 8-7 split brain condition 13-13, 13-17, 15-19
online 4-9 definition 14-7
online configuration 7-4
split-brain condition 13-8
parallel 9-8, 9-11
ssh 2-14, 3-7
parallel type 1-9
status 9-8, 9-9 .stale file
test procedure 7-30 close configuration 6-16
testing 8-20 open configuration 6-14
troubleshooting 7-26, 9-9, 15-18 protecting the cluster configuration 6-17
types 1-9 save configuration 6-15
unable to fail over 15-26 startup check 6-22
unable to probe 15-20 VCS startup 6-25
validate attributes 5-25 stale flag in starting VCS 6-28
worksheet 7-8 Stale_Admin_Wait state 6-25, 15-16
ServiceGroupHB resource 13-16 start
shutdown volumes 5-14
application 5-12 with a .stale file 6-28
VCS 6-30 startup
simulator by default 6-22
Java GUI 4-22 fencing 14-21, 14-29
offline configuration 8-14 files 12-18
test 8-15 probing 12-20
VCS 4-18 state
Simulator, command line interface 4-21 cluster 12-5

cluster membership 12-4 adjusting 11-14
Stale_Admin_Wait 6-25 GAB 13-6
unknown 6-25 LLT 13-6
status ToleranceLimit attribute 11-16
display 4-7 tools
license 3-7 offline configuration 8-14
LLT link 12-9 online configuration 6-13
service group 9-8 traps 10-12
storage trigger
requirement 2-8 event handling 11-25
shared fault handling 11-25
bringing up 5-14 NoFailover 11-26
configuring 5-7 notification 10-13
verification 2-12 PreOnline 10-13
switch ResFault 10-13, 11-26
network 2-7 ResNotOff 10-13, 11-26
service group 4-12 resnotoff 10-13
sysname file 12-15 ResStateChange 10-13, 11-26
system Violation 15-24
cluster member 2-16 troubleshooting 13-10
failure 14-5, 14-14 agents 15-30
failure recovery 14-33 configuration 7-26
fault 13-5, 13-6 configuration backup files 8-19
GAB startup specification 12-16 duplicate node IDs 15-13
ID 12-11, 12-13 flush service group 7-27
incorrect local name 15-23 GAB 15-11
join cluster membership 12-17 guide 15-9
LLT node name 12-14 HAD startup 15-15
local attribute 9-13 LLT 15-12, 15-14
seeding 12-17 log 7-26
state 12-5 log files 15-5
SystemList attribute 7-5, 15-18 main.cf file 8-16
systems message ID 15-7
service group configuration 7-5 offline configuration 8-16
recovering the cluster configuration 8-18
resource types 15-31
T VCS 15-9
termination 2-9 types.cf file
test backup 6-27
service group 7-30 backup files 8-19
testing definition 1-23
application service 5-13 installation 3-13
integrated components 5-19 simulator 4-22
network connections 2-12
notification 10-7 U
service group 8-20
UMI 15-5, 15-7
TFrozen attribute 4-13, 11-7, 15-21
unique message identifier 15-5
timeouts

UNIX resources 5-18
root user account 6-6 software 2-15
user account 6-7 VERITAS
UseFence attribute 14-25 Support 15-7, 15-8
user account VERITAS Product Installer 3-4
creating 6-9 Violation trigger 15-24
modifying 6-10 vLicense Web site 2-15
privileges 6-10 volume management 5-7
root 6-7
volume management software 2-13
UNIX 6-7
VCS 6-8 Volume resource 7-18, 7-21
VPI 3-4
vxfen driver 14-21, 14-23
V
vxfenadm command 14-27
validation vxfenclearpre command 14-34
design 5-22 vxfenconfig command 14-21
service group attributes 5-25
vxfendg file 14-28
VCS
vxfentab file 14-29
access control authority 6-8
administration 4-5 vxfentsthdw command 14-27
administration tools 4-5 VxVM
administrator 6-7 fencing 14-22
architecture 1-24 fencing implementation 14-24
communication 12-4 resources 7-18
communication overview 12-5
engine startup problems 15-15 W
fencing configuration 3-16
fencing implementation 14-25 Wait state
forcing startup 6-26 resource 15-29
installation preparation 2-16 troubleshooting 8-17
installation procedure 3-6 Web GUI
license 3-7 address 3-14
management tools 4-5 configuration 2-16
membership and configuration data 13-15 configure 3-8
response to system fault 13-5 worksheet
SNMP MIB 10-12 design 2-6
SNMP traps 10-12 offline configuration 8-10
starting 6-22 resource dependencies 7-34
starting stale 6-28 resource dependency 5-24
starting with .stale file 6-25 resource example 7-21
startup 12-20 resource preparation 5-22
startup files 12-18 service group 7-8
stopping 6-30
support 15-7
system name 12-15
user accounts 6-6
vcs.mib file 10-12
verification


Ha Vcs 410 101a 2 10 srtpg1 130918134130 Phpapp02 PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ha Vcs 410 101a 2 10 srtpg1 130918134130 Phpapp02 PDF

Uploaded by

Copyright:

Available Formats

VERITAS Cluster Server for

Lesson 1: VCS Building Blocks

Lesson 3: Installing VERITAS Cluster Server

Lesson 4: VCS Operations

ii VERITAS Cluster Server for UNIX, Fundamentals

Lesson 5: Preparing Services for VCS

Lesson 6: VCS Configuration Methods

Table of Contents iii

Lesson 7: Online Configuration of Service Groups

Lesson 8: Offline Configuration of Service Groups

iv VERITAS Cluster Server for UNIX, Fundamentals

Lesson 9: Sharing Network Interfaces

Lesson 10: Configuring Notification

Lesson 12: Cluster Communications

vi VERITAS Cluster Server for UNIX, Fundamentals

Lesson 14: I/O Fencing

Table of Contents vii

viii VERITAS Cluster Server for UNIX, Fundamentals

VERITAS Cluster Server Curriculum

VERITAS Cluster Server, Fundamentals

VERITAS Cluster Server, Implementing Local Clusters

VERITAS Cluster Server Agent Development

High Availability Design Using VERITAS Cluster Server

Disaster Recovery Using VVR and Global Cluster Option

Intro2 VERITAS Cluster Server for UNIX, Fundamentals

Course Introduction Intro3

Intro4 VERITAS Cluster Server for UNIX, Fundamentals

Certification Exam Objectives

Course Introduction Intro5

Cluster Design Input

Intro6 VERITAS Cluster Server for UNIX, Fundamentals

Sample Design Input

Course Introduction Intro7

Service Group Definition Sample Value

Resource Definition Sample Value

Intro8 VERITAS Cluster Server for UNIX, Fundamentals

Lab Design for the Course

Course Introduction Intro9

Service Group Sample

Substitute your name, or a nickname, wherever tables or instructions

Lab Naming Conventions

Intro10 VERITAS Cluster Server for UNIX, Fundamentals

Network Definition Your Value

Software Location Your Value

Use the classroom values provided by your instructor at the beginning

Classroom Values for Labs

Course Introduction Intro11

Lesson 1: VCS Building Blocks

Intro12 VERITAS Cluster Server for UNIX, Fundamentals

Course Introduction Intro13

Intro14 VERITAS Cluster Server for UNIX, Fundamentals

Server or cluster system that has faulted

Wide area network (WAN) cloud

Course Introduction Intro15

VCS service group

Offline service group

Intro16 VERITAS Cluster Server for UNIX, Fundamentals

Lesson 1: VCS Building Blocks

12 VERITAS Cluster Server for UNIX, Fundamentals

Lesson 1 VCS Building Blocks 13

14 VERITAS Cluster Server for UNIX, Fundamentals

Lesson 1 VCS Building Blocks 15

Definition of VERITAS Cluster Server and Failover