You are on page 1of 326

International Technical Support Organization System/390 MVS Parallel Sysplex Continuous Availability SE Guide December 1995

SG24-4503-00

IBML

International Technical Support Organization System/390 MVS Parallel Sysplex Continuous Availability SE Guide December 1995

SG24-4503-00

Take Note! Before using this information and the product it supports, be sure to read the general information under Special Notices on page xvii.

First Edition (December 1995)


This edition applies to Version 5 Release 2 of MVS/ESA System Product (5655-068 or 5655-069). Order publications through your IBM representative or the IBM branch office serving your locality. Publications are not stocked at the address given below. An ITSO Technical Bulletin Evaluation Form for reader s feedback appears facing Chapter 1. If the form has been removed, comments may be addressed to: IBM Corporation, International Technical Support Organization Dept. HYJF Mail Station P099 522 South Road Poughkeepsie, New York 12601-5400 When you send information to IBM, you grant IBM a non-exclusive right to use or distribute the information in any way it believes appropriate without incurring any obligation to you. Copyright International Business Machines Corporation 1995. All rights reserved. Note to U.S. Government Users Documentation related to restricted rights Use, duplication or disclosure is subject to restrictions set forth in GSA ADP Schedule Contract with IBM Corp.

Abstract
This document discusses how the parallel sysplex can help an installation get closer to a goal of continuous availability. It is intended for customer systems and operations personnel responsible for implementing parallel sysplex, and the IBM Systems Engineers who assist them. It will also be useful to technical managers who want to assess the benefits they can expect from parallel sysplex in this area. The book describes how to configure both the hardware and software in order to eliminate planned outages and minimize the impact of unplanned outages. It describes how you can make hardware and software changes to the sysplex without disrupting the running of the applications. It also discusses how to handle unplanned hardware or software failures, and to recover from error situations with minimal impact to the applications. A knowledge of parallel sysplex is assumed. (296 pages)

Copyright IBM Corp. 1995

iii

iv

Continuous Availability with PTS

Contents
Abstract
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

iii xvii

Special Notices

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . How This Document Is Organized . . . . . . . . . . . . . . . . . . . . . . Related Publications International Technical Support Organization Publications ITSO Redbooks on the World Wide Web (WWW) . . . . . . Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xix xix . xx xxi xxii xxiii

Part 1. Configuring for Continuous Availability

. . . . . . . . . . . . . . . . . . . . . . . . . .

1 3 3 3 4 6 7 7 7 7 7 8 10 10 11 11 11 11 12 13 13 14 14 14 14 14 15 16 17 17 17 17 18 18 19 19 20 20 21

Chapter 1. Hardware Configuration . . . . . . . . . 1.1 What Is Continuous Availability? . . . . . . . . . 1.1.1 Parallel Sysplex and Continuous Availability . . . . . . . . . . . . . . . . . . . 1.1.2 Why N + 1 ? 1.2 Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Coupling Facilities 1.3.1 Separate Machines . . . . . . . . . . . . . . 1.3.2 How Many? . . . . . . . . . . . . . . . . . . . 1.3.3 CF Links . . . . . . . . . . . . . . . . . . . . . 1.3.4 Coupling Facility Structures . . . . . . . . . 1.3.5 Coupling Facility Volatility/Nonvolatility . . 1.4 Sysplex Timers . . . . . . . . . . . . . . . . . . . 1.4.1 Duplicating . . . . . . . . . . . . . . . . . . . 1.4.2 Distance . . . . . . . . . . . . . . . . . . . . . 1.4.3 Setting the Time in MVS . . . . . . . . . . . 1.4.4 Protection . . . . . . . . . . . . . . . . . . . . 1.5 I/O Configuration . . . . . . . . . . . . . . . . . . 1.5.1 ESCON Logical Paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6 CTCs . . . . . . . . . . . . 1.6.1 3088 and ESCON CTC . . . . . . . . 1.6.2 Alternate CTC Configuration . . . . . . . . . . . . . . 1.6.3 Sharing CTC Paths 1.6.4 IOCP Coding . . . . . . . . . . . . . . . . . . 1.6.5 3088 Maintenance . . . . . . . . . . . . . . . 1.7 XCF Signalling Paths . . . . . . . . . . . . . . . . 1.8 Data Placement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.9 DASD Configuration 1.9.1 RAMAC and RAMAC 2 Array Subsystems 1.9.2 3990 Model 6 . . . . . . . . . . . . . . . . . . 1.9.3 3990 Model 3 . . . . . . . . . . . . . . . . . . . . . . . . . 1.9.4 DASD Path Recommendations 1.9.5 3990 Model 6 ESCON Logical Path Report . . . . . . . . . . . . . . . . . 1.10 ESCON Directors 1.10.1 ESCON Manager . . . . . . . . . . . . . . . 1.10.2 ESCON Director Switch Matrix . . . . . . . 1.11 Fiber . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.11.1 9729 . . . . . . . . . . . . . . . . . . . . . . 1.12 Consoles
Copyright IBM Corp. 1995

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.12.1 Hardware Management Console (HMC) . . . . . 1.12.2 How Many HMCs? . . . . . . . . . . . . . . . . . . 1.12.3 Using HMC As an MVS Console . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.12.4 MVS Consoles 1.12.5 Master Console Considerations . . . . . . . . . . 1.12.6 Console Configuration Considerations . . . . . . 1.13 Tape . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.13.1 3490 1.14 Communications . . . . . . . . . . . . . . . . . . . . . . 1.14.1 VTAM CTCs . . . . . . . . . . . . . . . . . . . . . . 1.14.2 3745s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.14.3 CF Structure 1.15 Environmental . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.15.1 Uninterruptible Power Supply (UPS) 1.15.2 9672/9674 Protection against Power Disturbances Chapter 2. System Software Configuration . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Introduction 2.2 N, N+1 in a Software Environment . . . . . . 2.3 Shared SYSRES . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Shared SYSRES Design . . . . . . . . . 2.3.2 Indirect Catalog Function 2.4 Master Catalog . . . . . . . . . . . . . . . . . . 2.5 Dynamic I/O Reconfiguration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.1 Exceptions 2.6 I/O Definition File . . . . . . . . . . . . . . . . . 2.7 Couple Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8 JES2 Checkpoint . . . . 2.8.1 JES2 Checkpoint Reconfiguration 2.9 RACF Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.10 PARMLIB Considerations . . . . 2.10.1 Developing Naming Conventions 2.10.2 MVS/ESA SP V5.2 Enhancements . . . . . . . . . . . . . . . . . . . 2.10.3 MVS Consoles 2.11 System Logger . . . . . . . . . . . . . . . . . . . . 2.11.1 Logstream and Structure Allocation . . . . . . . . . . . 2.11.2 DASD Log Data Sets . 2.11.3 Duplexing Coupling Facility Log Data 2.11.4 DASD Staging Data Sets . . . . . . . . . . 2.12 System Managed Storage Considerations 2.12.1 SMSplex . . . . . . . . . . . . . . . . . . . . . . . . . . 2.12.2 DFSMShsm Considerations 2.12.3 Continuous Availability Considerations . . . . . . . . . . . . . 2.12.4 RESERVE Activity 2.13 Shared Tape Support . . . . . . . . . . . . . . 2.13.1 Planning . . . . . . . . . . . . . . . . . . . 2.13.2 Implementing Automatic Tape Switching 2.14 Exploiting Dynamic Functions . . . . . . . . . 2.14.1 Dynamic Exits . . . . . . . . . . . . . . . . 2.14.2 Dynamic Subsystem Interface (SSI) . . . . . . . 2.14.3 Dynamic Reconfiguration of XES . 2.15 Automating Sysplex Failure Management 2.15.1 Planning for SFM . . . . . . . . . . . . . . 2.15.2 The SFM Isolate Function . . . . . . . . . 2.15.3 SFM Parameters . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

21 21 21 21 22 23 25 25 26 26 26 26 26 26 27 29 29 29 29 30 30 32 33 34 35 35 38 39 40 40 40 41 43 46 46 46 47 49 50 50 52 52 53 54 54 54 55 55 56 57 57 58 59 63

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

vi

Continuous Availability with PTS

2.15.4 SFM Activation . . . . . . . . . . . . . 2.15.5 Stopping SFM . . . . . . . . . . . . . . 2.15.6 SFM Utilization . . . . . . . . . . . . . 2.16 Planning the Time Detection Intervals . . 2.16.2 Synchronous WTO(R) . . . . . . . . . 2.17 ARM: MVS Automatic Restart Manager . . . . . . . . . . 2.17.1 ARM Characteristics 2.17.2 ARM Processing Requirements . . . 2.17.3 Program Changes . . . . . . . . . . . 2.17.4 ARM and Subsystems . . . . . . . . . 2.18 JES3 . . . . . . . . . . . . . . . . . . . . . . 2.18.1 Planning . . . . . . . . . . . . . . . . . . . . . 2.18.2 JES3 Sysplex Considerations 2.18.3 JES3 Parallel Sysplex Requirements 2.18.4 JES3 Configurations . . . . . . . . . . 2.18.5 Additional JES3 Planning Information Chapter 3. Subsystem Software Configuration 3.1 CICS V4 Transaction Subsystem . . . . . 3.1.1 CICS Topology . . . . . . . . . . . . . 3.1.2 CICS Affinities . . . . . . . . . . . . . . . . . . . . . . 3.1.3 File-Owning Regions . 3.1.4 Resource Definition Online (RDO) . . . . . . . . . 3.1.5 CSD Considerations . . . 3.1.6 Subsystem Storage Protection 3.1.7 Transaction Isolation . . . . . . . . . . . . . . . . . . . . . . . 3.2 CICSPlex SM V1 3.2.1 CICSPlex SM Configuration . . . . . . . . . . . . 3.3 IMS Transaction Subsystem . . . . . . . . . . . . . 3.3.1 IMS Topology . . . . . . . . . . . . . . 3.3.2 IMS RESLIB 3.3.3 IMSIDs . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.4 Terminal Definitions . . . . . . . . . . . 3.3.5 Data Set Sharing 3.3.6 IRLM Definitions . . . . . . . . . . . . 3.3.7 Coupling Facility Structures . . . . . 3.3.8 Dynamic Update of IMS Type 2 SVC 3.3.9 Cloning Inhibitors . . . . . . . . . . . 3.4 DB2 Subsystem . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 DB2 Environment 3.4.2 DB2 Structures . . . . . . . . . . . . . . . . . . . 3.4.3 Changing Structure Sizes 3.4.4 DB2 Data Availability . . . . . . . . . 3.4.5 IEFSSNXX Considerations . . . . . . 3.4.6 DB2 Subsystem Parameters . . . . . 3.5 VSAM RLS . . . . . . . . . . . . . . . . . . 3.5.1 Control Data Sets . . . . . . . . . . . . . . . . . . . 3.5.2 Defining the Database 3.5.3 Defining the SMSVSAM Structures . 3.5.4 CICS Use of System Logger . . . . . 3.6 TSO in a Parallel Sysplex . . . . . . . . . 3.7 System Automation Tools . . . . . . . . . 3.7.1 NetView . . . . . . . . . . . . . . . . . 3.7.2 AOC/MVS . . . . . . . . . . . . . . . . 3.7.3 OPC/ESA . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

69 72 72 73 79 79 80 80 82 82 87 87 89 90 91 93

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

95 95 . 96 . 97 . 97 . 97 . 97 . 98 . 98 . 98 . 99 100 100 101 101 101 102 102 102 102 103 103 103 104 105 105 105 105 106 107 108 108 109 109 110 110 110 111

Contents

vii

3.8 VTAM . . . . . . . 3.8.1 Configuration

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

112 112

Part 2. Making Planned Changes


Chapter 4.1 The 4.1.1 4.1.2 4.1.3 4.1.4 4.1.5

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

113 115 115 115 115 116 116 116 117 117 118 118 119 120 120 121 123 125 126 127 127 127 128 132 132 133 133 134 138 141 142 143 143 143 143 144 144 145 145 145 145 146 146 146 146 148 148

4. Systems Management in a Parallel Sysplex . . . . . . Importance of Systems Management in Parallel Sysplex Change Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Problem Management Operations Management . . . . . . . . . . . . . . . . . . . . . . . . . . The Other System Management Disciplines Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Chapter 5. Coupling Facility Changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Structure Attributes and Allocation 5.2 Structure and Connection Disposition . . . . . . . . . . . . . 5.2.1 Structure Disposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.2 Connection State and Disposition 5.3 Structure Dependence on Dumps . . . . . . . . . . . . . . . 5.4 To Move a Structure . . . . . . . . . . . . . . . . . . . . . . . 5.4.1 The Structure Rebuild Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Altering the Size of a Structure 5.6 Changing the Active CFRM Policy . . . . . . . . . . . . . . . 5.7 Reformatting the CFRM Couple Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.8 Adding a Coupling Facility 5.8.1 To Define the Coupling Facility LPAR and Connections . . . . . . . . . . . . 5.8.2 To Prepare the New CFRM Policy 5.8.3 Setting Up the Structure Exploiters . . . . . . . . . . . . 5.9 Servicing the Coupling Facility . . . . . . . . . . . . . . . . . 5.9.1 Concurrent Hardware Upgrades: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.9.2 Concurrent LIC Upgrades 5.10 Removing a Coupling Facility . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.11 Coupling Facility Shutdown Procedure . . . . . . 5.11.1 Coupling Facility Exploiter Considerations 5.11.2 Shutting Down the Only Coupling Facility . . . . . . . 5.12 Putting a Coupling Facility Back Online . . . . . . . . . . . Chapter 6. Hardware Changes . . . . . . . . . . 6.1 Processors . . . . . . . . . . . . . . . . . . . . 6.1.1 Adding a Processor . . . . . . . . . . . . 6.1.2 Removing a Processor . . . . . . . . . . . . . . . . . . . . 6.1.3 Changing a Processor . . . . . . . . . . 6.2 Logical Partitions (LPARs) 6.2.1 Adding an LPAR . . . . . . . . . . . . . . 6.2.2 Removing an LPAR . . . . . . . . . . . . . . . . . . . . . . . . 6.2.3 Changing an LPAR . . . . . . . . . . . . . . . . . . . 6.3 I/O Devices 6.4 ESCON Directors . . . . . . . . . . . . . . . . 6.5 Changing the Time . . . . . . . . . . . . . . . 6.5.1 Using the Sysplex Timer . . . . . . . . . . . . . . . . . . 6.5.2 Time Changes and IMS . . . . . . . . . 6.5.3 Time Changes and SMF 6.5.4 Changing Time in the 9672 HMC and SE

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

viii

Continuous Availability with PTS

Chapter 7. Software Changes . . . . . . . . . . . . . . . . . 7.1 Adding a New MVS Image . . . . . . 7.1.1 Adding a New JES3 Main . . . . . . . . . . 7.2 Adding a New SYSRES 7.2.1 Example JCL . . . . . . . . . . . . . . 7.3 Implementing System Software Changes . . . . . . . . . . . . 7.4 Adding Subsystems 7.4.1 CICS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.2 IMS Subsystem . . . . . . . . . . . . . . . . . . . 7.4.3 DB2 . . . . . . . . . . . . . . . . . . . 7.4.4 TSO . . . . . . . . . 7.5 Starting the Subsystems 7.5.1 CICS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5.2 DB2 . . . . . . . . . . . . . . . . . . . 7.5.3 IMS 7.6 Changing Subsystems . . . . . . . . . . . . . . . . . . . . . . 7.7 Moving the Workload 7.7.1 CICS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.7.2 IMS . . . . . . . . . . . . . . . . . . . 7.7.3 DB2 . . . . . . . . . . . . . . . . . . . 7.7.4 TSO . . . . . . . . . . . . . . . . . . 7.7.5 Batch . . . . . . . . . . . . . . . . . 7.7.6 DFSMS 7.8 Closing Down the Subsystems . . . . . . 7.8.1 CICS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.8.2 IMS . . . . . . . . . . . . . . . . . . . 7.8.3 DB2 . . . 7.8.4 System Automation Shutdown 7.9 Removing an MVS Image . . . . . . . . . Chapter 8. Database Availability 8.1 VSAM . . . . . . . . . . . . . . . . . . . . . . . 8.1.1 Batch . . . . . . . . . 8.1.2 Backup . . . . . . . . . . 8.1.3 Reorg 8.2 IMS/DB . . . . . . . . . . . . . . . . . . . . . . 8.2.1 Batch . . . . . . . . . 8.2.2 Backup . . . . . . . . . . 8.2.3 Reorg 8.3 DB2 . . . . . . . . . . . . . . . . . . . . . . . . 8.3.1 Batch . . . . . . . . . 8.3.2 Backup . . . . . . . . . . 8.3.3 Reorg

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

149 149 150 151 151 154 155 156 157 158 159 159 159 160 160 160 161 161 163 163 164 164 165 165 166 166 167 169 169 171 171 171 172 172 172 173 173 174 174 174 175 175

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Part 3. Handling Unplanned Outages

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

177 179 179 179 179 179 180 185 185

Chapter 9. Parallel Sysplex Recovery . . . . . 9.1 System Recovery . . . . . . . . . . . . . . . 9.1.1 Sysplex Failure Management (SFM) . 9.1.2 Automatic Restart Management (ARM) . . . . . . . 9.1.3 What Needs to Be Done? . . . . 9.2 Coupling Facility Failure Recovery . . . 9.3 Assessment of the Failure Condition . . 9.3.1 To Recognize a Structure Failure

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Contents

ix

9.3.2 To Recognize a Connectivity Failure . . . . . . . . . . . . . . . . . 9.3.3 To Recognize When a Coupling Facility Becomes Volatile . . . . 9.3.4 Recovery from a Connectivity Failure . . . . . . . . . . . . . . . . 9.3.5 Recovery from a Structure Failure . . . . . . . . . . . . . . . . . . 9.4 DB2 V4 Recovery from a Coupling Facility Failure . . . . . . . . . . . . . . . . . . 9.4.1 DB2 V4 Built-In Recovery from Connectivity Failure . . . . . . . . 9.4.2 DB2 V4 Built-In Recovery from a Structure Failure 9.4.3 Coupling Facility Becoming Volatile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4.4 Manual Structure Rebuild 9.4.5 To Manually Deallocate and Reallocate a Group Buffer Pool . . 9.4.6 To Manually Deallocate a DB2 Lock Structure . . . . . . . . . . . 9.4.7 To Manually Deallocate a DB2 SCA Structure . . . . . . . . . . . 9.5 XCF Recovery from a Coupling Facility Failure . . . . . . . . . . . . . 9.5.1 XCF Built-In Recovery from Connectivity or Structure Failure . . 9.5.2 Coupling Facility Becoming Volatile . . . . . . . . . . . . . . . . . 9.5.3 Manual Invocation of Structure Rebuild . . . . . . . . . . . . . . . . . . . . . 9.5.4 Manual Deallocation of the XCF Signalling Structures . . . . . . . . . . . . . . . . . . . . . . . . 9.5.5 Partitioning the Sysplex 9.6 RACF Recovery from a Coupling Facility Failure . . . . . . . . . . . . 9.6.1 RACF Built-In Recovery from Connectivity or Structure Failure . 9.6.2 Coupling Facility Becoming Volatile . . . . . . . . . . . . . . . . . 9.6.3 Manual Invocation of Structure Rebuild . . . . . . . . . . . . . . . 9.6.4 Manual Deallocation of RACF Structures . . . . . . . . . . . . . . 9.7 VTAM Recovery from a Coupling Facility Failure . . . . . . . . . . . . . . . . . . . . 9.7.1 VTAM Built-In Recovery from Connectivity Failure 9.7.2 VTAM Built-In Recovery from a Structure Failure . . . . . . . . . 9.7.3 The Coupling Facility Becomes Volatile . . . . . . . . . . . . . . . 9.7.4 Manual Invocation of Structure Rebuild . . . . . . . . . . . . . . . 9.7.5 Manual Deallocation of the VTAM GRN Structure . . . . . . . . . 9.8 IMS/DB Recovery from a Coupling Facility Failure . . . . . . . . . . . . . . . . . 9.8.1 IMS/DB Built-In Recovery from a Connectivity Failure 9.8.2 IMS/DB Built-In Recovery from a Structure Failure . . . . . . . . 9.8.3 Coupling Facility Becoming Volatile . . . . . . . . . . . . . . . . . 9.8.4 Manual Invocation of Structure Rebuild . . . . . . . . . . . . . . . 9.8.5 Manual Deallocation of an IRLM Lock Structure . . . . . . . . . . . . . . 9.8.6 Manual Deallocation of a OSAM/VSAM Cache Structure 9.9 JES2 Recovery from a Coupling Facility Failure . . . . . . . . . . . . . . . . . . . . . . . 9.9.1 Connectivity Failure to a Checkpoint Structure . . . . . . . . . . . . 9.9.2 Structure Failure in a Checkpoint Structure 9.9.3 The Coupling Facility becomes Volatile . . . . . . . . . . . . . . . 9.9.4 To Manually Move a JES2 Checkpoint . . . . . . . . . . . . . . . . 9.10 System Logger Recovery from a Coupling Facility Failure . . . . . . 9.10.1 System Logger Built-In Recovery from a Connectivity Failure . 9.10.2 System Logger Built-In Recovery from a Structure Failure . . . . . . . . . . . . . . . . . . . 9.10.3 Coupling Facility Becoming Volatile . . . . . . . . . . . . . . 9.10.4 Manual Invocation of Structure Rebuild . . . . . . . . . . 9.10.5 Manual Deallocation of Logstreams Structure 9.11 Automatic Tape Switching Recovery from a Coupling Facility Failure 9.11.1 Automatic Tape Switching Recovery from a Connectivity Failure 9.11.2 Automatic Tape Switching Built-In Recovery from a Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Failure . . . . . . . . . . . . . . . . 9.11.3 Coupling Facility Becoming Volatile . . . . . . . . . . . . . . 9.11.4 Manual Invocation of Structure Rebuild . 9.11.5 Consequences of Failing to Rebuild the IEFAUTOS Structure . . . . . . . . . . . 9.11.6 Manual Deallocation of IEFAUTOS Structure

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

186 186 187 188 189 189 190 190 190 190 191 192 192 192 193 193 193 193 194 194 195 195 196 196 196 196 196 196 197 197 197 198 198 198 199 199 199 199 202 203 203 203 203 203 203 204 204 204 204 204 204 204 205 205

. . . . . . . . . .

Continuous Availability with PTS

9.12 VSAM RLS Recovery from a Coupling Facility Failure . . . . . . . . . 9.12.1 SMSVSAM Built-In Recovery from a Connectivity Failure . . . . 9.12.2 SMSVSAM Built-In Recovery from a Structure Failure . . . . . . . . . . . . . . . 9.12.3 Coupling Facility Becoming Volatile . . . . . . . . . . . . . 9.12.4 Manual Invocation of Structure Rebuild . . . . . . . . . 9.12.5 Manual Deallocation of SMSVSAM Structures 9.13 Couple Data Set Failure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.13.1 Sysplex (XCF) Couple Data Set Failure 9.13.2 Coupling Facility Resource Manager (CFRM) Couple Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Failure 9.13.3 Sysplex Failure Management (SFM) Couple Data Set Failure 9.13.4 Workload Manager (WLM) Couple Data Set Failure . . . . . . 9.13.5 Automatic Restart Manager (ARM) Couple Data Set Failure . 9.13.6 System Logger (LOGR) Couple Data Set Failure . . . . . . . . 9.14 Sysplex Timer Failures . . . . . . . . . . . . . . . . . . . . . . . . . . 9.15 Restarting IMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.15.1 IMS/IRLM Failures Within a System 9.15.2 CEC or MVS Failure . . . . . . . . . . . . . . . . . . . . . . . . . 9.15.3 Automating Recovery . . . . . . . . . . . . . . . . . . . . . . . . 9.16 Restarting DB2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.17 Restarting CICS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.17.1 CICS TOR Failure . . . . . . . . . . . . . . . . . . . . . . . . . . 9.17.2 CICS AOR Failure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.18 Recovering Logs . . . . . . . . . . . . . . . . 9.18.1 Recovering an Application Failure . . . . . . . . . . . . . . . . . . . . 9.18.2 Recovering an MVS Failure 9.18.3 Recovering from a Sysplex Failure . . . . . . . . . . . . . . . . 9.18.4 Recovering from System Logger Address Space Failure . . . 9.18.5 Recovering OPERLOG Failure . . . . . . . . . . . . . . . . . . . 9.19 Restarting an OPC/ESA Controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.20 Recovering Batch Jobs under OPC/ESA Control 9.20.1 Status of Jobs on Failing CPU . . . . . . . . . . . . . . . . . . . 9.20.2 Recovery of Jobs on a Failing CPU . . . . . . . . . . . . . . . . Chapter 10. Disaster Recovery Considerations 10.1 Disasters and Distance . . . . . . . . . . . 10.2 Disaster Recovery Sites . . . . . . . . . . 10.2.1 3990 Remote Copy . . . . . . . . . . . 10.2.2 IMS Remote Site Recovery . . . . . . . 10.2.3 CICS Recovery with CICSPlex SM 10.2.4 DB2 Disaster Recovery . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

205 205 205 206 206 206 206 206 207 207 207 207 208 209 210 210 210 211 211 211 211 212 212 212 213 213 213 213 213 214 214 214 215 215 215 215 216 217 218 221 221 222 222 222 223 224 226 227 228 232 232 233

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Appendix A. Sample Parallel Sysplex MVS Image Members A.1 Example Parallel Sysplex Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.2 IPLPARM Members . . . . . . . . . . . . . . . . . . . . . . . . . A.2.1 LOADAA . . . . . . . . . . . . . . . . . . . . . A.3 PARMLIB Members A.3.1 IEASYMAA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.3.2 IEASYS00 and IEASYSAA A.3.3 COUPLE00 . . . . . . . . . . . . . . . . . . . . . . . . A.3.4 JES2 Startup Procedure in SYS1.PROCLIB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.3.5 J2G . . . . . . . . . . . . . . . . . . . . . . . . . . . A.3.6 J2L42 A.4 VTAMLST Members . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.4.1 ATCSTR42

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Contents

xi

A.4.2 ATCCON42 . . . . . A.4.3 APCIC42 . . . A.4.4 APNJE42 . . . A.4.5 CDRM42 . . . . A.4.6 MPC03 A.4.7 TRL03 . . . . . . A.4.8 APAPPCAA A.5 Allocating Data Sets . A.5.1 ALLOC JCL

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

234 235 235 236 236 236 237 238 238 241 241 243 245 245 245 246 246 248 248 249 249 255 259 259 260 263 267 267 270 271 272 275 285 289

Appendix B. Structures, How to ... . . . . . . . . . . . . . . B.1 To Gather Information on a Coupling Facility . . . . . . B.2 To Gather Information on Structure and Connections . B.3 To Deallocate a Structure with a Disposition of DELETE B.4 To Deallocate a Structure with a Disposition of KEEP . B.5 To Suppress a Connection in Active State . . . . . . . . . B.6 To Suppress a Connection in Failed-persistent State . . . . . . . . . . . . . . B.7 To Monitor a Structure Rebuild . . . . . . . . . . . . . . . . B.8 To Stop a Structure Rebuild . . . . . B.9 To Recover from a Hang in Structure Rebuild Appendix C. Examples of CFRM Policy Transitioning C.1 Changing the Structure Definition . . . . . . . . . . . . . C.2 Changing the Coupling Facility Definition Appendix D. Examples of Sysplex Partitioning . D.1 Partitioning on Operator Request . . . . . . D.2 System in Missing Status Update Condition Appendix E. Spin Loop Recovery

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . .

Appendix F. Dynamic I/O Reconfiguration Procedures . F.1 Procedure to Make the System Dynamic I/O Capable . . . . . . . . . . . F.2 Procedure for Dynamic Changes . . . . . . . F.3 Hardware System Area Considerations . . . . . F.4 Hardware System Area Expansion Factors Glossary

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

List of Abbreviations Index

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xii

Continuous Availability with PTS

Figures
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44. 45. 46. 47. 48. 49. 50. 51. Sample Parallel Sysplex Continuous Availability Configuration . . ESCON Logical Paths Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CTC Configuration . . . . . . . . . . Recommended XCF Signalling Path Configuration Recommended DASD Path Configuration . . . . . . . . . . . . . . . . ISCKDSF R16 ESCON Logical Path Report . . . . . . . . . . . . . . . Console Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Recommended Console Configuration . . . . . . . . . . . . . . . . . 9910 Local UPS and 9672 Rx2 and Rx3 Indirect Catalog Function with SYSRESA . . . . . . . . . . . . . . . . Indirect Catalog Function with SYSRESB . . . . . . . . . . . . . . . . Alternate Consoles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Example of Failure Dependent Connection . . . . . . . . . . . . . . . Example of Failure Dependent/Independence Connections . . . . . . . Basic Relationship between Sysplex Name and System Group SMSplex Consisting of System Group and Individual System Name . . . . . . . . . . . . . . . . . . . . . . . . . . Isolating a Failing MVS INTERVAL and ISOLATETIME Relationship . . . . . . . . . . . . . . . SFM Policy with the ISOLATETIME Parameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . SFM LPARs Actions Timings Sample JCL to Delete a SFM Policy . . . . . . . . . . . . . . . . . . . Figure to Show Timing Relationships . . . . . . . . . . . . . . . . . . JES3 *I S Display Showing Non-Existent Systems . . . . . . . . . . . JES3-Managed and Auto-Switchable Tape . . . . . . . . . . . . . . . NJE Node Definitions Portion of JES3 Init Stream . . . . . . . . . . . . . . . . . . . . . . . Sample JES3 Proc for Use by Multiple Globals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cloned CICSplex CICSPlex SM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sample IMS 5.1 Configuration Sample DB2 Data Sharing Configuration . . . . . . . . . . . . . . . . Sample VSAM RLS Data Sharing Configuration . . . . . . . . . . . . START Command When Adding a New JES3 Global . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Volume Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Copy SYSRESA SMP/E ZONEEDIT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Add IPL Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Example parallel sysplex Environment Introducing a New Software Level into the parallel sysplex . . . . . . . . . . . . . . . . . . . . . . . . . Redistributing Workload on TORs Redistributing Workload on AORs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . DB2 Data Sharing Availability Sample Checkpoint Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3990-6 Peer-to-Peer Remote Copy Configuration . . . . . . . . . . . . . 3990-6 Extended Remote Copy Configuration . . . . . . . . . . . . . . . IMS Remote Site Recovery Configuration . . . . . . . . . DB2 Data Sharing Disaster Recovery Configuration . . . . . . . . . . . . . . . . Example Parallel Sysplex Configuration LOADAA Member . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IEASYMAA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IEASYS00 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IEASYSAA
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5 13 . 15 . 16 . 19 . 20 . 23 . 25 . 28 . 31 . 32 . 44 . 48 . 49 . 51 . 51 . 59 . 61 . 62 . 67 . 72 . 74 . 88 . 90 . 91 . 92 . 96 . 99 100 104 107 151 152 152 153 153 154 155 162 163 168 200 217 218 219 220 221 222 223 224 225

Copyright IBM Corp. 1995

xiii

52. 53. 54. 55. 56. 57. 58. 59. 60. 61. 62. 63. 64. 65. 66. 67. 68. 69. 70. 71. 72. 73. 74. 75. 76. 77. 78. 79. 80. 81.

COUPLE00 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . JES2 Member in SYS1.PROCLIB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . J2G . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . J2L42 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ATCSTR42 ATCCON42 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . APCIC42 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . APNJE42 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CDRM42 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . MPC03 TRL03 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . APAPPCAA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Allocating System Specific Data Sets . . . . . . . . . . . . . . . . . . . Coupling Facility Display . . . . . . . . . . . . . . . . . . . . . . . . . . . Structures and Connections Display . . . . . . . . . . . . . . . . . . . . . . . . . Monitoring Structure Rebuild through Exploiter s Messages Monitoring Structure Rebuild by Displaying Structure Status . . . . . CFRM Policy Sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . JCL to Install a New CFRM Policy . . . . . . . . . . . . . . . . . . . . . Original CFRM Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . New CFRM Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VARY OFF a System without SFM Policy Active . . . . . . . . . . . . . VARY OFF a System with an SFM Policy Active . . . . . . . . . . . . . System in Missing Status Update Condition and No Active SFM Policy System in Missing Status Update with an Active SFM Policy and CONNFAIL(YES) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Resolution of a Spin Loop Condition HCD Panel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CONFIG Frame Fragment HCD Panel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dynamic I/O Customization . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

226 227 228 232 233 234 235 235 236 236 236 237 238 241 243 246 247 250 252 256 256 259 260 260 261 264 268 268 269 270

. . . . . .

xiv

Continuous Availability with PTS

Tables
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. Couple Data Set Placement Recommendations . . . . . . . . . . . JES2 Checkpoint Placement Recommendations . . . . . . . . . . . References Containing Information on the Use of System Symbols . . . . . . . . . . . . Summary of SFM Keywords and Parameters . . . . . . . . . . . . . . . . . . . . . . . . IMS Data Sets in Sysplex Automation Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Support of REBUILD by IBM Exploiters Support of ALTER by IBM Exploiters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . DB2 Changes Subsystem Recovery Summary Part 1 . . . . . . . . . . . . . . . Subsystem Recovery Summary Part 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Summary of Couple Data Sets
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

37 39 . 42 . 63 102 116 123 124 158 182 184 209

Copyright IBM Corp. 1995

xv

xvi

Continuous Availability with PTS

Special Notices
This publication is intended to help customers systems and operations personnel and IBM systems engineers to plan, implement and use a parallel sysplex in order to get closer to a goal of continuous availability. It is not intended to be a guide to implementing or using parallel sysplex as such. It only covers topics related to continuous availability. The information in this publication is not intended as the specification of any programming interfaces that are provided by MVS Version 5 or any other product mentioned in this redbook. See the PUBLICATIONS section of the IBM Programming Announcement for MVS Version 5, or other products, for more information about what publications are considered to be product documentation. References in this publication to IBM products, programs or services do not imply that IBM intends to make these available in all countries in which IBM operates. Any reference to an IBM product, program, or service is not intended to state or imply that only IBM s product, program, or service may be used. Any functionally equivalent program that does not infringe any of IBM s intellectual property rights may be used instead of the IBM product, program or service. Information in this book was developed in conjunction with use of the equipment specified, and is limited in application to those specific hardware and software products and levels. IBM may have this document. these patents. Licensing, IBM patents or pending patent applications covering subject matter in The furnishing of this document does not give you any license to You can send license inquiries, in writing, to the IBM Director of Corporation, 500 Columbus Avenue, Thornwood, NY 10594 USA.

The information contained in this document has not been submitted to any formal IBM test and is distributed AS IS. The information about non-IBM (VENDOR) products in this manual has been supplied by the vendor and IBM assumes no responsibility for its accuracy or completeness. The use of this information or the implementation of any of these techniques is a customer responsibility and depends on the customer s ability to evaluate and integrate them into the customer s operational environment. While each item may have been reviewed by IBM for accuracy in a specific situation, there is no guarantee that the same or similar results will be obtained elsewhere. Customers attempting to adapt these techniques to their own environments do so at their own risk. Reference to PTF numbers that have not been released through the normal distribution process does not imply general availability. The purpose of including these reference numbers is to alert IBM customers to specific information relative to the implementation of the PTF when it becomes available to each customer according to the normal IBM PTF distribution process. The following terms are trademarks of the International Business Machines Corporation in the United States and/or other countries:
ACF/VTAM AIX CICS CICS/MVS Copyright IBM Corp. 1995 Advanced Peer-to-Peer Networking APPN CICS/ESA CUA

xvii

DATABASE 2 DFSMS DFSMSdfp DFSMShsm Enterprise Systems Connection Architecture ES/9000 ESA/390 ESCON Hardware Configuration Definition IMS IPDS Magstar MVS/ESA MVS/XA PR/SM PS/2 RAMAC RMF S/390 SQL/DS System/360 System/390 SystemView Virtual Machine/Extended Architecture VM/XA VTAM

DB2 DFSMS/MVS DFSMSdss DFSORT ES/3090 ESA/370 ESCON XDF GDDM IBM IMS/ESA LPDA MVS/DFP MVS/SP NetView Processor Resource/Systems Manager RACF RETAIN S/370 SAA Sysplex Timer System/370 Systems Application Architecture Virtual Machine/Enterprise Systems Architecture VM/ESA VSE/ESA

The following terms are trademarks of other companies: C-bus is a trademark of Corollary, Inc. PC Direct is a trademark of Ziff Communications Company and is used by IBM Corporation under license. UNIX is a registered trademark in the United States and other countries licensed exclusively through X/Open Company Limited. Windows is a trademark of Microsoft Corporation.

Other trademarks are trademarks of their respective companies.

xviii

Continuous Availability with PTS

Preface
This document discusses how the parallel sysplex can help an installation get closer to a goal of Continuous Availability. This document is intended for customer systems and operations personnel responsible for implementing parallel sysplex, and the IBM Systems Engineers who assist them. It will also be useful to technical managers who want to assess the benefits they can expect from parallel sysplex in this area.

How This Document Is Organized


The document is in 3 parts:

Part 1, Configuring for Continuous Availability This part describes how to configure both the hardware and software in order to eliminate planned outages and minimize the impact of unplanned outages. Chapter 1, Hardware Configuration This chapter discusses how to design a hardware configuration for continuous availability. Chapter 2, System Software Configuration This chapter describes how to configure the system to support continuous availability and minimize the effort needed to maintain and run it. Chapter 3, Subsystem Software Configuration This chapter deals with configuring the various subsystems to provide an environment that will support the goal of continuous availability.

Part 2, Making Planned Changes This part describes how you can make changes to the sysplex without disrupting the running of the applications. Chapter 4, Systems Management in a Parallel Sysplex This chapter discusses the importance of maintaining good systems management disciplines in a parallel sysplex environment. Chapter 5, Coupling Facility Changes This chapter deals with changes that can be made to the coupling environment, for installation, planned or unplanned maintenance. Chapter 6, Hardware Changes This chapter discusses how to add, change or remove hardware elements of the sysplex in a non-disruptive way. Chapter 7, Software Changes This chapter discusses how to make changes such as adding, modifying or removing system images and subsystems. Chapter 8, Database Availability

Copyright IBM Corp. 1995

xix

This chapter discusses subsystem (CICS, IMS, DB2) configuration options to minimise the impact of making database changes.

Part 3, Handling Unplanned Outages This part describes how to handle unplanned outages and recover from error situations with minimal impact to the applications. Chapter 9, Parallel Sysplex Recovery This chapter discusses how to recover from unplanned hardware and software failures. Chapter 10, Disaster Recovery Considerations This chapter contains a discussion of disaster recovery considerations specific to the parallel sysplex environment.

Related Publications
The publications listed in this section are considered particularly suitable for a more detailed discussion of the topics covered in this document. The publications listed are sorted in alphabetical order.

CICS/ESA Release Guide GC33-0655 CICS VSAM Recovery Guide SH19-6709 CICS/ESA Dynamic Transaction Routing in a CICSPlex , SC33-1012 CICS/ESA Version 4 Intercommunication Guide , SC33-1181 CICS/ESA Version 4 Recovery and Restart Guide , SC33-1182 CICS/ESA Version 4 CICS-IMS Database Control Guide , SC33-1184 Concurrent Copy Overview GG24-3936 DB2 Version 4 Data Sharing: Planning and Administration , SC26-3269 DB2 Version 4 Release Guide , SC26-3394 DCAF V1.2.1 Installation and Using Guide , SH19-6838 DFSMS/MVS V1 R3 DFSMSdfp Storage Administration Reference , SC26-4920 ES/9000 and ES/3090 PR/SM Planning Guide , GA22-7123 ES/9000 9021 711-based Models Functional Characteristics , GA22-7144 ES/9000 9121 511-based Models Functional Characteristics , GA24-4358 Hardware Management Console Application Programming Interfaces , SC28-8141 Hardware Management Console Guide , GC38-0453. IBM CICS Transaction Affinities Utility User s Guide , SC33-1159 IBM CICSPlex Systems Manager for MVS/ESA Concepts and Planning , GC33-0786. IBM Token-Ring Network Introduction and Planning Guide , GA27-3677 IBM 3990 Storage Control Reference for Model 6 , GA32-0274 IBM 9037 Sysplex Timer and System/390 Time Management , GG66-3264 Implementing Concurrent Copy , GG24-3990 IMS/ESA Version 5 Administration Guide: Data Base , SC26-8012 IMS/ESA Version 5 Administration Guide: System , SC26-8013 IMS/ESA Version 5 Administration Guide: Transaction Manager , SC26-8014 IMS/ESA V5 Operations Guide , SC26-8029 IMS/ESA Version 5 Sample Operating Procedures , SC26-8032 JES2 Multi-Access Spool in a Sysplex Environment , GG66-3263 Large System Performance Reference Document , SC28-1187 LPAR Dynamic Storage Reconfiguration , GG66-3262 MVS/ESA Hardware Configuration Definition:Planning , GC28-1445

xx

Continuous Availability with PTS

MVS/ESA RMF User s Guide , GC33-6483 MVS/ESA RMF V5 Getting Started on Performance Management , LY33-9176 MVS/ESA SML:Implementing System-Managed Storage, SC26-3123 MVS/ESA SP V5 Hardware Configuration Definition: User s Guide , SC33-6468 MVS/ESA SP V5 Assembler Services Guide , GC28-1466 MVS/ESA SP V5 Authorized Assembler Services Guide , GC28-1467 MVS/ESA SP V5 Authorized Assembler Services Reference, Volume 2 , GC28-1476 MVS/ESA SP V5 Conversion Notebook , GC28-1436 MVS/ESA SP V5 Initialization and Tuning Guide , SC28-1451 MVS/ESA SP V5 Initialization and Tuning Reference , SC28-1452 MVS/ESA SP V5 Installation Exits , SC28-1459 MVS/ESA SP V5 JCL Reference , GC28-1479 MVS/ESA SP V5 JES2 Initialization and Tuning Reference , SC28-1454 MVS/ESA SP V5 JES2 Commands , GC28-1443 MVS/ESA SP V5 JES3 Commands , GC28-1444 MVS/ESA SP V5 Planning: Global Resource Serialization , GC28-1450 MVS/ESA SP V5 Planning: Security , GC28-1439 MVS/ESA SP V5 Planning: Operations , GC28-1441 MVS/ESA SP V5 Planning: Workload Management , GC28-1493 MVS/ESA SP V5 Programming: Assembler Services References , GC28-1474 MVS/ESA SP V5 Programming: Sysplex Services Guide , GC28-1495 MVS/ESA SP V5 Programming: Sysplex Services Reference , GC28-1496 MVS/ESA SP V5 Setting Up a Sysplex , GC28-1449 MVS/ESA SP V5 System Commands , GC28-1442 MVS/ESA SP V5 Sysplex Migration Guide , SG24-4581 MVS/ESA SP V5 System Management Facilities (SMF) , GC28-1457 S/390 MVS Sysplex Application Migration , GC28-1211 S/390 MVS Sysplex Hardware and Software Migration , GC28-1210. S/390 MVS Sysplex Overview: An Introduction to Data Sharing and Parallelism , GC28-1208 S/390 MVS Sysplex Systems Management , GC28-1209 S/390 9672/9674 Managing Your Processors , GC38-0452 S/390 9672/9674 System Overview , GA22-7148 SMP/E R8 Reference , SC28-1107 Sysplex Timer Planning , GA23-0365 TSO/E V2 User s Guide , SC28-1880 TSO/E V2 CLISTs , SC28-1876 TSO/E V2 Customization , SC28-1872 VTAM for MVS/ESA Version 4 Release 3 Migration Guide , GC31-6547

International Technical Support Organization Publications


Automating CICS/ESA Operations with CICSPlex SM and NetView , GG24-4424 Batch Performance , SG24-2557 CICS Workload Management Using CICSPlex SM And the MVS/ESA Workload Manager , GG24-4286 CICS/ESA and IMS/ESA: DBCTL Migration For CICS Users , GG24-3484 DFSMS/MVS Version 1 Release 3.0 Presentation Guide , GG24-4391 DFSORT Release 13 Benchmark Guide , GG24-4476 Disaster Recovery Library: Planning Guide , GG24-4210 MVS/ESA Software Management Cookbook , GG24-3481 MVS/ESA SP-JES2 Version 5 Implementation Guide , SG24-4583 MVS/ESA SP-JES3 Version 5 Implementation Guide , SG24-4582

Preface

xxi

MVS/ESA Version 5 Sysplex Migration Guide , SG24-4581 MVS/ESA Sysplex Migration Guide , GG24-3925 Planning for CICS Continuous Availability in an MVS/ESA Environment , SG24-4593 RACF Version 2 Release 1 Installation and Implementation Guide , GG2 RACF Version 2 Release 2 Technical Presentation Guide , GG24-2539 Sysplex Automation and Consoles , GG24-3854 S/390 Microprocessor Models R2 and R3 Overview , SG24-4575 S/390 MVS Parallel Sysplex Continuous Availability Presentation Guide , SG24-4502 S/390 MVS Parallel Sysplex Performance , GG24-4356 S/390 MVS/ESA Version 5 WLM Performance Studies , SG24-4352 Storage Performance Tools and Techniques for MVS/ESA , GG24-4045

A complete list of International Technical Support Organization publications, known as redbooks, with a brief description of each, may be found in:

International Technical Support Organization Bibliography of Redbooks, GG24-3070.


To get a catalog of ITSO redbooks, VNET users may type:

TOOLS SENDTO WTSCPOK TOOLS REDBOOKS GET REDBOOKS CATALOG


A listing of all redbooks, sorted by category, may also be found on MKTTOOLS as ITSOCAT TXT. This package is updated monthly. How to Order ITSO Redbooks IBM employees in the USA may order ITSO books and CD-ROMs using PUBORDER. Customers in the USA may order by calling 1-800-879-2755 or by faxing 1-800-445-9269. Most major credit cards are accepted. Outside the USA, customers should contact their local IBM office. For guidance on ordering, send a PROFS note to BOOKSHOP at DKIBMVM1 or E-mail to bookshop@dk.ibm.com. Customers may order hardcopy ITSO books individually or in customized sets, called BOFs, which relate to specific functions of interest. IBM employees and customers may also order ITSO books in online format on CD-ROM collections, which contain redbooks on a variety of products.

ITSO Redbooks on the World Wide Web (WWW)


Internet users may find information about redbooks on the ITSO World Wide Web home page. To access the ITSO Web pages, point your Web browser to the following URL:

http://www.redbooks.ibm.com/redbooks
IBM employees may access LIST3820s of redbooks as well. The internal Redbooks home page may be found at the following URL:

http://w3.itsc.pok.ibm.com/redbooks/redbooks.html

xxii

Continuous Availability with PTS

Acknowledgments
This publication is the result of a residency conducted at the International Technical Support Organization, Poughkeepsie Center. The advisor for this project was:

G. Tom Russell International Technical Support Organization, Poughkeepsie

The authors of this document are:

Paola Bari IBM Italy

Margaret Beal IBM Australia

Horace Dyke IBM Canada

Patrick Kappeler IBM France

Paul O Neill IBM Nordic

Ian Waite IBM U K

Preface

xxiii

xxiv

Continuous Availability with PTS

Part 1. Configuring for Continuous Availability


This part describes how to configure both the hardware and software in order to:

Eliminate planned outages Minimize the impact of unplanned outages

Copyright IBM Corp. 1995

Continuous Availability with PTS

Chapter 1. Hardware Configuration


This chapter discusses how to design a hardware configuration for continuous availability. This means eliminating all single points of failure, and giving the possibility to make changes to hardware and software without disrupting the running of the applications.

1.1 What Is Continuous Availability?


When we speak about continuous availability we are really dealing with two different but interrelated topics, high availability and continuous operations. High availability has to do with keeping the applications running without any breakdown during the planned opening hours . The way we achieve this is by a combination of high reliability for the individual components of the system and of redundancy of components so that even if a component fails there is another one there that can replace it. Continuous operations on the other hand is about keeping the applications and systems running without any planned stops . This in itself would not be too big a problem if it were not for the opposing but equally urgent need for responsiveness to changing business requirements. So the simplistic solution to freeze all changes just will not do. What the end users increasingly require is that the applications are kept running without any planned or unplanned stops, and this is what we mean by continuous availability. Up to now the only real solution to these requirements has been redundancy at the system level. This is a costly solution, but organizations such as airlines that have these requirements often have two complete systems, where one runs the production and the other is a hot standby, and they can switch the production from one system to the other quickly. Then if they have an unplanned breakdown on the production system the standby one takes over with a minimum delay. Having a second system also allows them to make planned changes to the standby system, and then switch the production over to it when they are ready to bring the change into operation.

1.1.1 Parallel Sysplex and Continuous Availability


The parallel sysplex was designed to:

Provide a single system image to the end-user of the application Support multiple copies of the applications, and provide services for dynamic balancing of the workload over the multiple copies Provide locking facilities to allow data to be shared among the multiple copies of the applications with integrity Provide services to facilitate communication between the multiple copies

From the perspective of continuous availability, the two most important functions provided by a parallel sysplex are:

Data Sharing

Copyright IBM Corp. 1995

Which allows multiple instances of an application running on multiple systems to work on the same databases simultaneously.

Workload Balancing Which means that the workload can be distributed evenly across these multiple application instances. This is made possible by the fact that they can share data.

These radically new possibilities provided by parallel sysplex change the way we approach continuous availability. Today, a specific system provides the infrastructure for a major customer application. The loss or degradation of that system can severely impact the customer s business. In the parallel sysplex environment, where multiple cooperating systems provide the infrastructure, the loss or degradation of one of the many identical systems has little impact. This means that we can now design a system that is fault-tolerant from both a hardware and software perspective, giving us the possibility of the following:

Very High Availability With redundancy in both hardware and software we can eliminate points-of-failure, and workload balancing can ensure that the work being done on a lost component will be distributed across the remaining ones.

Nondisruptive Change Hardware changes can be made by removing the system that needs to be changed from the sysplex while the applications continue to run on the remaining systems, making the change, and then returning the system to the sysplex. Software changes can be achieved in a similar way, provided that the changed version of the software in question can co-exist with the current ones in the sysplex. This coexistence (at level N and N+1) is a design objective of the IBM systems and subsystems that support parallel sysplex.

This shift in philosophy changes the way we think about designing the configuration in a parallel sysplex. In order to take advantage (or exploit ) the parallel sysplex there must be more than one of each hardware component, and the software must be designed for cloning. If the application requires N images in order to provide the processing capacity, then the system designer should provide N+1 images in the sysplex.

1.1.2 Why N + 1 ?
When designing systems for high availability we must always consider the possibility that a component can fail. If we build the system with redundant components such that, even if any component does fail, the system will continue to function, then we have a fault-tolerant system. We can also say that we have no single point of failure. Obviously this component redundancy has a cost. The simplest, but most expensive solution, is to duplicate everything. This is often not an economically viable alternative. Fortunately there are others.

Continuous Availability with PTS

If we assume that the individual components of the system are inherently reliable, that is that the probability of failure is very low for each component, then the probability of more than one failing at any one time is extremely low, and can be ignored. So, if we need a number of components (N) to do a particular job, all we need to do is allocate one extra to allow for the possibility of failure, and these N+1 components give us the redundancy we need. The larger the number of components (N) sharing the work, the less the relative cost of this redundancy. In other words, if we are flying in a two-engined plane and want to be safe in the case of an engine failure, then one engine must be able to fly the plane. This means one of the two engines (50%) is redundant. If it is a four-engined plane then we want to be able to continue with three engines, so the fourth one (25%) is redundant. In the same way we have been building hardware redundancy into computer systems for some time, the number of channels to I/O units, power supplies in the processor, and so on. Now with parallel sysplex we can take this concept one step further, and introduce N+1 redundancy in the number of machines or system images in the system. This allows us to configure for the failure of entire machines or system images and still keep the system on the air.

Figure 1. Sample Parallel Sysplex Continuous Availability Configuration. and all the links are duplicated to eliminate single points of failure.

The coupling facilities, sysplex timers

Chapter 1. H a r d w a r e Configuration

1.2 Processors
The first prerequisite is that we have multiple processors following the N+1 philosophy outlined above.

1.2.1.1 CMOS-Only Sysplexes


If we are designing a configuration from scratch, using CMOS processors, then this is just a matter of deciding what the optimal processor size is and then configuring N+1 identical machines, where N of these are sufficient to run the workload. In theory, the larger N is (that is, the smaller the individual machines) the less is the cost of the redundant N+1 machine. In practice there are counterbalancing reasons, such as the following:

The performance overhead on the sysplex (between 0.5% and 1% for each extra machine). The extra human effort in managing more machines (which will depend on how well the systems management procedures and tools can handle multiple machines). The extra work involved in maintaining more system images (which will depend on how well the clones are replicated and on how well the naming and other installation standards support this). How useful small machines are in handling the workload. If there are components in the workload that require larger machines to perform satisfactorily then this will tend to reduce the number of ways we can split the sysplex.

1.2.1.2 Mixed Sysplexes


Very often a sysplex will be a mixture of large bipolar and smaller CMOS machines. This is for many installations a natural evolution from their current bipolar configurations and allows these machines to continue their useful life into the parallel sysplex world. It may also be necessary to keep these larger machines because parts of the workload need either the larger system image or the more powerful engines that these provide. In many cases it is not realistic to adopt a simplistic N+1 approach to these configurations with large machines due to the high cost of having a redundant large processor. In any event we are often dealing here with a transition state, where not all of the work can be partitioned on a sysplex. What we need to consider from an availability viewpoint is the effect of the failure of each machine in the configuration, and particularly the larger ones. We must ensure that there is reserve capacity available to take over the essential work from that machine. This may involve removing or reducing the priority of some other nonessential work.

Continuous Availability with PTS

1.3 Coupling Facilities


The recommended configuration of coupling facilities for availability is to have at least two of them, and as separate 9674s, not partitions in processors doing other work.

1.3.1 Separate Machines


The reason for having them as separate machines is that if a coupling facility fails then the structures it contains will have to be rebuilt in another coupling facility, and this rebuild will be done using data from the coupled MVS-systems. If you run the coupling facility in a partition in a machine which is also running one of the systems in the sysplex, then a hardware failure on this machine will not only bring down the coupling facility but also one of the sources needed to rebuild it. The only way to recover from this situation is to restart the whole sysplex.

1.3.2 How Many?


In deciding how many coupling facilities you need, the same N+1 considerations apply as we have seen for processors. If one fails, we need to have sufficient processor capacity and memory available in the remaining ones to rebuild the structures and handle the load. The simplest design is where we have two coupling facilities, each of which has enough processor power and memory to handle the entire sysplex. In normal production we can then distribute the structures over these, and for each structure specify the other CF as the alternate for rebuild in case of a failure.

1.3.3 CF Links
The recommended number of CF links to each machine in the sysplex is at least two, for availability reasons. You may need more for performance. See Parallel Sysplex Performance , GG24-4356. Note that each of these receiver links (at the CF end) is separate. Sender links (at the MVS end) can be shared between partitions in a fashion similar to EMIF, so even if you have several partitions you will only need two links per machine for each CF you need to connect to. If you have an MP-machine which you plan to partition for any reason, then this means two links per CF on each side of the machine. In the coupling facility, one Intersystem Channel Adapter (fc #0014) is required for every two coupling links (#0007 or #0008). The Intersystem Channel Adapter is not hot pluggable, but the coupling links are. If you do not have a redundant 9674 to switch the coupling load to, you may want to consider installing additional Intersystem Channel Adapters to allow for additional coupling links to be installed without an outage in the future. For details on hot plugging, refer to the 9672/9674 System Overview , GA22-7148.

1.3.4 Coupling Facility Structures


There could be some planned activities that require a coupling facility shutdown. A coupling facility cannot be treated as a normal device. It requires a particular procedure to be unallocated by the subsystems and the shutdown can be disruptive or not depending on the initial coupling facility setting and the usage made by each different user. Here we will go through some considerations that can be useful in designing the coupling facility environment and making it possible to remove structures.
Chapter 1. H a r d w a r e Configuration

While designing the coupling facility environment, you should consider which structures must be relocated to an alternate coupling facility. Some subsystems can continue to operate without their coupling facility structure, although there may be a loss of performance. For example, the JES2 checkpoint can be relocated to DASD and the RACF structure can simply be deallocated while coupling facility maintenance is being performed. For the remaining structures, you must ensure that enough capacity (storage, CPU cycles, link connections structure IDs, etc.) exists on an alternate coupling facility to allow structures to be rebuilt there. When you set up your coupling facility configuration you should provide definitions that enable the structures to be moved or rebuilt; structures being moved to the alternate coupling facility must have the alternate coupling facility name in the PREFLIST statement. The following is an example on how to define a structure that can be rebuilt:

STRUCTURE NAME(IEFAUTOS) SIZE(640) REBUILDPERCENT(20) PREFLIST(CF01, CF02)


For structures that will be moved (REBUILT) from the outgoing coupling facility to an alternate coupling facility, ensure that all systems using the structures have connectivity to the alternate coupling facility.

1.3.5 Coupling Facility Volatility/Nonvolatility


Planning a coupling facility configuration for continuous availability requires particular attention to the storage volatility of the coupling facility where shared data resides. The advantages of a nonvolatile coupling facility are that if you lose power to a coupling facility that is configured to be nonvolatile, the coupling facility enters power save mode, saving the data contained in the structures. Continuous availability of structures can be provided by making the coupling facility storage contents nonvolatile. This can be done in different ways depending on how long a power loss we want to allow for:

With a UPS With an optional battery backup feature With a UPS plus a battery backup feature

For more details on this see 1.15.2, 9672/9674 Protection against Power Disturbances on page 27. The volatility or nonvolatility of the coupling facility is reflected by the volatility attribute, and can be monitored by the system and subsystems to decide on recovery actions in the case of power failure. There are some subsystems that are very sensitive to the status of this coupling facility attribute, like the system logger, and they can behave in different ways depending on the volatility status. To set the volatility attribute you should use the coupling facility control code command:

Mode Powersave
This is the default setup and automatically determines the volatility status of the coupling facility based on the presence of the battery backup feature. If

Continuous Availability with PTS

the battery backup is installed and working, the CFCC sets its status to nonvolatile. The battery backup feature will preserve coupling facility storage contents across a certain time interval (default is 10 seconds).

Mode Non-Volatile
This command should be used to inform the CFCC to set non-volatile status for its storage because a UPS is installed.

Mode Volatile
This command informs the CFCC to put its storage in volatile status irrespective of whether there is a battery or not.

There are considerations in coupling facility planning depending on the sensitivity of subsystem users to coupling facility volatile/nonvolatile status:

JES2 JES2 can use a coupling facility structure for primary checkpoint data set, and its alternate checkpoint data set can either be in a coupling facility or on DASD. Depending on the volatility of the coupling facility, JES2 will or will not allow you to have both primary and secondary checkpoint data sets on the coupling facility.

Logger The system logger can be sensitive to the volatile/nonvolatile status of the coupling facility where the LOGSTREAM structures are allocated. Particularly, depending on the coupling facility status, the system logger is able to protect its data against a double failure (MVS failure together with the coupling facility). When you define a LOGSTREAM you can specify the following parameters:

STG_DUPLEX(NO/YES)
Specifies whether the coupling facility logstream data should be duplexed on DASD staging data sets. You can use this specification together with the DUPLEXMODE parameter to be configuration independent.

DUPLEXMODE(COND/UNCOND)
Specifies the conditions under which the coupling facility log data will be duplexed in DASD staging data sets. COND means that duplexing will be done only if the logstream contains a single point of failure and is therefore vulnerable to permanent log data loss: - Logstream is allocated to a volatile coupling facility residing on the same machine as the MVS system. - Duplexing will not be done if the coupling facility for the logstream is nonvolatile and resides on a different machine than the MVS system.

DB2 DB2 requests of MVS that structures be allocated in a nonvolatile coupling facility; however, it does not prevent allocation in a volatile coupling facility. DB2 does issue a warning message if allocation occurs into a volatile coupling facility. A change in volatility after allocation does not have an effect on your existing structures. The advantages of a nonvolatile coupling facility are that if you lose power to a coupling facility that is configured to be nonvolatile, the coupling facility

Chapter 1. H a r d w a r e Configuration

enters power save mode, saving the data contained in the structures. When power is returned, there is no need to do a group restart, and there is no need to recover the data from the group buffer pools. For DB2 systems requiring high availability, nonvolatile coupling facilities are recommended.

SMSVSAM Lock The coupling facility IGWLOCK00 lock structure is recommended to be allocated in a nonvolatile coupling facility. This lock structure is used to enforce the protocol restrictions for VSAM RLS data sets and maintain the record level locks. The support requires a single CF lock structure.

IRLM Lock The lock structures for IMS or DB2 locks are recommended to be allocated in a nonvolatile coupling facility. Recovery after a power failure is faster if the locks are still available.

IMS Cache Directory The cache directory structure for VSAM or OSAM databases can be allocated in a nonvolatile or volatile coupling facility.

VTAM The VTAM Generic Resources structure ISTGENERIC can be allocated in either a nonvolatile or a volatile coupling facility. VTAM has no special processing for handling a coupling facility volatility change.

1.4 Sysplex Timers


In a multi-system sysplex it is necessary to synchronize the Time-of-Day (TOD) clocks in all the systems very accurately in order to maintain data integrity. If all the systems are in the same CPC, under PR/SM, then this is no problem as they are all using the same TOD clock. If the systems are spread across more than one CPC then the TOD clocks in all these CPCs must be synchronized using a single external time source, the sysplex timer. The IBM Sysplex Timer (9037) is a table-top unit that can synchronize the TOD clocks in up to 16 processors or processor sides, which are connected to it by fiber-optic links. For full details see IBM 9037 Sysplex Timer and System/390 Time Management , GG66-3264-00. The sysplex cannot continue to function without the sysplex timer. If any system loses the timer signal, it will be fenced from the sysplex and put in an unrestartable wait state.

1.4.1 Duplicating
When the Expanded Availability Feature is installed, two 9037 devices linked to one another, provide a synchronized, redundant configuration. This ensures that the failure of one 9037, or a fiber optic cable, will not cause loss of time synchronization. It is recommended that each 9037 have its own AC power source, so that if one source fails, both devices are not affected. Note that these two timers must be within 2.2 meters of one another. The sysplex timer attaches to the processor via the processor s Sysplex Timer Attachment Feature. Dual ports on the attachment feature permit redundant connections, so that there is no single point of failure.

10

Continuous Availability with PTS

1.4.2 Distance
The processors are connected to the timer by a multi-mode fiber, and can be up to three kms from the timer, depending on the fiber. Distances between the sysplex timer and CEC s beyond 3,000 meters are supported by RPQ 8K1919. RPQ 8K1919 allows the use of single mode fiber optic (laser) links between the processor and the 9037. To support single mode fiber on the 9037, a special LED/laser converter has been designed called the 9036 Model 003. The 9036-003 is designed for use only with a 9037, and is available only as RPQ 8K1919. Two 9036-003 extenders (two RPQs) are required between the 9037 and each sysplex timer attachment port on the processor. The single-mode link between the two 9036-003 extenders can be up to 20 kms.

1.4.3 Setting the Time in MVS


In a multi-CEC sysplex you must code ETRMODE YES in the CLOCKxx member in SYS1.PARMLIB for each system. This ensures that the TOD clocks are synchronized with the sysplex timer. The recommended MVS setup is to set the TOD clock to GMT (UTC) and use an offset for the local time. To do this you set ETRZONE YES in CLOCKxx. Now the only source for the time zone offset for all the systems is the sysplex timer, and you can only make time changes using the sysplex timer. This ensures time consistency across all the systems.

1.4.4 Protection
To prevent accidental system disruption, installations should use the password protection provided by the 9037. In addition, authorized users should make it a practice to always leave the console set to Authorization Level 1 instead of Level 2. Authorization Level 2 is required to be set prior to any disruptive functions. Be aware that when a 9037 console user enters the Set the Time menu and performs the function, the 9037s will perform a power-on-reset. This is extremely disruptive to processors in a multisystem Sysplex, all the MVS Systems will enter a X 0A2 wait state.

1.5 I/O Configuration


There is not a lot of difference between the connectivity requirements for continuous availability in a parallel sysplex environment and those of most systems today. The important thing to remember is that connectivity to data and to other I/O units must be preserved even when the workload is moved around. In designing the I/O subsystem you must ensure that there is no single point of failure. This is business-as-usual for most installations. Today, when you configure four I/O paths to each DASD control unit, three of these are for performance, and the fourth is for availability. While ESCON is not strictly a pre-requisite for parallel sysplex implementation, it becomes necessary for connectivity for optimum performance and high availability, once the number of MVS images in the parallel sysplex exceeds four. For instance, IBM recommends four paths from each MVS image to each 3990 or RAMAC DASD subsystem, and this configuration is not possible for more than four systems using only parallel channels.

Chapter 1. H a r d w a r e Configuration

11

The general guidelines for configuring I/O devices for high availability include:

Always try to configure with a minimum of two ESCON directors. Use of 3990 Dual Copy is recommended for critical DASD data sets, such as the MVS Master Catalog, which are placed behind 3990 subsystems. Where possible, for critical DASD data sets, use RAMAC, which has many availability features, such as RAID-5 operation, predictive failure analysis and redundant power supplies. Try to spread the channel paths to a device using nonadjacent channel numbers to nonadjacent ESCON director ports to different controllers or control units. Duplicate single path devices, such as screens to be used as consoles.

In the subsequent sections, the specific configuration requirements for the following critical device types are discussed:

CTC DASD ESCON directors Consoles Tape Network

1.5.1 ESCON Logical Paths


As well as understanding the physical configuration requirements of the I/O subsystem, it is also necessary to have an awareness of the ESCON logical path configuration, which can impact both performance and availability. Under some circumstances, providing a physical path from a processor to a device does not guarantee a working (logical) path. Each ESCON-capable control unit supports a specific number of ESCON logical paths, regardless of the number of physical channels connected and the number of IOCP paths defined. This is not only a connectivity problem, but can also become an availability issue if the logical path mapping and manipulation is not understood. For example, as shown in Figure 2 on page 13, each of the 10 systems in the parallel sysplex has a physical path defined and configured to the ESCON 3174 supporting the device used as the MVS master console. However, in order for the device to be used as a console, the 3174 control unit is customized in non-SNA mode and supports only one ESCON logical path. The Allow/Prohibit attributes of the associated ESCON director ports must be correctly manipulated in order to manage the MVS paths to the device as required. Note that the configuration shown in the figure is not necessarily optimum because the connected processors support EMIF and there is no need for a different physical channel to support the path from each system in the sysplex to the console device. The configuration is drawn this way to simplify the logical path explanation. The ESCON logical path considerations for each specific control unit type is discussed in detail in the relevant sections below.

12

Continuous Availability with PTS

Figure 2. ESCON Logical Paths Configuration

1.6 CTCs
CTCs provide an inter-system communication vehicle for functions such as XCF and VTAM. While it is possible for inter-system communications to take place through mechanisms other than CTC devices, such as a coupling facility for XCF signalling paths, or a 3745 for VTAM, CTCs should be considered at least for backup purposes in a parallel sysplex environment.

1.6.1 3088 and ESCON CTC


The CTC function can be provided either via a 3088 or an ESCON CTC Channel (defined in the IOCP as type CHANNEL=SCTC). While the 3088 can support up to 32 different devices, the ESCON CTC channel supports up to 512 devices. In a parallel sysplex environment, the ESCON CTC provides more connectivity and flexibility than a 3088 when configuring for high availability. The 3088 can only be configured on a parallel channel, and so does not provide the connectivity benefits of ESCON:

Increased connectivity through an ESCON director Shared EMIF channel

The next few sections discuss considerations for configuring CTC devices.

Chapter 1. H a r d w a r e Configuration

13

1.6.2 Alternate CTC Configuration


For each application using CTCs, ensure that alternate (or backup) devices are supported by different CHPIDs in the case of ESCON CTCs, or by different 3088s.

Ensure the different CHPIDs supporting an application s primary and alternate CTC devices are selected from different channel groups on the host processor. If the SCTC CHPIDs are configured through ESCON directors, ensure that the CHPIDs are attached to different ESCON directors.

Note that XCF has built-in flexibility in its ability to use CTCs. XCF will immediately start using any online unallocated CTC device as a signaling path when prompted through the SETXCF START,PI/PO operator command. It is not necessary to pre-define device numbers for XCF signalling path use.

1.6.3 Sharing CTC Paths


Avoid configuring high-connect time devices (such as tape) on the same path as either SCTC or 3088s. That is, if the ESCON channel supporting SCTC devices is configured through a dynamic port on an ESCON director, ensure that devices with potentially long connect time, for example tapes, are not also configured on the same ESCON channel. Refer to Figure 3 on page 15. This diagram shows an ESCON channel on the 9672-E03 connected to an ESCON CTC channel on the 9672-R52. While the ESCON channel supporting the ESCON CTC devices can also support paths to DASD devices, it should not be configured to support paths to tape devices.

1.6.4 IOCP Coding


Ensure correct coding of IOCP TIMEOUT parameter in HCD. Both ESCON CTC and 3088 should have TIMEOUT=YES coded. TIMEOUT=YES specifies that when an interface timeout is detected, an interface control check (IFCC) is generated, and the appropriate recovery can be performed to free the channel.

1.6.5 3088 Maintenance


The 3088 is not a concurrent maintenance device. Therefore any servicing activity on the 3088 will require the unit to be taken out of action. If the 3088 must be part of the parallel sysplex configuration consider installing a second 3088 for high availability.

1.7 XCF Signalling Paths


The following are some recommendations on planning the XCF signaling paths:

For high availability you should plan redundant elements. The best solution is having more than one CTC in each direction and two XCF structures allocated in two different coupling facilities. Defining an XCF structure is easier than handling a CTC configuration. An XCF structure offers a better recovery since it can be rebuilt in case of failure. CTC connections are faster than a coupling facility in message switching.

14

Continuous Availability with PTS

Figure 3. CTC Configuration

In planning XCF signalling through a coupling facility structure, be careful to avoid a last structure condition. In this case, XCF will take longer to complete a rebuild process because all signalling required to the process itself are through the couple data set.

1.8 Data Placement


It is not enough to ensure that we have systems with processor power available, we must also ensure continuous availability of the data. This means that essential data will have to be mirrored on separate DASD or tape, and you will have to be able to survive the loss of one of the mirror copies, continue running on the surviving copy, and then reinstate the mirror copy. When configuring the parallel sysplex, the following products provide high availability solutions for data:

RAMAC The RAMAC Array DASD and the RAMAC Array Subsystem are high availability, fault-tolerant storage subsystems which use a number of techniques to ensure full availability of data even when a hardware failure occurs. These include dynamic sparing, multi-level error correction (RAID-5 protection as well as drive and CKD error correction) and Dual Copy.

3990 PPRC For more information refer to 10.2.1, 3990 Remote Copy on page 215.

3990 XRC
Chapter 1. H a r d w a r e Configuration

15

Figure 4. Recommended XCF Signalling Path Configuration

For more information refer to 10.2.1, 3990 Remote Copy on page 215.

IMS RSR For more information refer to 10.2.2, IMS Remote Site Recovery on page 216.

An important point to remember is that while we can guard against physical loss of data by one of the mirroring techniques described above, this does not protect against logical corruption of the data by, for example, a bad program. This is a problem we have always had, and the solution remains the same. We must take backup copies of the database at regular intervals, and log all changes to it. We need procedures to be able to go back to any one of these backup copies, and then apply subsequent updates from the logs. Refer to Chapter 2, System Software Configuration on page 29 for specific information on data set placement guidelines for critical system data sets, and Chapter 3, Subsystem Software Configuration on page 95 for recommendations for critical subsystem data set placement.

1.9 DASD Configuration


The DASD configuration in a parallel sysplex is important both from the point of view of providing optimum performance as well as high system availability.

16

Continuous Availability with PTS

Recommendations for configuring paths to DASD attached to 3990s and RAMAC subsystems are provided, along with a discussion of availability features such as the 3990 Model 3 and Model 6 Dual Copy and RAMAC RAID-5. While a number of different DASD control units are discussed below, the main considerations for a high availability parallel sysplex configuration are the folowing:

Availability features, such as Dual Copy Connectivity, that is, the number of ESCON logical paths supported by the control unit

1.9.1 RAMAC and RAMAC 2 Array Subsystems


Host connectivity options include four or eight parallel channels, four or eight ESCON channels, and mixed parallel and ESCON channel capability. ESCON configurations include 128 logical channel path addressing capability, and the ability to operate at distances of up to 43 kilometers (26.7 miles). The host parallel channels may support either 3 or 4.5MB/sec data rates, and both 10 and 17 MB/sec ESCON channels are supported. The array controller can be configured as either a dual cluster controller or a quad cluster (two cluster pairs) controller thereby providing options for performance tailoring and/or intermix of DASD volume emulation modes. Options for controller cache sizes range from 64 megabytes to 2GB. Cache memory is also resident in the drawer.

1.9.2 3990 Model 6


For environments with mixed requirements for ESCON and parallel channel attachments, the four-port ESCON card allows a combination of eight parallel and eight physical ESCON (64 logical) connections, doubling the physical ESCON connectivity. For customers without ESCON directors, the new four-port ESCON card doubles the ESCON physical connectivity to 16 physical ports. PPRC environments, which require dedicated links between 3990 Model 6 Storage Controls, are able to address connectivity requirements without additional ESCON directors.

1.9.3 3990 Model 3


Even with ESCON channels, the 3990 Model 3 only supports 16 logical paths. This limits the connectivity options in a parallel sysplex environment with more than four MVS images. Even if four paths from each system are not required for performance reasons, there should be at least two paths configured from each system for availability reasons.

1.9.4 DASD Path Recommendations


Configure at least two paths to DASD. Configure multiple paths with the least number of common elements. Configure each path through a different: RAMAC or 3990 Storage Cluster (with power separation) Storage Path ESCON director

Chapter 1. H a r d w a r e Configuration

17

Side, on a partitionable machine in single image mode Channel group

Do not define in the IOCP DASD paths which do not physically exist. nonexistent paths to DASD devices should not be defined in the IOCP. During CHPID recovery, MVS stops I/O operations to all devices potentially affected by the CHPID problem. This stoppage includes devices with paths defined over the CHPID, even if they do not physically exist on that CHPID. To avoid I/O response time delays that occur while a channel path is being recovered, define only those paths that physically exist to DASD devices.

Ensure duplicate devices are correctly coded in LPAR IOCDS. Avoid using external 3044 links on DASD paths. Some 3044 links are used to connect devices that are physically located in a different site. Avoid configuring such external 3044 links on CHPIDs that are also used for DASD paths. 3044 links that extend outside the building may be easily damaged.

Order 3990 and RAMAC paths in the HCD/IOCP definition. Do not configure non-DASD devices on same CHPID as a 3990 or RAMAC subsystem. Do not configure non-DASD devices on parallel CHPIDs attached to a 3990 or RAMAC. During recovery for the 3990 Reset Event Notification, a Reset Channel Path (RCHP) instruction is used. If, for example, you had a 37XX TP controller on the same channnel as the 3990, the RCHP instruction would cause the loss of TP sessions.

1.9.5 3990 Model 6 ESCON Logical Path Report


Device Support Facilities (ICKDSF) Release 16 provides Logical Path Status Reporting. This function provides the capability to display information for all logical paths between host operating systems and a device on a 3990 Model 6 Storage Control Unit. The information displayed includes a logical path sequence number, the type of path (ESCON, parallel, etc.), the logical path status, and identification of the host with which the logical path is associated. This function is particularly useful in a multi-processor ESCON configuration where multiple logical paths can be established on each physical link. Figure 6 on page 20 below shows sample output.

1.10 ESCON Directors


Like ESCON channels, ESCON directors are not strictly required to implement a parallel sysplex. However, if the parallel sysplex consists of more than four MVS images, then in order to provide the redundancy required for high availability, ESCON directors become an essential part of the configuration. The 9032 Models 001 and 002 ESCON directors allow for hot-plugging of the fiber cables connecting to channels and control units into existing ports. However, to upgrade the director with additional ports is disruptive. The newer 9032 Model 003 ESCON director features concurrent hardware install, redundant hardware, and concurrent Licensed Internal Code (LIC) upgrade capability. So with this model, upgrades are not disruptive.

18

Continuous Availability with PTS

Figure 5. Recommended DASD Path Configuration

You should in any event spread I/O paths to the same control unit over several ESCON directors to minimize the effect of a failure of any one of them, and, in the case of models 001 or 002 to allow for the possibility to make changes in the directors.

1.10.1 ESCON Manager


ESCON Manager is highly recommended for installation in a parallel sysplex environment. ESCON Manager:

Provides a single point of control for managing ESCON director switching changes Prevents accidental misconfiguration of paths Provides the operator with otherwise unavailable diagnostic information about a potentially complex I/O configuration in the parallel sysplex environment

1.10.2 ESCON Director Switch Matrix


A well-managed ESCON director has only those ports that are intended to be used set up with the allow attribute.

Chapter 1. H a r d w a r e Configuration

19

+----------------+---------+---------+---------+------------------------------+ | LOGICAL PATH | | FULL | | HOST PATH GROUP ID | | | SYSTEM | ESCON | SP |------------+------+----------| |---------+------| ADAPTER | LINK | FENCES | CPU | CPU | CPU TIME | | NUMBER | TYPE | ID | ADDRESS | 0 1 2 3 | SERIAL # | TYPE | STAMP | +---------+------+---------+---------+---------+------------+------+----------+ | 1 | E | 00 | E502 | | 0000021330 | 9672 | ABCF560E | +---------+------+---------+---------+---------+------------+------+----------+ | 2 | E | 00 | E703 | | 0000030250 | 9672 | ABCE0103 | +---------+------+---------+---------+---------+------------+------+----------+ | 3 | E | 01 | CB03 | | 0000030250 | 9672 | ABCE0103 | +---------+------+---------+---------+---------+------------+------+----------+ | 4 | E | 01 | F104 | | 0000041330 | 9672 | ABCF534A | +---------+------+---------+---------+---------+------------+------+----------+ | 5 | E | 00 | E702 | | 0000020250 | 9672 | ABCF5635 | +---------+------+---------+---------+---------+------------+------+----------+ | 6 | E | 01 | F102 | | 0000021330 | 9672 | ABCF560E | +---------+------+---------+---------+---------+------------+------+----------+ | 7 | E | 01 | CB02 | | 0000020250 | 9672 | ABCF5635 | +---------+------+---------+---------+---------+------------+------+----------+ | 8 | E | 00 | E302 | | 0000020256 | 9672 | ABCF5650 | +---------+------+---------+---------+---------+------------+------+----------+ / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / +---------+------+---------+---------+---------+------------+------+----------+ | 111 | E | 14 | E901 | | 0000011330 | 9672 | ABCF53BE | +---------+------+---------+---------+---------+------------+------+----------+ | 112-128 | N/E | 14-17 | | | | | | +---------+------+---------+---------+---------+------------+------+----------+ LOGICAL PATH : E = ESCON N/E = NOT ESTABLISHED

Figure 6. ISCKDSF R16 ESCON Logical Path Report

1.11 Fiber
When planning fiber connections between machine rooms, and particularly between separate buildings, remember that fiber cables are thin and can easily be broken. So if possible draw two cables by different routes with enough fiber in each for the total needs.

1.11.1 9729
When going over long distances and through common carrier fiber this could be expensive, so consider if a pair of 9729-001s could be an economic alternative. The 9729-001 Optical Wavelength Division Multiplexor (sometimes called Muxmaster) enables multiple bit streams, each possibly using a different communications protocol, bit rate, and frame format, to be multiplexed onto a single optical fiber for transmission between geographically separate locations. The 9729-001 can multiplex 10 full duplex bit streams, each at up to 622 Mb/s over a single optical fiber, up to 50 km distance. The 9729-001 uses wavelength division multiplexing (WDM) to transmit several independent bit streams over this single fiber link. The distance between the two locations can be up to 50 km (at a 200 Mb/s bit rate per channel) and goes down proportionally as the bit rate is increased. Thus the 9729 enables economical transmission of many simultaneous bit streams bidirectionally over a single fiber.

20

Continuous Availability with PTS

1.12 Consoles
Software and hardware consoles need to be configured in a parallel sysplex with regard to the possibility of a failure. Much of the considerations here are no different from any other environment.

1.12.1 Hardware Management Console (HMC)


This section covers the hardware console, or HMC, for the IBM 9672 processors. Recommendations for the IBM 9021, 9121 and 9221 remain unchanged. A S/390 processor system consists of one or more CPCs, each with an associated support element (SE), and one or more HMCs, all interconnected with a token-ring LAN. With previous S/390 systems, each CPC needed its own console to perform power-on-resets (PORs), problem analysis and other tasks necessary for system operation. The HMC provides a means of operating multiple CPCs, or processors, from the same console. It does this by communicating with each CPC through its SE. When tasks are performed at the HMC, the commands are sent to one or more SEs via the token-ring LAN. The SEs then issue commands to their CPCs. CPCs can be grouped at the HMC so that a single command can be passed along to as many as all of the CPCs in the S/390 microprocessor cluster.

1.12.2 How Many HMCs?


Each HMC can manage up to 32 CPCs. Each SE can be managed by up to 16 hardware management consoles. To achieve maximum availability, we recommend that you have at least two hardware management consoles. Since HMCs operate as peers, this provides full redundancy. It is also important to remember that any failure in a SE or HMC will not bring down the operating system or systems running on the CPC or CPCs.

1.12.3 Using HMC As an MVS Console


The 9672 does support console integration, so the HMC can be an MVS NIP console. This support is similar to using the hardware console on a 9021 but is significantly easier to use. Even so, having the HMC as the only MVS console is a poor choice in the vast majority of cases. The exception is a sysplex environment with most messages suppressed at IPL and handled by automation and/or routed to a sysplex console attached to a 3174 on another system during production operations. You can use the 3270 emulation sessions built into the HMC for MVS consoles. This requires attachment of the HMC 3270 connection adapter (a standard adapter) via a 3270 coax cable to a channel attached terminal control unit such as a 3174, or token-ring LAN connectivity to a channel attached SNA control unit.

1.12.4 MVS Consoles


One of the major changes that will be seen when designing a sysplex is in the area of console operations. The difference is that cross-coupling services (XCF) enables MCS messages and commands to be transported between systems in the sysplex. This means that both MCS and extended MCS consoles can issue commands to, and receive replies from, any system in the sysplex. Because consoles on all systems are known across an entire sysplex, when planning the

Chapter 1. H a r d w a r e Configuration

21

console configuration, it is necessary that you understand the new roles played by master and alternate consoles. In the MVS world there are the folowing different types of consoles, as shown in Figure 7 on page 23:

MCS consoles Extended MCS consoles Integrated (System) Consoles Subsystem consoles

Only MCS and EMCS Consoles are affected by changes in a parallel sysplex configuration and require some planning consideration.

1.12.5 Master Console Considerations


When not running as a member of a sysplex, every MVS system has its own master console. When running in a sysplex, however, there is only one master console for the entire sysplex regardless of the number of systems; there can be any number of consoles, though, that can be defined to have the same authority as the master console. Initially, the sysplex master console will be determined by the system that initializes the sysplex and will be the first MCS console that is available with an AUTH of MASTER in its definition in CONSOLxx. Subsequent consoles defined with an AUTH of MASTER are simply consoles with MASTER authority. See 2.10.3, MVS Consoles on page 43 for information on coding CONSOLxx. There is little distinction between the sysplex master console and a console with master authority. The master console is the candidate of last resort for a console switch. If a console can not be switched anywhere else, it will switch to the master console. The master console receives undeliverable (UD) messages when there is no other console that is eligible. For example, if a WTO were issued with a route code of 27 and there was no console online eligible to receive that route code, the message would be considered a UD message and be deliverd at the master console. The master console also receives messages issued to CONSID=0. These are such things as command responses to IEACMD00/COMMNDxx issued commands and a variety of system initiated messages. Only a real MCS console can be the sysplex master console; neither SYSCONS nor an ECMS console are eligible. Because there can be only one active master console in the entire sysplex, when planning the console configuration, you must ensure that, there is always an alternate console available somewhere in the sysplex that can be switched to should the master console fail or the system to which it is attached is taken out of the sysplex, whether planned or unplanned.

22

Continuous Availability with PTS

Figure 7. Console Environment

1.12.6 Console Configuration Considerations


The following section discusses things to consider when configurating the consoles for a parallel sysplex.

1.12.6.1 Console Number


In a sysplex, the limit of 99 consoles for the entire complex still exists. This may have to be taken into consideration when planning the complex design. One possible way to eliminate this restriction is through the use of extended MCS consoles. For example, NetView Version 2.3 supports both subsystem and extended MCS consoles. Because of the limit of 99 consoles in a sysplex, it is recommended that NetView be implemented to use extended MCS consoles, where possible. Consoles that contribute to the maximum of 99 are those that are defined in CONSOLxx members via CONSOLE statements; MCS consoles and subsystem consoles. It does not include extended MCS consoles. Emphasis on naming consoles should be taken. In fact, console and especially subsystem consoles are treated as part of a sysplex-wide pool of subsystem-allocatable consoles available to any system in the sysplex. So far, because there is no system affinity to a subsystem console definition, even the subsystem consoles that were defined in CONSOLxx member do not get deleted when that system leaves the sysplex. When this same system IPLs and rejoins the sysplex, unless you have named the subsystem consoles, MVS
Chapter 1. H a r d w a r e Configuration

23

has no way of knowing that the subsystem consoles had been defined previously and will add them again; in this way, it is quite easy to reach the maximum of 99 consoles. A sysplex-wide IPL can be required once the limit is exceeded.

1.12.6.2 Hardware Requirements


In a traditional environment the recommendation has been to have a pair of dedicated non-SNA 3174 control units attached to each MVS system to avoid single point of failure and to handle all kinds of recovery situations, like DCCF messages for instance. In a sysplex environment, it is no longer required that every MVS image has its own MCS console. Using command and message routing capabilities, you can control all the MVS images running in the sysplex, from one MCS or extended MCS console. This multi-system console support means that there is no longer a requirement that every system has MCS consoles physically attached. Similarly, there is no longer a restriction that an MCS console has an alternate on the same system. Even if it is not possible to eliminate all the non-SNA 3x74s or MCS consoles, it may be possible to reduce the number from what was required when running the same number of individual single systems outside of a sysplex. To avoid outages, you should carefully plan the console configuration to ensure that there is an adequate number to handle either the message traffic and potential hardware failures. For example, for better control during the startup and the shutdown procedures for the sysplex, you should ensure that the first system coming up has a physically attached console and the same for the last system leaving the sysplex. This will make controlling the flow of all the systems joining or leaving the sysplex easier and guarantee a fast recovery in case something goes wrong. It is recommended that to manage the parallel sysplex, the hardware configuration should include at least two screens that can be used as MVS consoles. These screens should be attached to dedicated non-SNA control units which are configured with the least number of common hardware elements. That is, the two screens to be used as MVS consoles should be configured on different channels, on different physical CPCs. If the screens are attached to ESCON 3174s (configured in non-SNA mode) on ESCON channels, they should be configured through different ESCON directors. ESCON 3174s are the preferred control unit for MVS console devices, because in the event of the failure of the system to which the screens are attached, they can easily be switched to another live MVS system in the sysplex and varied online as consoles. This relies on the console device number being replicated on all systems in the sysplex, as recommended in 2.10.3, MVS Consoles on page 43. With ESCON Manager installed on all systems in the parallel sysplex, it is possible to automate this physical console switching when the owning system fails. Figure 8 on page 25 illustrates the recommended configuration.

24

Continuous Availability with PTS

Figure 8. Recommended Console Configuration

1.13 Tape
Configuring tape devices for high availability is important when critical applications have a dependency on those devices. However, in a parallel sysplex, while it is possible that tapes may exist in the configuration, there should be no dependence on those devices from an availability point of view. That is, a high availability CICS subsystem should not be relying on tapes for logging, for example. If tapes are part of the parallel sysplex configuration, say for the purposes of batch work, or backup, then their potential impact on the critical subsystems should be considered in terms of their recovery characteristics during failures.

1.13.1 3490
There are several models of the 3490 control unit that provide ESCON channel attachment, and hence the connectivity required for a parallel sysplex environment. Ensure that each MVS image has two paths configured to each 3490 device, and that each path has as few common physical components as possible.

Chapter 1. H a r d w a r e Configuration

25

1.14 Communications
The data center s communication with its users must also be ensured. The same N+1 considerations apply to communications equipment, lines, fiber trunks and so on, even out to the telecom provider. The network configuration in a parallel sysplex environment can take many different forms. We can think of the network in the following terms:

Physical configuration components 37x5 vs CTC vs 3172 Logical configuration subarea vs APPN Users of VTAM and their requirements

In this chapter, the discussion is concerned with the availability aspects of the physical configuration. The logical network and its users will be discussed in subsequent chapters.

1.14.1 VTAM CTCs


VTAM can use either 3088s or ESCON CTCs. The availability considerations for these devices were discussed in 1.6, CTCs on page 13.

1.14.2 3745s
As discussed in 1.9.4, DASD Path Recommendations on page 17, do not configure TP devices on the same channels as 3990 or RAMAC. Also, do not configure 3745 with 3490 or CTC.

1.14.3 CF Structure
VTAM uses a coupling facility structure to maintain information about generic resources. The structure name (ISTGENERIC) is a VTAM-defined hardcoded name which must be used.

1.15 Environmental
An essential part of keeping the data center running is the availability of power, cooling and other basic functions.

1.15.1 Uninterruptible Power Supply (UPS)


For a continuous availability configuration it is essential to have an uninterruptible power supply (UPS) of some kind. Both the 9672 and 9674 machines have dual power cords, which can be used to connect to different circuits and thus further reduce the possibility of power failures. Something to remember in a sysplex, where the systems will be operated from a single MVS-console which is probably outside the machine room, is that you need also to have this console, and its backup, on a UPS. This also applies to the Hardware Management Console (HMC) for the 9672 CMOS machines, where the HMC is a PC on a LAN, and may very well be outside the machine room too. If any of these consoles got through bridges or routers then you must also take into account the possibility of a power failure on these.

26

Continuous Availability with PTS

1.15.2 9672/9674 Protection against Power Disturbances


IBM offers alternate or additional means to protect the 9672 and 9674 machines against such failures.

1.15.2.1 Protecting 9672 Rx1 and 9674 C01


The battery backup feature (BBU, feature # 2011) can be installed into the CPC and I/O expansion cages. These are batteries providing enough power to sustain the CPC operations for about 3.5 minutes in case of a power drop. These batteries provide additional protection to the 9674 C01 or to any coupling facility partition defined in a 9672, in that a coupling facility can go into power save state depending upon the duration of the power failure. See the description of power save mode and state at 1.15.2.3, Power Save Mode and State. When in a power save state, the contents of the coupling facility memory can be preserved for about 80 minutes in the absence of primary power.

1.15.2.2 Protecting 9672 Rx2/Rx3 and 9672 C02/C03


The battery backup feature is not available for these machine types, instead IBM proposes a local UPS (machine type 9910) which is a standalone frame providing an alternate power source to the CPC, and to the I/O expansion cage(s) model 1000 and OSA cage, if any. Refer to Figure 9 on page 28 for connection of the 9910 unit to 9672 or 9674. The local UPS is itself powered by built in batteries which can keep a 9672/9674 fully operational for 6 to 12 minutes (depending on the amount of optional batteries in the UPS and the CPC configuration to be supported). If providing alternate power to a 9674 or a CF logical partition in a 9672 in power save state, the coupling facility memory can be preserved for 8 to 16 hours in the absence of primary power source.

1.15.2.3 Power Save Mode and State


The description of Power Save state pertains only to a coupling facility (9674 or a CF partition in 9672) and is relevant only if the machine has the battery backup feature or 9910 local UPS installed. A coupling facility can be preset to enter the power save state after a certain duration of an external power failure by the CFCC operator commands MODE POWERSAVE and RIDEOUT=. Once the power save state is entered, the coupling facility CPC logic is quiesced and does not receive any more power from the alternate source, while the CPC memory receives enough power just not to lose its contents. When the primary external power source is up again, the coupling facility sleeping logic is woken up, and assuming that the memory contents have been preserved, the coupling facility is automatically back in operation with its structures still in the state they were just before the power drop. This may present significant recovery time advantages for the CF exploiters which will not have to allocate and build the structures again. Note: If the CF has been implemented as a logical partition in a 9672, this applies to the CF LPAR only, the non CF partitions in the 9672 will have to be IPLed to resume operations.

Chapter 1. H a r d w a r e Configuration

27

Figure 9. 9910 Local UPS and 9672 Rx2 and Rx3

1.15.2.4 Benefits of Having Battery Backup or 9910 Local UPS


These can be weighed with respect to the following three possible situations:

The machine room is fully protected, for a time duration that exceeds the capability of BBU or 9910. There is of course no point in having them installed. The machine room is only partially protected, or the coupling facility is at a nonprotected remote location, and the BBUs/9910 are a cost effective alternative to providing protection. The machine room protection is limited in time, and the power save state may provide the additional protection to rapidly recover from an extended power failure.

28

Continuous Availability with PTS

Chapter 2. System Software Configuration


The following chapter discusses the configuration considerations for system software.

2.1 Introduction
Over time, installations have moved from large single images to multiple stand-alone systems where the workload is partitioned. This ensures the entire installation is not affected by a single system outage, but one of the workloads probably does not run. This has required system programmers to manage several systems, SYSRES volumes, master catalogs and parmlib members, all of which will be different. parallel sysplex improves on this situation by allowing the system programmer to manage several copies of a single system image. Sharing SYSRES, master catalog and parmlib members is possible as each system can be a clone of the others. The fact that each individual system has equal access to data enables one system to be lost and the workload balanced over the remaining systems. The ability to accommodate planned and unplanned outages and maintain availability is greatly improved in a parallel sysplex

2.2 N, N+1 in a Software Environment


As discussed, cloning has been introduced to simplify the management of many MVS systems. A number of fundamental changes have been made to MVS to allow cloning. The design objective is to have logically a single library for definitions and programs. Every system points to this common library. All changes are made to this one library and are thus immediately available to all. This is where the concept of level N and level N+1 coexistence comes into play. Having just a single library creates a single point of failure. It is recommended therefore that customers run with at least two libraries for software, one at level N and the other at level N+1. This enables the introduction of software change to a single system in the parallel sysplex by pointing to the level N+1 library without exposing the entire capacity of the parallel sysplex to the risks of the change. Once tested the other systems in the sysplex can in turn be pointed to the level N+1 library. When using this dual library philosophy you must ensure that all systems in the parallel sysplex have connectivity to each library.

2.3 Shared SYSRES


The level N, level N+1 philosophy manifests itself in a parallel sysplex in the form of a shared SYSRES. Shared SYSRES is not mandatory in a parallel sysplex, but without it many of the advantages of a parallel sysplex are lost. Statistically, the availability of a system complex is improved with a shared SYSRES. The availability of individual systems within the complex is unaffected.

Copyright IBM Corp. 1995

29

The analysis of this is presented in MVS/ESA Software Management Cookbook , GG24-3481. Recommendations for the design of a shared SYSRES are discussed in 2.3.1, Shared SYSRES Design. Having implemented a shared SYSRES it of course becomes a single critical resource within the installation. As such it is highly recommended that shared SYSRESs are backed up using dual copy to ensure a live backup at all times.

2.3.1 Shared SYSRES Design


In order to share SYSRES some thought needs to be given to the design and contents of the SYSRES. To facilitate the use of the N+1 philosophy and cloning of systems the following is a possible solution to the issue. The SYSRES volume should contain the following:

SMP/E target library data sets. These will be explicitly cataloged in the master catalog and would consist of: SMPCSI, SMP/E target consolidated software inventory SMPLTS, SMP/E load module temporary store SMPMTS, SMP/E macro temporary store SMPSCDS, SMP/E save control data set SMPSTS, SMP/E source temporary store SMPLOG, SMP/E log data set SMPLOGA, SMP/E second log data set

System software data sets which would be cataloged using the indirect catalog function. See 2.3.2, Indirect Catalog Function for an explanation of the indirect catalog function.

Some system data sets cannot be shared between images in a parallel sysplex and therefore cannot be included on the shared SYSRES. These will need to be allocated specifically and placed on volumes other than the SYSRES. These data sets are:

LOGREC data sets STGINDEX data sets PAGE data sets SMF data sets

However by utilizing the substitution variables available in MVS, these data sets need only be defined once. Taking LOGREC as an example, by using the symbolic &SYSNAME as part of the data set name for the LOGREC parameter enables the IEASYSxx member to be shared across the sysplex and reduces the number of required IEASYSxx specifications.

2.3.2 Indirect Catalog Function


A catalog entry for a data set contains, among other information, the volser and device type of the volume on which the data set resides. Cataloging a data set using the indirect catalog function results in a catalog entry that does not have the volume information. Those libraries, usually MVS target libraries, are cataloged in the master catalog with volume information of ****** and device type of 0000. Note: The indirect catalog function can be used only for non-VSAM data sets.

30

Continuous Availability with PTS

Libraries that are cataloged using the indirect catalog function must reside on the system residence volume, or a data-set-not-found condition arises. The indirect catalog function works for any library, whatever high-level qualifier or name it may have. In the shared SYSRES environment this function is exploited by the following:

The indirect catalog function is used to reference the system libraries located on the SYSRES. The SMP/E target environment data sets are cataloged in the master catalog with specific volume names.

Such a system design would enable you to use different levels of target libraries independent of the IPL device you choose, and the ability to utilize the same master catalog.

Figure 10. Indirect Catalog Function with SYSRESA

Figure 10 shows an example of how the facility works when referencing SYS1.LINKLIB cataloged using the indirect catalog function VOLUME=******. SYS1.LINKLIB is located on the IPL volume. The active operational level of SYS1.LINKLIB is volume SYSRESA. Note that for the operational libraries, SYS1.PAGE01 and SYS1.HASPCKPT, the catalog data set entry has a specific volume pointer.

Chapter 2. System Software Configuration

31

Figure 11. Indirect Catalog Function with SYSRESB

Figure 11 shows the indirect catalog function in the operational environment for SYS1.LINKLIB, where the active operational level is volume SYSRESB, the IPL volume. Note that the catalog pointers to system libraries SYS1.PAGE01 and SYS1.HASPCKPT are unaffected by the switch in IPL volumes. Therefore, it is possible to continue to use existing system libraries after a system upgrade. The indirect catalog function is a very common approach to enable alternating system residence volumes using the same master catalog.

2.4 Master Catalog


The master catalog is a critical data set and is likely to be shared across a parallel sysplex As with a shared SYSRES, dual copy backup is essential. However, with any dual copy situation, bad as well as good data is replicated to the backup. Therefore if the master catalog is corrupted in any way, preventing its use in the system, then the dual copy backup would be in the same state. A consideration might be to use two master catalogs across the parallel sysplex. For example, in an eight system parallel sysplex have four sharing one catalog and four sharing the other. If there is a failure in one master catalog (file corruption) then those systems accessing the other would still be available. At least half of the processing power would run non-disruptively, and recovery of the other systems can occur by either recovering the failed catalog or IPL-ing them off the surviving catalog. The major downside of this approach is the added complexity of keeping two separate catalogs manually synchronized. An additional consideration if using this method would be to keep the two catalogs on physically separate hardware, that is on different volumes behind different controllers.

32

Continuous Availability with PTS

Despite the fact that a shared master catalog is effectively a single point of failure, the increased complexity and management overhead of multiple master catalogs probably outweighs the risk of a shared catalog failure. For this reason the recommendation would still be a shared master catalog. It should be noted that the number of I/Os to the master catalog increases significantly when it is shared across a parallel sysplex. For performance reasons therefore, we recommended that you use DASD caching for the shared master catalog volume.

2.5 Dynamic I/O Reconfiguration


This section describes how to ensure the system is Dynamic I/O capable . Dynamic I/O configuration lets you change your I/O configuration without causing a system outage. It allows you to select a new I/O configuration definition without performing a power-on-reset (POR) of the hardware or an initial program load (IPL) of the MVS system. Dynamic I/O configuration allows you to add, delete, or modify the definitions of channel paths, control units, and I/O devices to both software and hardware I/O configurations. If your aim is continuous availability, you must implement the dynamic I/O reconfiguration capability. There are some cases where chages cannot be made through this facility. They will be discussed in 2.5.1, Exceptions on page 34. Dynamic I/O configuration provides you the following benefits:

Increases system availability by allowing you to change the I/O configuration while MVS is running, thus eliminating the POR and IPL for selecting a new or changed I/O configuration definition. Allows you to make I/O configuration changes when your installation needs them rather than wait for a scheduled outage to make the changes. Minimizes the need to logically define hardware devices that do not physically exist in a configuration.

Hardware Configuration Definition (HCD) is the only way to provide a configuration file that is dynamic reconfiguration capable. The output of HCD is a file called an I/O definition file (IODF). Both hardware and software configurations are contained in the IODF. Not all devices support dynamic reconfiguration. Each device type is represented to the software by a unit information module (UIM), which is included in the product that contains the device support code. The UIM specifies whether or not the device type supports dynamic I/O configuration. If the device type does not support dynamic I/O configuration, the device definition can be added to the hardware I/O configuration definition while MVS is running, but the device cannot be added to the software I/O configuration definition. Thus, the device is not available for use until the next IPL of the configuration containing the device. If the device type supports dynamic I/O configuration, it is up to your installation to decide whether to define the device as dynamic in the software definition.

Chapter 2. System Software Configuration

33

The specification for dynamic is through HCD where each device has to be defined as DYNAMIC Yes or DYNAMIC No. You must use HCD processing to create an IOCDS from the IODF and then perform a power-on reset (POR), which places the information about the hardware configuration in the hardware system area (HSA). The same IODF must be used at IPL time to define the software configuration. The IODF is pointed to by the LOADxx member; we recommend you use ** as the IODF identification in LOADxx, as this will use the IODF that matches to IOCDS active in the hardware. During the IPL process, the system reads the IODF and constructs UCBs, EDT and all device and I/O configuration related blocks. To be able to perform a software and hardware dynamic change, the hardware and software definitions must match. When the same IODF is used to define the hardware and software definitions, they will automatically match. So, to avoid losing the dynamic reconfiguration capability, it is strongly recommended that you keep software and hardware configuration IODF files in sync with one another. With the HCD ACTIVATE function or through the MVS ACTIVATE operator command, you can make changes to the current configuration without having to IPL the software or POR the hardware. Note: Dynamic changes are allowed from a hardware perspective only when they happen within the current LPAR setup. To add a new logical partition a Power-on reset is still required. How to Refer to Appendix F, Dynamic I/O Reconfiguration Procedures on page 267 for a complete discussion on how to make your processor I/O dynamic capable and on how to size the HSA storage.

2.5.1 Exceptions
There are a few exceptions that limit the capability of dynamic reconfiguration. Make sure that your installation is not affected by one of these exceptions:

2.5.1.1 JES3 Managed Devices


JES3 only partly supports dynamic I/O reconfiguration. Until JES3 4.2.1, it does not support dynamic reconfiguration of its managed devices. A JES3 requirement was that all JES3 managed devices should be defined as installation static . This was because JES3 used MVS services (such as IOSLOOK) which can only retrieve unit control blocks for static devices. If a device was defined as dynamic, JES3 abended with an ABENDU0004. With JES3 4.2.1 and APAR OY61674 the restriction of defining JES3 devices as static has been relaxed. JES3 now tolerates devices in the JES3 initialization deck which are defined as dynamic. However, JES3 also protects itself by preventing JES3 devices from being deleted during a dynamic configuration change. Any attempt to perform a dynamic activate that deletes or modifies the devices will result in an activation failure because JES3 pins UCB of any JES3 managed device.

34

Continuous Availability with PTS

2.5.1.2 Consoles
Graphic devices can be dynamically reconfigured if not allocated. Be careful in reconfiguring graphic devices mapping MVS consoles because you can create a gap between the I/O configuration and the group of devices dedicated as MVS console.

2.5.1.3 Coupling Facility Links Changes


Dynamic reconfiguration is not supported in a coupling facility logical partition. CFR channel paths cannot be dynamically reconfigured. This means that every time a change is required (adding/removing coupling facility links) a coupling facility processor shutdown is required to pick up the new configuration stored in the IOCDS.

2.6 I/O Definition File


As described in 2.5, Dynamic I/O Reconfiguration on page 33, the I/O Definition File (IODF) contains information about the I/O configuration. It is a VSAM linear data set, produced by HCD, and is used during IPL to build the UCBs representing I/O devices to MVS. The IODF resides on a DASD volume, the device number of which is specified in the IPL Load Parameters. It is recommended that all systems in the parallel sysplex share the same active production IODF, rather than each maintaining its own copy.

2.7 Couple Data Sets


When implementing a parallel sysplex, it is required for a number of couple data sets to be shared by some or every system in the parallel sysplex. The couple data sets are the following:

Sysplex couple data sets (also known as XCF couple data sets) Coupling Facility Resource Manager (CFRM) couple data sets Sysplex Failure Management (SFM) couple data sets Workload Manager (WLM) couple data sets Automatic Restart Manager (ARM) couple data sets System Logger (LOGR) couple data sets

Not all of those must be shared by every system in the sysplex. If they are not shared by some systems, then those systems will not be able to participate in whatever function that CDS is used for. The sysplex CDS must be shared by all systems in the parallel sysplex. When planning for the couple data sets, the following considerations should be taken into account. These considerations are applicable to not only the sysplex (or XCF) couple data sets but to the couple data sets for CFRM, SFM, WLM, ARM and LOGR policy data, as well.

An alternate couple data set. An alternate couple data set should be defined. To avoid a single point of failure in the sysplex, IBM recommends that for all couple data sets, you create an alternate couple data set on a different device, control unit, and channel from the primary.

Chapter 2. System Software Configuration

35

A spare couple data set. When the alternate couple data set replaces the primary, the original primary data set is deallocated, and there is no longer an alternate couple data set. Because it is recommended to have an alternate couple data set always available to be switched, consider formatting three data sets before IPL. For example: SYS1.XCF.CDS01 Specified as primary couple data set SYS1.XCF.CDS02 Specified as alternate couple data set SYS1.XCF.CDS03 Spare

Then, if the alternate (CDS02) becomes the primary, you can issue the SETXCF COUPLE,ACOUPLE command to make the spare data set (CDS03) the alternate. Details of the command are found in MVS/ESA SP V5 System Commands . A couple data set can be switched by the operator through use of the SETXCF command, and by the system because of error conditions. The SETXCF command can be used to switch from the primary couple data set to the alternate couple data set. When the alternate couple data set becomes the primary, MVS uses the new primary couple data set for all systems and stops using the old primary couple data set.

The sysplex couple data set format utility determines the size of the data set based on the parameters coded on the DEFINEDS statement. To simplify adding systems to the sysplex ensure the MAXSYSTEM parameter specifies a number large enough to allow for growth in system images. This will enable the introduction of new system images without the need to create a new sysplex couple data set and switching to it using the SETXCF command. A multiple extent couple data set is not supported. For the sysplex couple data set, the format utility determines the size of the data set based on the number of groups, members, and systems specified, and allocates space on the specified volume for the data set. There must be enough contiguous space available on the volume for the couple data set. For the couple data sets that support administrative data, for example CFRM and SFM, the format utility determines the size of the data sets based on the number of parameters within the policy type that is specified.

A couple data set cannot span volumes. XCF does not support multi-volume data sets.

A couple data set is used by only one sysplex. The name of the sysplex for which a data set is intended must be specified when the couple data set is formatted. The data set can be used only by systems running in the sysplex whose name matches that in the couple data set. Each sysplex must have a unique name, and each system in the sysplex must have a unique name. Each couple data set for a sysplex, therefore, must be formatted using the sysplex name for which the couple data set is intended.

The couple data set must not exist prior to formatting. The format utility cannot use an existing data set. This prevents the accidental reformatting of an active couple data set. You must delete an existing couple data set before reformatting it.

36

Continuous Availability with PTS

Couple data set placement. Couple data sets should be placed on volumes that do not already have high I/O activity. It is essential that XCF be able to get to the volume whenever it has to. For the same reason, you should not place the couple data set on a volume that is any of the following: Is subject to reserves Has page data sets Has an SVC dump data set allocated

If SFM is active for status update missing conditions, and such a condition occurs because of the I/O being disrupted by any of the above, then systems will be partitioned from the sysplex. If the volume that the couple data set resides on is one for which DFDSS does a full volume backup, you will have to take this into consideration and possibly plan to switch the primary to the alternate during the backup to avoid a status update missing condition due to the reserve against the volume by DFDSS. See 2.12.4, RESERVE Activity on page 53 for discussion of a possible solution to this issue. When selecting a volume for an alternate couple data set, use the same considerations as described for the primary. When XCF writes to the couple data set, it first writes to the primary, waits for a successful completion, and then writes to the alternate. Not until the write to the alternate is successful is the operation complete.

Performance and availability considerations. The placement of couple data sets can improve performance, as well as availability. For maximum performance and availability, each couple data set would be on its own volume. However, this is an expensive approach. The following example provides an approach that is workable. Do not place the primary sysplex couple data set on the same volume as the primary CFRM couple data set. This is because they are I/O intensive. Table 1 shows our recommendation for couple data set placement that ensures the system can continue in a DASD failure situation. The placement of the other primary and alternate data sets is less critical and could be as shown or spread across the four volumes, dependent on installation preference.

Table 1. Couple Data Set Placement Recommendations


Couple Data Set Sysplex CFRM SFM WLM ARM LOGR Alternate Primary Alternate Primary Primary Alternate Primary Alternate Volume A Primary Volume B Alternate Primary Alternate Volume C Volume D

Place couple data sets on volumes that are attached to cached control units with the DASD fast write (DFW) feature. This recommendation
Chapter 2. System Software Configuration

37

applies to all couple data sets in any size sysplex. Those couple data sets most affected by this are the sysplex couple data set and the CFRM couple data set. The recommendation becomes more critical the more systems you have in the sysplex. Place couple data sets on volumes that are not subject to reserve/release contention or significant I/O contention from sources not related to couple data sets. This is true even if the I/O contention is sporadic.

MIH considerations for couple data set. The interval for missing interrupts is specified on the DASD parameter of the MIH statement in the IECIOSxx parmlib member. The default time is 15 seconds. If there is little or no I/O contention on the DASD where the couple data sets reside, consider specifying a lower interval (such as seven seconds) to be used by MIH in scanning for missing interrupts. A lower value alerts MVS to a problem with a couple data set earlier.

Security considerations for couple data sets. Consider RACF-protecting the couple data sets with the appropriate level of security. If you are using RACF, you want to ensure that XCF has authorization to access RACF-protected sysplex resources. The XCF STC must have an associated RACF user ID defined in the RACF started task procedure table. The started procedure name is XCFAS.

2.8 JES2 Checkpoint


JES2 can use a coupling facility structure for the primary checkpoint data set. The alternate checkpoint data set can reside in a coupling facility structure or on DASD. The current recommendation is for a customer to start with the primary checkpoint in a coupling facility structure and the alternate on DASD. Depending on the nonvolatile characteristics of the installation s coupling facility, having both primary and alternate checkpoint data sets in a coupling facility is possible. The potential for a cold start must be evaluated should both coupling facilities containing checkpoint structures fail. Should you decide to use coupling facilities for both the primary and alternate checkpoint, be certain to place the structures in a separate coupling facility. See 1.3.5, Coupling Facility Volatility/Nonvolatility on page 8 for information on volatility in a coupling facility. The JES2 checkpoint structure can reside in any coupling facility that has space. There are no special considerations regarding structures from which it should be isolated. Place the structure in a coupling facility that has the processing power to support the access rate. The current recommendation for JES2 checkpoint structure placement is summarized in Table 2 on page 39.

38

Continuous Availability with PTS

Table 2. JES2 Checkpoint Placement Recommendations. The checkpoint definitions used here are the same as are used in the JES2 initialization deck. For more information, please refer to JES2 Version 5 Initialization and Tuning Reference .
Checkpoint Definition CKPT1 CKPT2 NEWCKPT1 Checkpoint Placement coupling facility DASD coupling facility

Note: NEWCKPT1 should not be in the same coupling facility as CKPT1 for availability reasons. NEWCKPT2 DASD

Note: It is recommended that if you are running with the JES2 primary checkpoint in a coupling facility, even if that coupling facility is nonvolatile, you should run with a duplex checkpoint on DASD as specified in the CKPT2 keyword or the checkpoint definition. This may require a modification to the checkpoint definition in the JES2 initialization parameters. More information on setting up a MAS in a parallel sysplex environment is found in JES2 Multi-Access Spool in a Sysplex Environment and MVS/ESA SP-JES2 Version 5 Implementation Guide . JES2 does not rebuild structures in the manner of other coupling facility users. Failure of a coupling facility with a JES2 checkpoint structure will invoke the JES2 reconfiguration dialog. At this time, you should have already planned the recovery route. If your recovery plans are to move the primary checkpoint to another coupling facility, then you should have predefined the structure in the active CFRM policy. For a performance comparison between JES2 checkpoints on DASD and a coupling facility, refer to S/390 MVS Parallel Sysplex Performance .

2.8.1 JES2 Checkpoint Reconfiguration


JES2 enters the reconfiguration dialog when it is necessary to reconfigure the checkpoint data set. Prior to MVS V5.1, this was a manual process in which the operator had to enter commands on every member and make replies on every member. Since MVS V5.1, JES2 is able to exploit XCF as the communication vehicle to automate the processing. By using a combination of VOLATILE=(ONECKPT=DIALOG,ALLCKPT=DIALOG) and OPVERIFY=NO on the CHKPTDEF initialization definition statement, you can fully automate the invocation of the checkpoint reconfiguration dialog. This will be the case in either the loss of a coupling facility and/or a coupling facility becoming volatile, that is losing its battery backup. Despite the fact that the checkpoint reconfiguration dialog can be invoked fully automatically, thus enhancing continuous availability, it should be noted that no other JES2 processing can be performed anywhere in the MAS during the dialog. This will manifest itself to end users in the following manner:

Unable to logon to, or logoff from TSO. Not being able to submit jobs.

Chapter 2. System Software Configuration

39

2.9 RACF Database


RACF uses the coupling facility as a large sysplex-wide store-through cache for the RACF database that reduces contention and I/O to DASD. If for some reason there is a coupling facility failure, RACF can function using its DASD resident database just as today. It is possible, through the RVARY RACF command, to switch between DASD usage and a coupling facility structure. The RACF primary data set structures will see greater activity than the alternate data set structure. For strictly performance purposes, place the primary and alternate structures in separate coupling facilities. Spread the primary structures across the available coupling facilities as their space requirements and access requirements allow, trying to keep the alternate database structures separated from the primary database structures. If the installation has the RACF database split into three data sets today, you might try to place the structures in coupling facilities based upon the amount of I/O to each of the data sets. Be aware that RACF uses different local buffer management schemes when a coupling facility structure is being used than when there is no structure. It is currently considered that the change in buffer management will require less physical I/O to the data sets. So the correlation between coupling facility accesses to a structure and the physical I/O to a data set may not be very good. However, it is better than no estimate at all.

2.10 PARMLIB Considerations


As systems work together in a parallel sysplex to process work, multiple copies of a software product in the parallel sysplex need to appear as a single system image. This single system image is important for managing MVS, the JES2 MAS configuration or JES3 complex, transaction managers, database managers, and ACF/VTAM. However, the notion of sysplex scope is important to any software product that you replicate to run on MVS systems in the parallel sysplex. To manage these separate parts as a single image, it is required to establish a naming convention for systems and subsystems that run in the parallel sysplex.

2.10.1 Developing Naming Conventions


You need to develop a flexible, consistent set of names for MVS systems and subsystems in the sysplex. 1. Develop consistent names for CICS, IMS TM, IMS DB, IRLM, DB2 and VTAM for use in data sharing:

Naming conventions for applications: For detailed information and recommendations on application subsystem naming conventions, please refer to System/390 MVS Sysplex Application Migration .

2. Specify the same names for the following on each system:


MVS system name SMF system identifier (SID) JES2 member name

40

Continuous Availability with PTS

3. Keep MVS system names short (for example, three to four characters). Short system names are easy for operators to use and reduce the chance of operator error. Consider the following examples of system names for a sysplex:

System 1 : S01 or System 2 : S02 or ... ... System 32: S32 or

SY01 SY02 .... SY32

4. Develop consistent and usable naming conventions for the following system data sets that systems in the sysplex cannot share:

LOGREC data sets STGINDEX data sets PAGE/SWAP data sets SMF data sets

Allow the names to be defined in one place, namely IEASYSxx. MVS/ESA SP 5.1 allows these non-shareable data sets to be defined with substitution variables so that sysplex can substitute the system name for each MVS image. As a result, you only need to define these data sets once in IEASYSxx of SYS1.PARMLIB. The following is an example of how the MVS system name SY01 is substituted when a variable is used for the SYS1.LOGREC data set:

Variable name substitution SYS1.&SYSNAME..LOGREC becomes SYS1.SY01.LOGREC

2.10.2 MVS/ESA SP V5.2 Enhancements


MVS/ESA SP 5.2 enhances the ability of two or more systems to share system definitions in a multisystem environment. Systems can use the same commands, dynamic allocations, parmlib members, and job control language (JCL) for started tasks while retaining unique values where required. A system symbol acts like a variable in a program; it can take on different values, based on the input to the program. Picture a situation where several systems require the same definition, but one value within that definition must be unique. You can have all the systems share the definition, and use a system symbol as a place holder for the unique value. When each system processes the shared definition, it replaces the system symbol with the unique value it has defined to the system symbol. If all systems in a multisystem environment can share definitions, you can view the environment as a single system image with one point of control. Sharing resource definitions has the following benefits:

Provides a single place to change installation definitions for all systems in a multisystem environment. For example, you can specify a single SYS1.PARMLIB data set that all systems share. Reduces the number of installation definitions by allowing systems to share definitions that require unique values. For example, you can specify a single data set definition in which different systems can specify unique data set names. Allows one to ensure that systems specify unique values for commands or jobs that can flow through several systems. For example, you can use single commands to start multiple instances of started tasks with unique names. Helps maintain meaningful and consistent naming conventions for system resources.

Chapter 2. System Software Configuration

41

When system symbols are specified in a definition that is shared by two or more systems, each system substitutes its own unique defined values for those system symbols. There are two types of system symbols:

Static system symbols have substitution texts that remain fixed for the life of an IPL. Dynamic system symbols have substitution texts that can change during an IPL.

MVS/ESA SP 5.1 introduced support for system symbols in a limited number of parmlib members and system commands. MVS/ESA SP 5.2 enhances that support by allowing system symbols in the following:

Dynamic allocations JES2 initialization statements and commands JES3 commands JCL for started tasks and TSO/E logon procedures Most MVS parmlib members Most MVS system commands

If your installation wants to substitute text for system symbols in other interfaces, such as application or vendor programs, it can call a service to perform symbolic substitution. MVS/ESA SP 5.1 introduced support for the &SYSNAME and &SYSPLEX static system symbols, which represent the system name and the sysplex name, respectively. MVS/ESA SP 5.2 enhances that support by adding the following:

&SYSCLONE A one or two character abbreviation for the system name Up to 100 system symbols that your installation defines

You can also define the &SYSPLEX system symbol earlier in system initialization than in MVS/ESA SP 5.1. The early processing of &SYSPLEX allows you to use its defined substitution text in other parmlib members. See MVS/ESA SP V5 Initialization and Tuning Reference for information about how to set up support for system symbols. Then, for information about how to use system symbols, see the following books:
Table 3 (Page 1 of 2). References Containing Information on the Use of System Symbols
Use in: Application programs Dynamic allocations JCL for started tasks JES2 commands JES2 initialization statements JES3 commands Reference Using the system symbol substitution service in MVS/ESA SP V5 Assembler Services Guide Providing input to the DYNALLOC macro in MVS/ESA SP V5 Auth Assembler Services Guide Using system symbols in JCL in MVS/ESA SP V5 JCL Reference Using system symbols in JES2 commands in MVS/ESA SP V5 JES2 Commands Using system symbols in JES2 initialization statements in MVS/ESA SP V5 JES2 Initialization and Tuning Reference Using system symbols in JES3 commands in MVS/ESA SP V5 JES3 Commands

42

Continuous Availability with PTS

Table 3 (Page 2 of 2). References Containing Information on the Use of System Symbols
Use in: Parmlib m e m b e r s SYS1.VTAMLST data set System commands TSO/E REXX and CLIST variables TSO/E logon procedures Reference Using system symbols in parmlib members in MVS/ESA SP V5 Initialization and Tuning Reference Using MVS system symbols in VTAM definitions in ACF/VTAM V3R4 VTAMLST Enhancements: Cloning VTAM Applications Managing messages and commands in MVS/ESA SP V5 System Commands Accessing system symbols through REXX and CLIST variables in TSO/E V2 User s Guide and TSO/E V2 CLISTs Setting up logon processing in TSO/E V2 Customization

Examples of how symbolics can be used within the various parmlib members is shown in Appendix A, Sample Parallel Sysplex MVS Image Members on page 221.

2.10.3 MVS Consoles


MCS and subsystem console must be defined in the parmlib CONSOLxx member. To simplify software management, it is recommended to specify unique console names and console device numbers in CONSOLxx. The same CONSOLxx member can then be used to describe all MCS consoles in the sysplex. If you specify the same console device numbers for different consoles on different systems in the sysplex, you must keep separate CONSOLxx members, which increases the management overhead and complicates the introduction of other systems into the sysplex.

2.10.3.1 Alternate Consoles Considerations


Generally speaking, in the past an alternate console was the one that would be switched to should the primary console fail. In a sysplex, because a given console may be used to fulfill a function for all systems in the sysplex, plans for alternates must be handled in a different way. Alternate consoles must also be considered across the entire sysplex, especially for the sysplex master console. You should plan the console configuration in such a way that there is always an alternate to the sysplex master console available all the time. Since MVS/ESA SP Version 4.2.2, changes were introduced to enhance the management of alternate consoles. Now, if a console fails, the system tries to switch to the next console specified in the CONSOLxx member independently of where the new target console is physically attached.

2.10.3.2 Console Groups


Another change was the introduction of console groups. SYS1.PARMLIB member CNGRPxx allows you to define MCS or extended MCS consoles as members of console groups which can be referred to and used as follows:

ALTGRP allows defining a group of consoles from which the system can select an alternate for a console during a console switch. Extended MCS

Chapter 2. System Software Configuration

43

consoles can be included in the ALTGRP console group and be used as alternates for MCS consoles or other extended MCS consoles. ALTGRP is specified on the CONSOLE statement for MCS consoles, or in the RACF OPERPARM segment for extended MCS consoles. Figure 12 shows an example of using the ALTGRP keyword.

Figure 12. Alternate Consoles

NOCCGRP allows you to define a group of consoles from which the system can select a master console when a no consoles condition occurs. NOCCGRP is specified on the INIT statement. SYNCHDEST allows you to define a group of consoles that the system can use to display synchronous messages. Synchronous messages, previously known as DCCF messages, are WTO or WTOR messages that are typically issued during initialization or recovery situations, or by programs that want messages to bypass normal message queuing. In a sysplex, a console can display a synchronous message only if it is physically attached to the system that issues the message. See 2.10.3.3, Synchronous WTO(R) Messages on page 45 for an fuller explanation. HCPYGRP allows you to define a group of console devices from which the system can select a backup device for the hardcopy log. HCPYGRP is specified on the HARDCOPY statement.

For a detailed description on console parameters, please refer to MVS/ESA Planning: Operations ,GC28-1441.

44

Continuous Availability with PTS

2.10.3.3 Synchronous WTO(R) Messages


You can define the master console, the system console, or other MCS consoles as members of an alternate console group in CNGRPxx to receive synchronous messages. Synchronous messages are WTO or WTOR messages that can be issued during initialization or recovery situations, or by programs that want messages to bypass normal message queuing. In a sysplex, a console can display a synchronous message only if it is attached to the system that issues the message. Synchronous messages are handled in a manner similar to that of the DCCF message of older MVS versions. The keyword SYNCHDEST on the DEFAULT statement of CONSOLxx can be used to control the display of a synchronous message. MVS selects an eligible console based on the order of the console members specified in the group. You can specify valid MCS console names as members of the group or you can specify *MSTCON* (the master console in the system or sysplex) or *SYSCON* (the system console). To receive the synchronous message, the console must be attached to the system that issues the message. If you do not specify an alternate console group on SYNCHDEST or none of the consoles on SYNCHDEST are active, the system that issues the message tries to select the following console: 1. The master console, if it is active and physically attached to the system that issues the message. 2. The system console on the system that issues a wait-state message or WTOR message. For a sysplex environment, you should understand and plan where your synchronous messages will be displayed. If you have not activated a SYNCHDEST console group, such messages will be displayed on one of the following: 1. The master console, if it exists and is attached to the system where the message was issued. 2. Otherwise, the system console. Synchronous messages can be displayed only on the system where they originated. They can be displayed on any MCS console attached to the system, but you must specify the console(s) to be used, in the SYNCHDEST console group. Systems with no attached MCS consoles will use the system console for these messages. The SYNCHDEST console group is an ordered list of consoles where MVS is to attempt to display synchronous messages. The system console can be specified in the list. If an MCS console in the list is not attached to the system where the message is issued, it is skipped. So, the same SYNCHDEST group can be used for all systems, if you wish. If the system attempts to use a console for a synchronous message and fails, the next console in the SYNCHDEST group, which is attached to his system, will be used. The system console can be specified in the group, and will also be used as a last resort, if all other console attempts have failed. If MCS consoles share a control unit and an operator tries to respond to a synchronous message on one of the consoles, interruptions from the other consoles can make it impossible for the operator to reply to a synchronous

Chapter 2. System Software Configuration

45

message. When you plan your sysplex recovery, you should attach the MCS console that is to display synchronous messages to its own control unit without any other attached console. If it shares a control unit, there is a higher probability of failure on the console; the message will then be attempted on the next suitable console in the SYNCHDEST group, or on the system console. For a detailed description on console definition, please refer to MVS/ESA Initialization and Tuning Guide ,GC28-1451.

2.11 System Logger


The MVS/ESA 5.2 system logger is a set of services that allows an application to write, browse and delete log data. System logger addresses the problem of complex log management in a multisystem MVS, creating a single image view. System logger provides the merging of log data generated in several systems in parallel sysplex. Initially the system logger functions are going to be exploited by multiple console support (MCS) for Syslog data (also called OPERLOG) and SVC 76 for Logrec records. A future release of CICS is planned to exploit the system logger. System logger is a system component running in its own address space and uses list structures of a coupling facility in a parallel sysplex through XES services. Each address space is a member of a XCF system logger group.

2.11.1 Logstream and Structure Allocation


The logstream is a collection of data used as a log at application level. Logstream data can reside in either a coupling facility list structure or on DASD. Logstreams are merged by system logger based on timestamp sequence. Through the LOGR policy, it is possible to relate a logstream to a list structure. It is possible to assign either one or multiple logstreams to a single coupling facility structure. When planning the logstream configuration, it should be noted that the maximum number of structures supported by a CFRM policy is 256. An installation with many subsystems may need to have multiple logstreams mapped to a unique coupling facility structure to avoid exceeding the structure limit. A range from ten to twenty logstreams per single structure is recommended. The LOGR couple data set is performance sensitive and care should be taken with regard to allocation and placement. See 2.7, Couple Data Sets on page 35 for recommendations on couple data sets.

2.11.2 DASD Log Data Sets


System logger uses VSAM linear data sets to store log stream data that has been moved from the coupling facility. This happens when the coupling facility structure space allocated for the log stream reaches its installation defined threshold and system logger has to initiate the offload of log data from the coupling facility to DASD. A log stream can have data in multiple DASD log data sets; as a log stream fills log data sets on DASD, system logger automatically allocates new ones for the log stream. System logger increments the sequence number in the log stream data set name as new data sets are added for a particular log stream.

46

Continuous Availability with PTS

It is recommended that DASD log data sets be managed by System Managed Storage (SMS). You can manage log stream data sets by either:

Modifying automatic class selection (ACS) routines. Defining the SMS data class, storage class and management class explicitly in the log stream definition using the IXCMIAPU utility or the IXGINVNT service.

MVS V5.2 imposes a limit of 168 log data sets per logstream. Based on the amount of log data, sizing the log data set is an important task in order to keep available all the log data required for a certain application. The index for the 168 log data sets is kept in the LOGR couple data set and the change management activities must be done manually. If the 168 limit is reached, system logger will stop. Procedures must be in place to ensure that appropriate action is taken before this limit is reached. Notes: 1. For information on how to use the IXCMIAPU utility, see MVS/ESA SP V5 Setting Up a Sysplex , GC28-1449. 2. For information on how to use the IXGINVNT service, see MVS/ESA V5 Authorized Assembler Services Reference, Volume 2, ENF-ITT , GC28-1476.

2.11.3 Duplexing Coupling Facility Log Data


Whenever the system logger writes a log block to the corresponding coupling facility list structure, it maintains another copy of the data to avoid a single point of failure. How to maintain another copy of the coupling facility resident log data is installation dependent and we will go through all possible configuration options. If an installation chooses the wrong option for the current environment, data loss may result. It is very important that an installation understands its topology and chooses the appropriate duplexing options. Maintaining a second copy of the coupling facility resident data is referred to as duplexing . An installation has the following two choices for duplexing:

Maintain a copy of the coupling facility resident log records in the MVS system logger storage buffers. Maintain a copy of the coupling facility resident log records in a staging data set on logstream basis. There is a staging data set per logstream per system in the sysplex.

Depending on the duplexing option, the local buffers or the staging data set contains data written by the system logger but not yet written to a logstream DASD data set. Another concept needs to be considered when configuring the system logger environment: the failure dependent or independent attribute of each coupling facility connection. Depending on the location of the coupling facility as well as its volatility or non-volatility status, a connection to a logstream can be identified to be failure dependent or failure independent. This may affect the system logger configuration and behavior. The rules for determining the attribute of a logstream connection are as follows:

Chapter 2. System Software Configuration

47

If the system logger and the coupling facility to which it is connected to on behalf of a given logstream are both executing on the same CPC, then the connection is failure dependent regardless of the volatility status of the coupling facility. Figure 13 on page 48 is an example of failure dependent connection between a system logger running on an MVS and a coupling facility running on an LPAR on the same CEC.

Figure 13. Example of Failure Dependent Connection

If MVS and coupling facility are separated then: If the coupling facility is non-volatile, then the connection is failure independent. An example is shown in Figure 14 on page 49 where different connections to the same logstream may each have different failure independence/dependence characteristics.

48

Continuous Availability with PTS

Figure 14. Example of Failure Dependent/Independence Connections

If the coupling facility is volatile, then the connection is failure dependent.

The attribute of failure dependent/independent can vary depending on potential failure or operator commands issued to the coupling facility. The system logger is sensitive and can switch back and forth between the two states. For critical applications using the system logger, the recommendation would be to put system logger structures in a failure independent coupling facility and to duplex logstream data to DASD.

2.11.4 DASD Staging Data Sets


System logger uses VSAM linear DASD staging data sets to hold a backup copy of coupling facility log data for log streams that specify it. When duplexing of log data to staging data sets is requested, system logger creates staging data sets. if necessary, during connection processing the first IXGCONN request issued against a log stream from a particular system. System logger will automatically allocate further staging data sets as needed. As with DASD log data sets, it is recommended that DASD staging data sets be managed by SMS, see 2.11.2, DASD Log Data Sets on page 46. It is also important that all the staging data sets reside on devices that all systems in the sysplex have connectivity to. Other systems may need access to

Chapter 2. System Software Configuration

49

the backed up log data in case of system or coupling facility failure. If peer systems do not have connectivity to staging data sets, system logger may not be able to recover all data in case of failure. Staging data sets will be used only during recovery procedure initiated automatically by the system logger through a structure rebuild process in a new coupling facility structure. Staging data sets are performance sensitive and the DASD fast write option is strongly recommended. There are important considerations for staging data set sizing; log data offload activity will be initiated as soon as either the coupling facility structure or the staging data set becomes full. The high threshold parameter applies to both the coupling facility structure and the staging data set. To minimize the offload activity, ensure that the staging data set is as big as the coupling facility structure. For more detailed information on setting up and exploiting system logger, please refer to MVS/ESA SP V5 Sysplex Migration Guide , SG24-4581.

2.12 System Managed Storage Considerations


In an MVS environment, the implementation of system-managed storage is accomplished by DFSMS/MVS. The policy (also known as a configuration) that is used to system-manage an installation s storage is known as the source control data set (SCDS). The policy defined in an SCDS is implemented by the installation when the SCDS is activated. The process of activation is accomplished by SMS copying the SCDS into an active control data set (ACDS). The ACDS contains the currently active policy. One further DFSMS/MVS control data set is required to implement system-managed storage and that is the communication data set (COMMDS). The COMMDS is used for communications among systems in a SMSplex.

2.12.1 SMSplex
An SMSplex is a system (an MVS image) or collection of systems that share a common SMS configuration. The systems in an SMSplex share a common ACDS and COMMDS pair. DFSMS/MVS 1.1.0 supports a maximum of eight systems in a SMSplex. DFSMS/MVS 1.2.0 introduces the concept of SMS system group names that allows the specification of a system group as a member of a SMSplex. This enables more than eight systems to be defined in one SMSplex. A system group consists of all systems that are part of the same parallel sysplex and are running SMS with the same configuration, minus any systems in the parallel sysplex that are specifically defined in the SMS configuration. The following figure shows examples by way of explanation.

50

Continuous Availability with PTS

SYSPLEX view SMSplex view SYSPL01 S1 S2 S3 S4 SYSPL01 S5 S6 COUPLExx: SCDS/ACDS Base Config: COUPLE SYSPLEX(SYSPL01) System group name=SYSPL01

Figure 15. Basic Relationship between Sysplex Name and System Group

The SMS system group name must be the same as the parallel sysplex name defined in the COUPLExx member in PARMLIB, and the individual system names must match system names in the IEASYSxx member in PARMLIB. When a system group name is defined in the SMS configuration, all systems in the named parallel sysplex are represented by the same name in the SMSplex, as shown in Figure 15.

SYSPLEX view SMSplex view SYSPL01 S1 S2 S1 S3 S4 SYSPL01 S5 S6 COUPLExx: SCDS/ACDS Base Config: COUPLE SYSPLEX(SYSPL01) System group name=SYSPL01 System name= S1
Figure 16. SMSplex Consisting of System Group and Individual System Name

The SMSplex does not have to mirror a parallel sysplex; you can choose to configure individual systems and Parallel Sysplexes into an SMSplex configuration. Figure 16 shows an SMSplex where system S1 has been
Chapter 2. System Software Configuration

51

separately defined as an individual system name. SMS considers S1 and SYSPL01 as two members of the SMSplex. Systems S2 through S6 are represented by SYSPL01 and must be addressed simultaneously with regard to SMS functions. It is recommended however, that the SMSplex matches the parallel sysplex for better manageability of your data. Note: JES3 does not support the SMS system group names. When the DFSMS/MVS 1.2.0 configuration is defined using parallel sysplex names, JES3 does not provide data set integrity and scheduling services for the SMS managed data sets. If the SMS configuration is defined using a combination of system name and system group names, JES3 SMS data set services are available on each system whose name matches the system names defined in the SMS configuration. JES3 SMS data set services are available on seven CPCs if there is more than eight MVS systems in the SMSplex.

2.12.2 DFSMShsm Considerations


DFSMShsm support for migration, backup and dump processing has been changed to support system group names. The enhanced DFSMShsm processes storage groups for automatic space management, data availability management and automatic dump processing the following way:

A storage group may be processed by any system if the system name in the storage group is blank. A storage group may be processed by a subset of systems if the system name in the storage group is a system group name. A storage group may be processed by a specific system if the system name in the storage group is a system name

In addition to the sharing of the ACDS and COMMDS across the parallel sysplex the following DFSMShsm data sets need to be shared:

The DFSMShsm migration control data set, MCDS The DFSMShsm backup control data set, BCDS The DFSMShsm offline control data set, OCDS The DFSMShsm journal

2.12.3 Continuous Availability Considerations


Basic rules need to be applied to ensure continuous availability from a DFSMS point of view across the sysplex. In an SMS complex, the operating systems communicate by sharing a common configuration stored in the ACDS and common system-managed volume statistics stored in the COMMDS. It is recommended that the SCDS, ACDS and COMMDS reside on different volumes. Backups of all three are recommended in case of hardware failure or accidental data loss. The volumes that contain these SMS control data sets and their backups must be accessible from all systems that are part of the SMS complex. The SCDS must be accessible from all systems that need to perform an activate of the configuration. If you have more than 16 systems in an SMS

52

Continuous Availability with PTS

complex, you need to define the ACDS and COMMDS on volumes attached through a 3990 Model 6 storage controller (the 3990 Model 3 does not have enough paths to make it possible to share attached volumes with more than 16 systems). Similar considerations need to be applied to the shared DFSMShsm shared data sets, MCDS, BCDS, OCDS and journal. Prior to DFSMS/MVS 1.2, logical connectivity for all system-managed volumes and storage groups was controlled at the individual system level. Allocations, deletions, and accesses could only be performed on systems that had the logical (SMS and MVS) and physical (hardware) connectivity. This also applied to DFSMShsm operations. Job failures would occur otherwise. In addition, the required catalogs needed to be accessible. With DFSMS/MVS 1.2, when you define a volume or storage group to have connectivity to a system group, the volume or storage groups must be accessible to all systems that are part of the system group. Otherwise, job failures will occur. When a common set of classes, groups, ACS routines, and a base configuration are applied across an MVS/ESA multisystem environment, the environment is a simple one. However, if SMS is not active on one of the systems, that system is not able to do the following:

Create data sets on system-managed volumes Delete system-managed data sets Extend system-managed data sets to new volumes Use JCL keywords supported by SMS

The COMMDS does not record DASD space usage changes for the system that has not activated SMS. For more information regarding defining system group names and implementing DFSMS across a parallel sysplex please refer to MVS/ESA SML: Implementing System-Managed Storage, SC26-3123.

2.12.4 RESERVE Activity


When DFSMSdss is doing a full volume backup it issues a reserve against the volume. If the volume contains data sets such as a couple data set, the volume backup activity could cause a status update missing condition as the couple data set could not be accessed during the backup activity. Planning to switch from primary to alternate couple data set in time for scheduled backups could be difficult to manage and may lead to disruption within the sysplex. A possible solution to this issue is to change the reserve activities into global ENQ. This can be done by ensuring the following statement is included in the active GRSRNLxx member of parmlib

RNLDEF RNL(CON) TYPE(GENERIC) QNAME(SYSVTOC) /* CONVERT VTOC

*/

Installations should check that converting reserve activity to global ENQ does not impose performance problems, prior to implementing the solution.

Chapter 2. System Software Configuration

53

Review MVS/ESA SP V5 Planning: Global Resource Serialization , GC28-1450, before implementing reserve conversion for other possible implications in your environment.

2.13 Shared Tape Support


MVS/ESA SP Version 5.2 introduced the ability to manage the allocation of 3480/3490 cartridge drives across multiple systems in a parallel sysplex configuration. Previously, while drives could be varied online to more than one system at a time, there was no cross-system allocation management. Operator intervention was required to move devices from one system to another. Alternatively, the devices could be managed by JES3 or a vendor product. Automatic tape switching expands the MVS allocation function, introduces new locking serialization and requires customer planning and setup to use effectively.

2.13.1 Planning
Planning for autoswitchable devices is discussed in MVS/ESA Hardware Configuration Definition: Planning , GC28-1445. The issue discussed in the manual for autoswitchable devices is how many to define as autoswitchable and how many to dedicate to particular systems. The device selection process is slightly longer for autoswitch devices than for dedicated devices due to the sysplex wide scope of the allocation. If the workload that requires tape drives is predictable on certain systems, the allocation of some devices as dedicated and others as shared may provide benefits. If the usage is likely to be spread or unpredictable, however, management of the devices may be simplified by defining all devices as autoswitchable.

2.13.2 Implementing Automatic Tape Switching


The automatic tape switching function can only be used by systems within a parallel sysplex, and MVS/ESA V5.2 is required on all systems that will use the shared tapes. All systems need access to the CFRM couple data set, an IEFAUTOS structure must be defined to a CFRM policy and the policy activated. SYS1.PARMLIB member GRSRNLxx must be updated to promote volume allocation to SYSTEMS instead of SYSTEM. In order to define and activate drives with the autoswitch attribute using HCD, the latest level of the tape UIM CBDUS005 must be installed and active in SYS1.NUCLEUS.

2.13.2.1 Failure of a System in the Sysplex


In the event of a failure of a system with dedicated devices, operator intervention will be required to move the devices to surviving systems. To reduce the need for operator intervention, define as many devices as autoswitchable as possible.

2.13.2.2 IPL of a System in the Sysplex


Most sites define tape devices to be offline at IPL in their HCD IODF. This setting is still recommended for autoswitch devices to avoid pathing problems during IPL. In order to reset the status of autoswitch devices without operator intervention, add the appropriate vary commands to IEACMDxx or COMMNDxx in SYS1.PARMLIB or to your automation program. Note that a VARY xxx,ONLINE only is required if the devices are defined to HCD with AUTOSWITCH Yes.

54

Continuous Availability with PTS

Refer to 2.18.2, JES3 Sysplex Considerations on page 89 for information on shared tape support in a JES3 environment.

2.14 Exploiting Dynamic Functions


The following chapter will discuss how the new dynamic exit function will help in achieving continuous operation goals.

2.14.1 Dynamic Exits


MVS/ESA V5.1 introduces a dynamic exits capability that provides system control of multiple exit routines called for an exit, and allows updates to exit processing without an IPL. The SMF and allocation installation exits exploit this capability. Users can associate multiple exit routines with the SMF and allocation exits and control their use at IPL or while the system is running. Users can also use the dynamic exits capability to define their own exits and control the use of those exits within a program. The new EXIT statement of the PROGxx parmlib member allows you to do the following:

Add exit routines to an exit that has been defined to the dynamic exits facility Modify or delete exit routines for an exit Change the attributes of an exit at or after IPL Undefine an implicitly defined exit

The following operator commands allow you to control the use of dynamic exits and exit routines:

SET PROG= specifies the particular PROGxx parmlib member the system to use. SETPROG EXIT adds exit routines to an exit, changes the state of an exit routine, deletes an exit routine from an exit, undefines an implicitly defined exit, and changes the attributes of an exit. DISPLAY PROG,EXIT displays exits that have been defined or have had exit routines associated with them.

The CSVDYNEX macro allows you to define exits, associate exit routines with those exits, and control the use of exits and exit routines within program. Further Reading For more information regarding the utilization of Dynamic Exits refer to MVS/ESA SP V5 Installation Exits , SC28-1459.

Chapter 2. System Software Configuration

55

2.14.2 Dynamic Subsystem Interface (SSI)


Dynamic SSI allows installations to define and manage subsystems without requiring an IPL. Previously, subsystems could be defined only at IPL through the IEFSSNxx parmlib member. The dynamic SSI support includes a set of authorized system services that subsystems can invoke to do the following:

Define and add a subsystem Activate a subsystem Deactivate a subsystem Swap subsystem functions Store and retrieve subsystem-specific information Define subsystem options, which includes deciding the following: If a subsystem can respond to dynamic SSI commands Under which subsystem a subsystem should be started

Query subsystem information

All of the features of the dynamic SSI support last only for the life of the IPL. If you IPL after using any of the set of authorized system services, you must issue the service again. Dynamic SSI provides the following benefits:

Supports continuous operations by allowing you to add a new subsystem or to upgrade an existing subsystem without an IPL. Reduces service costs associated with modifying SSI control blocks removing the need for subsystems to modify SSI control blocks and allowing a set of system services to make the necessary changes.

The services that the dynamic SSI support provides are available only to the SSI in one of the following ways:

Processing the keyword format of the IEFSSNxx parmlib member during IPL Issuing the IEFSSI macro Issuing the SETSSI system command

The dynamic SSI services include the following:

IEFSSVT macro, which: Creates an SSVT (REQUEST=CREATE) Enables additional function codes (REQUEST=ENABLE) Disables supported function codes (REQUEST=DISABLE) Replaces the function routine associated with a supported function code (REQUEST=CHANGE)

IEFSSVTI macro, which: Associates function codes and function routines

IEFSSI macro, which: Defines and adds a subsystem (REQUEST=ADD) Activates a subsystem (REQUEST=ACTIVATE)

56

Continuous Availability with PTS

Deactivates a subsystem (REQUEST=DEACTIVATE) Exchanges subsystem functions (REQUEST=SWAP) Defines subsystem options (REQUEST=OPTIONS) Gets (retrieves) subsystem information (REQUEST=GET) Puts (stores) subsystem information (REQUEST=PUT) Queries subsystem information (REQUEST=QUERY)

SETSSI command, which: Defines and adds a subsystem (SETSSI ADD) Activates a subsystem (SETSSI ACTIVATE) Deactivates a subsystem (SETSSI DEACTIVATE)

DISPLAY SSI command Displays subsystem information (DISPLAY SSI)

SSIDATA IPCS subcommand, which: Displays information about subsystems

The dynamic SSI support also introduces the IEFJFRQ installation exit, which provides a way for vendor products and installation applications examine and modify subsystem function requests. For more information regarding the macros associated with Dynamic SSI refer to MVS/ESA SP V5 Authorized Assembler Services Reference, Volume 2 , GC28-1476. For information regarding the use of the SETSSI command refer to MVS/ESA SP V5 System Commands , GC28-1442.

2.14.3 Dynamic Reconfiguration of XES


The structure alter function available with MVS/ESA V5.2 allows XES to dynamically change the size of a coupling facility structure, without disrupting its use by structure connectors. This function permits the installation to continue running their applications while changes in the structure, due to growth in the application data or variable use of the structure during different periods are made. For more information on dynamic reconfiguration of coupling facility structures see 5.5, Altering the Size of a Structure on page 123.

2.15 Automating Sysplex Failure Management


Complexity in managing failures in a Sysplex, from the operator standpoint, is expected to grow on an exponential basis as the number of participating MVS images increases. This has been taken into account when designing MVS Version 5 sysplex management facilities, in that built-in automation has been included to assist the Sysplex operator in making timely decision and to automatically carry failure management actions. The following are the two major MVS functions in automating Sysplex wide failure management:

Chapter 2. System Software Configuration

57

The Sysplex Failure Management (SFM) function, which requires all systems in the Sysplex to have connectivity to the SFM couple data set, and is in operation when an SFM policy is active. The Automatic Restart Manager, which also requires all systems in the Sysplex to have connectivity to an ARM couple data set, and is in operation when an ARM policy is active.

These two functions are executed in the XCF address space. The purpose of the following chapters is to give advice and recommendations on the setting and utilization of these functions. It is expected that the reader has also at hand Setting Up a Sysplex and PR/SM Planning Guide .

2.15.1 Planning for SFM


SFM allows you to automatically handle the following:

Connectivity failure. Either XCF connectivity between XCF members of the sysplex or connectivity failure between structure exploiters and the structures themselves. Systems failures. In that sense system failure means the inability of a MVS image to update its status in the sysplex couple data set for a time interval greater than the INTERVAL value coded in the COUPLExx member used by the first system which IPLed in the sysplex. The failing system is then in missing status update condition. This condition may result from a true system failure but may also be a temporary situation resulting from other events such as the following: An SVC dump is being obtained on the system and is taking longer than the INTERVAL value. A spin loop is occurring. The system is in a restartable wait state. The system is going through reconfiguration. Some system has a RESERVE on the volume with the couple data set. The operator stopped the system. The system is communicating with the operator by means of a branch entry synchronous WTOR macro.

To take into account the above possible situations and not to go into failure management because of a temporarily held status update, parameters specifying detection intervals will have to be tuned. These are the following:

INTERVAL and OPNOTIFY in the SYS1.PARMLIB(COUPLExx). RESETTIME, DEACTTIME and ISOLATETIME in the SFM policy. Indirectly: SPINTIME, by its default value or the value specified in SYS1.PARMLIB(EXSPATxx), since it dictates the time that can be spent by a processor in a spin loop.

The time detection intervals are discussed in details in 2.16, Planning the Time Detection Intervals on page 73. Some factors which will influence the use of SFM and the contents of the policy are the following:

Are there logical partitions running MVS in the sysplex, and do we want specific actions when a logical partition fails, such as: Partition reset

58

Continuous Availability with PTS

Partition deactivation Re-attribution of processor storage to a surviving partition

Do we want to automate the initiation of structure rebuild when connectivity is lost to the structure, with options such as: Rebuild as soon as one exploiter has lost connectivity to the structure. Initiate rebuild only when important connectors have lost connectivity, or initiate rebuild when a certain number of connectors have lost connectivity.

Do we want to automate the partitioning of the sysplex up to the point where the system being partitioned is automatically isolated from the rest of the sysplex, that is, its hardware is prevented from starting new I/O operations and its reserved devices are released.

Although automation is always desirable, and is most probably mandatory in the Sysplex context, there may be cases where automated failure management has to be shut down, for example:

When investigating a problem and failure management is just not wanted to occur. When educating operators and manual take over is part of the education.

It is believed that in normal production environment, most of the installations will use the MVS version 5 automated Sysplex Failure Management.

2.15.2 The SFM Isolate Function


Refer to Figure 17.

Figure 17. Isolating a Failing MVS

Chapter 2. System Software Configuration

59

The coupling facility provides for a function called isolate or fencing which consists in sending a signal, over the CFC link, to a designated target system. The target system will upon reception of the isolate signal: 1. Drain all ongoing I/O operations (this includes operations being performed over the CFC links as well). 2. Freeze its channel subsystem so that no new I/O operation can be initiated. 3. Perform an I/O system reset over its channel interfaces so that reserved devices will be released 4. Finally, go into non restartable wait state X 0A2 . The isolate function is intended to fence from the rest of the sysplex a system that is either in a missing status update condition or the target of a VARY XCF,sysname,OFFLINE operator command. However:

Isolation can be executed only through a coupling facility link, therefore systems which are not sharing connectivity to the same coupling facility cannot request or be subject of the isolate function. Isolation is initiated only if there is a SFM policy active, and requires in some cases that the proper keywords are set up in the policy, as explained in 2.15.3, SFM Parameters on page 63. The isolation is performed at the target system strictly by hardware. The isolate signal sent over the link by the coupling facility is directly interpreted by the target channel subsystem hardware, and subsequent actions at the target system are initiated without software involvement. Isolation is not performed by the target system hardware if the target system is already in a system reset state.

Isolation is actually performed at the whole CPC level if the target system is in basic mode, or at the logical partition level only (that is the logical partition in which the target MVS is executing) if the target system is running in PR/SM mode. The only way to exit from an isolated state is to perform a system reset of the CPC or of the logical partition. Ipling the target system will therefore result in exiting from the isolated state.

Note that if the isolate function cannot be performed (because of lack of connectivity with the coupling facility for instance), the alternate way to proceed is to go at the target system hardware console and to manually invoke the hardware system reset function, in order to stop all activities at the failing system and to release its reserved devices.

2.15.2.1 Operator Request to Vary a System Off the Sysplex


When the VARY XCF,sysname,OFFLINE command is issued by the operator, this command is to translate into:

An isolation request to the target system if: Both the requesting system and the target system have connectivity to the same coupling facility. And there is currently a SFM policy active (whatever keywords are set in the policy).

60

Continuous Availability with PTS

Message IXC102A indicating that one must go to the target system to manually initiate a system reset, if: Above conditions are not met. Or above conditions are met but the requesting MVS has been informed by the coupling facility that the attempt to isolate has been unsuccessful. This may occur because of: - A severe hardware malfunction at the target system. - Or, too high a volume of I/O operations to be drained to complete in due time the isolation request. - Or, the target system is already reset.

2.15.2.2 Automatic Isolation of a System from the Sysplex


The automatic isolation is available through a SFM policy set up with the keyword ISOLATETIME(xx). ISOLATETIME(xx) indicates how many seconds after detection of the missing status update condition the isolate signal must be sent (see Figure 18 for description of the timing related parameters), or how many seconds a system can stand without seeing any activity on an XCF inbound signalling path.

INTERVAL(xx) in COUPLExx

ISOLATETIME(xx) in SFM policy

MVS A fails to update the Sysplex Couple Data set

MVS B declares MVS A in missing status update condition

MVS B sends order to CF to fence MVS A

INTERVAL(xx) is coded in the SYS1.PARMLIB COUPLExx member, and pertains to all systems in the sysplex. ISOLATETIME(xx) is coded in the active SFM policy, and pertains to a specific system.

Figure 18. INTERVAL and ISOLATETIME Relationship

An example of SFM policy is shown in Figure 19 on page 62. Further details about SFM keywords can be found in Setting up a Sysplex , GC28-1449.

Chapter 2. System Software Configuration

61

DEFINE POLICY NAME(POLICY1) SYSTEM NAME(*) ISOLATETIME(0)

SYSTEM NAME(MVSA) ISOLATETIME(10)

SYSTEM : specifies the definition of a system within the scope of the named SFM policy (POLICY1)

NAME : target MVS system name * designate all MVS systems in the configuration, except for systems of which name is explicitly indicated with other NAME parameters

In this example any MVS in the configuration will be automatically isolated immediately upon detection of its missing status update condition. Except for MVSA which is to be isolated 10 seconds after the detection.

Figure 19. SFM Policy with the ISOLATETIME Parameter

Examples of partitioning sequences are given in Appendix D, Examples of Sysplex Partitioning on page 259. As for the manual invocation of the isolate function through the VARY XCF,sysname,OFF command, the automatic invocation may end up in message IXC102A being issued, indicating that the isolate may have failed. See the recommendations in 2.15.2.3, Recommendations.

2.15.2.3 Recommendations

The recommendation is to use the automatic isolate capability provided by the SFM policy. This is providing a good level of built-in automation that can be very helpful to the sysplex operator and is potentially less delay and error prone when dealing with multiple MVS images. The intent of the ISOLATETIME value is to provide finer control, at the individual system level, on when to actually start isolating once the system is in missing status condition. It is recommended to tune the INTERVAL parameter in COUPLExx so that all systems could have ISOLATETIME(0), that is isolate immediately upon missing status update condition. This makes parameter management and tuning easier. However, there may be some systems with very specific characteristics which would make the INTERVAL parameter too short for them. In these cases, ISOLATETIME can be used to personalize the time interval.

62

Continuous Availability with PTS

If the requesting MVS gets back to the operator with message IXC102, implying that isolate may have failed, it is recommended that you examine SYS1.LOGREC hardware and software records written during isolation to help to determine why the isolation did not complete automatically.

It is recommended that you not automate responses to IXC102A if the sysplex is running with an active SFM policy with ISOLATE. An operator intervention is required to prevent exposure to sysplex integrity problems.

2.15.3 SFM Parameters


The specifications described by the SFM policy keywords can relate to the whole sysplex or to only one system within the sysplex. See Table 4.
Table 4 (Page 1 of 2). Summary of SFM Keywords and Parameters
Keywords/Parameters CONNFAIL(YES|NO) Relate to Sysplex Usage Note Keyword. Used to drive built-in recovery for loss of connectivity. Used with the WEIGHT parameter, and with the CFRM REBUILDPERCENT parameter. Keyword. Indicates the beginning of one system specification and is followed by the keywords NAME, WEIGHT, DEACTTIME, RESETTIME, ISOLATETIME, PROMPT. Indicates to which MVS sysname the specifications pertain. A value of NAME(*) means any system in the sysplex, when not specifically defined by NAME(sysname) in the policy. To give the relative importance of the system in the sysplex. Used to make automatic decision on sysplex partitioning or structure rebuild (when REBUILDPERCENT is specified in the CFRM policy) when XCF connectivity or structure connectivity is lost. Used when the specified MVS is running in a logical partition and automatic deactivation of the logical partition is wanted when a system fails to update its status. Used when the specified MVS is running in a logical partition and automatic system reset of the logical partition is wanted when a system fails to update its status. Used when the specified MVS is to be automatically isolated in case of failure to update its status. SFM will prompt the operator without engaging automatic actions when a system fails to update its status. PROMPT is the default when any of the DEACTTIME, RESETTIME,ISOLATETIME parameters is not specified.

SYSTEM

Sysplex

NAME(sysname|*)

a system.

WEIGHT(value)

a system.

DEACTTIME(value)

a system

RESETTIME(value)

a system

ISOLATETIME(value)

a system

PROMPT

a system

Chapter 2. System Software Configuration

63

Table 4 (Page 2 of 2). Summary of SFM Keywords and Parameters


Keywords/Parameters RECONFIG Relate to A set of systems Usage Note Keyword. It indicates the beginning of a reconfiguration specification and is followed by parameters FAILSYS, ACTSYS, TARGETSYS, STORE, ESTORE. Used to drive PR/SM in automatic logical partition storage reconfiguration when a system fails. FAILSYS(sysname) a system This is the MVS to be monitored for failure. It can be in executing in either LPAR mode or in basic mode. This is the MVS which is going to acquire storage resources from the TARGETSYS system. ACTSYS and TARGETSYS must be executing in LPAR mode on the same physical CPC. This is the MVS of which logical partition is to be deactivated when FAILSYS fails. TARGETSYS processor storage is then acquired by ACTSYS. FAILSYS and TARGETSYS can be different systems on different CPCs. ALL indicates that all logical partitions in the logical addressing range of ACTSYS must be deactivated when FAILSYS fails. STORE(NO |YES) a system This indicates that ACTSYS is to acquire central storage from the deactivated TARGETSYS. The storage to be acquired must have been defined as reserved to ACTSYS logical partition. This indicates that ACTSYS is to acquire expanded storage from the deactivated TARGETSYS. The storage to be acquired must have been defined as reserved to ACTSYS logical partition.

ACTSYS(sysname)

a system

TARGETSYS(sysname|ALL)

a system

ESTORE(NO|YES)

a system

2.15.3.1 SFM Parameters in Basic or PR/SM Environment


SFM is designed in basic mode or logical partitions Table 4 on page to initiate actions directed towards MVS images, either running LPAR mode, and it can also initiate actions directly at the themselves, as per the keywords and parameters shown in 63.

This paragraph describes the SFM actions that can be designed to operate independently of the type of environment.

Planning for Automatically Partitioning the Sysplex: Automatic partitioning of the sysplex is intended to vary off from the sysplex, without requiring operator intervention, an MVS image which is either in missing status update condition, or which has lost connectivity to other MVS image(s). To automatically initiate partitioning of the sysplex, the following keywords and parameters must have been set up in the policy:

64

Continuous Availability with PTS

CONNFAIL(YES|NO) This keyword must be set to CONNFAIL(YES), which is also the default value, to allow SFM to automatically initiate actions when XCF connectivity fails. Having CONNFAIL(NO) will result in the operator being prompted without initiating automatic action. WEIGHT(value) This parameter is to give SFM some guidance to automatically partition the sysplex. When an XCF connectivity failure is detected between two systems in the sysplex, SFM must choose which one to exclude from the sysplex (assuming that both are known as still working). By giving a WEIGHT value to each one of the MVS images in the sysplex, SFM chooses the final sysplex configuration which yields the highest sum of WEIGHTs, after removing one system. This can be seen as a way to preserve the most important MVS images in the sysplex to be partitioned out, or conversely to choose partitioning the less important images off the sysplex to get around the XCF connectivity problem. As an example assume there is A sysplex with 3 participating MVS systems: MVS A, MVS B and MVS C. MVS A has WEIGHT(10), MVS B has WEIGHT(10) and MVS C has WEIGHT(30). Assuming that there is an XCF connectivity failure between MVS B and MVS C, the sysplex operations can be carried on with the images still sharing XCF connectivity. The alternative is then to continue with MVS A and MVS B (total WEIGHT=20) or MVS A and MVS C (total WEIGHT=30). The latter configuration will be kept; that is, MVS B will be varied off the sysplex. Weights can be attributed, as an example, on the basis of any of the following:

ITRs of the systems in the sysplex Configuration dependences, such as unique feature or I/O connected to only one system in the sysplex.

Weight can have a value from 1 to 9999. Specifying no weight is the same as specifying WEIGHT(1). That is if there are no WEIGHTs in the policy, every system is given the same importance when it comes to partition.

Planning for Automatically Rebuilding Structures: This pertains to rebuilding a structure because of a loss of connectivity between the exploiter and the structure. That is a problem affecting the coupling technology in either one of the sysplex MVS or the in the coupling facility itself.
Rebuilding a structure upon a loss of connectivity is the structure exploiter s decision. Some of them decide to rebuild the structure as soon as one single exploiter instance has lost connectivity to the structure, some of them will listen to the XES recommendation. This recommendation is passed to the structure s exploiter and indicates that the subsystem should either disconnect from the structure or that the structure is being rebuilt. More details on structure rebuild can be found in Chapter 5, Coupling Facility Changes on page 117. XES makes the recommendation on the basis of what has been set up in the active SFM policy:

Chapter 2. System Software Configuration

65

If CONNFAIL(NO), the MVS recommendation is always to disconnect. If CONNFAIL(YES), the recommendation will depend on the WEIGHTs given to the MVS images and on the REBUILDPERCENT value given to the affected structure in the active CFRM Policy.

As an example suppose that a structure as been given a REBUILDPERCENT of 50% in the current CFRM policy. Assume that an exploiter of the structure is running in MVSA and an exploiter of the structure is running in MVSB. Also assume that MVSA is given a WEIGHT of 30 in the active SFM policy and MVSB is given a WEIGHT of 90 in the active SFM policy. As these are the only 2 MVS in the system, the total sysplex weight is 30 + 90 = 1 2 0 . The REBUILDPERCENT indicates that MVS is to start rebuilding the structure if the total WEIGHT of the systems with loss of connectivity to the structures is greater than 50% of the total WEIGHT of the sysplex, that is, greater than 60. If MVSA looses connectivity to the structure, XES recommends that the exploiters disconnect from the structure (30/120 = 25%). If MVSB looses connectivity to the structure, XES starts rebuilding the structure (90/120 = 75%). If the specific exploiter code has been designed not to rebuild in that case, it will stop the XES initiated rebuild. Planning additional structure space for rebuild Rebuilding a structure implies to have temporarily duplicated structures in terms of coupling facility space occupancy. Proper consideration must be given as to what additional space has to be planned for the coupling facility.

Default values for WEIGHT and REBUILDPERCENT WEIGHT default is (1), and REBUILDPERCENT default is (100). Therefore if CONNFAIL is not specified (the default is CONNFAIL(YES)) and neither WEIGHT or REBUILDPERCENT are specified for any MVS image or structure, then MVS will initiate a structure rebuild only if all currently connected exploiters have lost connectivity to the structure.

2.15.3.2 SFM Parameters in a PR/SM Environment


This next section discusses aspects of SFM unique to the PR/SM environment.

Compatibility with XCF PR/SM Policy: MVS version 5 still supports the XCFPOLxx specifications but they are mutually exclusive with the usage of a SFM policy, that is:

If a SFM policy is active, XCFPOLxx specifications are discarded. If the sysplex includes MVS version 4 image(s), then the only possible failure management policy to be activated sysplex wide is the XCFPOLxx member, which only pertains to the failure management of a logical partition.

The consequence is that previous XCFPOLxx, if any, will have to be rewritten as SFM policies to take full advantage of the automated recovery in MVS version 5.

66

Continuous Availability with PTS

Timing Relationships of SFM Actions: The actions that can be taken automatically under SFM control against logical partitions are adjusted to occur a certain amount of time after a system is declared to be in a missing status update condition. A summary of the timing relationships for SFM actions is shown in Figure 20 on page 67.
OPNOTIFY(ss) see Note

Issues IXC402D

RESETTIME(ss) or DEACTTIME(ss) or ISOLATETIME(ss) INTERVAL(ss) in COUPLExx

see Note

in SFM policy

MVS A fails to update the sysplex couple data set

MVS x declares MVS A in missing status update condition

A MVS image sharing the physical CPC with MVS A, and with proper authority, can either: system reset MVS A LPAR or deactivate MVS A LPAR Any MVS sharing connectivity to the same Coupling Facility than MVS A, can isolate MVS A

INTERVAL(ss) OPNOTIFY(ss) are coded in the SYS1.PARMLIB COUPLExx member. RESETTIME(ss) DEACTTIME(ss) ISOLATETIME(ss)

are coded in the SFM policy, and are mutually exclusive

Note: if SFM is active and ISOLATETIME is specified for a system, then OPNOTIFY is nullified for this system.

Figure 20. SFM LPARs Actions Timings

Planning to Automatically Reset a Failing Logical Partition


RESETTIME(nostatus_interval) This is the interval of time between the missing status update condition and the initiation of the system reset at the failing partition. Notes: 1. RESETTIME cannot be specified for the same system with DEACTTIME or ISOLATETIME. 2. If the failing system resumes its status update before the RESETTIME interval has expired, the system reset function is not performed. 3. At least one logical partition activated in the same physical CPC must have Cross Partition Authority set in its definition and must be running an operational member of the sysplex. The Cross Partition Authority is set in LPDEF frame on 9021 systems and in the image profile on 9672 systems. It permits the logical partition to reset another logical partition.

Chapter 2. System Software Configuration

67

RESETTIME is intended to provide adjustable and personalized timing on top of the INTERVAL value in the COUPLExx. The value to be given as nostatus_interval can be one of the following :

0 This indicates that the failing partition should be reset as soon as INTERVAL expires. Any other value from 1 to 86400 seconds &madsh. This can be chosen because of any known peculiarity in the related system that can justify adding time to INTERVAL. For example, INTERVAL may have been set up for other members of the sysplex running in basic mode, and this member, running in a logical partition may therefore need additional time to update the status.

Planning to Automatically Deactivate a Failing Partition


DEACTTIME(nostatus_interval) This is the elapsed time between the missing status update condition and the moment when the failing partition has to be deactivated. The purpose of deactivating a logical partition is to free the physical CPC resources it had allocated. Notes: 1. DEACTTIME cannot be specified for the same system with RESETTIME or ISOLATETIME. 2. If the failing system resumes its status update before the DEACTTIME interval has expired, the deactivate function is not performed. 3. At least one logical partition activated in the same physical CPC must have Cross Partition Authority set in its definition and must be running an operational member of the sysplex. The Cross Partition Authority is set in LPDEF frame on 9021 systems and in the image profile on 9672 systems. It permits the logical partition to deactivate another logical partition DEACTTIME is intended to provide adjustable and personalized timing on top of the INTERVAL value in the COUPLExx. The value to be given as nostatus_interval can be one of the following:

0 This indicates that the failing partition should be deactivated as soon as INTERVAL expires. Any other value from 1 to 86400 seconds This can be chosen because of any known peculiarity in the related system that can justify adding time to INTERVAL. For example, INTERVAL may have been set up for other members of the sysplex running in basic mode, and this member, running in a logical partition may therefore need additional time to update the status.

Planning to Automatically Isolate a Logical Partition


ISOLATE(nostatus_interval) This is the elapsed time between the missing status update condition and the moment when the failing partition has to be isolated. Further information on the isolation function can be found in 2.15.2, The SFM Isolate Function on page 59. If the failing system resumes its status update before the ISOLATETIME interval has expired, the isolate function is not performed.

68

Continuous Availability with PTS

ISOLATE is intended to provide adjustable and personalized timing on top of the INTERVAL value in the COUPLExx. The value to be given as nostatus_interval can be one of the following:

0 this indicates that the failing partition should be isolated as soon as INTERVAL expires. Any other value from 1 to 86400 seconds This can be chosen because of any known peculiarity in the related system that can justify adding time to INTERVAL. For example, INTERVAL may have been set up for other members of the sysplex running in basic mode, and this member, running in a logical partition may therefore need additional time to update the status.

Planning to Automatically Acquire Processor Storage from a Logical Partition: PR/SM allows a MVS image running in a logical partition to dynamically acquire processor storage (central and/or expanded storage) from a logical partition defined on the same physical CPC.
Proper usage of this facility assumes that: 1. The giving logical partition (TARGETSYS, in the policy) is either the normal production logical partition or a logical partition to be sacrificed, which will be deactivated and its storage acquired by a backup partition (ACTSYS, in the policy). 2. The processor storage for the receiving logical partition must have been defined with a reserved part which overlaps the giving logical partition processor storage. 3. The receiving logical partition has proper authority in PR/SM to acquire resources from another logical partition. This is the Cross Partition Authority, set in LPDEF frame on 9021 systems and in the image profile on 9672 systems. 4. The backup partition is to take over the workload of the failing logical partition. However, it is up to the software running in the sysplex to manage transferring the workload from the failing partition to the backup one. All SFM and PR/SM will do is to re-allocate the physical CPC resources. The storage being acquired is brought on line to the receiving system by an automatically issued CF STOR|ESTOR(E=1),ONLINE. Proper consideration must be given to the receiving system RSU, if the acquired central storage will be dynamically released later on.

2.15.4 SFM Activation


Sysplex Failure Management (SFM) in MVS is driven by the contents of the active SFM policy. Having no SFM policy active results in not having the automated recovery capability of MVS SFM enabled, and operator intervention will always be required to handle situation such as partitioning off the sysplex a failing MVS image, or moving structures because of a severe connectivity failure.

Chapter 2. System Software Configuration

69

2.15.4.1 Manipulating SFM Policies


This section deals with the activities you can perform with the SFM policies.

Installation and Activation: There can be as many as 50 SFM policies set up in the SFM couple data set, of which only one can be active.
An SFM policy is installed in the SFM couple data set with IXCMIAPU Administrative Data Utility program. Remember that if you wish to know what the contents of the policy is, you either have to look into the IXCMIAPU JCL you have been using to install the policy, or you can run the Administrative Data Utility with DATA TYPE(SFM) and REPORT(YES) keywords. This will result in only listing the couple data set characteristics and the policies currently installed in the SFM couple data set. Activating a SFM policy is an operator initiated operation using the SETXCF START,POLICY,TYPE=SFM,POLNAME=polname command. Once the command is issued the new active policy is immediately in operation, unless recovery actions are already in progress under control of the previous policy. Proper messages are then issued to let the operator know about the delay in activating the new policy and the reason for it.

SETXCF START,POL,TYPE=SFM,POLNAME=SFM1 IXC602I SFM POLICY SFM1 INDICATES FOR SYSTEM SG1 A UPDATE MISSING ACTION OF PROMPT AND AN INTERVAL OF THE ACTION IS THE SYSTEM DEFAULT. IXC609I SFM POLICY SFM1 INDICATES FOR SYSTEM SG1 A SPECIFIED BY SPECIFIC POLICY ENTRY IXC601I SFM POLICY SFM1 HAS BEEN STARTED BY SYSTEM STATUS 15 SECONDS. SYSTEM WEIGHT OF 75 SG1

SFM activation An active SFM policy remains active across IPLs. However when an MVS image IPLs, the SFM function is not available to this image until after NIP completion.

Controlling Which Policy Is Currently Active: The operator can request to display what is the current SFM policy name:

D XCF,POL,TYPE=SFM IXC364I 19.01.34 DISPLAY XCF 407 TYPE: SFM POLNAME: SFMPOL01 STARTED: 10/11/95 12:49:50 LAST UPDATED: 06/02/95 09:47:22 SYSPLEX FAILURE MANAGEMENT IS ACTIVE

70

Continuous Availability with PTS

2.15.4.2 Changing the SFM Couple Data Set Definitions


The couple data set Format Utility (IXCL1DSU) will allocate space to the SFM couple data set based on inputs such as the following:

Maximum planned number of policies to install in the couple data set. Maximum planned number of systems to be characterized by the SYSTEM keyword in a policy. Maximum planned number of RECONFIG actions to be described in a policy.

Should the size of the SFM couple data set turn out to be wrong, the following procedure can be used to dynamically put online a new couple data set with the appropriate size. Note that this procedure works only for increasing the size of the couple data set . To Decrease the Size of a Couple Data Set Decreasing the size of a couple data set cannot be done non-disruptively; an alternate couple data set smaller than the primary couple data set cannot be brought online concurrently. You must prepare the new couple data set and IPL the sysplex using this new couple data set.

1. Run IXCL1DSU against a spare couple data set with the new couple data set specifications. 2. When the spare couple data set is formatted, use the command SETXCF COUPLE,ACOUPLE=(spare_dsname,spare_volume),TYPE=SFM to make the spare couple data set a new alternate SFM couple data set. Note: As soon as the spare couple data set has been switched into alternate, the new alternate couple data set will be loaded with the primary couple data set policy s contents. 3. Then switch the new alternate to new primary couple data set using the SETXCF COUPLE,TYPE=SFM,PSWITCH command. 4. The previous primary couple data set is no longer in use, and can be enlarged by the same process. Keeping COUPLExx in Synch It is recommended that the COUPLExx member be updated after swapping the couple data sets, so that an operator intervention to retrieve the last used couple data sets is not required at the next IPL.

2.15.4.3 Updating the Active SFM Policy


The recommended way to update a policy is to create a new policy by merging the previous policy contents and the intended changes and to give the policy a new name. Then activate this new policy with the following command:

SETXCF START,POL,TYPE=SFM,POLNAME=new_polname.
This allows you to explicitly track all changes made to the SFM policy with the policy name (a policy name can be up to eight characters long).

Chapter 2. System Software Configuration

71

Should the SFM couple data set already contain the maximum allowed number of policies, one of them can be deleted to make room by using the JCL in Figure 21 on page 72.

//DELSFM JOB (999,POK), L06R , CLASS=A,REGION=4096K, // MSGCLASS=T,TIME=10,MSGLEVEL=(1,1),NOTIFY=&SYSUID //****************************************************************** //* JCL TO DELETE A SFM POLICY //* //****************************************************************** //STEP1 EXEC PGM=IXCMIAPU //SYSPRINT DD SYSOUT=* //SYSIN DD * DATA TYPE(SFM) REPORT(YES) DELETE POLICY NAME(target_pol)
Figure 21. Sample JCL to Delete a SFM Policy

However if for some reason the new version of the policy must keep the same name, updates can be made dynamically to the active SFM policy, by doing the following: 1. Running the Administrative Data Utility IXCMIAPU against the currently active policy (with REPLACE(YES)). It does not disrupt the system operations since the active policy is in fact a duplicate of the policy residing on the couple data set. IXCMIAPU is to update the administrative copy of the policy in the couple data set while the active copy is left unchanged. 2. Activate the same policy by typing the following command:

SETXCF START,POL,TYPE=SFM,POLNAME=same_name
This will refresh the active copy with the just modified administrative copy.

2.15.5 Stopping SFM


To stop SFM, one has to stop the active SFM policy with the command:

SETXCF STOP,POLICY,TYPE=SFM
Once the command is issued the policy is immediately stopped unless recovery actions are already in progress under control of the previous policy. Proper messages are then issued to let the operator know about the delay in stopping the policy and the reason for it.

2.15.6 SFM Utilization


It is up to the sysplex user to decide if a SFM policy should be used or not. Our recommendation is to have one policy active to assist the operator in the task of timely managing a set of MVS images. However it is important to also understand by which means a SFM policy is to provide the expected services:

All functions pertaining to logical partition reconfiguration will be accomplished when the monitored system (FAILSYS(sysname) in the policy) fails, only if the system designated to initiate the recovery action (ACTSYS(sysname)) is still up and running.

72

Continuous Availability with PTS

The isolate function can be manually invoked only if there is a SFM policy active (see 2.15.2, The SFM Isolate Function on page 59). It is automatically invoked by using the ISOLATETIME(value) keyword in the policy, provided there is at least an active system in the sysplex sharing coupling facility connectivity with the system to be isolated. Using the WEIGHT(value) parameter in the SFM policy along with the REBUILDPERCENT parameter in the CFRM policy only provides a recommendation to rebuild or not a structure in case of loss of connectivity. MVS delivers the recommendation to the structure s exploiters which, in turn, decide if they do follow the recommendation or not. This is discussed for each one of the IBM structure s exploiters at 9.2, Coupling Facility Failure Recovery on page 180.

2.16 Planning the Time Detection Intervals


Because of the cumulative effect of time spent in each one of the sysplex MVS images, some time detection intervals must be planned with great care. These are the following:

The SPINTIME value. The default or the value set in EXSPATxx member, if any, and activated by SET EXS xx. The INTERVAL value. The default or the value explicitly set in COUPLExx. The TOLINT value. The default value or the value explicitly set in GRSCNFxx.

Equally important is to understand the relationship between the time intervals these parameters represent. This is shown in Figure 22 on page 74.

Chapter 2. System Software Configuration

73

INTERVAL(ss) in COUPLExx

Any MVS declares MVS A is missing status update.

MVS A spins SPINTIME

MVS A spins SPINTIME

Last status update in sysplex CDS by MVS A

TERM ACR

BEGINS NEW INTERVAL

MVS A updates its status in sysplex CDS

Recommendation is: INTERVAL = 2 * SPINTIME + 5

TOLINT(ss) IN GRSCNFxx MVS A sends RSA to MVS B RSA in xit RSA in MVS B xit MVS C

MVS A disrupts the ring if RSA not received.

BEGINS NEW TOLINT

RSA xit RSA in in xit MVS D MVS A

Recommendation is: TOLINT = 180 sec

Figure 22. Figure to Show Timing Relationships

2.16.1.1 INTERVAL in COUPLExx


This document assumes that the fix for OW11965 is installed. OW11965 changes the default XCF failure detection INTERVAL timeout values. The XCF Failure Detection Interval must be set to a time value relative to how long an MVS image is allowed to appear dormant, that is not updating its status in the sysplex couple data set, to other systems in the sysplex. An acceptable time that an MVS image may appear dormant (not update its status in the couple data set) before XCF decides that the instance of MVS is inoperative must be determined by each installation. Probably the longest amount of time that MVS is alive but appears dormant to other systems in the sysplex is when it is experiencing spin loops. Spin loops are often a recoverable condition which may take a long time to resolve depending on what is specified or defaulted to in SPINTIME and SPINRCVY. IBM recommends that XCF s Failure Detection Interval timeout value be based on MVS s ability to recover from a spin loop. For most installations, the default settings for SPINTIME and INTERVAL will be acceptable. However, in the case where MVS is running in a shared logical

74

Continuous Availability with PTS

partition (LP) or under VM, some customers may want to detect a dormant MVS image earlier than the default INTERVAL timeout value in order to expedite the dormant system s removal from the sysplex. The default INTERVAL values are:

After APAR OW11965: 25 seconds when MVS is running on native hardware or in a dedicated logical partition. 85 seconds when MVS is running in a shared logical partition or under VM

Note: the default values for INTERVAL are in fact:

INTERVAL_default = SPINTIME_default * 2 + 5
See 2.16.1.3, SPINTIME in EXSPATxx on page 77 for a discussion of SPINTIME. INTERVAL can be adjusted by specifying an INTERVAL(value) in COUPLExx, where value equals 3 to 86400 seconds. The recommendations are summarized here:

Decreasing INTERVAL below 25 seconds is not recommended. If you run with default SPINTIME and SPINRCVY settings, allow XCF to select a default setting. In this case it is strongly recommended that the fix for OW11965 be installed. As a rule, try to keep INTERVAL set to:

5 + (2 * SPINTIME)
For example, if SPINTIME = 20 and SPINRCVY = TERM,ACR, set INTERVAL to 45 seconds. This results from MVS taking two spin cycles before escalating to an action specified in SPINRCVY, which could be SPIN again, ABEND or TERM (that is ABEND without retry). The extra five seconds is to allow the recovering MVS enough time to catch up and update its status in the sysplex couple data set. Because of the way XCF works internally, the five second catch up time is considered to be more than adequate for MVS to catch up and update the sysplex couple data sets. However, if there is I/O contention on the sysplex couple data sets, additional time may be needed to perform the status update. The rationale for the MVS recommendation to set INTERVAL to five seconds beyond the time it would take to reach the ABEND (or TERMinate) action during an excessive spin condition is this:

Most spin loops, if resolvable, will be resolved by the ABEND or TERM action. Hence, in most cases, there will be no need for a third occurrence of SPINTIME. Setting the XCF Failure Detection Interval to 5 + (2 * SPINTIME) is a compromise between giving MVS enough time to recover from an excessive spin condition and removing a failed MVS from the sysplex as quickly as possible.

Each installation must decide the relative importance between:

Chapter 2. System Software Configuration

75

Allowing MVS sufficient time to recover from conditions where the MVS image appears dormant to other systems in the sysplex. And the expeditious removal of an MVS system from the sysplex in cases where the MVS image appears dormant to other systems in the sysplex.

If you choose an INTERVAL that is too high (say three minutes), and the MVS image fails while holding a critical resource, the surviving systems may have to wait for the resource to be freed before continuing. It might appear as if the entire sysplex hangs for three minutes or more until the failed MVS is partitioned out of the sysplex. If you choose an INTERVAL that is too low (say 15 seconds), and the MVS image is dormant but alive, a missing status update condition can occur causing the system to be erroneously removed from the sysplex. Erroneous partitioning of a healthy system is more probable when you have set INTERVAL too low and have activated a sysplex Failure Management (SFM) policy using a low ISOLATETIME. If the interval you select represents an unacceptable amount of time to wait for an MVS image to respond, consider increasing the amount of CPU resource given to the MVS image. This will reduce the amount of time needed to resolve a spin loop which thereby reduces the recommended failure detection interval timeout value.

2.16.1.2 OPNOTIFY in COUPLExx


OPNOTIFY specifies the amount of elapsed time at which XCF on another system is to notify the operator, by using message IXC402D, that the system has not updated its status. This value must be greater than or equal to the value specified in the INTERVAL keyword. OPNOTIFY and ISOLATETIME If there is a SFM policy active with ISOLATETIME specified for a system, then OPNOTIFY will not issue a message for this system.

The default value for OPNOTIFY is :

OPNOTIFY = 3 + value of INTERVAL, in seconds.


OPNOTIFY can be given an explicit value by coding OPNOTIFY(value) in COUPLExx, where value can be comprised between 3 and 86400 seconds. As an example:

SPINTIME SPINRCVY INTERVAL OPNOTIFY

= = = =

15 seconds TERM,ACR 35 seconds 35 seconds

In this example, once the status update missing condition occurs the operator would be notified without delay via message IXC402D. Automating a response to IXC402D is NOT recommended . Doing so may result in exposing the sysplex integrity.

76

Continuous Availability with PTS

2.16.1.3 SPINTIME in EXSPATxx


The EXSPATxx member is peculiar in that it is not inspected at IPL time, and the spin recovery parameters are always given the default values. In order to make the EXSPATxx parameters active, the operator (or the COMMNDxx) has to issue the command:

SET EXS xx
SET EXS xx does not have a sysplex scope It must be therefore issued from every system where the change has to take place.

For MVS running on native hardware or in a dedicated LP use the default SPINTIME of ten seconds. For MVS running in a shared LP or under VM the default SPINTIME is 40 seconds and the default XCF INTERVAL is 85 seconds. If the default values are acceptable, skip to SPINRCVY below. If you want to decrease INTERVAL below 85 seconds, SPINTIME must also be adjusted below the default of 40 seconds. The minimum recommended value for SPINTIME can be calculated based on the amount of CPU resource given to the MVS image.

The general rules are the following:

The higher the amount of real CPU resource available to the MVS image, the lower the amount of time needed to resolve spin loops. Hence, SPINTIME may be set to a lower value. For example, if you have a logical partition with engines that are receiving 95% of the REAL CPU resource you can set SPINTIME to, say, 12 seconds instead of using the default of 40 seconds.

Likewise, the lower the amount of real CPU resource available to the MVS image, the higher the amount of time needed to resolve spin loops. Hence, SPINTIME must be set to a higher value. For example, if you have a logical partition with engines that are receiving 10% of the REAL CPU resource you should set SPINTIME to 40 seconds (the default for shared LPs). Specifying a SPINTIME that is too low may cause premature excessive spin conditions. When an excessive spin condition occurs, MVS will select a first action of SPIN. Along with the first action of SPIN, MVS will also write a ABEND071-10 logrec entry and issue message IEE178I to the hardcopy log. ABEND071-10 is non-disruptive. Message IEE178I can be automated to inform the system programmer when an excessive spin condition occurs. If you see repetitive ABEND071-10 logrec entries and/or IEE178I messages AND the spin loops recover without escalating through the excessive spin recovery actions, you are probably experiencing premature excessive spin conditions. To remedy this condition, increase SPINTIME. To Compute Minimum SPINTIME

See the BLWSPINR member of SYS1.SAMPLIB for additional information on how to calculate the minimum SPINTIME for MVS images running in shared LPs. BLWSPINR shipped in UW18884 (MVS 5.1) and UW18885 (MVS 5.2).

Chapter 2. System Software Configuration

77

More details are given on spin recovery parameters and excessive spin recovery time condition in Appendix E, Spin Loop Recovery on page 263. SPINRCVY OPER Is Not Recommended A spin recovery action of OPER can be set in EXSPATxx, which means that MVS is to prompt the operator in order to take any further action. In a sysplex, specifying a SPINRCVY action of OPER is not recommended because an operator may not respond quickly enough to prevent remaining systems in the sysplex from partitioning the ailing system out of the sysplex.

2.16.1.4 TOLINT in GRSCNFxx


The RSA, when going through all the active systems in the sysplex, will cumulate the delays incurred in each individual system. Choosing the GRS toleration interval is critical in not having GRS too hastily disrupt the ring nor wait too long before disrupting the ring.

Care is needed to determine what an acceptable timeout value is for TOLINT because of the number of variables involved, such as the following: The excessive spin time and recovery actions for each system in the sysplex / ring Number of systems in the sysplex / ring Speed of systems in the sysplex / ring Inter-system signalling configuration, and activity Paging of GRS common area storage for each system in the sysplex / ring RESMIL time for each system in the sysplex / ring

Typically, the RSA should proceed quickly around the ring. However, the RSA may be delayed significantly in cases where the following occurs:

An MVS image is recovering from a spin loop An MVS image is taking an SVC dump There are delays in inter-system communications There are real storage shortages There are auxiliary storage page-in delays

IBM recommends that TOLINT be set to the default value of 180 seconds. This value may be raised or lowered depending on the installation. Keeping TOLINT set to a relatively high timeout value (180 seconds) will prevent premature GRS ring disruptions in cases where there are unexpected but recoverable delays involving systems that participate in the GRS ring. If a system fails and is partitioned out of the sysplex, GRS is notified of this condition and will commence a ring rebuild operation without waiting for the TOLINT timeout value to expire.

At the time of the writing of this document the toleration timeout interval that GRS uses to limit RSA travel time is the lesser of TOLINT (specified in GRSCNFxx) or INTERVAL (specified in COUPLExx). APAR OW11016 was taken to honor TOLINT as specified in GRSCNFxx. Until a fix for OW11016 is applied, you need to inflate the INTERVAL setting to achieve the desired TOLINT timeout value. In MVS 5.1.0 the default storage isolation for GRS was removed. Without storage isolation, excessive paging can occur in the GRS address space resulting in GRS

78

Continuous Availability with PTS

ring disruptions. APAR OW12444 was taken to restore storage isolation to GRS s working set in WLM compatibility mode. A consideration for MVS images that share CPU resources is when MVS is running in a shared logical partition or under VM you should consider the effect that MVS image will have on the rest of the sysplex. If the image is CPU or storage constrained it can have a degrading effect on functions that perform inter-system communications (such as GRS, XCF, Console Communications). Systems that reside in a sysplex must have sufficient resources to ensure that the sysplex performance is not adversely affected. An example of a constrained MVS image is one whose logical CPUs only receives 5% of REAL CPU resources.

2.16.2 Synchronous WTO(R)


It is essential in a parallel sysplex environment to minimize the potential for synchronous WTO (previously known as DCCF) messages. Note that even though XCF provides the mechanism for displaying standard messages from all systems in the sysplex on a single console, synchronous WTO(R)s can only be displayed on a console physically attached to the issuing system. If there is no physically attached console, the only other place the message can be displayed is on the hardware console, the HMC in the case of the 9672 or the hardware system console in the case of the ES/9000. Few operators will respond to the message issued at the hardware console in a timely manner, and a sysplex disruption is the likely result. In MVS/ESA SP Version 5.2, many recovery situations have been automated through facilities such as Sysplex Failure Management, but there are still some recovery situations which can be controlled through PARMLIB member settings. Check for the following:

IECIOSxx This member contains the Missing Interrupt Handler (MIH) settings as well as the Hot I/O recovery options. Ensure that the Hot I/O recovery options do not include OPER. This specifies to MVS to request the desired recovery action from the operator through a synchronous WTO(R) when a hot I/O condition occurs.

EXSPATxx This member contains the spin loop recovery options. Ensure that OPER is not included. This specifies to MVS to request the desired recovery action from the operator through a synchronous WTO(R) when a spin loop occurs.

2.17 ARM: MVS Automatic Restart Manager


The purpose of automatic restart management (ARM) is to provide fast efficient restarts for critical applications when they fail. The application can be in the form of a batch job or started task (STC). ARM can be used to restart these automatically whether the outage is the result of an abend, system failure or the removal of a system from the sysplex. ARM uses event driven failure recognition such as end of memory (EOM) and sysgone processing to trigger restart activities. It forms part of the integrated sysplex recovery along with the following:

Sysplex Failure Management (SFM)

Chapter 2. System Software Configuration

79

Workload Manager (WLM)

ARM also integrates with existing function within both automation (AOC/MVS) and production control (OPC/ESA) products. However care needs to be taken when planning and implementing ARM to ensure that multiple products (OPC/ESA and AOC/ESA, for example) are not trying to restart the same elements. If they are the results may not be what is required.

2.17.1 ARM Characteristics


ARM is a function introduced in MVS V5.2.0. It runs in the XCF address space and maintains its own dataspaces. ARM requires a couple data set to contain policy information as well as status information for registered elements. Both JES2 and JES3 are supported. The following describe the main functional considerations:

ARM provides only job and STC recovery. Transaction or database recovery is the responsibility of the restarted applications. Initial starting of applications (first or subsequent IPLs) is not provided by ARM. Automation or production control products provide this function. Interface points are provided through exits, event notifications (ENFs) and macros. The system or sysplex must have sufficient spare capacity to guarantee a successful restart. To be eligible for ARM processing, elements (Job/STC) must be registered with ARM. This is achieved through the IXCARM macro. A registered element that terminates unexpectedly is restarted on the same system. Registered elements that are on a system that fails are restarted on another system. Related elements are restarted on the same system. The intended exploiters of the ARM function are the jobs and STCs of certain strategic transaction and resource managers, such as the following: CICS/ESA CP/SM DB2 IMS/TM IMS/DBCTL ACF/VTAM

These products, at the correct level, already have the capability to exploit ARM. When they detect that ARM as been enabled they register an element with ARM to request a restart if a failure occurs.

2.17.2 ARM Processing Requirements


ARM requires a couple data set and an active ARM policy. The policy can be installation written or the default policy settings.

80

Continuous Availability with PTS

2.17.2.1 ARM Couple Data Set


As with any couple data set consideration needs to be given to the creation and placement of the ARM couple data set. Refer to 2.7, Couple Data Sets on page 35 for recommendations on couple data set placement. The ARM couple data is allocated and formatted using the IXCL1DSU utility. Its size is determined by the ITEM NAME control statements. For ARM these are the following:

POLICY the maximum number of user defined ARM policies that can be in the couple data set at any given time. MAXELEM the maximum number of elements per policy. TOTELEM the maximum number of elements that are anticipated to be registered with ARM across the sysplex at any given time.

Increasing the parameters can be done dynamically using the SETXCF COUPLE command. Decreasing the parameters needs a sysplex wide IPL. Care needs to taken therefore when allocating the couple data set. A good starting point for defining the ARM couple data set would be:

POLICY=3 MAXELEM=40 TOTELEM=200

Refer to MVS/ESA SP V5 Setting up a Sysplex , GC28-1449, for more information on defining the ARM couple data set.

2.17.2.2 ARM Policy


The ARM policy indicates how registered elements are restarted if they or the system on which they are running fails. The policy is optional. ARM has default policy parameters that it applies when there is no installation defined policy or elements are not specifically covered in an installation s policy. An installation can have a number of unique policies but only one can be active at any one time. Information specified in an ARM policy falls into the following two catagories:

Parameters relating to groups of ARM elements. Elements that have interdependences on each other are referred to as restart groups . Restart groups are pertinent only to ARM s restarting of elements after a system has left the sysplex. Elements from a departed system that are in the same restart group are restarted on the same system. ARM also allows control of the order in which elements in a restart group are restarted. For example, restarting a failed DB2 region before restarting the associated CICS AORs and TORs.

Parameters relating to individual ARM elements. The policy parameters for individual elements indicate how to restart that element in particular situations and/or whether to restart it. When an elements policy entry has such parameters that conflict with each other, it is the entry that indicates whether to restart the element that takes precedent.

Refer to MVS/ESA SP V5 Setting up a Sysplex , GC28-1449, for more information on defining the ARM policy.

Chapter 2. System Software Configuration

81

2.17.3 Program Changes


The only way to tell ARM that it is to restart a particular job or started task is for the program to register for ARM services. The intended exploiters of the function (see 2.17.1, ARM Characteristics on page 80) will have the capability to exploit this function. For installation written programs there are the following two ways of achieving this:

Change the program that is executed to request ARM services by invoking the IXCARM macro. Avoid changing source by utilizing an ARM driver program. This works by calling the ARM driver program instead of the application program and providing the original program call as a PARM for the driver program.

An example of how to change a program using the IXCARM macro is contained in MVS/ESA SP V5 Sysplex Migration Guide , SG24-4581. The same publication contains a full sample ARM driver program along with detailed examples of its use.

2.17.4 ARM and Subsystems


When a subsystem like CICS, IMS or DB2 fails in a parallel sysplex it impacts other instances of the subsystem in the sysplex due to such things as retained locks and so on. The requirement therefore is to restart these subsystems as soon as possible after failure to enable recovery actions to be started and thus keep disruption across the sysplex to a minimum. Utilizing ARM to provide the restart mechanism ensures that the subsystem is restarted automatically, efficiently and in a pre-planned manner without waiting for human intervention. This will ensure that disruption due to retained locks or partly completed transactions is kept to a minimum. It is recommended therefore that ARM is implemented to restart major subsystems in the event of failure of the subsystem or the system on which it was executing within the parallel sysplex. The following sections provide some information about setting up ARM to support the various subsystems.

2.17.4.1 ARM and CICS


If you are using CICS V4.1 and MVS/ESA 5.2, you can exploit the MVS automatic restart management facility that it provides to implement a sysplex-wide integrated automatic restart mechanism. To use the MVS ARM facility with CICS:

Implement ARM on the MVS images that the CICS workload is to run on. Ensure that CICS startup JCL used to restart CICS regions is suitable for ARM. Each CICS restart can use the previous startup JCL and system initialization parameters, or can use a new job and parameters.

Specify appropriate CICS START options. Specify appropriate MVS workload policies.

82

Continuous Availability with PTS

Implementing ARM for CICS: Implementing ARM for CICS generally involves the following steps:

Ensure that the MVS images available for automatic restarts have access to the databases, logs, and program libraries required for the workload. Identify those CICS regions for which you want to use ARM. Define restart processes for the candidate CICS regions Define ARM policies for the candidate CICS regions Ensure that the system initialization parameter XRF=NO is specified for CICS startup. You cannot specify XRF=YES if you want to use ARM. If the XRF system initialization parameter is changed to XRF=YES for a CICS region being restarted by ARM, CICS issues message DFHKE0407 to the console then terminates.

CICS START Options: It is recommended that START=AUTO is specified. This causes a warm start after a normal shutdown, and an emergency restart after failure. (START=AUTO also resolves to a cold start when you start a region for the first time with newly initialized catalogs.)
It is also recommended to always use the same JCL, even if it specifies START=COLD, to ensure that CICS restarts correctly when restarted by ARM after a failure. With ARM support, if the start-up system initialization parameter specifies START=COLD and the ARM policy specifies that ARM is to use the same JCL for a restart following a CICS failure, then CICS overrides the start parameter when restarted by ARM and enforces START=AUTO. This is reported by message DFHPA1934, and ensures recoverable data is correctly handled by the resultant emergency restart. If the ARM policy specifies different JCL for an automatic restart, and that JCL specifies START=COLD, CICS obeys this parameter with a risk of loss of data integrity. Therefore, if there is a need to specify different JCL to ARM, START=AUTO should be specified to ensure data integrity.

CICSPlex SM/ESA: CICSPlex SM/ESA extends the ability to manage CICS systems to include the logical set of systems to be treated as a single entity, the CICSPlex. CICSPlex SM/ESA has been extended to include the ability to terminate a damaged or hung CICS system that is being managed by ARM. Once terminated ARM will request a restart of the CICS system.
CICSPlex SM ARM support was provided via APAR PN65642 for V1.1.1 only. CICSPlex SM/ESA ARM support will allow both on demand and automatic ARM restart to be requested. This support is implemented as follows:

On demand restart. The CICSRGN set of views has been extended to support an ARM primary and line command. When requested this will result in the cancellation of the CICS region with a request to ARM restart. Automatic restart. The ACTION definition has been extended to allow specification of an ARM restart option. If specified this will direct RTA to request cancellation of all CICS regions within the scope of the outstanding event.

Chapter 2. System Software Configuration

83

ARM support is only be available if the following criteria are met for a CICS region:

The CICS region must be connected to a CMAS as a Local MAS. The operating system release must be MVS/ESA 5.2. ARM must be active in the MVS image. The CICS release must be CICS/ESA 4.1 or greater. The CICS region must have registered with ARM during initialization. The current ARM Policy must allow the region to be restarted.

If all the above criteria are true the CICS region will be terminated by internally issuing the following MVS command:

CANCEL name,ARMRESTART
The specification of ARMRESTART tells ARM to become involved. The CANCEL command will be issued in the CMAS to which the CICS region is connected as a local MAS. There is no interface for attempting ARM restart for a CICS region connected as a remote MAS.

2.17.4.2 ARM and IMS


IMS V5.1 provides support for MVS ARM via APAR PN71392. The code in this APAR is designed such that it can be applied to an IMS V5.1 system without regard to the MVS operating system level. It will operate on any MVS level but will only be active on an MVS SP V5.2 or higher operating system. The following information provides details of how IMS supports ARM.

The IMS environments supported are: TM-DB, DCCTL, DBCTL and XRF. DL/1, DDB batch and the IMS utilities are not supported. The IMS control region is the only region restarted by ARM. The DL/1 SAS and DBRC regions are started internally by the IMS control region. IMS dependent regions are not automatically restarted as these are normally restarted by some form of automation after the IMS control region has restarted. The element name that IMS registers with ARM is the IMSID. The IMSID should be unique across the sysplex because ARM attempts to move IMS to a surviving system if the system IMS is executing on fails. If the IMSID is not unique, ARM may move the IMS failing system to one that already has an IMS with the same IMSID. The element type IMS specifies when registering with ARM is SYSIMS. The default is for IMS to register with ARM and allow ARM to restart IMS in case of failure. The following is a new startup parameter which has been added to allow the user to stop IMS from registering with ARM:

ARMRST= Y allow ARM to restart IMS (default) N do not allow ARM to restart IMS

In an XRF environment, when the backup IMS (alternate) has started the tracking phase, the active IMS system is deregistered from ARM and is not automatically restarted. This is necessary to insure that the active system does not automatically restart after the backup takes over. If the old active

84

Continuous Availability with PTS

IMS was allowed to restart, the IMS (OLDS) and message queue integrity could be destroyed.

If IMS abends before it completes restart (XRF tracking phase is considered to be restart complete for an XRF backup), it deregisters from ARM and is not automatically restarted. If IMS is cancelled, IMS is not automatically restarted by ARM unless the ARMRESTART option was specified on the CANCEL or FORCE command. IMS maintains a user abend table and deregisters from ARM any time one of the abends in this abend table is experienced. The abend codes currently in this table are:

U0020 U0028 U0604 U0758 U0759 U2476

USER USER USER USER USER USER

20 28 604 758 759 2476

MODIFY /CHE ABDUMP /SWITCH QUEUES FULL QUEUE I/O ERROR CICS TAKEOVER

All of these abends are either a result of operator intervention or require some external changes before IMS can be restarted.

If any call to ARM fails, IMS issues a warning message and continues to execute. The message is:

DFS0403WW IMS xxxxxxxxx CALL TO MVS ARM FAILED RETURN CODE= nn, REASON CODE=nnnn values for xxxxxxxxx are: REGISTER register with ARM READY tell ARM that IMS is ready to accept work ASSOCIATE tell ARM that this is an XRF alternate UNKNOWN unknown request sent to DFSARM00

IMS enables and ENF listening exit for ENF signal 38. This is the signal value that ARM uses to indicate it has a failure of some kind, at which point IMS deregisters. The ENF signal is issued again when the ARM failure condition has been corrected and IMS reregisters. If IMS is being restarted by ARM, it ignores the AUTO=NO value in the IMS start parameters. ARM indicates whether this is an XRF alternate or not, so that IMS does does need the restart command to know how to start.

2.17.4.3 ARM and DB2


DB2 Version 4 supports the ARM function of MVS whether DB2 is sharing or non-sharing. DB2 must be installed with a command prefix scope of started to take advantage of automatic restart. See Command Prefixes on page 86 for more information on command prefixes.

Using an Automatic Restart Policy: As with other subsystems DB2 will be restarted in the event of failure, as specified in either the default ARM policy or the installation written one. In any policy, the job or STC is referred to as an element . In a data sharing group, element is the concatenated DB2 group name and member name (such as DSNDB0GDB1G). Wild cards (such as DSNDB0G*) can be specified if a single policy statement for all members in the group is to be used.

Chapter 2. System Software Configuration

85

To specify that DB2 is not to be restarted after a failure, RESTART_ATTEMPTS(0) should be included in the policy for that DB2 element.

Command Prefixes: DB2 Version 4 includes support for one to eight character command prefixes. A command prefix replaces the existing subsystem recognition character (SRC) for recognizing commands and are used in message displays as well.
To use multiple-character command prefixes with DB2 V4 the IEFSSNxx subsystem definition statements in SYS1.PARMLIB must be updated. The subsystem definition statement is changed to allow the command prefix to be specified. The format of the SYS1.PARMLIB IEFSSNxx subsystem definition statement to define the command prefix is as follows:

ssname,DSN3INI, DSN3EP,prefix,scope,group-attach
where:

ssname the one to four character DB2 subsystem name. prefix the one to eight character command prefix. scope the one character scope for the command prefix. DB2 Version 4 registers its command prefix with MVS. When this is done, the scope of the command prefix is controlled by the value chosen: M is for system scope (one MVS system) and register the prefix at IPL. X is for sysplex scope and register the prefix at IPL. S is for started and register the prefix with sysplex scope at DB2 startup instead of MVS IPL time.

It is recommended that S be chosen. This allows for a single IEFSSNxx parmlib member to be used by all MVS systems in the sysplex. It also simplifies the task of moving a DB2 from one system to another; DB2 can be stopped on one system and started on another. There is no need to re-IPL the system. For more information about the command prefix facility of MVS, see MVS/ESA SP V5 Planning: Operations , GC28-1441.

Group-attach This is the group attachment name. This is specified on installation panel DSNTIPK.

Here is an example definition of a subsystem with a name of DB1G and a started scope command prefix of -DB1G.

DB1G,DSN3INI, DSN3EP,-DB1G,S,DB0G
The existing one-character subsystem recognition character can continue to be used as the command prefix. This means the existing IEFSSNxx definitions can be used whilst migrating to DB2 Version 4. These one-character command prefixes are given a started scope. To change the command prefix parameters, the IEFSSNxx entry must be changed and the host system IPLed. Unless circumstances dictate otherwise, to minimize the requirement for IPLs, system or sysplex-wide, code the scope as S.

86

Continuous Availability with PTS

2.17.4.4 ARM and ACF/VTAM


ACF/VTAM V4.3 supports ARM. VTAM will be restarted automatically based on the ARM policy, except following a HALT, HALT CANCEL or HALT QUICK command. No significant amount of additional storage is required to use ARM for VTAM restarts. Before using ARM, deactivate any tools or user-written programs for automatically restarting VTAM or application programs. If this is not done there is a risk of multiple restarts of the same VTAM occurring after failure. During initialization, VTAM registers with ARM using the following options:

ELEMENT=NET@sscp_name ELEMTYPE=SYSVTAM TERMTYPE=ELEMTERM

2.17.4.5 ARM and JES3


In order to exploit ARM in a JES3 environment, the system level must be MVS/ESA SP Version 5.2 and JES3 Version 5.2.1 or later. Note that if the parallel sysplex consists of more than one JES3 complex, that is, more than one JES3 Global, then the ARM element can only be restarted on a JES3 main within the scope of the JES3 complex where it initially registered with ARM. When a batch job or started task registers with ARM, ARM in turn registers that element with JES3. This process involves JES3 assigning a token which includes information about the JES XCF group in which the job or started task is running. The XCF Group Name distinguishes one JES3 complex from another inside a sysplex, and so a batch job or started task can only be restarted using ARM on an MVS image associated with the same XCF JES3 Group Name as where the batch job or started task originally registered with ARM. As already described in 2.17.1, ARM Characteristics on page 80, ARM element restart on another system only applies in the case of a system failure. If the batch job or started task abends, rather than failing as a result of its supporting system terminating, then it is restarted on the same system.

2.18 JES3
The following chapter will discuss all JES3 items required to achieve high availability in a JES3 complex.

2.18.1 Planning
To achieve the goal of continuous availability in a parallel sysplex environment, an installation must configure the hardware and software such that no planned or unplanned outage will disrupt all systems in the sysplex at the same time. Planned outages include installing new software releases or hardware upgrades, or changing the configuration. JES3, even at the current Version 5.2.1, requires

Chapter 2. System Software Configuration

87

a concurrent restart of all systems in the JES3 complex for the following changes:

Any addition of MAINs to the initialization deck. Any addition of RJPs to the initialization deck. Any change in the JES3 managed device configuration. Any change in the JES3 user exits. Any upgrade of JES software maintenance or release.

Therefore, it is very important in a JES3 environment to plan ahead.

2.18.1.1 Initialization Deck


Ensure that the JES3 initialization deck contains definitions for future configuration requirements:

Main Definitions Adding a MAINPROC statement to the JES3 initialization deck requires a JES3 complex-wide warm start. If the JES3 complex maps one-to-one with the sysplex, this means a sysplex-wide disruption, because today the IPL and warm start of the systems in the JES3 complex must be concurrent. Prepare for future additions of mains by coding additional MAINPROC statements. The cost is a small amount of additional storage required for control blocks to support the extra processor definitions, and potentially some operational confusion when displays contain names of non-existent mains, as shown below in Figure 23.

*I S IAT5619 IAT5619 IAT5619 IAT5619 IAT5619 IAT5619 IAT5619 IAT5619 IAT5619 IAT5619 IAT5619 IAT5619 IAT5619 ALLOCATION QUEUE = 00000 BREAKDOWN QUEUE SYSTEM SELECT QUEUE = 00000 ERROR QUEUE SYSTEM VERIFY QUEUE = 00000 FETCH QUEUE UNAVAILABLE QUEUE = 00000 RESTART QUEUE WAIT VOLUME QUEUE = 00000 VERIFY QUEUE ALLOCATION TYPE = AUTO CURRENT SETUP DEPTH ALL PROCESSORS = 00000 MAIN NAME STATUS SDEPTH DASD SC50 ONLINE IPLD 020,000 00208,00000 SC49 ONLINE IPLD 020,000 00208,00000 SC43 ONLINE IPLD 020,000 00208,00000 SCNEW1 ONLINE NOTIPLD 020,000 00208,00000 SCNEW2 ONLINE NOTIPLD 020,000 00208,00000 = = = = = 00000 00000 00000 00000 00000

TAPE 00032,00000 00032,00000 00032,00000 00032,00000 00032,00000

Figure 23. JES3 *I S Display Showing Non-Existent Systems

Note that it is possible to change the names of mains in the initialization deck deck without the requirement for a warm start.

RJP Workstation Definitions The same considerations as those discussed above for adding JES3 mains applies to RJP workstation definitions. Any addition of RJPWS definitions to the JES3 initialization deck requires a JES3 complex-wide IPL and warm start. Therefore, it is necessary to plan ahead and define additional RJPs in advance.

JES3-Managed Devices

88

Continuous Availability with PTS

In the MVS/ESA SP Version 5.2 parallel sysplex environment, there may be good reasons, as discussed below in 2.18.2, JES3 Sysplex Considerations on page 89, for removing tape and DASD devices from JES3 control. Once these devices are no longer JES3-managed, they can be dynamically added and deleted from the configuration without the requirement for a JES3 complex-wide warm start. Note that as of JES3 Version 5.2.1, JES3 no longer supports JES3 operator consoles. That is, the JES3 initialization deck CONSOLE statement is ignored.

2.18.2 JES3 Sysplex Considerations


When a JES3 Global-Local complex maps to a sysplex, or subset of systems within a sysplex, there are a number of considerations concerning traditional JES3-managed devices. While the device types themselves do not necessarily impact continuous availability, the fact that they are JES3-managed means that any definition additions or deletions require a JES3 complex-wide, and hence potentially sysplex-wide disruption.

MVS/ESA 5.2 Shared Tape MVS/ESA SP Version 5.2 introduces allocation support for sharing tapes between multiple systems in the sysplex. Up until now, JES3 has always managed the sharing of tapes between systems in the JES3 complex. In order to manage tape sharing, the tape devices were defined to JES3 in the JES3 initialization deck as JES3-managed devices; that is, the tapes were identified to JES3 on the DEVICE statement of the initialization deck. In a JES3 parallel sysplex environment, it is necessary to choose between having JES3 manage tapes or the new MVS/ESA SP Version 5.2 shared tape support. A tape device cannot be both auto-switchable and JES3-managed at the same time, as shown in Figure 24 on page 90. Note that once a device has been varied online to a JES3 system, that device remains under JES3 control for the life of the IPL, and in the example shown above, cannot be used as an auto-switchable device for the remainder of the IPL. The advantages of JES3-managed tape devices include: JES3 Soft Allocation An advantage of JES3-managed devices is that JES3 performs setup, or soft allocation, of the devices required for a job before it begins execution. The disadvantages of JES3-managed tape devices include: No Dynamic I/O Reconfiguration support. Loss of I/O symmetry across all systems in the sysplex. JES3 complex-wide warm start required to change the configuration.

Consoles In JES3 Version 5.2.1, JES3 consoles no longer exist, and their definitions should be removed from the JES3 initialization stream.

Chapter 2. System Software Configuration

89

-D U,TAPE,,000,64 S008= MSTRJCL IEE457I 11.05.28 UNIT STATUS 086 S008= MSTRJCL UNIT TYPE STATUS VOLSER . . . S008= MSTRJCL 0B38 349S OFFLINE-AS S008= MSTRJCL 0B39 349S OFFLINE-AS S008= MSTRJCL 0B3A 349S OFFLINE-AS

VOLSTATE /REMOV /REMOV /REMOV

*V B3A ONLINE S008= JES3 IEE302I 0B3A ONLINE S008= JES3 IEF259I UNIT 0B3A IS NO LONGER DEFINED AS AUTOSWITCH IAT5510 0B3A VARIED ONLINE ON GLOBAL D U,,,B3A,1 S008= MSTRJCL S008= MSTRJCL S008= MSTRJCL

IEE457I 11.22.08 UNIT STATUS 102 UNIT TYPE STATUS VOLSER 0B3A 349S O -M

VOLSTATE /REMOV

*V B3A OFF SC50 IAT8180 0B3A VARIED OFFLINE TO JES3 ON SC50 S008= JES3 IEF281I 0B3A NOW OFFLINE D U,,,B3A,1 S008= MSTRJCL S008= MSTRJCL S008= MSTRJCL V B3A,AS,ON S008= MSTRJCL S008= MSTRJCL

IEE457I 11.56.08 UNIT STATUS 253 UNIT TYPE STATUS VOLSER 0B3A 349S F-NRD

VOLSTATE /REMOV

IEE461I UNIT 0B3A CANNOT BE DEFINED AS AUTOSWITCH BECAUSE IT IS A JES3-MANAGED TAPE.

Figure 24. JES3-Managed and Auto-Switchable Tape

2.18.3 JES3 Parallel Sysplex Requirements


A sysplex configuration is required for a multi-system JES3 Version 5.1.1 complex. In the following discussion, the required configuration components are discussed, with the underlying assumption that the parallel sysplex consists of more than one MVS image, that is, PLEXCFG=MULTISYSTEM is specified in the IEASYSxx member of PARMLIB. Note that at this release, JES3 does not directly exploit the coupling facility; that is, JES3 does not connect to any coupling facility structures.

2.18.3.1 JES3 Signalling Paths


In a parallel sysplex environment, JES3 exploits the cross system coupling (XCF) facilities of MVS to communicate between the Global and the Local(s), thus eliminating JES3-owned CTCs. Like other XCF exploiters, JES3 inter-system communication is now provided by XCF, over XCF-managed signalling paths, which may be supported either by CTCs or by a coupling facility. The configuration requirements for XCF signalling paths is described in 1.7, XCF Signalling Paths on page 14. JES3 has no specific signalling path requirements and it is not necessary to define an XCF transport class for exclusive use by JES3 Global-Local communications traffic unless RMF, or other monitoring tools, indicate that this would solve a specific performance problem.

90

Continuous Availability with PTS

2.18.4 JES3 Configurations


All the JES3 Version 5 systems in a JES3 complex must be contained within the same sysplex, which may either be a basic or a parallel sysplex. While IBM strongly recommends that the JES3 complex maps to the sysplex (JES3PLEX = SYSPLEX), it is possible for a sysplex to contain more than one JES3 complex. In fact, it is possible for the sysplex to contain only JES3 globals (without locals), and there may be some installations with requirements for just such a configuration. As a result, the implications of each type of configuration will be discussed.

2.18.4.1 JES3PLEX = SYSPLEX


In this configuration, the parallel sysplex consists of only one JES3 global system; all other MVS images in the parallel sysplex are JES3 locals. This is the recommended configuration for JES3 in a parallel sysplex environment. In this environment, any JES3 local can use DSI to take over the global functions in the event of a failure of the global main.

2.18.4.2 JES3PLEX < SYSPLEX


The parallel sysplex comprises more than one JES3 complex; that is, there is more than one JES3 global. This configuration is possible with the support introduced by PTFs UW19140 and UW19148. Considerations:

XCF Group Name The fencing of each JES3 complex within the sysplex is defined by the XCF group name. The XCF group name may be explicitly coded on the OPTIONS statement in the JES3 initialization deck, or, the recommended way is to let the group name default to the node name corresponding to the NJE homenode, that is, where HOME=YES is coded. A portion of a sample JES3 initialization deck is shown below in Figure 25. The XCF group name for this JES3 complex is WTSCPLX9.

*-------------------------------* NJE NODE DEFINITIONS *-------------------------------NJECONS,CLASS=S12,SIZE=128 NJERMT,NAME=WTSCPLX9,HOME=YES NJERMT,NAME=WTSCPLX1,TYPE=SNA NJERMT,NAME=WTSCMXA,TYPE=SNA NJERMT,NAME=C5JES3,PATH=WTSCMXA NJERMT,NAME=C2JES2,PATH=WTSCMXA NJERMT,NAME=C5JES2,PATH=WTSCMXA NJERMT,NAME=C2JES3,PATH=WTSCMXA NJERMT,NAME=WTSCPOK,PATH=WTSCMXA NJERMT,NAME=WTSCPLX1,PATH=WTSCMXA *-------------------------------Figure 25. NJE Node Definitions Portion of JES3 Init Stream

Command Prefix

Chapter 2. System Software Configuration

91

The JES3 command prefix requires careful consideration in this environment. If there is a single JES3 complex within the sysplex (JES3PLEX=SYSPLEX), then * will be the default for PLEXSYN. If there is more than one JES3 complex within the sysplex, then it is necessary to change all JES3 initialization decks to specify a PLEXSYN value other than *, and it is necessary to specify a value of * for the SYN parameter. Both the PLEXSYN and SYN parameters are specified on the CONSTD statement in the JES3 initialization deck.

JES3 Proc The parallel sysplex philosophy and cloning support eliminates the need to keep individual copies of critical data sets and definition libraries for each system in the sysplex. With this is mind, it is recommended to maintain a single SYS1.PROCLIB for all systems in the sysplex, and take advantage of the cloning support to tailor the JES3 proc for the different globals within the sysplex. figref refid=j3proc. provides an example of how the JES3 proc may be coded to accommodate its use by multiple JES3 globals within the sysplex.

//JES3 PROC JES=JES3,ID=01 //JES3 EXEC PGM=IATINTK,DPRTY=(15,15),TIME=1440,REGION=0M //STEPLIB DD DSN=SYS1.JES3LIB,DISP=SHR //* ----------------------------------------------------* //* //* JES3 PROCEDURE: JES3 //* //* RELEASE: ALL RELEASES //* CONFIG: //* STEPLIB: SYS1.JES3LIB //* CHKPT: SYS1.JES3CKPT //* SYS1.JES3CKP2 //* DISK RDR: SYS1.JES3DR //* JCT: SYS1.JES3JCT //* SPOOL: SYS1.JESPACE //* JES3OUT: ON-LINE PRINTER //* DUMPS: ON-LINE PRINTER //* PROCLIB: NONE DEFINED (DYNALLOC REQUIRED) //* JSMSSTAB: SYS1.JES3MSS //* INISH: SYS1.JES3IN //* //* ----------------------------------------------------* //CHKPNT DD DSN=SYS1.&JES.CKPT,DISP=SHR //CHKPNT2 DD DSN=SYS1.&JES.CKP2,DISP=SHR //JES3DRDS DD DSN=SYS1.&JES.DR,DISP=SHR //JES3JCT DD DSN=SYS1.&JES.JCT,DISP=SHR //SPOOL1 DD DSN=SYS1.&JES.SPL1,DISP=SHR //JES3OUT DD DSN=SYS1.&JES.OUT,DISP=SHR //JES3SNAP DD DUMMY //SYSMDUMP DD DSN=SYS1.&JES.DUMP,DISP=SHR //JESABEND DD DUMMY //JES3IN DD DSN=SYS1.PARMLIB(JES3IN&ID),DISP=SHR //*
Figure 26. Sample JES3 Proc for Use by Multiple Globals

Shared JES3 data sets

92

Continuous Availability with PTS

2.18.4.3 SYSPLEX Contains All JES3 Globals


The parallel sysplex MVS images all support JES3 globals only. In this configuration, recovery of the global function is not possible if a global system fails, since there is no local available to perform DSI.

2.18.5 Additional JES3 Planning Information


Other documentation that discusses setting up JES3 in a sysplex environment include the following references:

JES3 V5 Implementation Guide, SG24-4582

Chapter 2. System Software Configuration

93

94

Continuous Availability with PTS

Chapter 3. Subsystem Software Configuration


This chapter deals with configuring the various subsystems to provide an environment that will support the goal of continuous availability. The following products are included in this discussion:

CICS/ESA V4 CICSPlex SM V1 IMS/ESA V5 DB2 V4 DFSMS 1.3 (VSAM RLS) TSO/E 2.4 NetView V3.1 AOC/MVS OPC/ESA VTAM 4.2

These products generally fall into one of the three following categories:

Transaction management Database management Systems management

Just as it is with the hardware, redundancy of software subsystem components is one of the keys to providing continuous availability. In a parallel sysplex, you must have the capability to detect and route work around unavailable resources. This is the job of a transaction manager like CICS. The database managers, such as DB2, have the capability of providing concurrent access to the data. The system management products are needed to help simplify the complexity of the coupled environment. All of these products must be configured in such a way that they do not represent a single point of failure. Also, any instance of a product must be able to be added or removed from the system without impact to end user availability.

3.1 CICS V4 Transaction Subsystem


CICS, for many years, has had the ability to route transactions from one address space to another. CICS provides the following two mechanisms to do this:

Multiple region operation (MRO) Inter-system communication (ISC)

CICS/ESA Version 4 allows MRO operation between systems using XCF. With CICS/ESA Version 3, a dynamic transaction routing exit was added allowing routing decisions to be made based on programmable criteria. CICSPlex SM provides sophisticated routing algorithms to perform workload balancing and failure avoidance.

Copyright IBM Corp. 1995

95

3.1.1 CICS Topology


An ideal setup would have one terminal owning region per MVS image and some number of application owning regions (AORs) spread across all MVS images. The AORS would need to have enough capacity to handle peak transaction rates and be able to pick up the workload for a failed MVS image with no perceivable impact to transaction response times. Availability will be further increased if all transactions can execute on any AOR. In addition, CICS AORs need access to a database manager, which is responsible for the data sharing. To ensure availability, each MVS image should have an instance of the database manager(s) being used by CICS. The following figure provides an example of a simple CICSplex configuration.

Figure 27. Cloned CICSplex. is a clone of one another.

Multiple CICS subsystems share the same data. Each CICS

VTAM generic resource capability along with CICS 4.1 or higher allows the terminal user to log on to any of the terminal owning regions (TORs). This enables VTAM to perform dynamic balancing of the sessions across the available terminal-owning regions, thus removing a potential single point of failure. The terminal-owning regions can in turn perform dynamic workload balancing using the CICS dynamic transaction routing facility, which leads to improved availability for individual transactions. Dynamic transaction routing is controlled through the CICSPlex SM product.

96

Continuous Availability with PTS

3.1.2 CICS Affinities


CICS transactions use many different techniques to pass data from one to another. Some of these techniques require that the transactions exchanging data must execute in the same CICS AOR, and therefore impose restrictions on dynamic routing. If transactions exchange data in ways that impose such restrictions, there is said to be an affinity between them. It is important to identify these transactions and take one of the following actions:

Customize CICSPlex SM to recognize the affinity and route to the appropriate AOR. Modify the application to remove the affinity and remove the single point of failure from the AOR for those transactions.

3.1.3 File-Owning Regions


File-owning regions are needed when a transaction requires access to other data (which currently is not supported by data sharing through the coupling facility), such as VSAM and BDAM files. Application-owning regions must function ship file requests to the file-owning region to maintain data integrity. Thought should be given to having multiple FORs to limit impact of an FOR failure. Including a queue-owning region in the CICSplex is important today as it avoids any inter-transaction affinities that occur with temporary storage or transient data queues. An alternative to a queue-owning region is to create a combined FOR/QOR region. Note: The QOR will no longer be necessary when CICS/ESA 5.1 has implemented a mechanism for shared temporary storage across the CICSplex using the coupling facility.

3.1.4 Resource Definition Online (RDO)


CICS has been evolving from macro definitions to RDO. Macro definitions are a problem when aiming for continuous availability as they take a CICS recycle to become active. For this reason RDO or CICS auto-install should be used for definitions. The following RDO and auto-install functions have a great impact in lowering the number of planned outages:

Support for VSAM files and data tables CSD management Dynamic addition of MRO connections Autoinstall for programs, mapsets, and partitionsets Autoinstall for LU6.2 parallel sessions

3.1.5 CSD Considerations


The CSD should be defined as recoverable, so that changes that were incomplete when an abend occurred will be backed out. To avoid the CSD filling while CICS is running, ensure that the data set is defined with primary and secondary space parameters, and that there is sufficient DASD space available for the secondary extents.

Chapter 3. Subsystem Software Configuration

97

3.1.6 Subsystem Storage Protection


CICS/ESA 3.3 introduced storage protection, which uses an extension to MVS storage keys to physically separate CICS code and control blocks from user storage. This protects CICS code and control blocks from being overwritten by a wayward user application. These extensions allocate separate storage areas (with separate storage keys) for user application programs and for CICS code and control blocks. SIT parameter STGPROT must be coded to activate this feature. The key of the UDSA and EUDSA is controlled using the STGPROT:

STGPROT=NO indicates that storage protection is not active and storage is acquired in key 8, as for previous releases. STGPROT=YES indicates that storage protection is active and that storage will be acquired in storage protect key 9.

3.1.7 Transaction Isolation


Transaction isolation in CICS/ESA 4.1 offers storage protection between transactions, ensuring that a program of one transaction does not accidentally overwrite the storage of another transaction. MVS/ESA Version 5 Release 1 introduces the subspace group facility, which can be used for storage isolation to preserve data integrity within an address space. Programs defined with EXECKEY(USER) execute in their own subspace, with appropriate access to any shared storage, or to CICS storage. Thus, a user transaction is limited to its own view of the address space. Programs defined with EXECKEY(CICS) execute in the base space, and have same privileges as in CICS/ESA 3.3. Additional information can be found in manuals Planning for CICS Continuous Availability in a MVS/ESA Environment and System/390 MVS Sysplex Application Migration .

3.2 CICSPlex SM V1
IBM CICSPlex System Manager (CICSPlex SM) is a system-management tool that provides the following functions:

A real-time single-system image A single point of control Automated workload management Automated exception reporting for CICS resources Collection of statistical data for CICS resources

A single-system image means that the CICSPlex SM operator can manage multiple CICS systems, distributed across the parallel sysplex as if they were one system. A single command is sufficient to make changes throughout the CICSPlex. CICSPlex SM can balance the enterprise workload dynamically across the available AORs, thereby enabling you to manage a variable workload without operator intervention. CICSPlex SM routes transactions away from busy regions and from those that are failing or likely to fail, giving improved throughput and availability to the end user. Furthermore, planned service is made much easier. A CICS region can be taken down for maintenance without having to worry about that region s work, because CICSPlex SM can simply route the work dynamically to another region.

98

Continuous Availability with PTS

RTA resource monitoring evaluates the status of any CICS resource. External notifications are issued when the resource moves outside of the declared status range. For example, RTA resource monitoring can warn you that dynamic storage area (DSA) free space is falling, that a file is disabled, that a journal is closed, that the number of users of a transaction is growing, and so on. Once the exception is detected, RTA can issue an SNA generic alert to NetView, thereby allowing NetView to take corrective action. External messages, which are directed to the console by default, can be intercepted for automation by other products. For example, external messages may be intercepted and processed by AOC CICS automation.

3.2.1 CICSPlex SM Configuration


The coordinating address space (CAS) is the gateway to CICSPlex SM. It is an MVS/ESA subsystem whose main function is to support the end user interface. The CMAS implements the monitoring, real-time analysis, workload management, and operations functions of CICSPlex SM, and maintains configuration information about the CICSplexes it is managing. It also contains information about its own links with other CMASs. In short, the CMAS is responsible for the single system image that CICSPlex SM presents to the operator. Having a single-system image means that a CICSplex can be managed as if it were a single CICS system, regardless of the number of CICS. To achieve the best performance and availability, link every CMAS directly to every other CMAS.

Figure 28. CICSPlex SM. The CAS in each system communicates with the CMASs to collect information and present a single view of the CICSplex.

Chapter 3. Subsystem Software Configuration

99

3.3 IMS Transaction Subsystem


IMS is now evolving to fully participate in parallel sysplex architecture. IMS 5.1 introduces parallel sysplex support for IMS databases, often referred to as n-way data sharing. With IMS 5.1 it is possible to run multiple IMS TM, CICS DBCTL, and batch systems across multiple MVS systems, all reading and updating the shared databases efficiently and with integrity. A subsequent release of IMS will provide exploitation of coupling facility architecture by the IMS Transaction Manager. The IMS message queues will use the coupling facility and be a single sysplex wide resource. Logically, the function of the IMS systems can be considered to be split into network driving and transaction processing. The network driving IMS systems will receive messages from the various networks (SNA and non-SNA) and place them on shared message queues. Transaction processing IMS systems will take messages off the queues and insert replies to the queues. The network driving IMS systems will then take the replies from the queues and send them to their final destinations.

3.3.1 IMS Topology


The following figure shows an example of IMS 5.1 configuration for data sharing in a parallel sysplex.

Figure 29. Sample IMS 5.1 Configuration. Each MVS image contains an IMS subsystem, IRLM, shared Recon data and a shared database.

100

Continuous Availability with PTS

When migrating to an IMS data sharing environment the following items need to be considered:

Set up your IMS RESLIBs so you can clone your IMS subsystems across parallel sysplex. Ensure that the IMSID is unique for each IMS subsystem in the sysplex in order that the IMS subsystem can be moved to any MVS image, if necessary. Ensure that all terminal names, LU names, and ETO user IDs in the network are unique. Divide the network to balance your workload and to minimize network outage if you lose one of your IMS subsystems. Convert batch jobs to BMP programs, to minimize the number of connections to the coupling facility.

3.3.2 IMS RESLIB


When cloning your IMS subsystems aim for a single RESLIB. Use defaults or common coding during the stage 1 generation (wherever possible) so that the definitions can be overridden during execution.

3.3.3 IMSIDs
Having unique IMSIDs lets you move your IMS subsystems to another MVS image when necessary. The IMSID also has to be different from any non-IMS subsystem identifier defined to the MVS under which IMS is running. The IMSID specified in the IMSCTRL macro (part of stage 1) can be overridden at execution by specifying a keyword in the DFSPBxxx member or a parameter on the EXEC statement.

3.3.4 Terminal Definitions


Just as with the IMSID, you should ensure that your logical unit names (LU names), logical terminal (LTERM) names, physical terminal (PTERM) names, and ETO user IDs are unique across your IMS network. With unique names, you can move any terminal, LU 6.2 application or user anywhere in the network without having to worry about duplicate names. IMS does not have a generic resource capability so it s important that terminal definitions are not all defined to only one image. By spreading the terminal definitions across the IMS images, the loss of a single IMS will only affect the terminal users defined to that IMS instance. It is recommended that Extended Terminal Option (ETO) be used to improve your system s availability by reducing scheduled outages for the addition or deletion of VTAM local and remote terminals or LTERMs. This feature also enables you to add VTAM terminals, and users for these terminals to your IMS TM system without predefining them through the system definition process. IMS TM dynamically builds the required queues and blocks based on VTAM information and IMS ETO descriptors (a set of IMS TM skeleton definitions).

Chapter 3. Subsystem Software Configuration

101

3.3.5 Data Set Sharing


The following table lists the data sets that should be shared and those that must be unique within the parallel sysplex.
Table 5. IMS Data Sets in Sysplex
Shared Data Sets ACBLIBn DBDLIB FORMATn IMSTFMT MATRIXn MODBLKSn PGMLIB PROCLIB PSBLIB RECONn RESLIB Unshared data Sets DFSOLDSn DFSOLPnn DFSWADSn IEFRDRn IMSMON LGMSG MODSTAT MSDBCPn MSDBDUMP MSDBINIT QBLKS RDSn SHMSG

3.3.6 IRLM Definitions


For IRLM to belong to a data sharing group, you must specify the name of the data sharing group and the name of the IRLM structure in the startup procedure. Each IRLM in the group must:

Have a unique IRLMID Specify the same data sharing group name in the group parameter Specify the same lock structure using the LOCKTABL parameter

Although you can specify these definitions on the IRLM startup procedure, the recommended method is to define them using the CFNAMES control statement.

3.3.7 Coupling Facility Structures


IMS uses structures for OSAM and VSAM buffer invalidation, while IRLM uses a lock structure to provide locking services. Information on structure sizing, placement and definitions can be found in the IMS V5 Administration Guide: System .

3.3.8 Dynamic Update of IMS Type 2 SVC


In the past, adding the new SVC has required MVS to be shut down and re-IPLed. The IMS 5.1 Utility (DFSUSVC0) allows the user to add or replace an IMS type 2 SVC dynamically to MVS. All IMS activity using the SVC to be modified must be stopped before executing the new utility.

102

Continuous Availability with PTS

Note The SVC utility does not remove the need to add the IMS SVC to the MVS nucleus. Every MVS IPL will regress the customer back to an old IMS SVC and requires using the utility to reinstall a new IMS SVC.

3.3.9 Cloning Inhibitors


The following IMS databases cannot be shared:

MSDBs DEBDs with SDEPs DEBDs using Virtual storage option (VSO)

A possible solution for these databases is to place them in only one system and have the transactions routed using Multiple System Coupling (MSC) to that IMS. Another solution would be to convert these databases.

3.4 DB2 Subsystem


DB2 data can be accessed from CICS, IMS ,TSO or from a large variety of distributed applications. Each transaction environment has different characteristics that must be considered when designing the software configuration.

3.4.1 DB2 Environment


With Version 4, DB2 introduces a function that provides applications with full read and write concurrent access to shared databases. The DB2 parallel sysplex environment allows up to 32 instances of DB2 subsystems, thereby minimizing the impact of a failure in DB2. Unlike read-only data sharers, all members of a data sharing group have equal concurrent read/write access to databases. DB2 data sharing requires the subsystems to share the DB2 catalog, directory and user databases. Subsystems sharing the data are known as a data sharing group. As we see in Figure 30 on page 104, each DB2 system has its own set of log data sets and its own bootstrap data set (BSDS). However, these data sets must reside on DASD that is shared between all members of the data sharing group. This is to allow all systems access to all of the available log data in the event of a DB2 subsystem failure. Each member also has its own buffer pools in MVS storage.

Chapter 3. Subsystem Software Configuration

103

Figure 30. Sample DB2 Data Sharing Configuration. subsystem,IRLM, shared database.

Each MVS image contains a DB2

Once a DB2 data sharing group has been established, you can stop and start individual members in the data sharing group while the other members continue to process. You also can configure a new DB2 subsystem into the group without affecting the existing members.

3.4.2 DB2 Structures


Always make sure there is enough space for the structures to be rebuilt in an alternative coupling facility, if the need arises. The alternative coupling facility must be specified in the CFRM policy preference list. If space exists, the SCA and lock structures can be automatically rebuilt if a coupling facility fails. Group buffer pools are not automatically rebuilt when they fail. However, alternative allocation information is still needed to allow a new group buffer pool to be allocated in the alternate coupling facility to allow recovery from a group buffer pool failure. Ideally, to reduce the disruption caused by reconfiguring coupling facilities, different group buffer pools should be placed in different coupling facilities. Buffer group recovery requires information in the lock and SCA structure to determine which databases must be recovered. This is known as damage assessment . The lock and SCA structure should be put in a different coupling facility than important cache structures. Should a the lock structure and SCA be lost at the same time as one of the group buffer pools, DB2 waits until the lock structure and SCA are rebuilt before doing damage assessment.

104

Continuous Availability with PTS

3.4.3 Changing Structure Sizes


DB2 V4 supports the dynamic structure alter function provided in MVS 5.2 along with a CF level one LIC. This allows DB2 to continue using the structures while the size alteration is taking place. Note: This method cannot be used to increase the size of the lock table portion of a lock structure. This requires that a new CFRM policy be activated which is disruptive. This makes it important to size the lock structure correctly.

3.4.4 DB2 Data Availability


DB2 Data availability considerations for a data sharing group are basically the same as for a single subsystem. For a data sharing group, the catalog and directory data sets are even more important than with a single subsystem because there is a single catalog and directory for all members of the group. Placing the catalog and directory behind a 3990 control unit with dual copy capability or on a RAMAC Array Subsystem can provide some improved data availability.

3.4.5 IEFSSNXX Considerations


As the number of sysplex members grow it becomes important to remember that every DB2 and IRLM you define to MVS in the IEFSSNxx parmlib member requires an MVS system linkage index (LX). There number of LXs are defined by the NSYSLX in Sys1.Parmlib. When these are exhausted, new subsystems cannot be added to that MVS image. The default number of these indexes that MVS reserves is 55.

3.4.5.1 Command Prefixes


DB2 Version 4 includes support for one to eight character command prefixes. A command prefix replaces the existing subsystem recognition character (SRC) for recognizing commands. To use multiple-character command prefixes, requires updating IEFSSNxx in SYS1.PARMLIB. The subsystem definition statement provides a one character scope parameter that controls the scope of the commands issued with the corresponding prefix. It is recommended that S (default) is selected, which allows a single IEFSSNxx parmlib member to be used by all systems in the sysplex. It also allows the stopping of DB2 on one image and starting it on another, without IPLing MVS. Selecting X would require an IPL to do this. For additional information, see the MVS/ESA Planning: Operations and the MVS/ESA V5.2 Initialization and Tuning Reference .

3.4.6 DB2 Subsystem Parameters


Even though the most installation parameters affect the operation of a single DB2, some parameters must be the same on all the sharing subsystems. For example; the catalog alias name. Other installation parameters must be unique for each member. Most installation parameters do not have to be unique, but in a parallel sysplex it is always a good idea to make these definitions the same in order to simplify operating procedures. See the DB2 Version 4 Data Sharing and Administration manual for recommendations on how to code the various parameters.

Chapter 3. Subsystem Software Configuration

105

3.5 VSAM RLS


Record Level Sharing (RLS) is a new VSAM function, that will be provided by DFSMS Version 1 Release 3, and exploited by a future CICS. RLS enables VSAM data to be shared, with full update capability across many applications running in many CICS regions across the parallel sysplex. RLS removes the need for CICS file-owning regions (FORs) and thus eliminating a single point of failure. Instead, DFSMS 1.3 supports a new data sharing subsystem, SMSVSAM, which runs in its own address space and provides the RLS support required by CICS AORs, and batch jobs, within the parallel sysplex environment. The SMSVSAM server, which is initialized automatically during a IPL uses the coupling facility for its cache structures and lock structures. See Figure 31 on page 107 for an example. To enable VSAM RLS, you must:

Run all systems performing RLS as a sysplex. Define and activate sharing control data sets (SHCDS). Define CF cache and lock structures to MVS, using the coupling facility resource manager (CFRM) policy, and to the SMS base configuration. Associate CF cache set names with storage class definitions, and write routines to associate data sets with storage class definitions that map to cache structures. Change the attributes for a data set to specify whether the data set is recoverable or nonrecoverable. Specify LOG(NONE) if the data set is nonrecoverable. Specify LOG(UNDO) or LOG(ALL) if the data set is recoverable.

Figure 31 on page 107 shows the major components involved in VSAM RLS.

106

Continuous Availability with PTS

Figure 31. Sample VSAM RLS Data Sharing Configuration. Each SMSVSAM Address space has access to the coupling facility which contains the lock and cache structures.

3.5.1 Control Data Sets


VSAM RLS requires a number of data sets for controlling record-level data sharing. These data sets are called Sharing Control Data Sets (SHCDS). Sharing control is a key element in maintaining data integrity in a shared environment. Because persistent record locks are maintained in the CF, several classes of failure could occur, such as a sysplex, system, or SMSVSAM address space restart, or a CF lock structure failure. The SHCDS is designed to contain the information required for DFSMS/MVS to continue processing with a minimum of unavailable data when a failure occurs. The SHCDS acts as a log for sharing support and contains the following information:

The name of the CF lock structure in use The system status for each system or failed system instance The time that the system failed A list of subsystems and their status A list of open data sets using the CF

The VSAM sharing control data sets are logically-partitioned, linear data sets. They can be defined with secondary extents, but all the extents for each data set must be on the same volume. You should define at least three sharing control data sets, for use as follows:

VSAM requires two active data sets for use in duplexing mode. VSAM requires the third data set as a spare in case of failure of one of the active data sets.

Chapter 3. Subsystem Software Configuration

107

Place the SHCDSs on volumes with global connectivity. VSAM RLS processing is only available on those systems which currently have access to an active SHCDS. Ensure that the space allocation for active and spare SHCDSs is the same. See the DFSMS/MVS Version 1 Release 3 DFSMSdfp Storage Administration Reference for more information about sharing control data sets and for sample JCL for defining them.

3.5.2 Defining the Database


VSAM provides some new data set attributes that are specified via the define or alter functions of IDCAMS. The LOG parameter provides the capability to define the following:

No recovery required; the sphere is not recoverable Backout required; the sphere is recoverable Backout and forward recovery is required; the sphere is recoverable

It is strongly recommended that LOG(ALL) be specified when defining the VSAM data sets so that CICS and VSAM can provide a high availability environment with forward recovery and back out capabilities. With LOG(ALL), you must specify in the ICF catalog the name of the MVS log structure that CICS is to use as the forward recovery log. This log stream name must match the name of a log stream defined to the MVS system logger.

3.5.3 Defining the SMSVSAM Structures


The VSAM data sharing environment is using two different structure types: cache structures to manage the data and the lock structure to maintain data integrity.

3.5.3.1 LOCK Structure


DFSMS/MVS requires a master coupling facility lock structure, IGWLOCK00. The total number of locks and the amount of acceptable false contention are the two factors that you must consider when defining the size of the coupling facility lock structure. The coupling facility lock structure must be accessible from all systems in the Sysplex that support VSAM RLS processing. Thus the coupling facility lock structure must have universal connectivity. The coupling facility lock structure should also be nonvolatile. The name of the master coupling facility lock structure is IGWLOCK00. As soon as the data sharing is initialized, each member connects to the coupling facility lock structure. Defining this structure requires only a CFRM Policy update.

3.5.3.2 CACHE Structure


You may have more than one coupling facility and more than one coupling facility cache structure that DFSMS/MVS uses. The size of any coupling facility cache structure is established when the coupling facility cache structure is connected. After installation you can determine the initial and maximum size of the DFSMS/MVS cache structure. DFSMS/MVS connects using the initial size specified in the policy. You can alter (expand or reduce) this size with the SETXCF ALTER command. The ISMF Control Data Set Application is the only way of defining the coupling facility cache structures to DFSMS/MVS. To define the cache structure the following actions are required:

108

Continuous Availability with PTS

1. CFRM policy update You define the coupling facility cache structures through the XES coupling definition process. 2. SMS configuration and ACS routine updates From an SMS point of view the following actions are required:

Update the Base Configuration The base configuration section now includes the list of cache set names. This new list has an entry for each unique cache set name specified in the storage class. Each cache set name has an associated list of up to eight coupling facility cache structure names.

Set up the Storage Class construct The CACHE SET parameter has been added to the storage class construct. The cache set name is used to map the storage class to a cache set that you specify in the base configuration.

Storage Class ACS Routine You will be required to update the current storage class routine to match the new configuration.

Addition Information For further details, please refer to DFSMS/MVS 1.3 Implementation Guide , GG24-4391.

3.5.4 CICS Use of System Logger


It is planned that a future CICS will use the MVS system logger for logging. CICS log streams, in particular the system log, are extremely important because they are the key to maintaining the integrity of system resources. The following setup guidelines are recommended:

Share structures between MVS images. This provides immediate logstream recovery for the logs used by the failing image. Otherwise recovery will be delayed until the next time a system connects to the failed logstream. Use a standard naming convention for the log structures, that equates the structure to the type of logstream. For example: LOG_DFHLOG_001 CICS system logs LOG_DFHSHUNT_001 CICS Secondary system logs

In CFRM policy specify REBUILDPERCENT(0) to ensure structures are always rebuilt.

3.6 TSO in a Parallel Sysplex


Currently TSO/E is not sysplex aware(other than the broadcast function) and as such each image must be managed as a separate entity. A TSO/E user with a single userid won t be able to have concurrent sessions on the different MVS images because of the JES2 restriction on duplicate jobnames in a MAS. A simple modification to JES2 will allow the logon of duplicate TSO IDs in the MAS. Manual GG66-3263, JES2 Multi-Access Spool in a Sysplex Environment provides the documentation with some precautions.

Chapter 3. Subsystem Software Configuration

109

3.7 System Automation Tools


Automation is an essential component of systems management in a parallel sysplex. The increased message traffic during normal operations of multiple MVS images routed to one or two consoles makes message suppression essential. During an error or failure situation, human reaction time is too slow to meet the high availability requirements of systems in a parallel sysplex. In the past, automation has been used to reduce the amount of message traffic displayed at a console, and to some degree has started to address recovery situations, but now, in a parallel sysplex environment, automation of recovery should be exploited to the fullest. Apart from the critical response time component of automation, an equally important factor is the complexity of the data sharing environment. When a failure occurs in a component in the parallel sysplex, for example if a system supporting an IMS fails, preserving data integrity is paramount. The recovery involves restarting the failed IMS on another system and linking that IMS to the correct IRLM to allow recovery of coupling facility locks to take place. This process cannot be left to an operator who has to rely on potentially out-of-date hand-written documentation. Automation, using MVS/ESA SP Version 5.2 facilities such as the Automatic Restart Manager (ARM), is essential. Note that there is some overlap between the different automation tools, particularly as regards restart handling. You can define restarts today in multiple systems: AOC, ARM, and OPC. However, if you define the same job or task in more than one, then the results will certainly not be what you want.

3.7.1 NetView
In multisystem environments today, the recommendation is to have one of the systems act as a focal point system to provide a single point of control where the operator is notified of exception conditions only. This is still the case in a parallel sysplex. Another recommendation is to automate as close to the source as is possible. This means, having both NetView and AOC/MVS installed on all systems in the parallel sysplex to enable automation to take place on the system where the condition or message occurs . For continuous availability a backup focal point NetView should also be planned for in your configuration. This will allow the current focal point system to be taken out of the parallel sysplex for planned outages. In addition, unplanned outages on the focal point will not render the parallel sysplex inoperable. It is recommended that the NetView focal point and the focal point backup exist on the two VTAM network nodes in the parallel sysplex. Refer to 3.8, VTAM on page 112 for information about the VTAM configuration.

3.7.2 AOC/MVS
One of the principal tools for system automation is IBM SystemView Automated Operations Control/MVS (AOC/MVS). AOC/MVS extends NetView to provide automated operations facilities that improve system control. With Release 4 of AOC/MVS, support is provided for the Automatic Restart Management (ARM) function. Failing applications can be automatically restarted. With improved status coordination between focal point and target systems, the status monitored resources can be accurately reflected on the

110

Continuous Availability with PTS

AOC/MVS graphical interface. Operators can take prompt recovery actions for outages.

3.7.3 OPC/ESA
OPC/ESA provides automation for planning, controlling, and managing batch workload over multiple MVS systems today. So there is little change when these systems form a parallel sysplex.

3.7.3.1 OPC/ESA Setup


OPC/ESA consists of one controller and a number of trackers. The controller is the focal point of the OPC/ESA configuration. It contains the controlling functions, the ISPF dialogs, the databases, and the plans. It can control the entire OPC/ESA configuration, the OPCplex, both local and remote. You need at least one OPC/ESA controlling system, and if you lose it your entire batch production will come to a halt, as no more jobs will be started. OPC/ESA systems that communicate with the controlling system are called controlled, or tracker systems. The tracker communicates event records with the controlling system either through shared DASD, XCF, or NCF (which uses ACF/VTAM to link OPC/ESA systems). When one or more trackers are connected to the controller via XCF communication links, the OPC/ESA systems form an XCF group. The systems use XCF group, monitoring, and signaling services to communicate. The controller submits work and control information to the trackers using XCF signaling services. The trackers use XCF services to transmit events back to the controller.

3.7.3.2 Standby OPC/ESA Controller


A standby system can take over the functions of the active controller if the controller fails, or if the MVS/ESA system that it was active on fails. You can create a standby controller on one or more OPC/ESA controlled systems within an XCF group. Each standby system must have access to the same resources as the controller. These include data sets and VTAM cross-domain resources. The standby system is started the same way as the other OPC/ESA address spaces, but is not activated unless a failure occurs or unless it is directed to take over by an MVS/ESA operator modify command. The standby controller will detect the failure of the active one from an XCF signal and will then activate itself. For this reason you should not define the OPC/ESA controller to ARM.

3.7.3.3 Job Routing


You should only route jobs to a particular system in the sysplex if there is a reason to do so, such as nonsymmetric resource availability or that the job requires a lot of CPU-resources and must therefore be run on a particular processor. The usual thing should be to submit the jobs so that they can be run anywhere in the JESplex, which will normally be the same as the sysplex. This will minimize the impact if any one system is stopped.

Chapter 3. Subsystem Software Configuration

111

3.8 VTAM
VTAM can be configured in a multitude of ways in and outside of the sysplex. It is not our intention to describe all of the possible VTAM configurations, instead only the items that apply to availability will be discussed.

3.8.1 Configuration
The generic resources function is provided by a sysplex using Advanced Peer-to-Peer Networking (APPN). You need VTAM 4.2 for generic resources support, and at least one VTAM in the sysplex must be an APPN network node, with the other VTAMs being APPN end nodes. Each VTAM must be connected to the coupling facility and be part of the same sysplex. A high availability environment requires more than one APPN network node in the sysplex.

112

Continuous Availability with PTS

Part 2. Making Planned Changes


This part describes how you can make changes to the sysplex without disrupting the running of the applications. These may consist in:

adding hardware or updating software removing hardware or software

The kind of software setup you need in order to accomplish this is also discussed.

Copyright IBM Corp. 1995

113

114

Continuous Availability with PTS

Chapter 4. Systems Management in a Parallel Sysplex


This chapter discusses the importance of maintaining good Systems Management disciplines in a parallel sysplex environment.

4.1 The Importance of Systems Management in Parallel Sysplex


The parallel sysplex, from a simple two system sysplex up to a full blown 32 MVS images with corresponding CICS, IMS and DB2 subsystems, is a complex environment. There could also be a number of sysplexes; for example, test, development and production. In any event, to be able to properly manage an environment as complex as this necessitates extremely good system management disciplines. Change and problem management are key elements within these disciplines and will directly affect availability. Since changes may be cascaded through the sysplex to reduce or eliminate outages, knowing the state of a change is more complex. Determining why you have a problem in one system or subsystem and not in others ties problem and change management together.

4.1.1 Change Management


Change management is essential for high availability. You must be able to plan and track all proposed changes to the sysplex. The key to achieving this is good communication between all involved parties. As a result:

Changes can be introduced in an orderly manner with all personnel being aware of and ready to respond to the change. The introduction of the change can be tied to the capacity planning and performance management disciplines to ensure sufficient resources exist to support the change. The change management process forces more detailed planning. This reduces problems and allows either more time to implement additional planned changes or consolidating changes to reduce the number of planned outages rather than spending time reacting to problems.

The change management process must ensure that all changes are tracked adequately. With the philosophy in a parallel sysplex being that change is introduced in one place and then propagated through the sysplex over a period of time, you must be able to determine, at any point in time, what changes have been implemented on what elements within the parallel sysplex.

4.1.2 Problem Management


Problem management is required to track, report and resolve all problems. It is closely tied to the change management process as most problems will require some change to be implemented to prevent reoccurrence. Good problem recording and tracking disciplines result in the creation of a problem database. This database can be used to spot trends and specific areas prone to problems and allows management to judge where appropriate investments may be made to reduce the number, duration or severity of problems. The end result being higher availability.

Copyright IBM Corp. 1995

115

4.1.3 Operations Management


In a parallel sysplex, the major tasks of operating the system do not change greatly and you can still use consoles to receive messages and issue commands to accomplish tasks. However, the potential exists for the number of consoles to increase, the number of messages to increase, and for the tasks to become more repetitive and more complex because of the existence of multiple images of multiple products. In a sysplex environment, it becomes critical to exploit products that provide a single system image and to set up workstations so that groups of tasks can be performed from a single point. Minimal human intervention is desirable in any computing operations environment, but the need for it becomes more acute in a parallel sysplex To ensure availability in a sysplex, operations management disciplines need to be reviewed, and through console integration, message suppression and automation. The desired results are:

The only tasks left for operators are those that cannot be automated. Operators are alerted only to exception conditions requiring them to take some action. Operators are aware of the status of the sysplex.
This table contains a list of strongly

Table 6. Automation Recommendations. recommended automation suggestions.


Item IMS/ARM Console Switching Couple Data Set Activation Reference

2.17.4.2, ARM and IMS on page 84 1.12.6.2, Hardware Requirements on page 24 9.13.1, Sysplex (XCF) Couple Data Set Failure on page 206

4.1.4 The Other System Management Disciplines


The other system management disciplines, business management, performance management and configuration management, are equally important in a parallel sysplex as they are in any other operating environment. For a detailed discussion on all the aspects of Systems Management in a sysplex, refer to System/390 MVS Sysplex Systems Management , GC28-1209.

4.1.5 Summary
The bottom line is, if the increased availability potential of a parallel sysplex is to be realized, then the installation s system management disciplines need to be as follows:

In place Of a very high standard Adhered to rigorously Reviewed regularly and updated accordingly

116

Continuous Availability with PTS

Chapter 5. Coupling Facility Changes


This chapter deals with general changes that can be brought to the coupling environment for installation, and both planned or unplanned maintenance of a coupling facility. This chapter begins with an overview of how the coupling facility structures can be manipulated; it then describes in detail tasks to be performed in order to add or remove a coupling facility for servicing.

5.1 Structure Attributes and Allocation


An exploiter of the coupling facility obtains access to a particular structure by using the connection service, and more precisely the IXLCONN macro. If an exploiter issuing the connection request through IXLCONN is the first one to connect to the designated structure then the coupling facility control code (CFCC) is to allocate (that is to build) the structure first, and then have the requesting exploiter connected to the structure. When an exploiter no longer needs access to the structure, it is expected to disconnect from the structure by issuing the IXLDISC macro. The IXLCONN macro issued by the first exploiter to connect to the structure will therefore initiate the creation of the structure and will give the structure attributes. Attributes values are picked as follows:

From the active CFRM policy Structure name Structure initial size Structure maximum size

From the exploiter s internal code Structure type (cache, list or locks). Requested volatility state. Structure disposition. Permission to rebuild. Permission to alter size. Apportioning specifications such as: - Directory to data element ratio for cache structure. - List entry to list element ratio for list structure. - Lock entries and users numbers for lock structure. The level of CFCC required (CFLEVEL0 or CFLEVEL1 for the time being).

Further details can be found in Programming: Sysplex Services Guide , GC28-1495. The allocation of the structure is performed as per the current CFRM policy preference and exclusion lists for the structure. The selection process points to the first available coupling facility in the preference list which satisfies the following:

It has connectivity to the system on which the request is made. It has a CFLEVEL equal to or greater than the requested CFLEVEL.

Copyright IBM Corp. 1995

117

It meets the volatility requirement. It meets the failure independence requirement. It does not contain structures in the exclusion list. It has the requested space available.

If there is no coupling facility meeting above criteria, XES goes again through the allocation selection by relaxing some criteria:

It ignores the exclusion list. It ignores the failure independence requirement. It ignores the volatility requirement. It returns a structure in the coupling facility with the most storage that meets or exceeds the CFLEVEL requirement.

5.2 Structure and Connection Disposition


The following chapter will describe how a coupling facility structure reacts depending on the type of connection established with the user.

5.2.1 Structure Disposition


When issuing the connection request via the IXLCONN macro, the requestor must specify several keywords. These define the type of structure, (list, cache or lock), the structure disposition, (KEEP or DELETE), and the requirement for a nonvolatile coupling facility. A structure disposition of KEEP indicates that even though there are not any more exploiters connected to the structure, because of normal or abnormal disconnection, the structure is to remain allocated in the coupling facility processor storage. On the contrary, a structure disposition of DELETE implies that as soon as the last connected exploiter disconnects from the structure, then the structure is deallocated from the coupling facility processor storage. To manually deallocate a structure with a disposition of DELETE, one has to have all exploiters disconnected from the structure. An example of such a process is deallocating the XCF signalling structure. This can be achieved by stopping in every member of the sysplex the PATHINs and PATHOUTs using the structure using the following MVS operator commands:

RO *ALL,SETXCF STOP,PI,STRNM=xcf_strname RO *ALL,SETXCF STOP,PO,STRNM=xcf_strname


To deallocate a structure with a disposition of KEEP, one has to force the structure out of the coupling facility by the SETXCF FORCE command.

118

Continuous Availability with PTS

Notes: 1. A structure with active or failed-persistent connections cannot be deallocated. The connections must be put in the undefined state first. See 5.2.2, Connection State and Disposition on page 119 2. A structure with a related dump still in the coupling facility dump space cannot be deallocated. The dump has to be deleted first. See 5.3, Structure Dependence on Dumps on page 120 3. Because of the risk of data loss, care must be taken when using the SETXCF FORCE command. All consequences must be well understood before issuing the command. IBM Exploiters Using Structures with a Disposition of DELETE

XCF signalling structures RACF database structures Automatic Tape Switching structure System Logger logstream structures DB2 GBP structure IMS/DB OSAM and VSAM caches VTAM generic resource name structure

5.2.2 Connection State and Disposition


A connection is the materialization of a structure exploiter s access to the structure. The connection can be in one of these three states:

Undefined means that the connection is not established. Active , means that the connection is being currently used Failed-persistent means that, the connection has abnormally terminated but is logically remembered although itis not physically active.

At connection time, another parameter in the IXLCONN macro indicates the disposition of the connection. A connection can have a disposition of KEEP or DELETE. A connection with a disposition of KEEP is placed in what is called a failed-persistent state if it terminates abnormally, that is, without a proper completion of the exploiter task. When in the failed persistent state, a connection will become active again as soon as the connectivity to the structure is recovered. The failed-persistent state can be thought of as a place holder for the connection to be recovered. Note that in some special cases a connection with a disposition of KEEP may be left in undefined state even after an abnormal termination. A connection with a disposition of DELETE is placed in an undefined state if it terminates abnormally. When the connectivity to the structure is recovered, the exploiter has to reestablish a new connection.

Chapter 5. Coupling Facility Changes

119

To Check for Structure and Connection State and Disposition Refer to Appendix B, Structures, How to ... on page 241

5.3 Structure Dependence on Dumps


The SVC dump service can be requested to capture structure data, either from a program issuing the SDUMPX macro or by a request from the system operator through the DUMP command with the STRLIST parameter. No SVC Dump for Lock Structures The lock structures do not support dumping by any kind of MVS dump tools, and therefore considerations in this paragraph do not apply to this structure type.

Having a dump taken against a structure has the following implications:

The access to the structure by exploiters is delayed until the capture of the dump data is complete. Although the dump space defined in the coupling facility is intended to capture dump data without holding the structure s exploiters for too long a time, a dump serialization time limit is specified as a parameter of the IXLCONN macro. When reaching this time limit, the dump function is terminated and the structure is released. This limit can be furthermore enforced or overridden by DUMP command parameters. A structure dump residing in the coupling facility dump space prevents structure deallocation until it is transferred onto the dump data set or until it is deleted by using the SETXCF FORCE command. The SETXCF FORCE command allows thefollowing: Deletion of a structure dump Release of the dump serialization for a structure

5.4 To Move a Structure


There are two ways of moving a structure: 1. Moving can be done by deallocating and then reallocating the structure. Deallocating a structure requires knowledge of the current state of the connections to the structure and of any dump still residing in the coupling facility dump space, so that proper action can be taken. Refer to Appendix B, Structures, How to ... on page 241. Reallocating a structure after deallocation is the same as going through the initial allocation process, that is, the new allocation will be performed by scanning the active preference list for the structure as soon as an exploiter reconnects to the structure. If the current failure mode allows reallocation in the same initial coupling facility, chances are great that the structure will end up in the same coupling facility. It is necessary that the active CFRM policy be changed since the original allocation of the structure so that the preference list now shows another coupling facility as the best candidate to allocate into.

120

Continuous Availability with PTS

2. Another way to move a structure is to use the rebuild function against the structure, either by an operator initiated rebuild (using the SETXCF START,REBUILD command) or by a dynamic rebuild request issued by the structure exploiter during its recovery process (using the IXCREBLD macro). The structure rebuild function can be explicitly required to rebuild the structure in another coupling facility. For further details, refer to 5.4.1, The Structure Rebuild Process. Important Notice Not all structures can be rebuilt. A structure can be originally allocated with rebuild disallowed, in such a case all requests to rebuild the structure will be denied, and structure movement will have to be performed by deallocation and reallocation.

5.4.1 The Structure Rebuild Process


The process of rebuilding a structure consists in creating a new instance of an existing structure. It implies the following: 1. At least one active connection to the original structure instance. 2. Agreement of all the connected instances of the exploiter code on rebuilding the structure (that is all of them have connected with rebuild allowed). 3. For the rebuild process to be considered as a viable means of recovery or maintenance: a. There must enough physical resources available in terms of coupling facilities defined in the active preference list and their processor storage to allow the creation of the new (and temporarily duplicate), instance of the structure. b. All the potential and active exploiters of the structure must have connectivity to the new instance of the structure. The rebuild process can be started for one of the following reasons:

An operator initiated request to rebuild A connected exploiter request because of: Loss of connectivity to the structure Structure failure Specific exploiter reason (as per specific exploiter s conventions).

The rebuild process can also be stopped (and therefore does not complete successfully) because of one of the following reasons:

An operator initiated request to stop the process A connected exploiter request because of: Loss of connectivity to the original structure while rebuilding Original structure failure while rebuilding Specific exploiter reason

When the rebuild process completes successfully, the original instance of the structure is deallocated and the processing resumes using the new instance of

Chapter 5. Coupling Facility Changes

121

the structure. When the rebuild process does not complete successfully, the new instance of the structure is deallocated, and the original instance remains as it was when entering the rebuild.

5.4.1.1 To Manually Invoke the Structure Rebuild Process


You can use one of the following forms of the rebuild request:

To rebuild a structure in the first matching coupling facility in the structure active preference list. This will probably end up in rebuilding in the same coupling facility, unless the preference list has been changed since structure allocation.

SETXCF START,REBUILD,STRNM=strname,LOCATION=NORMAL

To rebuild a structure as per the active preference list but excluding the current coupling facility:

SETXCF START,REBUILD,STRNM=strname,LOCATION=OTHER

Rebuild can also be invoked for the whole contents of a coupling facility:

SETXCF START,REBUILD,CFNAME=cfname,LOCATION=NORMAL SETXCF START,REBUILD,CFNAME=cfname,LOCATION=OTHER


To manually stop the rebuild process one has to issue the command:

SETXCF STOP,REBUILD,STRNAME=strname
or

SETXCF STOP,REBUILD,CFNAME=cfname
Structure Rebuild Affects Performance The utilization of the structure is suspended for the complete duration of the rebuild process. This may temporarily affect system throughput if dealing with a heavily used structure.

5.4.1.2 Dynamic Rebuild of a Structure


Dynamic rebuild of a structure is initiated by one of the structure s exploiters by issuing the IXLRBLD macro and can therefore be a means to automate the recovery action in case of coupling facility failure. This is developed in further detail in 9.3.4.3, Automated Recovery from a Connectivity Failure on page 188.

5.4.1.3 Changing Structure Attributes or Location by Rebuilding


When creating the new instance of the structure during the rebuild, the structure attributes may be modified either because of a change in the active CFRM policy, or because of the new attributes being set up by the exploiter code invoking the dynamic rebuild. In this perspective, the structure rebuild process can be seen as a way to recover from improperly designed structure attributes or to validate structure changes induced by a new CFRM policy (see 5.6, Changing the Active CFRM Policy on page 125). The structure attributes that can be modified by using rebuild are as follows:

The structure size, either by the rebuilder decision or because of a modification to the active CFRM policy. Attributes set up by the requestor code, such as:

122

Continuous Availability with PTS

Request for nonvolatile coupling facility Directory to element ratio for a cache structure Entry to element ratio for a list structure Lock entries for a lock structure

Details on structure allocation and rebuild can be found in Sysplex Services References , GC28-1496. Information on how IBM exploiters support rebuild can be found in Table 7.
Table 7. Support of REBUILD by I B M Exploiters
Exploiting function XCF Signalling System Logger JES2 CKPT Shared Tape Structure name IXC..... user defined name user defined name IEFAUTOS Rebuild Supported Yes Yes No Yes Yes, but

Ignores L O C = O T H E R Always rebuilds all structures altogether. Ignores STOP REBUILD

RACF

IRRXCF00

IMS OSAM Cache IMS VSAM Cache IRLM lock table IMS IRLM lock table DB2

user defined name user defined name user defined name groupname_LOCK1 VTAM 4.2 -ISTGENERIC

Yes Yes Yes Yes Yes Yes Yes No Yes

VTAM generic resource

VTAM 4.3 can be user defined groupname_SCA groupname_GBP user defined name

DB2 SCA DB2 GBP SMSVSAM

5.5 Altering the Size of a Structure


The size of the structure can be dynamically modified by operator request by the following command:

SETXCF START,ALTER,STRNAME=strname
The structure cannot be expanded beyond the SIZE parameter value specified in the active CFRM policy. To check for the maximum size allowed use the following:

D XCF,STRUCTURE,STRNAME=strname
A structure can also be dynamically altered by a program using the IXLALTER service. The IXLALTER service allows modification to the structure size and the entry to element ratio attribute, and is intended to provide dynamic structure reapportionment capability.
Chapter 5. Coupling Facility Changes

123

XES accepts the ALTER request if all of the following are true:

SCP is MVS 5.2 or higher. The structure to be altered is in a coupling facility with CFLEVEL=1 or higher. All currently active or failed-persistent connectors to the structure, allowed structure alter when they connected. The structure is not already in the rebuild process. Structure ALTER for a Persistent Structure

A structure with a disposition of KEEP and no active or failed-persistent connectors can be altered.

A structure alteration can be stopped either by a connecting program or by the operator with the following command:

SETXCF STOP,ALTER,STRNAME=strname
Structure rebuild and structure alter can be thought of as complementary functions. The structure rebuild function allows the changing of many of the structure attributes but requires planning in that coupling facility space must be available for later rebuild use. Structure alter does not require additional space to be reserved beyond the maximum SIZE specified in the active CFRM policy and does not disrupt the processing of connectors to the structure while it is being altered. Information on how IBM exploiters support alter can be found in Table 8.
Table 8. Support of ALTER by I B M Exploiters
Exploiting function XCF Signalling System Logger JES2 CKPT Shared Tape RACF IMS OSAM Cache IMS VSAM Cache IRLM lock table IMS IRLM lock table DB2 VTAM generic resource Structure name IXC..... user defined name user defined name IEFAUTOS IRRXCF00 user defined name user defined name user defined name groupname_LOCK1 VTAM 4.2 -ISTGENERIC VTAM 4.3 can be user defined DB2 SCA DB2 GBP SMSVSAM groupname_SCA groupname_GBP user defined name Alter Supported Yes Yes Yes No No No No Yes Yes No No Yes Yes Yes

124

Continuous Availability with PTS

5.6 Changing the Active CFRM Policy


The CFRM data set contains the policies installed by the IXCMIAPU administrative data utility. These are the administrative copies. When a policy is started as active CFRM policy by the SETXCF START,POL,TYPE=CFRM,POLNAME=polname command, another copy is made on the CFRM couple data set from the selected policy and becomes the active policy. Changes can be made to the currently active policy by executing IXCMIAPU with the new policy parameters; they will only affect the administrative copy of the policy and the active copy remains unchanged. To actually change the active CFRM policy the operator must issue the SETXCF START,POL,TYPE=CFRM,POLNAME=polname command with the same policy name to transfer the administrative copy changes to the active copy. However, this is not the recommended way of managing policies since it may lead to confusion regarding the proper identification of the level of the currently active CFRM policy. It is rather recommended to create a new policy with a new name. When a CFRM policy is started as active by the SETXCF START command:

If no active policy is currently available, the activation takes effect immediately. If a policy is already active, the transition to the new policy parameters may not occur immediately: When adding a new coupling facility, the preference list for each structure definition in the active policy has to be updated with the new coupling facility logical name. This updating takes some time and it may prevent the operator to immediately use the new coupling facility logical name in commands. When changing the dump space size in the new policy, MVS attempts to change the dump space size immediately in the coupling facility, and if not successful, continues to attempt the change. Use the DISPLAY CF command to determine the dump space size in the policy and the dump space actually defined in the coupling facility. When deleting a coupling facility or structure or when modifying a structure in the new policy the following occurs: - The change takes effect immediately if the coupling facility resources are not allocated for the particular structure. - The change remains pending if coupling facility resources are allocated for the structure. The D XCF command provides specific information about a structure and any pending policy changes. The structure resources need to be deallocated by either operator command such as SETXCF FORCE, or if structure rebuild is allowed SETXCF START,REBUILD can be used to rebuild a new instance of the structure as per the new parameters in the CFRM policy.

The addition of a coupling facility or structure takes effect immediately assuming that the CFRM couple data set has free space available to record these new resources. If not, other resources have to be freed or the CFRM couple data set must be reformatted to accommodate for the new additional

Chapter 5. Coupling Facility Changes

125

resources. Refer to 5.7, Reformatting the CFRM Couple Data Set on page 126 for the procedure to reformat a couple data set non disruptively. Examples of CFRM Policy Transitioning See Appendix C, Examples of CFRM Policy Transitioning on page 249

5.7 Reformatting the CFRM Couple Data Set


The couple data set format utility (IXCL1DSU) will allocate space to the CFRM couple data set based on inputs such as the following:

Maximum planned number of policies to install in the couple data set Maximum number of structures, and maximum number of connectors for any given structures. Maximum number of coupling facilities in the installation.

Should the size of the CFRM couple data set be proven to have been planned incorrectly, the following procedure can be used to dynamically put online a new couple data set with the appropriate size. Note that this procedure works only when increasing the size of the couple data set. To Decrease the Size of the Couple Data Set Decreasing the size of a couple data set cannot be done nondisruptively: an alternate couple data set smaller than the primary couple data set cannot be brought online concurrently. You must prepare the new couple data set and the new COUPLExx member, then IPL the sysplex using this new couple data set.

1. Run IXCL1DSU against a spare couple data set with the new couple data set specifications. 2. When the spare couple data set is formatted, use the command SETXCF COUPLE,ACOUPLE=(spare_dsname,spare_volume),TYPE=CFRM to make the spare couple data set a new alternate CFRM couple data set. Note: As soon as the spare couple data set has been switched into alternate, the new alternate couple data set will be loaded with the primary couple data set policies contents. 3. Then switch the new alternate into the new primary couple data set:

SETXCF COUPLE,TYPE=CFRM,PSWITCH
4. The previous primary couple data set is no longer in use, and can be enlarged by the same process before becoming in turn a new alternate couple data set Keep COUPLExx in sync It is recommended that the COUPLExx member be updated after swapping the couple data sets, so that an operator intervention to retrieve the last used couple data sets is not required at the next IPL.

126

Continuous Availability with PTS

5.8 Adding a Coupling Facility


A coupling facility can be added for different reasons:

As a new permanent device in a production or test configuration, and is expected in most cases to be an additional 9674 coupling facility. As a temporary alternate coupling facility while the primary coupling facility is being serviced. In this case, it is conceivable that the alternate coupling facility be a logical partition in one of the sysplex CPCs. Note that the latter will require CFR CHPIDs being dedicated to the coupling facility logical partition. This configuration is not a recommended production configuration since structure recovery can be seriously impacted if the CPC where the coupling facility and some MVS images cohabit were to fail.

5.8.1 To Define the Coupling Facility LPAR and Connections

Define the coupling facility via HCD. HCD must be used to define the logical partition for the coupling facility and to define connectivity between the coupling facility senders and receivers. Keep track of the partition number you specify for the logical partition to be able to match with the partition number in the CFRM policy. Specify the SIDE parameter for a CPC only in a physically partitioned configuration. The SIDE parameter is needed in order to define the coupling facility to one or the other physical side of the CPC. Use caution when splitting or merging physical sides of a processor that contains a coupling facility. The action might change the SIDE information that identifies the coupling facility.

Define the coupling facility logical partition If the coupling facility is in a ES/9000 LPAR, IOCDS must be reloaded from HCD to get the new LPAR information and to display the new partition on LPDEF at the next POR time. Fill in the LPDEF and LPCTL definition for the coupling facility LPAR. If the coupling facility is in a 9672 LPAR, configure the HMC environment with a Reset and Image profile. Then, download the HCD IOCDS to the coupling facility service element through a MVS running on a CPC on the same SE LAN.

The coupling facility CPC must go through a power on reset so that the new coupling facility logical partition is known. This is achieved by running CONFIG POR on a 9021/9121 CPC, and by activating the CPC via the proper reset profile on a 9672/9674 CPC. Activate the coupling facility partition.

5.8.2 To Prepare the New CFRM Policy


This step can be performed prior to defining and activating the new coupling facility, in that the currently active CFRM policy can already have the new coupling facility specified in the preference lists.

Use the following command to get the information required to set up a new CFRM policy.

D CF,CFNAME=xxxx

Define a new CFRM Policy with the administrative data utility IXCMIAPU: Define a new policy with the new coupling facility information

Chapter 5. Coupling Facility Changes

127

Associate the structures to the coupling facility Define the amount of dump space in the coupling facility for dumping the coupling facility structure data

Activate the new policy in order to start using the coupling facility with the following command from any active system in the sysplex:

SETXCF START,POLICY,TYPE=CFRM,POLNAME=policy name

Verify that each MVS image that requires connectivity is connected to the coupling facility. To obtain information about the system connectivity for the coupling facility, issue the following command and specify the name of the coupling facility:

D XCF,CF,CFNAME=name
From now on, any exploiter can connect to the structure defined into the coupling facility.

5.8.3 Setting Up the Structure Exploiters


This paragraph describes the setups to perform in the IBM structures exploiters, so that they can properly connect to the designated structures.

5.8.3.1 IMS Lock Structure


1. Define the IRLM lock structure in the CFRM policy. The structure name is user-defined. In the IRLM startup procedure for IMS DB, specify the following values:

SCOPE=GLOBAL LOCKTABL=name-of-the-structure
2. Specify the lock structure on the CFNAMES,CFIRLM= control statement in one of the following procedures:

The VSPEC member (DFSVSMxx) in the IMS procedure The DFSVSAMP DD statement in the DLIBATCH or DBBATCH procedures

3. Specify the IRLM parameters during system definition to connect IMS to IRLM:

IRLM=YES in the IMSCTRL macro IRLM=Y in the IMS, DBBBATCH or DLIBATCH procedure Note: This specification overrides the specification in the IMSCTRL macro.

IRLMNM= in the IMSCTRL macro, the IMS, DBBBATCH, or DLIBATCH procedure.

4. Ensure that the correct CFRM policy has been started; then start the IRLMs which are to use this structure and the DBMS they are connected to.

5.8.3.2 IMS OSAM and VSAM Structures


1. For data sharing of OSAM databases, define the cache structures for OSAM buffer invalidate and the structure for VSAM buffer invalidate in the CFRM policy. The structure names are user-defined. 2. For database recovery control (DBRC) in IMS, register the database as:

128

Continuous Availability with PTS

sharelvl 3
3. For the command to start IMS, specify the following application access:

access = up
4. Specify the OSAM and VSAM buffer invalidate structures on the CFNAMES control statement in one of the following procedures:

The VSPEC member (DFSVSMxx) in the IMS procedure The DFSVSAMP DD statement in the DLIBATCH or DBBATCH procedures

5. Ensure that the VSAM share option is (3 3). 6. Ensure the correct CFRM policy has been started, then start the DBMS which are to use this structure.

5.8.3.3 XCF Signalling Structure


1. Determine the number of list structures that you require for signalling. 2. Define all required list structures in the CFRM policy. The structure names must begin with the following prefix:

IXCxxxxx
The remaining characters (xxxxx) can be alphanumeric or national characters (&, #, or @). 3. For each MVS in the sysplex, specify the names of the structures in the STRNAME keyword for the appropriate PATHIN and PATHOUT statements in COUPLExx of SYS1.PARMLIB. 4. Ensure that correct CFRM policy is active, then the new PATHINs and PATHOUTs can be dynamically started issuing from each MVS image:

SETXCF START,PATHIN!PATHOUT,STRNAME=str_name
Or, if required for other reasons, the sysplex can be re-IPLed.

5.8.3.4 JES2 Checkpoint Data Set Structure


1. Define the list structures for the JES2 checkpoint data sets in the CFRM policy. The structure names are user-defined. 2. On the JES2 CKPTDEF initialization statement, use the S T R n a m e = subparameter (of the CKPTn and NEWCKPTn parameters) to specify the JES2 checkpoint data set structure name. Delete the DSNAME and VOLSER subparameters. 3. Ensure the right CFRM policy is active, and then enter the Checkpoint Reconfiguration Dialog to forward the checkpoint to the designated structure. Refer to 9.9, JES2 Recovery from a Coupling Facility Failure on page 199 for an example of the checkpoint reconfiguration.

5.8.3.5 VTAM Generic Resource Names Structure


1. Define the structure for the VTAM generic resources to the CFRM policy. If you are using VTAM 4.2, the structure name must be:

ISTGENERIC
If you are using VTAM 4.3, the structure name can be user-defined, and must be in that case specified to VTAM by the start option STRGR (refer to VTAM 4.3 Network Implementation Guide ). 2. Customize the generic resources exit routine if needed.

Chapter 5. Coupling Facility Changes

129

3. If you are using RACF or another security management product, authorize all the CICS TORs to register the generic resources name. To authorize CICS TORs to access a VTAM generic resources, you must: Define a VTAMAPPL profile with the generic resources name as the VTAMAPPL name. Authorize each CICS TOR with READ access to the VTAMAPPL profile. 4. After activating the new CFRM policy, re-initialize VTAM on each system. 5. Specify the generic resource name as a system initialization parameter in the system initialization table (SIT) or as an override for each CICS TOR that is a member of the generic resources set. To activate, you must restart CICS on the system.

5.8.3.6 RACF Database Structures


1. Remove the generic resources name SYSZRACF from the global resource serialization reserve conversion RNL. 2. Define RACF database structures in the CFRM policy. Understand that once you cache the RACF database in the coupling facility, systems that are not part of the RACF data sharing group cannot share the database. Also, consider the following when you plan for RACF data sharing:

Sysplex-wide RACF commands operate whether you use RACF data sharing mode or not. Do not define more than one RACF data sharing group to the sysplex.

3. Specify the following name for each structure:

IRRXCF00_ayyy
Here a = P for primary, or B for backup and yyy = the RACF database sequence number. You require a structure for the RACF primary database and the RACF backup database. 4. Use the data set range table (ICHRRNG) to determine on which data sets the RACF profiles are to reside. 5. For the first RACF database that initializes RACF data sharing, set the sysplex communication bit and the default mode bit in ICHRDSNT to indicate data sharing. 6. IPL all systems in the sysplex to activate RACF data sharing mode, or use the RACF command RVARY DATASHARE on each system that is enabled for RACF sysplex data sharing, and ensure that the CFRM policy is active. 7. To control RACF data sharing dynamically after IPL, use the following RACF command:

RVARY DATASHARE|NODATASHARE

5.8.3.7 DB2 Structures


1. Verify that there is a single user integrated catalog facility catalog with connectivity to all MVS systems and pointed to by the master catalog of each MVS system in the sysplex. 2. Verify the connectivity from each system on which DB2 resides to each of the following:

130

Continuous Availability with PTS

A set of DB2 target libraries A single DB2 catalog and directory All DB2 databases Log data sets and BSDS data sets that are to be shared A user integrated catalog facility Catalogs for shared databases All coupling facilities

3. Define the following structures for use with the DB2 data sharing:

Cache structure for the total number of DB2 group buffer pools. (The group buffer pool consists of data pages and directory entries.) Lock structure for the total number of DB2 group members List structure for the DB2 SCA.

5.8.3.8 Log Stream Structures


1. Define the structure and the log stream names in the CFRM policy. The structure name is installation-defined. For the log stream names, specify the following:

For the LOGREC log stream, the name is SYSPLEX.LOGREC.ALLRECS For the operations log stream, the name is SYSPLEX.OPERLOG

2. Use the IXCL1DSU utility to format the LOGR couple data set. 3. Specify the name of the LOGR couple data set in COUPLExx. 4. Use the IXCMIAPU utility to define the log streams and structures to the coupling facility. 5. Plan to use SMS-managed DASD for staging log stream data.

5.8.3.9 Shared Tape Structures


Once you have determined the number of devices to be defined as autoswitch and their relative addresses, the following tasks are required to implement the function: 1. Determine the size of the IEFAUTOS structure based on the number of devices and the number of sharing systems. IEFAUTOS doesn t support the dynamic size altering, so keep the SIZE and INITSIZE equal. 2. Define the IEFAUTOS structure through the CFRM Policy update. 3. Activate the CFRM Policy and verify the results. 4. Update the GRSRNLxx with the SYSZVOLS definition 5. Once the IEFAUTOS structure becomes active, all the systems react and connect themselves to the structure. Once the environment is ready, the devices can be varied to autoswitch using the MVS command, ESCON Manager or HCD.

5.8.3.10 SMSVSAM Structures

Determine the size and number of the coupling facility cache structures depending on the following requirements: Number of available CF facilities Amount of space available in each CF facility Amount of data that will be accessed through each CF facility

Chapter 5. Coupling Facility Changes

131

Continuous availability requirements for CF reconfiguration Performance requirements for various applications

Determine the size of the coupling facility lock structure IGWLOCK00. Update the SMS configuration and the ACS routine to reflect the SMSVSAM environment and to map the storage class to a cache set specified in the base configuration. Define the Sharing control data sets (SHCDS) to maintain data integrity in a shared environment. Consider converting the RESERVEs for sharing control data sets. Define the cache and the lock structure using a CFRM Policy. Update the IGDSMSxx member. Update the CICS procedure for using VSAM RLS.

5.9 Servicing the Coupling Facility


As the coupling facility hardware can have to be upgraded or maintained (including the coupling facility CPC LIC and CFCC) the question arises on how to manage keeping the sysplex running while the coupling facility is serviced. This chapter discusses what can and cannot be done concurrently. If the coupling facility cannot be serviced concurrently with sysplex operations, then consideration must be given to the procedures describing how to temporarily add a coupling facility (5.8, Adding a Coupling Facility on page 127) and how to remove a coupling facility (5.11, Coupling Facility Shutdown Procedure on page 134). The assumption here is that there is at least one alternate coupling facility, and therefore changes can be rippled, if needed, through the coupling facilities one at a time.

5.9.1 Concurrent Hardware Upgrades:


Here is a description on the major upgrades against a coupling facility.

5.9.1.1 Adding CFRs to the Coupling Facility


Some degree of concurrency can be achieved in adding CFR CHPIDs to a coupling facility, in that the physical installation may be possibly done without disrupting the coupling facility operations. However, to have these new CHPIDs being usable will always require a power on reset to install the new CHPIDs definitions into HSA, since dynamic I/O Reconfiguration does not support CFRs. The physical installation concurrency can also be limited depending on which processor family that the coupling facility belongs to (9021 or 9672) and also on what is the current CHPID configuration in the coupling facility. The latter will dictate whether slots are readily available to concurrently insert new CFC cards or if slots have to be made available in a non concurrent manner. What this means is the following:

For 9672 or 9674 processor family: The physical upgrade can be concurrent if there are free slots in a sufficient number, in the already plugged CFC adapter cards (fc 0014), to concurrently

132

Continuous Availability with PTS

receive link cards (fc 0007 or 0008). If this is not the case, physically adding adapter cards cannot be done concurrently.

9021 processor family: The physical upgrade can be concurrent to the coupling facility operations if there are sufficient CFC slots available in both CPC sides altogether, that do not require plugging additional adapter card(s). The adapter cards cannot be plugged concurrently.

5.9.1.2 Other Physical Hardware Upgrades


Physical hardware upgrades to a coupling facility other than adding CFRs are disruptive to the coupling facility operations except for upgrades related to:

9672/9674 HMC or SE. 9021 PCE. Hardware Upgrades and HSA

Proper consideration must be given to the increase in HSA size that could be incurred because of additional hardware such as CFRs.

5.9.2 Concurrent LIC Upgrades


Most of the LIC upgrades on 9021, whether CPC LIC or CFCC, can be done concurrently, whereas most of the LIC upgrades on 9672/9674 cannot be done concurrently. Note that the situation regarding concurrency of LIC upgrades in 9672/9674 is improving with each new model family but is, even with Rx2 and Rx3 models, quite limited with respect to 9021.

5.10 Removing a Coupling Facility


In order to remove a coupling facility you must first remove any structures it contains, rebuild them in another CF, and then configure all CHPIDs to this CF offline. There is no online/offline command to enable/disable the allocation to a structure in a coupling facility. To deallocate a structure, you need to operate through the CFRM policy. When the coupling facility is powered off, structure data will be lost unless nonvolatility is preserved throughout the maintenance procedure. It is recommended that only one CF should be removed from service at a time. This will make it easier to control the moving of structures. Before physically removing the coupling facility, you should ensure that no more connections are still outstanding. This will avoid failures in the subsystems using the coupling facility. For this reason, it is advisable to resolve FAILED-PERSISTENT and NO CONNECTOR structure conditions prior to coupling facility shutdown. You can use the following command to display structures with FAILED-PERSISTENT connectors and NO CONNECTOR conditions:

Chapter 5. Coupling Facility Changes

133

D XCF,STR,STRNAME=ALL,STATUS=(FPCONN) D XCF,STR,STRNAME=ALL,STATUS=(NOCONN)

If a failed-persistent connector exists or if a structure with no connector exists you should determine whether this is a normal state for the connector/structure to be in and you should know what to do to resolve these conditions prior to going on with coupling facility shutdown procedure.

5.11 Coupling Facility Shutdown Procedure


A procedure for removing a coupling facility for service is as follows: 1. Create a new CFRM policy which removes the target coupling facility definition and lists only the alternate coupling facility in the preference lists for all structures in the target coupling facility. We recommend that a new policy name be used for this new policy so that when the service is complete, the original policy can be restored. The reason for defining a new CFRM policy with the target coupling facility removed from the policy and policy preferance lists is to prevent structures from being created in the target coupling facility while the target coupling facility is being quiesced. A new CFRM policy may not be necessary depending on what functions are exploiting the coupling facility. If, for example, you only have an XCF signalling structure defined, it is not necessary to define a new policy to prevent XCF signalling from rebuilding in the target CF. Once the XCF signalling structure is moved to the alternate CF, the chances are slim that XCF would attempt to relocate the structure back into the target CF. If you can prevent functions from creating structures in the target CF during CF shutdown without using a new policy to bar access to the target CF, you may choose to eliminate the policy switching steps. 2. Determine the CHPIDs in use by the target CF by issuing the following command on each system connected to the target CF as follows:

134

Continuous Availability with PTS

D CF,CFNAME=cfname

IXL150I 08.23.52 DISPLAY CF 160 COUPLING FACILITY 009672.IBM.02.000000040104 PARTITION: 1 CPCID: 00 CONTROL UNIT ID: FFFE NAMED CF01 COUPLING FACILITY SPACE UTILIZATION ALLOCATED SPACE DUMP SPACE UTILIZATION STRUCTURES: 18944 K STRUCTURE DUMP TABLES: 0 K DUMP SPACE: 2048 K TABLE COUNT: 0 FREE SPACE: 201216 K FREE DUMP SPACE: 2048 K TOTAL SPACE: 222208 K TOTAL DUMP SPACE: 2048 K MAX REQUESTED DUMP SPACE: 0 K VOLATILE: NO STORAGE INCREMENT SIZE: 256 K CFLEVEL: 1 COUPLING FACILITY SPACE CONFIGURATION IN USE FREE CONTROL SPACE: 20992 K 201216 K NON-CONTROL SPACE: 0 K 0 K SENDER PATH 70 72 PHYSICAL ONLINE ONLINE LOGICAL ONLINE ONLINE SUBCHANNEL 1696 1697 1698 1699

TOTAL 222208 K 0 K

STATUS VALID VALID STATUS OPERATIONAL/IN OPERATIONAL/IN OPERATIONAL/IN OPERATIONAL/IN

COUPLING FACILITY DEVICE FFFC FFFD FFFE FFFF

USE USE USE USE

In our example CHPID 70 and 72 are the sender ISC links used by this MVS image to connect the coupling facility named as CF01. Once the new policy has taken effect, the D CF,CFNAME=cfname command cannot be used to display information about the target coupling facility because the coupling facility will no longer be in the active CFRM policy. However, the D CF command can still be used to see the physical connections to the target coupling facility. Since D CF displays information on all the coupling facilities, it is necessary to know the NODE, PARTITION, and CPCID to identify the target coupling facility. 3. Activate the new policy with the following command:

SETXCF START,POLICY,TYPE=CFRM,POLNAME=newpolicyname
4. Get the names of all the allocated structures in the target CF by issuing the following MVS command:

D XCF,CF,CFNAME=cfname
If no structures are allocated, you will receive message IXC362I with one of the following statements:

Chapter 5. Coupling Facility Changes

135

NO COUPLING FACILITIES MATCH THE SPECIFIED CRITERIA or NO STRUCTURES ARE IN USE BY THIS SYSPLEX IN THIS COUPLING FACILITY
5. Depending on the subsystems connected to the structure, you will be required to follow different procedures. Not all subsystems can support structure rebuilding. For instance, DB2, RACF and JES2 will require particular actions. For these subsystems, follow the recommended procedure as described in 5.11.1, Coupling Facility Exploiter Considerations on page 138. 6. Move the remaining structures out of the target CF by attempting a rebuild of the structures. You can initiate a rebuild by either moving all the structures or by initiating the rebuild process for each structure individually. So you can use one of the following commands:

SETXCF START,REBUILD,CFNAME=cfname,LOC=OTHER cfname = name of CF being shut down


or

SETXCF START,REBUILD,STRNAME=strname,LOC=OTHER
If a structure does not support rebuild, an IXC message will inform you. You can expect that structure rebuild may take several minutes to complete. 7. Once that the previous command has been completed, check that no more structures are still allocated in the target coupling facility with the following command:

D XCF,CF,CFNAME=cfname cfname = name of CF being shut down


If you receive message IXC362I (NO COUPLING FACILITIES MATCH THE SPECIFIED CRITERIA) or message IXC362I (NO STRUCTURES ARE IN USE BY THIS SYSPLEX IN THIS COUPLING FACILITY), you can proceed to physically shut down the coupling facility because all structures have been deallocated and the new activated policy has taken effect. Otherwise IXC362I will list those structures still allocated to the target coupling facility. Determine the status and how to deallocate each of these structures. Issue the following command to determine the status of each of the remaining structures:

D XCF,STR,STRNAME=strname
There are some situations that prevent the deallocation of structures as part of the structure rebuild process:

The function that owns the structure does not support rebuild. In this case, it may be necessary to bring down the application to deallocate the structure.

136

Continuous Availability with PTS

The structure has no connectors . A structure cannot be rebuilt without an active connector. If the function associated with the structure supports structure rebuild, initialize the function to obtain a connector to the structure and attempt a rebuild by issuing the following:

SETXCF START,REBUILD,STRNAME=strname,LOCATION=OTHER

The structure has failed-persistent connector(s). Generally, a structure with failed-persistent connector(s) should have recovery actions invoked for the connectors prior to the deallocation of the structure. Whether or not the recovery of failed-persistent connectors is mandatory depends on the program that owns the connection/structure. The existence of failed-persistent connectors may prevent a program from rebuilding or deallocating the structure. This may require re-initialization of the function associated with the failed-persistent connector to recover the connection. When all of the failed-persistent connections have been recovered, the rebuild (or other method of deallocation) should be retried.

The following commands can be used to clean-up resources related to structures in the target coupling facility. You have to use them carefully because forcing the deletion of a structure may cause a loss of data. So, you should not force deletion of a structure or of a failed-persistent connector unless you understand its use in the sysplex and the impact of the force operation.

For structures with no connectors to the structure, force the structure by issuing the following command:

SETXCF FORCE,STR,STRNAME=strname

If there are only failed-persistent connectors, force the failed-persistent connectors and then force the structure.

SETXCF FORCE,CON,STRNAME=strname,CONNAME=ALL and if necessary SETXCF FORCE,STR,STRNAME=strname


8. Configure all CHPIDs to the target coupling facility offline.

On each system, configure all CHPIDs to the target coupling facility offline with the following command:

CONFIG CHP(xx,yy),OFFLINE
MVS will refuse to vary offline the last path to a coupling facility that contains one or more structures in use by an active XES connection. Ensure that all the structures and connectors have been removed from the target coupling facility. If for some reason the CHPID would have to be varied offline anyway, this can be achieved by executing the following:

CONFIG CHP(xx),OFFLINE,FORCE
Configuring ISC links offline is optional. It can be considered a clean way to quiesce the coupling facility. If you do not take the CHIPIDs offline, error messages and recording of link failure can occur. The messages and logouts can be ignored.

Issue the following command on all systems connected to the target coupling facility:

D CF
Verify that all sender paths were taken offline.

Chapter 5. Coupling Facility Changes

137

Power off the target coupling facility. When the maintenance procedure is complete, bring the coupling facility back into service by restoring the original coupling facility policy, and repeating the above actions in reverse. Add instead of delete the target coupling facility, and move structures into it instead of out of it.

5.11.1 Coupling Facility Exploiter Considerations


Here are exploiter by exploiter approaches to removing structures from a coupling facility. Further details related to the tasks to perform can also be found at 9.2, Coupling Facility Failure Recovery on page 180 and at Appendix B, Structures, How to ... on page 241.

JES2 Checkpoint: JES2 does not allow structure rebuild. To move the JES2 checkpoint either to another CF, or to DASD, you have to invoke the reconfiguration dialog. To do this, you must have a NEWCKPTx defined in the JES2 CKPTDEF statement, either in the JES2 parms or defined dynamically via the $TCKPTDEF command. Then, switch to this alternate checkpoint via the $TCKPTDEF,RECONFIG=YES command and follow the reconfig dialog prompts. When done, issue a display command to verify that the old structure isn t being used by JES2: $DCKPTDEF. The old checkpoint structure will remain allocated. Force if off via the following command:

SETXCF FORCE,STR,STRNAME=strname
Remember to update your installation s JES2 init parms to point to this new checkpoint so you ll find it on your next JES2 warmstart. Remember that data can be transferred to and from a coupling facility much faster than to and from DASD, so you should plan to return the checkpoint data set to a coupling facility as soon as possible. When the checkpoint data set is located on DASD, JES2 uses a hardware reserve to ensure data integrity among all members. This affects I/O performance on the checkpoint pack.

System Logger: The current exploiters for the system logger are:

Logrec Operations Log

System logger fully supports the rebuild function and it is the recommended procedure to move its structures to another coupling facility. In case of failure during a rebuild, you will get some IXG messages, for instance IXG106I, IXG107I, etc., and some actions can be required to recover the operations. OPERLOG will automatically switch hardcopy logging to SYSLOG if the syslog data set is initialized by JES. If unable to switch to SYSLOG an attempt will be made to send hardcopy to a printer console. If this fails, the hardcopy will be lost. LOGREC will buffer logrec entries to a point, then it discards new entries until the logger is operational.

Shared Tape: During an off-peak maintenance window, IEFAUTOS is a structure that you can potentially live without while coupling facility maintenance is being performed.

138

Continuous Availability with PTS

The suggested method to move IEFAUTOS is through the REBUILD command.

VTAM: VTAM Generic Resource structure can be moved using the REBUILD function. When the VTAM generic resource structure becomes unavailable, no new sessions to generic resources will be allowed. This includes sessions to both the generic and real name. Existing sessions should not be affected. During the period required for the rebuild, sessions are not rejected. They are queued and processed after the rebuild completes.
If VTAM is down, a failed-persistent connector will remain associated with the structure. Forcing the failed-persistent connector will result in the deallocation of the structure and the loss of persistent affinities. Failed-persistent connectors to the structure must not be forced if LU 6.1 or LU 6.2 sync level 2 is being used by the applications or subsystems.

IRLM: Deallocating the IRLM lock structure should be done by REBUILDing it to the alternate CF.
IRLM always disconnects from the lock structure abnormally (IXLDISC with REASON=FAILURE). This leaves the IRLM lock structure connections in a failed-persistent state. It is safe to force failed persistent connectors of a lock structure when there are no retained locks on any of the DBMSs identified to any IRLM in the data sharing group that is using that lock structure. Alternate procedure for deallocating the IRLM lock structure is as follows: 1. Use IRLM status command to see if any of the DBMSs that were previously in the data sharing group have retained locks:

F irlm_name,STATUS,ALLD
The IRLM has to be connected to the data sharing group in order to get information on all DBMSs that are in the group. IRLM will connect to the group as soon as a DBMS (IMS or DB2) identifies to it. If a DBMS has retained locks, restart that DBMS so it can recover and cleanup the retained locks. 2. Provide NORMAL shutdown of all DBMSs identified to IRLM. IRLM will disconnect from the data sharing group. If the subsystem does not shut down normally retained locks may exist and the subsystem must be restarted until a normal shut-down occurs. 3. Stop IRLM. 4. Force IRLM failed-persistent connectors. 5. Force the IRLM lock structure.

RACF Database Cache: RACF uses the coupling facility as a large sysplex-wide store-through cache for the RACF database to reduce contention and I/O to DASD.
RACF uses a new serialization protocol to replace reserve/release when the coupling facility is in use. The new protocol protects the data, but without the disadvantages of reserve/release. So, in case of coupling facility shutdown RACF could also operate in non-datasharing mode in a parallel sysplex. However, there could be a
Chapter 5. Coupling Facility Changes

139

performance impact to the installation if the I/O activity rate against the RACF database is high. To deallocate the structure, take RACF out of data sharing mode via the RVARY NODATASHARE command. Once RACF is out of data sharing mode the structure will be deallocated. You can stay like this until the coupling facility maintenance is completed. Then when the CF is back online, re-enable RACF data sharing mode via RVARY DATASHARE.

XCF Signalling: If the target coupling facility has an XCF signalling structure, there must be a full set of redundant signalling paths even if the XCF signalling structure is going to be rebuilt in the alternate coupling facility. If for example, you do not have alternate signalling structure or CTCs for the sysplex, you will lose system connectivity during the rebuild. The loss of connectivity will lengthen the time it takes to rebuild the signalling structure and may result in XCF timeouts, especially if multiple structures are being rebuilt at the same time.
If redundant signalling paths cannot be made available, XCF s failure detection interval, COUPLExx(INTERVAL), and GRS s toleration interval, GRSCNFxx(TOLINT), should be increased to prevent timeouts. This is particularly important if you have an active SFM policy with ISOLATETIME versus PROMPT specified. Furthermore, the XCF signalling structure should be rebuilt separately from other structures. When sufficient redundant XCF signalling capacity exists to allow for temporarily shutting down signalling through the target coupling facility, you can decide not to rebuild the target coupling facility s XCF signalling structure in the alternate coupling facility. In this case, you can simply stop XCF signalling by issuing the following commands on each system connected to the target coupling facility:

SETXCF STOP,PATHOUT,STRNAME=strname SETXCF STOP,PATHIN,STRNAME=strname


Once all PATHOUTs and PATHINs are stopped, XCF will deallocate the signalling structure. This will prevent XCF from rebuilding the signalling structure in the alternate coupling facility. When the coupling facility maintenance is complete, target coupling facility connectivity is restored and the original policy is restored, issue the following SETXCF commands on each system connected to the target coupling facility to reestablish XCF communication through the target coupling facility.

SETXCF START,PATHIN,STRNAME=strname SETXCF START,PATHOUT,STRNAME=strname


DB2 Structures: DB2 uses two different structures in a data sharing environment:

SCA Shared Communications Area GBP Group Bufferpool

If maintenance is required on the coupling facility, different methods are required depending on the structure type we should move out of the coupling facility. To deallocate the SCA structure, the REBUILD function is supported and it is the recommended way to move the SCA into an alternate coupling facility. The easiest way to deallocate a GBP structure is stopping all the DB2s in the data sharing group. Once the DB2s will be stopped, the GBP will automatically

140

Continuous Availability with PTS

be deallocated. There is no REBUILD support for this type of structure. Further details on GBP deallocation can be found at 9.4.5, To Manually Deallocate and Reallocate a Group Buffer Pool on page 190. During the REBUILD function, consider the following items: You should plan enough spare capacity on the alternate coupling facility to absorb the work from the target coupling facility. During the planning session consider how to spread the coupling facility structures across the coupling facilities in case of recovery to obtain optimal performance. For instance, mixing the Lock structure and a highly accessed GBP in the same coupling facility can significantly degrade performance. Rebuilding a structure with very high access frequencies can be very disruptive to the workloads. For instance, while the rebuild of the IRLM lock structure is in progress, all IRLM requests are queued. So, you may receive more timeouts, and in the extreme case, IRLM wait queues may build up to an unmanageable level, and IRLM may lead to a halt.

SMSVSAM Structures: SMSVSAM uses different structures:


IGWLOCK00 the lock structure Cache structure for data

If maintenance is required on the coupling facility, SMSVSAM supports the REBUILD function against both structures types and it is the recommended way to move them into an alternate coupling facility. During the REBUILD function, consider the following items: You should plan absorb the work consider how to facilities in case enough spare capacity on the alternate coupling facility to from the target coupling facility. During the planning session spread the coupling facility structures across the coupling of recovery to obtain optimal performance.

Rebuilding a structure with very high access frequencies can be very disruptive to the workloads.

5.11.2 Shutting Down the Only Coupling Facility


Some additional considerations will apply if you are shutting down the only coupling facility in the sysplex. If there is the potential for a data integrity exposure by forcing persistent structures and/or connectors off the only coupling facility in the sysplex, the recommendation is to add an alternate coupling facility to the sysplex and follow the procedures outlined above. Otherwise, the following procedure may be used. You won t be able to keep subsystems like IMS and DB2 up while the maintenance is occurring. However, you should take care of your structures before deactivating the coupling facility: JES2 Move the checkpoint structure from coupling facility to DASD before deactivating the coupling facility. Otherwise, you ll have to do a cold start when the coupling facility maintenance is complete.

XCF signalling Ensure you have sufficient CTCs so that sysplex communication can continue without the coupling facility signalling structures

Chapter 5. Coupling Facility Changes

141

RACF

Take RACF out of data sharing mode before deactivating the coupling facility.

Once you have the JES2 checkpoint out of the single coupling facility, and have ensured you have enough CTCs to keep the sysplex alive without the coupling facility, you should initiate an orderly shutdown of any subsystems still using that coupling facility (for instance IMS, IRLM, DB2, CICS). Force off any structures still allocated on the coupling facility. If you are using ICMF (Integrated Coupling Migration Facility), you should use the same procedure to rebuild structures into an alternate coupling facility. However, do not configure the coupling facility CHPIDs offline in this case. When all structures in the target coupling facility have been deallocated, the MVS images that exist on the same hardware image as the ICMF should be shut down. The hardware image can then be powered-off.

5.12 Putting a Coupling Facility Back Online


When a coupling facility recovers connectivity to MVS images, for example because of a CFC link being put into operation again, or because of the coupling facility being successfully activated, XES will check if this coupling facility is known in the active CFRM policy. If the coupling facility is known:

The sysplex is granted ownership of the coupling facility Structure exploiters are made aware of the availability of the coupling facility By the EVENT exit for exploiters that currently have active connections to structures. By ENF service (event code 35) for exploiters not currently having an active connection to any structure.

This makes exploiters only aware of a change in resource availability. As far as IBM exploiters are concerned, there are no spontaneous structure movements initiated because of this event. The new coupling facility will be used eventually for a new allocation of a structure or for a structure rebuild, if any.

142

Continuous Availability with PTS

Chapter 6. Hardware Changes


This chapter discusses how to add, change or remove hardware elements of the sysplex in a nondisruptive way. Most of these changes would be disruptive without the parallel sysplex.

6.1 Processors
Here is a description on how to add, remove and maintain a processor in a nondisruptive mode.

6.1.1 Adding a Processor


Installing a new machine nondisruptively in a sysplex is a straightforward matter: 1. Install the hardware. 2. Connect the channels to the I/O. For ESCON channels this is only a matter of connecting the fiber cables to the devices. This can be done without disrupting processing on the other machines and I/O devices in any way. For parallel channels you run the risk of power surges if you connect channel cables to devices that are running, so you should power down any devices you are connecting together. 3. Configure the I/O for the new processor. You can do this on another system in the sysplex. If the new machine is not a 9672, then you will have to transfer the IOCDS to it using a tape, and run the stand-alone IOCP. If the new machine is a 9672, and is on the same HMC LAN as the existing 9672s, then the process is easier. You can use the HCD dialogue in another machine to download the IOCDS into the service element of the new machine, and then all you need to do is power-on-reset the new machine. See F.2, Procedure for Dynamic Changes on page 270 for details.

6.1.2 Removing a Processor


Once all the systems on the processor have been stopped there are no problems in removing a processor. From the MVS console, issue the command VARY XCF,sysname,OFFLINE for each system on the processor, or deactivate the processor from the hardware console. You can then power it down without affecting the other systems in the sysplex. Now you can make any changes you want to the processor. If you need to physically remove the processor, then it is quite safe to unplug parallel channel cables to any I/O devices which are working with another system.

Copyright IBM Corp. 1995

143

6.1.3 Changing a Processor


There are many changes that can be made to a processor without a disruption. These are listed below. Any other changes require that you first remove the processor from the sysplex before making the change.

6.1.3.1 9021 711-Based Machines


1. Concurrent channel maintenance allows replacement of a failing channel card with no requirement to power off the system. 2. The enhanced power subsystem is capable of supporting a processor complex for most power supply failures without interrupting system operation. Dynamic replacement of a failed power supply without a system interruption is supported. 3. Vary off-line of a central processor (CP) for TCM maintenance with concurrent CP TCM maintenance. 4. Dynamic add, delete, or modify of the I/O configuration definitions for channel paths, control units, coupling link CHPIDs, and I/O devices. 5. Concurrent maintenance of most air moving devices. 6. Concurrent Licensed Internal Code maintenance for most microcode patches. 7. Concurrent maintenance of the IBM 9022 Processor Controller.

6.1.3.2 9672 R1 Machines


1. Concurrent channel maintenance allows replacement of a failing channel card with no requirement to power off the system. 2. The Enhanced Power Subsystem ( N + 1 ) is capable of supporting a processor complex for most power supply failures without interrupting system operation. Dynamic replacement of a failed power supply is done without a system interruption. 3. Dynamic add, delete, or modify of I/O configuration definitions for channel paths, control units, and I/O devices. 4. Concurrent maintenance of the support element and hardware system console.

6.1.3.3 9672 R2 and R3 Machines


In addition to what is possible with the R1-machines, you can also: 1. Concurrently add ESCON and parallel channels to an I/O cage if there are free slots in it. 2. Apply some Licensed Internal Code (LIC) patches concurrently.

6.2 Logical Partitions (LPARs)


Here is a description on how to add, remove and maintain a Logical Partition.

144

Continuous Availability with PTS

6.2.1 Adding an LPAR


LPAR reconfiguration is disruptive for the machine involved. You cannot add an LPAR without a POR.

6.2.2 Removing an LPAR


The only practical reason for deactivating an LPAR is to allocate its storage to another partition using Dynamic Storage Reconfiguration, as discussed below.

6.2.3 Changing an LPAR


Changing an LPAR is for the most part disruptive to the system running in that LPAR, and in many cases to the entire machine as it often requires a POR. However, there are some changes that are, or can be, nondisruptive.

6.2.3.1 Processing Weights


Processing weights are used to specify the portion of the shared processor resources allocated to a logical partition. These can be changed at any time using the LPCTL frame on the hardware console. If you use resource capping this can also be changed dynamically in the same way. The same does not apply, however, to dedicated processors. The logical partition must be deactivated to dedicate processors to a logical partition, or to change from dedicated to shared processors.

6.2.3.2 Dynamic Storage Reconfiguration


This is a feature in the processor which allows you to take central and expanded storage that is either not allocated or belongs to a partition that has been deactivated, and allocate it to a partition that is running MVS without disrupting that system. In order to be able to do this you must have planned it in advance, as you need to reserve addressability ranges at LPAR definition for the storage you want to be able to dynamically acquire later on. In addition, the 9672 and 511-based processors require that the storage that is being added be contiguous with the storage of the partition that is acquiring it. The 711-based processors do not have this requirement. For more information see LPAR Dynamic Storage Reconfiguration , GG66-3262 and the operations manual of the processor you are using.

6.3 I/O Devices


If the device supports dynamic reconfiguration then it is possible to install it without disrupting the sysplex. See 2.5, Dynamic I/O Reconfiguration on page 33 for details. 1. Install the hardware. 2. Connect the device to the channels, or to one or more ESCON directors. For ESCON devices this is only a matter of connecting the fiber cables to the devices. This can be done without disrupting processing on the other machines and I/O devices in any way.
Chapter 6. H a r d w a r e Changes

145

For parallel-attached devices you run the risk of power surges if you connect channel cables to devices when they are running, so you should power down any two devices you are connecting together. 3. Define the new device to the processors. You can run HCD on any system in the sysplex to modify the IODF to include the new I/O device, and then use the sysplex-wide activate function (MVS 5.2 only) to activate it in all the systems in the sysplex.

6.4 ESCON Directors


The 9032 Models 001 and 002 ESCON Directors allow for hot-plugging the fiber connecting channels and control units into existing ports. However, to upgrade the director with additional ports is disruptive. The newer 9032 Model 003 ESCON Director features concurrent hardware install, redundant hardware, and concurrent Licensed Internal Code (LIC) upgrade capability. With the concurrent upgrade capabilities, additional ports can be added at any time, in increments of 4, up to the maximum of 124. The channels and control units can be attached and brought on-line with no disruption (provided the new control units were predefined to the host).

6.5 Changing the Time


This problem arises every time there is a shift in the local time from winter (standard time) to summer (daylight savings time) and back again. This has implications for continuous operations.

6.5.1 Using the Sysplex Timer


Assuming you have the recommended setup of ETRMODE YES and ETRZONE YES in the CLOCKxx member in SYS1.PARMLIB, then you make the time changes using the sysplex timer. The only source for the time zone offset for all systems in the sysplex is the sysplex timer. You change between summer and winter time using the Sysplex Timer Console Change Offsets panel, and you can even schedule the change to occur at a later time. This is convenient, as the official time for the changeover between summer and winter time is usually 2 AM local time. These changes can be done while the operating system is running, and does not necessitate an IPL. However, to ensure consistency in the logs, some applications and subsystems may have to be shut down. This applies to IMS, see 6.5.2, Time Changes and IMS. There may also be accounting implications, see 6.5.3, Time Changes and SMF on page 148.

6.5.2 Time Changes and IMS


Resetting the MVS subsystems during stamps in its logs. stamps. Hopefully clocks has serious implications if you are running IMS this period, since IMS currently uses the local time for time Database recovery is sensitive to these log record time this restriction will be removed in a future release of IMS/VS.

146

Continuous Availability with PTS

IMS subsystems include IMS online (DC and DBCTL) systems, IMS DLI/DBB batch jobs, CICS-DL/I systems, the IMS database utilities (image copy, change accum, recovery, unload, reload, prefix update, etc.), and the logging utilities (log recovery and archive).

6.5.2.1 Changing from Standard to Daylight Savings Time


This involves moving the clock forward one hour. The following steps should be followed when changing the MVS clocks: 1. Terminate all IMS subsystems within the sysplex. 2. Reset (move forward) all of the MVS clocks using the sysplex timer. 3. Restart the IMS subsystems. Failure to follow this procedure could lead to database recovery and other operational problems following the time change. To understand why, consider the following scenario: 1. An IMS online subsystem (A) is running when the MVS clock is moved forward an hour at 2:00 AM. Since IMS computes the time by issuing STCK instructions and converting the result, rather than by issuing TIME macros, subsystem A is not aware of the time change. Thus, it will continue to observe the old (standard) time. 2. At 4:00 A M daylight savings time (3:00 A M as far as subsystem A is concerned), the user deallocates a database and starts a 15-minute batch job that updates the database. The batch job thinks that it is 4:00 AM and will time stamp its log records with times in the range of 4:00 AM to 4:15 AM. 3. The database is then given back to subsystem A at 4:30 A M daylight savings time (3:30 AM as far as subsystem A is concerned) and is updated. The updates by subsystem A will be time stamped from 3:30 onward. As a result, the updates from step 3 will be time stamped before step 2, even though they occurred after it. This could cause database recovery problems.

6.5.2.2 Changing from Daylight Savings to Standard Time


This is when the clocks are set back one hour. The following is the recommended procedure: 1. Bring all IMS subsystems down. 2. Reset the clocks using the sysplex timer 3. Don t run any IMS subsystems during the duplicate time period (the one-hour period between the original 2:00 AM and the new 2:00 AM). All IMS subsystems must be terminated prior to the time change. IMS subsystems generally use the hardware clock (STCK instruction) rather than the MVS TIME macro to obtain the time. At subsystem initialization time, IMS computes an adjustment factor to convert the STCK value to a date and time. This adjustment factor could be recomputed at midnight under certain, unpredictable circumstances. If a user were to reset the MVS clock while an IMS subsystem was running, the subsystem would not recognize the time change immediately. However, IMS subsystems started after the time change would use the new time. Thus, the new subsystems would be an hour behind

Chapter 6. H a r d w a r e Changes

147

the old subsystems. In addition, if the IMS subsystems that were running prior to the time change continued execution past midnight (old time), then they could reset their internal clocks back an hour and cause negative time breaks in the log. Almost all of the record types in the RECON data sets have time stamps. Some of these time stamps are provided by the IMS modules that invoke DBRC and some of these time stamps are obtained by the DBRC code (via the TIME macro). DBRC assumes that time never goes backwards and is coded with that basic assumption. Therefore, the user has no option other than to terminate all IMS subsystems prior to resetting the MVS clock and not run any new IMS subsystems for an hour.

6.5.3 Time Changes and SMF


SMF records contain time stamps in local time, and these may be used by your accounting system to compute durations. Setting the time back an hour will cause the SMF records time to overlap for an hour if the systems are not stopped. Here you have the option to either stop the systems for an hour to prevent the times from overlapping, or keep the systems running and then schedule the time to be reset at a time when the systems will not be affected by this SMF anomaly. Setting the time forward an hour could cause the duration of jobs that were running at the time of the change to appear one hour longer than they really were. This can obviously cause problems for accounting routines, as well as affecting statistics and possibly job scheduling systems. So you may find that you at least need to stop all production while you change the clocks in either direction.

6.5.4 Changing Time in the 9672 HMC and SE


When you change the time to or from summer or winter time at the sysplex timer, these changes are immediately reflected in all the processors and MVS images in the sysplex. The 9672 Service Element (SE) and Hardware Management Console (HMC) also have clocks. These will also be automatically synchronized to the changed time. This synchronization occurs, however, only once daily, at 23.00 for the SE and 23.15 for the HMC. This means that if you change to summer or winter time at the official time of 02.00, it will take until the following night before the SE and HMC clocks are synchronized. This is not a serious problem, although it may be an inconvenience for the operators. The only thing you must remember is that, if you have problems during that day and have to compare the console logs with any other system logs, you must take account of the hours difference between their time stamps.

148

Continuous Availability with PTS

Chapter 7. Software Changes


This chapter discusses how to make changes such as adding, modifying or removing system images and subsystems.

7.1 Adding a New MVS Image


This section lists the steps you must take to nondisruptively add a completely new MVS image to the sysplex. It assumes that the cloning setup resembles the example in Appendix A, Sample Parallel Sysplex MVS Image Members on page 221.

Allocate specific LOGREC, PAGE, STGINDEX and SMF data sets for this MVS image. See Appendix A, Sample Parallel Sysplex MVS Image Members on page 221 for JCL example. Check SYS0.IPLPARM(LOADxx) for the IEASYMxx in use. Check SYS1.PROCLIB(JES2) for the names of JES2 clone members. Modify SYS1.PARMLIB as follows: IEASYMxx add SYSDEF for the new system. COUPLE00 add pathin/pathout for XCF signalling paths. J2G add name of the new system to the JES2 global member. J2Lxx add the specific JES2 member for the new system.

Modify ISMF if needed. Using SDSF check active SCDS. Check that the groupname is the same as the sysplexname. Check ACS-routines for any system-specific code.

Check the HSM startup procedure. Activate the new SMS configuration and verify that SMS is OK. Start new XCF pathins and pathouts on all systems. /RO *ALL,SETXCF START,PI,DEV=(4230,4238) /RO *ALL,SETXCF START,PO,DEV=(5230,5238)

Create new VTAMLST members for the new system. ATCSTRxx ATCCONxx APNJExx APCICxx CDRMxx, also modify all the other CDRM members to include the new system MPCxx TRLxx, also modify the TRL members for network nodes to include the new system

Copyright IBM Corp. 1995

149

Vary the VTAM CTCs online in all other network node machines. IPL the new system. Review the ARM policy.

7.1.1 Adding a New JES3 Main


When adding a new JES3 main, there are different considerations depending on whether the JES3 main is a JES3 local joining an existing JES3 complex in the sysplex, or whether the system is to be a new global, thereby creating a new JES3 complex within the sysplex.

7.1.1.1 Adding a Local to an Existing JES3 Complex


To add a new local to an existing JES complex within the sysplex, there are some additional considerations after completing the tasks described in 7.1, Adding a New MVS Image on page 149:

In order for the addition of a JES3 local to be nondisruptive to the existing JES3 complex, the initialization deck must already have a definition for a new main included. It is not possible to add a new main definition (MAINPROC statement) without a JES3 complex-wide warm-start. However, it is possible to change the name of a previously defined main without a complex-wide warm-start.

7.1.1.2 Adding a New Global to the Sysplex


To add a new JES3 global to a sysplex involves adding a new JES3 complex to the sysplex. There are some important considerations to be aware of after completing the tasks described in 7.1, Adding a New MVS Image on page 149:

Check maintenance levels. It is only possible to run multiple JES3 complexes within a parallel sysplex when the appropriate JES3 and JESXCF maintenance is installed. The required PTFs are UW19140 and UW19148.

Define a new XCF group name. The JES3 group name, defined to XCF through the JES3 initialization deck, is the distinguishing attribute that separates one JES3 complex from another with the sysplex. For more information on specifying the XCF group name for XCF, refer to on page 91.

Define a new command prefix. JES3 makes use of the MVS Command Prefix Facility (CPF). For more information on how to specify the JES3 command prefix, refer to 2.18.4.2, JES3PLEX < SYSPLEX on page 91.

Allocate unshared JES3 data sets. The new JES3 complex requires some unique data sets: JES3 Checkpoint Provide two for increased availability. JES3 JCT JES3 Spool JES3 Initialization Stream

It is possible to share the following data sets with another JES3 complex within the sysplex:

150

Continuous Availability with PTS

JES3 Dump JES3 OUT

Check the JES3 proc and PARMLIB COMMNDxx member As described in 2.18.4.2, JES3PLEX < SYSPLEX on page 91, it is possible to share the JES3 proc between different JES3 complexes within the same sysplex. This requires a change to the START JES3 command issued out of the PARMLIB command (COMMNDxx) member. For example, the START command for the new system being added might look like:

S JES3,JES=JES9,ID=09,SUB=MSTR
Figure 32. START Command When Adding a New JES3 Global

7.2 Adding a New SYSRES


To be able to apply maintenance, add new products or upgrade existing ones within a parallel sysplex without an overall sysplex outage the installation requires the ability to clone additional SYSRESs into the environment. The following explains the process and provides examples of how this can be done assuming the SYSRES is designed as in 2.3.1, Shared SYSRES Design on page 30. Let s assume that an additional SYSRES is to be introduced into the sysplex. The existing SYSRES is as follows:

VOLSER = SYSRESA SMP/E target zone is TGTRESA Target SMP/E data set high-level qualifier is SMP.RESA.**

The process to create the new SYSRES, called SYSRESB, would be as follows: 1. Initialize the new volume. 2. Copy the all the data sets from SYSRESA to SYSRESB, excluding the VTOC, VVDS and SMP/E data sets, and do not catalog. 3. Copy the SMP/E target data sets to SYSRESB, rename to SMP.RESB.** and catalog. 4. SMP/E ZONEDIT the target zone on SYSRESB to:

Change the SMP/E target zone name from TGTRESA to TGTRESB Change the VOLUME entry in the DDDEFs from SYSRESA to SYSRESB Change the DATASET entry in the DDDEFs for the SMP/E data sets from SMP.RESA.** to SMP.RESB.**

5. Add IPL text to SYSRESB.

7.2.1 Example JCL


Some example JCL to achieve the previous process is provided in the following figures.

Chapter 7. Software Changes

151

//DSFINIT JOB (999,POK), INITIALIZE VOLUME , // MSGCLASS=X,NOTIFY=&SYSUID //INIT1 EXEC PGM=ICKDSF //SYSPRINT DD SYSOUT=* //SYSIN DD * INIT UNITADDRESS(FD0) DEVTYP(3390) VOLID(SYSRESB) VTOC(1113,0,90) INDEX(1110,0,45) PURGE NOVERIFY /*
Figure 33. Volume Initialization. Initialize volume and name it SYSRESB

//DSSCOPY JOB (999,POK), COPY RES PACK , MSGCLASS=X,NOTIFY=&SYSUID //STEP1 EXEC PGM=ADRDSSU,REGION=0M //SYSPRINT DD SYSOUT=* //RESIN DD UNIT=3390,VOL=SER=SYSRESA,DISP=SHR //RESOUT DD UNIT=3390,VOL=SER=SYSRESB,DISP=SHR //SYSIN DD * COPY INDD(RESIN) OUTDD(RESOUT) DS(EXCLUDE(SYS1.VTOCIX.* SYS1.VVDS.* SMP.** )) TOLERATE(ENQF) SHARE ALLEXCP ALLDATA(*) CANCELERROR COPY INDD(RESIN) OUTDD(RESOUT) DS(INCLUDE(SMP.RESA.**)) RENAMEU(SMP.RESA.**,SMP.RESB.**) CATALOG TOLERATE(ENQF) SHARE ALLEXCP ALLDATA(*) CANCELERROR /*
Figure 34. Copy SYSRESA. Job copies data sets excluding VTOC, VVDS and SMP/E, and then copies, renames and catalogs the SMP/E data sets.

152

Continuous Availability with PTS

//SMPZEDIT JOB (999,POK), ZONE EDIT , MSGCLASS=X,NOTIFY=&SYSUID, // TIME=1440,TYPRUN=HOLD //SMP EXEC PGM=GIMSMP,TIME=1440,REGION=0M //SMPCSI DD DISP=SHR,DSN=SMP.GLOBAL.CSI //SYSPRINT DD SYSOUT=* //SMPRPT DD SYSOUT=* //SMPOUT DD SYSOUT=* //SMPCNTL DD * SET BDY(GLOBAL) . UCLIN. DEL GLOBALZONE ZONEINDEX((TGTRESB)) . ENDUCL. ZONERENAME(TGTRESA) TO(TGTRESB) OPTIONS(OPTMVST) RELATED(DLIB001) NEWDATASET(SMP.RESB.CSI) . SET BDY (TGTM02C) . ZONEEDIT DDDEF. CHANGE VOLUME(SYSRESA,SYSRESB) . ENDZONEEDIT . UCLIN. REP DDDEF(SMPLTS) DATASET(SMP.RESB.LTS) . REP DDDEF(SMPMTS) DATASET(SMP.RESB.MTS) . REP DDDEF(SMPSCDS) DATASET(SMP.RESB.SCDS) . REP DDDEF(SMPSTS) DATASET(SMP.RESB.STS) . ENDUCL. /*
Figure 35. SMP/E ZONEEDIT. Job renames target zone to TGTRESB, changes all DDDEF volumes to SYSRESB and changes the SMP/E target data set DDDEFs to SMP.RESB.**.

//IPLTEXT JOB (999,POK), IPL TEXT , MSGCLASS=T,NOTIFY=&SYSUID, // TYPRUN=HOLD //IPLTEXT PROC VOL=,UNIT=3390 //DSF EXEC PGM=ICKDSF,REGION=1M //SYSPRINT DD SYSOUT=* //IPLVOL DD DISP=SHR,VOL=SER=&VOL,UNIT=&UNIT //IPLTEXT DD DSN=SYS1.SAMPLIB(IPLRECS), // DISP=SHR,UNIT=&UNIT,VOL=SER=&VOL // DD DSN=SYS1.SAMPLIB(IEAIPL00), // DISP=SHR,UNIT=&UNIT,VOL=SER=&VOL // PEND //STEP1 EXEC IPLTEXT,VOL=SYSRESB,UNIT=3390 //SYSIN DD * REFORMAT DDNAME(IPLVOL) IPLDD(IPLTEXT) NOVERIFY BOOTSTRAP /*
Figure 36. Add IPL Text. Job creates and places IPL text on SYSRESB.

Chapter 7. Software Changes

153

7.3 Implementing System Software Changes


Having seen how to add a new SYSRES to the sysplex, it is now possible to see how to implement a change into the parallel sysplex without causing disruption to the overall sysplex. As an example let us continue with a sysplex consisting of eight images, and two SYSRES volumes, SYSRESA and SYSRESB, both at the same software level of N. All images are currently IPLed from SYSRESB.

Figure 37. Example parallel sysplex Environment

To introduce system software changes into the sysplex, such as maintenance, new products or a product upgrade, the following process is:

Apply the change to SYSRESA. This will have no effect on the sysplex as all images are IPLed from SYSRESB. Clone a new SYSRES volume, SYSRESC from SYSRESA using the procedure described in 7.2, Adding a New SYSRES on page 151. IPL one image in the sysplex from SYSRESC.

Ripple IPL all other images in a controlled manner from SYSRESC.

154

Continuous Availability with PTS

Figure 38. Introducing a New Software Level into the parallel sysplex

The result of this is that for a period of time, the images within the sysplex are at N and N+1 levels. Should the N+1 level, in this example SYSRESC, cause a problem then the N level is still available to fall back to. By employing the ripple IPL technique any potential problem is limited to one image initially, thereby reducing the impact on the whole sysplex. It can be seen therefore that the minimum number of SYSRESs required for this technique is three, one to act as the medium for change, and two to be the N and N+1 levels. An installation may need more than three SYSRESs. For example in an eight image sysplex there may be four images sharing one SYSRES and four sharing another. Either way there will need to be at least two other SYSRESs available to facilitate the processes of introducing change with minimum disruption as described. This basic philosophy of rippling a change through the parallel sysplex can be employed to propagate subsystem changes as well as for system software changes through the sysplex. The manner in which the changes are implemented for a specific subsystem may differ from that of system software and are discussed further in 7.6, Changing Subsystems on page 160.

7.4 Adding Subsystems


The following sections detail specific considerations for adding subsystem elements to a parallel sysplex configuration.

Chapter 7. Software Changes

155

7.4.1 CICS
Actions required to add a new CICS subsystem may vary depending on the type of CICS region we are introducing on the MVS image. Therefore, we will point out what activities will be specific for a TOR (Terminal Owing Region) and for an AOR (Application Owing Region). Most of the definitions should already be in place because we are only adding a cloned CICS region. However, before activating a new CICS region, you should execute or verify the following activities:

Verify that the new CICS data sets are protected either using RACF or an equivalent external security manager. Verify the MVS definitions depending on the function being used by the new CICS regions. Verify the existing subsystem definition in IEFSSNxx. Verify the existing entry for CICS in SCHEDxx. Verify the existing SMSVSAM server definitions (only for an AOR region). If you are going to use MVS workload management with CICS, you should set up appropriate MVS definitions, and ensure that CICS performance parameters match the current definitions. If you want to use the MVS Automatic Restart Manager (ARM) facility to handle the new CICS, you should verify the following: - Check that ARM is active on the MVS image. - Ensure that the MVS images available for ARM have access to the databases, logs, and program libraries required for the workload. - Ensure that the CICS startup JCL used to restart CICS regions is suitable for MVS ARM. - Ensure that the system initialization parameter XRF=NO is specified for CICS startup. - Specify appropriate CICS START options. - Define ARM policies for the new CICS region.

Set up all definitions required to enable the logging function. If you are going to add either a TOR belonging to an existing and active application or a cloned AOR, you will not be required to make all logging definitions. This new CICS region will join the existing environment and will use predefined structures. Therefore, you should only plan the following activities: Verify the existing coupling facility structure for log data to check whether it will be enough storage to support the increasing activity rate. If necessary, you can expand the coupling facility structure size, if previously planned, or change it through a new CFRM policy. If log data duplexing is required, plan for the staging data set allocation Update the LOGR data set: define the new CICS unique log streams that are implemented as two MVS system log streams (primary and secondary). Activate the new LOGR definition. Verify the archiving procedures.

156

Continuous Availability with PTS

Set up all security definitions for the logging function. The CICS region user ID must be authorized to write to (and create if necessary) the log streams that are used for its system log and general logs. If the setup of your installation allows for several CICS regions to share the same CICS region user ID, you can make profiles more generic by specifying an * for the APPLID qualifier. If this were done, then most of the definitions should already exist.

If you intend using VTAM with CICS you must define to VTAM each CICS region that is to use VTAM. You must also ensure that any VTAM terminal definitions are properly specified for connection to CICS (only for a TOR region). To define your CICS regions to VTAM, you must: Define VTAM application program major nodes (APPL) Issue a VARY ACT command to activate the APPL definition

Allocate the data sets unique to the new CICS region. Verify or customize the DL/1 interface (only for AOR region). Verify or customize the DB2 support (only for AOR region). Verify the MRO and ISC support. Define the new CICS region to the CP/SM environment.

7.4.2 IMS Subsystem


This topic describes the major activities that must be done in order to create a cloned IMS subsystem, which includes IMS TM/DB and IRLM 2.1. It is assumed that the Extended Terminal Option (ETO) is will be used for terminal definitions. Before bringing a new IMS into the data sharing group, the following actions are required:

Define the IMS system parameters that are unique to this IMS instance (for example, the IMSID). Create the data sets required to support the new IMS. Update the IEFSSNxx member of SYS1.PARMLIB, to define the new subsystem to MVS. The definition can be activated via the SETSSI command. Define the ARM policy for the new IMS. Verify that the coupling facility structure sizes are large enough to accommodate the addition of the subsystem. Create the IRLM procedure. Use ETO to make changes to the terminal definitions.

Refer to the following manuals for the installation details:


IMS/ESA V5 Administration Guide: System IMS/ESA V5 Installation Voulme 1 IMS/ESA V5 Installation Voulme 2

Chapter 7. Software Changes

157

7.4.3 DB2
In this section we discuss how to add a new member in a DB2 data sharing group. DB2 data sharing is the only way to satisfy the requirements of applications that need very high levels of availability. With data sharing, you can run applications on many DB2 subsystems and access the same shared data. If one system must come down, either for planned maintenance or because of a failure, the work can be rerouted to another DB2 subsystem with no perceived outage to end users. In the same way, you are able to add a new subsystem to support increased workload demand. DB2 subsystems that share data must belong to a DB2 data sharing group. A data sharing group is a collection of one or more DB2 subsystems accessing shared DB2 data. Each DB2 subsystem belonging to a particular data sharing group is a member of that group. All members of the group use the same shared DB2 catalog and directory. Changes that occur during scheduled maintenance can be done on one DB2 at a time. If a DB2 or MVS must come down for the change to take place, and the outage is unacceptable to users, you can move those users onto another DB2. Most changes can be made on one DB2 at a time, as shown in Table 9, with no application disruption.
Table 9. DB2 Changes
Type of change DB2 Code Attachment Code System parameters Action required Bring down and restart each DB2 member independently. Apply the change and restart the transaction manager or application. For those that cannot be changed dynamically, update using DB2 s update process. Stop and restart the DB2 to activate the updated parameter.

Adding a new member to the installation should be considered as a new installation. You cannot take an independently existing DB2 subsystem and merge it into the group. The new members begin using the DB2 catalog of the originating member. The following list shows the actions required to add a new DB2 data sharing member: 1. Update IEFSSNxx with the subsystem definition and activate the changes through the following MVS command:

T SSN=xx
2. On panel DSNTIPA1, specify:

INSTALL TYPE ===> INSTALL DATA SHARING FUNCTION ===> MEMBER


3. On panel DSNTIPK, specify the name of the new member:

MEMBER NAME ===> new member name


4. Complete the installation path. It is recommended that you rename the customized SDSNSAMP data set for each member. This data set contains tailored JCL for each member. If you do not rename it, it will be overwritten

158

Continuous Availability with PTS

when you install a new member name. We suggest that you choose a new name for prefix .NEW.SDSNSAMP data set on the installation panel DSNTIPT. 5. Define the system data sets BSDS and active log data sets. 6. Initialize System data sets. 7. Define DB2 initialization statements. 8. Optionally:

Record DB2 to SMF. Establish Security. Connect IMS to DB2. Connect CICS to DB2.

9. IPL the MVS image if you are using a multi-character command prefix. 10. Start the DB2 subsystem. 11. Define the temporary work files.

7.4.4 TSO
Adding a TSO application requires the following actions:

Verify the startup procedure in a PROCLIB library. Check the contents of IKJTSOxx member in Parmlib. Define the APPL to VTAM. There is no workload balancing for the TSO sessions. A future release will support session balancing through the usage of VTAM generic resources combined with WLM.

7.5 Starting the Subsystems


Starting the various subsystems within a parallel sysplex is virtually unchanged from the processes used in a normal environment. Subsystems can be started using commands, batch jobs or as started tasks. However to facilitate the cloning of subsystems across the sysplex using started tasks for the default starting of subsystems is recommended. Cloning support is essentially provided by system symbolic substitution. JCL for started tasks supports system symbolic substitution, while the batch job JCL does not. The following addresses any additional considerations for the sysplex environment.

7.5.1 CICS
There is no difference between starting a CICS region in a traditional world or in a parallel sysplex. You can either start your CICS region as a started task or as a batch job. Even if you are using CICS V5 in data sharing mode, no further user actions are required to open the connection to the SMSVSAM server. The CICS interface with SMSVSAM is through a control ACB and CICS registers with this ACB to open the connection. CICS registers automatically during the initialization process.

Chapter 7. Software Changes

159

7.5.2 DB2
There is a new process available in DB2 V4 called group restart, which is needed only in the rare event that critical resources in a coupling facility are lost and cannot be rebuilt. When this happens, all members of the group terminate abnormally. Group restart is required to rebuild this lost information from individual member logs. However, unlike data recovery, this information can be applied in any order. Because there is no need to merge log records, many of the restart phases for individual members can be done in parallel. An automated procedure can be used to start all members of the group. If a particular DB2 is not started, then one of the started DB2s performs a group restart on behalf of that stopped DB2.

7.5.3 IMS
A number of operator enhancements were made to IMS 5.1 to assist in the management of databases. Commands with the GLOBAL parameter globally affect data. If, for example, you enter the /START command with the GLOBAL parameter on one subsystem and specify several database names, then the IRLM transmits the command to other sharing subsystems. It deletes the names of any databases that are invalid for the local system before it transmits the command, and all sharing subsystems process the command. In these online data sharing systems, you observe the following messages:

DFS3334I GLOBAL START COMMAND seqno INITIATED BY SUBSYSTEM ssid FOR THE FOLLOWING DATABASES DFS3328I GLOBAL START COMMAND seqno COMPLETE

The variable seqno is a reply sequence number uniquely identifying the command and associating this message with the completion message that follows. The variable ssid is the name of the system originating the command. The message includes the database names that were in your command. If you omit the GLOBAL parameter (or specify LOCAL), the command applies only to the local online system and does not affect access by any other IMS subsystem.

7.6 Changing Subsystems


Within a parallel sysplex, subsystems will need to be changed from time to time due to the need for maintenance, product upgrades and so on. This change will need to be applied to one system and then propagated throughout the sysplex using the ripple technique described in 7.3, Implementing System Software Changes on page 154. It is recommended that changes are always implemented first on the same system in the sysplex every time. This helps operators and support staff highlight that if a problem appears on that system and not elsewhere in the sysplex then the problem is likely caused by the change and the appropriate backout and recovery actions can be taken.

160

Continuous Availability with PTS

The method for implementing changes to the subsystems will differ from that of system software in that there is no residence volume for the subsystems. The simplest way to make changes to the subsystem software is by the use of a STEPLIB statement in the initialization JCL. For example, for the subsystems on the system where you choose to do your changes, you might have a STEPLIB concatenation such as:

//STEPLIB DD DSN=SUBSYS.TEST.RESLIB,DISP=SHR DD DSN=SUBSYS.PROD.RESLIB,DISP=SHR


The following illustrates how these data sets would be used:

Prior to any change the first data set in the concatenation is empty. The contents of SUBSYS.PROD.RESLIB are copied to SUBSYS.TEST.RESLIB and the changes applied to this library. The subsystem is closed down and restarted. This subsystem will access modules from the newly updated SUBSYS.TEST.RESLIB. Provided there are no problems encountered running from this library over a suitable period of time, the TEST library can be renamed to PROD and the remaining subsystems started from this new level. The original PROD level would become a backup library. Should problems occur after the initial change, then fallback to the PROD level for that subsystem is straight forward.

7.7 Moving the Workload


The following sections detail any specific considerations when moving the workload from a subsystem within a parallel sysplex. This will be a requirement if a system or subsystem needs to be closed down to facilitate a disruptive change activity.

7.7.1 CICS
Different considerations apply depending if the target region to be closed is a TOR or an AOR. Figure 39 on page 162 illustrates an high availability configuration with multiple front end CICSs distributing the incoming workload on multiple AORs. This kind of configuration is able to balance the sessions using the VTAM generic resource feature. The generic resource name is normally shared by a number of CICS TORs. A VTAM application such as CICS can be known by a generic resource name in addition to its own VTAM application program name (APPLID). Both of the names are defined in the VTAM APPL definition statement for the CICS TOR, and VTAM keeps a list of the APPLIDs that are members of the same generic resource name set. For this reason, redistributing the new sessions is done automatically by VTAM/GR towards the online TORs suitable to the generic resource name. Terminals connected to the outgoing TOR have to re-logon to the generic resource and re-signon to CICS.

Chapter 7. Software Changes

161

Figure 39. Redistributing Workload on TORs

In workload balancing scenarios, you would typically keep the AORs as similar as possible. The ideal AOR for a parallel sysplex is one that is capable of running any transaction. If all your AORs are identical, then the dynamic routing program has great flexibility in making routing decisions, and workload balancing is most effective. This will allow you, as shown in Figure 40 on page 163, to shut down an AOR region without any impact to operations. CICSPlex SM will automatically redirect the new transactions to the remaining AORs.

162

Continuous Availability with PTS

Figure 40. Redistributing Workload on AORs

7.7.2 IMS
In order to move workload from one IMS instance to another will require a short outage to each terminal connected to the IMS that is being stopped. NetView automation can be used to bring down the terminal sessions and re-establish these sessions on another IMS. Once all sessions have been moved, IMS can then be stopped. Another approach would be to perform a shutdown of IMS and restart IMS in another MVS image. This method requires spare capacity in the MVS image to accommodate the addition of the moved IMS. This process takes longer than moving sessions and is hence more disruptive. Which procedure a customer uses, will depend upon how critical the disruption of service is for their business. Regardless of how the work is moved, VTAM routes must exist from each terminal to the new IMS and must not all traverse through a single VTAM node. Additional information can be found in the IMS/ESA V5 Operations Guide and the IMS/ESA V5 Sample Operating Procedures .

7.7.3 DB2
DB2 workload is entering the system in a traditional way. In the following sections we will explore what actions are required to move workload either coming from transaction managers or from batch/TSO.

Chapter 7. Software Changes

163

7.7.3.1 CICS and IMS


If you need to shut down a DB2 member in a data sharing group, you will be required to move the workload to the other members. You cannot use the group attachment for CICS and IMS applications because these transaction managers must be aware of the particular DB2 subsystem to which they are attached so they can resolve indoubt units of work in case of failure. The recommended way is transaction rerouting on the other DB2 members inside the data sharing group. For instance, for the CICS subsystem, it is recommended that routing is handled through CP/SM. This will prevent new workload from being routed to the AOR related to the outgoing DB2 member.

7.7.3.2 TSO and Batch


The suggested method to connect TSO and batch users to DB2 is through the group attachment name instead of the specific subsystem name. The group attachment name acts as a generic name for the DB2 subsystems in a data sharing group. This method allows you to easily move the TSO and batch workload around the parallel sysplex members without any further intervention. The group attachment name is substituted with the DB2 subsystem name running on the MVS from where the job was submitted. TSO and batch job are not sensitive to the particular subsystem name.

7.7.3.3 Call Attachment Facility (CAF) Applications


Programs can access DB2 in two ways using CAF: explicitly or implicitly. On either types of connections, CAF applications can use the group attachment name to connect to the DB2 subsystem. This will make the CAF connections not sensitive to the subsystem name and easier to be moved around the sysplex. The only restriction with using the group attachment name is that it is not possible to use the waiting feature. If you need this feature, you have to connect explicitly using the particular subsystem name you need. This type of connection may require manual rerouting to another DB2 member.

7.7.4 TSO
There is no automated way to move the TSO sessions from a quiescing TSO application to another TSO. The only way to redrive the logon is to logoff the user and re-logon to another system. While you are quiescing the TSO application, you can prevent new users from logging on via the command F TSO,USERMAX=0. Currently, TSO does not support the VTAM/GR facility. TSO user must specifically access a new TSO application to redrive their logon.

7.7.5 Batch
You can stop new jobs from being runon a system by stopping all initiators. You will then have to wait until all running jobs have completed. Alternatively you can cancel the jobs and resubmit them on another system. This assumes that they are restartable, and it may involve a lot of work to back out updates made by the job before it abended, so it is an alternative you must choose with care. A better way to transfer batch work in a planned way is to let your job scheduling system handle it for you.

164

Continuous Availability with PTS

7.7.5.1 Redirecting Batch Work under OPC/ESA Control


OPC/ESA supports redirecting of work from one system to another. You can manually redirect the work to an alternate system by using the OPC/ESA Modify Current Plan dialog. You can also specify an alternate system where work will be redirected in the event of a failure. This is recursive; that is if you specify a loop A,B,C,A, then B will replace A, C will replace B, and so on. If all are down OPC will detect that as well and submit nothing.

7.7.6 DFSMS
Before removing an SMS element from the sysplex, you should verify that no specific activities or affinities belong to this system. Here is a list of potential activities that need to be redirected to another system:

Storage Group affinity You should verify that there are no Storage Groups where new allocations are allowed only from the outgoing system. In this case you have to review the Storage Group attributes, as described in the following example, and activate a new SMS configuration to be able to do new allocations on these Storage Groups.

ENABLE NOTCON DISNEW DISALL QUINEW QUIALL

Full access enabled by SMS Not connected New allocation disabled by SMS Job access disabled by SMS New job access disabled by SMS Job access disabled by SMS

DFSMShsm activities In a multiple-processor environment, you define one DFSMShsm processor as the primary DFSMShsm processor. The primary processor automatically performs the primary processor functions of backup and dump. Primary processor functions are functions not related to one data set or volume. The following functions are performed only by the primary processor: Backing up control data sets Backing up data sets Deleting expired dump copies automatically Deleting excess dump VTOC copy data sets

The primary DFSMShsm processor is qualified in the DFSMShsm startup procedure through the HOST parameter. Therefore, if the MVS image to be removed is the DFSMShsm primary processor, you should move these functions to another MVS in the sysplex. After closing the primary DFSMShsm, close the alternate DFSMShsm and restart it with the primary attribute.

7.8 Closing Down the Subsystems


Closing down subsystems within parallel sysplex differs little from closing down subsystems in a non-sysplex environment. The following sections highlight any additional considerations for the sysplex environment.

Chapter 7. Software Changes

165

7.8.1 CICS
Stopping a CICS region requires the following action:

Check the definition and the set up with CP/SM to remove the CICS references. If you are closing a TOR, terminals connected to this region have to re-logon to the generic resource and re-signon to CICS. If you are closing an AOR, no further actions are required. Shut down the CICS region with the normal stop procedure.

7.8.2 IMS
A common sequence for shutting down the entire online system is as follows: 1. Stop data communications 2. Stop dependent regions 3. Stop the control region 4. Stop the IRLM The following describes these operations. The command used to shut down the control region also forces termination of communication and the dependent regions if they have not already been terminated in an orderly way.

7.8.2.1 Stopping Data Communications


In a VTAM environment, the /STOP DC command prevents new users from logging on, but does not terminate existing sessions. You can use the IMS /PURGE TRAN ALL command to prevent existing BTAM and VTAM terminal users from entering new transactions. Disable the BTAM terminals using the /STOP LINE and /IDLE LINE commands. Terminate the VTAM sessions with the IMS command /CLSDST NODE.

7.8.2.2 Stopping APPC


In an APPC environment, the /STOP APPC command prevents new users from allocating LU 6.2 conversations with IMS. The /STOP APPC CANCEL command causes APPC/MVS to initiate a shutdown request when you expect a lot of time before shutdown (for example, at the end of the day).

7.8.2.3 Stopping Dependent Regions


The /STOP REGION command terminates dependent regions. You also terminate dependent regions if you shut down the control region using a /CHECKPOINT command.

7.8.2.4 Stopping the Control Region


You can use the /CHECKPOINT command with the FREEZE, DUMPQ, or PURGE option to shut down the control region. If you use the IMS monitor, stop it with the /TRACE SET OFF MON command before using the /CHECKPOINT command. In a VTAM environment, you can allow active VTAM terminals to complete processing before the shutdown begins.

166

Continuous Availability with PTS

After the shutdown process has begun, you can use the /DISPLAY SHUTDOWN STATUS command to see how many and which communication lines and terminals still contain active messages. You can use the /IDLE command to stop I/O operations on the specified lines to speed up the shutdown process. If IMS fails to shut down after you have followed these shutdown procedures, and if logging resources are available, you must force it to terminate.

7.8.2.5 Shutting Down an IMS Network


The shutdown of an IMS network may or may not include a shutdown of IMS. The IMS /CHECKPOINT command is used to invoke termination of the network and a shutdown of IMS. The format of /CHECKPOINT used determines whether the network termination occurs immediately or waits for processing to complete.

/CHECKPOINT FREEZE|DUMPQ|PURGE causes immediate session termination for all logical units as follows: FREEZE DUMPQ PURGE Immediately after current input/output message After blocks are checkpointed After all queues are empty

/CHECKPOINT FREEZE|DUMPQ|PURGE QUIESCE allows all network nodes to complete normal processing before initiating the shutdown processing.

7.8.2.6 Terminating an ISC Session from CICS


ISC sessions may be terminated by CICS by means of CICS control operator commands only. The CICS operator may use the CEMT command to release the session by entering CEMT SET TERMINAL(termid) RELEASED OUTSERVICE, where termid is the four character session name (SESSNAME) on the DEFINE SESSIONS command in the CICS system definition utility (CSD), or is the terminal ID (TRMIDNT) on the DFHTCT TYPE=TERMINAL program.

7.8.2.7 Stopping the IRLM


You can stop the IRLM from the system console using either the MODIFY irlmproc,ABEND,NODUMP or STOP irlmproc commands. The presence of the IRLM subsystem is required during the entire online execution, if you use IRLM as your resource lock manager. When all IMS subsystems making use of the IRLM have completed their processing, terminate the IRLM subsystem to release resources.

7.8.3 DB2
As shown in Figure 41 on page 168 there is no problem in accessing data when one subsystem comes down. Users can still access their DB2 data from another subsystem. Transaction managers are informed that DB2 is down and can switch new user work to another DB2 subsystem in the group.

Chapter 7. Software Changes

167

Figure 41. DB2 Data Sharing Availability

There might be a situation in which you want to remove members from the group permanently or temporarily. For example, assume your group does the job it needs to do 11 months of the year. However, you get a surge of additional work every December that requires you to expand your capacity. It is possible to quiesce some members of the group for those 11 months. Those members are dormant until you restart them. The same principle is used to remove a member of the group permanently. You quiesce the member to be removed, and keep the log data sets until they are no longer needed for recovery (other members might need updates that are recorded on that member s log). In summary, to quiesce a member of the group, you must: 1. Stop the DB2 you are going to quiesce. Our example assumes you want to quiesce member DB3G.

-DB3G STOP DB2 MODE(QUIESCE)


2. From another member, enter the following commands:

DISPLAY GROUP DISPLAY UTILITY (*) MEMBER( member-name ) DISPLAY DATABASE(*) RESTRICT
If there is no unresolved work, no further action is required. However, if you want to create an archive log, go to step 4 on page 169.

168

Continuous Availability with PTS

3. If there is unresolved work, or if you want to do optional logging to create a disaster recovery archive log, start the quiesced member with ACCESS(MAINT).

-DB3G START DB2 ACCESS(MAINT)


If there is unresolved work, resolve any remaining activity for the member, such as resolving indoubt threads, finishing or stopping utility work, and so on. 4. Optionally, to create an archive log that can be sent to a disaster recovery site, archive the log for the member by entering the following command:

-DB3G ARCHIVE LOG


5. Stop DB2 again with MODE(QUIESCE).

-DB3G STOP DB2 MODE(QUIESCE)


A quiesced member (whether you intend for it to be quiesced forever or only temporarily) still appears in displays and reports. It appears in DISPLAY GROUP output with a status of QUIESCED.

7.8.4 System Automation Shutdown


When closing down a system we must take into account the automation tools that are running on that system. Local tools will be shut down as part of the system shutdown. Global tools must have their function transferred to another system in the sysplex. For example, if the system being closed down is running the focal point AOC/MVS or NetView, then this function will have to be transferred to the backup focal point system.

7.8.4.1 Closing Down the OPC/ESA Tracker


You must wait until all the OPC-controlled batch jobs have completed before shutting down the OPC/ESA tracker on a system that you are closing down. Otherwise you will lose the job completion events for these jobs and the applications will not be able to continue.

7.8.4.2 Transferring the OPC/ESA Controller


You can have a standby controller, which can be used to take over the functions of the active controller, on one or more OPC/ESA controlled systems within the XCF group. The standby system is started in the same way as the other OPC/ESA address spaces, but is not activated unless a failure occurs or unless it is directed to take over via an MVS/ESA operator modify command.

7.9 Removing an MVS Image


If you need to remove an MVS image permanently, the only thing you need to do is tidy up the various PARMLIB and other library members where it has been defined. See 7.1, Adding a New MVS Image on page 149 to find out what these are. This is a relatively trivial job and is nondisruptive.

Chapter 7. Software Changes

169

170

Continuous Availability with PTS

Chapter 8. Database Availability


Database availability is as important as the availability of the supporting system infrastructure. To an IT department offering 24x7 service levels to their customers, continuous availability means 24x7 access to data; having the database unavailable for three hour windows for backup or reorganization is becoming less and less acceptable. This chapter discusses options to address the three main causes of disruption to database availability: 1. Batch Most installations divide their database processing day into online and batch. Generally, the two periods cannot overlap. The overnight batch processing can only begin once the online systems have been stopped, and conversely, the online systems cannot be brought up again next morning, until the overnight batch window has closed. The reduction of the batch window can be addressed by a number of technology solutions, including techniques such as Data-in-Memory, Batch Pipes and so on. 2. Backup The impact of taking backups can be alleviated to some extent with database mirroring solutions such as 3990 Remote Copy. 3. Reorg Database reorg can occur either on a scheduled basis, or on an emergency basis.

8.1 VSAM
VSAM files usually belong to a CICS transaction manager. They can be accessed locally by a single CICS region or shared between multiple CICS AORs either traditionally through an FOR region or with a future release of CICS the files can be directly accessed with record level sharing (RLS). With a no single point of failure configuration CICS is able to provide 24x7 service. In this environment availability of the database is the key item to be concerned with to obtain continuous operations. Some dataset operations can be done without stopping the online activity, but some still cannot be executed concurrently with online activities. This next section discusses the kind of application outages that are still necessary.

8.1.1 Batch
Currently, there is no capability to share VSAM databases between online and batch workloads. Before starting the batch processing, all required database must be deallocated from the online transaction manager. With a future release of CICS this restriction will be lifted. In this RLS environment there will be the capability to share VSAM databases between online and batch processing, with the restriction that batch can access the database only for read operations.

Copyright IBM Corp. 1995

171

8.1.2 Backup
There are different techniques for backing up VSAM files. In this section we will summarize which method and/or product you can use to avoid deallocating the VSAM database from the online workload. In fact, starting from MVS/DFP V3.2 DFHSM V2.5 and DFDSS V2.5, CICS is able to provide database backup while they are still open for online updates. CICS Backup While Open (BWO) is an online backup facility that allows data sets to be backed up even while they are being updated. BWO utilizes DFSMSdss, through DFHSM, as the data mover and creates a fuzzy copy. When restoring a fuzzy backup, you must also include any logs of changes made since the backup process started . If a file is eligible for BWO, CICS sets the BWO attribute in the catalog entry and writes information in the catalog entry at regular intervals. This information includes the time from which the forward recovery utility must start applying records and is defined as the recovery point. However, there are some things to consider when BWO technique is used. DFSMSdss reads data sets sequentially, so if a control interval (CI) or control area (CA) split occurs, it cannot assure data integrity and the backup is flagged as invalid. Therefore, if you want to use BWO with a file in which many records are inserted, you should schedule it during a period of low activity. To avoid this problem, DFSMS Concurrent Copy can be used with the BWO function. Once the DFSMS Concurrent Copy begins, any CI or CA splits that occur will not invalidate the copy. BWO and Concurrent Copy provide a point-in-time backup of CICS VSAM files with full data integrity. With DFSMS version 1.2, the only operational consideration for Concurrent Copy with BWO is the possibility that a CI or CA split is already in progress when the DFHSM backup of the VSAM file begins. In this case, DFDSS will fail the backup and will not retry. You must either schedule a manual backup or wait until the next backup cycle. CICSVR provides forward recovery of the CICS-VSAM file using the backup copy and all CICS journal records logged after the backup was taken. For further reading, please refer to Implementing Concurrent Copy , GG24-3990, Concurrent Copy Overview ,GG24-3936, CICS/ESA Release Guide , GC33-0655 and CICS VSAM Recovery Guide , SH19-6709.

8.1.3 Reorg
In general, most VSAM file reorganizations will require the file be removed from the online system.

8.2 IMS/DB
Applications running in a DL/1 environment are well positioned to offer a 24x7 continuous operation. The major issue is the database reorganization process.

172

Continuous Availability with PTS

8.2.1 Batch
IMS supports concurrent access from online and batch programs through the use of batch message processing programs (BMPs). Therefore, if your installation requires 24x7 service, you must use BMPs for all batch jobs. BMPs have characteristics of programs in both online and batch environments in that they run online but are started with job control language (JCL), like programs in a batch environment. Input for BMPs can be from an MVS file or from the IMS message queue. BMPs do not necessarily process messages, although they can; BMPs can access the database concurrently with MPPs, even if your installation does not use data sharing. However, with data sharing, true batch programs, as well as BMPs or MPPs, can access the same database concurrently. Although BMPs are generally used to perform batch-type processing online, they can send or receive messages. There are two kinds of BMPs:

A transaction-oriented BMP accesses message queues for its input and output. It can also process input from MVS files and it can create MVS files for its output. A batch-oriented BMP does not access the message queue for input; it is simply a batch program that runs online. It can send its output to any MVS output device.

8.2.2 Backup
The techniques used to back up a DL/1 database depends on the type of database. IMS databases are divided into Full Function Databases (FFDBs) and Fast Path Databases (FPDBs). FPDBs support higher transaction rates and offer some enhancements in data management. On the other hand, FPDBs require more virtual storage than FFDBs. FPDBs can be further divided into Data Entry databases (DEDBs) and Main Storage Databases (MSDBs). Full Function Database: The FFDB is the standard IMS database. The access methods are HSAM, HISAM, HDAM, and HIDAM. The image copy utilities supported depend on the access methods used. Data Entry Database: DEDBs are similar to FFDBs; the main differences are:

DEDBs support partitioning of the database into multiple areas , each of which exist in a separate area data sets . DEDBs support maintaining multiple copies of any area data set. If the area data set is defined as having multiple copies, the recovery procedure must rebuild the first copy and then a separate process must re-establish the multiple copies. With DEDBs, the log contains only the after image of the data. The DASD copy of a DEDB is not written until the transaction reaches a commit point, usually when it finishes.

Main Storage Database: The MSDB is located in main storage. This means that an MSDB can be accessed faster than a DEDB. However, it also means that the MSDBs are more limited in size. MSDBs also have very significant functional limitations. For example, you cannot add or delete a root segment to an MSDB without shutting down IMS. MSDBs are not supported by IMS V5. With this release of IMS a new VSO option for DEDBs causes IMS to place the entire contents of a
Chapter 8. Database Availability

173

DEDB into storage. This gives the performance advantage of main storage occupancy without the functional limitations of MSDBs. A backup of an IMS database is called an image copy. An image copy can be produced using either an IMS utility or a user utility, and may be performed either online or offline (batch). In this section we want to put emphasis on the online techniques available for the IMS database. Concurrent Image Copy Option: This is a database image copy utility with the concurrent image copy (CIC) parameter in the EXEC statement. This allows an image copy to be taken while the database continues to be updated. IMS concurrent image copy supports DEDBs and has been enhanced to support FFDBs in IMS/ESA Version 4.1. The resulting image copy is not a point-in-time backup; however, it can be used with the appropriate log to recover the database. This is sometimes called a fuzzy image copy. VSAM KSDSs are not supported by the concurrent image copy option. Online Database Image Copy Utility (DFSUICP0): This utility is executed as an online utility. It runs as a batch message processing (BMP) program. You can use it only for HISAM, HIDAM, and HDAM databases. All logs active while the image copy is being created are required as input to the recovery. DBRC plays an important role here by maintaining in the RECON the recovery information that has been obtained from log archive activity and image copy executions. DBRC uses system timestamps to determine the various logs required for use in a potential data base recovery.

8.2.3 Reorg
The requirement to reorganize a database in most cases requires the database to be removed from the online service, causing an outage to the users of the system. However, DEDBs can be reorganized online as long as the space allocation does not have to be changed. There are also some vendor products that extend the standard IBM utilities to provide online database reorganization for other database organizations.

8.3 DB2
Up to now there is still no 24x7 full availability for DB2 database. However DB2, through the database partitioning and hardware features, is able to offer limited outages of the database during backup and/or reorg processing.

8.3.1 Batch
There is no particular restriction in batch activities against DB2 database. DB2 can support either online, batch, CAF and TSO queries. The only issue could come from a performance point of view.

174

Continuous Availability with PTS

8.3.2 Backup
In DB2, the term image copy refers only to data copies that are taken with the DB2 image copy utility. Image copies are taken at a table space level. Until DB2 Version 3, no other data copies, such as DFSMSdss or IDCAMS copies, could be used by DB2 for recovery. DB2 keeps track of the image copies by registering them in the DB2 catalog. It automatically selects the correct image copy for any recovery needed. An image copy that is not registered in the DB2 catalog cannot be used for recovery. DB2 utilities provide some functions to back up database while still in use. For example, image copies can be taken concurrently while other applications update the data. The DB2 image copy utility can be invoked with either:

SHRLEVEL CHANGE Concurrent update is allowed. SHRLEVEL REFERENCE Read only, no concurrent update is permitted.

Starting with DB2 V3 you can also use the DFSMS CONCURRENT COPY feature to speed up DB2 database backup. However, DB2 is not aware of the copy when it is done in this manner, so you have to manage the copies and recovery from them yourself in the event that a recovery is required. There is a new option in the recovery process called log only, that is designed to work as a follow on to a recovery from a copy done with the DFSMSdss concurrent copy.

8.3.3 Reorg
Up to now there is no full capability for online database reorganization. However, there is granularity in DB2 database reorganization. It is not required that the entire database should be deallocated to be reorganized, you have to shutdown only the partition of the database that needs to be reorganized.

Chapter 8. Database Availability

175

176

Continuous Availability with PTS

Part 3. Handling Unplanned Outages


This part describes how to handle unplanned outages and recover from error situations with minimal impact to the applications.

Copyright IBM Corp. 1995

177

178

Continuous Availability with PTS

Chapter 9. Parallel Sysplex Recovery


This chapter discusses how to recover from unplanned hardware and software failures.

9.1 System Recovery


Here we consider the recovery actions if you lose an entire system. This may be caused either by a hardware problem which causes the CPC to fail, or by a software error which causes the MVS to fail. In either case, we must recover the entire workload which was running on that system. The sysplex provides two functions to assist you:

9.1.1 Sysplex Failure Management (SFM)


SFM provides automated sysplex recovery for signalling failures, missing status updates and reconfiguring PR/SM. Its job is to isolate or fence the failing system, drain ongoing I/O, and perform an I/O system reset to release any reserved devices. For more details on SFM, see 2.15, Automating Sysplex Failure Management on page 57.

9.1.2 Automatic Restart Management (ARM)


ARM is an MVS function which is invoked in case of a failure in the sysplex. It uses installation defined policies to identify the critical jobs or started tasks, which it then automatically restarts. This ensures continued running of important work. For details on using ARM to assist in recovery, see 2.17, ARM: MVS Automatic Restart Manager on page 79.

9.1.3 What Needs to Be Done?


In the case of a software problem you may choose to restart MVS on the same machine and try to get back in operation as quickly as possible. If it is a hardware problem it may take some time before you can restart MVS on the failing CPC, so here our priority will be to distribute the workload over the other systems as quickly as possible. In either case you will need to take some cleanup actions to take care of jobs or transactions that were running at the time of the problem and have been abended or stopped. These will need to be backed out and restarted in some way. In order to ensure that all this is done quickly and correctly, we recommend that you automate these processes as far as possible.

Copyright IBM Corp. 1995

179

9.2 Coupling Facility Failure Recovery


As seen from the coupling facility exploiter standpoint, the coupling facility may fail to provide service because of one of the following:

Connectivity failure. This is a solid failure affecting the communication between the host MVS and the coupling facility, such as defective CFS or CFR CHPIDs or defective CFC links, and there are no more communication links available to the coupling facility. Note: a coupling facility going out of operation also causes a connectivity failure.

Structure failure. This is a functional problem reported by the coupling facility against the structure(s) it is keeping in its processor storage. Getting a structure failure indication implies that the coupling facility is operative enough to report an internal problem which may have affected the structure contents. By its nature, a structure failure has a more pervasive effect on the sysplex than a connectivity failure, except if the connectivity failure is due to a coupling facility going not operational.

The coupling facility becomes volatile. A coupling facility switches from the nonvolatile state to the volatile state upon: An operator intervention at the CFCC who entered the command: MODE VOLATILE Refer to 1.3.5, Coupling Facility Volatility/Nonvolatility on page 8. Or the coupling facility power control system detects a potential malfunction in the battery backup unit or the local UPS (refer to 1.15.2, 9672/9674 Protection against Power Disturbances on page 27).

Switching from nonvolatile to volatile state does not affect the coupling facility operation provided that the primary power is still here, but it may matter to connected exploiters which requested the structure to be allocated in a nonvolatile coupling facility. In most of the cases the recovery for a coupling facility failure will be to move the affected structure(s) in another location, which is not in the current failure domain. Location here could be either to move inside the same coupling facility or to move to another coupling facility. Note also that the recovery can be either attempted immediately or possibly deferred. If deferred, the sysplex continues with some member(s) possibly affected by the failure. These failures are reported to the coupling facility exploiters by XES via the exploiter s EVENT exit, along with some additional information intended to help the connected exploiter in making a recovery decision. If possible, recovery is automatically initiated. Note that the philosophy here is only to provide the structure s exploiter with information and facilities to help it drive recovery. However, it is up to the exploiter code to decide whether or not recovery should be performed and to what extent. MVS provides services to the coupling facility exploiters to help them automating the recovery from coupling facility failure, which are based on the Sysplex Failure Management (SFM) service. The recommendation is to use SFM

180

Continuous Availability with PTS

whenever applicable. If SFM is not applicable or should the automated recovery fail, then operator intervention can be considered. SFM is explained in further detail in 2.15, Automating Sysplex Failure Management on page 57. The following section indicates the ways of recovering from a coupling facility/coupling technology failure. Information on how to move a structure can be found in 5.4, To Move a Structure on page 120. Information on how IBM subsystems specifically recover from a coupling facility failure can be found in these paragraphs:

DB2 at 9.4, DB2 V4 Recovery from a Coupling Facility Failure on page 189 XCF at 9.5, XCF Recovery from a Coupling Facility Failure on page 192 RACF at 9.6, RACF Recovery from a Coupling Facility Failure on page 194 VTAM at 9.7, VTAM Recovery from a Coupling Facility Failure on page 196 IMS/DB at 9.8, IMS/DB Recovery from a Coupling Facility Failure on page 197 JES2 at 9.9, JES2 Recovery from a Coupling Facility Failure on page 199 System logger at 9.10, System Logger Recovery from a Coupling Facility Failure on page 203 Tape Switching at 9.11, Automatic Tape Switching Recovery from a Coupling Facility Failure on page 204 VSAM RLS at 9.12, VSAM RLS Recovery from a Coupling Facility Failure on page 205

We also describe what a system operator can do to assess the problem and what is the recommended course of actions.

Chapter 9. Parallel Sysplex Recovery

181

182
RACF V T A M 4.2 Structure Disp: DELETE Connection Disp: DELETE Connection Disp: KEEP Connection Disp: DELETE Structure Disp: KEEP Structure Disp: KEEP Processing as for VTAM 4.2. V T A M 4.3 JES2 Checkpoint VTAM Generic Resources Logger If a VTAM Node fails, other VTAMs provide necessary cleanup. G e n e r i c r e s o u r c e s can be restarted on another VTAM node. Other instances of Logger coordinate migration of logstream data that had not been written to DASD by failed system. Structure Disp: DELETE Connection Disp: KEEP Local persistent data may exist for LU61 and LU62 sessions with SYNCPT data. Structure Disp: KEEP Connection Disp: KEEP Processing as for VTAM 4.2. System Logger initiates structure rebuild. See 9.10.1 on page 203. VTAM initiates rebuild of structure ISTGENERIC when any VTAM member in the sysplex loses connectivity. See 9.7.1 on page 196. System which loses connectivity to RACF structure switches to read-only mode; rest of sysplex continues in data-sharing mode. Checkpoint is m o v e d , either to another structure or DASD, according to specification of OPVERIFY in JES2 initialization parms. Structure has disposition of KEEP and remains allocated even if checkpoint is forwarded to DASD. See 9.9.1 on page 199. If all RACF instances lose connectivity, structure is automatically deallocated and reallocated as per CFRM policy preference list. See 9.6.1.2 on page 194. Processing as per N o Active SFM Policy. Processing as per N o Active SFM Policy. VTAM initiates rebuild of structure ISTGENERIC as per the active SFM policy WEIGHTs and CFRM policy REBUILDPERCENT. VTAM 4.2 ignores REBUILDPERCENT and initiates structure rebuild as soon as one VTAM member loses connectivity. See 9.7.1 on page 196. N o action. See 9.6.2 on page 195. N o action. See 9.7.3 on page 196. N o action. See 9.7.3 on page 196. Processing as specified on CKPTDEF stmt: issue msg, enter Chkpt Reconfig Dialog, or ignore. See 9.9.3 on page 203. Logger initiates structure rebuild. See 9.10.2 on page 203. RACF initiates structure r e b u i l d . If not possible, switches to non data sharing mode. See 9.6.1.3 on page 195. JES2 does not support r e b u i l d . Operator must use Checkpoint Reconfiguration Dialog to switch checkpoint to DASD. Processing as for CF connectivity failure. See 9.9.2 on page 202. If structure ISTGENERIC fails, each VTAM attempts to initiate structure rebuild. New ISTGENERIC is replenished from the local data of each VTAM node in the generic resource configuration. See 9.7.2 on page 196. RACF supports operator initiated structure rebuild but with restrictions. See 9.6.3 on page 195. Operator must enter JES2 Checkpoint Reconfiguration Dialog. See 9.9.4 on page 203. VTAM supports operator initiated structure rebuild. See 9.7.4 on page 196. VTAM supports operator initiated structure rebuild. See 9.7.4 on page 196. Logger supports operator initiated structure rebuild. See 9.10.4 on page 204. If structure fails, each VTAM attempts to initiate structure rebuild. New structure is replenished from the local data of each VTAM node in the generic resource configuration. See 9.7.2 on page 196. Logger rebuilds the logstream structures into another CF. See 9.10.3 on page 203. VTAM initiates rebuild of structure as per the active SFM policy WEIGHTs and CFRM policy REBUILDPERCENT. VTAM 4.3 initiates structure rebuild when REBUILDPERCENT is reached. See 9.7.1 on page 196. Logger initiates structure rebuild as per the active SFM policy WEIGHTs and CFRM policy REBUILDPERCENT. See 9.10.1 on page 203.

Table 10 (Page 1 of 2). Subsystem Recovery Summary Part 1. The table summarizes recovery actions for the subsystems for different failure types.

Subsystem

XCF Signalling Paths

Element

Single System Failure

System is partitioned out of sysplex. No persistent data in XCF signalling structure.

Structure Disp: DELETE

Connection Disp: DELETE

Loss of CF Connectivity

Continuous Availability with PTS

No Active SFM Policy

SFM Policy Specifies CONNFAIL(NO)

XCF initiates structure rebuild when any sysplex member loses connectivity to the XCF signalling structure.

If any member cannot recover connectivity to new and only structure it is partitioned out of the sysplex.

See 9.5.1 on page 192.

Loss of CF Connectivity

Active SFM Policy with CONNFAIL(YES)

Processing as per N o Active SFM Policy.

CF Volatility Change

N o action.

CF Structure Failure

Processing as per N o Active SFM Policy.

Manual Structure Rebuild

XCF supports operator initiated structure rebuild.

See 9.5.3 on page 193.

Table 10 (Page 2 of 2). Subsystem Recovery Summary Part 1. The table summarizes recovery actions for the subsystems for different failure types.
RACF V T A M 4.2 Put all RACF data-sharing instances into non data sharing mode: See 9.7.5 on page 197. See 9.7.5 on page 197. RVARY NODATASHARE See 9.6.4 on page 196. Not supported. Stop all connected instances of VTAM. Stop all connected instances of VTAM. V T A M 4.3 All logstream exploiters must disconnect from System Logger: See 9.10.5 on page 204. JES2 Checkpoint VTAM Generic Resources Logger

Subsystem

XCF Signalling Paths

Element

Manual Deallocation of Structure

Remove all connections by stopping signalling paths using the structure:

SETXCF STOP,PI,STRNAME=strname SETXCF STOP,PO,STRNAME=strname

See 9.5.4 on page 193.

Chapter 9. Parallel Sysplex Recovery

183

184
DB2 GBP Possibility of failed persistent data. Connection Disp: KEEP Structure Disp: KEEP Connection Disp: DELETE Cache Connection Disp: DELETE Data sharing group member fails. See 9.4.1 on page 189. See 9.12.1 on page 205. See 9.4.1 on page 189. Data sharing group member fails. Allocation initiates IEFAUTOS structure rebuild. See 9.11.1 on page 204. Connection Disp: KEEP Cache Structure Disp: KEEP Structure Disp: KEEP Lock Connecttion Disp: DELETE Possibility of failed persistent data. Structure Disp: DELETE Lock Structure Disp: KEEP SCA Lock Structure Disp: DELETE Connection Disp: DELETE IMS SMSVSAM (VSAM RLS) Tape Sharing (IEFAUTOS) IMS does not initiate s t r u c t u r e r e b u i l d . The data sharing member which lost connectivity enters non data sharing mode. See 9.8.1 on page 197. SMSVSAM attempts to rebuild both cache and lock structures. DB2 initiates structure rebuild in alternate CF, if possible. See 9.4.1 on page 189. See 9.8.1 on page 197. See 9.4.1 on page 189. DB2 initiates structure rebuild in alternate CF, if possible. IMS initiates structure rebuild according to the active SFM policy WEIGHTs and CFRM policy REBUILDPERCENT. S M S V S A M initiates structure rebuild according to the active SFM policy WEIGHTs and CFRM policy REBUILDPERCENT. See 9.12.1 on page 205. N o action. See 9.12.3 on page 206. Allocation initiates IEFAUTOS structure rebuild according to the active SFM policy WEIGHTs and CFRM policy REBUILDPERCENT. See 9.11.1 on page 204. N o action. See 9.11.3 on page 204. D B2 issues warning message but does not initiate structure rebuild. DB2 issues warning message but does not initiate structure rebuild. Neither IMS nor IRLM take any action if the CF becomes volatile. See 9.8.3 on page 198. DB2 initiates structure rebuild in alternate CF. See 9.4.2 on page 190. See 9.4.2 on page 190. DB2 initiates structure rebuild in alternate CF. IMS initiates structure rebuild in alternate CF. SMSVSAM attempts to rebuild the structure (either cache or lock). See 9.12.2 on page 205. All connectors to IEFAUTOS attempt to start rebuild. If system cannot continue with rebuild, it disconnects. See 9.11.2 on page 204. Operator initiated rebuild supported. See 9.11.4 on page 204. See 9.8.4 on page 198. SETXCF START,REBUILD See 9.12.4 on page 206. DB2 supports manual rebuild of SCA structure via the command: SETXCF START,REBUILD without disruption to data sharing group members. See 9.4.4 on page 190. SETXCF START,REBUILD without disruption to data sharing group members. See 9.4.4 on page 190. IGWLOCK00 lock structure no support Not supported. See 9.11.6 on page 205. Stop all DB2 instances currently connected to structure. Lock structure has disp KEEP, so use command: SETXCF FORCE See 9.4.7 on page 199. O S A M / V S A M cache structure stop all connected DBMS instances. DB2 supports manual rebuild of lock structure via the command: IMS supports the manual rebuild of the lock, OSAM and VSAM structures. SMSVSAM supports rebuild of both the lock and cache structures via the command: Stop all DB2 instances currently connected to structure. SCA structure has disp KEEP, so use command: SETXCF FORCE See 9.4.7 on page 199. SMSVSAM support manual deallocation against the SMSVSAM cache structure through the VARY SMS command. IRLM lock structure identify all IRLMs connected to the structure and stop all DBMS instances connected to those IRLMs. Note: lock structure has disposition of KEEP. See 9.8.5 on page 199. See 9.12.5 on page 206.

Table 11. Subsystem Recovery Summary Part 2. The table summarises recovery actions for the subsystems for different failure types.

Subsystem

Element

Single System Failure

Possibility of failed persistent data.

Structure Disp: DELETE

Connection Disp: KEEP

Loss of CF Connectivity

No Active SFM Policy

Continuous Availability with PTS

SFM Policy Specifies CONNFAIL(NO)

Transactions using GBP when connectivity is lost add their pages to the Logical Page List (LPL), making them unavailable for other transactions. N e w transactions attempting to access affected GBP receive SQL rc=904 indicating pages not available.

See 9.4.1 on page 189.

Loss of CF Connectivity

Active SFM Policy with CONNFAIL(YES)

Processing as for no active SFM policy, above.

CF Volatility Change

D B2 issu es w arnin g message but does not initiate structure rebuild.

CF Structure Failure

Processing as for Loss of CF Connectivity .

Manual Structure Rebuild

Not supported.

See 9.4.4 on page 190.

Manual Deallocation of Structure

Stop usage of GBP by stopping DB2 instances, stopping databases, or adjusting buffer pools.

9.3 Assessment of the Failure Condition


The following is an introduction to the recovery actions necessary in the case of a coupling facility structure failure.

9.3.1 To Recognize a Structure Failure


A structure failure is not directly reported by MVS, instead it is reported by the structure s exploiters either explicitly with messages about the extent of the failure and what recovery actions, if any, are in progress. Or, it is reported implicitly by the exploiter, only indicating that some actions are in progress. Here are examples of these messages:

When a lock structure fails: DXR143I IRLK REBUILDING LOCK STRUCTURE BECAUSE IT HAS FAILED OR AN IRLM LOST CONNECTION TO IT DXR146I IRLK REBUILD OF LOCK STRUCTURE COMPLETED SUCCESSFULLY When a XCF signalling structure fails: IXC467I STARTED REBUILD FOR PATH STRUCTURE IXCPLEX_PATH1 RSN: STRUCTURE FAILURE DIAG073: 08880001 092A0000 2000E800 000000000 IXC457I REBUILT STRUCTURE IXCPLEX_PATH1 ALLOCATED WITH 1000 LISTS WHICH SUPPORTS FULL SIGNALLING CONNECTIVITY AMONG 32 SYSTEMS AND UP TO 14428 SIGNALS IXC465I REBUILD REQUEST FOR STRUCTURE IXCPLEX_PATH1 WAS SUCCESSFUL WHY REBUILT: STRUCTURE FAILURE When a RACF structure fails: IRRX007I RACF DATASHARING GROUP IS INITIATING A REBUILD FOR STRUCTURE IRRXCF00_P001. ICH15019I INITIATING PROPAGATION OF RVARY COMMAND TO MEMBERS OF RACF DATA SHARING GROUP I IN RESPONSE TO A REBUILD REQUEST. ICH15020I RVARY COMMAND INITIATED IN RESPONSE TO THE REBUILD REQUEST HAS FINISHED PROCESSING When a VTAM structure fails: IST1381I REBUILD STARTED FOR STRUCTURE ISTGENERIC . . . IST1383I REBUILD COMPLETE FOR STRUCTURE ISTGENERIC

There are, as of now, two known variations on the way a structure failure is indicated:

VSAM or OSAM structure : There is currently no message stating that damage has been detected in a structure. Neither is there a message when the structure is rebuilt, even though the rebuild is done automatically. The only indication that might appear would be U3033 abends in the transactions caused by calls to a database failing while the structure is being rebuilt.

Chapter 9. Parallel Sysplex Recovery

185

JES2 checkpoint: A checkpoint structure which fails is treated by JES2 as an I/O error, and the checkpoint reconfiguration can be automatically initiated. This is further explained at 9.9, JES2 Recovery from a Coupling Facility Failure on page 199.

9.3.2 To Recognize a Connectivity Failure


The system is in a situation of connectivity failure to a structure when all the paths available to access the structure become not operational. Messages IXL158I, as shown in the following example, will appear but because there is still at least one path operational to the structure the system has not lost connectivity to the structure. When all paths to a coupling facility are not operational, message IXC518I informs that the coupling facility is no longer usable.

In this example, the system is connected to the CF2 via CHPID 12 and 14.

IXL158I PATH 12 IS NOW NOT-OPERATIONAL TO CUID: FFF8 066 COUPLING FACILITY 009672.IBM.51.000000060043 PARTITION: 3 CPCID: 00 IXL158I PATH 14 IS NOW NOT-OPERATIONAL TO CUID: FFF8 067 COUPLING FACILITY 009672.IBM.51.000000060043 PARTITION: 3 CPCID: 00

IXC518I SYSTEM SF1 NOT USING 081 COUPLING FACILITY 009672.IBM.51.000000060043 PARTITION: 3 CPCID: 00 NAMED CF2 REASON: CONNECTIVITY LOST. REASON FLAG: 13300001.

D XCF,CF,CFNAME=CF2 IXC362I 16.38.36 DISPLAY XCF 101 CFNAME: CF2 COUPLING FACILITY : 009672.IBM.51.000000060043 PARTITION: 3 CPCID: 00 POLICY DUMP SPACE SIZE: 2000 K ACTUAL DUMP SPACE SIZE: 2048 K STORAGE INCREMENT SIZE: 256 K NO SYSTEMS ARE CONNECTED TO THIS COUPLING FACILITY

Note that the same messages will show up if the coupling facility becomes not operational.

9.3.3 To Recognize When a Coupling Facility Becomes Volatile


As for a structure failure, the indication that a coupling facility switched from nonvolatile to volatile state comes from the coupling facility exploiter. The volatility state of the coupling facility can then be verified by D CF,CFNAME=cfname.

186

Continuous Availability with PTS

IXG104I STRUCTURE REBUILD INTO STRUCTURE SYSTEM_OPERLOG 608 HAS BEEN STARTED FOR REASON: COUPLING FACILITY VOLATILITY STATE CHANGE IXG209I RECOVERY FOR LOGSTREAM SYSPLEX.OPERLOG 316 IN STRUCTURE SYSTEM_OPERLOG COMPLETED SUCCESSFULLY. IXG110I STRUCTURE REBUILD FOR STRUCTURE SYSTEM_OPERLOG IS COMPLETE. LOGSTREAM DATA DEFINED TO THIS STRUCTURE MAY BE LOST FOR CERTAIN LOGSTREAMS D CF IXL150I 23.01.31 DISPLAY CF 927 COUPLING FACILITY 009674.IBM.51.000000060041 PARTITION: 3 CPCID: 00 CONTROL UNIT ID: FFF6 NAMED CF1 COUPLING FACILITY SPACE UTILIZATION ALLOCATED SPACE DUMP SPACE UTILIZATION STRUCTURES: 11264 K STRUCTURE DUMP TABLES: DUMP SPACE: 2048 K TABLE COUNT: FREE SPACE: 17920 K FREE DUMP SPACE: TOTAL SPACE: 31232 K TOTAL DUMP SPACE: MAX REQUESTED DUMP SPACE: VOLATILE: YES STORAGE INCREMENT SIZE:256K CFLEVEL: 1 COUPLING FACILITY SPACE CONFIGURATION IN USE CONTROL SPACE: 13312 K NON-CONTROL SPACE: 0 K SENDER PATH 13 PHYSICAL ONLINE LOGICAL ONLINE SUBCHANNEL 0361 0362 317

0 0 2048 2048 0

K K K K

FREE 17920 K 0 K STATUS VALID

TOTAL 31232 K 0 K

COUPLING FACILITY DEVICE FFEA FFEB

STATUS OPERATIONAL/IN USE OPERATIONAL/IN USE

9.3.4 Recovery from a Connectivity Failure


The way to recover from a connectivity failure is to move the structure to another coupling facility to which all exploiters have proper connectivity. The movement can be accomplished either by deallocating the structure and reallocating it in another coupling facility, or by invoking the rebuild process. The choice is dictated here by criteria such as, capability to rebuild, possible disruption to the operations, and the automated recovery provides. Built-in automated recovery must be the first choice.

9.3.4.1 Using the Deallocation Reallocation Process


1. While the structure is still in the original location, change the active CFRM policy for a new policy with a preference list giving the targeted new coupling facility as best candidate for allocation. Changing an active CFRM policy is discussed at 5.6, Changing the Active CFRM Policy on page 125. 2. Deallocate the structure. This is expected in most of the cases to translate into shutting down the structure s exploiters. However, some exploiters may have been coded to only disconnect from the structure if such a failure

Chapter 9. Parallel Sysplex Recovery

187

occurs. Deallocating a structure is discussed at Appendix B, Structures, How to ... on page 241. 3. Reallocate the structure. This is expected in most of the cases to translate into restarting the structure s exploiters. Whether this process is applicable and how it can be applied to the IBM structure exploiters, is indicated for each one of them in the paragraphs dedicated to their individual recovery.

9.3.4.2 Using the Manual Rebuild


Rebuilding a structure to solve a connectivity problem implies the followingt:

All the current connectors to the structure are allowing the structure to be rebuilt. There is at least one connection to the structure still active. The alternate coupling facility is on the preference list in the active CFRM policy for the structure to be rebuilt. The coupling facility candidate to have the structure rebuilt into must have enough free space to accommodate for the new instance of the structure. All exploiters have connectivity to the new instance of the structure.

9.3.4.3 Automated Recovery from a Connectivity Failure


As already stated, in the case of connectivity failure to a structure, MVS/XES will inform the structure s exploiters of the failure and will provide them with additional information and recommendation whether they should disconnect or attempt to rebuild the structure. Some users (such as XCF) may elect to attempt rebuilding every time, others (such as JES2) may decide not to even attempt rebuilding and will always require an operator intervention to recover. Others may decide to drop all connections and reconnect as an attempt to recover from the failure. The decision to rebuild can also be weighted using the SFM policy and the REBUILDPERCENT parameter in the active CFRM policy. This is intended to delay rebuilding until there are a significant amount of structure s exploiters, or some important exploiters which have lost connectivity to the structure. SFM weights are explained in 2.15, Automating Sysplex Failure Management on page 57.

9.3.5 Recovery from a Structure Failure


As for a connectivity failure, the recovery from a structure failure will be to move the structure. However, because this type of problem can be related to a specific functional area in the coupling facility (such as a defective memory block), you can attempt to build the new instance of the structure in the same coupling facility as the original one. The same applies to attempting recovery by deallocating/reallocating the structure; however, the general recommendation is to reallocate or to rebuild into another coupling facility whenever possible.

188

Continuous Availability with PTS

9.4 DB2 V4 Recovery from a Coupling Facility Failure


The following describes the DB2 behavior necessary to resolve a coupling facility failure.

9.4.1 DB2 V4 Built-In Recovery from Connectivity Failure


The recovery decision made by DB2 will depend on whether there is an active SFM policy or not.

9.4.1.1 No Active SFM Policy, or Active Policy with CONNFAIL(NO)


Loss of Connectivity to the SCA or Lock Structure: The data sharing group members losing connectivity to either the lock or the Shared Common Area (SCA) structure are brought down. They are recovered by restarting their DB2 instances once the connectivity problem has been repaired. As an alternate solution, they can be restarted from another host MVS that still has connectivity to the structure. Loss of Connectivity to a Group Buffer Pool (GBP) Structure: The data sharing group members loosing connectivity to a GBP attempt to carry on operations without the affected GBP:

Transactions which were using the GBP at the time of loss of connectivity add their pages to the Logical Page List (LPL), making these pages unavailable for other transactions. New transactions which try to access the affected GBP receive a SQL return code of -904, indicating that the pages are not available.

To recover from this situation:

Repair the connectivity failure. As soon as connectivity to the GBP is restored, the affected data sharing group members will automatically reconnect to the GBP. You could also stop the affected DB2 members and restart them from a host MVS which still has connectivity to the GBP structure.

If the connectivity failure cannot be repaired, or if the GBP contents have been damaged. Delete all the connections left to the GBP (these are failed-persistent connections and SETXCF FORCE will have to be used). This deallocates the GBP and one of the DB2 members will then perform a damage assessment and will mark the affected DB2 objects as GBP recovery pending (GRECP) with proper messages to indicate which databases have been affected. The next step is then to re-start the affected database with the START DB command. This will reallocate the GBP, and the objects will be recovered. This can be done from any DB2 member.

9.4.1.2 Active SFM Policy with CONNFAIL(YES)


Loss of Connectivity to SCA or Lock Structure: This will result in automatically rebuilding the structure in an alternate coupling facility provided that:

The alternate coupling facility is on the active preference list for the involved structure, and has enough free processor storage to accommodate for a new instance of the structure. All participating DB2 members have connectivity to the alternate coupling facility.
Chapter 9. Parallel Sysplex Recovery

189

The active REBUILDPERCENT threshold for the involved structure has been reached.

Loss of Connectivity to a GBP Structure: Having an active SFM policy does not affect the way a loss of connectivity to a GBP is handled. The recovery has to be performed as in 9.4.1.1, No Active SFM Policy, or Active Policy with CONNFAIL(NO) on page 189, that is, either of the following:

Recover connectivity to the GBP. Proceed with GBP deallocation and restart the affected database.

9.4.2 DB2 V4 Built-In Recovery from a Structure Failure


Not having an active SFM policy does not affect the way a structure failure is handled by DB2 V4.

9.4.2.1 Structure Failure in SCA or Lock Structure


DB2 V4 always attempts to rebuild the structure in an alternate coupling facility

9.4.2.2 Structure Failure in GBP Structure


The recovery has to be done manually by GBP deallocation and recovery of affected objects as in 9.4.1.1, No Active SFM Policy, or Active Policy with CONNFAIL(NO) on page 189.

9.4.3 Coupling Facility Becoming Volatile


If a coupling facility containing DB2 structures becomes volatile, DB2 issues warning messages but does not attempt to move the structure.

9.4.4 Manual Structure Rebuild


The lock and SCA structures can be manually rebuilt using SETXCF START,REBUILD, without stopping the data sharing group members. However, rebuild is not supported for Group Buffer Pool (GBP) structures. To manually move a GBP structure from one coupling facility to another requires that you first deallocate the GBP structure from the first coupling facility then reallocate the structure. It basically means that the GBP users must be temporarily stopped and/or disconnected. How to deallocate a GBP is explained in 9.4.5, To Manually Deallocate and Reallocate a Group Buffer Pool.

9.4.5 To Manually Deallocate and Reallocate a Group Buffer Pool


There are basically four possible methods to deallocate a GBP. They are described in details in DB2 for MVS/ESA Data Sharing and Administration Guide , SC26-3269. A new CFRM policy with a properly modified preference list must be started before reallocation to insure the reallocation into another coupling facility. A summary of the four methods foloows:

The simplest one is to stop all DB2 members in the data sharing group. This method is mandatory if the GBP to deallocate is GBP0 (GBP0 contains catalog and directory). As the GBP structure has a disposition of DELETE, it will be automatically deallocated. Reallocation will be performed upon restart of the DB2 instances.

190

Continuous Availability with PTS

If it is not possible to stop all members and the GBP to be deallocated is not GBP0, then one of the three following methods can be used.

Delete virtual pool buffer by altering its size to 0. This will initiate disconnection from the related GBP. Stop all databases, therefore remove page sets dependence on the GBP which results in eventually disconnecting from the GBP. Stop only page sets which use the associated buffer pool. This is the most granular way to minimize the GBP deallocation impact.

To reallocate the structure, restart the stopped elements or alter the size of the virtual buffer pool, depending on the method selected for deallocation.

9.4.6 To Manually Deallocate a DB2 Lock Structure


The procedure isas follows: 1. Identify all IRLMs connected to the structure, by using the following command:

D XCF,STR,STRNAME=lock_structure_name
This will result in displaying at the console a list of all connectors. That is, a list of the IRLMs currently using the structures or with connections in the failed-persistent state, if any.

D XCF,STR,STRNAME=IRLMLOCK1 STRNAME: IRLMLOCK1 STATUS: ALLOCATED POLICY SIZE : 32000 K POLICY INITSIZE: N/A REBUILD PERCENT: 1 PREFERENCE LIST: CF1 CF2 EXCLUSION LIST IS EMPTY ACTIVE STRUCTURE ---------------ALLOCATION TIME: 11/01/95 08:22:20 CFNAME : CF1 COUPLING FACILITY: 009674.IBM.02.000000040020 PARTITION: 1 CPCID: 00 ACTUAL SIZE : 32000 K STORAGE INCREMENT SIZE: 256 K VERSION : ABE834ED BC6B2002 DISPOSITION : KEEP ACCESS TIME : 0 MAX CONNECTIONS: 23 # CONNECTIONS : 4 CONNECTION NAME ---------------IRLMGRP1$IRLA001 IRLMGRP1$IRLB002 IRLMGRP1$IRLC003 IRLMGRP1$IRLD004 ID -07 02 01 05 VERSION -------0007006D 0002006D 0001007F 0005006E SYSNAME -------Z0 J80 J90 JA0 JOBNAME -------IRLMA IRLMB IRLMC IRLMD ASID ---0081 0038 003B 008D STATE ---------------ACTIVE ACTIVE ACTIVE ACTIVE

2. For each one of the IRLMs identified above, identify the DB2 instances using the following command:

F irlm_name,STATUS
Chapter 9. Parallel Sysplex Recovery

191

F IRLMD,STATUS DXR101I IRLD STATUS SCOPE=GLOBAL SUBSYSTEMS IDENTIFIED NAME STATUS UNITS HELD IMSD UP 5 271

WAITING 0

RET_LKS 0

3. Stop all the DB2 instances identified by the previous command Note that lock structures have a disposition of KEEP, therefore the SETXCF FORCE command will have to be used to complete the deallocation. SETXCF FORCE must also be used if any connection remains in failed-persistent state. Reallocation is performed when the DB2 instances are restarted. Reallocation can be directed to another coupling facility by changing the active preference list before restarting the DB2s.

9.4.7 To Manually Deallocate a DB2 SCA Structure


The procedure is: 1. To identify all DB2 members connected to the structure, by the following command:

D XCF,STR,STRNAME=sca_structure_name
This will result in displaying at the console a list of all connectors. That is, DB2 members currently using the structures or with connections in failed persistent state, if any. 2. To stop all DB2 instances identified by the above command. Note that SCA structures have a disposition of KEEP&semi, therefore the SETXCF FORCE command will have to be used to complete the deallocation. SETXCF FORCE must also be used if any connection remains in failed-persistent state. Reallocation is performed when the DB2 members are restarted. Reallocation can be directed to another coupling facility by changing the active preference list before restarting the DB2s.

9.5 XCF Recovery from a Coupling Facility Failure


The following describes the XCF behavior to resolve a coupling facility failure.

9.5.1 XCF Built-In Recovery from Connectivity or Structure Failure


XCF always automatically attempts to rebuild the structures in an alternate coupling facility, whether there is an active SFM policy or not. It ignores the REBUILDPERCENT parameter; it always initiates rebuild as soon as one sysplex member loses connectivity to the XCF signalling structure. However, if the attempt to move the structure is unsuccessful in that at least one sysplex member cannot recover XCF connectivity even to the new structure location (assuming it does not have backup CTCs), XCF proceeds with partitioning of the nonconnected member (see 9.5.5, Partitioning the Sysplex on page 193).

192

Continuous Availability with PTS

Note that the decision to partition is made by XCF itself, without consulting the XCF exploiters. An XCF signalling exploiter cannot in any way prevent XCF from partitioning a member out of the sysplex when XCF loses signalling connectivity. We recommend that both CTC paths and coupling facility structures be available for XCF signalling.

9.5.2 Coupling Facility Becoming Volatile


XCF does not attempt to move the structure(s) to another coupling facility.

9.5.3 Manual Invocation of Structure Rebuild


XCF supports the manual invocation of a structure rebuild, even if the structure being rebuilt is the last XCF signalling vehicle available in the sysplex. However rebuilding the last XCF signalling structure in the sysplex, without XCF CTC links available, is a lengthy process. Because XCF connectivity is disrupted during the duration of the process, this may lead to the active SFM policy taking over and deciding to partition temporarily non responding system(s) from the sysplex. Ways of avoiding this problem are:

Have backup XCF signalling structures in another coupling facility (recommended). Have backup CTC links for XCF signalling (recommended). Modify the INTERVAL parameter in COUPLExx to account for the longer time (not recommended).

9.5.4 Manual Deallocation of the XCF Signalling Structures


To manually deallocate an XCF signalling structure, remove all the connections to the structure by stopping all the PATHINs and PATHOUTs using the structure. This is accomplished by issuing the following commands:

SETXCF STOP,PI,STRNAME=xcf_strname SETXCF STOP,PO,STRNAME=xcf_strname


An XCF signalling structure is reallocated as soon as any PATHIN or PATHOUT is started with a specification to use the structure.

9.5.5 Partitioning the Sysplex


A system which has lost XCF connectivity to any other member of the sysplex must be partitioned off the sysplex. This can be initiated by the system operator or by SFM when a SFM policy with CONNFAIL(YES) is active. SFM is discussed in 2.15, Automating Sysplex Failure Management on page 57.

9.5.5.1 Manual Sysplex Partitioning


Manual partitioning is occurs if there is no active SFM policy or if the currently SFM active policy has CONNFAIL(NO). The operator can deliberately partition a system off the sysplex with the following command:

V XCF,sysname,OFFLINE
In this case the operator will be prompted with the following message:

IXC371D CONFIRM REQUEST TO VARY SYSTEM sysname OFFLINE. REPLY SYSNAME=sysname TO REMOVE sysname OR C TO CANCEL.
Chapter 9. Parallel Sysplex Recovery

193

The operator must go to the system partitioned off the sysplex and perform system reset if no SFM policy is active. If a SFM policy is active, the partitioned system will be automatically isolated and an I/O interface reset performed, as long as the system to be isolated shares coupling facility connectivity with any other operating MVS image in the sysplex. For further details on the isolate function refer to 2.15.2, The SFM Isolate Function on page 59.

9.5.5.2 Automatic Sysplex Partitioning


The partitioning is automatically initiated if there is an active SFM policy active with CONNFAIL(YES). The sysplex configuration after partitioning depends on the WEIGHTs coded for the participating systems (note that no WEIGHT is the same as WEIGHT(1)). See 2.15.3, SFM Parameters on page 63 for a discussion of the WEIGHT parameter. The partitioning process will end up by automatically isolating the target system out of the sysplex, via hardware signals issued from the coupling facility, if the system to be isolated shares coupling facility connectivity with any other operating MVS image in the sysplex. Examples of sysplex partitioning are given in Appendix D, Examples of Sysplex Partitioning on page 259.

9.5.5.3 To Re-join the Sysplex


The partitioned system has to be IPLed to re-join the sysplex, whatever the method of partitioning (manual or automatic). This assumes that the XCF connectivity has been restored.

9.6 RACF Recovery from a Coupling Facility Failure


The following describes the RACF behavior to resolve a coupling facility failure.

9.6.1 RACF Built-In Recovery from Connectivity or Structure Failure


RACF recovery does not depend on the contents of the SFM policy.

9.6.1.1 Connectivity Failure


Unless all participating RACF experience the connectivity problem to the same structure, there is no attempt by RACF to move the structure. That is, if only one system, as an example, loses connectivity to a RACF structure, this RACF instance will switch to READ-ONLY mode while the rest of the sysplex will still proceed in RACF data-sharing mode. If all RACF instances in the sysplex lose connectivity to the structure, the structure will be automatically deallocated then reallocated as per the CFRM policy preference list. At deallocation time, all RACF instances in the data sharing group switch to non-data-sharing mode, and resume data sharing mode as soon as the reallocation is successful.

9.6.1.2 To Recover from an Individual RACF Loss of Connectivity


The structure has to be manually rebuilt into a location where the disconnected RACF instance has connectivity to. Once the rebuild in this location completes successfully, the disconnected instance automatically reconnects and resumes in data sharing mode.

194

Continuous Availability with PTS

RACF structure rebuild Because of the specific way structure rebuild is implemented in RACF, rebuilding into another coupling facility requires that you modify the active preference list first. See 9.6.3, Manual Invocation of Structure Rebuild on page 195.

9.6.1.3 RACF Recovery from Structure Failure


On a structure failure, RACF will always proceed with deallocation and reallocation of the structure, as per the active CFRM preference list.

9.6.2 Coupling Facility Becoming Volatile


RACF does not attempt to move the structure(s) when the coupling facility becomes volatile.

9.6.3 Manual Invocation of Structure Rebuild


RACF supports the SETXCF START,REBUILD command for the database structures, however the structures movement is actually achieved by deallocation and reallocation of the structures. The deallocation and reallocation is initiated by an automatically issued RVARY NODATASHARE followed by a RVARY DATASHARE both being broadcast sysplex wide.

SETXCF START,REBUILD,STRNM=IRRXCF00_P001 IXC367I THE SETXCF START REBUILD REQUEST FOR STRUCTURE 527 IRRXCF00_P001 WAS ACCEPTED. IXC521I REBUILD FOR STRUCTURE IRRXCF00_P001 HAS BEEN STOPPED ICH15019I INITIATING PROPAGATION OF RVARY COMMAND 299 TO MEMBERS OF RACF DATA SHARING GROUP IRRXCF00 IN RESPONSE TO A REBUILD REQUEST. IXC509I CFRM ACTIVE POLICY RECONCILIATION EXIT HAS STARTED. 300 TRACE THREAD: 0000018F. .......................... IXC509I CFRM ACTIVE POLICY RECONCILIATION EXIT HAS COMPLETED. 301

This brings the following restriction to the RACF rebuild:


The rebuild always occurs as if LOC=NORMAL. For any one rebuild request to RACF, all the RACF structures currently allocated are rebuilt.

The implication is that the rebuild always scans the CFRM preference list, and chances are great that the same coupling facility will be selected again to receive the structure. In order to have the RACF structures be rebuilt into a different coupling facility, one of the following conditions must be met:

The rebuild cannot be performed on the original coupling facility because of a permanent failure and there is another coupling facility available in the preference list.

Chapter 9. Parallel Sysplex Recovery

195

The CFRM active policy is changed with a preference list designating another operational coupling facility as the best candidate for the allocation of the structure.

9.6.4 Manual Deallocation of RACF Structures


To manually deallocate RACF structures, put all RACF data sharing group members in non data sharing mode, by issuing the following command:

RVARY NODATASHARE
This command has a sysplex scope and initiates disconnection of the data sharing group members from the structures, and hence the structures are deallocated. The RACF structures are reallocated when you issue the following:

RVARY DATASHARE

9.7 VTAM Recovery from a Coupling Facility Failure


The following section will describe the VTAM behavior to resolve a coupling facility failure.

9.7.1 VTAM Built-In Recovery from Connectivity Failure


9.7.1.1 No Active SFM Policy, or an Active Policy with CONNFAIL(NO)
VTAM always attempts to rebuild the structure as soon as any VTAM member in the sysplex loses connectivity to the structure.

9.7.1.2 Active SFM Policy with CONNFAIL(YES)


VTAM initiates structure rebuild as per the active SFM policy WEIGHTs and CFRM policy REBUILDPERCENT. Note that VTAM 4.2 ignores the REBUILDPERCENT parameter and rebuildings the structure as soon as one member loses connectivity. VTAM 4.3 starts rebuilding the structure as soon as the REBUILDPERCENT threshold is reached.

9.7.2 VTAM Built-In Recovery from a Structure Failure


VTAM always initiates rebuild on a structure failure, whether an SFM policy is active or not.

9.7.3 The Coupling Facility Becomes Volatile


VTAM does not attempt to move the generic resource name structure if the coupling facility becomes volatile.

9.7.4 Manual Invocation of Structure Rebuild


VTAM supports manual rebuild of the generic resource name structure. Note that VTAM 4.2 will ignore the rebuild request if it detects that one or more of the members has no connectivity to the new structure location. VTAM 4.3 will rebuild in any case.

196

Continuous Availability with PTS

9.7.5 Manual Deallocation of the VTAM GRN Structure


The manual deallocation must be performed by stopping all the connected instances of VTAM. Note: If VTAM is not properly brought down, connections to the VTAM structure will remain in a failed-persistent state. The consequences of forcing these connections must be well understood, particularly if LU 6.1 or LU 6.2 sync level 2 are used. These applications or subsystems therefore expect permanent connection affinities.

9.8 IMS/DB Recovery from a Coupling Facility Failure


The following chapter describes the IMS behavior to resolve a coupling facility failure.

9.8.1 IMS/DB Built-In Recovery from a Connectivity Failure


9.8.1.1 No Active SFM Policy, or an Active Policy with CONNFAIL(NO)
IMS does not initiate the rebuild of the affected structure. The data sharing member which lost connectivity goes to non-data-sharing mode.

For a Connectivity Failure to the Lock Structure:


The affected IRLMs remain active in failure status. The batch jobs using the affected IRLMs abend. The IMS/TMs or DBCTLs using the affected IRLMs are quiesced. Dynamic backout is invoked for in-flight transactions

If the connectivity is restored, IRLM reconnects automatically to the lock structure, IMS and DBCTL reconnect automatically to IRLM and operations resume. if the connectivity cannot be restored, either of the following must be done:

Manually rebuild the lock structure into another coupling facility that all IRLMs have connectivity to. Restart the IMS and DBCTL instances on a system where IRLM has still connectivity to the lock structure. Note that an IMS/DB instance can be restarted and use a different IRLM to process any new lock or previously retained locks. The restart of the new IMS instance can be implemented as an automatic operation using ARM (refer to 2.17, ARM: MVS Automatic Restart Manager on page 79).

If batch jobs abended while the connectivity to the coupling facility was lost, it is possible that you will require running the Batch Backout Utility to recover the databases used by the batch jobs.

For a Connectivity Failure to a Cache Structure


Note: An IMS cache structure (OSAM or VSAM) is used only for its directory part; the loss of the structure prevents data sharing because it is not possible to perform cross-invalidation.

The local buffers are invalidated.


Chapter 9. Parallel Sysplex Recovery

197

IMS stops all databases with SHARELVL=2 or 3 (that is, stops data sharing). If IMS/TM is used, affected transactions are put in the suspend queue.

If the connectivity is restored, IMS reconnects automatically to the cache structure and starts the affected data bases. The transactions are released from the suspend queue. If the connectivity cannot be restored do either of the following:

Manually rebuild the cache structure into another coupling facility to which all IRLMs in the data sharing group have connectivity. Restart the IMS instances on a system which has still connectivity to the cache structure.

9.8.1.2 Active SFM Policy with CONNFAIL(YES)


IMS will initiate the rebuild of the affected structure as per the SFM policy WEIGHTs and CFRM policy REBUILDPERCENT.

9.8.2 IMS/DB Built-In Recovery from a Structure Failure


On a structure failure, there is always a request for a dynamic rebuild either by IRLM or IMS. Transactions are held during the rebuild and are automatically resumed upon successful completion of the rebuild.

9.8.2.1 IRLM Lock Structure Failure


IRLM requests a dynamic rebuild. The rebuild can be successful only if all involved IRLMs can connect to the new instance of the structure. If the dynamic rebuild is not successful, a manual rebuild can be attempted after modifying the active preference list to designate a new best candidate coupling facility to rebuild into.

9.8.2.2 OSAM/VSAM Cache Structure Failure


The local buffers are invalidated IMS requests a dynamic rebuild of the structure. Data sharing operations automatically resume on the successful completion of the rebuild.

If the dynamic rebuild is not successful, a manual rebuild can be attempted after modifying the active preference list to designate a new best candidate coupling facility to rebuild into.

9.8.3 Coupling Facility Becoming Volatile


IMS or IRLM does not attempt to move the structures when the coupling facility becomes volatile.

9.8.4 Manual Invocation of Structure Rebuild


IMS/DB supports the manual rebuild of the lock, OSAM and VSAM structures.

198

Continuous Availability with PTS

9.8.5 Manual Deallocation of an IRLM Lock Structure


The procedure is: 1. Identify all IRLMs connected to the structure, by

D XCF,STR,STRNAME=lock_structure_name
This will result in displaying at the console a list of all connectors, that is IRLMs currently using the structures or with connections in failed-persistent state, if any. 2. For each one of the IRLMs identified above, identify the DBMS instances connected to this IRLM, using:

F irlm_name,STATUS
3. Stop all DBMS instances connected this IRLM (IMS/DB, DBCTL or DL/I batch); this results in IRLM disconnecting from the lock structure. Note that the lock structures have a disposition of KEEP, therefore the SETXCF FORCE command will have to be used to complete the deallocation. The SETXCF FORCE command must also be used for those connections which remain in failed-persistent state for a lock structure. Reallocation is performed when the IMS and/or DBCTL instances are restarted.

9.8.6 Manual Deallocation of a OSAM/VSAM Cache Structure

Use the D XCF,STR,STRNAME=cache_strname command to identify all the DBMS instances connected to the structure. Stop all the identified DBMS instances.

The structure will be automatically reallocated when the DBMS instances are started again.

9.9 JES2 Recovery from a Coupling Facility Failure


JES2 does not support the rebuild function for the checkpoint structure. Any checkpoint structure movement has to be accomplished through the JES2 checkpoint reconfiguration facility. Connectivity failures or structure failures are treated as checkpoint I/O errors by JES2.

9.9.1 Connectivity Failure to a Checkpoint Structure


Depending on the value of the OPVERIFY parameter in the JES2 initialization parameters, the checkpoint movement can be automatically initiated by JES2, ( OPVERIFY=NO ) or the operator can be prompted to enter the Checkpoint Reconfiguration Dialog ( OPVERIFY=YES ). As an example, with the checkpoint definition shown in Figure 42 on page 200

Chapter 9. Parallel Sysplex Recovery

199

CKPTDEF CKPT1=(STRNAME=JES2CKPT_1,INUSE=YES, VOLATILE=NO), CKPT2=(DSNAME=SYS1.JES2.CKPT1, VOLSER=TOTSM1,INUSE=YES,VOLATILE=NO), NEWCKPT1=(STRNAME=JES2CKPT_2), NEWCKPT2=(DSNAME=SYS1.JES2.CKPT2, VOLSER=TOTPD0),MODE=DUPLEX,DUPLEX=ON, LOGSIZE=1,APPLCOPY=NONE, VERSIONS=(STATUS=ACTIVE,NUMBER=50, WARN=80,MAXFAIL=0,NUMFAIL=0, VERSFREE=50,MAXUSED=2),RECONFIG=NO, VOLATILE=(ONECKPT=DIALOG, ALLCKPT=DIALOG),OPVERIFY=NO
Figure 42. Sample Checkpoint Definition

On a connectivity failure to JES2CKPT_1, the checkpoint must be forwarded to JES2CKPT_2 by JES2. Note that the JES2 checkpoint structures are allocated as soon as they they assigned as a checkpoint, but they remain allocated (disposition=KEEP) even if the checkpoints are forwarded to DASD.

200

Continuous Availability with PTS

Connectivity is lost to the checkpoint structure: IXC518I SYSTEM SC47 NOT USING 483 COUPLING FACILITY 009672.IBM.02.000000040104 PARTITION: 1 CPCID: 01 NAMED CF02 REASON: CONNECTIVITY LOST. REASON FLAG: 13300002.

Because of OPVERIFY=NO, the Checkpoint Reconfiguration is automatically initiated. $HASP285 JES2 CHECKPOINT RECONFIGURATION STARTING $HASP290 MEMBER SC47 -- JES2 CKPT1 IXLLIST LOCK REQUEST FAILURE 490 *** CHECKPOINT DATA SET NOT DAMAGED BY THIS MEMBER *** RETURN CODE = 0000000C REASON CODE = 0C080C06 RECORD = UNKNOWN *$HASP275 MEMBER SC47 -- JES2 CKPT1 DATA SET - I/O ERROR - REASON CODE 491 CF2 $HASP233 REASON FOR JES2 CHECKPOINT RECONFIGURATION IS CKPT1 I/O 492 ERROR(S) ON 1 MEMBER(S) $HASP285 JES2 CHECKPOINT RECONFIGURATION STARTED - DRIVEN BY 493 MEMBER SC50 $HASP280 JES2 CKPT1 DATA SET (STRNAME JES2CKPT_2) IS NOW IN USE JES2 informs the operator that there is no more NEWCKPT1 defined, since the Checkpoint has just been forwarded to the previously defined NEWCKPT1. $HASP256 FUTURE AUTOMATIC FORWARDING OF CKPT1 IS SUSPENDED UNTIL 378 NEWCKPT1 IS RESPECIFIED. ISSUE $T CKPTDEF,NEWCKPT1=(...) TO RESPECIFY $DCKPTDEF $HASP829 CKPTDEF 547 $HASP829 CKPTDEF CKPT1=(STRNAME=JES2CKPT_2,INUSE=YES, $HASP829 VOLATILE=NO), $HASP829 CKPT2=(DSNAME=SYS1.JES2.CKPT1, $HASP829 VOLSER=TOTSM1,INUSE=YES,VOLATILE=NO), $HASP829 NEWCKPT1=(DSNAME=,VOLSER=), $HASP829 NEWCKPT2=(DSNAME=SYS1.JES2.CKPT2, $HASP829 VOLSER=TOTPD0),MODE=DUPLEX,DUPLEX=ON, $HASP829 LOGSIZE=1,APPLCOPY=NONE, $HASP829 VERSIONS=(STATUS=ACTIVE,NUMBER=50, $HASP829 WARN=80,MAXFAIL=0,NUMFAIL=0, $HASP829 VERSFREE=50,MAXUSED=0),RECONFIG=NO, $HASP829 VOLATILE=(ONECKPT=DIALOG, $HASP829 ALLCKPT=DIALOG),OPVERIFY=NO

If the OPVERIFY parameter had been coded OPVERIFY=YES , then the operator would be prompted to make the decision:

Chapter 9. Parallel Sysplex Recovery

201

Connectivity is lost to the checkpoint structure: IXC518I SYSTEM SC47 NOT USING 602 COUPLING FACILITY 009672.IBM.02.000000040104 PARTITION: 1 CPCID: 00 NAMED CF01 REASON: CONNECTIVITY LOST. REASON FLAG: 13300001. JES2 initiates Checkpoint reconfiguration, but prompts the operator to continue. $HASP285 JES2 CHECKPOINT RECONFIGURATION STARTING *$HASP275 MEMBER SC47 -- JES2 CKPT1 DATA SET - I/O ERROR - REASON CODE 609 CF2 $HASP290 MEMBER SC47 -- JES2 CKPT1 IXLLIST LOCK REQUEST FAILURE 610 *** CHECKPOINT DATA SET NOT DAMAGED BY THIS MEMBER *** RETURN CODE = 0000000C REASON CODE = 0C080C06 RECORD = UNKNOWN $HASP285 JES2 CHECKPOINT RECONFIGURATION STARTING $HASP233 REASON FOR JES2 CHECKPOINT RECONFIGURATION IS CKPT1 I/O 611 ERROR(S) ON 1 MEMBER(S) $HASP285 JES2 CHECKPOINT RECONFIGURATION STARTED - DRIVEN BY 612 MEMBER SC42 MEMBER SC42 *$HASP273 JES2 CKPT1 DATA SET WILL BE ASSIGNED TO NEWCKPT1 STRNAME 270 JES2CKPT_1 VALID RESPONSES ARE: CONT - PROCEED WITH ASSIGNMENT TERM - TERMINATE MEMBERS WITH I/O ERROR ON CKPT1 DELETE - DISCONTINUE USING CKPT1 CKPTDEF (NO OPERANDS) - DISPLAY MODIFIABLE SPECIFICATIONS CKPTDEF (WITH OPERANDS) - ALTER MODIFIABLE SPECIFICATIONS *161 $HASP272 ENTER RESPONSE R 161,CONT IEE600I REPLY TO 161 IS;CONT $HASP280 JES2 CKPT1 DATA SET (STRNAME JES2CKPT_1) IS NOW IN USE *$HASP256 FUTURE AUTOMATIC FORWARDING OF CKPT1 IS SUSPENDED UNTIL 275 NEWCKPT1 IS RESPECIFIED. ISSUE $T CKPTDEF,NEWCKPT1=(...) TO RESPECIFY $HASP255 JES2 CHECKPOINT RECONFIGURATION COMPLETE

9.9.2 Structure Failure in a Checkpoint Structure


A structure failure in a JES2 Checkpoint is treated as a connectivity failure. All the examples shown in 9.9.1, Connectivity Failure to a Checkpoint Structure on page 199 apply.

202

Continuous Availability with PTS

9.9.3 The Coupling Facility becomes Volatile


The CKPTDEF in the JES2 initialization deck states what action JES2 must take if the checkpoint structure becomes volatile. The action specified is one of the following:

JES2 issues a message to the operator to suspend or confirm the use of the structure as a checkpoint data set ( VOLATILE=(ONECKPT=WTOR) ). JES2 automatically enters the Checkpoint Reconfiguration Dialog ( VOLATILE=(ONECKPT=DIALOG) ). JES2 just ignores the volatility state of the structure ( VOLATILE=(ONECKPT=IGNORE) ).

9.9.4 To Manually Move a JES2 Checkpoint


The operator has to enter the JES2 checkpoint reconfiguration dialog. Refer to 2.8.1, JES2 Checkpoint Reconfiguration on page 39 for a detailed explanation on JES2 checkpoint reconfiguration process.

9.10 System Logger Recovery from a Coupling Facility Failure


The system logger is a service provider for:

OPERLOG LOGREC

All recovery decisions resulting from a coupling facility failure affecting a logstream structure will be made by the system logger itself or the system operator. The system logger exploiters will either suspend their processing or switch to an alternate logging solution, if any, while the logstream recovery is in process.

9.10.1 System Logger Built-In Recovery from a Connectivity Failure


9.10.1.1 No Active SFM Policy, or an Active Policy with CONNFAIL(NO)
System logger initiates the rebuild of the affected structure, for any instance of the logger that loses connectivity to the structure.

9.10.1.2 Active SFM Policy with CONNFAIL(YES)


The system logger will follow the MVS recommendation. It will initiate a dynamic rebuild as per the active SFM policy WEIGHTs and CFRM policy REBUILDPERCENT.

9.10.2 System Logger Built-In Recovery from a Structure Failure


The system logger always initiates a structure rebuild upon a structure failure.

9.10.3 Coupling Facility Becoming Volatile


The system logger always rebuilds the logstream structures when it detects that the coupling facility became volatile. Note that if the only coupling facility available for rebuild is itself volatile, then the logger will carry on with the rebuild.

Chapter 9. Parallel Sysplex Recovery

203

9.10.4 Manual Invocation of Structure Rebuild


The system logger supports manual rebuild of any logstream structure.

9.10.5 Manual Deallocation of Logstreams Structure


In order to have the system logger disconnecting from a logstream structure, all exploiters of all logstreams using the structure must disconnect from the system logger. This is achieved by the following:

For OPERLOG: by deactivating the OPERLOG logstream:

VARY OPERLOG,HARDCPY,OFF
This issued at each occurrence of OPERLOG.

For LOGREC: by deactivating the LOGREC logstream:

SETLOGRC DATASET
This is issued at each occurrence of LOGREC. This works only if a LOGREC data set name was specified in IEASYSxx. at the last IPL. The recommendation is to have a LOGREC data set defined in IEASYSxx.

For other exploiters of the system logger, if any, refer to the specific operating procedures for that product.

9.11 Automatic Tape Switching Recovery from a Coupling Facility Failure


The following describea how the system reacts with a coupling facility IEFAUTOS structure failure.

9.11.1 Automatic Tape Switching Recovery from a Connectivity Failure


Depending on the SFM set up, we can expect the following situations:

9.11.1.1 No Active SFM Policy, or an Active Policy with CONNFAIL(NO)


There is always an attempt to rebuild the IEFAUTOS structure.

9.11.1.2 Active SFM Policy with CONNFAIL(YES)


The IEFAUTOS structure is rebuilt as per the active SFM policy WEIGHTs and CFRM policy REBUILDPERCENT.

9.11.2 Automatic Tape Switching Built-In Recovery from a Structure Failure


The IEFAUTOS structure is always rebuilt on a a structure failure.

9.11.3 Coupling Facility Becoming Volatile


Allocation does not rebuild the IEFAUTOS structure when the coupling facility becomes volatile.

9.11.4 Manual Invocation of Structure Rebuild


The IEFAUTOS structure can be manually rebuilt.

204

Continuous Availability with PTS

9.11.5 Consequences of Failing to Rebuild the IEFAUTOS Structure


If the rebuild fails:

The tape devices which are not allocated are taken OFFLINE. The operator can vary them back online, but they will be dedicated to the single system they are online to. The tape devices which are allocated are kept online, but become dedicated to the systems that had them allocated.

If another instance of IEFAUTOS can eventually be created, with the proper connectivity, reconnection to the structure is automatically performed and the tape devices will again be sharable.

9.11.6 Manual Deallocation of IEFAUTOS Structure


There is no method to manually deallocate an IEFAUTOS structure.

9.12 VSAM RLS Recovery from a Coupling Facility Failure


To achieve VSAM Record Level Sharing (RLS) DFSMS supports a new subsystem, SMSVSAM, which runs in its own address space and acts as a server to provide RLS services. The SMSVSAM server uses the coupling facility for multiple cache structures (cache set) and a single central lock structure. Like other service providers using the coupling facility, SMSVSAM handles the recovery from coupling facility failures on its own, without involvement of the RLS exploiters. The RLS exploiters processing may be temporarily or permanently affected by the results of these recovery actions.

9.12.1 SMSVSAM Built-In Recovery from a Connectivity Failure


This section discusses SMSVSAM behavior to resolve coupling facility connectivity problems.

9.12.1.1 No Active SFM Policy, or an Active Policy with CONNFAIL(NO)


SMSVSAM will always attempt to rebuild both cache and lock structures.

9.12.1.2 Active SFM Policy with CONNFAIL(YES)


The MVS recommendation to rebuild or disconnect will be followed. SMSVSAM will abend if can not reconnect to the structure. This means that the other systems still having connectivity would create retained locks. Once connectivity is reestablished and SMSVSAM restarts, it will build its own instances of the retained locks and reconnect. CICS then automatically performs its recovery and if successful the VSAM files are then available to transactions.

9.12.2 SMSVSAM Built-In Recovery from a Structure Failure


This section discusses SMSVSAM behaviors to resolve coupling facility structure problems.

Chapter 9. Parallel Sysplex Recovery

205

9.12.2.1 SMSVSAM Cache Structure Failure


SMSVSAM attempts to rebuild the structure. If the rebuild fails, VSAM will switch all the files that were using the failed structure to use another cache structure in the cache set.

9.12.2.2 SMSVSAM Lock Structure Failure


SMSVSAM always attempts to rebuild the structure. If for some reason the rebuild fails, the locks are lost and the the RLS data sets become unavailable to the applications. In-flight transactions that attempt access to RLS are abended. This produces a condition that RLS calls lost locks . Before any recovery can be attempted, the condition that prevented the rebuild must be resolved. The lost locks recovery consists of backout of in-flight transactions and resolution of any indoubt transactions. CICS cannot execute new transactions that will access RLS files until all of the CICSs have completed the lost locks recovery.

9.12.3 Coupling Facility Becoming Volatile


SMSVSAM does not rebuild the cache or lock structures when the coupling facility becomes volatile.

9.12.4 Manual Invocation of Structure Rebuild


The SMSVSAM server will support the rebuild of both the lock and cache structures via the SETXCF START,REBUILD command.

9.12.5 Manual Deallocation of SMSVSAM Structures


There is no capability to manually deallocate the IGWLOCK00 structure, because this action will be disruptive to the data sharing environment. However, it is possible to manage the cache structure through the following command:

VARY SMS,CFCACHE(cache_name) , ENABLE E QUIESCE Q

9.13 Couple Data Set Failure


A number of different couple data sets are required in the parallel sysplex. These contain status and policy information, and some are essential to the availability of the sysplex. The types of couple data sets and recommendations for their placement are described in 2.7, Couple Data Sets on page 35. This section discusses recovery procedures when the access to a couple data set is lost, or the volume containing the couple data set fails.

9.13.1 Sysplex (XCF) Couple Data Set Failure


When XCF loses access to the primary sysplex couple data set and an active alternate couple data set is available, it automatically switches to the alternate. This switch is transparent to the XCF group members. The old alternate becomes the new primary and the old primary is no longer allocated as a couple data set. At this time, XCF has only one sysplex couple data set available, which has now become a single point of failure. A new alternate sysplex couple data set should be made available as soon as possible.

206

Continuous Availability with PTS

Note: If XCF loses access to all sysplex couple data sets, the system enters a nonrestartable wait state. If the parallel sysplex is running with only one sysplex couple data set, and that volume fails, then all systems in the sysplex enter nonrestartable wait states. Consideration should be given to automating the activation of the spare couple data set, triggered by the issuing of the message indicating there is no alternate couple data set:

IXC267I PROCESSING WITHOUT AN ALTERNATE COUPLE DATA SET FOR typename ISSUE SETXCF COMMAND TO ACTIVATE A NEW ALTERNATE

9.13.2 Coupling Facility Resource Manager (CFRM) Couple Data Set Failure
When a system loses access to the CFRM couple data set, the system enters a nonrestartable wait state. Loss of access to the CFRM policy implies that XES cannot ensure that connectors on this system are in a consistent state.

9.13.3 Sysplex Failure Management (SFM) Couple Data Set Failure


All systems require connectivity to the SFM couple data set in order to be able to make use of SFM policy:

Any system in the sysplex at a pre-SP5.1.0 level will disable SFM for the entire sysplex. Any system in the sysplex without connectivity to SFM couple data set will disable SFM for the entire sysplex. Not having a started SFM policy will disable SFM for the entire sysplex.

Loss of access to SFM policy by any system in the sysplex results in the inactivation of SFM policy. The sysplex reverts to pre-SP5.1.0 mechanisms for failure processing.

9.13.4 Workload Manager (WLM) Couple Data Set Failure


If access to the primary couple data set is lost and an active alternate couple data set is available, XCF automatically switches to the alternate. This switch is transparent to the workload manager. If access to the only couple data set is lost, the workload manager continues to run, but in independent mode. This means that the workload manager only operates on local data, and does not transmit data to other members of the sysplex. The following message may be issued:

IWM012E POLICY ACTIVATION FAILED, WLM COUPLE DATA SET NOT AVAILABLE

9.13.5 Automatic Restart Manager (ARM) Couple Data Set Failure


If access to the primary couple data set is lost and an active alternate couple data set is available, XCF automatically switches to the alternate. This switch is transparent to ARM. The following sequence of messages indicating the switch may be issued:

Chapter 9. Parallel Sysplex Recovery

207

IXC253I PRIMARY COUPLE DATA SET xxxxxx FOR ARM IS BEING REMOVED BECAUSE OF AN I/O ERROR DETECTED BY SYSTEM sysname IXC263I REMOVAL OF THE PRIMARY COUPLE DATA SET xxxxxx FOR ARM IS COMPLETE *IXC267I PROCESSING WITHOUT AN ALTERNATE COUPLE DATA SET FOR ARM ISSUE SETXCF COMMAND TO ACTIVATE A NEW ALTERNATE.

If access to the only ARM couple data set is lost, Automatic Restart Manager services are not available until a primary couple data set is made active. Automatic Restart Manager does not cause the system to enter a wait state when both of its couple data sets are lost. In this case, the elements registered with Automatic Restart Manager will be deregistered. The following messages may be issued:

IXC808I ELEMENTS FROM TERMINATED SYSTEM sysname WERE NOT PROCESSED BY THIS SYSTEM. ARM COUPLE DATA SET IS NOT AVAILABLE TO THIS SYSTEM

In this case, the system that issued ARM couple data set. Therefore, it manager elements, if any, from the in the sysplex can restart elements

this message does not have access to the cannot initiate restarts of automatic restart failed system. The other remaining systems from the failed system.

IXC809I ELEMENTS REGISTERED ON SYSTEM sysname WERE DEREGISTERED DUE TO LOSS OF ACCESS TO THE ARM COUPLE DATA SET

The system issuing this message has lost access to the ARM couple data set. All elements running on this system will be deregistered by other systems in the sysplex that have access to the ARM couple data set. The deregistered programs will continue to run. Once an ARM couple data set has been available, the elements can not be reregistered without ending and restarting their jobs or started tasks.

9.13.6 System Logger (LOGR) Couple Data Set Failure


When the logger loses connectivity to its inventory data set (both primary and secondary), the logger address space on the system that lost connectivity terminates itself by issuing CALLRTM TYPE=MEMTERM. If there are peer connections to the CF structures in use by the instance of the logger that terminated itself, then other instances of the logger in the sysplex are informed of the failure. They will initiate a flush of the log streams that map to the structures in use by the logger that lost inventory connectivity. Updating of the inventory as necessary will be performed by peer instances provided they have connectivity. If there are no peer connections to the structure(s) in use by the failed system, recovery is initiated upon connection to the first log stream that maps to the structure.

208

Continuous Availability with PTS

Table 12. Summary of Couple Data Sets. parallel sysplex availability.

The table summarizes which couple data sets are essential for

Couple Data Set Sysplex (XCF) Primary

Effect of Loss Switch to alternate XCF automatically switches to the available alternate, and issues message:

Active alternate available

IXC267I PROCESSING WITHOUT AN ALTERNATE COUPLE DATA SET FOR typename. ISSUE SETXCF COMMAND TO ACTIVATE A NEW ALTERNATE.
Sysplex (XCF) Primary

WAIT0A2-10 XCF enters a nonrestartable wait state when it loses access to all couple data sets XCF issues message:

No active alternate available

Sysplex (XCF) Alternate

IXC267I PROCESSING WITHOUT AN ALTERNATE COUPLE DATA SET FOR typename. ISSUE SETXCF COMMAND TO ACTIVATE A NEW ALTERNATE.
CFRM Primary

System enters a nonrestartable wait state.

No active alternate available Sysplex Failure Management is disabled for the entire sysplex. Sysplex reverts to pre-SP510 mechanisms for failure processing. WLM switches into independent state, and uses only local data. No data is transmitted to other members of the sysplex. Processing continues but ARM services are no longer available until a new primary Couple Data Set is made active. Logger terminates.

SFM

No active alternate available

WLM

No active alternate available

ARM

No active alternate available

System logger (LOGR)

No active alternate available

9.14 Sysplex Timer Failures


If a system loses access to the sysplex timer, it enters a nonrestartable disabled wait state. If a sysplex timer fails, but the expanded availability feature is installed, the sysplex continues operating without disruption, using the second ETR as the timer synchronization source. If the last sysplex timer in the configuration fails, all systems in the sysplex are placed in nonrestartable disabled wait states. In this situation you can not restart your sysplex until the sysplex timer failure has been recovered. If you want to have the capability to restart at least one system of the originating complex, you should plan to use the PLEXCFG=ANY parameter in IEASYS. This will allow you to IPL a single system outside of the sysplex and using the

Chapter 9. Parallel Sysplex Recovery

209

machine TOD instead of the ETR. When the sysplex timer is repaired, this system will then synchronize itself with the ETR. The other systems will then be allowed to IPL into the sysplex.

9.15 Restarting IMS


IMS subsystem failures fall into the following categories:

The IMS or IRLM fails within a system in the sysplex. The CEC or MVS image fails. A coupling facility fails.

The way in which support is provided in IMS V5 to handle coupling facility failures is discussed in 9.8.1.1, No Active SFM Policy, or an Active Policy with CONNFAIL(NO) on page 197

9.15.1 IMS/IRLM Failures Within a System


Operating procedures for IMS within an image in a sysplex differ little from that in a normal environment. The data sharing support provided with the IMS and IRLM products ensure data integrity and isolation of the failing element. As a result, the restarting and recovery of individual failed elements is relatively straight forward and differs little from business as usual. Full details of restarting failed IMS subsystems and IRLMs is provided in IMS/ESA V5 Operations Guide , SC26-8029. The key thing is to restart the failed element as quickly as possible, not to continue processing additional work, but to release the retained locks. This minimizes the affect on other elements in the parallel sysplex and ensures that any disruption experienced by the end users is kept to a minimum. Once the retained resources are freed, the failed element can be shut down in an orderly manner and the cause of the failure can be investigated.

9.15.2 CEC or MVS Failure


There are two alternatives when considering how to restart IMS subsystems in the event of a CEC or MVS failure. 1. Restart the failing IMS and associated IRLM on another system in the sysplex. 2. Restart just the failing IMS on another system and link it to a surviving IRLM. Having said that the purpose of restarting the failed IMS is to release any locked resources the key thing to consider is speed of recovery. If you restart both IMS and IRLM on another system, the IRLM has to be started and ready before the IMS can be started, adding extra time. If just the IMS is restarted and linked to an existing IRLM the restart time is less and the retained resources freed much earlier. As the aim is to minimize the effect of any failure to the end user then the second option would be the recommended course of action. The operational procedures are again little different from a non-sysplex environment.

210

Continuous Availability with PTS

9.15.3 Automating Recovery


IMS/ESA V5.1 provides support for the MVS Automatic Restart Manager. ARM can be used to restart IMS in either of the above scenarios. Using ARM speeds up failure recognition and initiating restart actions. This removes the requirement for human intervention and avoids unnecessary error or delay. For a description of the Automatic Restart Manager and its subsystem support see 2.17, ARM: MVS Automatic Restart Manager on page 79. For information on implementing ARM see MVS/ESA SP V5 Setting up a Sysplex , GC28-1449. Once the failed IMSs are up and all retained locks are freed, then the process is to bring the subsystem back down in an orderly fashion. To further enhance the automation of the recovery this could be achieved via the use AOC or similar automation product or routine. By doing so, it would remove the need for human intervention totally from the recovery of the failed IMSs and enable the operators to concentrate on determining the cause of the system failure. With the potential complexity of parallel sysplex setups it is essential that as much restart activity and recovery as possible be automated, removing the chance of human error or delay.

9.16 Restarting DB2


When a DB2 subsystem fails or the MVS image fails, the surviving DB2 members of the data sharing group convert the modify locks held by the failed DB2 to retained locks. These locks are held until the failed DB2 is restarted in the parallel sysplex. During the time that DB2 is down all transaction that request locks that contend with the retained locks are given a resource not available reason code. It is recommended that the Automatic Restart Manager (ARM) be used to quickly restart the failed DB2. If the failed DB2 cannot be restarted some operator actions will be required. Recovery from this point forward will depend upon how severely the retained locks are affecting transactions on the other DB2s. The DISPLAY DATABASE(xxxx) LOCKS command can be used to help in determining recovery actions. For more detailed information on specific recovery actions reference the DB2 V4 Data Sharing Planning and Administration manual.

9.17 Restarting CICS


CICS recovery depends on the type and where the failure occurs. Recovery actions are different for a TOR than an AOR. The CICS subsystem may also require some additional recovery actions depending on the type of data: DB2, DL/1 or VSAM.

9.17.1 CICS TOR Failure


If the TOR region fails, all the current sessions are broken and the user should rebind through another TOR. This was true until the introduction of the VTAM persistent session support. With this support, VTAM is able to provide restart-in-place of a failed CICS without rebinding. This support is valid for all LU-LU sessions except LU0 and LU6.1 sessions.

Chapter 9. Parallel Sysplex Recovery

211

Therefore, if a failed CICS is restarted within a predefined time interval, it can use the retained sessions immediately and there is no need for networks flows to rebind them. The CICS sessions are held by VTAM in the recovery pending state and may be recovered during the emergency restart of CICS. There are some instances where it is not possible to reestablish a pre-existing session, such as:

Performing a COLD start after the CICS failure Toleration interval has expired VTAM, MVS or CPC failure

9.17.2 CICS AOR Failure


If a CICS region fails, you restart CICS with an emergency restart to backout any transactions that were in-flight at the time of failure. If the failed CICS region was running with VSAM RLS, SMSVSAM converts any active exclusive locks held by the failed system into retained locks, pending the CICS restart. This means that the records are protected from being updated by any other CICS region in the sysplex. Retained locks also ensure that other regions trying to acess the protected records do not wait on the locks until the failed region restarts. CICS emergency restart performs CICS RLS restart processing during which orphan locks are eliminated. An orphan lock is one that is held by SMSVSAM but unknown to any CICS region. Orphan locks can occur if a CICS region acquires an RLS lock but then fails before logging it. Records associated with orphan locks cannot have been updated; therefore CICS releases any orphan locks that it finds during its RLS restart. As soon as the retained locks condition has been cleared, the CICS region can be stopped if it is not required from an installation or performance point of view. It is recommended that you use ARM to quickly restart the failed AOR.

9.18 Recovering Logs


The following sections describe failure scenarios and the actions taken by the MVS system logger.

9.18.1 Recovering an Application Failure


When an application fails while still having active connections to one or more logstreams, the following processing occurs:

The application failure is recognized by either the end-of-task (EOT) or end-of-memory (EOM) resource manager that was established by the MVS system logger on behalf of the connector. The system logger insures that all logstream data written to a logstream to which a connection still exits is flushed from the coupling facility and written to DASD. After all data is flushed to DASD, the system logger automatically disconnects the application from any logstreams to which the application is still connected.

212

Continuous Availability with PTS

9.18.2 Recovering an MVS Failure


When an application is connected to one or more logstreams, the MVS image on which it executes might fail. Dependant on the situation, recovery will proceed as follows:

Multisystem sysplex when other logstream connections exist: Other instances of the MVS system logger in the sysplex are notified of the failure. The surviving instances of the MVS system logger coordinate among themselves to migrate logstream data that was not yet written to DASD by the failed system. Multisystem sysplex when no other logstream connections exist: Data still resident in the coupling facility continues to exist in the coupling facility. When another instance of the application connects to the logstream, it has access to the coupling facility data.

9.18.3 Recovering from a Sysplex Failure


When all systems in a sysplex fail, there are no surviving systems to participate in the recovery. Data in the coupling facility continues to exist in the coupling facility. After the sysplex is re-IPLed and instances of the application connect to the logstream, they may have access to the coupling facility data. The data written by instances of the application exist in the coupling facility, and if staging has been defined, in the staging data sets.

9.18.4 Recovering from System Logger Address Space Failure


While an application is connected to a logstream, the supporting instance of the MVS system logger might fail independently of the exploiting application. When the MVS system logger address space fails, connections to logstreams are automatically disconnected by the system logger. All requests to connect are rejected. When the recovery processing completes, the system logger is restarted and an ENF 48 is broadcast. On receipt of the ENF, applications may connect to logstreams and resume processing.

9.18.5 Recovering OPERLOG Failure


If the OPERLOG is the hardcopy medium and the OPERLOG fails on one system, the system attempts to switch hardcopy to the SYSLOG. The system on which the OPERLOG fails will write messages about the failure to SYSLOG. Any system in the sysplex that are not affected by the failure will continue to write to the operations log. If the SYSLOG is also inactive, or if the use of the SYSLOG has been prevented by the WRITE CLOSE command, the system attempts to switch hardcopy to an appropriate printer console. For further information, see the section on log switching and JES2 restart in MVS/ESA SP V5 Initialization and Tuning Reference , SC28-1452

9.19 Restarting an OPC/ESA Controller


If you have a standby OPC/ESA controller then it can automatically take over the functions of the active controller if the controller fails, or if the MVS/ESA system that it was active on fails. The standby controller will be started on the backup system but will not be activated unless a failure occurs or unless it is directed to take over via an
Chapter 9. Parallel Sysplex Recovery

213

MVS/ESA operator modify command. The activation is done by the standby controller itself on a signal from XCF; ARM is not involved. If you do not have a standby controller, you will have to start one in order to continue running batch production.

9.20 Recovering Batch Jobs under OPC/ESA Control


You should normally use the OPC/ESA functions to restart the batch work on a failing system and not ARM. You should never define a job as restartable to both OPC/ESA and ARM.

9.20.1 Status of Jobs on Failing CPU


The status OPC/ESA assigns to operations that are running when the system fails is decided by the WSFAILURE and WSOFFLINE keywords of the JTOPTS statement. You should mark jobs running on a failing processor as ended-in-error. They are then handled the same way as other operations that have ended in error. If a RECOVER statement has been defined to cover the situation, as defined by the job code and step code, it is invoked automatically in the same way as for other failures.

9.20.2 Recovery of Jobs on a Failing CPU


OPC/ESA supports automatic recovery by having its own job statements that take effect when a job fails. These job statements look like comments to MVS and JES. For jobs tracked on an MVS system, OPC/ESA also notices when the catalog has been updated by a job and is able to undo the catalog updates (step by step, if required) to the point before the job ran, for all data sets allocated with JCL DD statements. This facility is called Catalog Management. For example, when a job creates a data set, a rerun often fails because the data set already exists. With Catalog Management active for the job, OPC/ESA uncatalogs and deletes the data set before resubmitting the job. There are some restrictions to this automatic handling of the catalog if the job uses GDGs. OPC/ESA can fix GDGs for job-level restarts, and fix step-level restarts for non-GDG jobs if RECOVER is specified. Step-level restart with GDGs with RECOVER not specified requires the job be restarted manually using an OPC/ESA dialog.

214

Continuous Availability with PTS

Chapter 10. Disaster Recovery Considerations


This chapter contains a discussion of disaster recovery considerations specific to the parallel sysplex environment. One consideration is the supporting hardware at the disaster recovery site, and whether a parallel sysplex is required. This may be necessary to provide capacity or function. It may also be desirable from the point of view of providing consistency to operations and end-users. This is not an exhaustive description of disaster recovery planning. There is more information in Disaster Recovery Library: Planning Guide , GG24-4210. Further information on disaster recovery considerations for a CICS data-sharing environment can be found in Planning for CICS Continuous Availability in an MVS/ESA Environment , SG26-4593.

10.1 Disasters and Distance


Depending on the kind of disasters that you are trying to guard against, the distance between the primary and disaster recovery site may need to be greater or smaller. Major disasters such as earthquakes, hurricanes, floods, and wars normally affect a wide area. So, in order to have a backup against these eventualities the backup site will need to be very far away. Some parts of the world are more prone to major disasters as described above than others, so in these places it may be necessary to have a backup plan that takes them into account and locate the recovery site remotely. There is another class of disasters, which includes fires, local flooding, building damage, terrorist bombings, and others, which are quite local in scope but can be equally disastrous for a business if they strike the data center. These also need to be guarded against by having a backup site, but for these contingencies the backup site does not need to be very remote. If a risk analysis shows that it is only the latter kind of disaster that it is realistic to protect against, then there may be other considerations such as cost and convenience that dictate that the sites be reasonably close to each other. For instance, it may then be possible for the one staff to operate both sites. In some countries this is a common practice.

10.2 Disaster Recovery Sites


Here the requirements for configuring a remote disaster recovery site for the various data-sharing subsystems are described.

10.2.1 3990 Remote Copy


The 3990 Model 6 provides two options for maintaining remote copies of data, both of which address the problem of out-of-date data that occurs between the last safe backup and the time of failure:

Peer-to-Peer Remote Copy (PPRC)

Copyright IBM Corp. 1995

215

Peer-to-Peer Remote Copy provides a mechanism for synchronous copying of data to the remote site, which means that no data is lost between the time of the last backup at the application system and the time of the recovery at the remote site. The impact on performance must be evaluated, since an application write to the primary subsystem is not considered complete until the data has also been transferred to the remote subsystem. Figure 43 on page 217 shows a sample Peer-to-Peer Remote Copy configuration. The Peer-to-Peer Remote Copy implementation requires ESCON links between the primary site 3990 and the remote (recovery) site 3990.

Extended Remote Copy (XRC) Extended Remote Copy provides a mechanism for asynchronous copying of data to the remote site and only data that is in transit between the failed application system and the recovery site is lost. Note that in general, the delay in transmitting the data from the primary subsystem to the recovery subsystem is measured in seconds. Figure 44 on page 218 shows a sample Extended Remote Copy configuration. The Extended Remote Copy implementation involves the transfer of data between the primary subsystem and the recovery subsystem under the control of a DFSMS/MVS host system, which can exist at the primary site, at the recovery site, or anywhere in between.

The 3990 Remote Copy solutions are data-independent; that is, beyond the performance considerations, there is no restriction on the data that can be mirrored at a remote site using these solutions.

10.2.2 IMS Remote Site Recovery


IMS/ESA Version 5.1 introduces the Remote Site Recovery feature for IMS DB and TM resources, such as logs and databases. The implementation of this feature enables installations to resume IMS service at a remote site in the event of an extended outage at the primary site with minimal or no data loss. Note that other solutions, such as 3990 Remote Copy, need to be considered for other required resources. The IMS system at the primary site must be using DBRC, and so it has RECON data sets, logs, and databases in use. At the remote site, another IMS is running which must also have a RECON database, but this IMS cannot be used as a DB/DC system; it is used solely for tracking the primary system. The remote site may be connected to the active site either via an ESCON CTC if the remote site is within ESCON distance, or, if longer distances are required, a wide area network can be used. Figure 45 on page 219 shows a sample Extended Remote Copy configuration. RSR offers two levels of support which can be selected on an individual database basis:

Database Level Tracking With this level of support, the database is shadowed at the remote site, thus eliminating the need to recover the databases in the event of a primary site outage.

216

Continuous Availability with PTS

Figure 43. 3990-6 Peer-to-Peer Remote Copy Configuration

Recovery Level Tracking With this level of support, the databases are not shadowed. The logs are transmitted electronically to the remote site, and the databases must be recovered as part of the disaster recovery process.

RSR supports the recovery of IMS full function databases, Fast Path DEDBs, the IMS message queues and the telecommunications network. For more information on the IMS Remote Site Recovery feature, refer to the following sources:

MKTTOOLS packages

IMS5RSR IMS/ESA Remote Site Recovery Overview IMSRSR IMS Remote Site Recovery

10.2.3 CICS Recovery with CICSPlex SM


CICSPlex SM can manage CICS systems across any configuration of MVS systems. There is no requirement for a sysplex and no dependency on VTAM generic resources. A possible disaster recovery configuration which would provide high availability for CICS users might consist of a TOR and several AORs. The CICSPlex SM workload manager would contain a transaction routing policy which would route all transactions to the AOR at the primary site as long as the AOR at that site was available. If the AOR were down, CICSPlex SM operators would install an alternate workload policy which would route transactions to the AOR at the secondary site.

Chapter 10. Disaster Recovery Considerations

217

Figure 44. 3990-6 Extended Remote Copy Configuration

Note that before the above rerouting of transactions can be effective, you must have addressed the issues of getting the databases available on the recovery site through the techniques discussed previously in 10.2.2, IMS Remote Site Recovery on page 216 and 10.2.1, 3990 Remote Copy on page 215 or some other technique. To provide for high availability for the TOR, you would want to either start the TOR on the alternate CPU or perhaps issue a VTAM command to install an alternate USS table which would direct 3270 logons to the alternate TOR. CICSPlex SM can trigger automation software by issuing an alert or console message when the TOR fails. Note that it is still necessary to consider the recovery at the remote site of the application data. Data sharing can only occur within a sysplex, but features such as IMS RSR described above, can significantly reduce the recovery time required for databases.

10.2.4 DB2 Disaster Recovery


The procedures for DB2 data recovery are fundamentally the same as for previous releases of DB2, but the parallel sysplex data sharing group configuration adds an extra consideration in that the recovery site must have an identical data sharing group as the primary site, regardless of the supporting physical configuration or the number of supporting MVS images. It must have the same name, the same number of members and the names of the members must be the same. The CFRM policies at the remote site must define the coupling facility structures with the same names. Note that an ICMF LPAR can be used instead of a standalone coupling facility.

218

Continuous Availability with PTS

Figure 45. IMS Remote Site Recovery Configuration

Obviously you must also have a coupling facility at the recovery site. DB2 does not provide a utility for remote site database shadowing. Transportation of required recovery information, such as logs, image copies and so on, to the remote site is manual. For more information on DB2 disaster recovery, refer to DB2 Data Sharing: Planning and Administration , SC26-3269.

Chapter 10. Disaster Recovery Considerations

219

Figure 46. DB2 Data Sharing Disaster Recovery Configuration

220

Continuous Availability with PTS

Appendix A. Sample Parallel Sysplex MVS Image Members


This appendix shows a sample parallel sysplex configuration. The corresponding MVS image members are provided to illustrate the widespread use of symbolics to facilitate the cloning of new images. These examples can be used in conjunction with 7.1, Adding a New MVS Image on page 149. For more information on the coding and use of the various members described below, please refer to MVS/ESA SP V5 Initialization and Tuning Reference , SC28-1452.

A.1 Example Parallel Sysplex Configuration

Figure 47. Example Parallel Sysplex Configuration

Copyright IBM Corp. 1995

221

A.2 IPLPARM Members


The key member is:

LOADAA, located in SYS0.IPLPARM

A.2.1 LOADAA
IODF NUCLEUS SYSCAT NUCLST IEASYM SYSPLEX 37 SYS6 L06RMVS1 01 1 TOTCAT113CCATALOG.TOTICFM.VTOTCAT AA (AA,L) WTSCPLX1 X
LOADxx m e m b e r for all systems in sysplex WTSCPLX1

Figure 48. LOADAA Member.

In this example the key statements are:

IEASYM It defines one or more suffixes of the IEASYMxx members of PARMLIB that are to be used. The IEASYMxx member is used to define the static system symbols. IEASYMxx is the only place where installations can define static system symbols. These will be required to keep management effort of cloning new systems to a minimum. L indicates that a list of member names is displayed in message IEA900I during IPL.

SYSPLEX This statement defines the name of the sysplex in which the system participates. It is also the substitution text for the &SYSPLEX system symbol. Any non blank character, (in this case X), should be coded after the name to tell the system to issue message IXC2171 and prompt the operator to respecify the suffix of COUPLExx if the &SYSCLONE system symbol is not unique for every system in the sysplex.

A.3 PARMLIB Members


The key members are:

IEASYMAA IEASYS00 COUPLE00 J2G J2Lxx

222

Continuous Availability with PTS

A.3.1 IEASYMAA
Figure 49 contains the IEASYMAA member for the sample sysplex. This is the member pointed to by the LOADxx member.

SYSDEF

SYSDEF

SYSDEF

SYSDEF

SYSDEF

SYMDEF(&CLNLST= AA ) /* CLONING NAME */ SYSCLONE(&SYSNAME(3:2)) SYMDEF(&IEASYSP= AA ) SYMDEF(&APPCLST1=&SYSCLONE. ) SYMDEF(&CMDLIST1=&SYSCLONE.,00 ) SYMDEF(&LNKLIST2= 0 0 ) SYMDEF(&LPALIST2= 0 0 ) SYMDEF(&LPALSTJ1= AA ) SYMDEF(&MLPALST1= AA ) SYMDEF(&RSULST01= 0 0 ) SYMDEF(&SSNLST01= AA ) HWNAME(ITSO942A) LPARNAME(T5) SYSPARM(AA) SYSNAME(SC47) SYMDEF(&MLPALST1= AA,AB ) HWNAME(P101) LPARNAME(A1) SYSPARM(AA) SYSNAME(SC52) SYMDEF(&APPCLST1= AA ) SYMDEF(&CMDLIST1= AA,00 ) HWNAME(P101) LPARNAME(A2) SYSPARM(AA) SYSNAME(SC53) SYMDEF(&APPCLST1= AA ) SYMDEF(&CMDLIST1= AA,00 ) HWNAME(P201) LPARNAME(A1) SYSPARM(AA) SYSNAME(SC42) SYMDEF(&APPCLST1= AA ) SYMDEF(&CMDLIST1= AA,00 )

Figure 49. IEASYMAA

The first SYSDEF statement is global . The value parameters will apply to all systems in the sysplex. The remaining SYSDEF statements are local . The value parameters here apply only to the system that HWNAME or LPARNAME identifies. In the previous example, system SC47 will use command list members COMMND47 and COMMND00. The remaining systems use COMMNDAA and COMMND00. SYSPARM specifies that member IEASYSAA is to be concatenated with the default IEASYS00.

Appendix A. Sample Parallel Sysplex MVS Image Members

223

A.3.2 IEASYS00 and IEASYSAA


Figure 50 contains the IEASYS00 member for the sample sysplex.

ALLOC=00, ALLOCATION DEFAULTS CLOCK=00, TOD CLOCK INITIALIZATION CMB=(UNITR,COMM,GRAPH,CHRDR), ADDITIONAL CMB ENTRIES CON=(00), CONSOLE DEFINITIONS COUPLE=00, COUPLE DEFINITIONS CSA=(2048,20480), MVS/XA CSA RANGE DIAG=01, CSA/SQA TRACING DUMP=NO, DYNAMIC ALLOCATION ACTIVE (COMMND00) FIX=00, FIX MODULES SPECIFIED /*J3*/ GRS=TRYJOIN, LETS GET THIS BABY GOING GRSCNF=00, GRS CONFIG DEFINITIONS GRSRNL=01, GRS RNLS DEFINITIONS ICS=00, SELECT IEAICS00 INSTALL CNTL SPECS FOR SRM LNK=00, SPECIFY LNKLST00 LNKAUTH=APFTAB, LINKLIST APF AUTHORIZATION VIA APFTAB LPA=00, SELECT LPALST00 CONCATENATED LPA LIBRARY LOGCLS=Y, WILL NOT BE PRINTED BY DEFAULT LOGLMT=999999, MUST BE 6 DIGITS, MAX WTL MESSAGES QUEUED LOGREC=LOGSTREAM, LOGREC GOES TO LOGR LOGSTREAM MAXCAD=25, CICSPLEX CMAS NUMBER OF COMMON DSPACES MAXUSER=250, (SYS TASKS + INITS + TSOUSERS) MLPA=02, SELECT IEALPA02 MODULES LOADED INTO PLPA MSTRJCL=01, MSTJCL WITHOUT UADS & WITH IEFJOBS NSYSLX=55, CICSPLEX CAS/ESSS LINKAGE INDEXES OPI=YES, ALLOW WOL OVERRIDE TO IEASYS00 OPT=00, SPECIFY IEAOPT00 (SRM TUNING PARMETERS) PAGE=(PAGE.&SYSNAME..PLPA, PLPA PAGE DATA SET PAGE.&SYSNAME..COMMON, COMMON PAGE DATA SET PAGE.&SYSNAME..LOCAL1,L), LOCAL PAGE DATA SET PAGTOTL=(8,3), ALLOW ADDITION 5 PAGE D/S AND 3 SWAP D/S PAK=00, IEAPAK00 PLEXCFG=(MULTISYSTEM,OPI=NO), MULTI-SYSTEM SYSPLEX ONLY PROG=(00), DYNAMIC APF REAL=512, ALLOWS 2 64K JOBS OR 1 128K JOB TO RUN V=R RSVSTRT=25, RESERVED ASVT ENTRIES DEFAULT RSVNONR=25, RESERVED ASVT ENTRIES DEFAULT SCH=00, SCHEDULER LIST SCHED00 SMF=00, SELECT SMFPRM00, SMF PARMETERS SMS=00, SMS PARAMETER SQA=(3,18), MVS/XA SQA APPROX 1MB SSN=00, SUBSYSTEM INITIALIZATION NAMES SVC=00, SVC TABLE IEASVC00 VAL=00, SELECT VATLST00 DEFAULT VIODSN=SYS1.&SYSNAME..STGINDEX, DATASET NAME FOR STGINDEX-DS VRREGN=512 DEFAULT REAL-STORAGE REGION SIZE DEFAULT
Figure 50. IEASYS00

Figure 51 on page 225 contains the IEASYSAA member for the sample sysplex.

224

Continuous Availability with PTS

CMD=(&CMDLIST1.), LNK=&LNKLIST2., LPA=(&LPALSTJ1.,&LPALIST2.), MLPA=(&MLPALST1.), RSU=&RSULST01., SSN=(&SSNLST01.)


Figure 51. IEASYSAA

The LNK statement in IEASYSAA resolves to a concatenation of LNKLST00, as specified in the global SYSDEF statement in the IEASYMAA memeber in Figure 49 on page 223.

SYMDEF(&LNKLIST1= 0 0 )
The other statements resolve in the same manner.

Appendix A. Sample Parallel Sysplex MVS Image Members

225

A.3.3 COUPLE00
Figure 52 contains the COUPLE00 member for the sample sysplex.

COUPLE SYSPLEX(WTSCPLX1) PCOUPLE(SYS1.XCF.CDS10) ACOUPLE(SYS1.XCF.CDS20) INTERVAL(85) OPNOTIFY(85) CLEANUP(30) MAXMSG(500) RETRY(10) CLASSLEN(1024) /* DEFINITIONS FOR CFRM POLICY */ DATA TYPE(CFRM) PCOUPLE(SYS1.XCF.CFRM1X) ACOUPLE(SYS1.XCF.CFRM2X) /* DATASETS FOR SFM POLICY */ DATA TYPE(SFM) PCOUPLE(SYS1.XCF.SFM10) ACOUPLE(SYS1.XCF.SFM20) /* DATASETS FOR WLM POLICY */ DATA TYPE(WLM) PCOUPLE(SYS1.XCF.WLM10) ACOUPLE(SYS1.XCF.WLM20) /* DATASETS FOR LOGR POLICY */ DATA TYPE(LOGR) PCOUPLE(SYS1.XCF.LOGR10) ACOUPLE(SYS1.XCF.LOGR20) /* DATASETS FOR ARM POLICY */ DATA TYPE(ARM) PCOUPLE(SYS1.XCF.ARM10) ACOUPLE(SYS1.XCF.ARM1X) /* LOCAL XCF MESSAGE TRAFFIC */ LOCALMSG MAXMSG(512) CLASS(DEFAULT) /* PATH DEFINITIONS FOR DEFAULT SIGNALLING */ PATHIN DEVICE(4010,4020,4030,4040,4050) PATHIN DEVICE(4018,4028,4038,4048,4058) PATHOUT DEVICE(5010,5020,5030,5050) PATHOUT DEVICE(5018,5028,5038,5058) PATHOUT STRNAME(IXC_DEFAULT_1,IXC_DEFAULT_2) PATHIN STRNAME(IXC_DEFAULT_1,IXC_DEFAULT_2)
Figure 52. COUPLE00

In the previous example the naming convention for the CTC device numbering is based on XYYZ, where:

X is 4 for inbound CTCs (PATHIN) or 5 for outbound CTCs (PATHOUT) YY corresponds to the MVS system image number. Z indicates to which of two ESCON directors the CTC is associated.

226

Continuous Availability with PTS

A.3.4 JES2 Startup Procedure in SYS1.PROCLIB


Figure 53 contains the JES2 startup procedure for the sample sysplex.

//JES2 PROC // // // //IEFPROC EXEC //HASPLIST DD //HASPPARM DD // DD //PROC00 DD // DD // DD //PROC01 DD // DD //STEPLIB DD

M=J2G,M1=J2L&SYSCLONE, N=SYS1,L=LINKLIB,U=,PN=SYS1,PL=PARMLIB, PROC00= SYS1.PROCLIB , PROC01= ESA.SYS1.PROCLIB , OPCSTC= OPCESA.V1R3M0.COMMON.STC PGM=HASJES20,TIME=1440,DPRTY=(15,15) DDNAME=IEFRDER UNIT=&U,DSN=&PN..&PL(&M),DISP=SHR UNIT=&U,DSN=&PN..&PL(&M1),DISP=SHR DSN=&OPCSTC,DISP=SHR DSN=&PROC00,DISP=SHR DSN=&PROC01,DISP=SHR DSN=&PROC01,DISP=SHR DSN=&PROC00,DISP=SHR DSN=&N..&L,DISP=SHR

Figure 53. JES2 M e m b e r in SYS1.PROCLIB

This member uses both system symbolic substitution and JCL symbolic substitution to point to the appropriate members for JES2 initialization:

J2G for global initialization parameters J2L.&SYSCLONE for specific system parameters, where &SYSCLONE resolves to the last two characters of the SYSNAME parameter specified in IEASYMAA, (see Figure 49 on page 223).

Appendix A. Sample Parallel Sysplex MVS Image Members

227

A.3.5 J2G
Figure 54 contains the JES2 initialization member for the sample sysplex.

LOGON(1) APPLID=WTSC&SYSNAME. LOGON(2) APPLID=&SYSNAME.RJE SPOOLDEF DSNAME=SYS1.HASPACE, VOLUME=TOTSP, BUFSIZE=3992, FENCE=NO, SPOOLNUM=32, TGBPERVL=5, TGSIZE=30, TGSPACE=(MAX=48864,WARN=80), TRKCELL=6 MASDEF AUTOEMEM=ON, CKPTLOCK=ACTION, HOLD=0, SHARED=CHECK, XCFGRPNM=XCFJES2A RESTART=YES, LOCKOUT=1000, DORMANCY=(0,100), SYNCTOL=120,

/* SYMBOLIC */ /* SYMBOLIC */

/* CONTENTION */

CKPTDEF CKPT1=(STRNAME=JES2CKPT_1,INUSE=YES), CKPT2=(DSN=SYS1.JES2.CKPT1,VOL=TOTSM1,INUSE=YES), NEWCKPT1=(STRNAME=JES2CKPT_2), NEWCKPT2=(DSN=SYS1.JES2.CKPT2,VOL=TOTPD0), MODE=DUPLEX, DUPLEX=ON, APPLCOPY=NONE, OPVERIFY=NO, VOLATILE=(ONECKPT=DIALOG,ALLCKPT=DIALOG) MEMBER(1) MEMBER(2) MEMBER(3) MEMBER(4) NJEDEF NAME=SC47 NAME=SC52 NAME=SC53 NAME=SC42

DELAY=300, HDRBUF=(LIMIT=100,WARN=50), JRNUM=1, JTNUM=1, SRNUM=7, STNUM=7, LINENUM=40, MAILMSG=YES, MAXHOP=0, NODENUM=999, OWNNODE=1, PATH=1, RESTMAX=0, RESTNODE=100, RESTTOL=0, TIMETOL=0 PATHMGR=YES, PATHMGR=YES, PATHMGR=NO, PATHMGR=YES, SUBNET=LOCAL SUBNET=WTSCNET SUBNET=WTSCPOK SUBNET=WTSCTEST

N1 N4 N5 N6

NAME=WTSCPLX1, NAME=WTSCNET, NAME=WTSCPOK, NAME=WTSCTEST,

Figure 54 (Part 1 of 5). J2G

228

Continuous Availability with PTS

APPL(WTSCSC47) APPL(WTSCSC52) APPL(WTSCSC53) APPL(WTSCSC42) APPL(SCHNJE) APPL(SCGRSCS) APPL(WTSCJEST) APPL(WTSCJESU)

NODE=1 NODE=1 NODE=1 NODE=1 NODE=4 NODE=5, REST=3 NODE=6 NODE=6

DESTID(MVS3827) DEST=WTSCMXA.U11 DESTID(MVS3900) DEST=WTSCMXA.LOCAL JOBDEF ACCTFLD=REQUIRED, JOBNUM=2000, PRTYHIGH=10, PRTYJECL=YES, RANGE=(1000-30000) COPIES=128, AUTOCMD=50, CONCHAR=$, SCOPE=SYSTEM, RDRCHAR=$ DSN=SYS1.JES2.OFFLOAD1 DSN=SYS1.JES2.OFFLOAD2 DSN=SYS1.JES2.OFFLOAD3 DSN=SYS1.JES2.OFFLOAD4 WS=(/) DISP=KEEP,WS=(/) WS=(/) DISP=KEEP,WS=(/) WS=(/) DISP=KEEP,WS=(/) WS=(/) DISP=KEEP,WS=(/) WS=(/) DISP=KEEP,WS=(/) WS=(/) DISP=KEEP,WS=(/) WS=(/) DISP=KEEP,WS=(/) WS=(/) DISP=KEEP,WS=(/) NIFCB=STD3, NIUCS=GF15, JCLERR=YES, JNUMWARN=60, PRTYLOW=5, PRTYJOB=YES, JOENUM=10000, BUFNUM=200, DISPLEN=56, JOBWARN=60, PRTYRATE=144, JOEWARN=70 BUFWARN=60,

OUTDEF CONDEF

OFFLOAD1 OFFLOAD2 OFFLOAD3 OFFLOAD4 OFF1.JR OFF1.JT OFF1.SR OFF1.ST OFF2.JR OFF2.JT OFF2.SR OFF2.ST OFF3.JR OFF3.JT OFF3.SR OFF3.ST OFF4.JR OFF4.JT OFF4.SR OFF4.ST

PRINTDEF FCB=9, LINECT=60, UCS=P11 PUNCHDEF DBLBUFR=YES


Figure 54 (Part 2 of 5). J2G

Appendix A. Sample Parallel Sysplex MVS Image Members

229

TPDEF

BELOWBUF=(LIMIT=10,SIZE=3840,WARN=50), EXTBUF=(LIMIT=50,SIZE=3840,WARN=50), MBUFSIZE=550, RMTMSG=50, SESSION=31 BELOWBUF=(LIMIT=120,WARN=60), EXTBUF=(LIMIT=200,WARN=60) BUFNUM=100 OUTPUT=YES,

BUFDEF

SMFDEF

TSUCLASS AUTH=ALL, BLP=YES, LOG=NO, REGION=2M, MSGCLASS=S, SWA=ABOVE, CONDPURG=NO STCCLASS AUTH=ALL, BLP=NO, LOG=NO, REGION=2M, MSGCLASS=S, SWA=ABOVE, TIME=(1440,0) JOBCLASS(A-Z) AUTH=ALL, BLP=YES, COMMAND=DISPLAY, JOURNAL=NO, RESTART=NO, MSGLEVEL=(1,1), REGION=2M, SWA=ABOVE, TIME=(450,00) JOBCLASS(0-9) AUTH=ALL, BLP=YES, COMMAND=DISPLAY, JOURNAL=YES, RESTART=YES, MSGLEVEL=(1,1), REGION=2M, SWA=ABOVE, TIME=(450,00) JOBPRTY1 JOBPRTY2 JOBPRTY3 JOBPRTY4 JOBPRTY5 JOBPRTY6 JOBPRTY7 JOBPRTY8 JOBPRTY9 INTRDR PRIORITY=9,TIME=1 PRIORITY=8,TIME=2 PRIORITY=7,TIME=4 PRIORITY=6,TIME=8 PRIORITY=5,TIME=16 PRIORITY=4,TIME=32 PRIORITY=3,TIME=64 PRIORITY=2,TIME=128 PRIORITY=1,TIME=256

OUTPUT=YES,

LOG=YES,

LOG=YES,

AUTH=(JOB=YES,DEVICE=YES,SYSTEM=YES), RDINUM=20 HASPFSSM=HASPFSSM HASPFSSM=HASPFSSM

FSSDEF(FSS382B) PROC=APS382B, FSSDEF(FSS382C) PROC=APS382C,


Figure 54 (Part 3 of 5). J2G

230

Continuous Availability with PTS

PRINTER1 FSS=FSS382C, MODE=FSS, PRMODE=(PAGE,LINE), CLASS=IU, UCS=0, CKPTPAGE=100, MARK=YES, START=NO, ROUTECDE=U10 PRINTER3 FSS=FSS382B, MODE=FSS, PRMODE=(PAGE,LINE), CLASS=IU, UCS=0, CKPTPAGE=100, MARK=YES, START=NO, ROUTECDE=U12 /* PRINTER9 START=NO,UNIT=831,UCS=P11,FCB=STD2,CLASS=U */ PUNCH1 PUNCH2 PUNCH3 PUNCH4 START=NO START=NO START=NO START=NO START=NO START=NO START=NO START=NO

READER1 READER2 READER3 READER4

ESTLNCT NUM=25,INT=10000,OPT=0 ESTIME NUM=10,INT=10,OPT=NO OUTCLASS(A) OUTCLASS(B) OUTCLASS(C) OUTCLASS(D) OUTCLASS(E-J) OUTCLASS(K) OUTCLASS(L) OUTCLASS(M-R) OUTCLASS(S-T) OUTCLASS(U-W) OUTCLASS(X) OUTCLASS(Y) OUTCLASS(Z) OUTCLASS(0-9) OUTPRTY1 OUTPRTY2 OUTPRTY3 OUTPRTY4 OUTPRTY5 OUTPRTY6 OUTPRTY7 OUTPRTY8 OUTPRTY9 OUTPUT=PUNCH OUTDISP=(HOLD,HOLD) OUTPUT=PUNCH OUTDISP=(HOLD,HOLD) OUTDISP=(HOLD,HOLD) OUTDISP=(HOLD,HOLD) OUTPUT=DUMMY,OUTDISP=(PURGE,PURGE),TRKCELL=NO RECORD=600, RECORD=1200, RECORD=3000, RECORD=12000, RECORD=15000, RECORD=20000, RECORD=25000, RECORD=30000, RECORD=40000, PAGE=10 PAGE=20 PAGE=50 PAGE=200 PAGE=250 PAGE=300 PAGE=350 PAGE=400 PAGE=500

PRIORITY=144, PRIORITY=128, PRIORITY=112, PRIORITY=96, PRIORITY=80, PRIORITY=64, PRIORITY=48, PRIORITY=32, PRIORITY=16,

Figure 54 (Part 4 of 5). J2G

Appendix A. Sample Parallel Sysplex MVS Image Members

231

LINE1 LINE2 LINE3 LINE4 LINE5 LINE6 LINE7 LINE8 LINE9

UNIT=SNA UNIT=SNA UNIT=SNA UNIT=SNA UNIT=SNA UNIT=SNA UNIT=SNA UNIT=SNA UNIT=SNA


Global JES2 Initialization Parameters

Figure 54 (Part 5 of 5). J2G.

A.3.6 J2L42
Figure 55 contains the JES2 initialization member that is unique for a member of the sysplex.

INITDEF PARTNUM=20 INIT(1) INIT(2) INIT(3) INIT(4) INIT(5) INIT(6) INIT(7) INIT(8) INIT(9) INIT(10) INIT(11) INIT(12) INIT(13) INIT(14) INIT(15) INIT(16) INIT(17) INIT(18) INIT(19) INIT(20) NAME=A, NAME=A, NAME=A, NAME=A, NAME=A, NAME=B, NAME=B, NAME=B, NAME=C, NAME=C, NAME=C, NAME=D, NAME=D, NAME=D, NAME=E, NAME=E, NAME=E, NAME=Z, NAME=Z, NAME=Z, CLASS=ABCDE, CLASS=ABCDE, CLASS=ABCDE, CLASS=ABCDE, CLASS=ABCDE, CLASS=01234, CLASS=01234, CLASS=56789, CLASS=A, CLASS=A, CLASS=A, CLASS=ABCDEFG, CLASS=ABCDEFG, CLASS=ABCDEFG, CLASS=0123456789, CLASS=0123456789, CLASS=0123456789, CLASS=S, CLASS=S, CLASS=S, START=YES START=YES START=YES START=YES START=YES START=NO START=NO START=NO START=NO START=NO START=NO START=NO START=NO START=NO START=NO START=NO START=NO START=YES START=YES START=YES

Figure 55. J2L42.

Specific System JES2 Initialization Parameters

A.4 VTAMLST Members


In this configuration the members concerned are:

ATCSTRxx system specific configuration list. ATCCONxx system specific start list APCICxx system specific CICS definitions APNJExx system specific JES2/VTAM interface CDRMxx system specific CDRM members MPCxx system specific TRL local major node definitions

232

Continuous Availability with PTS

TRLxx system specific TRL definitions APAPPCAA global APPC application member APHCMAA global HCM application member APISPFAA global ISPF application member APTCPAA global TCP/IP application member APTSOAA global TSO application member ECHOAA global ECHO member

The global members utilize symbolic substitution. Therefore when cloning new MVS images into the parallel sysplex no changes need to be made to these members. One example of such a member, APAPPCAA, is included in the following. All system specific members have been included so as to relate to 7.1, Adding a New MVS Image on page 149.

A.4.1 ATCSTR42
******************************************************************** * START LIST FOR VTAM IN IMG03/SC42 * ******************************************************************** CONFIG=42, SSCPID=42, NOPROMPT, SSCPNAME=SC42M, NETID=USIBMSC, HOSTSA=42, NODETYPE=EN, CONNTYPE=APPN, APPNCOS=#INTER, DEFAULT APPN COS CPCP=YES, SUPP=NOSUP, IOPURGE=180, HOSTPU=SC42MPU, PPOLOG=YES, DYNLU=YES, CRPLBUF=(208,,15,,1,16), IOBUF=(182,440,19,,8,48), LPBUF=(9,,0,,6,1)
Figure 56. ATCSTR42

X X X X X X X X X X X X X X X X X

Appendix A. Sample Parallel Sysplex MVS Image Members

233

A.4.2 ATCCON42
********************************************************************** * CONFIG LIST FOR IMG03/SC42 * ********************************************************************** PATH42, PATH DECK TRL23, TRL DEFINITIONS MPC23, MPC LOCAL MAJOR NODE CDRM42, CDRMS ECHOAA, ECHO APPL APTSOAA, TSO APPLICATION APISPFAA, ISPF APPC LU APNJE42, JES2/VTAM INTERFACE APCIC42, CICS AND CPSM ACPPLICATIONS COSAPPN, DEFAULT APPN COS TABLE APPNTGP, TRANSMISSION GROUP PROFILE FOR APPN APAPPCAA, APPC LU APTCPAA TCP/IP TELNET TERMINALS CULN8A0
Figure 57. ATCCON42

X X X X X X X X X X X X X

Note: This member points to global application members.

234

Continuous Availability with PTS

A.4.3 APCIC42
*************************************************** * * * CICS DEFINITIONS * * * *************************************************** VBUILD TYPE=APPL CICSPAC1 APPL AUTH=(ACQ,VPACE,PASS),VPACING=0,EAS=5000,PARSESS=YES, SONSCIP=YES CICSPAC2 APPL AUTH=(ACQ,VPACE,PASS),VPACING=0,EAS=5000,PARSESS=YES, SONSCIP=YES CICSPAC3 APPL AUTH=(ACQ,VPACE,PASS),VPACING=0,EAS=5000,PARSESS=YES, SONSCIP=YES CICSPAC4 APPL AUTH=(ACQ,VPACE,PASS),VPACING=0,EAS=5000,PARSESS=YES, SONSCIP=YES CICSPFC1 APPL AUTH=(ACQ,VPACE,PASS),VPACING=0,EAS=5000,PARSESS=YES, SONSCIP=YES CICSPFC2 APPL AUTH=(ACQ,VPACE,PASS),VPACING=0,EAS=5000,PARSESS=YES, SONSCIP=YES CICSPFC3 APPL AUTH=(ACQ,VPACE,PASS),VPACING=0,EAS=5000,PARSESS=YES, SONSCIP=YES CICSPFC4 APPL AUTH=(ACQ,VPACE,PASS),VPACING=0,EAS=5000,PARSESS=YES, SONSCIP=YES * CICSPCC1 APPL AUTH=(ACQ,VPACE,PASS,SPO),EAS=10,PARSESS=YES,APPC=NO, ACBNAME=CICSPCC1,VPACING=5, SONSCIP=YES * CAS42 APPL AUTH=(ACQ), ACBNAME=CAS42, PARSESS=YES, MODETAB=EYUSMPMT
Figure 58. APCIC42

X X X X X X X X

X X

X X X

A.4.4 APNJE42
*************************************************** * * * JES2/VTAM INTERFACE * * * *************************************************** VBUILD TYPE=APPL WTSCSC42 APPL AUTH=(ACQ),EAS=5,ACBNAME=WTSCSC42,VPACING=7, MODETAB=NJETAB,DLOGMOD=PKNJE77
Figure 59. APNJE42

Appendix A. Sample Parallel Sysplex MVS Image Members

235

A.4.5 CDRM42
*********************************************************************** * CDRMS FOR IMG03 SC42 * *********************************************************************** CDRMSC VBUILD TYPE=CDRM NETWORK NETID=USIBMSC * SC42M CDRM SUBAREA=42, SC42 * CDRDYN=YES, * CDRSC=OPT, * ISTATUS=ACTIVE SC47M CDRM SUBAREA=47, SC47 * CDRDYN=YES, * CDRSC=OPT, * ISTATUS=INACTIVE SC52M CDRM SUBAREA=52, SC52 * CDRDYN=YES, * CDRSC=OPT, * ISTATUS=INACTIVE SC53M CDRM SUBAREA=53, SC53 * CDRDYN=YES, * CDRSC=OPT, * ISTATUS=INACTIVE
Figure 60. CDRM42

A.4.6 MPC03
*********************************************************************** * TRL LOCAL MAJOR NODE FOR IMG03 * *********************************************************************** TRL03L VBUILD TYPE=LOCAL TRL0305P PU TRLE=MPC0305,ISTATUS=ACTIVE,VPACING=0, X SSCPFM=USSSCS,CONNTYPE=APPN,CPCP=YES
Figure 61. MPC03

Note: The network node, in this example image 05, will require PU statements for all other images in the sysplex.

A.4.7 TRL03
*********************************************************************** * VTAM TRL DEFINITIONS FOR IMG03 * *********************************************************************** TRL03 VBUILD TYPE=TRL MPC0305 TRLE LNCTL=MPC,MAXBFRU=5, * READ=(4053,405B),WRITE=(5053,505B)
Figure 62. TRL03

236

Continuous Availability with PTS

Note: The network node, in this example image 05, will require TRLE statements for all other images in the sysplex.

A.4.8 APAPPCAA
APAPPC&SYSCLONE VBUILD TYPE=APPL * SC&SYSCLONE.APPC APPL ACBNAME=SC&SYSCLONE.APPC, APPC=YES, AUTOSES=10, DDRAINL=NALLOW, DMINWNL=2, DMINWNR=2, DRESPL=NALLOW, DSESLIM=10, EAS=509, MODETAB=MTAPPC, SECACPT=CONV, SRBEXIT=YES, VERIFY=NONE, VPACING=2 SC&SYSCLONE.SRV APPL ACBNAME=SC&SYSCLONE.SRV, APPC=YES, AUTOSES=10, DDRAINL=NALLOW, DMINWNL=2, DMINWNR=2, DRESPL=NALLOW, DSESLIM=10, EAS=509, MODETAB=MTAPPC, SECACPT=ALREADYV, SRBEXIT=YES, VERIFY=NONE, VPACING=2 SC&SYSCLONE.SGT APPL ACBNAME=SC&SYSCLONE.SGT, APPC=YES, AUTOSES=10, DDRAINL=NALLOW, DMINWNL=2, DMINWNR=2, DRESPL=NALLOW, DSESLIM=10, EAS=509, MODETAB=MTAPPC, SECACPT=ALREADYV, SRBEXIT=YES, VERIFY=NONE, VPACING=20
Figure 63. APAPPCAA

* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *

Appendix A. Sample Parallel Sysplex MVS Image Members

237

A.5 Allocating Data Sets


The following is sample JCL for allocating LOGREC, PAGE, STGINDEX and SMF data sets when cloning an MVS image as referenced in 7.1, Adding a New MVS Image on page 149. These data sets are system specific and cannot be shared.

A.5.1 ALLOC JCL


//ALLOCMVS JOB (999,POK), L06R , NOTIFY=&SYSUID, // CLASS=A,MSGCLASS=T,TIME=1439, // REGION=5000K,MSGLEVEL=(1,1) //******************************************************************** //* DEFINE LOGREC AND DIP THE LOGREC * //******************************************************************** //LOGREC EXEC PGM=IFCDIP00 //SYSPRINT DD SYSOUT=* //SERERDS DD DSN=SYS1.SC42.LOGREC,DISP=(,CATLG), // SPACE=(CYL,5,,CONTIG), // DSORG=PSU,RECFM=U,LRECL=0,BLKSIZE=1944, // UNIT=3390,VOL=SER=MVS004 //******************************************************************** //* DEFINE PAGE DATASETS * //******************************************************************** //PAGEDEF EXEC PGM=IDCAMS //SYSPRINT DD SYSOUT=* //SYSIN DD * /* DEFINE LOCAL PAGE */ DEFINE PAGESPACE (NAME(PAGE.SC42.LOCAL1) CYLINDERS(400) VOLUME(MVS004) UNIQUE) CATALOG(CATALOG.TOTICFM.VTOTCAT) /* DEFINE COMMON PAGE */ DEFINE PAGESPACE (NAME(PAGE.SC42.COMMON) CYLINDERS(70) VOLUME(MVS004) UNIQUE) CATALOG(CATALOG.TOTICFM.VTOTCAT) /* DEFINE PLPA PAGE */ DEFINE PAGESPACE (NAME(PAGE.SC42.PLPA) CYLINDERS(80) VOLUME(MVS004) UNIQUE) CATALOG(CATALOG.TOTICFM.VTOTCAT) /*
Figure 64 (Part 1 of 3). Allocating System Specific Data Sets

238

Continuous Availability with PTS

//******************************************************************** //* DEFINE STGINDEX DATASET * //******************************************************************** //STGDEF EXEC PGM=IDCAMS //SYSPRINT DD SYSOUT=* //SYSIN DD * /* DEFINE STGINDEX */ DEFINE CLUSTER (NAME(SYS1.SC42.STGINDEX) CYLINDERS(5) VOLUME(MVS004) KEYS(12 8) BUFFERSPACE(20480) RECORDSIZE(2041 2041) REUSE) DATA (CONTROLINTERVALSIZE(2048)) INDEX (CONTROLINTERVALSIZE(4096)) /* //******************************************************************** //* DEFINE SMF DATASETS * //******************************************************************** //SMFDEF EXEC PGM=IDCAMS //SYSPRINT DD SYSOUT=* //SYSIN DD * DEFINE CLUSTER (NAME(SYS1.SC42.MAN1) CYLINDERS(40) VOLUME(MVS004) RECORDSIZE(26614 32767) NONINDEXED SPEED BUFFERSPACE(737280) SPANNED SHAREOPTIONS (2) REUSE CONTROLINTERVALSIZE(26624)) DEFINE CLUSTER (NAME(SYS1.SC42.MAN2) CYLINDERS(40) VOLUME(MVS004) RECORDSIZE(26614 32767) NONINDEXED SPEED BUFFERSPACE(737280) SPANNED SHAREOPTIONS (2) REUSE CONTROLINTERVALSIZE(26624))
Figure 64 (Part 2 of 3). Allocating System Specific Data Sets

Appendix A. Sample Parallel Sysplex MVS Image Members

239

DEFINE CLUSTER (NAME(SYS1.SC42.MAN3) CYLINDERS(40) VOLUME(MVS004) RECORDSIZE(26614 32767) NONINDEXED SPEED BUFFERSPACE(737280) SPANNED SHAREOPTIONS (2) REUSE CONTROLINTERVALSIZE(26624)) /* //******************************************************************* //* CLEAR THE SPECIFIED SMF DATASETS * //******************************************************************* //SMFCLR1 EXEC PGM=IFASMFDP //SYSPRINT DD SYSOUT=* //DUMPIN DD DSN=SYS1.SC42.MAN1,DISP=SHR //DUMPOUT DD DUMMY //SYSIN DD * INDD(DUMPIN,OPTIONS(CLEAR)) /* //SMFCLR2 EXEC PGM=IFASMFDP //SYSPRINT DD SYSOUT=* //DUMPIN DD DSN=SYS1.SC42.MAN2,DISP=SHR //DUMPOUT DD DUMMY //SYSIN DD * INDD(DUMPIN,OPTIONS(CLEAR)) /* //SMFCLR3 EXEC PGM=IFASMFDP //SYSPRINT DD SYSOUT=* //DUMPIN DD DSN=SYS1.SC42.MAN3,DISP=SHR //DUMPOUT DD DUMMY //SYSIN DD * INDD(DUMPIN,OPTIONS(CLEAR)) /*
Figure 64 (Part 3 of 3). Allocating System Specific Data Sets

240

Continuous Availability with PTS

Appendix B. Structures, How to ...


This appendix intends to provide operational information and examples on how to inquire about structures and connections, along with examples of structures manipulations.

B.1 To Gather Information on a Coupling Facility


Use the command D CF,CFNAME=cfname to get information on a coupling facility:

D CF,CFNAME=CF1 IXL150I 23.01.31 DISPLAY CF 927 COUPLING FACILITY 009674.IBM.51.000000060041 PARTITION: 3 CPCID: 00 CONTROL UNIT ID: FFF6 NAMED CF1 COUPLING FACILITY SPACE UTILIZATION ALLOCATED SPACE DUMP SPACE UTILIZATION 1 STRUCTURES: 11264 K 3 STRUCTURE DUMP TABLES: 0 K 2 DUMP SPACE: 2048 K 4 TABLE COUNT: 0 FREE SPACE: 17920 K FREE DUMP SPACE: 2048 K TOTAL SPACE: 31232 K TOTAL DUMP SPACE: 2048 K 5 MAX REQUESTED DUMP SPACE: 0 K 6 VOLATILE: YES STORAGE INCREMENT SIZE: 256 K 7 CFLEVEL: 1 COUPLING FACILITY SPACE CONFIGURATION IN USE 8 CONTROL SPACE: 13312 K 9 NON-CONTROL SPACE: 0 K SENDER PATH 10 13 PHYSICAL ONLINE LOGICAL ONLINE SUBCHANNEL 0361 0362 FREE 17920 K 0 K STATUS VALID STATUS OPERATIONAL/IN USE OPERATIONAL/IN USE TOTAL 31232 K 0 K

11 COUPLING FACILITY DEVICE FFEA FFEB

Figure 65. Coupling Facility Display

Note:

1 Allocated space for STRUCTURES. This is the coupling facility processor storage currently used by the allocated structures. It is a multiple of STORAGE INCREMENT SIZE. 2 DUMP SPACE is the space reserved to capture structure dump data in
the coupling facility, before offloading them onto the dump data set. DUMP SPACE is the value given to DUMPSPACE in the CFRM policy active at structure allocation time, rounded to the next multiple of STORAGE INCREMENT SIZE.

3 STRUCTURE DUMP TABLES is the space currently allocated to captured structure dump data waiting to be offloaded onto dump data set.
Copyright IBM Corp. 1995

241

4 TABLE COUNT is the number of captured dumps still in the coupling facility dump space. 5 MAX REQUESTED DUMP SPACE. This is the maximum amount of dump space which has been requested to be assigned to a dump table. 6 VOLATILE. This is the volatility status of the coupling facility. 7 CFLEVEL is the CFCC level currently running in this coupling facility 8 CONTROL SPACE should be understood as Central Storage space in the coupling facility processor storage. 9 NON-CONTROL SPACE should be understood as Expanded Storage space in the coupling facility processor storage. 10 The following is information related to the status of the CFS CHPID(s) defined to this MVS image and connected to this coupling facility:

SENDER PATH is the CHPID number. PHYSICAL can be either: ONLINE OFFLINE This means that there is no physical CHPID assigned to this MVS image. This can result from a definition error or from the CHPID being offline due to a CF CHP command.

LOGICAL can be either: ONLINE OFFLINE This means that there is no path to the coupling facility associated to this CHPID. This can result from a malfunction or from the path being offline due to a V PATH command.

STATUS can be either: VALID MISCABLED This means either that the CFS CHPID is not connected to the coupling facility as it was defined during the configuration phase through HCD, or the HCD configuration phase did not complete properly. NOT OPERATIONAL FACILITY PAUSED This means that the last path validation attempted received a Facility Paused status, therefore the path is not operational. PATH NOT AVAILABLE This means that the last path validation attempted received a Path Not Available status, therefore the path is not operational.

242

Continuous Availability with PTS

B.2 To Gather Information on Structure and Connections


The D XCF,STRUCTURE,STRNAME=strname command provides much information about the current characteristics of the structure and its connections. Look at Figure 66 for an example of what information is available.

D XCF,STR,STRNM=SYSTEM_LOGREC IXC360I 15.18.30 DISPLAY XCF 361 STRNAME: SYSTEM_LOGREC STATUS: ALLOCATED POLICY SIZE : 32256 K POLICY INITSIZE: 16128 K REBUILD PERCENT: N/A PREFERENCE LIST: CF01 CF02 EXCLUSION LIST IS EMPTY ACTIVE STRUCTURE ---------------ALLOCATION TIME: 10/16/95 12:52:08 CFNAME : CF02 COUPLING FACILITY: 009672.IBM.02.000000040104 PARTITION: 1 CPCID: 01 ACTUAL SIZE : 17152 K STORAGE INCREMENT SIZE: 256 K VERSION : ABD445FB B991D202 DISPOSITION : DELETE ACCESS TIME : 0 MAX CONNECTIONS: 32 # CONNECTIONS : 9 CONNECTION NAME ---------------IXGLOGR_SC42 IXGLOGR_SC43 IXGLOGR_SC47 IXGLOGR_SC49 IXGLOGR_SC50 IXGLOGR_SC52 IXGLOGR_SC53 IXGLOGR_SC54 IXGLOGR_SC55 ID -02 03 01 08 05 04 09 07 06 VERSION -------00020D96 000309FB 00010077 0008005C 000502F9 0004041A 00090006 000700A3 00060230 SYSNAME -------SC42 SC43 SC47 SC49 SC50 SC52 SC53 SC54 SC55 JOBNAME -------IXGLOGR IXGLOGR IXGLOGR IXGLOGR IXGLOGR IXGLOGR IXGLOGR IXGLOGR IXGLOGR ASID ---00 0011 0011 0011 0011 0011 0011 0011 0011 STATE ---------------ACTIVE ACTIVE ACTIVE ACTIVE ACTIVE ACTIVE ACTIVE ACTIVE

1 2

3 4 5 6 7 8

Figure 66. Structures and Connections Display

Note:

1 POLICY SIZE is the maximum size which can be reached by the structure. This is the value given to the SIZE parameter in the CFRM policy active at structure allocation time, rounded to the next multiple of STORAGE INCREMENT SIZE. 2 POLICY INITSIZE is the size actually given to the structure at allocation time. This is the value of the INITSIZE parameter in the CFRM policy active at allocation time, rounded to the next multiple of STORAGE INCREMENT SIZE. Giving a value to INITSIZE implies that the user intends to use the ALTER function if necessary. 3 ACTUAL SIZE is the current size of the structure and is a multiple of the STORAGE INCREMENT SIZE. 243

Appendix B. Structures, How to ...

4 VERSION is a pseudo random number generated by XES at structure allocation time. It uniquely identifies this instance of the structure. Rebuilding a structure, or deallocating/reallocating the structure will change the version number for the new instance of the structure. 5 DISPOSITION is DELETE or KEEP. This is set up by the first exploiter to connect to the structure (therefore at structure allocation) and this is not adjustable by policy parameter. 6 ACCESS TIME is the length of time (in tenths of second) the connectors can tolerate not having access to the structure because of a structure SVC dump being in progress. This value is set up by the first exploiter to connect to the structure and this is not adjustable by a policy parameter. In the case shown here, ACCESS TIME: 0 means that the structure cannot be included in a SVC dump. 7 MAX CONNECTIONS is the maximum total number of connections to this structure that can be active or failed-persistent at any point in time. This value is defined when formatting the CFRM data set with the IXCL1DSU. It pertains to all structures created in the sysplex while this couple data set is in use. 8 # CONNECTIONS is the current number of connectors to the structure. 9 The following is information relative to the current connections to the structure:

CONNECTION NAME is a value given by the connector internal code. CONNECTION ID is a value given by XES when granting the connection. CONNECTION VERSION is a pseudo random number which uniquely identifies the instance of the connection. SYSNAME, JOBNAME and ASID allow you to locate the connector. The STATE of the connection can be: FAILED-PERSISTENT DISCONNECTING The connection is in the process of disconnecting. FAILING, that is in the process of abnormally ending. ACTIVE. ACTIVE & The connection is in the active state but the connector has lost physical connectivity to the structure. ACTIVE OLD The structure is being rebuilt and the connector is connected to the old structure. ACTIVE &OLD The structure is being rebuilt and the connector is connected to the old structure, but it has lost physical connectivity to the structure.

244

Continuous Availability with PTS

ACTIVE NEW,OLD The structure is being rebuilt and the connector is connected to the old and new structure.

ACTIVE NEW,&OLD The structure is being rebuilt and the connector is connected to the old and new structure, and it has lost physical connectivity to the old structure.

ACTIVE &NEW,OLD The structure is being rebuilt and the connector is connected to the old and new structure, and it has lost physical connectivity to the new structure.

ACTIVE &NEW,&OLD The structure is being rebuilt and the connector is connected to the old and new structure, and it has lost physical connectivity to the old and new structure.

B.3 To Deallocate a Structure with a Disposition of DELETE


To deallocate a structure with a disposition of DELETE, all the structure s exploiters must be disconnected from the structure; that is there should not be any more connections to the structure in active or failed-persistent state. Refer to B.5, To Suppress a Connection in Active State and B.6, To Suppress a Connection in Failed-persistent State on page 246. for information on how to suppress connections.

B.4 To Deallocate a Structure with a Disposition of KEEP


Insure that there are no connections in either the active or failed-persistent state, then force the structure out of the coupling facility with the following command:

SETXCF FORCE,STRUCTURE,STRNAME=strname

B.5 To Suppress a Connection in Active State


The only way to suppress a connection in the active state is to have all the connectors to disconnect. The connection then switches to failed-persistent or undefined state. The SETXCF FORCE command does not operate on an active connection. Information on how IBM exploiters can be made disconnecting from the structure is available at 9.2, Coupling Facility Failure Recovery on page 180.

Appendix B. Structures, How to ...

245

B.6 To Suppress a Connection in Failed-persistent State


Note the structure name and the connection names, which appear as a result of the following command:

D CF,STRUCTURE,STRNAME=strname
If any of the connections have a state of Failed-Persistent then force them out by using the following command:

SETXCF FORCE,CONNECTION,STRNM=strname,CONNAME=conname

B.7 To Monitor a Structure Rebuild


Once started, a structure rebuild can be monitored by looking at the messages issued at the MVS console by the exploiters themselves, or by periodically requesting the state of the structure using the D XCF,STRUCTURE,STRNAME=strname command. Figure 67 and Figure 68 on page 247 contain two examples.

SETXCF START,REBUILD,STRNM=IEFAUTOS,LOC=OTHER IXC367I THE SETXCF START REBUILD REQUEST FOR STRUCTURE 879 IEFAUTOS WAS ACCEPTED. IEF265I AUTOMATIC TAPE SWITCHING: REBUILD IN PROGRESS BECAUSE THE OPERATOR REQUESTED IEFAUTOS REBUILD. IEF265I AUTOMATIC TAPE SWITCHING: REBUILD IN PROGRESS BECAUSE THE OPERATOR REQUESTED IEFAUTOS REBUILD. IEF265I AUTOMATIC TAPE SWITCHING: REBUILD IN PROGRESS BECAUSE THE OPERATOR REQUESTED IEFAUTOS REBUILD. IEF265I AUTOMATIC TAPE SWITCHING: REBUILD IN PROGRESS BECAUSE THE OPERATOR REQUESTED IEFAUTOS REBUILD. IEF265I AUTOMATIC TAPE SWITCHING: REBUILD IN PROGRESS BECAUSE THE OPERATOR REQUESTED IEFAUTOS REBUILD. IEF265I AUTOMATIC TAPE SWITCHING: REBUILD IN PROGRESS BECAUSE THE OPERATOR REQUESTED IEFAUTOS REBUILD. IEF265I AUTOMATIC TAPE SWITCHING: REBUILD IN PROGRESS BECAUSE THE OPERATOR REQUESTED IEFAUTOS REBUILD. IEF268I AUTOMATIC TAPE SWITCHING IS AVAILABLE. 881 IEFAUTOS WAS SUCCESSFULLY REBUILT.

120 104 397 299 192 559 880

Figure 67. Monitoring Structure Rebuild through Exploiter s Messages

246

Continuous Availability with PTS

SETXCF START,REBUILD,STRNM=SYSTEM_LOGREC,LOC=OTHER IXC367I THE SETXCF START REBUILD REQUEST FOR STRUCTURE 057 SYSTEM_LOGREC WAS ACCEPTED. D XCF,STR,STRNM=SYSTEM_LOGREC IXC360I 15.17.51 DISPLAY XCF 059 STRNAME: SYSTEM_LOGREC STATUS: REASON SPECIFIED WITH REBUILD START: OPERATOR INITIATED REBUILD PHASE: COMPLETE POLICY SIZE : 32256 K POLICY INITSIZE: 16128 K REBUILD PERCENT: N/A PREFERENCE LIST: CF01 CF02 EXCLUSION LIST IS EMPTY REBUILD NEW STRUCTURE --------------------ALLOCATION TIME: 10/25/95 15:17:50 CFNAME : CF01 COUPLING FACILITY: 009672.IBM.02.000000040104 PARTITION: 1 CPCID: 00 ACTUAL SIZE : 16128 K STORAGE INCREMENT SIZE: 256 K VERSION : ABDFB755 B5945204 DISPOSITION : DELETE ACCESS TIME : 0 MAX CONNECTIONS: 32 # CONNECTIONS : 8 REBUILD OLD STRUCTURE --------------------ALLOCATION TIME: 10/16/95 12:52:08 CFNAME : CF02 COUPLING FACILITY: 009672.IBM.02.000000040104 PARTITION: 1 CPCID: 01 ACTUAL SIZE : 17152 K STORAGE INCREMENT SIZE: 256 K VERSION : ABD445FB B991D202 ACCESS TIME : 0 MAX CONNECTIONS: 32 # CONNECTIONS : 8 * ASTERISK DENOTES CONNECTOR CONNECTION NAME ID VERSION ---------------- -- -------*IXGLOGR_SC42 02 00020D97 *IXGLOGR_SC43 03 000309FC *IXGLOGR_SC47 01 0001007C *IXGLOGR_SC49 08 0008005D *IXGLOGR_SC50 05 000502FA *IXGLOGR_SC52 07 000700A5 *IXGLOGR_SC54 04 0004041C *IXGLOGR_SC55 06 00060231 WITH OUTSTANDING REBUILD RESPONSE SYSNAME JOBNAME ASID STATE -------- -------- ---- ---------------SC42 IXGLOGR 0011 ACTIVE NEW,OLD SC43 IXGLOGR 0011 ACTIVE NEW,OLD SC47 IXGLOGR 0011 ACTIVE NEW,OLD SC49 IXGLOGR 0011 ACTIVE NEW,OLD SC50 IXGLOGR 0011 ACTIVE NEW,OLD SC52 IXGLOGR 0011 ACTIVE NEW,OLD SC54 IXGLOGR 0011 ACTIVE NEW,OLD SC55 IXGLOGR 0011 ACTIVE NEW,OLD

Figure 68. Monitoring Structure Rebuild by Displaying Structure Status

Appendix B. Structures, How to ...

247

B.8 To Stop a Structure Rebuild


The operator is not expected in a normal situation to stop a structure rebuild. However, the facility is available and can be used for recovery purposes, such as stopping a rebuild which seems to be hung. To stop the rebuild, use the command: SETXCF STOP,REBUILD,STRNAME=strname.

B.9 To Recover from a Hang in Structure Rebuild


If the rebuild seems to be hung, the operator can try to incite a system action by requesting a stop of the rebuild using SETXCF STOP,REBUILD If the hang condition seems to be due to a connector having a rebuild response outstanding (see Figure 68 on page 247 for an example of rebuild responses outstanding), an attempt can be made to vary the connector s system out of the sysplex by using V XCF,sysname,OFF

248

Continuous Availability with PTS

Appendix C. Examples of CFRM Policy Transitioning


This appendix contains examples of changes introduced in an active CFRM policy that do not immediately take effect. It shows how to recognize that changes are pending and what has to be done to complete the CFRM policy transitioning.

C.1 Changing the Structure Definition


Assume that the following CFRM policy, shown in Figure 69 on page 250, is currently active.

Copyright IBM Corp. 1995

249

CF NAME(CF01) DUMPSPACE(2048) PARTITION(1) CPCID(00) TYPE(009672) MFG(IBM) PLANT(02) SEQUENCE(000000040104) CF NAME(CF02) DUMPSPACE(2048) PARTITION(1) CPCID(01) TYPE(009672) MFG(IBM) PLANT(02) SEQUENCE(000000040104) STRUCTURE NAME(IEFAUTOS) SIZE(640) REBUILDPERCENT(20) PREFLIST(CF01, CF02) STRUCTURE NAME(IRRXCF00_B001) SIZE(332) PREFLIST(CF02, CF01) EXCLLIST(IRRXCF00_P001) STRUCTURE NAME(IRRXCF00_P001) SIZE(1644) PREFLIST(CF01, CF02) EXCLLIST(IRRXCF00_B001) STRUCTURE NAME(ISTGENERIC) SIZE(328) PREFLIST(CF02, CF01) STRUCTURE NAME(IXC_DEFAULT_1) SIZE(16128) PREFLIST(CF02, CF01) EXCLLIST(IXC_DEFAULT_2) STRUCTURE NAME(IXC_DEFAULT_2) SIZE(16128) PREFLIST(CF01, CF02) EXCLLIST(IXC_DEFAULT_1)

STRUCTURE NAME(JES2CKPT_1) SIZE(4096) INITSIZE(2048) PREFLIST(CF02, CF01) EXCLLIST(JES2CKPT_2) STRUCTURE NAME(SYSTEM_LOGREC) SIZE(32256) INITSIZE(16128) PREFLIST(CF01, CF02) STRUCTURE NAME(SYSTEM_OPERLOG) SIZE(1024) PREFLIST(CF02, CF01) STRUCTURE NAME(TEST) SIZE(2048) INITSIZE(1024) REBUILDPERCENT(20) PREFLIST(CF01, CF02)
Figure 69. CFRM Policy Sample

Let s display the status of the policy defined structures:

250

Continuous Availability with PTS

D XCF,STR IXC359I 10.14.49 DISPLAY XCF 020 STRNAME ALLOCATION TIME IEFAUTOS 10/12/95 13:25:26 IRRXCF00_B001 10/16/95 16:55:19 IRRXCF00_P001 10/16/95 16:55:18 ISTGENERIC 10/12/95 14:35:55 IXC_DEFAULT_1 10/17/95 09:42:11 IXC_DEFAULT_2 --JES2CKPT_1 10/12/95 14:46:43 SYSTEM_LOGREC 10/16/95 12:52:08 SYSTEM_OPERLOG 10/12/95 14:36:31 TEST ---

STATUS ALLOCATED ALLOCATED ALLOCATED ALLOCATED ALLOCATED NOT ALLOCATED ALLOCATED ALLOCATED ALLOCATED NOT ALLOCATED

For the sake of the example, we are to create and activate a new CFRM policy with two major changes: 1. The SIZE of the IEFAUTOS structure will be modified. 2. The IXC_DEFAULT_1 signalling structure will not be defined in the new CFRM policy. Note that these two structures are currently allocated. We check what is the current size of the IEFAUTOS structure:

D XCF,STR,STRNM=IEFAUTOS IXC360I 10.15.43 DISPLAY XCF 022 STRNAME: IEFAUTOS STATUS: ALLOCATED POLICY SIZE : 640 K POLICY INITSIZE: N/A REBUILD PERCENT: 20 PREFERENCE LIST: CF01 CF02 EXCLUSION LIST IS EMPTY ACTIVE STRUCTURE ---------------ALLOCATION TIME: 10/12/95 13:25:26 CFNAME : CF02 COUPLING FACILITY: 009672.IBM.02.000000040104 PARTITION: 1 CPCID: 01 ACTUAL SIZE : 768 K STORAGE INCREMENT SIZE: 256 K VERSION : ABCF45F7 0891D281 DISPOSITION : DELETE ACCESS TIME : NOLIMIT MAX CONNECTIONS: 32 # CONNECTIONS : 9 CONNECTION NAME ---------------IEFAUTOSSC42 IEFAUTOSSC43 IEFAUTOSSC47 IEFAUTOSSC49 IEFAUTOSSC50 IEFAUTOSSC52 IEFAUTOSSC53 IEFAUTOSSC54 IEFAUTOSSC55 ID -06 09 01 08 05 02 04 07 03 VERSION -------00060035 00090005 0001017B 00080006 0005003B 00020045 00040033 00070028 0003003F SYSNAME -------SC42 SC43 SC47 SC49 SC50 SC52 SC53 SC54 SC55 JOBNAME -------ALLOCAS ALLOCAS ALLOCAS ALLOCAS ALLOCAS ALLOCAS ALLOCAS ALLOCAS ALLOCAS ASID ---000F 000F 000F 000F 000F 000F 000F 000F 000F STATE ---------------ACTIVE ACTIVE ACTIVE ACTIVE ACTIVE ACTIVE ACTIVE ACTIVE ACTIVE

Appendix C. Examples of CFRM Policy Transitioning

251

The new CFRM policy is installed into the CFRM couple data set by the JCL shown in Figure 70 on page 252.

//ADDCFRM JOB (999,POK), L06R , CLASS=A,REGION=4096K, // MSGCLASS=T,TIME=10,MSGLEVEL=(1,1),NOTIFY=&SYSUID //****************************************************************** //* JCL TO INSTALL A NEW CFRM POLICY //* //****************************************************************** //STEP1 EXEC PGM=IXCMIAPU //SYSPRINT DD SYSOUT=* //SYSIN DD * DATA TYPE(CFRM) REPORT(YES) DEFINE POLICY NAME(TESTPK ) CF NAME(CF01) DUMPSPACE(2048) PARTITION(1) CPCID(00) TYPE(009672) MFG(IBM) PLANT(02) SEQUENCE(000000040104) CF NAME(CF02) DUMPSPACE(2048) PARTITION(1) CPCID(01) TYPE(009672) MFG(IBM) PLANT(02) SEQUENCE(000000040104) STRUCTURE NAME(IEFAUTOS) SIZE(1000) REBUILDPERCENT(20) PREFLIST(CF01, CF02) STRUCTURE NAME(IRRXCF00_B001) SIZE(332) PREFLIST(CF02, CF01) EXCLLIST(IRRXCF00_P001) STRUCTURE NAME(IRRXCF00_P001) SIZE(1644) PREFLIST(CF01, CF02) EXCLLIST(IRRXCF00_B001) STRUCTURE NAME(ISTGENERIC) SIZE(328) PREFLIST(CF02, CF01) STRUCTURE NAME(IXC_DEFAULT_2) SIZE(16128) PREFLIST(CF01, CF02) STRUCTURE NAME(JES2CKPT_1) SIZE(4096) INITSIZE(2048) PREFLIST(CF02, CF01) EXCLLIST(JES2CKPT_2) STRUCTURE NAME(SYSTEM_LOGREC) SIZE(32256) INITSIZE(16128) PREFLIST(CF01, CF02) STRUCTURE NAME(SYSTEM_OPERLOG) SIZE(1024) PREFLIST(CF02, CF01) STRUCTURE NAME(TEST) SIZE(2048) INITSIZE(1024) REBUILDPERCENT(20) PREFLIST(CF01, CF02)
Figure 70. JCL to Install a New CFRM Policy

252

Continuous Availability with PTS

Once the policy has been installed, it is started as a new active policy:

SETXCF START,POL,TYPE=CFRM,POLNM=TESTPK IXC511I START ADMINISTRATIVE POLICY TESTPK FOR CFRM ACCEPTED IXC512I POLICY CHANGE IN PROGRESS FOR CFRM 025 TO MAKE TESTPK POLICY ACTIVE. 2 POLICY CHANGE(S) PENDING.

Note that there are two pending changes indicated:

D XCF,STR IXC359I 10.17.10 DISPLAY XCF 027 STRNAME ALLOCATION TIME STATUS IEFAUTOS 10/12/95 13:25:26 ALLOCATED POLICY CHANGE PENDING IRRXCF00_B001 10/16/95 16:55:19 ALLOCATED IRRXCF00_P001 10/16/95 16:55:18 ALLOCATED ISTGENERIC 10/12/95 14:35:55 ALLOCATED IXC_DEFAULT_1 10/17/95 09:42:11 ALLOCATED POLICY CHANGE PENDING IXC_DEFAULT_2 --NOT ALLOCATED SYSTEM_LOGREC 10/16/95 12:52:08 ALLOCATED SYSTEM_OPERLOG 10/12/95 14:36:31 ALLOCATED TEST --NOT ALLOCATED

In this example, we already knew what to expect from this CFRM change. In a real life situation, these pending changes may result from mistakes in writing the new policy, and may therefore require that you compare the previous and current active policies to locate the differences. We now initiate a rebuild of the IEFAUTOS structure. This will cause the creation of a new instance of the structure as per the new parameters in the active policy:

SETXCF START,REBUILD,STRNM=IEFAUTOS,LOC=NORMAL IXC367I THE SETXCF START REBUILD REQUEST FOR STRUCTURE 033 IEFAUTOS WAS ACCEPTED. IEF265I AUTOMATIC TAPE SWITCHING: REBUILD IN PROGRESS BECAUSE 492 THE OPERATOR REQUESTED IEFAUTOS REBUILD. IEF268I AUTOMATIC TAPE SWITCHING IS AVAILABLE. 035 IEFAUTOS WAS SUCCESSFULLY REBUILT. IXC512I POLICY CHANGE IN PROGRESS FOR CFRM 793 TO MAKE TESTPK POLICY ACTIVE. 1 POLICY CHANGE(S) PENDING.

Appendix C. Examples of CFRM Policy Transitioning

253

D XCF,STR IXC359I 10.19.48 DISPLAY XCF 037 STRNAME ALLOCATION TIME IEFAUTOS 10/17/95 10:19:14 IRRXCF00_B001 10/16/95 16:55:19 IRRXCF00_P001 10/16/95 16:55:18 ISTGENERIC 10/12/95 14:35:55 IXC_DEFAULT_1 10/17/95 09:42:11 IXC_DEFAULT_2 JES2CKPT_1 SYSTEM_LOGREC SYSTEM_OPERLOG TEST

STATUS ALLOCATED ALLOCATED ALLOCATED ALLOCATED ALLOCATED POLICY CHANGE PENDING --NOT ALLOCATED 10/12/95 14:46:43 ALLOCATED 10/16/95 12:52:08 ALLOCATED 10/12/95 14:36:31 ALLOCATED --NOT ALLOCATED

We can check for the new size of IEFAUTOS:

D XCF,STR,STRNM=IEFAUTOS IXC360I 10.20.02 DISPLAY XCF 039 STRNAME: IEFAUTOS STATUS: ALLOCATED POLICY SIZE : 1000 K POLICY INITSIZE: N/A REBUILD PERCENT: 20 PREFERENCE LIST: CF01 CF02 EXCLUSION LIST IS EMPTY ACTIVE STRUCTURE ---------------ALLOCATION TIME: 10/17/95 10:19:14 CFNAME : CF01 COUPLING FACILITY: 009672.IBM.02.000000040104 PARTITION: 1 CPCID: 00 PAGE 8 ACTUAL SIZE : 1024 K STORAGE INCREMENT SIZE: 256 K VERSION : ABD565AC 32D98806 DISPOSITION : DELETE ACCESS TIME : NOLIMIT MAX CONNECTIONS: 32 # CONNECTIONS : 9 CONNECTION NAME ---------------IEFAUTOSSC42 IEFAUTOSSC43 IEFAUTOSSC47 IEFAUTOSSC49 IEFAUTOSSC50 IEFAUTOSSC52 IEFAUTOSSC53 IEFAUTOSSC54 IEFAUTOSSC55 ID -06 09 01 08 05 02 04 07 03 VERSION -------00060035 00090005 0001017B 00080006 0005003B 00020045 00040033 00070028 0003003F SYSNAME -------SC42 SC43 SC47 SC49 SC50 SC52 SC53 SC54 SC55 JOBNAME -------ALLOCAS ALLOCAS ALLOCAS ALLOCAS ALLOCAS ALLOCAS ALLOCAS ALLOCAS ALLOCAS ASID ---000F 000F 000F 000F 000F 000F 000F 000F 000F STATE ---------------ACTIVE ACTIVE ACTIVE ACTIVE ACTIVE ACTIVE ACTIVE ACTIVE ACTIVE

To remove the last pending change, we have to deallocate the IXC_DEFAULT_1 structure. This is achieved by stopping all PATHINs and PATHOUTs to this structure:

254

Continuous Availability with PTS

SETXCF STOP,PI,STRNM=IXC_DEFAULT_1 IXC467I STOPPING PATHIN STRUCTURE IXC_DEFAULT_1 041 RSN: OPERATOR REQUEST IXC307I SETXCF STOP PATHIN REQUEST FOR STRUCTURE IXC_DEFAULT_1 042 COMPLETED SUCCESSFULLY SETXCF STOP,PO,STRNM=IXC_DEFAULT_1 IXC467I STOPPING PATHOUT STRUCTURE IXC_DEFAULT_1 044 RSN: OPERATOR REQUEST IXC307I SETXCF STOP PATHOUT REQUEST FOR STRUCTURE IXC_DEFAULT_1 045 COMPLETED SUCCESSFULLY IXC307I STOP PATH REQUEST FOR STRUCTURE IXC_DEFAULT_1 046 COMPLETED SUCCESSFULLY: NOT DEFINED AS PATHOUT OR PATHIN IXC513I COMPLETED POLICY CHANGE FOR CFRM. 047 TESTPK POLICY IS ACTIVE.

DISPLAY XCF,CF IXC361I 10.21.18 DISPLAY XCF 049 CFNAME COUPLING FACILITY CF01 009672.IBM.02.000000040104 PARTITION: 1 CPCID: 00 CF02 009672.IBM.02.000000040104 PARTITION: 1 CPCID: 01 D XCF,STR IXC359I 10.21.25 DISPLAY XCF 051 STRNAME ALLOCATION TIME IEFAUTOS 10/17/95 10:19:14 IRRXCF00_B001 10/16/95 16:55:19 IRRXCF00_P001 10/16/95 16:55:18 ISTGENERIC 10/12/95 14:35:55 JES2CKPT_1 10/12/95 14:46:43 SYSTEM_LOGREC 10/16/95 12:52:08 SYSTEM_OPERLOG 10/12/95 14:36:31 TEST ---

STATUS ALLOCATED ALLOCATED ALLOCATED ALLOCATED ALLOCATED ALLOCATED ALLOCATED NOT ALLOCATED

C.2 Changing the Coupling Facility Definition


In this example, the original CFRM active policy defined two coupling facilities, CF01 and CF02. We set a new CFRM policy up with only CF01 being defined. Note that for this new policy to be accepted by the Administrative Data Utility, IXCMIAPU, all references to CF02 in the preference lists had to be removed as well. Failing to do so results in the Utility not installing the new policy, and ending with return code 12. Refer to Figure 71 on page 256 and Figure 72 on page 256. The exclusion lists have been left as they were before, as the involved exploiters are known to accept allocations deviating from the exclusion lists.

Appendix C. Examples of CFRM Policy Transitioning

255

DEFINE POLICY NAME(CFRM07 ) CF NAME(CF01) DUMPSPACE(2048) PARTITION(1) CPCID(00) TYPE(009672) MFG(IBM) PLANT(02) SEQUENCE(000000040104) CF NAME(CF02) DUMPSPACE(2048) PARTITION(1) CPCID(01) TYPE(009672) MFG(IBM) PLANT(02) SEQUENCE(000000040104) STRUCTURE NAME(IEFAUTOS) SIZE(640) REBUILDPERCENT(20) PREFLIST(CF01, CF02) STRUCTURE NAME(IRRXCF00_B001) SIZE(332) PREFLIST(CF02, CF01) EXCLLIST(IRRXCF00_P001) STRUCTURE NAME(IRRXCF00_P001) SIZE(1644) PREFLIST(CF01, CF02) EXCLLIST(IRRXCF00_B001) STRUCTURE NAME(JES2CKPT_1) SIZE(4096) INITSIZE(2048) PREFLIST(CF02, CF01) EXCLLIST(JES2CKPT_2)
Figure 71. Original CFRM Policy

DEFINE POLICY NAME(TSTPK ) CF NAME(CF01) DUMPSPACE(2048) PARTITION(1) CPCID(00) TYPE(009672) MFG(IBM) PLANT(02) SEQUENCE(000000040104) STRUCTURE NAME(IEFAUTOS) SIZE(640) REBUILDPERCENT(20) PREFLIST(CF01) STRUCTURE NAME(IRRXCF00_B001) SIZE(332) PREFLIST(CF01) EXCLLIST(IRRXCF00_P001) STRUCTURE NAME(IRRXCF00_P001) SIZE(1644) PREFLIST(CF01) EXCLLIST(IRRXCF00_B001) STRUCTURE NAME(JES2CKPT_1) SIZE(4096) INITSIZE(2048) PREFLIST(CF01) EXCLLIST(JES2CKPT_2)
Figure 72. New CFRM Policy

When starting the new policy, four pending changes are indicated against the structures and one pending change against the coupling facility itself.

256

Continuous Availability with PTS

SETXCF START,POL,TYPE=CFRM,POLNAME=TSTPK IXC511I START ADMINISTRATIVE POLICY TSTPK FOR CFRM ACCEPTED IXC512I POLICY CHANGE IN PROGRESS FOR CFRM 118 TO MAKE TSTPK POLICY ACTIVE. 8 POLICY CHANGE(S) PENDING. D XCF,STR IXC359I 17.01.41 DISPLAY XCF 120 STRNAME ALLOCATION TIME STATUS IEFAUTOS 10/25/95 15:11:53 ALLOCATED POLICY CHANGE IRRXCF00_B001 10/17/95 10:44:14 ALLOCATED POLICY CHANGE IRRXCF00_P001 10/17/95 10:44:12 ALLOCATED POLICY CHANGE JES2CKPT_1 10/12/95 14:46:43 ALLOCATED POLICY CHANGE D XCF,CF IXC361I 17.01.57 DISPLAY XCF 122 PAGE 2 CFNAME COUPLING FACILITY CF01 009672.IBM.02.000000040104 PARTITION: 1 CPCID: 00 CF02 009672.IBM.02.000000040104 PARTITION: 1 CPCID: 01 POLICY CHANGE PENDING

PENDING PENDING PENDING PENDING

The pending changes are due to the following: 1. Changes to the structures already allocated in CF01 (the o n - g o i n g coupling facility) in that their preference lists have been modified so as not to mention CF02. This pertains to structures IEFAUTOS and IRRXCF00_P001. 2. Changes to the structures already allocated in CF02 (the out-going coupling facility), in that their coupling facility is not defined in the new active policy. This pertains to structures IRRXCF00_B001 and JES2CKPT_1. Note however that all accesses to the structures in CF02 keep proceeding normally. 3. Change to the coupling facility CF02 in that it is no longer useable by the Sysplex MVS images as a target of allocation or rebuild. To resolve the pending changes:

The original policy can be re-installed and re-started. Or, if the new policy is to be kept as is: 1. The changes pending against CF01 structures are resolved using the following command:

SETXCF START,REBUILD,CFNAME=CF01,LOC=NORMAL
This causes the new structure attributes (in this case, the new preference list) to be taken into account. 2. The changes pending against the structures in CF02 are resolved with the following command:

SETXCF START,REBUILD,CFNAME=CF02,LOC=NORMAL
This will initiate the rebuild of all the structures currently in CF02 as per the active preference lists. That is, all the rebuilds will be done into CF01.
Appendix C. Examples of CFRM Policy Transitioning

257

This works for IRRXCF00_B001, but it does not work for JES2CKPT_1 since JES2 does not support the structure rebuild. The JES2 checkpoint will therefore have to be moved onto DASD or into CF01 using the JES2 checkpoint reconfiguration dialog. In either case, JES2CKPT_1 must be eventually deleted from CF02 using the following command:

SETXCF FORCE,STRUCTURE,STRNAME=JES2CKPT_1
3. When this is done, all the pending changes have been resolved, and a D XCF,CF will show CF01 as the only usable coupling facility for the sysplex members.

258

Continuous Availability with PTS

Appendix D. Examples of Sysplex Partitioning


This appendix shows some examples of SYSLOGs obtained when manually or automatically (with SFM) partitioning a system off the sysplex.

D.1 Partitioning on Operator Request


The operator requests sysplex partitioning by using the following command:

V XCF,sysname,OFFLINE
Further operation intervention is required if there is no SFM policy active, as shown in Figure 73. The partitioning is totally handled without operator intervention (the system being varied off the sysplex is automatically isolated using the SFM isolate function) if both of the following are true:

There is a SFM policy active. An operational system in the Sysplex shares coupling facility connectivity with the system to be isolated (the coupling facility is the intermediary in forwarding the isolation signal to the target system).

V XCF,SC42,OFF *007 IXC371D CONFIRM REQUEST TO VARY SYSTEM SC42 OFFLINE. REPLY SYSNAME=SC42 TO REMOVE SC42 OR C TO CANCEL. R 7,SYSNAME=SC42 IEE600I REPLY TO 007 IS;SYSNAME=SC42 IXC101I SYSPLEX PARTITIONING IN PROGRESS FOR SC42 ........................ *008 IXC102A XCF IS WAITING FOR SYSTEM SC42 DEACTIVATION. REPLY DOWN WHEN MVS ON SC42 IS DOWN. R 8,DOWN IEE600I REPLY TO 008 IS;DOWN ........................ ISG178E GLOBAL RESOURCE SERIALIZATION HAS BEEN DISRUPTED. GLOBAL RESOURCE REQUESTORS WILL BE SUSPENDED. IEA257I CONSOLE PARTITION CLEANUP IN PROGRESS FOR SYSTEM SC42. ISG011I SYSTEM SC42 - BEING PURGED FROM GRS COMPLEX ISG013I SYSTEM SC42 - PURGED FROM GRS COMPLEX ISG173I SYSTEM SC43 RESTARTING GLOBAL RESOURCE SERIALIZATION. IXC105I SYSPLEX PARTITIONING HAS COMPLETED FOR SC42 321 - PRIMARY REASON: OPERATOR VARY REQUEST - REASON FLAGS: 000004
Figure 73. VARY OFF a System without SFM Policy Active

Copyright IBM Corp. 1995

259

V XCF,SC42,OFFLINE *006 IXC371D CONFIRM REQUEST TO VARY SYSTEM SC42 OFFLINE. REPLY SYSNAME=SC42 TO REMOVE SC42 OR C TO CANCEL. R 6,SYSNAME=SC42 IEE600I REPLY TO 006 IS;SYSNAME=SC42 .................................................... IXC101I SYSPLEX PARTITIONING IN PROGRESS FOR SC42 .................................................... IXC105I SYSPLEX PARTITIONING HAS COMPLETED FOR SC42 255 - PRIMARY REASON: OPERATOR VARY REQUEST - REASON FLAGS: 000004 IEA258I CONSOLE PARTITION CLEANUP COMPLETE FOR SYSTEM SC42.
Figure 74. VARY OFF a System with an SFM Policy Active

D.2 System in Missing Status Update Condition


A system not updating its status in the sysplex couple data set or not sending any XCF signals for the duration of the INTERVAL specified in the COUPLExx member is a candidate to be partitioned out of the sysplex by XCF. XCF prompts the operator before partitioning if there is no SFM policy active or if the currently active policy has CONNFAIL(NO). See Figure 75. XCF automatically proceeds with partitioning if there is an active SFM policy with CONNFAIL(YES), or without CONNFAIL being specified. See Figure 76 on page 261.

*242 IXC402D SC42 LAST OPERATIVE AT 18:40:14. REPLY DOWN IF MVS IS DOWN OR INTERVAL=SSSSS TO SET A REPROMPT TIME. R 242,DOWN IEE600I REPLY TO 242 IS;DOWN IXC101I SYSPLEX PARTITIONING IN PROGRESS FOR SC42 ISG011I SYSTEM SC42 - BEING PURGED FROM GRS COMPLEX ISG013I SYSTEM SC42 - PURGED FROM GRS COMPLEX IEA257I CONSOLE PARTITION CLEANUP IN PROGRESS FOR SYSTEM SC42. IEA258I CONSOLE PARTITION CLEANUP COMPLETE FOR SYSTEM SC42. IXC105I SYSPLEX PARTITIONING HAS COMPLETED FOR SC42 109 - PRIMARY REASON: SYSTEM STATUS UPDATE MISSING - REASON FLAGS: 000008
Figure 75. System in Missing Status Update Condition and No Active SFM Policy

260

Continuous Availability with PTS

IXC101I SYSPLEX PARTITIONING IN PROGRESS FOR SC42 IEA257I CONSOLE PARTITION CLEANUP IN PROGRESS FOR SYSTEM SC42. IEA258I CONSOLE PARTITION CLEANUP COMPLETE FOR SYSTEM SC42. IXC105I SYSPLEX PARTITIONING HAS COMPLETED FOR SC42 879 - PRIMARY REASON: SYSTEM REMOVED BY SYSPLEX FAILURE MANAGEMENT BECAUSE ITS STATUS UPDATE WAS MISSING - REASON FLAGS: 000100
Figure 76. System in Missing Status Update with an Active SFM Policy and CONNFAIL(YES)

Appendix D. Examples of Sysplex Partitioning

261

262

Continuous Availability with PTS

Appendix E. Spin Loop Recovery


This appendix provides details on how spin loop recovery is handled by MVS. MVS will perform a SPIN as the first action taken when an excessive spin condition occurs. On subsequent expirations of the SPINTIME interval, MVS will escalate through the actions specified by SPINRCVY until the spin loop is resolved. The default SPINRCVY actions taken after SPIN are:

ABEND, TERM, ACR


Specifying SPINRCVY = ABEND,TERM,ACR will result in MVS escalating through the following recovery actions when an excessive spin condition occurs:

SPIN, ABEND, TERM, ACR


Specifying SPINRCVY = SPIN,ABEND,TERM,ACR will result in MVS escalating through the following recovery actions when an excessive spin condition occurs:

SPIN, SPIN, ABEND, TERM, ACR


Given the fact that MVS enforces a first action of SPIN, specifying a SPINRCVY action of SPIN is not recommended In a sysplex it is important to resolve a spin loop as quickly as possible. Sysplex customers should set the following recovery actions:

TERM, ACR
Specifying TERM as the first disruptive action will ABEND the failing workunit. The workunit s recovery will be given control however, the recovery routine will not be allowed to retry. The terminating abend thereby eliminates the possibility of the recovery routine retrying back into misbehaving code. In a sysplex, specifying a SPINRCVY action of OPER is not recommended because an operator may not respond quickly enough to prevent remaining systems in the sysplex from partitioning the ailing system out of the sysplex. An example of a spin loop occurring on an MVS system and being resolved is shown in Figure 77 on page 264. SYSA is in a sysplex with systems SYSB and SYSC. SYSA is running in a shared logical partition with SPINTIME=20 and SPINRCVY=TERM,ACR. At time minus 10, CPU 1 gets into a never-ending disabled loop. At time 0, CPU 0 tries to signal CPU 1 but gets no response. CPU 0 enters a spin loop waiting for CPU 1 to enable.

Copyright IBM Corp. 1995

263

Time in |------------- MVS SYSTEM SYSA ---------------| seconds: CPU 0 CPU 1 -10 -7 | Enter never-ending | disabled loop Update couple data | set (CDS) with status | | Looping Update CDS with status | | Update CDS with status | | SIGP to CPU 1 -------------> No response. Enter | spin loop waiting for | Looping response from CPU 1. | | Spinning | | | Spinning | Looping | | SPINTIME timeout - An | excessive spin | condition is declared. | Write IEE178I to | syslog. SIGP RESTART ------> Write ABEND071-10 to | LOGREC. Redispatch Recovery action is to | interrupted program. Continue SPIN. | | Spinning | V | Looping | | Spinning | | | Spinning | | Looping | An excessive spin | condition is | declared again, | ACTION is to TERM | CPU 1. SIGP RESTART -------> ABEND071-30, | disabled loop is | terminated. SIGP response received <----- Respond to (time 0) from CPU 1 | SIGP | Update CDS with status |

-4 -1 0

20

40

41

42

Figure 77. Resolution of a Spin Loop Condition

In this figure, after waiting 20 seconds for CPU 1 to enable, CPU 0 declared an excessive spin condition. MVS s response to the excessive spin condition at

264

Continuous Availability with PTS

time 20 was to collect diagnostic data on CPU 1 and to have the spinning routine on CPU 0 repeat the SPIN. After waiting an additional 20 seconds (time 40) for CPU 1 to enable, CPU 0 declared another excessive spin condition. At this point MVS selected the next excessive spin recovery action specified by SPINRCVY. The TERM action successfully ended the disabled loop on CPU 1 and resolved the spin loop on CPU 0. In this example, there was a 43 second lapse of time (-1 thru +42) between updates of SYSA s status in the sysplex couple data set. During this time, SYSA would appear to be dormant to systems SYSB and SYSC. If SYSA s failure detection interval was 40 seconds, SYSB or SYSC may have initiated a partitioning action against SYSA to remove it from the sysplex before SYSA had a chance to recover from the spin loop. It is therefore very important to choose the XCF failure detection interval carefully. Note: It is possible for a spin loop to tie up multiple CPUs in an MP environment. If SYSA had 10 engines, a spin loop on one CP could tie up all ten CPs and make the MVS image appear dormant to other systems in the sysplex.

Appendix E. Spin Loop Recovery

265

266

Continuous Availability with PTS

Appendix F. Dynamic I/O Reconfiguration Procedures


This chapter describes how to ensure the system is dynamic I/O capable and how to size the related HSA storage.

F.1 Procedure to Make the System Dynamic I/O Capable


A system can be capable of dynamic I/O even if none of the devices are defined as dynamic. The first dynamic change operation could be to add new devices, or to change a device defined as static to dynamic. A system is capable of dynamic I/O definition if MVS is IPLed using an IODF file matching the IOCDS used at power-on reset of the processor. HCD must be used to prepare the IODF. The system must also be enabled for dynamic I/O. The setting of this option is done differently on ES/9000 and 9672. 1. First of all you must prepare an IODF file with the hardware and software configuration. If desired, some devices can be defined as dynamic. The following is the HCD panel where you can configure them:

View Device Parameter / Feature Definition Command ===> __________________________________________ Scroll ===> Configuration ID . : MVSW1 Device number . . : 018B Device type . . . : 3390 Generic / VM device type . . . . : 3390 ENTER to continue. Parameter/ Feature OFFLINE DYNAMIC ALTCTRL SHARED SHAREDUP Value Req. Description

No Device considered online or offline at IPL Yes <------- Device supports dynamic configuration No Separate physical control unit path Yes Device shared with other systems No Shared when system physically partitioned

2. Now, depending on the type of processor, you have to follow different steps to make the hardware dynamic reconfigurable: ES/9000 a. HCD is used to build the IOCDS from the IODF. As shown in Figure 78 on page 268, option 2.2 is used to create the IOCDS.

Copyright IBM Corp. 1995

267

Activate or Process Configuration Data

Select one of the following tasks. ---> 2 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. Build production I/O definition file Build IOCDS Build IOCP input data set Create JES3 initialization stream data View active configuration Activate configuration dynamically Activate configuration sysplex-wide Activate switch configuration Save switch configuration Build HCPRIO input data set Build and manage S/390 microprocessor IOCDSs and IPL attributes

Figure 78. HCD Panel

b. Hardware enablement of dynamic reconfiguration management is selected on the hardware console CONFIG frame, H=I/O DEFINITION selection.

H= I/O Definition 1. Percent Expansion Total : ______ Shared: ______ 2. Allow Modification

Figure 79. CONFIG Frame Fragment

c. Perform a power-on reset, and IPL from the same IODF. Note that the IODF is pointed to by the LOADxx member. 9672 a. As shown in Figure 80 on page 269, option 2.11 is used to build the IOCDS on a 9672.

268

Continuous Availability with PTS

Activate or Process Configuration Data

Select one of the following tasks. 11 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. Build production I/O definition file Build IOCDS Build IOCP input data set Create JES3 initialization stream data View active configuration Activate configuration dynamically Activate configuration sysplex-wide Activate switch configuration Save switch configuration Build HCPRIO input data set Build and manage S/390 microprocessor IOCDSs and IPL attributes

------>

Figure 80. HCD Panel

You can also use HCD to create a stand alone input IOCP deck on a diskette and load it via the HMC Workstation. For 9672 model 1 the HCD token will not be preserved and the resulting IOCDS file will not be dynamic capable. You should first POR and IPL without dynamic reconfigurable capability and then reload the IOCDS file through HCD to regain the dynamic capability. 9672 model 2 and 3 will be dynamic I/O capable from the first power-on reset. b. Enable the dynamic I/O configuration option in the RESET profile to be used in the next POR as shown in Figure 81 on page 270.

Appendix F. Dynamic I/O Reconfiguration Procedures

269

Figure 81. Dynamic I/O Customization

c. Perform a power-on reset, and IPL using the same IODF. d. Update the RESET profile to indicate that the system should use the last active IOCDS. This ensures that the system will come back after a power off with the same IOCDS as was active when the system went down. Because you may have done some dynamic I/O activates from MVS, this may not be the same IOCDS as was used at the last POR. 3. IPL the machine using the same IODF file 4. You can verify that both hardware and software are in synch via the D IOS,CONFIG command and checking the results:

IOS506I 13.44.29 I/O CONFIG DATA 011 ACTIVE IODF DATA SET = SYS5.IODF23 <--SW Token CONFIGURATION ID = MVSW1 EDT ID = 11 TOKEN: PROCESSOR DATE TIME DESCRIPTION SOURCE: ITSO942A 95-09-21 11:54:18 SYS5 IODF23 <--HW Token

F.2 Procedure for Dynamic Changes


1. Build production IODF with the new device definitions. 2. Vary off the devices which will be modified or deleted (in all partitions). 3. Configure the channel paths to be modified off-line (in all partitions). 4. You may perform an activation in TEST mode to ensure that there are no conditions that might inhibit a dynamic I/O reconfiguration. 5. Change physically the hardware configuration.

270

Continuous Availability with PTS

6. If you are running in LPAR, perform a software and hardware activation in the driving partition. For the software and hardware activation specify YES for Allow hardware delete on the Activate New Hardware and Software Configuration panel. In all the other partitions perform a software-only activation. If you are running in Basic mode, HW and SW activation should be done once from the running MVS. 7. Configure the channel paths on-line. 8. Vary on the new devices. 9. If you have a specific IODF named in your LOADxx member, update it to point at the new IODF. We recommend you use ** as the IODF numebr in LOADxx, as this indicates that the system should use the IODF that matches the one currently active in the hardware.

F.3 Hardware System Area Considerations


HSA storage is an important resource to handle dynamic reconfiguration. The system is able to do that until it finds available storage to fit the new changes. At POR time you are asked to specify a percentage of HSA to be kept for dynamic changes. As soon as you fit all the available HSA storage, you will receive the message IOS500I RC 156 telling you that no more HSA is available to handle the additions and IOS will provide you information stating how many resources were being added and how many could have been added.

IOS500I ACTIVATE RESULTS TEST DETECTED CONDITIONS WHICH WOULD RESULT IN ACTIVATE FAILURE REASON=0156,NOT ENOUGH SPACE TO ACCOMMODATE HARDWARE CHANGES DESCTEXT=NET # [SUB|CU|LCU] TO BE ADDED = xxxxxxxx, # [SUB|CU|LCU] AVAIL = yyy There is not enough storage in the hardware system area (HSA) to store the changes to the hardware I/O configuration. More subchannels (SUB), control units (CU) or logical control units (LCU) must be available in the HSA to store the changes to the hardware I/O configuration. In the message text: xxxxxxxx The number of subchannels, CUs, or LCUs that the system is adding because of the configuration change. yyy The number of subchannels, CUs, or LCUs that are currently available in the HSA.
Anyway, the system rejects the ACTIVATE request and most of the time you will be requested to power-on-reset the machine with a larger expansion factor for the HSA. Starting from MVS/ESA SP V5.1 there is as an extension of the operator command D IOS,CONFIG to request HSA information relative to Dynamic I/O. In this way you can control the HSA growing and prevent unplanned outages because the machine is no more dynamic capable due to a lack of HSA storage. The command D IOS,CONFIG(HSA) or D IOS,CONFIG(ALL) will provide the following as part of the IOS506I response:

Appendix F. Dynamic I/O Reconfiguration Procedures

271

xxx PHYSICAL CONTROL UNITS xxx SUBCHANNELS FOR SHARED CHANNEL PATHS xxx SUBCHANNELS FOR UNSHARED CHANNEL PATHS xxx LOGICAL CONTROL UNITS FOR SHARED CHANNEL PATHS xxx LOGICAL CONTROL UNITS FOR UNSHARED CHANNEL PATHS where xxx indicates the available number.

F.4 Hardware System Area Expansion Factors


Two values must be specified: Percent Expansion and Percent of Expansion for Shared devices. If dynamic reconfiguration management is to occur, then a value must be specified in the Percent Expansion selection at POR time or in the Dynamic I/O Panel of the Reset Profile. When the dynamic change is to occur is controlled by the Expansion factors In order to dynamically change (add devices to) the hardware configuration, storage must be set aside in the HSA at POR time to accommodate the additional channel subsystem control blocks (primarily subchannels) that will be added as a result of a dynamic change to the configuration. The indication of how many additional subchannels for which space should be set aside is specified either on the hardware system console CONFIG frame, H=I/O DEFINITION selection for a ES/9000 processor or in the Reset Profile for a S/390 microprocessor. With the advent of EMIF (ESCON Multiple Image Facility), subchannel must also be reserved either for shared or nonshared subchannels. To accommodate both nonshared and shared subchannels, two values can be specified: Percent Expansion TOTAL Specifies the percentage of additional subchannels for which space is to be reserved in HSA. The percentage specified is applied at POR time to the number of subchannels contained in the IOCDS file. A value of zero (0) is allowed and will satisfy the requirement that a value must be specified in this field if a dynamic change is to occur. However, a value of 0 will not allow any dynamic additions, and may impact dynamic deletions. So far, a value of 0 is not recommended. Percent Expansion SHARED Specifies the number of additional shared subchannels for which space is to be set aside in HSA. The number of additional shared subchannel specified as a percentage of the number set aside for Total devices. The value specified can be from 0 to 100. Since the size of the HSA may be affected by the total number of subchannels to be supported by a specific I/O configuration plus any planned dynamic expansion, an understanding of how expansion values are used in determining the total number of subchannels is necessary in order to determine the potential size of HSA. Whether the actual HSA size increases, or whether the additional channels can be accommodated within an existing increment of HSA, depends on the specific processor family and the granularity of its HSA allocation. What follows is a brief example showing how the percent expansion TOTAL and SHARED values are used to determine the number of additional subchannels to reserve space for in HSA in a Dynamic I/O EMIF environment.

272

Continuous Availability with PTS

The example applies to a 9021-711 based processor. Along with the CONFIG frame Percent Expansion values, information from the HCD Device Detail Report, or the end of the IOCP I/O IODEVICE Report, as well as the total number of logical partitions defined in the IOCDS, is required to determine the resultant total number of subchannels. The following process can be used to determine the number of nonshared and shared subchannels to be added, as well as the total number of subchannels to be accommodated in the HSA, based on user specified percent expansion TOTAL and SHARED values. In this example, the user plans to increment the number of subchannels three-fold, and of that number, one-half will be shared subchannel. To accomplish this, a percent expansion TOTAL value of 300, and a percent expansion SHARED value of 50, would be specified.

-----> Information from HCD Device Detail Report ---------------------------------------------------------------TOTALS FOR CHPIDS, SUBCHANNELS, AND CONTROL UNITS NonAdditional HSA Shared shared generated Total Total ------ ------- ---------- ----- ----CHPIDs 28 96 n/a 124 n/a Physical Control Units 8 271 n/a 279 n/a Subchannels 320 1938 1332 3590 4870 Logical Control Units 5 172 70 247 267 --------------------------------------------------------------------> CONFIG Frame H values H= I/O Definition 1. Percent Expansion Total: 300 Shared: 50 ----------------------------> Total number of logical partitions is 5. ---------------------------------------------------------------Determine expansion value IOCDS Total Subchannels x expansion factor 3590 x 3.00 = 10770 -----> This number plus the IOCDS total number (3590 in the above calculation) cannot exceed the Max Subchannels in IOCDS supported by the processor. ---------------------------------------------------------------Determine shared portion Expansion value x shared factor -----> 10770 x .5 = 5385 ---------------------------------------------------------------Determine nonshared portion Expansion value minus shared portion -----> 10770 - 5385 = 5385 ---------------------------------------------------------------Determine new HSA total

a. HSA total ---------------------------------------> 4870 from IOCDS I/O Device Report or HCD Detail Report
Appendix F. Dynamic I/O Reconfiguration Procedures

273

b. Nonshared addition ------------------------------> 5385 Nonshared portion from above c. Shared addition ---------------------------------> 26925 Shared portion from above x total # of partitions -----d. New HSA Total -----------------------------------> 37180

274

Continuous Availability with PTS

Glossary
The following terms and abbreviations are defined as they are used in this book. For terms that do not appear in this glossary see the IBM Vocabulary for Data Processing, Telecommunications, and Office Systems, GC20-1699 , or the glossaries of related publications. The following cross-references are used in this glossary: Contrast with. This refers to a term that has an opposed or substantively different meaning. Deprecated term for. This indicates that the term should not be used. It refers to a preferred term, which is defined in the glossary. See. This refers the reader to multiple-word terms in which this term appears. See also. This refers the reader to terms that have a related, but not synonymous, meaning. Synonym for. This indicates that the term has the same meaning as a preferred term, which is defined in the glossary. Synonymous with. This is a backward reference from a defined term to all other terms that have the same meaning.

A
abend . Abnormal end of task: termination of a task prior to its completion because of an error condition that cannot be resolved by recovery facilities while the task is executing. AC . ACF . ACS . Alternating current. See advanced communications function. See automatic class selection .

device or workstation connected to a network. (5) The identifier of a location, source, or destination. address space . In ESA/390, the range of virtual storage addresses that provide each user with a unique address space and which maintains the distinction between programs and data within that space. advanced communications functions (ACF) . A group of IBM program products (ACF/VTAM, ACF/NCP, and more) that uses the concepts of SNA. AIX . Advanced Interactive Executive

activate . To load the contents of an SCDS into SMS address space storage and into an ACDS , or to load the contents of an existing ACDS into SMS address space storage. This establishes a new storage management policy for the SMS complex. active configuration . In an Enterprise Systems Connection Director (ESCD), the configuration determined by the status of the currently active set of connectivity attributes. Contrast with saved configuration . active control data set (ACDS) . A VSAM linear data set that contains a copy of the most recently activated configuration ( SCDS ) and subsequent updates. All systems in an SMS complex use the ACDS to manage storage. adapter . (1) A general term for a device that provides some transitional function between two or more devices. (2) In an Enterprise Systems Connection link environment, hardware used to join different connector types. address . (1) A value that identifies a register, a particular part of storage, a data source, or a data sink. The value is represented by one or more characters. (2) To refer to a device or an item of data by its address. (3) The location in the storage of a computer where data is stored. (4) In data communication, the unique code assigned to each Copyright IBM Corp. 1995

alert . A unit of information, usually indicating the loss of a system resource, passed from one machine or program to a host to signal an error. alphanumeric . Consisting of both letters and numbers and often other symbols, such as punctuation marks and mathematical symbols. APA . APAR . All-points-addressable. See authorized program analysis report .

application . (1) The use to which an information processing system is put; for example, a payroll application, an airline reservation application, a network application. (2) A collection of software components used to perform specific types of work on a computer. asynchronous . Without regular time relationship. Unexpected or unpredictable with respect to the p r o g r a m s instructions, or to time. Contrast with synchronous. authorized program analysis report (APAR) . A report of a problem caused by a suspected defect in a current unaltered release of a program. automatic class selection (ACS) . A mechanism for assigning SMS classes and storage groups.

275

automatic dump . In DFHSM , the process of using DFDSS to automatically do a full volume dump of all allocated space on primary volumes to designated tape dump volumes. auxiliary storage . Data storage other than main storage; usually, a direct storage device. availability . For a storage subsystem, the degree to which a data set can be accessed when requested by a user.

central storage . Storage that is an integral part of the processor unit. Central storage includes both main storage and the hardware system area. channel (CHN) . (1) A path along which signals can be sent, for example, data channel, output channel. (A) (2) In the channel subsystem, each channel controls an I/O interface between the channel control element and the attached control units. channel-attached . Pertaining to attachment of devices directly by data channels (I/O channels) to a computer. Synonym for local . Contrast with telecommunication-attached. channel path . Is the physical medium by which a channel subsystem exchanges data with an I/O device in ESA/390 mode. A channel path can have byte or burst character, and up to eight paths can be assigned to a device from the same system. channel subsystem (CSS) . A collection of subchannels that directs the flow of information between I/O devices and main storage, relieves the processor of communication tasks, and does path management functions. CICS . CIM . class. See Customer Information Control System. See computer-integrated manufacturing. See SMS class .

B
backup . The process of copying data and storing it for use in case the original data is somehow damaged or destroyed. In DFHSM , the process of copying a data set residing on a level 0 volume, level 1 volume, or a volume not managed by DFHSM to a backup volume. See automatic backup and incremental backup . BASIC (beginners all-purpose symbolic instruction code) . An easy-to-use problem-solving language that lets you write programs in English-like statements. batch . Pertaining to a program or operation that is performed with little or no interaction between the user and the system. Contrast with interactive . block . A string of data elements recorded or transmitted as a unit. The element may be characters, words, or physical records. (T) bus . In a processor, a physical facility on which data is transferred to all destinations, but from which only addressed destinations may read in accordance with appropriate conventions. (I)

CLIST . A sequential list of commands and control statements assigned a single name; when the name is invoked the commands in the list are executed in sequential order. CLIST (command list) . A data set in which commands and possibly subcommands and data are stored for subsequent execution. cluster controller . A device that can control the input/output operations of more than one device connected to it. coax . Coaxial.

C
CA . (1) Channel address. (2) Communication adapter. (3) Common adapter. catalog . A data set that contains extensive information required to locate other data sets, to allocate and deallocate storage space, to verify the access authority of a program or operator, and to accumulate data set usage statistics. CDS . CDS . Configuration data set. See control data set .

column . A vertical arrangement of data. Contrast with r o w . command . (1) A request for system action. (2) A request from a terminal for the performance of an operation or the execution of a particular program. (3) A value sent on an I/O interface from a channel to a control unit that specifies the operation to be performed. common carrier . In data communication, any government-regulated company that provides communication services to the general public. complex . See SMS complex .

central processor . The electronic circuitry (and licensed internal code) responsible for the execution of the instructions that reside in main storage and constitute the operating system and the user applications.

276

Continuous Availability with PTS

component . (1) Hardware or software that is part of a functional unit. (2) A functional part of an operating system; for example, the scheduler or supervisor. computer-integrated manufacturing (CIM) . A strategy that encompasses the integration of information from engineering design, production, business systems, and the plant floor. config . (1) Configuration. (2) Configurator. (3) Configure. configuration . (1) The arrangement of a computer system or network as defined by the nature, number, and the chief characteristics of its functional units. More specifically, the term configuration may refer to a hardware configuration or a software configuration. (I) (A) (2) In an Enterprise Systems Connection Director (ESCD), the physical interconnection capability determined by a set of attributes. The attribute values specify the connectivity control status and identifiers associated with the ESCD and its ports. See also active configuration, configuration matrix, connectivity attributes, and saved configuration . configuration matrix . In an Enterprise Systems Connection Director (ESCD), an array of connectivity attributes, displayed in rows and columns, that can be used to alter both active and saved configurations. configure . To describe to the system the devices and optional features installed on the system. connected . In an Enterprise Systems Connection Director (ESCD), the attribute that, when set, establishes a dedicated connection. Contrast with disconnected . connection . In an Enterprise Systems Connection Director (ESCD), an association established between two ports that provides a physical communication path between them. connectivity . A term used to describe the physical interconnections of multiple devices/computers/networks employing similar or different technology and/or architecture together to accomplish effective communication between and among connected members involving data exchange and/or resource sharing. connectivity . Relationship that establishes the eligibility of a given system in an SMS complex to access a VIO storage group, a pool storage group, and the individual volumes within a pool storage group. The relationship can be NOTCON (not connected), indicating ineligibility, or any of the following, all of which imply eligibility: ENABLE , QUIALL (quiesce all), QUINEW (quiesce new), DISALL (disable all), DISNEW (disable new).

console . A logical device that is used for communication between the user and the system. See also service console . control data set (CDS) . With respect to SMS , a VSAM linear data set containing configurational, operational, or communication information. SMS introduces three types of control data sets: source control data set, active control data set, and communications data set. controller . A unit that controls input/output operations for one or more devices. control unit . A general term for any device that provides common functions for other devices or mechanisms. Synonym for controller . conversion . (1) In programming languages, the transformation between values that represent the same data item but belong to different data types. Information can be lost through conversion because accuracy of data representation varies among different data types. (2) The process of changing from one method of data processing to another or from one data processing system to another. (3) The process of changing from one form of representation to another; for example, to change from decimal representation to binary representation. CP . CS . (1) Control program. (2) Central processor. (1) Central storage. (2) Cycle steal.

CUA . Control unit address (channel, control unit, and device address). Customer Information Control System (CICS) . An IBM licensed program that enables transactions entered at remote terminals to be processed concurrently by user-written application programs. It includes facilities for building, using, and maintaining data bases. Customize . To change a data processing installation or network to meet the needs of particular users.

D
DASD . Direct access storage device. data class . A list of allocation attributes that the system uses for the creation of data sets. data stream . A continuous or concentrated flow of data bytes including control characters that will influence the processing of this string of bytes. data system . Refers to the storage and retrieval of data, its transmission to terminals, and controls to provide adequate protection and ensure proper usage.

Glossary

277

default . Pertaining to an attribute, value, or option that is assumed when none is explicitly specified. device . A mechanical, electrical, or electronic contrivance with a specific purpose. direct access storage device (DASD) . A device in which access time is effectively independent of the location of the data. disconnected . In an Enterprise Systems Connection Director (ESCD), the attribute that, when set, removes a dedicated connection. Contrast with connected . diskette . A flexible magnetic disk enclosed in a protective container. display . See display device, display image, and display screen . display device . A device that presents information on a screen. See also display screen . display image . Information, pictures, or illustrations that appear on a display screen. See also display device . display screen . The surface of a display device on which information is presented to a user. See also display image . duplex . Pertaining to communication in which data can be sent and received at the same time. Synonymous with full duplex . dynamic . Pertaining to an operation that occurs at the time it is needed rather than at a predetermined or fixed time.

esoteric name . A name used to define a group of devices having similar hardware characteristics, such as TAPE or SYSDA . See generic name . event . (1) An occurrence or happening. (2) A n occurrence of significance to a task; for example, the completion of an asynchronous operation, such as an input/output operation. expanded storage . (1) Optional integrated high-speed storage that transfers 4K-byte pages to and from central storage. (2) Additional (optional) storage that is addressable by the system control program. Expanded storage improves system response and system performance. (3) All storage above 256MB. Storage between 64MB and 256MB can be partitioned between central storage and expanded storage. extent . A continuous space on a DASD volume occupied by a data set or portion of a data set.

F
FC . Feature code. feature . A part of an IBM product that can be ordered separately by the customer. FORTRAN (formula translation). . A mathematically oriented high-level programming language, useful for applications ranging from simple problem solving to large scale numeric systems using optimization techniques. frame . (1) A housing for machine elements. (2) The hardware support structure, covers, and all electrical parts mounted therein that are packaged as one entity for shipping. (3) A formatted display. full duplex . Synonym for duplex .

E
element . A major part of a component (for example, the buffer control element) or a major part of the system (for example, the system control element). emulation . (1) The imitation of all or part of one system by another, primarily by hardware, so that the imitating system accepts the same data, executes the same programs, and achieves the same results as the imitated computer system. (2) The use of programming techniques and special machine features to allow a computing system to execute programs written for another system. (3) Imitation; for example, imitation of a computer or device. (4) Contrast with simulation . end user . A person in a data processing installation who requires the services provided by the computer system. ESA/390 . ESCON . Enterprise Systems Architecture/390. Enterprise Systems Connection.

G
GDDM . Graphical Data Display Manager. generic name . A name assigned to a class of devices (such as 3380) that is derived from the IODEVICE statement in the MVS configuration program. See esoteric name .

H
hardware system area (HSA) . A logical area of central storage that is used to store Licensed Internal Code and control information (not addressable by application programs). host (computer) . (1) In a computer network, a computer that provides end users with services such

278

Continuous Availability with PTS

as computation and data bases and that usually performs network control functions. (2) The primary or controlling computer in a multiple-computer installation. HSA . See hardware system area .

I
ID . Identifier. identifier (ID) . (1) One or more characters used to identify or name a data element and possibly to show certain properties of that data element. (2) In an Enterprise Systems Connection Director (ESCD), a user-defined symbolic name of 24 characters or less that identifies a particular ESCD. See also password identifier and port address name . incremental backup . In DFHSM , the process of copying a data set that has been opened for other than read-only access since the last backup version was created, and that has met the backup frequency criteria. initialization . Preparation of a system, device, or program for operation. initialize . To set counters, switches, addresses, or storage contents to zero or other starting values at the beginning of, or at prescribed points in, the operation of a computer routine. (A) initial program load (IPL) . The initialization procedure that causes an operating system to start operation. input/output (I/O) . (1) Pertaining to a device whose parts can perform an input process and an output process at the same time. (I) (2) Pertaining to a functional unit or channel involved in an input process, output process, or both, concurrently or not, and to the data involved in such a process. input/output configuration data set (IOCDS) . A configuration definition built by the I/O Configuration Program (IOCP) and stored on disk files associated with the processor controller. Input/output configuration program (IOCP) . The program that defines the I/O configuration data required by the processor complex to control I/O requests. intelligent printer data stream (IPDS) . A type of printer control that allows you to present text, raster images, vector graphics, bar codes, and previously stored overlays at any point on a page. interactive . Pertaining to a program or system that alternately accepts input and then responds. A n interactive system is conversational; that is, a continuous dialog exists between user and system. Contrast with batch .

interface . (1) A shared boundary between two functional units, defined by functional characteristics, common physical interconnection characteristics, signal characteristics, and other characteristics as appropriate. (2) A shared boundary. An interface can be a hardware component to link two devices or a portion of storage or registers accessed by two or more computer programs. (3) Hardware, software, or both, that links systems, programs, or devices. I/O . Input/output. See input/output configuration data set.

IOCDS .

I/O configuration . The collection of channel paths, control units, and I/O devices that attaches to the processor unit. IOCP . IPDS . IPL . See input/output configuration program. See intelligent printer data stream. See initial program load.

J
JES (job Entry Subsystem) . A system facility for spooling, job queuing, and managing I/O. job . A unit of work to be done by a system. M a y consist of more that one program. job control language (JCL) . A problem-oriented language used to express statements in a job that identify the job or describe its requirements to an operating system.

L
LAN . LIC . See local area network. Licensed Internal Code.

link problem determination aid (LPDA) . A series of test commands executed by IBM DCE to determine which of various network components may be causing an error in the network. local . Pertaining to a device accessed directly without use of a telecommunication line. Synonym for channel-attached . Contrast with remote . local area network (LAN) . A data network located on the user s premises in which serial transmission is used for direct data communication among data stations. (T) It services a facility without the use of common carrier facilities. log . To record; for example, to log error information onto the system disk.

Glossary

279

logically partitioned mode (LPAR) . A mode that allows the operator to allocate hardware resources of the processor unit among several logical partitions. logical partition . In LPAR mode, a subset of the processor unit resources that is defined to support the operation of a system control program (SCP). logical unit (LU) . In SNA, a port to the network through which an end user can communicate with another end user. loop . (1) A sequence of instructions processed repeatedly while a certain condition prevails. (2) A closed unidirectional signal path connecting input/output devices to a network. LPAR . LPDA . See logically partitioned mode. See link problem determination aid.

mode . In any cavity or transmission line, one of those electromagnetic field distributions that satisfies Maxwell s equations and the boundary conditions. The field pattern of a mode depends on wavelength, refractive index, and cavity or waveguide geometry. (A) multidrop (network) . A network configuration in which there are one or more intermediate nodes on the path between a central node and an endpoint node. multiple preferred guests . A VM/XA facility that, with the Processor Resource/Systems Manager (PR/SM), supports up to six preferred virtual machines. See also preferred virtual machine . multiplexing . In data transmission, a function that permits two or more data sources to share a common transmission medium so that each data source has its own channel. MVS . Multiple Virtual Storage, consisting of MVS/System Product Version 1 and the MVS/370 Data Facility Product operating on a System/370 processor. See also MVS/XA . MVS/SP . Multiple Virtual Storage/System Product.

M
main storage . A logical entity that represents the program addressable portion of central storage. See also central storage. All user programs are executed in main storage. management class . A list of data set migration, backup, and retention attributes that DFHSM uses to manage storage at the data set level. MAP (manufacturing automation protocol) . A communication protocol used mainly to communicate between electronic equipment associated with the manufacturing process. master catalog . A catalog that points to user catalogs. See catalog . Mb. MB. Megabit. Megabyte; 1 048 576 bytes. A physical carrier of electrical or optical

MVS/XA . Multiple Virtual Storage/Extended Architecture, consisting of MVS/System Product Version 2 and the MVS/XA Data Facility Product, operating on a System/370 processor in the System/370 extended architecture mode. MVS/XA allows virtual storage addressing to 2 gigabytes. See also MVS .

N
NetView . An IBM licensed program used to monitor a network, manage it, and diagnose its problems. network . An arrangement of programs and devices connected for sending and receiving information. node . A junction point in a network, represented by one or more physical units.

medium . energy.

megabit (Mb) . A unit of measure for throughput. 1 m e g a b i t = 1 048 576 bits. megabyte (MB) . (1) A unit of measure for storage size. One megabyte equals 1 048 576 bytes. (2) Loosely, one million bytes. migration . In DFHSM , the process of moving a cataloged data set from a primary volume to a migration level 1 volume or migration level 2 volume, from a migration level 1 volume to a migration level 2 volume, or from a volume not managed by DFHSM to a migration level 1 or migration level 2 volume.

O
office system . A set of application that provide support in areas like decision support, text services, electronic mail, data base access, and professional support. They integrate text, data, graphic and image processing. offline . Not controlled directly by, or not communicating with a computer. Contrast with online. offload . To move data or programs out of a storage.

280

Continuous Availability with PTS

online . Being controlled directly by, or directly communicating with a computer. Contrast with offline. online . Pertaining to equipment, devices, or data under the direct control of the processor. operating system (OS) . Software that controls the execution of programs. An operating system may provide services such as resource allocation, scheduling, input/output control, and data management. (I) (A) Note: Although operating systems are predominantly software, partial or complete hardware implementations are possible.

complex. Performance is largely determined by throughput, response time, and system availability. PICK . An operating system made by PICK Systems for various applications written for asynchronous machines. pool . POR . See storage pool See power-on reset.

port . (1) An access point for data entry or exit. (2) A connector on a device to which cables for other devices such as display stations and printers are attached. port address name . In an Enterprise Systems Connection Director (ESCD), a user-defined symbolic name of 24 characters or less that identifies a particular port. power-on reset . The state of the machine after a logical power-on before the control program is IPLed. preferred virtual machine . A virtual machine that runs in the V = R area. The control program gives this virtual machine preferred treatment in the areas of performance, processor assignment, and I/O interrupt handling. See also multiple preferred guests . processor controller element (PCE) . Hardware that provides support and diagnostic functions for the processor unit. The processor controller communicates with the processor unit through the logic service adapter and the logic support stations, and with the power supplies through the power thermal controller. It includes: primary support processor (PSP), initial power controller (IPC), input/output support processor (IOSP), and the control panel assembly. Processor Resource/Systems Manager (PR/SM) . A function that allows the processor unit to operate several system control programs (SCPs) simultaneously in LPAR mode. It provides for logical partitioning of the real machine and support of multiple preferred guests. See also multiple preferred guests . processor storage. . (1) The storage in a processing unit. (2) In virtual storage systems, synonymous with real storage. profile . Data that describes the significant characteristics of a user, a group of users, or one or more computer resources. PROFS . Professional office system.

option . (1) A specification in a statement, a selection from a menu, or a setting of a switch, that can be used to influence the execution of a program. (2) A hardware or software function that can be selected or enabled as part of a configuration process. (3) A piece of hardware (such as a network adapter) that can be installed in a device to modify or enhance device function. OS . See operating system.

P
page . In a virtual storage system, a fixed-length block that has a virtual address and is transferred as a unit between real storage and auxiliary storage. (I) (A) parallel channel . A data path along which a group of signals representing a character or any other entity of data can be sent simultaneously. parameter . (1) A variable that is given a constant value for a specified application and that can denote the application. (2) An item in a menu for which the user specifies a value or for which the system provides a value when the menu is interpreted. (3) Data passed between programs or procedures. Pascal . A high-level programming language that is effective for system development and technical problem solving. password identifier . In an Enterprise Systems Connection Director (ESCD), a user-defined symbolic name of 24 characters or less that identifies the password user. path . PCE. In a network, a route between any two nodes. Processor controller element.

performance . For a storage subsystem, a measurement of effective data processing speed against the amount of resource that is consumed by a

program temporary fix (PTF) . A temporary solution or by-pass of a problem diagnosed by IBM as

Glossary

281

resulting from an error in a current unaltered release of the program. protocol . (1) A set of semantic and syntactic rules that determines the behavior of functional units in achieving communication. (2) In SNA, the meanings of and the sequencing rules for requests and responses used for managing the network, transferring data, and synchronizing the states of network components. (3) A specification for the format and relative timing of information exchanged between communicating parties. PR/SM . ps . PTF . See Processor Resource/Systems Manager.

session . In SNA, a logical connection between two network addressable units (NAUs) that can be activated, tailored to provide various protocols, and deactivated as requested. SMS . See storage management subsystem

SMS class . A list of attributes that SMS applies to data sets having similar allocation (data class), performance (storage class), or availability (management class) needs. SNA . See Systems Network Architecture. Structured query language/data system.

SQL/DS . Picosecond. See program temporary fix.

standard . Something established by authority, custom, or general consent as a model or example. station . (1) An input or output point of a system that uses telecommunication facilities; for example, one or more systems, computers, terminals, devices, and associated programs at a particular location that can send or receive data over a telecommunication line. (2) A location in a device at which an operation is performed; for example, a read station. (3) In SNA, a link station. storage . A unit into which recorded text can be entered, in which it can be retained and processed, and from which it can be retrieved. storage class . A list of storage performance and availability service requests. storage group . VIO , a list of real DASD volumes, or a list of serial numbers of volumes that no longer reside on a system but that end users continue to reference in their JCL . storage management subsystem (SMS) . An operating environment that helps automate and centralize the management of storage. To manage storage, SMS provides the storage administrator with control over data class, storage class, management class, storage group, and ACS routine definitions. storage pool . A predefined set of DASD volumes used to store groups of logically related data according to user requirements for service or according to storage management tools and techniques. subchannel . The channel facility required for sustaining a single I/O operation. subchannel (SCH) . In ESA/370 mode, a group of contiguous words in the hardware system area that provides all of the information necessary to initiate, control, and complete an I/O operation. subsystem . A secondary or subordinate system, or programming support, usually capable of operating

R
real time . Pertains to the actual time during which a physical process transpires. remote . Pertaining to a system, program, or device that is accessed through a telecommunication line. Contrast with local. request for price quotation (RPQ) . for a product. A custom feature

RETAIN . Remote technical assistance and information network. row . A horizontal arrangement of data. Contrast with column . RPQ . See request for price quotation.

S
SAA . See Systems Application Architecture. saved configuration . In an Enterprise Systems Connection Director (ESCD), a stored set of connectivity attributes whose values determine a ESCD configuration that can be used to replace all or part of the configuration currently active. Contrast with active configuration . SCP . SEC . System control programming. System engineering change.

service console . A logical device used by service representatives to maintain the processor unit and to isolate failing field replaceable units. The service console can be assigned to any of the physical displays attached to the input/output support processor.

282

Continuous Availability with PTS

independently of or asynchronously with a controlling system. synchronous . (1) Pertaining to two or more processes that depend on the occurrences of a specific event, such as common timing signal. (2) Occurring with a regular or predictable time relationship. system . (1) The processor unit and all attached and configured I/O and communication devices. (2) In information processing, a collection of machines, programs, and methods organized to accomplish a set of specific functions. system control programming (SCP) . IBM-supplied programming that is fundamental to the operation and maintenance of the system. It serves as an interface with licensed programs. system-managed storage . An approach to storage management in which the system determines data placement and an automatic data manager handles data backup, movement, space, and security. system reset (SYSRESET) . To reinitialize the execution of a program by repeating the initial program load (IPL) operation. Systems Application Architecture (SAA) . An architecture developed by IBM that consists of a set of selected software interfaces, conventions, and protocols, and that serves as a common framework for application development, portability, and use across different IBM hardware systems. Systems Network Architecture (SNA) . The description of the logical structure, formats, protocols, and operational sequences for transmitting information units through, and controlling the configuration and operation of, networks. S/370 . System/370 mode.

token . A sequence of bits passed from one device to another on the token-ring network that signifies permission to transmit over the network. It consists of a starting delimiter, an access control field, and an end delimiter. The access control field contains a bit that indicates to a receiving device that the token is ready to accept information. If a device has data to send along the network, it appends the data to the token. When data is appended, the token then becomes a frame. Token-Ring . A network with a ring topology that passes tokens from one attaching device (node) to another, complying with the IEEE 802.5 standard. A node that is ready to send can capture a token and insert data for transmission. topology . units. The geometric configuration of connected

track . A portion of a disk that is accessible to a given read/write head position. transmission control protocol/internet protocol (TCP/IP) . A public domain networking protocol with standards maintained by US Department of Defense to allow unlike vendor systems to communicate.

U
upgrade . To add features to a system. user ID . A predefined set of one to eight characters that uniquely identifies a user to the system.

V
V = R. Virtual equals real. virtual machine (VM) . (1) A functional simulation of a computer and its associated devices. Each virtual machine is controlled by a suitable operating system. VM/370 controls concurrent execution of multiple virtual machines on a single System/370. (2) In VM, a functional simulation of either a System/370 computing system or a System/370-Extended Architecture computing system. Each virtual machine is controlled by an operating system. VM controls concurrent execution of multiple virtual machines on a single system. virtual storage (VS) . (1) The storage space that can be regarded as addressable main storage by the user of a computer system in which virtual addresses are mapped into real addresses. The size of virtual storage is limited by the addressing scheme of the computer system and by the amount of auxiliary storage available, not by the actual number of main storage locations. (2) Addressable space that is apparent to the user as the processor storage space, from which the instructions and the data are mapped into the processor storage locations.
Glossary

S/390 . System/390. Any ES/9000 system including its associated I/O devices and operating system(s).

T
table . Information presented in rows and columns. TCP/IP . See transmission control protocol/internet protocol. telecommunication-attached . Pertaining to the attachment of devices by teleprocessing lines to a host processor. Synonym for remote . Contrast with channel-attached . terminal . In data communication, a device, usually equipped with a keyboard and display device, that can send and receive information.

283

virtual telecommunication access method (VTAM) . This program provides for workstation and network control. It is the basis of a System Network Architecture (SNA) network. It supports SNA and certain non-SNA terminals. VTAM supports the concurrent execution of multiple telecommunications applications and controls communication among devices in both single-processors and multiple processors networks. VM . See virtual machine. Virtual Machine/Extended Architecture.

VTAM . See virtual telecommunications access method.

W
wait . The condition of a processing unit when all operations are suspended. wide area network . A network that provides communication services to a geographic area larger than that served by a local area network. workstation . (1) An I/O device that allows either transmission of data or the reception of data (or both) from a host system, as needed to perform a job; for example, a display station or printer. (2) A configuration of I/O equipment at which an operator works. (3) A terminal or microcomputer, usually one connected to a mainframe or network, at which a user can perform tasks. write . To make a permanent or transient recording of data in a storage device or on a data medium.

VM/XA .

volume . A certain portion of data, together with its data carrier, that can be mounted on the system as a unit; for example, a tape reel or a disk pack. For DASD, a volume refers to the amount of space accessible by a single actuator. VSE (Virtual Storage Extended) . An operating system that is an extension of DOS/VS. A VSE system consists of a) a licensed VSE/Advanced Functions support and b) any IBM-supplied and user-written programs required to meet the data processing needs of a user. VSE and the hardware it controls form a complete data processing system. Its current version is called VSE/ESA. VSE/ESA (Virtual Storage Extended/Enterprise Systems Architecture) . The most advanced VSE system currently available.

X
XA . Extended architecture.

284

Continuous Availability with PTS

List of Abbreviations
ABEND ACB ACF
abnormal end access method control block advanced communications function (MVS-based software) advanced communications function for virtual telecommunications access method (MVS-based software) alternate CPU recovery automated operations control/multiple virtual storage (IBM) application owning region authorized program analysis report advanced program-to-program communication advanced peer-to-peer networking (IBM program product) automatic restart manager component of MVS address space identifier (MVS) battery backup unit bootstrap dataset (DB2) basic telecommunications access method Backup While Open (IBM DFHSM enhanced backup option) continuous availability (optically read) compact disk - read only memory cross-domain resource managers configuration data set central electronics complex, synonym for CPC Master Terminal Transaction (CICS) coupling facility Coupling Facility Channel Coupling Facility Receiver

CFRM CFS CHP CHPID CI CICS CICS/ESA

Coupling Facility Resource Manager Coupling Facility Sender channel path channel path id control interval customer information control system (IBM) customer information control system/enterprise systems architecture (IBM) CICS VSAM recovery (IBM program product, MVS or VSE) count key data checkpoint command list CICS Managing address space complementary metal oxide semiconductor central processing complex central processing unit CICS system definition channel subsystem channel to channel control unit direct access storage device data base Data Base Control Subsystem data base management system data base recovery control (IMS) distributed console access facility disabled console communications facility (MVS) Distributed Computing Environment (OSF) a JCL dynamic allocation statement for MVS data entry data base data facility data set services (IBM software product)

ACF/VTAM

ACR AOC/MVS

CICSVR

AOR APAR APPC

CKD CKPT CLIST CMAS CMOS CPC CPU CSD CSS CTC CU DASD DB DBCTL DBMS DBRC DCAF DCCF DCE DDDEF DEDB DFDSS

APPN

ARM ASID BBU BSDS BTAM BWO

CA CD-ROM CDRMS CDS CEC CEMT CF CFC CFR

Copyright IBM Corp. 1995

285

DFHSM DFSMS

data facility hierarchical storage manager Data Facility Storage Management Subsystem (MVS and VM) Data Facility Storage Management Subsystem/MVS data facility sort (IBM program product) DASD fast write data language 1 dynamic storage area dynamic system interchange (JES3) eligible devices table (MVS control block) ESCON multiple image facility event notification facility enqueue end of memory ESCON director (ES/9000) enterprise systems connection (architecture, IBM System/390) Extended Terminal Option (IMS DC) external time reference file owning region gigabyte (10**9 bytes or 1,000,000,000 bytes) group buffer pool (DB2) global resource serialization (MVS) hardware configuration definition (MVS/SP) hardware maangement console hardware system area input/output International Business Machines Corporation the program name for access method services (OP SYS) Institute of Electrical and Electronics Engineers interface control check information management system

IMS/DB IMS/ESA

information management system/data base information management system/enterprise systems architecture information management system/virtual storage initialize/initial/initiate I/O configuration data set I/O configuration program input/output definition file input/output supervisor inter-processor communication interactive problem control system intelligent printer data stream (IBM) initial program load IMS/VS resource lock manager inter-system communications International Technical Support Center (IBM) International Technical Support Organization job control language (MVS and VSE) job entry subsystem (MVS) local area network logical control unit licensed internal code logout recorder (error recording DB in OS/VS) logically partitioned mode link problem determination aid logical terminal logical unit link/linkage index multi-access spool (JES2) multiple console support missing interruption handler multi-processing multiple virtual storage (IBM System 370 & 390)

DFSMS/MVS DFSORT DFW DL/I DSA DSI EDT EMIF ENF ENQ EOM ESCD ESCON

IMS/VS INIT IOCDS IOCP IODF IOS IPC IPCS IPDS IPL IRLM ISC ITSC ITSO JCL JES LAN LCU LIC LOGREC LPAR LPDA LTERM LU LX MAS MCS MIH MP MVS

ETO ETR FOR GB GBP GRS HCD HMC HSA I/O IBM IDCAMS IEEE IFCC IMS

286

Continuous Availability with PTS

MVS/ESA

multiple virtual storage/enterprise systems architecture (IBM) multiple virtual storage/extended architecture (IBM) nucleus initialization program network job entry operations, planning & control (IBM program product) operations planning & control/enterprise systems architecture (IBM) open systems adapter overflow sequential access method parameter MVS initialization parameter library personal computer processor controller element power on reset Peer-to-Peer Remote Copy (IBM 3990 Model 6) processor resource/systems manager (IBM) physical terminal publications resource access control facility brand name and trademark of IBM. return code resource definition on-line (CICS) recovery control (data set) restructured extended executor language remote job processing record level sharing resource measurement facility (MVS) resource name lists ring processing system authority message (MVS control block) remote site recovery (IMS) real-time analysis (CICS)

SCA SCDS SCH SE SFM SID SIGP SIT SMF SMP/E SNA SQL SSI STC STCK SUBSYS SVC SYSCAT SYSDEF SYSLOG SYSPLEX SYSRES SYSRESET TCP/IP

Shared Communications Area (MVS/XCF Coupling Facility) save control data set subchannel service element Sysplex Failure Manager system identification signal processor system initialization table system management facility system modification program/extended (MVS) systems network architecture (IBM) structured query language subsystem interface (MVS) started task control store clock subsystem supervisor call instruction (IBM System/360) system catalog system definition (frame) system log systems complex system residence file/disk system reset Transmission Control Protocol/Internet Protocol (USA, DoD, ARPANET; TCP=layer 4, IP=layer 3, UNIX-ish/Ethernet-based system-interconnect protocol) U.S. Dept. of Defense s virtual terminal protocol, based on TCP/IP transaction manager time of day terminal owning region time sharing option time sharing option extensions text uni-processor uninterruptible power supply/system user identification

MVS/XA

NIP NJE OPC OPC/ESA

OSA OSAM PARM PARMLIB PC PCE POR PPRC PR/SM PTERM PUBS RACF RAMAC RC RDO RECON REXX RJP RLS RMF RNL RSA

TELNET

TM TOD TOR TSO TSO/E TXT UP UPS USERID

RSR RTA

List of Abbreviations

287

USS VIO VM VM/XA VOLSER VSAM

unformatted system services (SNA) virtual input output virtual machine (IBM System 370 & 390) virtual machine/extended architecture (IBM) volume serial virtual storage access method (IBM)

VTAM

virtual telecommunications access method (IBM) (runs under MVS, VM, & DOS/VSE) VTAM definition library write to operator with reply cross-system coupling facility (MVS) cross-system extended services (MVS) extended recovery facility

VTAMLST WTOR XCF XES XRF

288

Continuous Availability with PTS

Index Numerics
3088 configuration 13 maintenance 14 3174 MVS console 24 sysplex console attachment 21 3490 25 3990 concurrent copy 172, 175 DB2 data 105 dual copy 12 Extended Remote Copy 216 model 3 17 model 6 17, 53 Peer-to-Peer Remote Copy 215 remote copy 215 9021 711-based processors 144, 145 cross partition authority 67, 68, 69 9032 18, 146 9036-003 11 9037 10 9121 511-based processors 145 9672 clock 148 cross partition authority 68, 69 dynamic storage reconfiguration 145 HMC 21 image profile 67, 68, 69 IOCDS 143 power 27 R1 machines 144 R2 and R3 machines 144 9674 7, 27 9729 20 9910 27 AOC/MVS (continued) description 110 graphical interface 111 NetView 110 shutdown 169 area data set 173 ARM See Automatic Restart Manager (ARM) ARMRESTART 84, 85 ARMRST 84 auto-switchable tape 89 Automatic Restart Manager (ARM) AOC/MVS interaction 110 automating sysplex failure management characteristics 80 CICS definition 156 CICS implementation 83 CICS support 82 couple data set 35, 37, 81 DB2 support 85 definition 81 description 79 IMS element name 84 IMS element type 84 IMS support 84 IMSID 84 parameters 81 Policy TOTELEM keyword 81 subsystem interaction 82 SYSIMS 84 VTAM support 87 automation AOC/MVS 110 NetView 110 OPC/ESA 111 tools 110, 169 AUTOSWITCH 54 availability database considerations 171 DB2 subsystem 105 high 3 RLS database 108

58

A
Abbreviations 285 ABEND 75, 77, 85 ACDS 50, 52 ACQUIRE 69 Acronyms 285 ACTIVATE command 34 activation 70 ACTSYS Parameter 64 alternate consoles 43 ALTGRP 43 AOC/MVS ARM interaction 80 ARM restart 110

B
backup database considerations DB2 database 175 DL/1 database 173 VSAM database 172 batch database considerations DB2 database 174 DL/1 database 173 171

171

Copyright IBM Corp. 1995

289

batch (continued) m o v i n g workload 164 OPC/ESA 165 VSAM database 171 battery backup 8, 27 BCDS 52, 53 BLWSPINR 77

C
CANCEL parameter 85 central processing complex (CPC) number that can be managed by one HMC 21 central storage 145 CFCC See Coupling Facility Control Code (CFCC) CFRM See Coupling Facility Resource Manager (CFRM) channel card 144 ESCON 143 parallel 143 CICS 83 adding a subsystem 156 affinities 97 ARM Implementation 83 ARM Support 82 backup while open (BWO) considerations 172 CICSPlex SM 217 coupling facility structure 106 CSD 97 disaster recovery 217 failure 82 file-owning region 97 logging 109 moving workload 161 Resource Definition Online (RDO) 97 restarting a TOR 211 restarting an AOR 212 RLS control data set 107 RLS database 108 shared temporary storage 97 shutdown 166 SMSVSAM 106 starting 159 storage protection 98 topology 96 transaction isolation 98 VSAM structure 108 CICSPlex SM affinities 97 configuration 99 description 83, 98 disaster recovery considerations 217 CLOCKxx 11 cloning 6 CMOS processors 6 sysplex 6

CNGRPxx 45 Command Prefix Facility (CPF) 150 command prefixes 86 COMMDS 50, 52, 53 COMMNDxx 54 concurrent maintenance 9032 model 003 18 channel 144 CP 144 LIC patches 144 CONNFAIL 64, 65, 66 CONSOLE statement 44 consoles 9672 21 alternate 43 C O N S I D = 0 22 extended MCS 22, 23 groups 43 integrated 22 JES3 89 master 22 MCS 22, 43 MSTCON 45 MVS 43 subsystem 22 system 22, 45 CONSOLxx 43 continuous availability 3 configuration 5 operations 3 couple data set alternate 35 Automatic Restart Manager (ARM) 35, 37, 80, 207 COUPLE00 member 149 Coupling Facility Resource Manager (CFRM) 35, 37, 54, 207 description 35 determining size 36 failure 206 performance and availability 37 placement 37 reformatting 126 spare 36 swapping 71 sysplex 35, 37, 206 Sysplex Failure Management (SFM) 35, 37, 207 System Logger (LOGR) 35, 37, 46, 208 Workload Manager (WLM) 35, 37, 207 COUPLEXX INTERVAL parameter 58, 62, 68, 69, 73, 74 OPNOTIFY parameter 58, 76 sample member 226 SFM considerations 58 SMS group name 51 coupling facility alternate 8 CFCC 9

290

Continuous Availability with PTS

coupling facility (continued) CFRM policy changes 125 CICS logstream 109 configuration 7 DB2 104, 105 DB2 structure 130 dump space 120 exploiters 128 IMS 102 IMS lock structure 128 JES2 structure 129 links 7, 35 logstream structure 131 maintenance 132 moving a structure 120 OSAM and VSAM structure 128 RACF structure 130 shared tape structure 131 shutdown procedure 134 SMSVSAM structure 131 structure allocation 117 configuration 7 connections 118 connectivity 8 DB2 9 definition 8 IEFAUTOS 131 IMS lock 128 ISTGENERIC 10 JES2 checkpoint 9, 129 last structure condition 15 LOGR 46 logstream 46, 131 OSAM and VSAM 128 RACF 8, 130 rebuilding 121 relocation 8 shared tape 131 SMSVSAM 131 system logger 9 VTAM 129 VTAM generic resources 10 XCF signalling paths 14 XCF structure 129 volatility 8 VSAM 108 VSAM RLS 106 VTAM structure 129 XCF structure 129 Coupling Facility Control Code (CFCC) 9 Coupling Facility Resource Manager (CFRM) couple data set 35, 37, 54 policy 54, 66, 73 PREFLIST statement 8 CSVDYNEX 55 CTC 13

D
DASD path configuration 17 DASD Fast Write (DFW) 37, 50 data set IMS 102 LOGREC 41, 149, 238 PAGE 149, 238 PAGE/SWAP 41 SMF 41, 149, 238 STGINDEX 41, 149, 238 data sharing 3 DATA TYPE 70 DB2 adding a member 158 ARM support 85 availability 105 Call Attachment Facility (CAF) 164 CICS and IMS considerations 164 coupling facility structure 9, 104 database considerations 174 database structure 130 description 103 disaster recovery 218 failure 82 moving workload 163 shutdown 167 starting 160 subsystem definition 105 subsystem parameters 105 TSO and batch considerations 164 DB2 group name 85 DCCF 24, 44, 45 DEACTIVATE 68 DEACTTIME 58, 64, 67, 68 DFSMS 50, 52 coupling facility structure 10 DFSMShsm considerations 165 moving workload 165 SMSVSAM structure 131 DFSMSdss 37, 53 DFSMShsm 52, 53, 165 DFSMShsm journal 52 DSI 91, 93 dual copy 12 DUPLEXMODE 9 DYNAMIC 34, 57 dynamic exits 55 dynamic I/O reconfiguration 33 dynamic sparing 15 dynamic storage reconfiguration 145 dynamic subsystem interface (SSI) 56

E
EDT 34 ELEMENT 85

Index

291

EMIF 7 ENF 85 ENQ 53 ESCON channels 11, 143 CTC 13 devices 145 director 12, 18, 145, 146 EMIF 7 I/O configuration 11 logical paths 12, 18 manager 19 ESCON Manager 24 ESTORE 64 expanded storage 145 exploiting dynamic functions 55 EXSPATxx 58, 73, 77 Extended Remote Copy 216

I
I/O configuration 11 connectivity 11 definition file (IODF) 35 devices 145 dynamic reconfiguration 33 ICKDSF logical path report 18 ICMF 7, 218 IEACMDxx 54 IEASYM 222 IEASYMxx 222, 223 IEASYSxx 41, 51, 224 IECIOSxx 38 IEFAUTOS 54 IEFAUTOS structure 131 IEFJFRQ 57 IEFSSI 56 IEFSSNxx 56, 86 IEFSSVT 56 IEFSSVTI 56 Image 69, 221 image profile 67, 68 IMS area data set 173 ARM support 84 cache directory 10 cloning 101, 102, 103, 157 coupling facility structure 102 database considerations 172 DEDB database 173 disaster recovery 216 failure 82 FFDB database 173 fuzzy image copy 174 IRLM definitions 102 lock structure 128 moving workload 163 OSAM and VSAM structure 128 RSR 216 shared data sets 102 shutdown 166 starting 160 subsystem identifier 101 SVC 102 terminal definition 101 topology 100 unique data sets 102 IMSID 84 indirect catalog 30 installation 70 INTERVAL description 74 interval detection 73 recommendations 62 SFM planning 58 values definitions 68

F
FAILING 67 FAILSYS 64, 72 failure detection 74 fault-tolerant system 4 fiber 20 FORCE parameter 85 fuzzy backup 172 fuzzy image copy 174

G
generic resources 10 glossary 275 GRSCNFxx 73, 78 GRSRNLxx 53, 54

H
hardware management console (HMC) changing time 148 description 21 usage during NIP 21 HCD ACTIVATE function 34 adding I/O device 146 download IOCDS 143 dynamic capable IOCDS 33 TIMEOUT parameter 14 HCPYGRP 44 HMC See hardware management console (HMC) hot I/O 79 HSA 34 HWNAME 223

292

Continuous Availability with PTS

IOCDS 34, 35, 143 IOCP stand-alone 143 TIMEOUT parameter 14 IODF 33, 54 IPL load parameters 35 message suppression 21 IPLPARM members 222 IRLM lock structure 10 ISOLATE 63, 68 ISOLATETIME 58, 62, 64, 68, 73, 76 ISOLATETIME. 67 ISTGENERIC structure 10 ITEM NAME 81 IXCARM 80, 82 IXCL1DSU 71, 81 IXCMIAPU 70, 72 IXCMIAPU utility 47 IXGCONN 49 IXGINVNT service 47

JESXCF group name 87, 91 software maintenance journal 53

150

L
LIC patches 144 LOADxx 34, 222, 268 local UPS 27 log data sets 46 log data, duplexing 47 logger See system logger LOGR 46 LOGREC data set 41, 149, 238 logs 148 LOGSTREAM 9, 46 LPAR adding 34, 144 dynamic storage reconfiguration isolation 68 processing weights 145 resetting 67 SFM parameters 64 storage acquisition 69 LPARNAME 223 LPDEF 67, 68, 69

145

J
JCL CICS startup 82 started tasks 42 system symbols 42 JES2 checkpoint 8 performance 39 placement 38 reconfiguration 39 structure placement 38 checkpoint coupling facility structure CKPT structure 129 duplicated TSO logon 109 startup procedure 227 structure failure 39 JES3 adding a global 150 adding a local 150 adding a subsystem 150 ARM exploitation 87 CONSTD statement 92 DSI 91, 93 initialization stream CONSOLE statement 89 DEVICE statement 88, 89 MAINPROC statement 88, 150 OPTIONS statement 91 RJPWS statement 88 managed devices 34, 88 managing tape allocation 54 planning changes 87 PLEXSYN keyword 92 SMS support 52 SYN keyword 92

M
master catalog 32 MAXELEM 81 MCDS 52, 53 messages IWM012E 207 IXC253I 208 IXC263I 208 IXC267I 207, 208, 209 IXC808I 208 IXC809I 208 undeliverable (UD) 22 MIH 38 MVS ACTIVATE command 34 adding a new SYSRES 151 adding an image 149 NIP console 21 removing an image 169 ripple IPL 154

N
N a n d N + 1 29, 155 NAME 64 naming conventions 40 NetView description 110 focal point 110

Index

293

NIP 21 NOCCGRP 44 non-volatile, coupling facility nondisruptive change 4

PROGxx 55 PROMPT 64 38, 47

R
RACF coupling facility structure 8 database 40 database structure 130 structure failure 40 RAMAC 11, 12, 15, 17 REBUILD 65 REBUILDPERCENT 66, 73 RECONFIG 64, 71 reconfiguration 33 redundancy 4 Remote Site Recovery (IMS) 216 reorganization database considerations 171 DB2 database 175 DL/1 database 174 VSAM database 172 REPORT 70 reserve 37, 38, 53 reset failing logical partition 67 RESETTIME 58, 64, 67, 68 RESMIL 78 restart groups 81 RPQ 8K1919 11 RSA 78 RVARY command 40

O
OCDS 52, 53 OPC/ESA ARM support 80 controller 111 description 111 job routing 111 moving workload 165 shutdown 169 OPERLOG 213 OPNOTIFY 58, 76

P
PAGE data set 149, 238 PAGE/SWAP data set 41 parallel attached devices 146 channels 143 parallel sysplex sample configuration 221 PARMLIB See SYS1.PARMLIB PARTITION 64, 68 Peer-to-Peer Remote Copy 215 performance 6 PLEXSYN 92 policy 81 active SFM 70 Automatic Restart Manager (ARM) 81 Coupling Facility Resource Manager (CFRM) 66, 73 DFSMS 50 Sysplex Failure Management (SFM) 71 XCF PR/SM 66 POR 33 power battery backup 8 failure 8 save state 27 subsystem 144 supply 144 UPS 8, 26 PR/SM 10 processing weights 145 processor 69 adding 143 bipolar 6 changing 144 CMOS 6 configuration for continuous availability 6 N+1 6 r e m o v i n g 143

S
54, SCDS 50, 52 SCTC 13 SE See service element (SE) service element (SE) clock 148 SET 55 SETPROG EXIT 55 SETSSI 56, 57 SETXCF command 36, 70, 71, 81 SFM 66 See also Sysplex Failure Management (SFM) shared SYSRES 29, 30 shared tape 54, 89 single point of failure 4 single system image 40 SMF allocation sample 238 data sets 41 dynamic exit 55 system cloning 149 system identifier (SID) 40 time changes 148 SMSplex 50

294

Continuous Availability with PTS

spin loop 79 SPINRCVY 74, 77 SPINTIME 58, 73, 74, 75, 76, 77 SSIDATA 57 staging data sets 49, 50 standards 6 standby system 3 START 83 status update missing 37 STC 80 STGINDEX data set 41, 149, 238 STOR 69 refid.Sysplex Failure Management (SFM) STORAGE 69 STORE 64 subsystem adding a CICS subsystem 156 changing 160 CICS 166 CICS startup 159 DB2 167 DB2 startup 160 IMS 166 IMS startup 160 shutdown 165 starting 159 SYMDEF 223 SYN 92 SYNCHDEST 44, 45 synchronous WTO(R) 45 SYS1.PARMLIB CLOCKxx member 11, 146 CNGRPxx member 45 COMMNDxx member 22 considerations 40 CONSOLxx member 22, 43, 45 COUPLE00 member 149 DEFAULT statement 45 ETRMODE keyword 11, 146 ETRZONE keyword 11, 146 EXSPATxx member 79 GRSRNLxx 53, 54 IEACMDxx member 22 IEASYMxx member 149 IECIOSxx member 79 IEFSSNxx member 156, 158 m e m b e r s 222 OPER Keyword 79 PLEXCFG parameter 90 SCHEDxx member 156 SPINRCVY keyword 74 SPINTIME keyword 74 SYNCHDEST keyword 45 TERM keyword 75 TOLINT keyword 73, 78 XCFPOLxx 66 SYS1.PROCLIB 227

70

SYS1.SAMPLIB 77 SYSCLONE 42 SYSCONS 22 SYSDEF 223 SYSIMS 84 SYSNAME 42 SYSPARM 223 sysplex CMOS only 6 keyword in LOADxx member 222 mixed 6 name 51 symbolic 42 t i m e r 10, 146 Sysplex Failure Management (SFM) activation 69 active policies 70 ARM considerations 79 automating 57 couple data set 35, 37 isolate function 59 native environment considerations planning 58 policies 70, 71 Policy ACTSYS keyword 64 CONNFAIL keyword 64 DEACTTIME keyword 64 ESTORE keyword 64 FAILSYS keyword 64 ISOLATETIME keyword 64 NAME keyword 64 PROMPT keyword 64 RECONFIG keyword 64 RESETTIME keyword 64 STORE keyword 64 SYSTEM keyword 64 TARGETSYS keyword 64 WEIGHT keyword 64, 65 PR/SM environment considerations stopping 72 timing 67 utilization 72 SYSPLEX symbolic 42 system group 51 system logger address space failure 213 application failure 212 CICS definition 156 coupling facility sensitivity 8 description 46 logstream allocation 46 logstream structure 131 OPERLOG failure 213 sysplex failure 213 system failure 213 SYSTEM parameter 64

64

64

Index

295

system software changes system symbolic 41, 42 systems management 6

154

W
WEIGHT 64, 65, 73 WLM See Workload Manager (WLM) workload balancing 4 batch 164 CICS 161 DB2 163 DFSMS 165 IMS 163 m o v i n g 161 TSO 164 Workload Manager (WLM) compatibility mode 79 couple data set 35, 37 sysplex recovery 80

T
tape 54 tape switching 54 TARGETSYS 64, 69 TERM 75 time changing 146 9672 HMC and SE 148 IMS 146 SMF 148 daylight savings 146, 147 detection interval 73 local 146 log timestamps 148 MVS clocks 146 setting in MVS 11 standard 146, 147 s u m m e r 146 TOD clock 10 winter 146 zone offset 146 time-of-day (TOD) clock 10 TOLINT 73, 78 TOTELEM 81 TSO/E adding 159 description 109 moving workload 164

X
XCF (Cross Systems Coupling Facility) address space 80 connectivity failure 65 couple data set 35, 37 failure detection interval 74 group name JES3 91, 150 PR/SM policy 66 signalling paths alternate 14 configuration 14 JES3 use 90 transport class 90 signalling structure 129 XCFPOLxx 66 XES 57 XRF 83

U
UCB 34 U I M 33 uninterruptible power supply (UPS) 8, 26 UPS See uninterruptible power supply (UPS)

V
VARY device command 54 VARY XCF 60, 62 volatile, coupling facility 39, 47, 49 VSAM database 171 VTAM APPN 112 ARM support 87 configuration 112 coupling facility structure 10 generic resources 10 ISTGENERIC structure 129 VTAMLST 232

296

Continuous Availability with PTS

ITSO Technical Bulletin Evaluation


International Technical Support Organization System/390 MVS Parallel Sysplex Continuous Availability SE Guide December 1995 Publication No. SG24-4503-00

RED000

Your feedback is very important to help us maintain the quality of ITSO Bulletins. Please fill out this questionnaire and return it using one of the following methods:

Mail it to the address on the back (postage paid in U.S. only) Give it to an IBM marketing representative for mailing Fax it to: Your International Access Code + 1 914 432 8246 Send a note to REDBOOK@VNET.IBM.COM

Please rate on a scale of 1 to 5 the subjects below. (1 = very good, 2 = good, 3 = average, 4 = poor, 5 = very poor) Overall Satisfaction Organization of the book Accuracy of the information Relevance of the information Completeness of the information Value of illustrations ____ ____ ____ ____ ____ ____ Grammar/punctuation/spelling Ease of reading and understanding Ease of finding information Level of technical detail Print quality ____ ____ ____ ____ ____

Please answer the following questions: a) If you are an employee of IBM or its subsidiaries: Do you provide billable services for 20% or more of your time? Are you in a Services Organization? b) c) d) Are you working in the USA? Was the Bulletin published in time for your needs? Did this Bulletin meet your needs? If no, please explain: Yes____ No____ Yes____ No____ Yes____ No____ Yes____ No____ Yes____ No____

What other topics would you like to see in this Bulletin?

What other Technical Bulletins would you like to see published?

Comments/Suggestions:

( THANK YOU FOR YOUR FEEDBACK! )

Name

Address

Company or Organization

Phone No.

ITSO Technical Bulletin Evaluation SG24-4503-00

RED000

IBML

Cut or Fold Along Line

Fold and Tape

Please do not staple

Fold and Tape

NO POSTAGE NECESSARY IF MAILED IN THE UNITED STATES

BUSINESS REPLY MAIL


FIRST-CLASS MAIL PERMIT NO. 40 ARMONK, NEW YORK POSTAGE WILL BE PAID BY ADDRESSEE

IBM International Technical Support Organization Mail Station P099 522 SOUTH ROAD POUGHKEEPSIE NY USA 12601-5400

Fold and Tape

Please do not staple

Fold and Tape

SG24-4503-00

Cut or Fold Along Line

IBML

Printed in U.S.A.

SG24-4503-00

You might also like