Professional Documents
Culture Documents
Everything You Know About
Everything You Know About
Everything you ever wanted to know about the Cluster Health Monitor (CHM)
Markus Michalewicz Principal Product Manager Oracle RAC & Oracle Clusterware
The preceding is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracles products remains at the sole discretion of Oracle.
Agenda
Introduction
What is? Why use? Where to get? Cluster Health Monitor (CHM)
<Insert Picture Here>
Installation
Of the Tool Of the GUI
Possible outcomes:
Oracle Support finds the answer in one of the logs Oracle Support needs more node specific information to answer the question
For the latter: This why you need Cluster Health Monitor (CHM) for example
For the latter: CHM provides a historical view on collected data for analyzes
>crfgui -d "00:05:00" -m 192.168.2.8 Cluster Health Analyzer V1.10 Look for Loggerd via node 192.168.2.8 ...reading 300 sec from the past Connected to Loggerd on rac1 Note: Node rac1 is now up Cluster 'MyCluster', 2 nodes. Ext time=2010-08-18 23:22:30
Installation
Summary of installation steps: 1. Download the software 2. Unzip the downloaded file
Do not install from a shared file system The user must have passwordless SSH access to all nodes The user can be the same as the Oracle Grid Infrastructure-owner $CHM_install_DIR/install/crfinst.pl i {node1,node2} b /BDBdirectory Do not use a shared destination for the location of the BDBdirectory The software is distributed across all nodes specified under i automatically Define one of the nodes as the master node Run crfinst.pl -f -b /BDBdirectory as root on all nodes to enable the tool
The passwordless SSH setup basically follows the configuration that you would use for the Oracle Grid Infrastructure (Oracle Clusterware 11g Release 2) setup, which allows setting up the passwordless SSH automatically. If you plan on deploying Oracle Grid Infrastructure on this system, you might want to do it first and then install Cluster Health Monitor.
[root@rac2 ~]# ps -ef root 28025 27949 root 28028 27949 oracle 28089 1
root 28127 1
0 21:26 ?
If your client is a Windows client, download the Windows version of the tool Unzip and install the GUI using:
Usage: crfinst.pl -a -c -d -f -g -h -i -N [<nodelist>] [<nodelist>] [-b <bdb loc>] <ui install dir> <nodelist> -b <bdb loc> [-m <master>] ClusterName.
Administration
Administration part 1
The main administration tool for CHM: oclumon
[oracle@rac1 ~]$ oclumon -h For help from command line : oclumon <verb> -h For help in interactive mode : <verb> -h Currently supported verbs are : showtrail, showobjects, dumpnodeview, manage, version, debug, quit and help
[oracle@rac1 ~]$ oclumon version Instantaneous Problem Detection - OS Tool, Version 1.04.20091223 - Production Copyright 2009 Oracle. All rights reserved.
Administration part 2
How long can I go back in time?
Reviewing historical data is limited by the size of the Berkeley DB
By default the database retains the node views from all the nodes for the last 24 hours in a circular manner. This limit can be increased to 72 hours by using the following oclumon command: 'oclumon manage -bdb resize 259200'. resize is set in seconds In the current release (as of 11.2.0.1) you cannot query the current retention time You can, however, set it to the time that you think is appropriate / reasonable
Whenever time is specified in the format HH:MM:SS, it refers to the amount of time that you want to go back (in hours, minutes, seconds). This command: crfgui -d "00:35:00" -m 192.168.2.8
Views the data 35 minutes ago from now.
Administration part 3
Get me information on the command line
> oclumon dumpnodeview -v -n rac1 -last "00:00:03
---------------------------------------Node: rac1 Clock: '08-19-10 03.53.53 UTC' SerialNo:63193 ---------------------------------------SYSTEM: #cpus: 2 cpu: 4.5 cpuq: 1 physmemfree: 13896 mcache: 959952 swapfree: 1900208 ior: 0 iow: 297 ios: 17 netr: 57.9 netw: 43.56 procs: 187 rtprocs: 11 #fds: 2658 #sysfdlimit: 6815744 #disks: 7 #nics: 4 nicErrors: 0 TOP CONSUMERS: topcpu: 'osysmond(13446) 0.66' topprivmem: 'ologgerd(13532) 102260' topshm: 'ologgerd(13532) 46680' topfd: 'crsd.bin(10754) 102' topthread: 'crsd.bin(10754) 58' PROCESSES: name: 'osysmond' pid: 13446 #procfdlimit: 1024 cpuusage: 0.66 memusage: 78912 shm: 41196 #fd: 22 #threads: 9 priority: 139 name: 'orarootagent.bi' pid: 10890 #procfdlimit: 65536 cpuusage: 0.66 memusage: 6420 shm: 10032 #fd: 7 #threads: 34 priority: 19 name: 'ologgerd' pid: 13532 #procfdlimit: 1024 cpuusage: 0.0 memusage: 102260 shm: 46680 #fd: 19 #threads: 9 priority: 139
3 seconds
DEVICES: sdf ior: 0.0 iow: 0.0 ios: 0 qlen: 0 sdf1 ior: 0.0 iow: 0.0 ios: 0 qlen: sde ior: 0.0 iow: 0.0 ios: 0 qlen: 0 sde1 ior: 0.0 iow: 0.0 ios: 0 qlen: sdd ior: 0.0 iow: 0.0 ios: 0 qlen: 0 wait: 0 type: SYS - wait: - type: SYS wait: 0 type: SYS - wait: - type: SYS wait: 0 type: SYS
NICS: lo netrr: 21.3 netwr: 21.3 neteff: 42.7 nicerrors: 0 pktsin: 7 pktsout: 7 errsin: 0 errsout: 0 indiscarded: 0 outdiscarded: 0 inunicast: 7 innonunicast: 0 type: PUBLIC eth0 netrr: 25.65 netwr: 15.94 neteff: 41.60 nicerrors: 0 pktsin: 13 pktsout: 13 errsin: 0 errsout: 0 indiscarded: 0 outdiscarded: 0 inunicast: 13 innonunicast: 0 type: PRIVATE latency: <1 eth1 netrr: 10.27 netwr: 6.58 neteff: 16.85 nicerrors: 0 pktsin: 30 pktsout: 22 errsin: 0 errsout: 0 indiscarded: 0 outdiscarded: 0 inunicast: 30 innonunicast: 0 type: PRIVATE latency: <1 eth2 netrr: 0.12 netwr: 0.0 neteff: 0.12 nicerrors: 0 pktsin: 0 pktsout: 0 errsin: 0 errsout: 0 indiscarded: 0 outdiscarded: 0 inunicast: 0 innonunicast: 0 type: PUBLIC latency: <1 PROTOCOL ERRORS: IPHdrErr: 0 IPAddrErr: 0 IPUnkProto: 0 IPReasFail: 0 IPFragFail: 0 TCPFailedConn: 50 TCPEstRst: 13 TCPRetraSeg: 69 UDPUnkPort: 41 UDPRcvErr: 0
End of data
Administration part 4
Time is crucial the clock
> oclumon dumpnodeview -n rac1 -s "2010-08-19 02.00.01" -e "2010-08-19 02.00.03" ---------------------------------------Node: rac1 Clock: '08-19-10 02.00.01 UTC' SerialNo:58695 ---------------------------------------SYSTEM: #cpus: 2 cpu: 4.20 cpuq: 4 physmemfree: 17728 mcache: 953248 swapfree: 1900208 ior: 0 iow: 103 ios: 7 netr: 46.36 netw: 39.29 procs: 187 rtprocs: 11 #fds: 2658 #sysfdlimit: 6815744 #disks: 7 #nics: 4 nicErrors: 0 TOP CONSUMERS: topcpu: 'osysmond(13446) 1.31' topprivmem: 'ologgerd(13532) 102260' topshm: 'ologgerd(13532) 46680' topfd: 'crsd.bin(10754) 102' topthread: 'crsd.bin(10754) 58' End of data
Alternative: oclumon dumpnodeview -allnodes -s "2010-08-19 02.00.01" -e "2010-08-19 02.00.03 The "Clock:" in the oclumon output is printed in the timezone which the master daemon is running with.
Administration part 5
Sampling data and refresh rate
Two independent rates to distinguish:
1. The sampling rate of the tool 2. The refresh rate of the GUI
The sampling rate of the tool depends on the currently active processes and the devices on the system. Up to a total of 1000 active processes and disks with ideal system, the sampling interval is approximately 1 second. The refresh rate of the GUI is 1 second per default, but a higher refresh rate can be specified using the r parameter followed by the time in secs.
Example: crfgui -r 5 -m 192.168.2.8
FAQ #1
Is CHM a CVU (Cluster Verification Utility) replacement?
NO
CVU is a separate tool with a completely different purpose. CVU does not gather nor provide the same data that CHM provides. For more information on CVU got to:
http://www.oracle.com/goto/rac On this page, follow this link: Cluster Verification Utility - Download
FAQ #2
Can CHM be used as an OS Watcher replacement?
YES
OS Watcher (OSW) is a collection of UNIX shell scripts(*) intended to
collect and archive operating system and network metrics to aid support in diagnosing performance issues.
Note: OS Watcher may have some specific environments, in which it provides additional information (e.g. Version 3.0 OS of Watcher adds additional collections for Exadata, as per the MOS note mentioned.)
(*) on Unix there is also a Windows version of OS Watcher
FAQ #3
Is CHM the standard tool to be used?
YES
Oracle RAC Development recommends using CHM whenever possible:
When using Oracle Clusterware, Oracle Grid Infrastructure, or Oracle RAC The current release is available on Linux and Windows both 32 and 64bit.
CHM will be the standard tool moving forward Therefore, more OSs will be supported in future
More Information
Going forward, all OS supported for Oracle Grid Infrastructure will be supported for Cluster Health Monitor.
More Operating Systems are planned to be supported for CHM as 11.2.0.2 becomes available on these Operating Systems (last planned for 11.2.0.3)
More Information
http://www.oracle.com/goto/rac
Download link: Cluster Health Monitor - Download
http://www.oracle.com/goto/clusterware
Technical White Paper Oracle Clusterware 11g Release 2 Technical Overview
For OS Watcher
My Oracle Support doc ID 301137.1 - OS Watcher User Guide
OTN Migration
A migration with some impact
Note that Oracle Technology Network (also known as OTN) was migrated URLs containing http://otn.oracle.com/ are moved
Individual items (e.g. papers) are migrated to a new Content Management System Direct links using the old URL to those items may therefore not work anymore
Some links to main pages should be redirected to some new pages e.g.:
http://otn.oracle.com/rac (might go away over time) http://www.oracle.com/technetwork/database/clustering/overview/index.html
Items are linked on the main pages to the new URLs Tip: follow the links on the main pages until migration is complete