You are on page 1of 12

An Oracle White Paper August 2013

Oracle Trace File Analyzer


Collector (TFA)

EXECUTIVE OVERVIEW

BUSINESS DRIVERS FOR IMPLEMENTING TFA

REDUCED COST
REDUCED COMPLEXITY
INCREASED QUALITY OF SERVICE
IMPROVED AGILITY

4
4
4
4

TFA ARCHITECTURE AND BASICS

TFA USAGE

DIAGCOLLECT
SAMPLE COLLECTION OUTPUT
OTHER SAMPLE COMMAND OUTPUT

8
9
10

SUMMARY

11

GOALS
APPROACH

11
11


Executive Overview

As systems supporting todays complex IT environments become larger,


more complex and numerous, sometimes with many hosts clustered into a
single system or distributed among different data centers, when problems
occur it can be difficult and time consuming for those managing these
systems to know which diagnostics to collect. First level support personnel
may not be as experienced as second or third level support staff in knowing
how and what to collect.
Trace File Analyzer Collector, hereafter referred to as TFA, is designed to
solve these problems by providing a simple, efficient and thorough
mechanism for collecting ALL the relevant first failure diagnostic data in
one pass,from all nodes in the TFA configuration. Customer staff need only
know the approximate time that a problem occurrs and with this information
one simple command can be run that will collect all the relevant files from
the entire configuration based on that time. The collection will include
operating system, Oracle Clusterware, Oracle Automatic Storage
Management and Oracle Database diagnostic data

Business Drivers for Implementing TFA


Four key business drivers should be considered in a decision to
implement Trace File Analyzer Collector.

Reduced Cost
When a problem occurs whether there is a service interruption or some
less serious problem it costs money and ties up resources. The faster
service can be restored or the problem solved the more quickly IT staff
can get back to more productive activities. Time is money. When time is
spent collecting and uploading diagnostic data over and over due to lack
of knowledge about what is needed or being pressed for time then the
resolution cycle is lengthened. TFA can help compress this aspect of the
cycle.

Reduced Complexity
Often staff may not know what exactly happened or what to collect and
often the response is a kitchen sink approach where large amounts of
data are provided but not much information. The Oracle Support analyst
has to transfer this data to internal systems and then sift through it to
determine what the customer has provided. Often in clustered systems
diagnostics are required for correlation across the entire cluster and often
this requirement is neglected because the staff considers the collections to
be too complex or time consuming.

Increased Quality of Service


Oracle Support can provide much quicker resolutions and more valuable
support if the proper first failure diagnostics are uploaded in one transfer
instead of needing multiple communications from the Customer.
Customers often feel that a request for additional diagnostic files is a
stalling tactic by Support. Collecting all relevant files in one collection
would eliminate much of this kind of misunderstanding allowing Oracle
to provide a much higher quality of customer service.

Improved Agility
If staff are freed from the time consuming task of collecting diagnostic
data they can be re-deployed to work on more productive and interesting

tasks. First level production support personnel can collect and upload
diagnostics just as efficiently and as precisely as higher level staff through
automation and standardization.

TFA Architecture and Basics


TFA is implemented by way of a Java Virtual Machine (JVM) that is
installed and runs as a daemon on each host in the TFA configuration. Each
TFA JVM communicates with peer TFA JVMs in the configuration through
secure sockets. There is a Berkley Database (BDB) on each host in which
TFA stores metadata about the configuration, directories and files that TFA
monitors on each host.
Supported versions (Aug 2013) of Oracle Clusterware, ASM and Database
are 10.2, 11.1, 11.2, 12.1. TFA is shipped and installed in an 11.2.0.4 Grid
Infrastructure Home.
Supported platforms (Aug 2013) are Linux, Solaris, AIX, HP-UX
Use the TFA MOS Note to download and monitor for updates on version and

platform support.
There is a simple command line interface (CLI) named tfactl which is used
to execute the commands that are supported by TFA. The JVMs only accept
commands from tfactl. tfactl is used to monitor the status of the TFA
configuration, making changes to the configuration and ultimately to take
collections.
Collections are initiated from any host in the configuration and if files are
needed from other hosts in the configuration the initiating hosts JVM
communicates the collection requirement to peer JVMs on other hosts. All
the collections across the configuration are run concurrently and are copied
back to the TFA repository on the initiating node. When all the collections
are complete the user can then upload the collections from a single host to
the Oracle Service Request.
The TFA repository can reside on a shared filesystem in which case as each
hosts collection is completed it is stored in the shared repository in host

specific subdirectories. If the TFA repository is configured locally on each


host then as each host collection is completed it is copied to the initiating
hosts repository. In either case it is convenient for the user to obtain the
files needed for upload to the Service Request from a single location that is
listed at the end of the output for each collection.
TFA collects only files that are relevant based on time of problem. The user
has only to know the approximate time of the problem and to use the proper
time modifier in the CLI in order to gather all the files that Support would
typically need in order to triage the problem. As an example
$ sudo tfactl diagcollect since 1h
This command communicates to TFA that the problem occurred sometime
in the last hour and to collect all OS, Clusterware, ASM, Database, Cluster
Health Monitor and OS Watcher files that are relevant to that time, ie., that
were modified within the last hour. There are other arguments that can be
used to limit the collection to a subset of hosts and components but if not
much is known about what exactly the problem may have been or what
components may have been involved that above command might collect
more than was necessary but at least nothing will have been missed. The
more precisely the time of the problem can be specified the smaller the
collection will be.
TFA prunes larger files by skipping data from traces and logs that is outside
of the time specified. Again, the more precisely the time can be specified
the more these larger files can be pruned. The design goal of TFA is to
make the collections as complete, as relevant and as small as possible to
save time in copying, uploading and transferring files for both the customer
and for Oracle Support AND to collect only data that would be potentially
relevant to the problem. A file that was not modified around the time of the
problem is not likely to be relevant.
It is trivial to add nodes to the TFA configuration after initial install. New
databases added to a configuration will be discovered automatically. TFA
runs an auto-discovery and a file inventory periodically (6 hours at this
writing but that is subject to change in a future version) so at minimum
configuration changes are discovered at that interval.

A file inventory is kept for every directory registered in the TFA


configuration and metadata about those files is maintained in the BDB.
Examples of the metadata kept are file name, first timestamp, last
timestamp, file type, etc. Differing timestamp formats are also normalized
as the format of timestamps can vary from one file type to the next. When
collections are taken the first step is to take a pre-collection inventory in
case any files have been modified or created since the last periodic
inventory. In the case of a pre-collection inventory only files for the
specified databases and/or components are inventoried.
The TFA resource footprint is expected to be very small. Usually the user
would not even be aware that it is running. The only time TFA will
consume any noticeable CPU is when doing an inventory or a collection and
then only for brief periods and on a single CPU.
Under normal operating conditions TFA spawns a thread to monitor the end
of each alert log in the configuration Database, ASM, and Clusterware.
TFA monitors for certain events such as node or instance evictions, and
certain errors such as ORA-00600, ORA-07445, etc. TFA stores metadata
about these events in the BDB for future reference. Also, TFA can be
configured to take collections automatically when those events occur and
store them in the repository. There is a built in flood control mechanism for
the auto-collection feature to prevent a rash of duplicate errors or events
from triggering multiple duplicate collections. Should the repository reach
its configurable maximum size TFA will suspend taking collections until
space is cleared in the repository.

TFA Usage

The primary purpose for using TFA is to take collections. Here is a listing
of tfactl diagcollect syntax from the interactive help.
NOTE: TFA must be installed as root and tfactl must be run as root.
Installing and running under sudo control is also supported. In order
to configure TFA for sudo control here is an example /etc/sudoers file
configuration for an example host named myhost and assuming the
TFA installer is staged in /home/oracle and that the TFA_HOME is
/opt/oracle/tfa/tfa_home/.
oracle myhost=/home/oracle/installTFALite.sh

oracle myhost=/opt/oracle/tfa/tfa_home/bin/tfactl

diagcollect
$ sudo ./tfactl diagcollect -h
Usage: /opt/oracle/tfa/tfa_home/bin/tfactl
database <all|d1,d2..> | -asm | -crs | -os
nochmos ] [-node <all | local | n1,n2,..>]
<filename>] [-since <n><h|d>| -from <time>
nocopy] [-nomonitor]
Options:
-all
files
-crs
-asm
-database
-os
-install
-chmos
-nochmos

diagcollect [-all | | -install | -chmos | [-tag <description>] [-z


-to <time> | -for <time>] [-

Collect all logs (If no time is given for collection then


for the last 4 hours will be collected)
Collect CRS logs
Collect ASM logs
Collect database logs from databases specified
Collect OS files such as /var/log/messages
Collect Oracle Installation related files
Collect CHMOS files (Note that this data can be large for
longer durations)
Do not collect CHMOS data when it would normally have been
collected
Specify comma separated list of host names for collection
Does not copy back the zip files to initiating node from

-node
-nocopy
all nodes
-nomonitor This option is used to submit the diagcollection as a
background
process

-since <n><h|d>
Files from past 'n' [d]ays or 'n' [h]ours
-from "MMM/dd/yyyy hh:mm:ss"
From <time>
-to
"MMM/dd/yyyy hh:mm:ss"
To <time>
-for "MMM/dd/yyyy"
For <date>.
-tag <tagname> The files will be collected into tagname directory
inside
repository
Examples:
/opt/oracle/tfa/tfa_home/bin/tfactl diagcollect
Trim and Zip all files updated in the last 4 hours as well as
chmos/osw data
from across the cluster and collect at the initiating node
Note: This collection could be larger than required but is there as
the
simplest way to capture diagnostics if an issue has recently
occurred.
/opt/oracle/tfa/tfa_home/bin/tfactl diagcollect -all -since 8h
Trim and Zip all files updated in the last 8 hours as well as
chmos/osw data
from across the cluster and collect at the initiating node

/opt/oracle/tfa/tfa_home/bin/tfactl diagcollect -database hrdb,fdb since 1d -z foo


Trim and Zip all files from databases hrdb & fdb in the last 1 day
and
collect at the initiating node
/opt/oracle/tfa/tfa_home/bin/tfactl diagcollect -crs -os -node
node1,node2 -since 6h
Trim and Zip all crs files, o/s logs and chmos/osw data from node1
& node2
updated in the last 6 hours and collect at the initiating node
/opt/oracle/tfa/tfa_home/bin/tfactl diagcollect -asm -node node1 from Mar/4/2013 -to "Mar/5/2013 21:00:00"
Trim and Zip all ASM logs from node1 updated between from and to
time and
collect at the initiating node
/opt/oracle/tfa/tfa_home/bin/tfactl diagcollect -for "Mar/2/2013"
Trim and Zip all log files updated on "Mar/2/2013" and collect at
the
initiating node
/opt/oracle/tfa/tfa_home/bin/tfactl diagcollect -for "Mar/2/2013
21:00:00"
Trim and Zip all log files updated from 09:00 on March 2 to 09:00
on March 3
(i.e. 12 hours before and after the time given) and collect at the
initiating node

Sample collection output


$ sudo ./tfactl diagcollect
Collecting data for the last 4 hours for all components...
Running an inventory clusterwide ...
Collection name tfa_Fri_Aug_30_22_02_55_PDT_2013.zip
Getting list of files satisfying time range [Fri Aug 30 18:03:10 PDT
2013, Fri Aug 30 22:03:10 PDT 2013]
myhost1: Zipping File:
/opt/oracle/oak/oswbb/archive/oswnetstat/myhost1_netstat_13.08.30.2100.dat.gz
myhost1: Zipping File:
/opt/oracle/oak/oswbb/archive/oswmeminfo/myhost1_meminfo_13.08.30.1800.dat.gz
myhost1: Zipping File:
/opt/oracle/oak/oswbb/archive/oswvmstat/myhost1_vmstat_13.08.30.1900.dat.gz
myhost1: Zipping File: /u01/app/11.2.0.3/grid/log/myhost1/ctssd/octssd.log
myhost1: Zipping File: /opt/oracle/oak/log/myhost1/oak/oakd.log
Trimming file : /opt/oracle/oak/log/myhost1/oak/oakd.log with original file size : 4.5MB
myhost1: Zipping File: /u01/app/11.2.0.3/grid/log/myhost1/alertmyhost1.log
Trimming file : /u01/app/11.2.0.3/grid/log/myhost1/alertmyhost1.log with original file
size : 256kB
Truncated for brevity
myhost1: Zipping File:
/u01/app/11.2.0.3/grid/log/myhost1/agent/ohasd/oracssdagent_root/oracssdagent_root.log
Trimming file :
/u01/app/11.2.0.3/grid/log/myhost1/agent/ohasd/oracssdagent_root/oracssdagent_root.log
with original file size : 6.7MB

myhost1: Zipping File: /u01/app/oracle/diag/rdbms/test2/test21/trace/alert_test21.log


myhost1: Zipping File:
/opt/oracle/oak/oswbb/archive/oswslabinfo/myhost1_slabinfo_13.08.30.2200.dat
myhost1: Zipping File:
/opt/oracle/oak/oswbb/archive/oswiostat/myhost1_iostat_13.08.30.2200.dat
Collecting extra files...
Total Number of Files checked : 10588
Total Size of all Files Checked : 4.1GB
Number of files containing required range : 72
Total Size of Files containing required range : 127MB
Number of files trimmed : 13
Total Size of data prior to zip : 123MB
Saved 51MB by trimming files
Zip file size : 7.1MB
Total time taken : 47s
Completed collection of zip files.

Logs are collected to:


/opt/oracle/tfa/tfa_home/repository/collection_Fri_Aug_30_22_02_55_PDT_
2013_node_all/myhost1.tfa_Fri_Aug_30_22_02_55_PDT_2013.zip
/opt/oracle/tfa/tfa_home/repository/collection_Fri_Aug_30_22_02_55_PDT_
2013_node_all/myhost2.tfa_Fri_Aug_30_22_02_55_PDT_2013.zip

Other sample command output


$ sudo ./tfactl print status
.---------------------------------------------------------------------------------------.
| Host
| Status of TFA | PID
| Port | Version | Build ID
| Inventory Status |
+----------+---------------+-------+------+---------+----------------+------------------+
| myhost1 | RUNNING
| 28922 | 5000 | 2.5.1.5 | 20130830055620 | COMPLETE
|
| myhost2 | RUNNING
| 28922 | 5000 | 2.5.1.5 | 20130830055620 | COMPLETE
|
'----------+---------------+-------+------+---------+----------------+------------------'

$ sudo ./tfactl print config


.----------------------------------------------------.
| Configuration Parameter
| Value
|
+------------------------------------------+---------+
| TFA version
| 2.5.1.5 |
| Automatic diagnostic collection
| OFF
|
| Trimming of files during diagcollection | ON
|
| Repository current size (MB) in myhost1 | 0
|
| Repository maximum size (MB) in myhost1 | 10240
|
| Trace level
| 1
|
'------------------------------------------+---------'
$ sudo ./tfactl print hosts
Host Name : myhost1
Host Name : myhost2
$ sudo ./tfactl print repository
.------------------------------------------------------------.
|
myhost1
|
+----------------------+-------------------------------------+
| Repository Parameter | Value
|
+----------------------+-------------------------------------+
| Location
| /opt/oracle/tfa/tfa_home/repository |

| Maximum Size (MB)


| 10240
|
| Current Size (MB)
| 3
|
| Status
| OPEN
|
'----------------------+-------------------------------------'
$ sudo ./tfactl -h
Usage : /opt/oracle/tfa/tfa_home/bin/tfactl <command> [options]
<command> =
print
Print requested details
purge
Delete collections from TFA repository
directory
Add or Remove or Modify directory in TFA
host
Add or Remove host in TFA
set
Turn ON/OFF or Modify various TFA features
diagcollect Collect logs from across nodes in cluster


Summary
Goals
Improved comprehensive first failure diagnostics collection
Efficient collection, packaging, transfer of data for Customers
Reduce round trips between Customers and Oracle
Supports 10.2, 11.1, 11.2 and above
Included in the 11.2.0.4 patchset and future versions

Approach
Collect all relevant components (OS, Grid Infrastructure, ASM, RDBMS)
One command to collect all required information
Prune large files based on temporal criteria
Collect time relevant IPS (incident) packages on RAC nodes
Collect time relevant CHMOS, OSWatcher data on RAC nodes
On-demand (default) and Event Driven diagnostic collections
TFA Download My Oracle Support

Oracle Trace File Analyzer Collector (TFA) August 2013 Author: Bob Caldwell
Contributing Authors: Bill Burton, Sandesh Rao
Oracle Corporation World Headquarters 500 Oracle Parkway Redwood Shores, CA 94065 U.S.A.
Worldwide Inquiries: Phone: +1.650.506.7000 Fax: +1.650.506.7200
oracle.com
Copyright 2011, Oracle and/or its affiliates. All rights reserved. This document is provided for
information purposes only and the contents hereof are subject to change without notice. This
document is not warranted to be error-free, nor subject to any other warranties or conditions,
whether expressed orally or implied in law, including implied warranties and conditions of
merchantability or fitness for a particular purpose. We specifically disclaim any liability with
respect to this document and no contractual obligations are formed either directly or indirectly by
this document. This document may not be reproduced or transmitted in any form or by any
means, electronic or mechanical, for any purpose, without our prior written permission.
Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be
trademarks of their respective owners.
AMD, Opteron, the AMD logo, and the AMD Opteron logo are trademarks or registered
trademarks of Advanced Micro Devices. Intel and Intel Xeon are trademarks or registered
trademarks of Intel Corporation. All SPARC trademarks are used under license and are
trademarks or registered trademarks of SPARC International, Inc. UNIX is a registered trademark
licensed through X/Open Company, Ltd. 1010

You might also like