You are on page 1of 564

Sun Systems Fault Analysis

Workshop
ST-350
Student Guide With Instructor Notes

Sun Microsystems, Inc.


UBRM05-104
500 Eldorado Blvd.
Broomfield, CO 80021
U.S.A.
Revision E

Copyright 2002 Sun Microsystems, Inc., 901 San Antonio Road, Palo Alto, California 94303, U.S.A. All rights reserved.
This product or document is protected by copyright and distributed under licenses restricting its use, copying, distribution, and
decompilation. No part of this product or document may be reproduced in any form by any means without prior written authorization of
Sun and its licensors, if any.
Third-party software, including font technology, is copyrighted and licensed from Sun suppliers.
Sun, Sun Microsystems, the Sun Logo, Solaris, Ultra, SunSolve Online, Sun Explorer Data Collector, Sun Enterprise Ultra, SunSpectrum,
BigAdmin, Sun System Configuration Check, Sun Blade, Sun Fire, Sun4U, Solaris Management Console, SunSolve Online, SunVTS,
AnswerBook2, and OpenWindows are trademarks or registered trademarks of Sun Microsystems, Inc. in the U.S. and other countries.
All SPARC trademarks are used under license and are trademarks or registered trademarks of SPARC International, Inc. in the U.S. and
other countries. Products bearing SPARC trademarks are based upon an architecture developed by Sun Microsystems, Inc.
UNIX is a registered trademark in the U.S. and other countries, exclusively licensed through X/Open Company, Ltd.
Adobe is a registered trademark of Adobe Systems, Incorporated.
Federal Acquisitions: Commercial Software Government Users Subject to Standard License Terms and Conditions
Export Laws. Products, Services, and technical data delivered by Sun may be subject to U.S. export controls or the trade laws of other
countries. You will comply with all such laws and obtain all licenses to export, re-export, or import as may be required after delivery to
You. You will not export or re-export to entities on the most current U.S. export exclusions lists or to any country subject to U.S. embargo
or terrorist controls as specified in the U.S. export laws. You will not use or provide Products, Services, or technical data for nuclear, missile,
or chemical biological weaponry end uses.
DOCUMENTATION IS PROVIDED AS IS AND ALL EXPRESS OR IMPLIED CONDITIONS, REPRESENTATIONS, AND
WARRANTIES, INCLUDING ANY IMPLIED WARRANTY OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE
OR NON-INFRINGEMENT, ARE DISCLAIMED, EXCEPT TO THE EXTENT THAT SUCH DISCLAIMERS ARE HELD TO BE
LEGALLY INVALID.

THIS MANUAL IS DESIGNED TO SUPPORT AN INSTRUCTOR-LED TRAINING


(ILT) COURSE AND IS INTENDED TO BE USED FOR REFERENCE PURPOSES IN
CONJUNCTION WITH THE ILT COURSE. THE MANUAL IS NOT A STANDALONE
TRAINING TOOL. USE OF THE MANUAL FOR SELF-STUDY WITHOUT CLASS
ATTENDANCE IS NOT RECOMMENDED.
Export Control Classification Number (ECCN): 5E992

Please
Recycle

Copyright 2002 Sun Microsystems Inc., 901 San Antonio Road, Palo Alto, California 94303, Etats-Unis. Tous droits rservs.
Ce produit ou document est protg par un copyright et distribu avec des licences qui en restreignent lutilisation, la copie, la distribution,
et la dcompilation. Aucune partie de ce produit ou document ne peut tre reproduite sous aucune forme, par quelque moyen que ce soit,
sans lautorisation pralable et crite de Sun et de ses bailleurs de licence, sil y en a.
Le logiciel dtenu par des tiers, et qui comprend la technologie relative aux polices de caractres, est protg par un copyright et licenci
par des fournisseurs de Sun.
Sun, Sun Microsystems, le logo Sun, Solaris, Ultra, SunSolve Online, Sun Explorer Data Collector, Sun Enterprise Ultra, SunSpectrum,
BigAdmin Portal, Sun System Configuration Check, Sun Blade, Sun Fire, Sun4U, Solaris Management Console, SunSolve Online, SunVTS,
AnswerBook2, et OpenWindows sont des marques de fabrique ou des marques dposes de Sun Microsystems, Inc. aux Etats-Unis et dans
dautres pays.
Toutes les marques SPARC sont utilises sous licence sont des marques de fabrique ou des marques dposes de SPARC International, Inc.
aux Etats-Unis et dans dautres pays. Les produits portant les marques SPARC sont bass sur une architecture dveloppe par Sun
Microsystems, Inc.
UNIX est une marques dpose aux Etats-Unis et dans dautres pays et licencie exclusivement par X/Open Company, Ltd.
Adobe est une marque enregistree de Adobe Systems, Incorporated.
Lgislation en matire dexportations. Les Produits, Services et donnes techniques livrs par Sun peuvent tre soumis aux contrles
amricains sur les exportations, ou la lgislation commerciale dautres pays. Nous nous conformerons lensemble de ces textes et nous
obtiendrons toutes licences dexportation, de r-exportation ou dimportation susceptibles dtre requises aprs livraison Vous. Vous
nexporterez, ni ne r-exporterez en aucun cas des entits figurant sur les listes amricaines dinterdiction dexportation les plus courantes,
ni vers un quelconque pays soumis embargo par les Etats-Unis, ou des contrles anti-terroristes, comme prvu par la lgislation
amricaine en matire dexportations. Vous nutiliserez, ni ne fournirez les Produits, Services ou donnes techniques pour aucune utilisation
finale lie aux armes nuclaires, chimiques ou biologiques ou aux missiles.
LA DOCUMENTATION EST FOURNIE EN LETAT ET TOUTES AUTRES CONDITIONS, DECLARATIONS ET GARANTIES
EXPRESSES OU TACITES SONT FORMELLEMENT EXCLUES, DANS LA MESURE AUTORISEE PAR LA LOI APPLICABLE, Y
COMPRIS NOTAMMENT TOUTE GARANTIE IMPLICITE RELATIVE A LA QUALITE MARCHANDE, A LAPTITUDE A UNE
UTILISATION PARTICULIERE OU A LABSENCE DE CONTREFAON.

CE MANUEL DE RFRENCE DOIT TRE UTILIS DANS LE CADRE DUN COURS


DE FORMATION DIRIG PAR UN INSTRUCTEUR (ILT). IL NE SAGIT PAS DUN
OUTIL DE FORMATION INDPENDANT. NOUS VOUS DCONSEILLONS DE
LUTILISER DANS LE CADRE DUNE AUTO-FORMATION.

Please
Recycle

Table of Contents
About This Course ............................................................Preface-xvii
Course Goals........................................................................Preface-xvii
Course Map........................................................................ Preface-xviii
Topics Not Covered............................................................. Preface-xix
How Prepared Are You?...................................................... Preface-xx
Introductions ........................................................................ Preface-xxi
How to Use Course Materials ........................................... Preface-xxi
Conventions .........................................................................Preface-xxii
Icons .............................................................................Preface-xxii
Typographical Conventions ....................................Preface-xxiii
Introducing the Fault Analysis and Diagnosis Methodology .......1-1
Objectives ........................................................................................... 1-1
Relevance............................................................................................. 1-2
Additional Resources ........................................................................ 1-3
Describing the Fault Analysis and Diagnosis Methodology ....... 1-4
Stating the Problem Clearly..................................................... 1-5
Listing Facts ............................................................................... 1-6
Documenting Each Item Carefully ......................................... 1-9
Introducing the Fault Diagnosis Methodology ........................... 1-11
Prioritizing Planned Tests...................................................... 1-12
Verifying the Corrective Action............................................ 1-14
Documenting Each Item......................................................... 1-14
Identifying the Basic Layers and Error Types in
Sun Systems ................................................................................... 1-17
Overview of the Four Basic Layers of a Sun System ......... 1-17
Introducing Types of Faults in Sun Systems....................... 1-18
Identifying Error-Reporting Mechanisms ........................... 1-20
Exercise: Performing Fault Analysis and Diagnosis................... 1-22
Preparation............................................................................... 1-22
Tasks ......................................................................................... 1-22
Fault Analysis and Diagnosis Worksheet Template.................. 1-23
Analysis Phase......................................................................... 1-23
Diagnosis Phase....................................................................... 1-24

v
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Exercise Summary............................................................................ 1-26


Exercise Solution .............................................................................. 1-27
Analysis Phase......................................................................... 1-27
Diagnosis Phase....................................................................... 1-28
Introducing OBP Components, Features, and Diagnostics......... 2-1
Objectives ........................................................................................... 2-1
Relevance............................................................................................. 2-2
Additional Resources ........................................................................ 2-3
Introducing OBP Components......................................................... 2-4
Introducing Boot PROM .......................................................... 2-5
Introducing NVRAM................................................................ 2-9
Listing Common OBP Variables ........................................... 2-10
Modifying OBP Variables and Running Diagnostics ................. 2-12
Modifying OBP Variables ...................................................... 2-12
Preparing for Manual OBP Diagnostics .............................. 2-15
Using Manual OBP Diagnostic Commands........................ 2-15
Using OBP Commands to Display System
Information ........................................................................... 2-18
Exercise: Modifying the OBP Variables ........................................ 2-22
Preparation............................................................................... 2-22
Tasks ......................................................................................... 2-22
Exercise Solutions ............................................................................ 2-23
Exercise: Performing Manual OBP Diagnostics .......................... 2-24
Preparation............................................................................... 2-24
Tasks ........................................................................................ 2-25
Exercise Summary............................................................................ 2-26
Exercise Solutions ............................................................................ 2-27
Enabling and Monitoring POST Diagnostics................................. 3-1
Objectives ........................................................................................... 3-1
Relevance............................................................................................. 3-2
Additional Resources ........................................................................ 3-3
Introducing POST Concepts............................................................. 3-4
Identifying the Testable Components.................................... 3-4
Describing the diag-switch? Variable ................................ 3-5
Identifying the Methods to Enable Extended POST
Diagnostics.............................................................................. 3-5
Booting From the diag-device or boot-device
Variable.................................................................................... 3-8
Viewing Extended Diagnostics During POST ............................. 3-10
Using the tip Command ....................................................... 3-10
Using the prtdiag Command .............................................. 3-14
Using the show-post-results Command ........................ 3-22
Exercise: Enabling and Monitoring POST Diagnostics .............. 3-24
Preparation............................................................................... 3-24
Tasks ........................................................................................ 3-25

vi

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Exercise Summary............................................................................ 3-27


Exercise Solutions ............................................................................ 3-28
Introducing the OBP Device Tree and the Boot Sequence ..........4-1
Objectives ........................................................................................... 4-1
Relevance............................................................................................. 4-2
Additional Resources ........................................................................ 4-3
Introducing the OBP Device Tree.................................................... 4-4
Device Path Name..................................................................... 4-6
Automated OBP Probing ......................................................... 4-8
Navigating and Examining the OBP Device Tree ................ 4-9
Creating Custom Device Aliases .......................................... 4-12
Introducing the Boot Sequence ...................................................... 4-17
Boot Sequence.......................................................................... 4-17
Using boot Commands.......................................................... 4-30
Exercise: Introducing the OBP Device Tree and the Boot
Sequence......................................................................................... 4-33
Preparation............................................................................... 4-33
Tasks ......................................................................................... 4-33
Exercise Summary............................................................................ 4-35
Exercise Solutions ............................................................................ 4-36
Performing Solaris OE Diagnostics................................................5-1
Objectives ........................................................................................... 5-1
Relevance............................................................................................. 5-2
Additional Resources ........................................................................ 5-3
Using the Device Management Commands .................................. 5-4
Using the devfsadm Command .............................................. 5-4
Using the Pre-Solaris 8 OE Device Commands .................... 5-6
Using the Disk and File System Management Commands ......... 5-7
Using the format Command .................................................. 5-7
Using the fsck Command....................................................... 5-8
Using the fstyp Command................................................... 5-11
Using the iostat Command ................................................ 5-11
Using the Software Package Management Commands ............. 5-14
Using the pkgchk Command ................................................ 5-14
Using the pkginfo Command .............................................. 5-15
Using the pkgadd Command ................................................ 5-16
Using the pkgrm Command................................................... 5-17
Using the File-Checking Commands ............................................ 5-18
Checking for Hidden Characters .......................................... 5-18
Comparing File Contents....................................................... 5-20
Using the CPU and Memory Management Commands ............ 5-23
Using the ps Command ......................................................... 5-23
Using the vmstat Command ................................................ 5-25
Using the psrinfo Command .............................................. 5-27
Using the mpstat Command ................................................ 5-27

vii
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Using the modinfo Command .............................................. 5-29


Using the pgrep Command................................................... 5-31
Using the Network Management Commands............................. 5-32
Using the ping Command..................................................... 5-32
Using the traceroute Command........................................ 5-33
Using the ifconfig Command ............................................ 5-35
Using the arp Command ....................................................... 5-37
Using the netstat Command .............................................. 5-39
Using the snoop Command................................................... 5-42
Using the General-Purpose Commands ....................................... 5-43
Using the find Command..................................................... 5-43
Using the script Command ................................................ 5-44
Using the file Command..................................................... 5-45
Using the tail Command..................................................... 5-45
Using the uname Command................................................... 5-46
Using the showrev Command .............................................. 5-48
Using the prtconf Command .............................................. 5-49
Using the sysdef Command ................................................ 5-51
Using the nm Command ......................................................... 5-52
Using the swap Command..................................................... 5-53
Using the Program Execution Management Commands .......... 5-55
Using the truss Command................................................... 5-55
Using the coreadm Command .............................................. 5-56
Exercise: Performing Solaris OE Diagnostics............................... 5-59
Preparation............................................................................... 5-59
Tasks ......................................................................................... 5-59
Exercise Summary............................................................................ 5-62
Exercise Solutions ............................................................................ 5-63
Diagnosing Faults Using Online Tools .......................................... 6-1
Objectives ........................................................................................... 6-1
Relevance............................................................................................. 6-2
Additional Resources ........................................................................ 6-3
Using the Online Man Pages ............................................................ 6-4
The MANPATH Variable .............................................................. 6-4
Using the man -l Option......................................................... 6-5
Using the man -s Option......................................................... 6-5
Using the man -M Option......................................................... 6-6
Using the man -k Option......................................................... 6-6
Diagnosing Problems by Using the SunSolve Online
Service............................................................................................... 6-9
Accessing the SunSolve Online Service ................................. 6-9
Using the SunSolve Online Service ..................................... 6-10
Performing Search Operations in the SunSolve
Online Service...................................................................... 6-14
Identifying Patch Support Tools.......................................... 6-17

viii

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Using the Sun Explorer Data Collector Utility ........................... 6-21


Obtaining Explorer ................................................................. 6-21
Installing Explorer .................................................................. 6-21
Configuring and Executing Explorer ................................... 6-21
Reviewing the Explorer Output............................................ 6-22
Using the docs.sun.com Web Site .............................................. 6-23
Browsing the docs.sun.com Web Site ................................ 6-23
Performing a Search Operation on the
docs.sun.com Web Site ..................................................... 6-27
Printing Files From the docs.sun.com Web Site.............. 6-30
Icon Legends in the docs.sun.com Web Site..................... 6-33
Exercise: Using the man Command ............................................... 6-36
Preparation............................................................................... 6-36
Tasks ......................................................................................... 6-36
Exercise Solutions ............................................................................ 6-37
Exercise: Diagnosing Problems Using the SunSolve Online
Service............................................................................................. 6-38
Preparation............................................................................... 6-38
Tasks ......................................................................................... 6-38
Exercise Summary............................................................................ 6-40
Exercise Solutions ............................................................................ 6-41
Introducing Types of System Failures ...........................................7-1
Objectives ........................................................................................... 7-1
Relevance............................................................................................. 7-2
Additional Resources ........................................................................ 7-3
Introducing the Causes of System Panics....................................... 7-4
Introducing System Panics ...................................................... 7-6
Introducing a System Hang..................................................... 7-9
Generating a System Crash Dump ................................................ 7-10
Writing the System Crash Dump......................................... 7-11
Configuring the System to Process Crash Dumps ............. 7-12
Using the dumpadm Command .............................................. 7-12
Using the savecore Command Automatically.................. 7-14
Using the savecore Command Manually.......................... 7-15
Introducing Watchdog Resets ........................................................ 7-19
Identifying Causes and Effects of Watchdog Resets.......... 7-19
Identifying the watchdog-reboot? OBP Variable ............ 7-20
Displaying the Register Contents by Using OBP
Commands ............................................................................ 7-20
Identifying the misc/obpsym Kernel Module .................... 7-21
Exercise: Introducing Types of System Failures.......................... 7-22
Preparation............................................................................... 7-22
Tasks ......................................................................................... 7-22
Exercise Summary............................................................................ 7-23
Exercise Solutions ............................................................................ 7-24

ix
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Analyzing Core Dumps Using the mdb Utility ................................ 8-1


Objectives ........................................................................................... 8-1
Relevance............................................................................................. 8-2
Additional Resources ........................................................................ 8-3
Introducing the mdb Utility............................................................... 8-4
Features of the mdb Utility ....................................................... 8-6
Limitations of the mdb Utility .................................................. 8-8
General mdb Command Formats............................................. 8-9
Relationship Between the mdb and adb Utilities................... 8-9
Using the mdb Utility ....................................................................... 8-10
Identifying Macros and Registers......................................... 8-10
Examining System Dumps by Using the mdb Utility........ 8-14
Exercise: Analyzing Core Dumps Using the mdb Utility............ 8-21
Preparation............................................................................... 8-21
Tasks ......................................................................................... 8-21
Exercise Summary............................................................................ 8-23
Exercise Solutions ............................................................................ 8-24
Sample Outputs ...............................................................................A-1
Output of the eeprom Command on a Sun4U Enterprise
Server ............................................................................................... A-2
Sample Report of The PatchDiag Tool ......................................... A-4
Additional Information.....................................................................B-1
The probe Commands ......................................................................B-2
The test Commands ........................................................................B-4
The watch Commands ......................................................................B-6
Architecture of the Ultra 5 and Ultra 10 Workstations.................B-7
The show-post-results Command..............................................B-8
Obtaining a SunSolve Account ......................................................B-11
Workshop Exercises........................................................................C-1
Introduction ...................................................................................... C-1
Preparatory Tasks .................................................................... C-1
Fault #1 Blank Monitor.................................................................. C-4
Analysis Phase.......................................................................... C-4
Diagnosis Phase....................................................................... C-5
Fault #2 Unknown Device ............................................................ C-6
Analysis Phase.......................................................................... C-6
Diagnosis Phase....................................................................... C-7
Fault #3 The ps Command Does Not Work............................... C-8
Analysis Phase.......................................................................... C-8
Diagnosis Phase..................................................................... C-10
Fault #4 Repetitive Boot Sequences ........................................... C-11
Analysis Phase........................................................................ C-11
Diagnosis Phase..................................................................... C-12

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #5 Login Problem ............................................................... C-13


Analysis Phase........................................................................ C-13
Diagnosis Phase..................................................................... C-14
Fault #6 Problem With the root Login ..................................... C-15
Analysis Phase........................................................................ C-15
Diagnosis Phase..................................................................... C-16
Fault #7 Problem in the Network .............................................. C-17
Analysis Phase........................................................................ C-17
Diagnosis Phase..................................................................... C-18
Fault #8 Hung System ................................................................. C-19
Analysis Phase........................................................................ C-19
Diagnosis Phase..................................................................... C-21
Fault #9 Problem With the CDE................................................. C-22
Analysis Phase........................................................................ C-22
Diagnosis Phase..................................................................... C-23
Fault #10 Problem With the ftp Service................................... C-24
Analysis Phase........................................................................ C-24
Diagnosis Phase..................................................................... C-25
Fault #11 Problem With the Non-root User Accounts .......... C-26
Analysis Phase........................................................................ C-26
Diagnosis Phase..................................................................... C-27
Fault #12 Problem in the Network ............................................ C-28
Analysis Phase........................................................................ C-28
Diagnosis Phase..................................................................... C-29
Fault #13 Problem With the CDE............................................... C-30
Analysis Phase........................................................................ C-30
Diagnosis Phase..................................................................... C-31
Fault #14 Problem With the CDE Login Screen....................... C-32
Analysis Phase........................................................................ C-32
Diagnosis Phase..................................................................... C-33
Fault #15 Problem With the root Account .............................. C-34
Analysis Phase........................................................................ C-34
Diagnosis Phase..................................................................... C-35
Fault #16 Problem in the Network ............................................ C-36
Analysis Phase........................................................................ C-36
Diagnosis Phase..................................................................... C-37
Fault #17 Problem With the Network Printer.......................... C-38
Analysis Phase........................................................................ C-38
Diagnosis Phase..................................................................... C-39
Fault #18 Problem in the Network ............................................ C-40
Analysis Phase........................................................................ C-40
Diagnosis Phase..................................................................... C-41
Fault #19 Problem With Read-only File System ...................... C-42
Analysis Phase........................................................................ C-42
Diagnosis Phase..................................................................... C-43

xi
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #20 Problem With the CDE............................................... C-44


Analysis Phase........................................................................ C-44
Diagnosis Phase..................................................................... C-45
Fault #21 Corrupt Network File................................................. C-46
Analysis Phase........................................................................ C-46
Diagnosis Phase..................................................................... C-47
Fault #22 Problem in the Network ............................................ C-48
Analysis Phase........................................................................ C-48
Diagnosis Phase..................................................................... C-49
Fault #23 Problem With Admintool .......................................... C-50
Analysis Phase........................................................................ C-50
Diagnosis Phase..................................................................... C-52
Fault #24 Boot Failure.................................................................. C-53
Analysis Phase........................................................................ C-53
Diagnosis Phase..................................................................... C-54
Fault #25 Hung System ............................................................... C-55
Analysis Phase........................................................................ C-55
Diagnosis Phase..................................................................... C-56
Fault #26 Problem in the Network ............................................ C-57
Analysis Phase........................................................................ C-57
Diagnosis Phase..................................................................... C-58
Fault #27 Script Hangs the System ............................................ C-59
Analysis Phase........................................................................ C-59
Diagnosis Phase..................................................................... C-61
Fault #28 Inappropriate Halts .................................................... C-62
Analysis Phase........................................................................ C-62
Diagnosis Phase..................................................................... C-63
Fault #29 SunSolve Workshop ................................................... C-64
Analysis Phase........................................................................ C-64
Diagnosis Phase..................................................................... C-66
Fault #30 Corrupt File System.................................................... C-67
Analysis Phase........................................................................ C-67
Diagnosis Phase..................................................................... C-68
Fault #31 Insufficient File Permission ....................................... C-69
Analysis Phase........................................................................ C-69
Diagnosis Phase..................................................................... C-71
Fault #32 Problem in the Network ............................................ C-72
Analysis Phase........................................................................ C-72
Diagnosis Phase..................................................................... C-73
Fault #33 Login Problem ............................................................. C-74
Analysis Phase........................................................................ C-74
Diagnosis Phase..................................................................... C-75
Fault #34 Analyze System Crash Dumps ................................. C-76
Analysis Phase........................................................................ C-76
Diagnosis Phase..................................................................... C-77

xii

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #35 Problem in the Network ............................................ C-78


Analysis Phase........................................................................ C-78
Diagnosis Phase..................................................................... C-79
Fault #36 Faulty CD-ROM .......................................................... C-80
Analysis Phase........................................................................ C-80
Diagnosis Phase..................................................................... C-81
Fault #37 Turn the Page............................................................... C-82
Analysis Phase........................................................................ C-82
Diagnosis Phase..................................................................... C-83
Fault #38 Login Problem ............................................................. C-84
Analysis Phase........................................................................ C-84
Diagnosis Phase..................................................................... C-85
Fault #39 Do not Point at Me...................................................... C-86
Analysis Phase........................................................................ C-86
Diagnosis Phase..................................................................... C-87
Fault #40 Problem in the Network ............................................ C-88
Analysis Phase........................................................................ C-88
Diagnosis Phase..................................................................... C-89
Fault #41 No Space on the File System ..................................... C-90
Analysis Phase........................................................................ C-90
Diagnosis Phase..................................................................... C-91
Fault #42 Cannot Mount a File System ..................................... C-92
Analysis Phase........................................................................ C-92
Diagnosis Phase..................................................................... C-93
Fault #43 Problem in the Network ............................................ C-94
Analysis Phase........................................................................ C-94
Diagnosis Phase..................................................................... C-95
Fault #44 User Login Problem.................................................... C-96
Analysis Phase........................................................................ C-96
Diagnosis Phase..................................................................... C-97
Fault #45 Problem in the Network ............................................ C-98
Analysis Phase........................................................................ C-98
Diagnosis Phase..................................................................... C-99
Fault #46 System Displays a Panic Message .......................... C-100
Analysis Phase...................................................................... C-100
Diagnosis Phase................................................................... C-101
Fault #47 Corrupt File System.................................................. C-102
Analysis Phase...................................................................... C-102
Diagnosis Phase................................................................... C-103
Fault #48 Remote Login Failure ............................................... C-104
Analysis Phase...................................................................... C-104
Diagnosis Phase................................................................... C-105
Fault #49 Corrupt File System.................................................. C-106
Analysis Phase...................................................................... C-106
Diagnosis Phase................................................................... C-107

xiii
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #50 Student Designed Workshop .................................. C-108


Analysis Phase...................................................................... C-108
Diagnosis Phase................................................................... C-109
Workshop Exercises........................................................................D-1
Fault #1 Blank Monitor.................................................................. D-3
Fault #2 Unknown Device ............................................................ D-5
Fault #3 The ps Command Does Not Work............................... D-7
Fault #4 Repetitive Boot Sequences ............................................. D-9
Fault #5 Login Problem ............................................................... D-12
Fault #6 Problem With the root Login ..................................... D-14
Fault #7 Problem in the Network .............................................. D-16
Fault #8 Hung System ................................................................. D-18
Fault #9 Problem With the CDE................................................. D-20
Fault #10 Problem With the ftp Service................................... D-23
Fault #11 Problem With Non-root User Accounts................. D-25
Fault #12 Problem in the Network ............................................ D-27
Fault #13 Problem With the CDE............................................... D-29
Fault #14 Problem With the CDE Login Screen....................... D-31
Fault #15 Problem With the root Account .............................. D-33
Fault #16 Problem in the Network ............................................ D-35
Fault #17 Problem With the Network Printer.......................... D-36
Fault #18 Problem in the Network ............................................ D-38
Fault #19 Problem With Read-only File System ...................... D-40
Fault #20 Problem With the CDE............................................... D-42
Fault #21 Corrupt Network File................................................. D-44
Fault #22 Problem in the Network ............................................ D-46
Fault #23 Problem With Admintool .......................................... D-48
Fault #24 Boot Failure.................................................................. D-50
Fault #25 Hung System ............................................................... D-52
Fault #26 Problem in the Network ............................................ D-54
Fault #27 Script Hangs the System ............................................ D-56
Fault #28 Inappropriate Halts .................................................... D-59
Fault #29 SunSolve Workshop ................................................... D-61
Fault #30 Corrupt File System.................................................... D-63
Fault #31 Insufficient File Permission ....................................... D-65
Fault #32 Problem in the Network ............................................ D-67
Fault #33 Login Problem ............................................................. D-68
Fault #34 Analyze System Crash Dumps ................................. D-70
Fault #35 Problem in the Network ............................................ D-78
Fault # 36 Faulty CD-ROM ......................................................... D-79
Fault #37 Turn the Page............................................................... D-81
Fault #38 Login Problem ............................................................. D-83
Fault #39 Do Not Point at Me..................................................... D-85
Fault # 40 Problem in the Network ........................................... D-87
Fault # 41 No Space on the File System .................................... D-89

xiv

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault # 42 Cannot Mount a File System .................................... D-91


Fault # 43 Problem in the Network ........................................... D-93
Fault # 44 User Login Problem................................................... D-94
Fault # 45 Problem in the Network ........................................... D-95
Fault # 46 System Displays a Panic Message ........................... D-96
Fault # 47 Corrupt File System................................................... D-98
Fault # 48 Remote Login Failure .............................................. D-100
Fault # 49 Corrupt File System................................................. D-102
Fault # 50 Student Designed Workshop ................................. D-104
Glossary/Acronyms............................................................ Glossary-1
Index .......................................................................................... Index-1

xv
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Preface

About This Course


Course Goals
Upon completion of this course, you should be able to:

Describe the fault analysis and diagnosis methodology

Describe OpenBoot PROM (OBP) components, features, and


diagnostics

Enable and monitor power-on self-test (POST) diagnostics

Describe the OBP device tree and boot sequence

Perform Solaris Operating Environment (Solaris OE) diagnostics

Diagnose faults using online tools

Describe types of major system failures

Analyze core dumps using the mdb utility

Use this module to get the students interested in this course.


Ask the students how many signed up for this course because of the information in the Sun Educational
Services course catalog. Use this introduction to the course to determine how well students are equipped
with prerequisite knowledge and skills and what is their knowledge and expectations of the objectives stated.
You can use this information as a tool to manage your time in covering the material in this course.
The strategy provided by About This Course is to introduce students to the course before they introduce
themselves to you and one another. By familiarizing them with the content of the course first, their
introductions have more meaning in relation to the course prerequisites and objectives.

Preface-xvii
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Course Map

Course Map
The following course map enables you to see what you have
accomplished and where you are going in reference to the course goals.

Fundamentals of Fault Analysis and Diagnosis


Introducing the Fault Analysis
and Diagnosis Methodology

POST and OBP Diagnostics


Introducing OBP
Components,
Features, and Diagnostics

Enabling and Monitoring


POST
Diagnostics

Introducing the
OBP Device Tree
and BOOT Sequence

Sun Software
Performing Solaris OE

Diagnosing Faults

Diagnostics

Using Online Tools

System Crash Dump Analysis


Introducing Types of
System Failures

Preface-xviii

Analyzing Core Dumps


Using the mdb Utility

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Topics Not Covered

Topics Not Covered


This course does not cover the following topics. Many of these topics are
covered in other courses offered by Sun Educational Services:

Basic system administration topics Covered in SA-239, Intermediate


System Administration for the Solaris 9 Operating Environment

Software installation

User account configuration

Printer management

Basic security policies

Advanced system administration topics SA-299, Advanced System


Administration for the Solaris 9 Operating Environment

Device naming conventions

Network File System (NFS), cachefs and automounter


administration

Network Information Service/Network Information Service


Plus (NIS/NIS+) administration

Disk and file system configuration

Network configuration

The Service Access Facility (SAF) utility

Refer to the Sun Educational Services catalog for specific information and
registration.

About This Course


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Preface-xix

How Prepared Are You?

How Prepared Are You?


To be sure you are prepared to take this course, can you answer one of the
following questions in the affirmative?

Have you completed SA-119: UNIX Essentials Featuring the Solaris


9 Operating Environment, or SA-239: Intermediate System
Administration for the Solaris 9 Operating Environment, and SA-299:
Advanced System Administration for the Solaris 9 Operating
Environment?
or

Do you have at least six months of field system administration or


system maintenance experience in the Solaris OE?

If any students indicate they cannot do the above, meet with them at the first break to decide how to
proceed with the class. Do they want to take the class at a later date? Is there some way to get the
extra help needed during the week?

It might be appropriate here to recommend resources from the Sun Educational Services catalog that
provide training for topics not covered in this course.

Preface-xx

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Introductions

Introductions
Now that you have been introduced to the course, introduce yourself to
the other students and the instructor, addressing the items shown on the
overhead.

How to Use Course Materials


To enable you to succeed in this course, these course materials employ a
learning module that is composed of the following components:

Goals You should be able to accomplish the goals after finishing


this course and meeting all of its objectives.

Objectives You should be able to accomplish the objectives after


completing a portion of the instructional content. The objectives
support goals and can support other higher-level objectives.

Lecture The instructor will present information specific to the


objective of the module. This information will help you learn the
knowledge and skills necessary to succeed with the activities.

Activities The activities take on various forms, such as an exercise,


discussion, and demonstration. Activities are used to facilitate
mastery of an objective.

Visual aids The instructor might use several visual aids to convey a
concept, such as a process, in a visual form. Visual aids commonly
contain graphics and summarized text.

About This Course


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Preface-xxi

Conventions

Conventions
The following conventions are used in this course to represent various
training elements and alternative learning resources.

Icons
Additional resources Indicates other references that provide additional
information on the topics described in the module.

Discussion Indicates a small-group or class discussion on the current


topic is recommended at this time.

Insert Title of OH
here

This is an instructor-only icon. Use the visual aid icon to indicate which
slide to present.

Note Indicates additional information that can help students but is not
crucial to their understanding of the concept being described. Students
should be able to understand the concept or complete the task without
this information. Examples of notational information include keyword
shortcuts and minor system adjustments.
Caution Indicates that there is a risk of personal injury from a
nonelectrical hazard, or risk of irreversible damage to data, software, or
the operating system. A caution indicates that the possibility of a hazard
(as opposed to certainty) might happen, depending on the action of the
user.

Preface-xxii

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Conventions

Typographical Conventions
Courier is used for the names of commands, files, directories,
programming code, and on-screen computer output, such as:
The /etc/hostname.hme0 file on the faulty system does not have
the Internet Protocol (IP) address defined on it.
Courier bold is used for characters and numbers that you type, such as:
To check the architecture of your system, use the following
command:
# uname -m

Courier italics is used for variables and command-line placeholders


that are replaced with a real name or value, such as:
The following is the syntax for the setenv command:
setenv variablename

Courier italic bold is used to represent variables whose values are to


be entered by the student as part of an activity, such as:
Type chmod a+rwx filename to grant read, write, and execute
rights for filename to world, group, and users.
Palatino italics is used for book titles, new words or terms, or words that
you want to emphasize, such as:
To change the write protection state of Ultra Enterprise systems, turn
the external front panel keyswitch to the Diagnostic position.

About This Course


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Preface-xxiii

Conventions

Notes to the Instructor


In an effort to enable you to accomplish the course objectives easily, and in the time frame given, a series of
tools have been developed and support materials created for your discretionary use.
A consistent structure has been used throughout this course. This structure is outlined in the Course Goal
section. The suggested flow for each module is:

Objectives

Relevance

References

Lecture information with appropriate overheads

Lab exercises

Discussion, either as whole class or in small groups

Emphasize that the main purpose of this course is to develop an approach to fix system errors. The
performance of students does not depend on the number of faults they fix but the approach they use to
debug faults.
To enable you to follow this structure, the following supplementary materials are provided with this course:

Relevance

These questions or scenarios set the context of the module. It is suggested that you ask these questions and
discuss the answers with students. The answers are provided only in the instructor guide.

Course map

The course map allows the students to get a visual picture of the course. It also helps students know the
status. The course map is presented in the About This Course section of the student guide.

Lecture overheads

Overheads for the course are provided in two formats:


The paper-based format can be copied onto standard transparencies and used on a standard overhead
projector. These overheads are also provided in the student guide.
The web browserbased format is in HTML and can be projected using a projection system which displays
from a workstation. This format gives the instructor the ability to allow students to view the overhead
information on individual workstations. It also allows random access to the overheads.

Small-group discussion

After the lab exercises, it is a good idea to debrief the students. You can gather them back into the classroom
and have them discuss their discoveries, problems, and issues in programming the solution to the problem in
small groups of four or five, one-on-one, or one-on-many.

General timing recommendations

Preface-xxiv

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Conventions
Each module contains a Relevance section after the course map. This section may present a scenario
relating to the content presented in the module, or it may present questions that stimulate students to think
about the content that will be presented. Engage the students in relating experiences or posing possible
answers to the questions. Spend no more that 1015 minutes on this section.

Module

Lecture
(Minutes)

Lab
(Minutes)

Total Time
(Minutes)

Preface

30

NA

30

Module 1

60

90

150

Module 2

90

90

180

Module 3

90

90

180

Module 4

90

90

180

Module 5

120

90

210

Module 6

90

60

150

Module 7

45

45

90

Module 8

45

90

135

Note Approximately 50 percent of the class is comprised of small group


workshop sessions, which primarily occur in the afternoon sessions
according to the suggested course agenda.
Refer to the README file, SETUP file, or both files for additional information.
The README file contains information specific to the content of the course.
The SETUP file contains specific setup instructions about how to set up this course. It also contains any
special instructions for setting up the HTML overheads. The sample templatesetup.txt contains a sample
of the type of information that could go in the SETUP file.
Both the README and SETUP files are text files.

About This Course


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Preface-xxv

Module 1

Introducing the Fault Analysis and Diagnosis


Methodology
Objectives
Overview on
page OH 1-2

Upon completion of this module, you should be able to:

Describe the Fault Analysis methodology

Describe the Diagnosis methodology

Identify the basic layers in Sun systems

Identify the error types that occur in Sun systems

1-1
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Relevance

Relevance
Present the following questions to stimulate students and get them thinking about the issues and topics
presented in this module. While they are not expected to know the answers to these questions, the answers
should be of interest to them and inspire them to learn the material presented in this module.

Relevance on
page OH 1-3

!
?

Discussion These questions are relevant to understanding the activities


that you perform in the Solaris OE:

List some of the faults that you encounter in your day-to-day work.

How do you analyze the faults that you encounter?

How do you diagnose the faults that you encounter?

Ask students to recount their experiences, and list the faults on a white board or a flip chart. If the faults
encompass a wide spectrum, use them to highlight the differences in problems that occur in different
circumstances. Alternatively, provide examples of diverse problems that students might encounter at their
work places. Use these examples to explain why it is essential to follow a consistent Fault Analysis and
Diagnosis methodology.

1-2

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Additional Resources

Additional Resources
Additional resources The following references can provide additional
information on the topics discussed in this module:

Analytical Problem Solving


(http://www.alamols.com/training/advant.htm), accessed 16
January 2002.

Analytical Problem Solving


(http://www-ctd.ucsd.edu/hndbk/7AnProb.html), accessed 25
February 2002.

Problem Solving and Analytical Techniques


(http://www.psywww.com/mtsite/page2.html), accessed 25
February 2002.

Watchdog-resets.pdf
(http://sunsolve.sun.com/kmsattachments/41107.watchdogresets.pdf), accessed 25 February 2002.

Introducing the Fault Analysis and Diagnosis Methodology


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

1-3

Describing the Fault Analysis and Diagnosis Methodology

Describing the Fault Analysis and Diagnosis Methodology


The Fault Analysis and Diagnosis methodology provides a powerful tool
to isolate and repair system faults. This methodology helps system
administrators to gather data and use their collective experience to
analyze and diagnose system faults.
The Fault Analysis and Diagnosis methodology uses a two-stage process:

Fault analysis

Fault diagnosis

Inform students that the first part of the module describes the Fault Analysis methodology and the second
part focuses on the Diagnosis methodology.

Introducing the
Fault Analysis
Methodology on
page OH 1-4

To record information generated in each step of the Fault Analysis and


Diagnosis methodology, you use a Fault Analysis and Diagnosis
Worksheet template (fad_worksheet.sdw). You can save this template for
future reference and customize it to suit your requirements. Figure 1-1
shows the steps in the Analysis stage of the Fault Analysis and Diagnosis
methodology.

Figure 1-1

1-4

Analysis Stage of the Fault Analysis and Diagnosis


Methodology

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Describing the Fault Analysis and Diagnosis Methodology

Stating the Problem Clearly


The first step in the Fault Analysis phase is to state the problem. You
create a problem statement based on the original customer complaint. A
clearly written problem statement helps you to analyze the fault and
diagnose the problem.
You use the problem statement to perform the following:

Identify the fault that occurred on the system

Identify the faulty part of the system

Identifying the fault correctly is critical to the success of the Fault Analysis
and Diagnosis methodology. Incorrect fault analysis can cause complete
system failure. Many faults become critical because the initial
identification of the fault is incorrect.
Caution While creating a problem statement, do not assume the cause of
the fault.

Introducing the Fault Analysis and Diagnosis Methodology


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

1-5

Describing the Fault Analysis and Diagnosis Methodology

Listing Facts
Listing Facts on
page OH 1-5

The next step is to list the facts about the problem. This helps to establish
the possible causes of the fault. Figure 1-2 shows the recommended order
of steps to arrive at a list of facts.

Figure 1-2

Creating a List of Facts About a System Fault

Identifying Information Sources


Explain to students that it is easy to overlook an information source. They might unknowingly miss an
important piece of information while collecting information.

Typically, information about a system fault is available from a number of


sources. The following are some critical sources of information:

1-6

Problem statement Describes the problem in terms of the fault that


occurred on the system and the faulty part of the system.

Customer interviews Provide details about the system fault. The


following are a few recommended questions for customers:

Who first observed the fault?

Describe the fault.

What is the location of the fault?

When was the fault first observed?

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Describing the Fault Analysis and Diagnosis Methodology

What error messages were generated by the fault?

What changes were made to the system before you observed


the fault?

What is the size or magnitude of the fault?

Note The questions to be asked vary, depending on the fault and the
user. Use your judgement, and ask questions that help you to analyze the
fault.
If required, expand each bullet by presenting an example from student experiences.

Additional interviews Provide additional information about the


fault. If required, interview other members of the technical staff, such
as system administrators, network administrators, and
programmers.

System crash dumps Help identify the potential causes for the fault
by analyzing the /var/crash/`uname -n`/unix.n and
/var/crash/`uname -n`/vmcore.n dump files if they exist.

Log files Evaluate the messages recorded in system log files, such
as the /var/adm/messages file, for information about the fault.

After you identify the sources of information, you should start collecting
the information.

Collecting Relevant Error Messages


Analyzing the error messages that are generated on a faulty system helps
to establish the possible causes of the fault. For example, a Sun Blade
2000 workstation running in the Solaris 9 OE displays a data-access
exception error message within the /var/adm/messages log.
When explaining the preceding example, you can refer students to Solaris Common Messages and
Troubleshooting Guide at the docs.sun.com Web site. The guide specifies that a data-access exception
indicates one of the following:

An incorrectly installed dual inline memory module (DIMM)

Problems in the hard disk

Introducing the Fault Analysis and Diagnosis Methodology


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

1-7

Describing the Fault Analysis and Diagnosis Methodology

Identifying Recent System Changes


Recent system changes are an important source of information about a
system fault. For example, the system administrator removes read
permissions from the /etc directory for files, such as passwd and group.
This change is relevant for troubleshooting the problems of users who
cannot log in to the system.

Performing Controlled Comparisons


To determine the probable causes of the fault, you compare the faulty
system with a similar system that does not display the same fault.
To establish an effective set of comparative facts, make the following
observations:

Are the two systems that are being compared similar, in terms of
hardware architecture, OE revisions, patch levels, and application
versions?

Does the known functional system display the same symptoms and
conditions as the faulty system?

What events occurred in the environment, which might have


contributed to the fault?

Analyzing Comparison Results


You analyze the results generated by controlled comparisons to identify
the causes of the system fault.
The following lists the guidelines for a comparative system analysis:

Focus on one set of comparisons at a time.

State facts. Do not allow opinions to confuse facts.

Identify the differences between the faulty system and the nonfaulty
system.

Analyze the comparative facts about the systems for any similarities.

Note While the similarities between systems do not directly identify the
source of the fault, they can help to eliminate most of the potential
sources. Therefore, you can easily isolate the possible sources of the
system fault.

1-8

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Describing the Fault Analysis and Diagnosis Methodology

Identifying the Magnitude of the Fault


Identifying the size or magnitude of the fault enables you to determine
whether multiple systems are involved, which narrows the focus in a
large networked environment. For example, a single system that cannot
communicate with a printer probably has the fault on the system.
However, if none of the systems can communicate with the printer, the
fault is probably with the printer rather than with all the systems.

Documenting Each Item Carefully


You use the information gathered in the Analysis phase of the Fault
Analysis and Diagnosis methodology as the basis for diagnosing the fault.
To ensure that all the factors are considered while performing the
diagnosis, document all the information.
Use the following scenario to demonstrate how the Fault Analysis and Diagnosis Worksheet template helps
to capture all the information about a system fault. Emphasize the ease with which all the details about the
system are captured in the template.

Consider a scenario in which a system running the Solaris 9 OE has a


security hole in the in.ftpd daemon. When you attempt to install a patch
to fix the security hole, the patchadd command fails. The following is the
method for completing the Analysis section of the Fault Analysis and
Diagnosis Worksheet template:
When describing the contents of the Fault Analysis and Diagnosis worksheet, refer students to the phases in
the Analysis section of the Fault Analysis and Diagnosis methodology. Link each entry in the worksheet with
the phase from which it is derived.

Initial Customer Description


I cannot install patches on my UltraSPARC system. When I run the
patchadd -p command, the command terminates, and a core dump is
created in the /var/sadm/patch directory.

Problem Statement
When attempting to run the patchadd -p command, the system
generates a core dump in the /var/sadm/patch directory and terminates
the command.

Introducing the Fault Analysis and Diagnosis Methodology


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

1-9

Describing the Fault Analysis and Diagnosis Methodology

Resources
Problem Statement The patchadd -p command is not executing
correctly, and it terminates with the error message listed in Table 1-1.

Problem Description
Table 1-1 shows the problem description.
Table 1-1 Problem Description
Error Messages

Symptoms and
Conditions

patchadd:
Program
unexpectedly
terminates
with signal 11.

The
patchadd -p
command is
terminated on
execution.

Recent Changes

Comparative Facts

The remon
package was
recently installed
on the system.

When comparing the


/var/sadm/pkg directory on
the faulty system with the
same directory on a system
that does not display the
fault, the following
observations were made:
Permissions are set at 555
in both systems.
No links to other file
systems exist in the
directory tree on both
systems.
All subdirectories contain
the pkginfo file on both
systems.

!
?

Discussion Examine the Analysis section of the Fault Analysis and


Diagnosis worksheet, which has been completed, and determine the
conclusions about the fault that is apparent from the Analysis section.

Ask students to recall the original problem. Compare it with the responses they provide for this question, and
highlight the amount of details they now have on the fault.
Ask students to explain the logic behind their proposed solutions. Capture student responses, and use them
to introduce the Fault Diagnosis methodology.

1-10

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Introducing the Fault Diagnosis Methodology

Introducing the Fault Diagnosis Methodology


Introducing the
Fault Diagnosis
Methodology on
page OH 1-6

The Diagnosis phase of the Fault Analysis and Diagnosis methodology


uses the data that you collect in the Analysis phase. The steps in the
Diagnosis phase of the Fault Analysis and Diagnosis methodology are
shown in Figure 1-3.

Figure 1-3

Fault Diagnosis Methodology

Introducing the Fault Analysis and Diagnosis Methodology


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

1-11

Introducing the Fault Diagnosis Methodology

Prioritizing Planned Tests


Prioritizing
Planned Tests
on page OH 1-7

After evaluating all the data gathered in the Analysis phase, you generate
a list of probable causes for the fault. Next, you prioritize the probable
causes and various feasible test methods, according to the steps shown in
Figure 1-4 on page 1-12.

Figure 1-4

Steps in Prioritizing Planned Tests

Formulating Hypotheses
A hypothesis states the most probable cause of the fault and is based on
the data collected in the Analysis phase. Although multiple hypotheses
exist, each hypothesis is tested separately, starting with the most probable
hypothesis.

1-12

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Introducing the Fault Diagnosis Methodology


To formulate an hypothesis, first state the hypothesis in the form of a
question. The answer to the question forms the hypothesis statement.

Choosing the Test Methodology


After you formulate an hypothesis, identify a test methodology. The test
methodology might be one of the following:

Factual In this approach, the testing of probable causes is based on


past experience and on the information gathered in the Fault
Analysis worksheet. This results in isolating the most probable
cause.

Realistic In this approach, each probable cause must pass an


experiment that determines conclusively whether it is the actual
cause. For example, try a new driver without overwriting the old
one. This provides a quick, nondisruptive verification method.
However, this approach is not completely conclusive.

Result-oriented In this approach, you rely on previous experiences


to make educated guesses about the actual cause and take the
corrective action. This is the least conclusive verification and can be
disruptive, expensive, and time-consuming, especially if the
assumption is incorrect.

Note Follow the factual approach, when testing a hypothesis.

Identifying the Impact of the Chosen Methodology


Each test methodology has its strengths and weaknesses. You must choose
a methodology that is most appropriate for a given situation. You must
consider the impact of the methodology on the system when you select
the methodology.

Testing the Hypothesis


After you select the test methodology and evaluate the impact that it
might have on the system, you test the hypothesis. Testing the hypothesis
helps to identify the actual cause of the fault.

Introducing the Fault Analysis and Diagnosis Methodology


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

1-13

Introducing the Fault Diagnosis Methodology


To test the hypothesis, complete the following steps:
1.

Investigate the possible causes of the fault.

2.

Eliminate each possible cause that fails the tests.

3.

Continue to eliminate possible causes until you identify the main


cause of the fault.

Caution Do not change observed facts to support the hypothesis


because you are in a hurried or stressful situation.

Verifying the Corrective Action


The final step in the Diagnosis phase of the Fault Analysis and Diagnosis
methodology is to take a corrective action and verify that it repairs the
fault.

Documenting Each Item


Use the Fault Analysis and Diagnosis Worksheet template to record the
data that you collect in the Analysis and the Diagnosis phase. You can
now use this template in the future as a reference for faults of a similar
nature.
For example, for the fault in the Sun Enterprise Ultra 5 hardware, the
following sections describe the Diagnosis section of the Fault Analysis
and Diagnosis Worksheet template.

1-14

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Introducing the Fault Diagnosis Methodology

Test and Verification


Table 1-2 shows the steps to test and verify the problem described in the
Analysis section.
Table 1-2 Test and Verification
Probable
Causes
The remon
package is
corrupted.

Tests

Results

Verification

Execute the
truss patchadd
command on the
system.

The output indicates


a potential fault with
the SUNWremon
package. Refer to the
following partial
output.

Run the pkgchk command


on the SUNWremon package.

# truss patchadd
...
open("/var/sadm/pkg/SUNWremon/pkginfo", O_RDONLY) = 4
fstat(4, 0xEFFFDA98)= 0
ioctl(4, TCGETA, 0xEFFFDA24)Err#25 ENOTTY
read(4, " C L A S S E S = S T A T".., 8192)= 1296
lseek(4, 0xFFFFFF45, SEEK_CUR)= 1109
close(4)
= 0
Incurred fault #6, FLTBOUNDS %pc = 0x00014B30
siginfo: SIGSEGV SEGV_MAPERR addr=0x0000000B
Received signal #11, SIGSEGV [default]
siginfo: SIGSEGV SEGV_MAPERR addr=0x0000000B
*** process killed ***

Introducing the Fault Analysis and Diagnosis Methodology


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

1-15

Introducing the Fault Diagnosis Methodology

Corrective Action
Table 1-3 shows the corrective action that you must take for the preceding
problem.
Table 1-3 Corrective Action
Final Repair

Communication

Documentation

Uninstall and reinstall


the remon package.

Inform the system


administrator that the
remon package requires
reinstallation. Verify that
the remon package is
reinstalled correctly and
that the patchadd -p
command executes
successfully.

Update the system logs,


including the initials and
the timestamp.

!
?

Discussion Examine the Diagnosis section of the Fault Analysis and


Diagnosis worksheet that has been completed and determine the
following:

What are your inferences about the fault, based on the Diagnosis
section of the Fault Analysis and Diagnosis worksheet?

Ask students to recall the original problem statement. Compare it with their responses for this question, and
highlight the process by which they achieved the result. Next, explain that the same methodology is
applicable to all the problems that the student might encounter in their day-to-day work.

Why is it essential to complete the entire Fault Analysis and


Diagnosis methodology to identify and remove system faults?

Remind students that while it might not be necessary to complete the entire Fault Analysis and Diagnosis
methodology, they should not miss any of the steps. Actions taken on a system without proper analysis and
diagnosis often do more harm than good.

1-16

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Identifying the Basic Layers and Error Types in Sun Systems

Identifying the Basic Layers and Error Types in Sun


Systems
Any Sun system can be divided into four distinct layers: hardware,
firmware, the Solaris OE, and the application layer. A fault can occur in
any layer of the system. Therefore, you must be familiar with each layer
and how the layers interact with one another. The following section
provides a brief overview of various layers that constitute a Sun system.

Overview of the Four Basic Layers of a Sun System


Overview of the
Four Basic
Layers of a Sun
System on
page OH 1-8

A Sun system consists of four layers, as shown in Figure 1-5.

Figure 1-5

Basic Layers of a Sun System

Hardware Represents the physical components of a system.

Firmware Governs hardware diagnostics and the system before the


start of the boot process. Firmware is the software that is embedded
within the special chips of a system.

Operating environment Governs all the programs in a system. The


default operating environment in Sun systems is the Solaris OE,
which is designed for both SPARC and Intel architectures.

Applications Perform a specific function for users or other


applications. Examples of applications in a Sun system include
databases and mail and web servers.

Introducing the Fault Analysis and Diagnosis Methodology


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

1-17

Identifying the Basic Layers and Error Types in Sun Systems

Introducing Types of Faults in Sun Systems


General Types of
Errors on
page OH 1-9

Figure 1-6 shows the types of faults that might occur in a Sun system. The
corrective action depends on the type of the fault.

Figure 1-6

Error Categories

Software Errors
All errors that do not originate in the hardware are known as software
errors. The system processor detects and reports these errors. Examples of
software errors include programming errors in applications and bugs in
kernel code.

Hardware Errors
A hardware interrupt can indicate a hardware error. Examples of
hardware errors include corrupt disks and failures of power supply and
fan trays.

Note An interrupt is a signal that is generated by either a device that is


attached to the system or a program within the system. The interrupt
notifies the system of an event that can cause a program to suspend itself
temporarily so that the central processing unit (CPU) can process the
relevant interrupt.
Among various hardware errors, an interrupt always signals
hardware-corrected errors. No recovery action is usually required for this
type of hardware error. For example, a 1-bit error from memory is
corrected by the error checking and correcting (ECC) logic.

1-18

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Identifying the Basic Layers and Error Types in Sun Systems


You can provide the following explanation about ECC logic:
You can build memory systems using various protection systems. The simplest memory system is
unprotected and has neither parity nor ECC protection. Each bit of memory holds one bit of data. In the
unprotected memory system, any error is undetected and results in corrupt data being delivered to the
system. Parity provides an additional bit of memory that contains the bit-wise exclusive data bits. This
enables the detection of single-bit errors. If the system detects a parity error, the system treats the error as
fatal and all processing activities are stopped immediately. In ECC memory systems, multiple bits are held for
some number of data bits, such as 3 ECC bits per 8 data bits. These ECC bits hold the hamming code of the
data and are used to detect and correct all 1-bit errors. You also use ECC to detect and correct all double-bit
and multiple-bit errors. The cost of providing this additional protection is increased speed and size of memory.
Parity-based memory protection increases the number of memory cells required to hold an amount of data by
12 percent and having ECC results in a 37.5 percent increase.

Critical Errors
Critical errors require immediate attention. You must shut down the
system immediately. Examples of critical errors include the following:

Single power supply failure in a system with redundant power


supplies

Fan failure, resulting in an increased operating temperature

Fatal Errors
A fatal error corresponds to an error in which you cannot guarantee
system recovery. Examples of fatal errors include the following:

Power supply failure in a system with a single power supply

Component burnout due to high temperature

System Panics
A system panic occurs when the system detects a fatal error that can
corrupt data. The system responds by halting all processes and calling the
panic() kernel function. The panic() kernel function is not an error
condition but a protective reaction to an error condition that is designed
to safeguard system data. The panic() kernel function performs the
following:

Displays a panic message at the console

Performs a stack trace and lists routines that led to the panic

Generates a crash dump image of system memory in the dump


device

Resets the system

You analyze the crash dump files generated during a system panic to
determine the cause of the panic.
Introducing the Fault Analysis and Diagnosis Methodology
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

1-19

Identifying the Basic Layers and Error Types in Sun Systems

Identifying Error-Reporting Mechanisms


Inform students that the error-reporting mechanisms in a Sun system provide clues about the type of error.
Therefore, knowledge about error-reporting mechanisms helps students to take the appropriate corrective
action.

Sun systems have a number of error-reporting mechanisms. These include


the following:

Bus errors

Interrupts

Watchdog resets:

CPU

System

Bus Errors
A bus error occurs when a process receives a signal indicating that it
attempted to perform input/output (I/O) operations on a device that is
either restricted or does not exist.

Interrupts
An interrupt is a signal that the device driver of a hardware component
sends to the CPU. This signal requires a response to an event. An example
of such an event can be a completed I/O request or a hardware-error
condition. Interrupts are categorized as hardware or software interrupts.
Hardware interrupts are generated by I/O devices, and software
interrupts are established through a call to the kernel add_softintr()
function.
When the CPU receives an interrupt, it stops processing the instructions
of the current process, locates the interrupt in the trap table, and then
branches off to a special kernel code, known as the Interrupt Service
Routine, to manage the interrupt. After managing the interrupt
successfully, the process resumes its activity.
Each hardware component provides different services to the system in
different ways. Therefore, each Interrupt Service Routine is uniquely
tailored for the supporting device.

1-20

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Identifying the Basic Layers and Error Types in Sun Systems

Watchdog Resets
A watchdog reset occurs when you reset the CPU. In such a situation, the
system immediately drops to the programmable read-only memory
(PROM) monitor without creating a system crash dump. The absence of a
crash dump makes the watchdog reset condition difficult to analyze.
Hardware or software faults cause watchdog resets. The following are the
types of watchdog resets:

CPU watchdog reset Occurs on a single processor system. In this


type of reset, the CPU receives a trap before it can resolve an existing
trap.

System watchdog reset Occurs due to a hardware fault. This type


of reset affects all the CPUs and the I/O devices on the system.

Note You can use the ok .traps command to view the types of traps.

Introducing the Fault Analysis and Diagnosis Methodology


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

1-21

Exercise: Performing Fault Analysis and Diagnosis

Exercise: Performing Fault Analysis and Diagnosis


In this exercise, you use the Fault Analysis and Diagnosis methodology to
analyze and diagnose a system fault. Document your observations by
using the Fault Analysis and Diagnosis Worksheet template.
The following exercise is not designed to challenge the Fault Analysis and Diagnosis skills of students. The
exercise introduces students to the Fault Analysis and Diagnosis methodology and increases their familiarity
with the Fault Analysis and Diagnosis Worksheet template.

Preparation
Based on the size of the class, the instructor will divide students into
groups of two or three. In this exercise, the instructor plays the role of a
customer and provides answers to the questions asked by students.
Provide students with the following guidelines to perform the exercise:

Write down the steps for solving the fault to ensure that all the steps are completed.

Discuss the fault in a group to clarify the thinking process and validate assumptions.

Summarize and document the steps followed and the solution of the fault when you complete the
exercise.

Tasks
Consider the following fault descriptions:
When attempting a telnet connection from the Instructor1 system to the
Host1 system, the command fails. The following error message is
displayed:
Telnet: Unable to connect to remote host: Connection refused
However, when attempting to reverse the telnet connection from the
Host1 system to the Instructor1 system, the command is successful. In
addition, when attempting to open a File Transfer Protocol (FTP)
connection to the Instructor1 system, the connection is refused. The
following error message is displayed:
> ftp: connect: Connection refused
The same error is reported when attempting an FTP connection to the
Host1 system.

1-22

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault Analysis and Diagnosis Worksheet Template


The preceding error is caused by removing the following entries from the /etc/services file:
On the system Host1:

telnet

ftp

On the system Instructor1: ftp


After removing the entries from the /etc/services file and using the ftp command on the systems, restart
the inetd command or reboot each affected system.
Provide this information to students through the client interviews conducted by students.

Use the methods described in this module to analyze and diagnose the
fault. Document the observations in the Fault Analysis and Diagnosis
worksheet. A template for the Fault Analysis and Diagnosis worksheet is
provided in the following pages.

Fault Analysis and Diagnosis Worksheet Template


You use the Fault Analysis and Diagnosis worksheet to log faults and
observations made during the Fault Analysis and Diagnosis phases. You
can modify the worksheet to suit the requirements of your organization.

Analysis Phase
Document the observations made during the analysis phase.

Initial Customer Description

Problem Statement

Resources

Introducing the Fault Analysis and Diagnosis Methodology


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

1-23

Fault Analysis and Diagnosis Worksheet Template

Problem Description
Table 1-4 describes the problem.
Table 1-4 Problem Description
Error Messages

Symptoms and
Conditions

Recent Changes

Comparative Facts

Diagnosis Phase
Document the observations made during the Diagnosis phase.

Test and Verification


Table 1-5 lists the results of tests and verification.
Table 1-5 Test and Verification
Probable
Causes

1-24

Tests

Results

Verification

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault Analysis and Diagnosis Worksheet Template

Corrective Action
Table 1-6 lists the corrective action taken.
Table 1-6 Corrective Action
Final Repair

Communication

Documentation

Introducing the Fault Analysis and Diagnosis Methodology


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

1-25

Exercise Summary

Exercise Summary

Discussion Take a few minutes to discuss what experiences, issues, or


discoveries you had during the lab exercise.

Manage the discussion based on the time allowed for this module, which was provided in the About This
Course module. If you do not have time to spend on discussion, highlight just the key concepts students
should have learned from the lab exercise.

Experiences

Ask students what their overall experiences with this exercise have been. Go over any trouble spots or
especially confusing areas at this time.

Interpretations

Ask students to interpret what they observed during any aspect of this exercise.

Conclusions

Have students articulate any conclusions they reached as a result of this exercise experience.

Applications

Explore with students how they might apply what they learned in this exercise to situations at their workplace.

1-26

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Exercise Solution

Exercise Solution
You use the Fault Analysis and Diagnosis worksheet to log the faults and
observations made during the Fault Analysis and Fault Diagnosis phases.
You can modify the worksheet to suit the requirements of your
organization.

Analysis Phase
Document the observations made during the Analysis phase.

Initial Customer Description


I cannot set up the telnet connection to the system Host1. I also cannot
establish an FTP connection to the systems Host1 and Instructor1.

Problem Statement
Two systems in the network are not accepting the following as shown in
Table 1-7.
Table 1-7 Problem Statement
System Name

Fault

Host1

Telnet connection refused


FTP connection refused

Instructor1

FTP connection refused

Resources
The following are the available resources:

Customer interviews

The man pages on the ftp and telnet commands

Other functioning systems

Technical colleagues (workshop group members)

Introducing the Fault Analysis and Diagnosis Methodology


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

1-27

Exercise Solution

Problem Description
Table 1-8 describes the problem.
Table 1-8 Problem Description
Error
Messages

Symptoms and
Conditions

Recent Changes
and/or History

Telnet:
Unable to
connect to
remote
host:
Connection
refused

The telnet connection


is refused by the
remote host.

Modifications were
made to the
/etc/services file.

An entry is missing
for the telnet
service in the
/etc/services
file on the system
Host1.

> ftp:
connect:
Connection
refused

The FTP connection is


refused by the remote
host.

Modifications were
made to the
/etc/services file.

An entry is missing
for the FTP service
in the
/etc/services
file on the systems
Host1 and
Instructor1.

Comparative Facts

Diagnosis Phase
Document the observations made during the Diagnosis phase.

Test and Verification


Table 1-9 lists the results of tests and verification.
Table 1-9 Test and Verification
Probable
Causes
An entry is
missing for the
telnet service in
the
/etc/services
file on the
system Host1.

1-28

Tests

Results

Verification

Remove the entry for the


telnet service from a system,
and attempt a telnet
connection to the system.

The error is
replicated.

After inserting the


entry in the file,
you can establish a
telnet connection
with the system.

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Exercise Solution
Table 1-9 Test and Verification (Continued)
Probable
Causes
An entry is
missing for the
FTP service in
the
/etc/services
file on the
systems Host1
and
Instructor1.

Tests

Results

Verification

Remove the entry for the


FTP service from a system,
and then attempt an FTP
connection to the system.

The error is
replicated.

After inserting the


entry in the file,
you can establish
an FTP connection
with the system.

Corrective Action
Table 1-10 lists the corrective action taken.
Table 1-10 Corrective Action
Final Repair

Communication

Add the following


entry for the telnet
service in the
/etc/services file
on the Host1 system:
telnet 23/tcp

None

Add the following


entry for the FTP
service in the
/etc/services file
on both Host1 and
Instructor1
systems:
ftp 21/tcp

None

Documentation

Introducing the Fault Analysis and Diagnosis Methodology


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

1-29

Module 2

Introducing OBP Components, Features,


and Diagnostics
Objectives
Overview on
page OH 2-2

Upon completion of this module, you should be able to:

Describe the OBP components

Modify the OBP variables and run diagnostics

2-1
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Relevance

Relevance
Present the following questions to stimulate students and get them thinking about the issues and topics
presented in this module. While they are not expected to know the answers to these questions, the answers
should be of interest to them and inspire them to learn the material presented in this module.

Relevance on
page OH 2-3

!
?

Discussion The following questions are relevant to understanding the


diagnostic activities that you perform in the Solaris OE:

What is OpenBoot firmware?

OpenBoot firmware is the resident firmware in all Sun systems, which provides basic hardware testing and
initialization operations before the system boots.

Which tasks can you accomplish in the OpenBoot environment?

You can perform the basic functionality test of hardware components in the OpenBoot environment. The OBP
diagnostic commands include the test, watch, and probe commands.

2-2

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Additional Resources

Additional Resources
Additional resources The following references provide additional
information on the topics described in this module:

Sun System Handbook (http://sunsolve.sun.com/handbook_pub),


accessed 07 May 2002.

OpenBoot Command Reference, part number 800-6076.

OpenBoot 3.x Quick Reference Card, part number 802-3240.

OpenBoot 4.x Command Reference Manual (http://docs.sun.com),


accessed 14 March 2002.

Introducing OBP Components, Features, and Diagnostics


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

2-3

Introducing OBP Components

Introducing OBP Components


Explain to students why it is important to know about OBP components. Inform them that these components
store OBP variables, which you examine and modify to reconfigure or diagnose the system environment.

All systems, regardless of the manufacturer, require special hardware with


embedded software to control the first phase of the boot process. This
special hardware is known as firmware. The firmware within the Sun
systems, which is based on SPARC technology, is known as OpenBoot
firmware.
OpenBoot firmware provides basic hardware testing and initialization at
system power on, the bootstrap program support to load the kernel from
a disk or over the network, and a series of diagnostic tools to troubleshoot
the hardware.

Note OpenBoot is an open standard defined by the Institute of Electrical


and Electronics Engineers (IEEE) standard (IEEE Standard 1275-1994 for
Boot Firmware).

OBP
Components on
page OH 2-4

2-4

OBP consists of the following components on each system board:

Boot PROM

Nonvolatile random access memory (NVRAM)

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Introducing OBP Components


Figure 2-1 shows the OBP components.

Figure 2-1

OBP Components

Introducing Boot PROM


Each Sun system has a 1-Mbyte boot PROM chip. This chip is typically
located on the same board as the CPU. On the earlier systems, boot PROM
chips are usually located in a socket. However, starting from the 3.x
PROM revision, the boot PROM chips are permanently soldered to the
main system board because they are flash-updateable and do not need
replacement.
Boot PROM is commonly referred to as the PROM monitor or the ok
prompt. The following lists the primary tasks of the boot PROM:

Testing and initializing system hardware

Determining system configuration

Booting the operating system (OS) from the network or from a


storage device

Providing interactive debugging facilities

Enabling the use of third-party devices

Introducing OBP Components, Features, and Diagnostics


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

2-5

Introducing OBP Components

Listing Boot PROM Features


Information about the boot PROM features helps you to understand the
capabilities of OBP and the scope of activities that you can perform using
the OBP diagnostic commands. Table 2-1 shows the main features of boot
PROM.
Table 2-1 Features of Boot PROM

2-6

Feature

Description

Programmable user
interface

OBP is based on the interactive Forth language.


This enables you to combine user commands to
make complete programs for debugging
hardware and software.

FCode interpreter

Most plug-in drivers are written in a


machine-independent language called FCode.
The OBP includes an FCode interpreter that
enables you to use the same device driver on
systems with different CPU instruction sets.

Facilities for
dynamically
constructing a device
tree structure in
nonpageable
memory

The OBP probes hardware devices and


dynamically constructs a device tree data
structure in nonpageable memory. The device
tree is hierarchically organized and represents
all the hardware devices available on the system.
You can view the device tree at the firmware
level to determine if hardware is available on the
system.

Plug-in device
drivers

Every SBus and peripheral component


interconnect (PCI) card contains a firmware chip
called an IDPROM that has a minimal device
driver and a customized POST routine. The
IDPROM enables the peripheral card to specify
how to test and probe the system at the firmware
level. This enables you to add and remove
various peripheral cards without making any
changes to the boot PROM.

Diagnostic
commands

OBP provides an extensive set of diagnostic


commands that help you to perform system
reconfiguration and hardware diagnostic tasks.

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Introducing OBP Components

Associating PROM Revisions With Platforms

Boot PROM
Revisions on
page OH 2-5

Each Sun system supports a minimum revision of the boot PROM. There
are four generations of Sun boot PROMs, and 4.x is the latest revision.
Table 2-2 shows the PROM revision numbers and examples of the
corresponding Sun platforms.
Table 2-2 PROM Revision Numbers and Examples of the Corresponding
Sun Platforms
PROM Revision

Platform

1.x (The original SPARC


boot PROM)

SPARCstation 1, SPARCstation 1+,


SPARCstation IPC, and SPARCstation SLC
systems

2.x (The first OBP)

SPARCstation 2, SPARCstation 5,
SPARCstation 10, and SPARCstation 20
systems

3.x (The OBP with a flash


update feature)

Ultra workstations, such as Ultra 5,


Ultra 10, Ultra 30, Ultra 60, and Ultra 80
and Sun Enterprise servers, such as Sun
Enterprise 250, Sun Enterprise 450, and Sun
Enterprise 3x006x00

4.x (Enhanced debugging


features using FCode)

Sun Fire and Sun Blade systems

Note The flash update feature enables you to upgrade the revision of
software within the OBP without actually replacing the boot PROM.

Introducing OBP Components, Features, and Diagnostics


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

2-7

Introducing OBP Components

Introducing FPROM Upgrades

FPROM
Upgrades on
page OH 2-6

The flash-upgradeable PROM in Ultra workstations is known as FPROM.


Due to a design flaw that might lock the CPU while running a 64-bit
kernel, Ultra workstations that have a CPU speed of less than or equal to
200 MHz are shipped with a firmware revision that allows them to
operate only with a 32-bit kernel. Therefore, to use the 64-bit kernel on
these workstations, you must upgrade the FPROM. Table 2-3 lists the
workstation type and the minimum firmware revision for an FPROM
upgrade.
Table 2-3 Workstation Type and Minimum Firmware Revision for the
FPROM Upgrade
Workstation Type

Firmware
Revision

Ultra 1

3.11.1

Ultra 2

3.11.2

Ultra 450

3.7.107

Sun Enterprise Ultra server

3.2.17

If students ask why Ultra 5, Ultra 10, and other Sun workstations are not listed in Table 2-3, inform them that
the Ultra 1, Ultra 2, Ultra 450, and all Enterprise Ultra servers are the only Ultra workstations that have a
minimum OBP revision requirement to support the 64-bit architecture. If students want further details, refer
them to collection document #21434 on the sunsolve.sun.com Web site.

The following lists the steps to upgrade the FPROM firmware:


1.

Capture and save the existing parameter settings in the NVRAM


chip. You can restore the saved settings after completing the
upgrade.

Inform students that they can use the eeprom command at the OS level to save the configuration settings in
a file. The eeprom command is discussed later in this module.

2.

Set the write-protect FPROM jumper, J2003, to the write-enabled


position on the Ultra 1 and Ultra 2 workstations and to the
diagnostic position on the Sun Enterprise servers.

Caution You must power off the system before changing the state of the
FPROM jumper.

2-8

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Introducing OBP Components


For details on the location of the Flash jumper on the system and the steps to upgrade FPROM, ask students
to access www.sun.com/products-n-solutions/hardware/docs/pdf/802-3233-21.pdf.

3.

Execute the FPROM upgrade script from a bootable compact disk


(CD) or by downloading the appropriate patch from the
sunsolve.sun.com Web site.

Note To check the OBP firmware revision on your system, use either the
banner or .version command at the ok prompt or the prtconf -V
command at the Shell prompt.

Introducing NVRAM
NVRAM is a pluggable chip on the main system board. The NVRAM chip
uses a battery backed up complementary metal-oxide semiconductor
(CMOS) chip to store customized system configuration variables, macros,
and device aliases. NVRAM also contains the Time of Day (TOD) chip,
which provides the date and time to the system. A single lithium battery
provides the backup for the NVRAM chip and the clock.
The NVRAM chip includes the following information:

A unique host identification number (ID)

A unique 48-bit hardware address of the Ethernet interface

A unique serial number

Note The host ID on the NVRAM chip forms the basis for a number of
software licenses. You must retain the chip if a new system board is
installed. If the chip fails, Sun replaces it with a chip containing the same
host ID and Ethernet address.

Storing Custom Values in the nvramrc Variable


The nvramrc configuration variable stores user-defined commands that
the system executes during the startup process. These commands are
known as scripts and stored in the nvramrc variable in the American
Standard Code for Information Interchange (ASCII) format.
If the use-nvramrc? configuration variable is set to true during the
OpenBoot startup sequence, the nvramrc script is evaluated after the
firmware initializes the system.

Introducing OBP Components, Features, and Diagnostics


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

2-9

Introducing OBP Components

Listing Common OBP Variables


Listing Common
OBP Variables
on page OH 2-7

OBP variables provide you with the flexibility to modify the default
behavior of different aspects of the OBP firmware. You can use various
OBP commands to query and set OBP variables. Table 2-4 displays the
common OBP variables on a Sun4U system along with their
descriptions.
Table 2-4 OBP Variables on a Sun4U System
OBP Variable

Description

auto-boot?

Specifies whether a system boots automatically


after a power on or reset of the system

diag-device

Specifies the diagnostic boot source device

diag-switch?

Specifies the diagnostic mode in which the


system runs

sbus-probe-list

Specifies the SBus slots to be probed and the


order in which to conduct the probe

pcia-probe-list

Controls the probe order of the plug-in devices


attached to the pcia bus

pcib-probe-list

Controls the probe order of the plug-in devices


attached to the pcib bus

security-mode

Controls the firmware security level

tpe-link-test?

Enables or disables the network interface link


test for the built-in twisted pair Ethernet

watchdog-reboot?

Specifies whether the system must reboot


automatically when a watchdog reset occurs

Inform students that several OBP variables exist in Sun systems. However, Table 2-4 explains only the
common OBP variables. For information on all the OBP variables, refer students to the OpenBoot 3.x Command
Reference Manual and the OpenBoot 4.x Command Reference Manual on the docs.sun.com Web site.

Most variables on a Sun4U Enterprise server are identical to the OBP


variables on a Sun4U desktop system. You can use the eeprom command
to view the OBP variables in the Solaris OE.

2-10

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Introducing OBP Components

Using the printenv Command


You use the printenv command at the ok prompt to print the OBP
variables along with their values.

Note If you run the printenv command on a new system, all the OBP
variables with their corresponding default values are displayed.

Ask students to discuss instances in which they run the printenv command.

The following shows a partial output of the printenv command on an


Ultra 10 workstation:
ok printenv
Variable Name
tpe-link-test?
...
...
output-device
boot-command
auto-boot?
watchdog-reboot?
diag-file
diag-device
boot-file
boot-device
local-mac-address?
ansi-terminal?
...

Value
true

Default Value
true

screen
boot
true
false

screen
boot
true
false

net

net

disk net
false
true

disk net
false
true

ok

You can also use the printenv command to display a single OBP variable
and its value. For example, to display the value of the boot-device
variable, you type the following at the ok prompt:
ok printenv boot-device
boot-device =
disk

Introducing OBP Components, Features, and Diagnostics


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

2-11

Modifying OBP Variables and Running Diagnostics

Modifying OBP Variables and Running Diagnostics


You can modify OBP variables to reconfigure system boot and
environment variables, such as watchdog-reboot? and
local-mac-address?.
Use the following example to explain how to reconfigure the system environment by modifying the OBP
local-mac-address? variable.

Consider a scenario in which you want to enable multipathing on the web


servers. To do this, all the network interface cards (NICs) must have
different memory access controller (MAC) addresses. By default, all the
NICs on a SPARC system use the same MAC addresses stored in the
NVRAM chip. New NICs also have a unique local MAC address. To
assign the local MAC addresses to the NICs, you set the
local-mac-address? variable to true at the ok prompt.

Modifying OBP Variables


Modifying the
OBP Variables
on page OH 2-9

Table 2-5 shows the commands that you use to modify OBP variables and
the locations from where you run each command.
Table 2-5 Commands to Modify OBP Variables
Commands to Modify OBP Variables

Commands Run From

setenv
set-default
set-defaults

ok prompt
Keyboard at power on

stop-n
eeprom

Shell command prompt

Note The Stop-N key sequence is not supported on Universal Serial Bus
(USB)-equipped workstations, such as Sun Fire servers. The functionality
of the Stop-D key sequence is simulated by using the Safe NVRAM mode.
Provide the following information to students about the Safe NVRAM mode: During the boot process, if you
lose access to the system console due to a failed NVRAM configuration change, use the Safe NVRAM mode
to restore access to the console. The settings of the Safe NVRAM mode are temporary and ensure a
successful recovery boot.

2-12

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Modifying OBP Variables and Running Diagnostics

Using the setenv Command


You use the setenv command to modify the current values assigned to
the OBP variables at the ok prompt.
For example, to prevent the system from booting automatically, use the
setenv command to set the auto-boot? variable to false:
ok setenv auto-boot? false
auto-boot? = false

Using the set-default Command


You use the set-default command to reset a single OBP variable to its
default value.
For example, use the set-default command to reset the auto-boot?
variable to its default value:
ok set-default auto-boot?
ok

Using the set-defaults Command


You use the set-defaults command to reset all the OBP variables to
their default values. The following message is displayed when you run
the set-defaults command at the ok prompt:
ok set-defaults
Setting NVRAM parameters to default values.
ok
The set-default and set-defaults commands affect only those OBP
variables that have default values.
Ask students to modify the value of an OBP variable, such as the last-hardware-update variable, which
does not have a default value. Run the printenv command to verify that you have set the value. Next, try to
reset the last-hardware-update variable by using either the set-default or set-defaults command.
Notice that the value is not reset. This is because the last-hardware-update variable does not have a
default value.

Using the stop-n Command


You use the stop-n command to reset OBP variables to their default
values during power on. To run the stop-n command, press the Stop-N
keys during power on.

Introducing OBP Components, Features, and Diagnostics


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

2-13

Modifying OBP Variables and Running Diagnostics


Consider a scenario in which you set the value of the boot-device
variable to a device that does not exist. If the system does not boot, you
can use the Stop-N keys during power on to boot the system from the
default boot device.
You use the Stop-N keys:

Before the ok prompt is displayed

If the security modes are not set

Inform students that they can set the firmware security level at the ok prompt. For information on various
security levels and steps to set the security level, refer students to the documentation at the docs.sun.com
Web site.

Using the eeprom Command


You use the eeprom command to view and modify OBP variables from the
shell. However, you can modify an OBP variable by using the eeprom
command only if you are the superuser. The eeprom command stores the
changes made to the OBP variables in the NVRAM chip without resetting
the system.
Table 2-6 shows various uses of the eeprom command and the
corresponding syntax.
Table 2-6 Uses of the eeprom Command
Syntax

Use

# eeprom

Displays all the variables with their


corresponding values

# eeprom parameter

Displays a single parameter and its


current value

# eeprom parameter=value Changes the current value of a


parameter
For example, to prevent the system from automatically booting after
completing the power-on self-test (POST) diagnostics, you must alter the
auto-boot? variable. To do this from within the Solaris OE, use the
eeprom command:
# eeprom auto-boot?=false

2-14

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Modifying OBP Variables and Running Diagnostics

Note When you use the eeprom command on variables with a question
mark, either enclose the variable in quotes or precede the question mark
with an escape character (\). You do this to prevent the shell from
interpreting the question mark.

Preparing for Manual OBP Diagnostics


When you power on a system, the POST routines within the OBP
firmware execute automatically to perform initial hardware checks on the
system. If the ok prompt is displayed after initial testing, you can
manually diagnose the system by running various diagnostic commands.
However, if your system is set to boot automatically, you must bring it to
the ok prompt and use the setenv command to set the auto-boot?
variable to false. This prevents the system from booting automatically.
ok setenv auto-boot? false

Using Manual OBP Diagnostic Commands


The following diagnostic commands enable you to determine the status of
hardware components at the firmware level:

The probe commands

The test commands

The watch commands

These commands run at the ok prompt.

Introducing OBP Components, Features, and Diagnostics


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

2-15

Modifying OBP Variables and Running Diagnostics

Using the probe Commands


Inform students that the probe commands help to diagnose hardware devices that are attached to the
system at the firmware level.

Using the probe


Commands on
page OH 2-10

To diagnose peripheral devices, such as disks, tape drives, and CD-ROMs


that are connected to your system, use the probe commands, as shown in
Table 2-7.
Table 2-7 The probe Commands
Command

Purpose

Output

probe-ide

Probes the Integrated Drive


Electronics (IDE) devices
that are connected to the
on-board IDE interface of
the system

Displays the target


address, unit
number, device type,
and manufacturer
name of each IDE
device

probe-scsi

Probes the small computer


system interface (SCSI)
devices, such as disks and
tape drives, which are
attached to the on-board
SCSI controller

Displays the target


address, unit
number, device type,
and manufacturer
name of each SCSI
device

probe-scsi-all

Probes the devices attached


to the on-board SCSI
controller as well as the
devices that are attached to
the PCI and Sbus SCSI
controllers

Identifies SCSI
devices by their
target addresses

You must use the probe-scsi-all command on an Ultra 5, Ultra 10, or


Sun Blade system if the system has a SCSI card attached to it. The
probe-scsi command probes only the on-board SCSI controllers and not
the peripheral card controllers. Before you use the probe-scsi or
probe-scsi-all command, you must power on all the SCSI devices,
because the probe-scsi command can detect the connected SCSI
devices only if the devices are powered on.
To view the sample outputs of the probe commands, refer to Appendix B,
Additional Information, on page B-2.

2-16

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Modifying OBP Variables and Running Diagnostics

Using the test Commands

Using the test


Commands on
page OH 2-11

To test the hardware devices attached to the system, use the test
commands, as shown in Table 2-8.
Table 2-8 The test Commands
Command

Purpose

test-all

Tests all the devices in a system, such as the


SBus cards that have a built-in test
program. This command does not test the
tape drives, CD-ROMs, and hard disks.

test floppy

Tests the response of the floppy drive to the


commands.

test net

Performs internal and external loopback


tests on the autoselected system Ethernet
interface.

While running the test-all or the test floppy command to test the
removable media drives, such as a diskette or a CD-ROM drive, ensure
that the media is inserted in the drive.

Note Before running the test and watch commands, you must reset the
system once after dropping to the ok prompt. This helps to clear all
buffers and registers and ensures that the system does not hang.
To view sample outputs of the test commands, refer to Appendix B,
Additional Information, on page B-4.

Introducing OBP Components, Features, and Diagnostics


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

2-17

Modifying OBP Variables and Running Diagnostics

Using the watch Commands

Using the watch


Commands on
page OH 2-12

To monitor the network traffic and clock function of the system, use the
watch commands, as shown in Table 2-9.
Table 2-9 The watch Commands
Command

Purpose

watch-net

Monitors broadcast Ethernet packets on the


Ethernet cables that are connected to the
system

watch-net-all

Monitors Ethernet packets on all the


Ethernet interfaces that are installed on the
system

watch-clock

Displays seconds from the TOD chip in


NVRAM

To view sample outputs of the watch commands, refer to Appendix B,


Additional Information, on page B-6.

Using OBP Commands to Display System Information


You use the following OBP commands to display information about the
system configuration at the ok prompt:

2-18

The banner command

The .version command

The .speed command

The .enet-addr command

The sifting command

The see command

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Modifying OBP Variables and Running Diagnostics

Using the banner Command


You use the banner command to obtain information about the system,
such as processor information, available memory, the host ID, the MAC
address, the serial number, and the firmware version. However, the
firmware version is not specific because the banner command displays
incomplete version information.
The following is a sample output of the banner command:
ok banner
Sun Ultra 5/10 UPA/PCI (UltraSPARC-IIi 440MHz), Keyboard
Present
OpenBoot 3.25, 256MB (50 ns) memory installed, Serial #
16078985.
Ethernet address 8:0:20:f5:58:89, HOST ID: 80f55889
Inform students that they can use the oem-banner and oem-banner? configuration variables to modify the
text field in the banner. To insert a custom text field in the power-on banner, type the following commands:
ok setenv oem-banner This is a custom banner.
ok setenv oem-banner? true

Using the .version Command


You use the .version command to display the version and date of OBP.
The .version command also provides the version number of POST.
The following is a sample output of the .version command:
ok .version
Release 3.25 Version 3 created 2000/06/29 14:12
OBP 3.25.3 2000/06/29 14:12
POST 3.1.0 2000/06/27 13:56
Inform students that on a Sun Enterprise server, the .version command also displays the version of
OpenBoot Diagnostics (OBDiag).

Introducing OBP Components, Features, and Diagnostics


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

2-19

Modifying OBP Variables and Running Diagnostics

Using the .speed Command


You use the.speed command to display the clock frequency of the CPU,
Ultra Port Architecture (UPA), and SBus or PCI buses.
The following is a sample output of the .speed command:
ok .speed
CPU speed
UPA speed
PCI Bus A
PCI Bus B

:
:
:
:

440.00MHz
110.00MHz
33MHz
33MHz

Using the .enet-addr Command


You use the.enet-addr command to display the current Ethernet address
of your system.
The following shows a sample output of the .enet-addr command:
ok .enet-addr
8:0:20:f5:58:89
ok
If required, ask students to recall the example about multipathing, which is described in Modifying OBP
Variables and Running Diagnostics on page 2-12. Explain that while assigning unique MAC addresses to the
NICs, you use the .enet-addr command to view the current MAC addresses.

Using the sifting Command


The sifting command displays the names of all the commands that have
a common text string. The sifting command is similar to the UNIX
grep command.
For example, to list the commands for performing any operations on
devices, you use the sifting command with the input variable, dev.
ok sifting dev
In vocabulary known-int-properties
(f0061264) devsel-speed
(f00611c0)device-id
In vocabulary magic-properties
(f0029940)device-type
In vocabulary forth
(f005f600)show-pci-config-dev#)
(f005b9cc)cfg>dev#
(f004fd18)test-dev
(f002f624)map-device
....
....

2-20

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Modifying OBP Variables and Running Diagnostics

Using the see Command


You can use the built-in Forth language decompiler to recreate or
customize the source code for any previously-defined Forth words. The
see command displays the source code for a command passed to it as an
input variable.
In the following example, the .version command is passed as the input
variable to the see command. The following output displays the source of
the .version command:
ok see .version
:.version
.version current-device > r /flashprom find-device
version
get-property if
exit
then 0 left-parse-string type cr type cr device-end r >
push-device
If students want to know more about Forth coding, they can view the source for the commands described
earlier in the module. Additional details about the Forth language decompiler are provided at the
docs.sun.com Web site.

Introducing OBP Components, Features, and Diagnostics


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

2-21

Exercise: Modifying the OBP Variables

Exercise: Modifying the OBP Variables


In this exercise, you use the OBP commands to display and modify OBP
variables.
Explain to students that the set of questions in the exercise facilitates the revision of the content described in
the module. Instruct students to perform tasks and attempt questions in the sequence that they appear.
Inform them that they can refer to the lecture notes to attempt the exercise.

Preparation
To complete system diagnostics, perform a shutdown procedure to access
the OBP environment and run the OBP commands at the ok prompt.

Note Due to different PROM revisions, the syntax for the OBP
commands can vary slightly. For more information, refer to the OpenBoot
3.x Quick Reference Card or the OpenBoot 4.x Command Reference Manual.

Tasks
Perform the following tasks:

2-22

1.

Access the ok prompt, and set the appropriate variable so that the
system does not boot automatically.

2.

Use the appropriate command to display the list of OBP variables on


your system. Record the current and default values for the
boot-device variable.

3.

Modify the boot-device variable so that the system boots from the
disk disk0.

4.

Verify that the system boots from the disk disk0.

5.

Set the default value of the boot-device variable.

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Exercise Solutions

Exercise Solutions
The following are solutions for the exercise steps:
1.

Access the ok prompt, and set the appropriate variable so that the
system does not boot automatically.
ok printenv auto-boot?
ok setenv auto-boot? false

2.

Use the appropriate command to display the list of the OBP


variables on your system. Record the current and default values for
the boot-device variable.
Run either of the following commands:
ok printenv
ok printenv boot-device

3.

Modify the boot-device variable so that the system boots from the
disk disk0.
ok setenv boot-device disk0

4.

Verify that the system boots from the disk disk0.


ok boot
If the system boots from the net variable instead of the disk disk0, check
whether the diag-switch? variable is set to true. If yes, set it to false,
and reset the system. Verify that the system now boots from the disk disk0.

5.

Set the default value of the boot-device variable.


Access the ok prompt, and run the following command:
ok set-default boot-device

Introducing OBP Components, Features, and Diagnostics


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

2-23

Exercise: Performing Manual OBP Diagnostics

Exercise: Performing Manual OBP Diagnostics


In this exercise, you use the OBP commands to perform the following
diagnostics on your system:

Examine system information

Display and modify OBP variables at both firmware and OS levels

Perform hardware diagnostics

Set the default values of the OBP variables

Explain to students that the set of questions in this exercise facilitates the revision of the content in the
module. Instruct students to perform tasks and attempt questions in the sequence that they appear. Inform
them that they can refer to the lecture notes to attempt the exercise.

Preparation
Perform a shutdown procedure to access the OBP environment, and run
the OBP commands at the ok prompt to perform system diagnostics.
If the systems in the lab do not have a SCSI disk, arrange at least one
system with a SCSI drive. You can connect this system to the overhead
projector to enable students to observe the output of the probe-scsi and
probe-scsi-all diagnostics commands.

Note The exercise questions, such as those involving running the


probe-scsi and probe-scsi-all commands, are not functional if the
system does not have a SCSI-based hard disk.

2-24

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Exercise: Performing Manual OBP Diagnostics

Tasks
Perform the following tasks:
1.

2.

Prepare for OBP diagnostics.


a.

Ensure that the system is at the ok prompt.

b.

Set the appropriate variable so that the system does not boot
automatically.

c.

When the ok prompt is displayed, set OBP variables to the


default values.

Display the following information about the system:


a.

Installed memory

b.

PROM serial number

c.

MAC address

d.

Host ID

3.

Use the appropriate commands to display the version of the boot


PROM firmware and the speed of the system buses.

4.

Use the probe commands to display the list of IDE and SCSI devices
that are attached to your system.

5.

Study the information displayed by the probe commands, and


identify the main differences between various probe commands.

6.

Run the test command to test the system.

7.

Use the watch command to check the clock function of the system.

8.

Use the appropriate command to check the version of POST on your


system.

9.

Display all the commands containing the string probe.

Introducing OBP Components, Features, and Diagnostics


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

2-25

Exercise Summary

Exercise Summary

Discussion Take a few minutes to discuss what experiences, issues, or


discoveries you had during the lab exercise.

Manage the discussion based on the time allowed for this module, which was provided in the About This
Course module. If you do not have time to spend on discussion, highlight just the key concepts students
should have learned from the lab exercise.

Experiences

Ask students what their overall experiences with this exercise have been. Go over any trouble spots or
especially confusing areas at this time.

Interpretations

Ask students to interpret what they observed during any aspect of this exercise.

Conclusions

Have students articulate any conclusions they reached as a result of this exercise experience.

Applications

Explore with students how they might apply what they learned in this exercise to situations at their workplace.

2-26

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Exercise Solutions

Exercise Solutions
The following are the solutions to the exercise tasks:
1.

Prepare for OBP diagnostics.


a.

Ensure that the system is at the ok prompt.

b.

Set the appropriate variable so that the system does not boot
automatically.
ok setenv auto-boot? false

c.

When the ok prompt is displayed, set OBP variables to their


default values.
ok set-defaults

2.

Display the following information about the system:


a.

Installed memory

b.

PROM serial number

c.

MAC address

d.

Host ID

ok banner
3.

Use the appropriate commands to display the version of the boot


PROM firmware and the speed of system buses.
ok .version
ok .speed

4.

Use the probe commands to display the list of IDE devices and SCSI
devices that are attached to your system.
ok probe-ide
ok probe-scsi
ok probe-scsi-all

5.

Study the information displayed by the probe commands, and


identify the main differences between various probe commands.
The probe-ide command identifies the peripheral devices that are attached
to the on-board IDE controller.
The probe-scsi command identifies the SCSI devices, such as disks, tape
drives, and CD-ROMs, which are attached to the on-board SCSI controller.
The probe-scsi-all command identifies the SCSI devices attached to
both the on-board SCSI and SBus SCSI controllers.

6.

Run the test command to test the disk drive of your system.
ok test-all

Introducing OBP Components, Features, and Diagnostics


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

2-27

Exercise Solutions
7.

Use the watch command to check the clock function of the system
and monitor the system network traffic.
ok watch-clock

8.

Use the appropriate command to check the version of POST on your


system.
ok .version

9.

Display all the commands containing the string probe.


ok sifting probe

2-28

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Module 3

Enabling and Monitoring POST Diagnostics


Objectives
Overview on
page OH 3-2

Upon completion of this module, you should be able to:

Describe the concepts of POST

Identify ways to view extended diagnostics during POST

3-1
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Relevance

Relevance
Present the following questions to stimulate the students and get them to think about the issues and topics
presented in this module. While they are not expected to know the answers to these questions, the answers
should be of interest to them and inspire them to learn the material presented in this module.

Relevance on
page OH 3-3

!
?

Discussion The following questions are relevant to understanding the


reasons for performing POST diagnostics:

What hardware problems, if any, have you experienced with your


systems?

Did the POST messages displayed on the screen during the boot
process help you to troubleshoot problems with your systems?

What other resources help you to troubleshoot hardware problems?

Allow students to share their work experiences, and ask them to list the hardware problems that they
encountered while working on their systems.

3-2

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Additional Resources

Additional Resources
Additional resources The following references can provide additional
information on the topics discussed in this module:

Field Engineer Handbook, part numbers 800-4006-16 and 800-4247


(http://sunsolve.sun.com), accessed 17 December 2001.

Sun System Handbook (http://sunsolve.sun.com/handbook_pub)


accessed 07 May 2002.

OpenBoot Command Reference, part number 800-6076.

OpenBoot 2.x Quick Reference Card, part number 802-1958.

OpenBoot 3.x Quick Reference Card, part number 802-3240.

OpenBoot 4.x Command Reference Manual (http://docs.sun.com),


accessed 14 March 2002.

Solaris User and System Administration Answer Books


(http://docs.sun.com), accessed 17 December 2001.

Note OpenBoot Command Reference and Quick Reference Card are now part
of the software supplement for the Solaris 9 OE CD.

Enabling and Monitoring POST Diagnostics


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

3-3

Introducing POST Concepts

Introducing POST Concepts


Inform students that this section describes the importance, capabilities, and limitations of POST in identifying
and resolving system faults. Focus on the features of POST and how it helps to analyze system faults.

POST is a binary program written for the SPARC processor, which resides
within OBP and executes automatically upon system power on.
You use POST to initialize and test the hardware that is part of the system.
POST performs a series of diagnostic tests on the hardware components of
the system board to verify that all the components are functioning
properly. POST also helps to determine which components failed and
must be replaced. The error messages displayed during the POST
sequence help administrators and support personnel to determine if
hardware problems exist on the system.

Identifying the Testable Components


POST performs tests on the following:
POST performs
tests on the
following: on
page OH 3-4

CPU modules

Memory management units (MMUs)

Memory, such as the Init Memory and Block Memory Addr Tests
parameters

Interrupts

NVRAM, such as NVRAM Battery Detect Test, NVRAM Scratch


Addr Test, and NVRAM Scratch Data Test parameters

Cache, such as Ecache Tests and Basic Cache Tests parameters

Inform students that Ecache RAM Addr, Ecache Tag Addr, Ecache RAM, and Ecache Tag are part of
Ecache Tests. In addition, Dcache RAM, Dcache Tag, Icache RAM, Icache Tag, Icache Next, and
Icache Predecode are part of Basic Cache Tests.

Register tests

POST does not perform extensive tests on any components of the main
logic board, such as SBus or PCI cards and associated I/O devices. These
tests are performed using OBP.

3-4

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Introducing POST Concepts

Describing the diag-switch? Variable


The diag-switch? variable is an OBP variable. The value of the
diag-switch? variable is either true or false. However, by default, the
value of the diag-switch? variable is false. For example, on an Ultra 10
workstation, when you set the value of the diag-switch? variable to
true, the system boots in extended diagnostic mode. Alternatively, when
you set the value of the diag-switch? variable to false, the system
boots in normal mode. Table 3-1 describes the normal and extended
diagnostic modes on an Ultra 10 workstation.
Table 3-1 Normal and Extended Diagnostic Modes
Mode

Description

Normal
diagnostic

In this mode, no progressive test messages are displayed


on the console during the execution of POST tests. If an
error occurs, the error messages are displayed on either the
tty-type terminal or on the console.

Extended
diagnostic

In this mode, POST displays progressive test messages on


the console during the execution of POST. If POST is
successful, the control is transferred to OBP. Then, OBP
probes the installed SBus modules.

If required, describe a tty-type terminal. You can provide the following definition:
A tty-type terminal is the serial port for the system console. To define the communication parameters on the
serial port, you set the configuration variables for the port.

Identifying the Methods to Enable Extended POST


Diagnostics
You enable extended POST on the Sun firmware to implement extended
POST diagnostic tests. When you power on the system, extended POST
diagnostic tests are invoked automatically if you have already performed
the following:

Set the diag-switch? variable to true

Pressed the Stop-D keys after powering on the system

Enabling and Monitoring POST Diagnostics


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

3-5

Introducing POST Concepts

Set the value of the diag-level variable to max


You can set three different levels of diagnostic tests for POST and
choose to run all or some of the tests. The OBP firmware initiates the
selected level of POST, based on the value set for the diag-level
variable. The following are different values that you set for the
diag-level parameter:

off (no testing) No tests are performed during system power


on.

max (maximum) POST performs an extended set of


diagnostic-level tests.

min (minimum) POST performs an abbreviated set of


diagnostic-level tests.

Note You must ensure that the diag-device parameter is set to a


bootable device when you enable extended POST.

Inform students that the diag-device parameter is described later in the module.

Setting the diag-switch? Variable to true


To boot the system in extended diagnostic mode, you must set the value
of the diag-switch? variable to true.
To set the value of the diag-switch? variable to true, type the following
command at the ok prompt:
ok setenv diag-switch? true
ok
When the diag-switch? variable is set to true, the system:

3-6

Performs self-tests during any subsequent system power on.

Displays additional status messages.

Uses different configuration variables to boot a system. For example:

If the auto-boot? variable is set to true, the system boots


from the boot device specified in the diag-device variable.

If the auto-boot? variable is set to false, the system remains


at the PROM monitor without booting.

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Introducing POST Concepts

Using the Stop-D Key Sequence


You use the Stop-D keys to control the POST phase.
When you press the Stop-D keys after you power on the system, the value
of the diag-switch? variable is set to true. The firmware automatically
runs POST diagnostic tests at the level specified by the diag-level
variable. After the tests are complete, the system boots from the parameter
that is set in the diag-device variable.

Using the System Key Switch


You can show the OH slide of the Figure 3-1 on page 3-8 at this time.

Different modes
of the system
key on page OH
3-5

You use the system key switch to control the power on mode of a system.
For example, on a Sun Enterprise 250 server, the system key has four
key-switch positions. Table 3-2 describes the function of each key-switch
position.
Table 3-2 System Key-Switch Settings and Functions
Name of
Switch Position

Description

Power-On

Starts the system power-on process

Diagnostics

Starts the system power-on process and runs the


POST and OpenBoot diagnostic tests

Locked

Starts the system power-on process and disables the


keyboard Stop-A key sequence and the terminal
Break key

Standby

Turns off the power to all internal system components


and sets the power supply to standby mode

Inform students that locked mode prevents users from suspending system operations and accessing the ok
prompt. This prevents you from modifying the OBP parameters that are stored in the NVRAM chip from the
console unless you are logged in as a superuser.
Inform students that when the key switch is in the Standby position, the keyboard power switch is disabled.
Inform students that on a Sun Fire server, the Standby mode is known as the off mode. For more information
on Sun Fire servers, refer students to Sun Fire 280R Server Owners Guide available at
http://www.sun.com/products-n-solutions/hardware/docs/pdf/806-4806-10.pdf.

Enabling and Monitoring POST Diagnostics


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

3-7

Introducing POST Concepts


Figure 3-1 illustrates the four different modes of the system key on a Sun
Enterprise 250 server.

Figure 3-1

Different Modes of the System Key

Booting From the diag-device or boot-device


Variable
You can boot the system from either the diag-device or boot-device
variable, depending on the value of the diag-switch? variable.

Using the diag-device Variable


The diag-device variable contains the name of the default diagnostic
mode boot device. When you set the value of the diag-switch? variable
to true and power on the system, it boots from the value specified in the
diag-device variable. The default value of the diag-device variable is
net.

3-8

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Introducing POST Concepts


Explain to students that when they type boot at the ok prompt, the system boots from the default startup
device if the diag-switch? variable is not set. You can set the value of the startup device in the
boot-device variable.

In the following example, set the value of the diag-device variable to


disk, and boot the system:
Inform students that this is a partial output.

ok setenv diag-device disk


diag-device =
disk
ok boot
Boot device: disk File and args:
Loading ufs-file-system package 1.4 04 Aug 1995 13:02:54.
FCode UFS Reader 1.12 00/07/17 15:48:16.
Loading: /platform/SUNW,Ultra-5_10/ufsboot
Loading: /platform/sun4u/ufsboot
........ <output truncated>

Using the boot-device Variable


The boot-device variable contains the name of the device from which
the system boots when the diag-switch? variable is set to false. The
boot-device variable contains one or more device specifiers separated by
spaces. Each device specifier is a device alias.
The boot PROM attempts to open each successive device specifier in the
list if the previous device is not available. The system uses the first device
specifier that opens successfully. The default value of the boot-device
variable is disk net.
In the following example, you set the value of the boot-device variable
to disk and boot the system:
Inform students that this is a partial output.

ok setenv diag-switch? false


diag-switch? =
false
ok setenv boot-device disk
boot-device =
disk
ok boot
Rebooting with command: boot
Boot device: disk File and args:
SunOS Release 5.9 Version s81_54 64-bit
Copyright 1983-2001 Sun Microsystems, Inc.
configuring IPv4 interfaces: hme0.
.........<output truncated>

All rights reserved.

Enabling and Monitoring POST Diagnostics


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

3-9

Viewing Extended Diagnostics During POST

Viewing Extended Diagnostics During POST


You use the following commands to view the execution and results of
POST diagnostics:

tip An OS-level command run from a second system that you use
to establish a serial connection to the first system

prtdiag An OS-level command run on a system to display


diagnostic information about the system

show-post-results An OBP-level command that you use at the


firmware level of a system to display information about the last
executed POST

Using the tip Command


In Sun hardware, a keyboard plugged into the keyboard port of the
system is discovered at system startup and assigned as the input channel
for the system console /dev/console. However, if no keyboard is
discovered at startup, all console input and output is routed to the serial
port A, /dev/term/a.
The serial port is a standard serial device connected to either a dumb
terminal, a dial-in modem, or another Sun system. If you connect the
serial port to another Sun system, you can use the tip command on the
functional system to interact with the system running POST. The
following types of serial devices are available on a Sun system:

Data Terminal Equipment (DTE) devices, such as terminals

Data Communication Equipment (DCE) devices, such as modems

DTE devices use pin 2 to transmit data and pin 3 to receive data while
DCE devices use pin 2 to receive data and pin 3 to transmit data. This pin
setup works well for terminal-to-modem communication (DTE to DCE).
However, when you set up a tip connection between two terminals, the
pins of a pass-through modem cable do not allow a terminal-to-terminal
communication (DTE to DTE) because both devices are transmitting on
pin 2 and receiving on pin 3.

Note A null modem cable, also known as a crossover cable, switches or


crosses the transmit signal in the cable from pin 2 to pin 3 and switches
the receive signal from pin 3 to pin 2.

3-10

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Viewing Extended Diagnostics During POST


The tip command is an OS-level command in the Solaris OE that
establishes a serial session between two Sun systems through a null or
crossover modem cable connected to each /dev/term/a port of the
system.
The following are the two ways of invoking the tip command:

Use the hardwire argument. For example:


# tip hardwire
You use the hardwire argument as an index in the /etc/remote file
to retrieve specific serial device configuration information as shown:
cuab:dv=/dev/cua/b:br#2400
dialup1|Dial-up system:\
:pn=2015551212:tc=UNIX-2400:
hardwire:\
:dv=/dev/term/a:br#9600:el=^C^S^Q^U^D:ie=%$:oe=^D:
.......<output truncated>
The tip command establishes a serial connection at 9600 bits per
second (bauds) through the /dev/term/a port.

Invoke the tip command manually, and provide the baud rate as an
option. In addition, provide the serial port to configure as an
argument. For example:
# tip -9600 /dev/term/a
The tip command establishes a serial connection at 9600 bauds
through the /dev/term/a port.

Note Workstations, such as the Ultra 5 and Ultra 10, have two serial
ports; serial port A, a 25-pin female port; and serial port B, a 9-pin male
port. Most null modem cables are 25-pin male to 25-pin female cables.
This indicates that unless you edit the hardwire entry within the
/etc/remote file, you cannot use the hardwire entry as an argument to
the tip command unless you use a 9-pin female to 25-pin male crossover
cable.
To manage the tip session, you can use the following tilde commands:

The ~. or <CTRL-D> command Enables you to exit a tip session

The ~# command Enables you to send a break sequence to the


remote system

The ~? command Displays a short menu of the available


tip tilde commands

Enabling and Monitoring POST Diagnostics


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

3-11

Viewing Extended Diagnostics During POST

Setting up a tip Connection


To set up a tip connection, you must complete the following steps on the
remote system that runs the tip connection to observe POST:
1.

Edit the hardwire entry within the /etc/remote file to provide


support for serial port A:
hardwire:\
:dv=/dev/term/a:br#9600:el=^C^S^Q^U^D:ie=%$:oe=^D:

2.

Attach one end of a null modem cable to serial port A of the test
system and the other end to the remote Sun system.

3.

In a terminal window, invoke the tip command:


# tip hardwire

Complete the following steps on the system that runs POST:


1.

In a terminal window, use the init command to drop to the PROM


monitor:
# init 0

2.

When the ok prompt is displayed, ensure that you set the following
OBP variables that control extended POST:
ok setenv diag-switch? true
ok setenv diag-level max

3.

Power down the system.

4.

Remove the keyboard cable from the rear of the system.

5.

Power on the system.

Note When POST completes execution on the test system, you can
terminate the tip connection from the remote system by using the ~.
command.
The sample POST output results in this module are executed on Ultra 10
workstations. Refer to the Architecture of the Ultra 5 and Ultra 10
Workstations section on page Appendix B-7 for the graphic displaying
the architecture of the Ultra 5 and Ultra 10 workstations.
Inform students that this is a partial output of the POST sequence.

3-12

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Viewing Extended Diagnostics During POST


# tip hardwire
connected
@(#) Sun Ultra 5/10 UPA/PCI 3.25 Version 3 created 2000/06/29 14:12
Probing keyboard Done
%o0 = 0000.0000.0000.4001
Executing Power On SelfTest
@(#) Sun Ultra 5/10 (Darwin) POST 3.1.0 (Build No. 626) 13:56 on 06/27/00
CPU: UltraSPARC-LC (Clock Frequency: 440MHz, Ecache Size: 2048KB)
Init POST BSS
Init System BSS
NVRAM
...............<output truncated>
Power On Selftest Completed
Software Power ON0.0000.0000.0000 ffff.ffff.f00b.4858 0002.3333.0200.001b
...............<output truncated>
The following is a POST output with errors, which is generated on an Sun
Enterprise 3500 server. The environmental probing on the server detects a
disconnected fan failure. This is an example of POST detecting a problem,
but this error is not fatal, and the system continues to boot. However, if
you do not shut down the system, it can overheat and automatically shut
down.
# tip hardwire
connected
Hardware Power ON
@(#) Ultra Enterprise 3.2 Version 29 created 2001/06/18 17:28
CPU = 0000.0000.0000.0006
Probing keyboard Done
3,0>
3,0>@(#) POST 3.9.29 2001/06/18 17:50
3,1>
3,0>Copyright 2001 Sun Microsystems, Inc. All rights reserved.
3,1>@(#) POST 3.9.29 2001/06/18 17:50
3,0>
SelfTest Initializing (Diag Level 10, ENV 00004001) IMPL 0011 MASK a0
3,1>Copyright 2001 Sun Microsystems, Inc. All rights reserved.
3,0>Board 3 CPU FPROM Test
3,1>
SelfTest Initializing (Diag Level 10, ENV 00000000) IMPL 0011 MASK a0
.......................................<output truncated>
3,0>Board 3 Environmental Probe Test
3,0>
Environmental Probe
3,0>ERROR: TEST=Environmental Probe,SUBTEST=Environmental Probe
ID=1f.1
3,0>Component under test: Board 3 System Interrupt

Enabling and Monitoring POST Diagnostics


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

3-13

Viewing Extended Diagnostics During POST


3,0>Disk Fan failed
3,0>Checking Power Supply Configuration
3,0>Power is more than adequate, load 4 ps 3
3,0>
..................................<output truncated>
POST COMPLETE
3,0>Entering OBP
....................................<output truncated>

Using the prtdiag Command


The prtdiag command displays the following information about the
Sun4U and Sun4d systems:

System configuration, including information about the frequency of


the clock, the CPU, memory, and the I/O card types

Diagnostic information

Failed field-replaceable units (FRUs)

Note The prtdiag command does not display diagnostic information


and environmental status when executed on the Sun Enterprise 10000
server. You must refer to the
/var/opt/SUNWssp/adm/${SUNW_HOSTNAME}/messages file on the
System Service Processor (SSP) to obtain this status information.
The following is the syntax for the prtdiag command:
/usr/platform/`uname -m`/sbin/prtdiag [-v][-l]

3-14

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Viewing Extended Diagnostics During POST


Table 3-3 displays the options supported by the prtdiag command.
Table 3-3 Options Supported by the prtdiag Command
Option

Description

-v

Specifies verbose mode. The -v option displays the


time of the most recent alternate current (AC) power
failure and the most recent hardware fatal-error
information. The hardware fatal-error information
helps you to repair hardware faults and provides
detailed diagnostic information about the FRUs. In
addition, the -v option displays the status of
environmental variables, if applicable.

-l

Logs the output. If the prtdiag command detects


failures or errors on the system, the output is sent to the
syslogd daemon.

When you execute the prtdiag command, the following exit values are
returned:

0 Indicates that no failures or errors are detected in the system.

1 Indicates that failures or errors are detected in the system.

2 Indicates that an internal prtdiag error has occurred in the


system. For example, the system has run out of memory.

The following example displays a POST output with no errors, which is


generated on a Sun Enterprise 450 server. You use the prtdiag -v
command to view the following output:
# /usr/platform/sun4u/sbin/prtdiag -v
System Configuration: Sun Microsystems sun4u Sun Enterprise 450 (2 X
UltraSPARC
C-II 400MHz)
System clock frequency: 100 MHz
Memory size: 2048 Megabytes
========================= CPUs =========================
Run
Ecache
CPU
CPU
Brd CPU
Module
MHz
MB
Impl.
Mask
--- --- ------- ----- ------ ------ ---SYS
1
1
400
4.0
US-II
10.0
SYS
3
3
400
4.0
US-II
10.0
========================= Memory =========================
Memory Interleave Factor = 2-way

Enabling and Monitoring POST Diagnostics


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

3-15

Viewing Extended Diagnostics During POST


Interlv. Socket
Size
Bank
Group
Name
(MB) Status
---------------- -----0
0
1901
256
OK
0
0
1902
256
OK
0
0
1903
256
OK
0
0
1904
256
OK
1
0
1801
256
OK
1
0
1802
256
OK
1
0
1803
256
OK
1
0
1804
256
OK
========================= IO Cards =========================
Bus
Freq
Brd Type MHz
Slot Name
Model
--- ---- ---- ---- -------------------------------- --------------------SYS
PCI
33
2
TSI,gfxp
GFXP
No failures found in System
===========================
========================= Environmental Status =========================
System Temperatures (Celsius):
-----------------------------AMBIENT
27
CPU 1
41
CPU 3
45
=================================
Front Status Panel:
------------------Keyswitch position is in On mode.
System LED Status:
POWER
GENERAL ERROR
ACTIVITY
[ ON]
[OFF]
[ ON]
DISK ERROR THERMAL ERROR POWER SUPPLY ERROR
[OFF]
[OFF]
[OFF]
Disk LED Status:
OK = GREEN
ERROR = YELLOW
DISK 2:
[OK]
DISK 3:
[OK]
DISK 0:
[OK]
DISK 1:
[OK]
=================================

3-16

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Viewing Extended Diagnostics During POST


Fans:
----Fan Bank
Speed
Status
----------------CPU
49
OK
PWR
31
OK
Power Supplies:
--------------Supply
Rating
Temp
Status
------------------0
550 W
38
OK
1
550 W
37
OK
2
550 W
36
OK
========================= HW Revisions =========================
ASIC Revisions:
--------------STP2223BGA: Rev 4
STP2223BGA: Rev 4
STP2223BGA: Rev 4
STP2003QFP: Rev 1
STP2205BGA: Rev 1
System PROM revisions:
---------------------OBP 3.16.2 2000/01/11 15:42
POST 6.0.9 2000/01/11 15:43
You can interpret the preceding output in the following way:

The first section of the output displays basic system information,


such as the clock board frequency and the amount of RAM installed.

System Configuration: Sun Microsystems


UltraSPARC-II 400MHz)
System clock frequency: 100 MHz
Memory size: 2048 Megabytes

sun4u Sun Enterprise 450 (2 X

The second section provides information on the CPUs located on


each CPU/memory board within the system. Lower-end systems
have only a single processor.

========================= CPUs =========================


Run
Ecache
CPU
CPU
Brd CPU
Module
MHz
MB
Impl.
Mask
--- --- ------- ----- ------ ------ ---SYS
1
1
400
4.0
US-II
10.0
SYS
3
3
400
4.0
US-II
10.0

Enabling and Monitoring POST Diagnostics


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

3-17

Viewing Extended Diagnostics During POST


The CPU section contains information on the following:

The number of system boards on the system

The number of CPUs on the system

The speed of each CPU

The amount of external cache (Ecache) on each CPU

The third section provides information about the memory on each


CPU/memory board installed on the system.

========================= Memory =========================


Memory Interleave Factor = 2-way

Bank
---0
0
0
0
1
1
1
1

Interlv.
Group
----0
0
0
0
0
0
0
0

Socket
Name
-----1901
1902
1903
1904
1801
1802
1803
1804

Size
(MB)
---256
256
256
256
256
256
256
256

Status
-----OK
OK
OK
OK
OK
OK
OK
OK

The preceding output shows that the system has two banks of memory,
fully populated with eight memory modules containing 256 Mbytes of
memory each.

The fourth section displays the names and model numbers of all the
peripheral cards installed on the system.

========================= IO Cards =========================


Bus
Freq
Brd Type MHz
Slot Name
Model
--- ---- ---- ---- -------------------------------- --------------------SYS
PCI
33
2
TSI,gfxp
GFXP
No failures found in System
===========================

3-18

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Viewing Extended Diagnostics During POST

The fifth section displays the basic system information and


information regarding the power supply and the cooling fan status
in addition to the operating temperatures of various boards within
the system.

========================= Environmental Status =========================


System Temperatures (Celsius):
-----------------------------AMBIENT
27
CPU 1
41
CPU 3
45
=================================
Front Status Panel:
------------------Keyswitch position is in On mode.
System LED Status:
POWER
GENERAL ERROR
ACTIVITY
[ ON]
[OFF]
[ ON]
DISK ERROR THERMAL ERROR POWER SUPPLY ERROR
[OFF]
[OFF]
[OFF]
Disk LED Status:
OK = GREEN
ERROR = YELLOW
DISK 2:
[OK]
DISK 3:
[OK]
DISK 0:
[OK]
DISK 1:
[OK]
=================================
Fans:
----Fan Bank
Speed
Status
----------------CPU
49
OK
PWR
31
OK
Power Supplies:
--------------Supply
Rating
Temp
Status
------------------0
550 W
38
OK
1
550 W
37
OK
2
550 W
36
OK

Enabling and Monitoring POST Diagnostics


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

3-19

Viewing Extended Diagnostics During POST

The sixth section displays application-specific integrated circuit


(ASIC) revision information about each board installed on the
system. ASIC is a microchip that is designed for a special
application, such as a particular type of transmission protocol or a
hand-held computer.

========================= HW Revisions =========================


ASIC Revisions:
--------------STP2223BGA: Rev 4
STP2223BGA: Rev 4
STP2223BGA: Rev 4
STP2003QFP: Rev 1
STP2205BGA: Rev 1

The seventh section provides information on the OBP and POST


versions installed in the boot PROM on each CPU/memory board on
the system. This section also shows the creation date and time for
each item.

System PROM revisions:


---------------------OBP 3.25.3 2000/06/29 14:12

POST 3.1.0 2000/06/27 13:56

The following is a sample output of POST with errors from a Sun


Enterprise 3500 server. This is an example of POST detecting a problem.
However, this error was not fatal, and the system completed the booting
process. You use the prtdiag -v command to view the following output:
# /usr/platform/sun4u/sbin/prtdiag -v
System Configuration: Sun Microsystems sun4u 5-slot Sun Enterprise
E3500
System clock frequency: 100 MHz
Memory size: 512Mb
========================= CPUs =========================
Run
Ecache
CPU
CPU
Brd CPU
Module
MHz
MB
Impl.
Mask
--- --- ------- ----- ------ ------ ---3
6
0
400
4.0
US-II
10.0
3
7
1
400
4.0
US-II
10.0
5
10
0
400
4.0
US-II
10.0
5
11
1
400
4.0
US-II
10.0
========================= Memory =========================
Intrlv. Intrlv.
Brd
Bank
MB
Status
Condition Speed
Factor
With
--- ----- ---- ------- ---------- ----- ------- ------3
0
256
Active
OK
60ns
2-way
A
5
0
256
Active
OK
60ns
2-way
A

3-20

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Viewing Extended Diagnostics During POST


========================= IO Cards =========================
Bus
Freq
Brd Type MHz
Slot Name
Model
--- ---- ---- ---- -------------------------------- ---------------1 SBus 25
1 cgsix
SUNW,501-2253
1 SBus 25
3 SUNW,hme
1 SBus 25
3 SUNW,fas/sd (block)
1 SBus 25
13 SUNW,socal/sf (scsi-3)
501-3060
7 SBus 25
3 SUNW,hme
7 SBus 25
3 SUNW,fas/sd (block)
7 SBus 25
13 SUNW,socal/sf (scsi-3)
501-3060
No failures found in System
===========================
Detected System Faults
======================
Disk Drive Fan failure
Detected Fri Mar 1 15:57:14 2002
PROM detected failure
Detected Fri Mar 1 15:57:14 2002
========================= Environmental Status
=========================
Keyswitch position is in Normal Mode
System Power Status: Redundant
System LED Status:
GREEN
YELLOW
GREEN
WARNING
ON
ON
BLINKING
Fans:
----Unit
Status
--------Disk
FAIL
System Temperatures (Celsius):
-----------------------------Brd
State
Current Min Max Trend
--- ------- ------- --- --- ----1
OK
41
41
41 unknown
........<output truncated>

Enabling and Monitoring POST Diagnostics


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

3-21

Viewing Extended Diagnostics During POST

Using the show-post-results Command


On an Ultra workstation, you can view the POST output in real time by
attaching a terminal device to serial port A. However, if no terminal
device is available, you can use the OBP command show-post-results
to view the results after executing POST diagnostics.
The following example displays a POST output without errors. You use
the show-post-results command to view the following output:
ok show-post-results
Status 0=Pass, Non-Zero=Fail (%o0)=0
Message String
(%o1):
Board Descriptor
(%o2):233330200001b
To view the preceding output, complete the following steps:
1.

Set the diag-switch? variable to true.


ok setenv diag-switch? true

2.

Set the diag-level variable to max.


ok setenv diag-level max

3.

Set the auto-boot? variable to false.


ok setenv auto-boot? false

4.

Power down the system.

5.

Power on the system.


After you power on the system, execute the show-post-results
command at the ok prompt.

Caution On an Ultra 10 workstation, if you do not power off and then


power on the system, the show-post-results command does not report
the POST run status although the system executes POST successfully.
When executing the show-post-results command, the following
message is displayed at the ok prompt: Power On Selftest not run
on last reset.

3-22

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Viewing Extended Diagnostics During POST


The following is a sample output of POST with errors, which is generated
on an E3500 Enterprise server. You use the show-post-results
command to view the following output:
ok show-post-results
Slot 1 - Status=Okay, Type: I/O Type 4
Sysio0=P
Sysio1=P
FEPS=P
FEPSFC=0
Sbus0=P
Sbus1=P
Sbus2=P
AC=P
FHC=P
SRAM=P
FPROM=P
TODC=P
JTAG=P
CntrPl=P
DC=ff
Slot

Cpu0-OK=P

FailCode=0

Cpu1=P

FHC=P

SRAM=P

FPROM=P

LabCon=Not

Bank1=0

DTag0=P

DTag1=P

JTAG=P

Cpu1-OK=P

Bank1=Not

DC=ff

Cpu0-OK=P

FailCode=0

Cpu1=P

FHC=P

SRAM=P

FPROM=P

LabCon=Not

Bank1=0

DTag0=P

DTag1=P

JTAG=P

Bank1=Not

DC=ff

7 - Status=Okay, Type: I/O Type 4

Sysio0=P
Sysio1=P
FEPS=P
FEPSFC=0
Sbus0=P
Sbus1=P
Sbus2=P
AC=P
FHC=P
SRAM=P
FPROM=P
TODC=P
JTAG=P
CntrPl=P
DC=ff
Slot 16 - Status=Fail, Type: Clock
Clock=P
Serial=P
AC=P
ACFan=P
V5-P=P
V12-P=P
V5-PC=P
RckFan=***
3.3V=P
P
***
Not

Cpu1-OK=P

5 - Status=Okay, Type: CPU/Memory

Cpu0=P
FailCode=0
AC=P
Ovtemp=Not
Bank0=0
CntrPl=P
Bank0=P
Slot

LabCon=Not Ovtemp=Not

3 - Status=Okay, Type: CPU/Memory

Cpu0=P
FailCode=0
AC=P
Ovtemp=Not
Bank0=0
CntrPl=P
Bank0=P
Slot

SOC=P

SOC=P
LabCon=Not Ovtemp=Not

KbdMse=P
KeyFan=P
V5-Aux=P

PPS-DC=P
PSFail=0
V5P-PC=P

5.0V=P

Triger=P

DCReg0=P
DCReg1=P
Ovtemp=Not
TODC=P
V12-PC=P
V3-PC=P
Coolng=P

AC-REV=P

= Present or Passed
= Failed Component
= Not present
Enabling and Monitoring POST Diagnostics
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

3-23

Exercise: Enabling and Monitoring POST Diagnostics

Exercise: Enabling and Monitoring POST Diagnostics


In this exercise, you perform remote diagnostic tests between two
classroom systems. You can apply the procedure and the skills used here
to situations that require a remote diagnostic approach to fault analysis.

Preparation
The instructor divides the class into small groups. Two systems and a null
modem cable are required for each group. The students can review the
OBP variables diag-switch?, ttya-mode, and diag-device, which are
set or referenced within the remote diagnostic procedure.
Use a monitor or an ASCII terminal for remote diagnostic sessions. You
can perform this lab procedure on both the Sun4m and Sun4U
architectures.
The steps to insert a fault in the system are provided in the classroom setup file located at the
education.central Web site.

Note Before you begin, make sure that the functional system has the
Solaris OE running in multiuser mode and a remote terminal window is
attached.

POST tests just enough of the electronic circuitry to ensure that you can perform the boot command
and execute the instruction.

If POST fails, students take appropriate action, according to their company policy. The POST examples
in the Student Guide provide visual examples for students who have never viewed POST diagnostics.

Note If you use one of the tip commands, a Conley monitor, or an


ASCII terminal, the execution of POST is visually displayed. You can use
the messages displayed to troubleshoot problems of a blank monitor.

3-24

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Exercise: Enabling and Monitoring POST Diagnostics

Tasks
Use the following procedure for setting up a remote diagnostic session:
1.

Connect an RS-232C null modem cable to port B of the functional


workstation.

2.

Connect the other end of the cable to port A of the faulty system.

3.

Stop the faulty system by pressing the appropriate key sequence.

4.

Set the diag-switch? parameter to true on the faulty system.

5.

Turn off the faulty system to prevent blowing the keyboard fuse.

6.

Disconnect the keyboard from the rear of the faulty system. Send the
output to serial port A, ttya.

7.

Start the terminal window on the functional system from the


Programs menu. (You can run the tip command in a non-Microsoft
Windows environment, but if the tip command does not respond,
there is no method to enter the system to release or kill the tip
command.)

Note The hardwire argument for ttya-mode specifies that the tip
command requires 9600 baud, 8 data bits, and 1 stop bit at port A on the
CPU board. These parameters are set for port A when a system is
powered on without using a keyboard.
8.

If you have a SPARC processor with two serial ports on your


functional system, you do not have to edit the /etc/remote file. If
your workstation is an Ultra 5 or if only port A is available, edit the
/etc/remote file on the functional system to set the port to port A.

9.

Change the following:


:dv=/dev/term/b:br#9600:el=^C^S^Q^U^D:ie=%$:oe=^D
to
:dv=/dev/term/a:br#9600:el=^C^S^Q^U^D:ie=%$:oe=^D

Enabling and Monitoring POST Diagnostics


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

3-25

Exercise: Enabling and Monitoring POST Diagnostics


10. In the terminal window, type the tip hardwire command.
The workstation should respond with a message, connected. If it
does not, check the following:

The wrong port is selected, physically or logically, in the


/etc/remote file.

The selected port is already active. (Invoke the Solaris


Management Console (SMC), and ensure that the port is
disabled.)

A /var/spool/locks/LCK file exists from a previous tip or


uucp session. (This file exists if the system administrator did not
properly exit the tip connection using the ^D or ~. command.)

11. Power on the faulty system. You can observe the power-on
diagnostic messages in the terminal window of the second system. If
not, the following might be the reasons:

Wrong physical or logical port selected at either end

Faulty null modem cable

System is not in diag mode, or the keyboard is still plugged in

12. When the POST diagnostic tests are complete, record all the errors.

Note If the systems in your classroom are connected to a jump server,


the system locates the server while trying to boot over the net, which
automatically becomes the default boot device when the diag-switch?
variable is set to true. Usually, this invokes an installation procedure,
which should be aborted. If the classroom systems are not connected to a
server, an error can occur when the attempt to boot over the network fails.
13. Press the appropriate command to end the tip session.
14. If your classroom has an Ultra workstation, view the saved POST
results. Type the appropriate command. Remember to power off the
system when you reconnect the keyboard.
15. Return the faulty system to a running state.

3-26

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Exercise Summary

Exercise Summary

Discussion Take a few minutes to discuss what experiences, issues, or


discoveries you had during the lab exercises.

Manage the discussion here based on the time allowed for this module, which was given in the About This
Course module. If you find you do not have time to spend on discussion, then just highlight the key concepts
students should have learned from the lab exercise.

Experiences

Ask students what their overall experiences with this exercise have been. You may want to go over any
trouble spots or especially confusing areas at this time.

Interpretations

Ask students to interpret what they observed during any aspects of this exercise.

Conclusions

Have students articulate any conclusions they reached as a result of this exercise experience.

Applications

Explore with students how they might apply what they learned in this exercise to situations at their workplace.

Enabling and Monitoring POST Diagnostics


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

3-27

Exercise Solutions

Exercise Solutions
The following are the solutions for the tasks listed in this exercise:
1.

Connect an RS-232C null modem cable to port B of the functional


workstation.

2.

Connect the other end of the cable to port A of the faulty system.

3.

Stop the faulty system by pressing the appropriate key sequence.


To stop the faulty system, press the Stop-A keys.

Note If the system is in multiuser mode, use the init 0 command to


stop the faulty system.
4.

Set the diag-switch? parameters to true on the faulty system. You


can also set the value of the diag-switch? parameter to true by
pressing the Stop-D keys while turning on the power.
Use the following commands to set the value of the diag-switch?
parameter to true:
ok setenv diag-switch? true
ok reset

5.

Turn off the faulty system to prevent blowing the keyboard fuse.

6.

Disconnect the keyboard from the rear of the faulty system. Send the
output to the serial port A, ttya. Remember to turn off the power
when you reconnect the keyboard.

7.

Start the terminal window on the functional system from the


Programs menu. (You can run the tip command in a non-Microsoft
Windows environment, but if the tip command does not respond,
there is no method to enter the system to release or kill the tip
command.)

8.

If you have a SPARC processor with two serial ports on your


functional system, you do not have to edit the /etc/remote file. If
your workstation is an Ultra 5 or if only port A is available, edit the
/etc/remote file on the functional system to set the port to port A.

9.

Change the following:


:dv=/dev/term/b:br#9600:el=^C^S^Q^U^D:ie=%$:oe=^D
to
:dv=/dev/term/a:br#9600:el=^C^S^Q^U^D:ie=%$:oe=^D

3-28

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Exercise Solutions
10. In the terminal window, type the tip hardwire command.
Type the following command:
# tip hardwire
The workstation should respond with a message, connected. If it
does not, check the following:

The wrong port is selected, physically or logically, in the


/etc/remote file.

The selected port is already active. (Invoke the SMC and ensure
that the port is disabled.)

A /var/spool/locks/LCK file exists from a previous tip or


uucp session. (This file exists if the system administrator did not
properly exit tip with a ^D or ~.)

11. Power on the faulty system. You can observe the power-on
diagnostic messages in the terminal window of the second system. If
not, the following reason might be:

Wrong physical or logical port selected at either end

Faulty null modem cable

System is not in diag mode or still has the keyboard plugged in

12. When the POST diagnostic tests are complete, record all the errors.

Note If the systems in your classroom are connected to a jump server,


the system locates the server as it tries to boot over the net, which
automatically becomes the default when the diag-switch? variable is set
to true. Usually, this invokes an installation procedure, which should be
aborted. If the classroom systems are not connected to a server, an error
can occur when the attempt to boot over the network fails.
13. Press the appropriate command to end the tip session.
Press the ~. command to end the tip session.
14. If your classroom has an Ultra workstation, view the saved POST
results. Type the appropriate command.
Use the following command to view the saved POST results:
# prtdiag -v

Enabling and Monitoring POST Diagnostics


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

3-29

Exercise Solutions
15. Return the faulty system to a running state. Remember to power off
the system when you reconnect the keyboard.
To return the faulty system to a running state:
a.

Turn off the system and plug in the keyboard.

b.

Turn on the system and wait for the ok prompt.

c.

Run the following commands:

ok setenv diag-switch? false


ok reset
d.

3-30

Verify that the system reboots.

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Module 4

Introducing the OBP Device Tree and the


Boot Sequence
Objectives
Overview on
page OH 4-2

Upon completion of this module, you should be able to:

Describe the OBP device tree

Describe the boot sequence

4-1
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Relevance

Relevance
Present the following questions to stimulate the students and get them to think about the issues and topics
presented in this module. While they are not expected to know the answers to these questions, the answers
should be of interest to them and inspire them to learn the material presented in this module.

Relevance on
page OH 4-3

!
?

Discussion The following questions are relevant to understanding the


OBP device tree and the boot sequence:

Describe the structure of the OBP device tree.

A device tree organizes the devices that are attached to the system. Each node in the device tree represents
a device or firmware service on the system.

List the commands that you use to navigate and examine the OBP
device tree.

Ask students which commands they use to navigate the OBP device tree.

Have you experienced any failure during the boot sequence?

Allow students to share their work experiences and note the inputs on a white board or a flip chart.

4-2

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Additional Resources

Additional Resources
Additional resources The following references provide additional
information on the topics discussed in this module:

Field Engineer Handbook, part numbers 800-4006-16 and 800-4247.

OpenBoot Command Reference, part number 800-6076.

OpenBoot 2.x Quick Reference Card, part number 802-1958.

OpenBoot 3.x Quick Reference Card, part number 802-3240.

OpenBoot 4.x Command Reference Manual (http://docs.sun.com),


accessed 14 March 2002.

Solaris User and System Administration Answer Books


(http://docs.sun.com), accessed 17 December 2001.

Solaris 8 System Administration Guide, Volume 1, part number


805-7228-10.

Introducing the OBP Device Tree and the Boot Sequence


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

4-3

Introducing the OBP Device Tree

Introducing the OBP Device Tree


Introducing the
OBP Device Tree
on page OH 4-4

Sun hardware uses the concept of a device tree to organize the devices
that are attached to the system. The OpenBoot firmware builds the device
tree from the information generated during POST and loads the device
tree into memory.
The kernel refers to the device tree during the boot process to determine
the hardware configuration of the system. For example, to identify the
card and slot configuration on your system, map the driver names, unit
addresses, and device arguments to the physical devices and their
locations on the system. You can examine the device path on a system by
using the following:

The /devices directory


This directory contains the physical device names for the devices
that are attached to the system.

The prtconf -vp command

Consider the following example in which you must determine the


location of an internal disk installed on a PCI I/O board of an Ultra 10
workstation:
/pci@1f,0/pci@1,1/ide@3/disk@0,0
In this example, /pci@1f,0/pci@1,1/ide@3/disk@0,0 is the device path
that represents the on-board internal disk on the PCI I/O board.
Consider the following example in which you determine the location of a
SCSI tape drive:
/pci@4,2000/scsi@2,0/st@5,0
In this example:

4-4

pci@4 is the I/O board in the system

scsi@2,0 is a SCSI controller card on the I/O board

st@5 is a SCSI tape drive attached to target ID 5 of the controller

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Introducing the OBP Device Tree

Introducing the
OBP Device Tree
on page OH 4-4

Figure 4-1 shows the elements of an OBP device tree.

Figure 4-1

Elements of an OBP Device Tree

Introducing the OBP Device Tree and the Boot Sequence


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

4-5

Introducing the OBP Device Tree


In Figure 4-1 on page 4-5, disk@0,0 and cdrom@2,0 are the IDE devices
for the disk drive and for the CD-ROM drive attached to the IDE
controller. The SCSI disk device and the SCSI tape device, which are
attached to the PCI-based SCSI controller isptwo@4 are the sd@3,0 and
st@4,0 devices respectively.

Device Path Name


A system identifies devices by the path name of the device nodes. A node
in the device tree represents a device or a firmware service. The nodes
that include subnodes usually represent system buses and controllers
associated with system buses. The subnodes represent the devices that are
connected to buses or controllers.
Each device has a unique path name that represents the type of device
and the location of the device in the overall addressing structure. In a
device tree, the path for a device begins with a slash (/). The slash
represents the root of the device tree. The following is the format for the
name of a device tree node:

device-name@unit-address:device-arguments
Table 4-1 describes each parameter in a device path name.
Table 4-1 Parameters of a Device Path Name
Path Name
Parameter

4-6

Description

device-name

Includes the manufacturer and model names of


devices, separated by a comma. This parameter
includes a string of 1 to 31 characters, such as
punctuation characters, which have mnemonic
values.

Precedes the address parameter.

unit-address

Includes a text string that stores the physical


address of the device in the address space of the
parent node. The format of this parameter is
bus-dependent.

Precedes the device argument parameter.

device-arguments

Includes a text string that you use to pass


additional information to the device software.
The format of this parameter is
device-dependent.

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Introducing the OBP Device Tree

Note The device-name parameter in a device path name is


case-sensitive.
The path of a device in a device tree varies, depending on the type of
system and the configuration of the device.
Disk Device Path
for an Ultra
Workstation on
page OH 4-5

Figure 4-2 shows the sample name of a disk device on an Ultra


workstation with a PCI bus. The path name is divided into five sections,
and each section has a label that specifies its purpose and significance.

Figure 4-2

Disk Device Path for an Ultra Workstation

Introducing the OBP Device Tree and the Boot Sequence


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

4-7

Introducing the OBP Device Tree

Automated OBP Probing


The system initiates an automatic probing function in which system
control is transferred to OBP after power on or reset. The automatic
probing function searches for the devices that are attached to the system
and includes the following:

UPA probing Searches the UltraSPARC systems that are based on


high-speed Ultra Port architecture.
You cannot control the probe order of the system ports on
UltraSPARC systems. However, you can exclude a list of ports from
probing by setting the upa-port-skip-list variable that is stored
in the NVRAM chip.

PCI probing Searches the PCI slots for the ID of PROM.

For example, the Sun Ultra 250 UPA/PCI workstation has four PCI
plug-in slots that are distributed across a single PCI bus. Table 4-2
displays the two NVRAM configuration variables that control the probing
order of slots for the PCI buses attached to an Ultra 250 UPA/PCI
workstation.
Table 4-2 NVRAM Configuration Variables for PCI Probing on an Ultra
250 UPA/PCI Workstation

4-8

Variable

Default Value

Description

pci0-probe-list

3,2,4,5

Controls the probe order of


plug-in devices under the
pci0 variable

pci-slot-skip-list

none

Controls which PCI plug-in


slots to skip while probing

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Introducing the OBP Device Tree


For example, the Sun Ultra 5 and Ultra 10 PCI workstations have two PCI
buses, pcia and pcib. Table 4-3 displays the two NVRAM configuration
variables that control the probing order of slots for the PCI buses attached
to an Ultra 10 workstation.
Table 4-3 NVRAM Configuration Variables for PCI Probing on an Ultra
10 Workstation
Variable

Default Value

Description

pcia-probe-list

1,2,3,4

Controls the probe order of PCI


slots under the pcia variable

pcib-probe-list

1,2,3

Controls the probe order of PCI


slots under the pcib variable

Consider a scenario in which you have four plug-in devices on an Ultra 10


workstation. To define a probe order of 2,3,1 for the pcia bus slots, type
the following command at the ok prompt:
ok setenv pcia-probe-list 2,3,1
In the preceding example, slot 4 of the pcia bus is excluded from the
probe list.

Note You can also specify dashes while defining the probe order of the
PCI slots.

Navigating and Examining the OBP Device Tree


You navigate the device tree to examine and modify individual device tree
nodes. You use the following commands to navigate the OBP device tree
on an Ultra workstation:

The dev command

The device-end command

The .properties command

The words command

The ls command

Introducing the OBP Device Tree and the Boot Sequence


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

4-9

Introducing the OBP Device Tree

Using the dev Command


To run any commands on a device node, you must first use the dev
command to make the node the current node. The following is the syntax
for the dev command:
dev device-path
where the device-path parameter refers to the path of the node that you
selected as the current node.
The following example uses the dev command to make the specified node
the current node. The device path of the node is specified at the command
line.
ok dev /pci@1f,0/pci@1,1/SUNW,m64B@2
ok pwd
ok /pci@1f,0/pci@1,1/SUNW,m64B@
Refer students to Figure 4-2 on page 4-7, and explain the preceding output of the dev command.

Note You can use the cd command to select a node as the current node
in a device tree.

Using the device-end Command


You use the device-end command to exit the device tree.
For example, to exit the device tree from the screen device node, you
type the following command:
ok cd screen
ok pwd
/pci@1f,0/pci@1,1/SUNW,m64B@2
ok device-end
Not at a device tree node. Use dev<device-pathname>
ok

4-10

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Introducing the OBP Device Tree

Using the .properties Command


You use the .properties command to display the properties of the
current node. The output of the .properties command displays the
names and values of all the current-node properties, such as the device
type, the model name, the device ID, and the vendor ID.
For example, the following output is generated when you select the
screen device node and run the .properties command at the ok
prompt:
ok cd screen
ok .properties
address
assigned-addresses
00000000 01000000

fe000000
82011010 00000000 e1000000
82011018 00000000 e2000000

00000000 00001000
aty,fcode
1.60
aty,card#
109-41900-00
aty,rom#
113-41901-104
model
ATY,GT-C
name
SUNW,m64B
............<output truncated>

Using the words Command


You use the words command to display the names of methods that
belong to the current node. The words command also displays a list of
commands and arguments that you can specify at the ok prompt.
For example, the following output is generated when you select the
screen device node and run the words command at the ok prompt:
ok cd screen
ok pwd
/pci@1f,0/pci@1,1/SUNW,m64B@2
ok words
selftest
disp-test
close
restore
draw-logo
write
open
self-test
read-rectangle
fill-rectangle
get-colors
color!
.....................<output truncated>

remove
install
draw-rectangle
set-colors

Introducing the OBP Device Tree and the Boot Sequence


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

4-11

Introducing the OBP Device Tree


Inform students that this is a partial output of the words command. The preceding example lists the methods
that are executed for the screen device. As part of an exercise, you can ask students to execute the words
command for a device node and view the output on their respective systems.

Using the ls Command


You use the ls command to list the contents of the current device node.
For example, to list the contents of the ide device node, you type the
following command:
ok cd ide
ok ls
f00811fc cdrom
f0080b50 disk

Creating Custom Device Aliases


A device alias is a representation of a device path. For example,
/pci@1f,0/pci@1,1/ide@3/disk@0,0, represents the device path of the
disk alias.
Systems usually have predefined device aliases for the commonly used
devices. However, the external devices attached to a system do not have
built-in device aliases associated with them. You can create custom device
aliases for these external devices.

Note The custom device aliases are not saved after a system reset or
power cycle. To create permanent aliases, you must manually store the
alias names in the nvramrc variable of the NVRAM chip or use the
nvalias and nvunalias commands.
You use the following commands to examine, create, and change device
aliases:

4-12

The devalias command

The show-devs command

The show-disks command

The show-nets command

The nvalias command

The nvunalias command

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Introducing the OBP Device Tree

Using the devalias Command


You use the devalias command to display all the current device aliases
defined in the system.
The following output is generated when you run the devalias command
at the ok prompt:
ok devalias
screen
/pci@1f,0/pci@1,1/SUNW,m64B@2
net
/pci@1f,0/pci@1,1/network@1,1
cdrom
/pci@1f,0/pci@1,1/ide@3/cdrom@2,0:f
disk
/pci@1f,0/pci@1,1/ide@3/disk@0,0
disk3
/pci@1f,0/pci@1,1/ide@3/disk@3,0
disk2
/pci@1f,0/pci@1,1/ide@3/disk@2,0
disk1
/pci@1f,0/pci@1,1/ide@3/disk@1,0
disk0
/pci@1f,0/pci@1,1/ide@3/disk@0,0
ide
/pci@1f,0/pci@1,1/ide@3
floppy
/pci@1f,0/pci@1,1/ebus@1/fdthree
ttyb
/pci@1f,0/pci@1,1/ebus@1/se:b
ttya
/pci@1f,0/pci@1,1/ebus@1/se:a
keyboard! /pci@1f,0/pci@1,1/ebus@1/su@14,3083f8:forcemode
keyboard /pci@1f,0/pci@1,1/ebus@1/su@14,3083f8
mouse
/pci@1f,0/pci@1,1/ebus@1/su@14,3062f8
name
aliases
The names of the device aliases are displayed to the left of the command
output, and the physical address of each device is displayed to the right of
the command output.
You use the devalias command to display a device path name
corresponding to an alias. The following is the syntax of the devalias
command that you use to display a device path name corresponding to an
alias:
ok devalias alias

Introducing the OBP Device Tree and the Boot Sequence


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

4-13

Introducing the OBP Device Tree

Using the show-devs Command


You use the show-devs command to view the entire device tree. The
show-devs command displays a list of all the device tree paths that are
available from the root level.
The following output is generated when you run the show-devs
command at the ok prompt:
ok show-devs
/SUNW,UltraSPARC-IIi@0,0
/pci@1f,0
/virtual-memory
/memory@0,10000000
/aliases
/options
/openprom
/chosen
/packages
/pci@1f,0/pci@1
/pci@1f,0/pci@1,1
/pci@1f,0/pci@1,1/ide@3
/pci@1f,0/pci@1,1/SUNW,m64B@2
/pci@1f,0/pci@1,1/network@1,1
/pci@1f,0/pci@1,1/ebus@1
/pci@1f,0/pci@1,1/ide@3/cdrom
/pci@1f,0/pci@1,1/ide@3/disk
/pci@1f,0/pci@1,1/ebus@1/SUNW,CS4231@14,200000
..................<output truncated>
To display all the devices directly under a specific device in the device
tree, type the following command at the ok prompt:
ok show-devs devicepath
If there is enough time, inform students about the following commands that are related to the preceding
command. Ask students to run these commands on their respective systems for system-related information.

show-tapes Displays a list of device paths for the installed SCSI tape controllers

show-displays Displays a list of device paths for the installed display devices

show-sbus Displays a list of installed SBus devices

4-14

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Introducing the OBP Device Tree

Using the show-disks Command


You use the show-disks command to display the available disks on the
system and select the device path that relates to the disk that you want to
use in a custom device alias.
The following output is generated when you run the show-disks
command at the ok prompt:
ok show-disks
a) /pci@1f,0/pci@1,1/ide@3/cdrom
b) /pci@1f,0/pci@1,1/ide@3/disk
c) /pci@1f,0/pci@1,1/ebus@1/fdthree@14,3023f0
q) NO SELECTION
valid choice: a...c, q to quit a
/pci@1f,0/pci@1,1/ide@3/cdrom has been selected.
Type ^Y ( Control-Y ) to insert it in the command line.
e.g. ok nvalias mydev ^Y for creating devalias mydev for
/pci@1f,0/pci@1,1/ide@3/cdrom
ok
Inform students that a shortcut provided with the show-disks command helps to select a device. They can
use the Control-Y keys to copy the device path onto the command line.

Using the show-nets Command


You use the show-nets command to display a list of device paths for the
Ethernet controllers installed on your system. You select the device path
that is associated with the controller for which you want to create an alias.
The following output is generated when you run the show-nets
command at the ok prompt:
ok show-nets
a) /pci@1f,0/pci@1,1/network@1,1
q) NO SELECTION
Enter Selection, q to quit: a
1 pci@1f,0/pci@1,1/network@1,1 has been selected
Type ^Y (Control-Y) to insert it into the command line
e.g. ok nvalias mydev ^Y for creating devalias mydev for
/pci@1f,0/pci@1,1/network@1,1
ok

Introducing the OBP Device Tree and the Boot Sequence


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

4-15

Introducing the OBP Device Tree

Using the nvalias Command


You use the nvalias command to create an alias name for a device. The
following is the syntax of the nvalias command for creating a custom
alias name:
ok nvalias aliasname devicepath
In the following example, you add a device alias. The new alias is mydisk,
and it is aliased to a device at the IDE target 0. The IDE target is connected
to the on-board IDE controller on the system. After you create the alias,
you must make it the default boot device by setting the boot-device
variable to the alias and booting the system.
To create the alias mydisk, type the following command at the ok prompt:
ok nvalias mydisk /pci@1f,0/pci@1,1/ide@3/disk@0,0
where:

mydisk represents the new device alias

/pci@1f,0/pci@1,1/ide@3/disk@0,0 represents the device path of


the mydisk alias

To make the mydisk alias the default boot device, use the following
command:
ok setenv boot-device mydisk
ok boot

Using the nvunalias Command


You use the nvunalias command to remove a device alias. To remove a
device alias, type the following command at the ok prompt:
ok nvunalias aliasname

Note You must use the reset-all command to save the changes made
by the nvunalias command.

4-16

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Introducing the Boot Sequence

Introducing the Boot Sequence


The boot sequence refers to the chain of events, which spans four distinct
phases and takes the system from a powered off state to a state in which
the hardware is powered on and the Solaris OE is running.
To troubleshoot boot problems effectively, you must understand the
events that occur and when they occur during each phase of the boot
sequence.

Boot Sequence
You use the boot command at the ok prompt to boot the Solaris OE.
When you power on the system, the system invokes the POST diagnostic
tests. POST tests the hardware and memory of the system. If no errors are
detected, the system begins the automatic boot process.
Phases of the
Boot Sequence
on page OH 4-6

The boot process occurs in the following four phases, as shown in


Figure 4-3.

Figure 4-3

Phases of the Boot Sequence

Introducing the OBP Device Tree and the Boot Sequence


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

4-17

Introducing the Boot Sequence

The Boot PROM Phase

The Boot PROM


Phase on
page OH 4-7

Figure 4-4 shows the events that occur during the boot PROM phase.

Figure 4-4

The Boot PROM Phase

The following events occur during the boot PROM phase:


1.

PROM runs POST diagnostics.


The boot PROM firmware runs POST diagnostic tests to verify the
hardware and memory of the system. PROM displays the system
identification banner, which includes the model type, the amount of
memory installed, the PROM version and serial numbers, the
Ethernet address, and the host ID of the system.

4-18

2.

The boot command determines the device from which the system
boots. In this step, the boot command reads the value specified in
the boot-device variable.

3.

The boot command locates the bootblk program from the


boot-device or from the diag-device variable, if the
diag-switch? variable is set to true.

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Introducing the Boot Sequence


The bootblk program is the primary boot program, which is located
on sectors 1-15 of the boot device. If the bootblk program is not
present or must be regenerated, you can install it by running the
installboot command during system installation.

Note The installboot utility resides in the /usr/sbin/ directory.


4.

The boot command loads the bootblk program from its location on
the boot device into memory.
A copy of the bootblk program is available in the
/usr/platform/`uname -i`/lib/fs/ufs directory.

The Boot Programs Phase

The Boot
Programs Phase
on page OH 4-8

Figure 4-5 shows the events that occur during the boot programs phase.

Figure 4-5

The Boot Programs Phase

This section discusses the concept of a disk boot. If required explain to students the concept of a network
boot. During a network boot, the bootblk program is not read from the disk, but the inetboot program is
read from the network by using ftp. In addition no ufsboot program is required.

Introducing the OBP Device Tree and the Boot Sequence


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

4-19

Introducing the Boot Sequence


The following events occur during the boot programs phase:
1.

The bootblk program loads the secondary boot program, ufsboot,


from the boot device into memory.
The secondary boot program is located in the UNIX File System
(UFS) in the boot device. The path to the ufsboot program is
recorded in the bootblk program.

2.

The ufsboot program locates and loads the kernel.

Note The ufsboot program is platform-dependent and resides in the


/platform/`uname -i`/ directory.
The kernel is composed of a static core consisting of the genunix and
unix files. Genunix is the platform-independent generic kernel file,
and unix is the platform-specific kernel file. When the ufsboot
program loads these two files into memory, they combine to form the
basis of a running kernel.

4-20

On a 32-bit system, the kernel is located in the


/platform/`uname -m`/kernel directory.

On a 64-bit system, the kernel is located in the


/platform/`uname -m`/kernel/sparcv9 directory.

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Introducing the Boot Sequence

The Kernel Initialization Phase

The Kernel
Initialization
Phase on
page OH 4-9

Figure 4-6 shows the events that occur during the kernel initialization
phase.

Figure 4-6

The Kernel Initialization Phase

The following events occur during the kernel initialization phase:


1.

The kernel reads the /etc/system configuration file.


The /etc/system file is the control file in which you specify the
modules and parameters loaded by the kernel during the boot
sequence. To modify the configuration of the kernel, edit the
/etc/system file. When you edit the /etc/system file, you modify
the modules and parameters loaded by the kernel.
You use the following variables to edit the modules in the
/etc/system file:

moddir Sets the path of the default modules loaded by the


kernel during the boot sequence

forceload Forces a module to be loaded during the boot


sequence

exclude Excludes a particular kernel module from being


loaded

rootfs Sets the type of the root file system

rootdev Specifies the path of the physical root device

set Specifies new values for the tunable kernel parameters

Introducing the OBP Device Tree and the Boot Sequence


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

4-21

Introducing the Boot Sequence

Note Before you edit the /etc/system file, make a copy. If you specify
incorrect values in the /etc/system file, the system might not boot.
2.

The kernel initializes itself and starts to load modules.


Modules consist of device drivers, file systems, and streams that you
use to perform specific tasks within the system. The modules, which
are a part of the kernel, are located in the /kernel and /usr/kernel
directories. Each subdirectory located under these directories is a
collection of similar types of modules.

Depending on the experience of students, explain the following types of module subdirectories in the
/kernel and /usr/kernel directories:

sys Contains system calls, which are defined interfaces used by applications

exec Contains executable file formats

fs Contains types of file systems, such as ufs, nfs, and proc

misc Contains miscellaneous modules

sched Contains scheduling classes

strmod Contains module streams

drv Contains device drivers for booting systems

Note The kernel uses the ufsboot program to load kernel modules.
When the kernel loads enough modules to mount the root file system,
the kernel unmaps the ufsboot program and proceeds to the next step.
3.

4-22

The kernel starts the /sbin/init process. The /sbin/init process


starts other processes by reading the /etc/inittab file.

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Introducing the Boot Sequence

The init Phase

The init Phase


on page OH 4-10

The init phase is the final phase in the boot process. Figure 4-7 shows the
events that occur during the init phase.

Figure 4-7

The init Phase

The following events occur during the init phase:


1.

The init daemon reads the configuration files /etc/default/init


and /etc/inittab file.
The /etc/default/inittab file allows some internal variables to be
set for the init phase. The /etc/inittab file gives detailed
instructions on what processes to run to bring the system to the
desired functioning state.

2.

The init daemon scans the inittab file for the sysinit and
initdefault entries, executing the sysinit entries as found and
recording the initdefault value.

3.

The init daemon starts the specified run level.


If no run level was explicitly passed as an argument to the init
daemon, the highest initdefault level specified in the inittab file
is used. All other initdefault entries that specify the desired run
level, other than those already processed, are now processed.

Introducing the OBP Device Tree and the Boot Sequence


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

4-23

Introducing the Boot Sequence


4.

The default inittab file executes the system startup scripts


(/etc/rc#), each of which corresponds to a run level. The /etc/rc#
scripts execute the files in the /etc/rc#.d directories, which in turn
start the system daemons.

5.

The final entries in the default initatb file specify the sac and
console login session.

Note If the /etc/rc2.d/S99dtlogin script is executed during startup,


the script starts the X server and the X-based login process, which does
not allow the default console login window to display on any specified
graphics device.
After the init phase is complete, the system login prompt is displayed on
the console.

Examining a Successful Boot Sequence


The following is a summary of a successful boot sequence on SPARC
processor-based systems:

4-24

1.

The system is powered on.

2.

The boot PROM firmware performs hardware diagnostics and


displays device information.

3.

The PROM reads the primary boot program, bootblk.

4.

The primary boot program, bootblk, loads the secondary boot


program, ufsboot.

5.

The secondary boot program, ufsboot, loads the kernel.

6.

After the kernel starts running, it performs tasks, such as:

Making additional hardware checks, such as checking the


device drivers

Loading configuration modules

Initializing memory and buffer caches for processes

7.

The kernel starts the init process.

8.

The init process starts the rc scripts.

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Introducing the Boot Sequence

Identifying Common Errors in a Boot Sequence


During a boot sequence, the system displays some common errors.
Consider a scenario in which the system that should boot from the disk
boots from the network.
The following are the two possible causes for the preceding scenario:

The diag-switch? parameter is set to true.


To set the diag-switch? parameter to false, interrupt the boot
process with Stop-A, and type the following command at the ok
prompt:
ok setenv diag-switch? false
ok boot
The system now boots from the disk.

The boot-device parameter is set to net instead of disk.


To set the boot-device parameter to disk, interrupt the boot
process with the Stop-A command, and type the following
command at the ok prompt:
ok setenv boot-device disk
ok boot
The system now boots from the disk.

Note The preceding commands cause the system to boot from the disk
defined as disk in the list of device aliases.
Consider a scenario in which the system boots from the wrong disk. For
example, you have more than one disk in your system. You want the
system to boot from the disk disk2. However, the system boots from the
disk disk1.
The possible cause for the preceding scenario is that the boot-device
parameter is not set to the correct disk.
To set the boot-device parameter to the disk disk2, interrupt the
boot process with Stop-A, and type the following command at the
ok prompt:
ok setenv boot-device disk2
ok boot
The system will now boot from the disk disk2.

Introducing the OBP Device Tree and the Boot Sequence


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

4-25

Introducing the Boot Sequence


Consider a scenario in which you are booting the system from a disk and
the system fails. The following message is displayed:
The file just loaded does not appear to be executable
The possible cause for the preceding scenario is that the boot block is
missing or corrupted. To correct this problem, install a new boot block.

Identifying Boot Problems by Using Run-Level Isolation


You can identify boot problems by changing the run levels of a system.
You can then compare the two run levels and identify which components
fail when you change the run level of the system.
The run level is a digit or letter representing a default collection of system
services. The run level defines which services and resources are currently
available to users. You can also refer to run levels as init states because
you can use the init daemon to initiate run-level transitions.
Run Levels of
the Solaris OE
on page OH 4-11

The Solaris OE has eight run levels that determine various modes of
system operation. These run levels are described in Table 4-4.
Table 4-4 Run Levels of the Solaris OE

4-26

Run Level

Function

Shuts down the Solaris OE and displays the ok prompt.


This indicates that it is safe to turn off the power to the
system.

s or S

Runs the system in single-user mode with all file


systems mounted and accessible.

Indicates that the system runs in a single-user


administrative state and allows access to all file
systems.

Indicates that the system runs in a multiuser state.


However, kerberos, secure shell Dynamic Host
Configuration Protocol (DHCP), simple network
management protocol daemons, and samba services are
not launched at run level 2 but at run level 3.

Indicates that the system is running in the multiuser


state and the NFS resource-sharing facility is available.
This level is specified as the default run level in the
/etc/inittab file.

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Introducing the Boot Sequence


Table 4-4 Run Levels of the Solaris OE
Run Level

Function

Is currently not defined completely.

Shuts down the Solaris OE and powers off the system


on a Sun4U architecture.

Shuts down the Solaris OE and then prompts the OBP


to perform a default boot.

To determine the current run level of a system, type the following


command at the console:
# who -r
.
run-level 3
Current Run
Level of a
System on
page OH 4-12

Mar 21 15:25

Figure 4-8 shows the current run level of a system.

Figure 4-8

Current Run Level of a System

Introducing the OBP Device Tree and the Boot Sequence


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

4-27

Introducing the Boot Sequence

Troubleshooting Scripts by Using a Shell Command Set


During the boot process, the Solaris OE runs a number of scripts that exist
in the /etc/rcS.d, /etc/rc2.d, and /etc/rc3.d directories.
Each of these directories corresponds to a run level that the system enters
while switching to full multiuser mode.
Sometimes, during system startup, an incorrectly configured system file
can cause an error in the system. In the /sbin directory, several run
control script files control the execution of scripts in the /etc/rc#.d
directories. Each of these scripts starts with the following line:
#!/sbin/sh
You can change the script so that it executes in debug mode. You use the
following shell options to execute scripts in debug mode:

-x Prints commands and their arguments as they are executed

-v Prints shell input lines as they are read

Executing the scripts in debug mode helps you to know the command
that is causing the error. Therefore, you can check the system
configuration that relates to the command for possible errors.
To execute the script in debug mode, add the line set -xv to the script:
#!/sbin/sh
set -xv
This prints each line of the executed script. Next, by viewing the last lines
executed before the error, you can track the errors that occur during the
boot process.

Note Some scripts, specifically those in the .sh shell execute in the
context of the calling shell. Therefore, the preceding shell command set
set -xv affects this shell and any subsequent processing by the shell. If
this occurs, add the shell command set set +xv to the end of the script to
rectify the problem.

4-28

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Introducing the Boot Sequence

Restoring the bootblk or ufsboot Program


During the boot programs phase, the bootblk program loads the
secondary boot program, ufsboot, from the boot device into memory. If
the bootblk program or the secondary boot programs are corrupt, you
can boot the system either from the CD-ROM device or from other disks
on which the OS is loaded.
In the following example, you restore a corrupt file system from the
CD-ROM device. This procedure assumes that the root file system was
backed up using the ufsdump command.
To restore the corrupt boot program files, you must install the boot block
program files either from the CD-ROM device or from other partitions to
the boot area of the disk partition. The following lists the steps to restore
the file system:
1.

Load the Solaris CD into the CD-ROM device.

2.

Boot the CD-ROM in single-user mode.


ok boot cdrom -s

3.

Install the boot block.


# cd /usr/platform/`uname -i`/lib/fs/ufs
# installboot bootblk /dev/rdsk/c0t0d0s0
where c0t0d0s0 is the root file system.

4.

Reboot the system.


# init 6

Note The boot block is platform-dependent and resides in the


/usr/platform/platform-name/lib/fs/ufs directory. You can locate
the platform name of a system by using the -i option of the uname
command.

Introducing the OBP Device Tree and the Boot Sequence


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

4-29

Introducing the Boot Sequence

Using boot Commands


You use the boot command to boot the Solaris OE from the ok prompt.
When you specify the boot command at the ok prompt, the system boots
automatically to the default run level that is specified in the
/etc/inittab file.
To boot a system, type the following command at the ok prompt:
ok boot
The boot command supports the following options:

s Boots the system to single-user mode and prompts for the root
password.

v Displays detailed device information at the console during the


boot sequence. This option is useful for troubleshooting problems
during the boot sequence.

r Performs a reconfiguration boot during the boot sequence. This


option identifies any new devices that might be attached and creates
entries for these devices in the /devices and /dev directories. This
option also updates the /etc/path_to_inst file.

a Boots the system in interactive mode. You use this option to


specify an alternative system file or kernel by specifying different
configuration options during the boot process.

Note Interactive booting enables you to test the changes made during
the booting process and recover from system problems quickly. This
procedure assumes that the system is already shut down.

4-30

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Introducing the Boot Sequence


Table 4-5 displays the information that you specify when you boot the
system by using the -a option.
Table 4-5

Interactive Boot Procedure

If the System Displays....

Perform the Following....

Enter filename
[kernel/sparcv9/unix]:

Provide the name of another


kernel that is booting.
Alternatively, press Return to use
the default kernel
(kernel/sparcv9/unix).

Name of system file


[etc/system]:

Provide the name of an alternative


system file, and press Return.
Alternatively, press Return to use
the default /etc/system file.
If no valid alternative system file is
available, you can use the
/dev/null file to have no system
file. This works if there are no
critical changes performed by the
system file, such as using root disk
mirroring.

Enter default directory for


modules
[/platform/SUNW,Ultra5_10/kernel
/platform/sun4u/kernel
/kernel /usr/kernel]:

Provide an alternative path for the


modules directory, and press
Return.
Alternatively, press Return to use
the default modules directory
path.

root filesystem type [ufs]:

Press Return to use the default root


file system type, UFS for local disk
booting, and NFS for diskless
clients.

Enter physical name of root


device
[/pci@1f,0/pci@1,1/ide@3/dis
k@0,0:a]:

Provide an alternative device


name, and press Return.
Alternatively, press Return to use
the default physical name of the
root device.

Introducing the OBP Device Tree and the Boot Sequence


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

4-31

Introducing the Boot Sequence

How the auto-boot? Variable Affects the Boot Command


The auto-boot? variable is an OBP variable. When you set the value of
the auto-boot? variable to true, the boot process starts automatically at
system power on. If the value of the auto-boot? variable is set to false,
the system does not attempt to boot and drops to the ok prompt.

4-32

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Exercise: Introducing the OBP Device Tree and the Boot Sequence

Exercise: Introducing the OBP Device Tree and the Boot


Sequence
In this exercise, you use the OBP and Solaris OE commands to perform
the tasks described in the module.

Preparation
Perform a shutdown procedure to access the OBP environment and run
OBP commands at the ok prompt to perform system diagnostics. Ensure
that the systems are already at the ok prompt.
To complete the tasks for restoring the corrupt boot program files, ensure that the lab is set up with a system,
Host1, which has a corrupt file system.

Note Due to different PROM levels and architectures, the syntax for
OBP commands can vary slightly. For more information, refer to OpenBoot
3.x Quick Reference Card and OpenBoot 4.x Quick Reference Card.

Tasks
To navigate the OBP device tree, complete the following steps:
1.

Display a list of all the device aliases defined on your system.

2.

Select a device node on the device tree. Use an appropriate


command to make the node current.

3.

Display all the properties of the current node.

4.

Display all the methods that belong to the current node.

5.

Use the appropriate command to switch to the root of the device


tree.

To create a custom alias, complete the following steps:


1.

Use the appropriate command to display the full device path name
for the disk alias. Note the device path name.

2.

Use the appropriate command to display the disks available on your


system, and select the device path that relates to the disk that you
recorded from Step 1.

Introducing the OBP Device Tree and the Boot Sequence


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

4-33

Exercise: Introducing the OBP Device Tree and the Boot Sequence
3.

Run the appropriate command to create a device alias named


mydisk. Set the mydisk alias to the path and disk name that you
recorded in Steps 1 and 2.

4.

Verify that the new alias is correctly defined.

5.

Boot your system by using the mydisk alias.

6.

Use the appropriate command to remove the mydisk alias.

7.

Verify that the mydisk alias is removed successfully from the


nvramrc variable.

8.

Use the appropriate command to check whether the mydisk alias


exists in the list of device aliases.

9.

Reset your system, and then check whether the mydisk alias still
exists in the list of device aliases.

10. Set the OBP variables to their default values, and boot the system
from the default boot device.
11. Verify that the system boots.
To restore the corrupt boot program files from the CD-ROM device,
complete the following steps:
1.

Load the Solaris CD into the CD-ROM device.

2.

Boot the CD-ROM in single-user mode.

3.

Install the boot block.

4.

Reboot the system.

To boot the system in interactive mode, complete the following steps:

4-34

1.

Halt the automatic boot sequence.

2.

Display a list of all the OBP parameters defined on your system.

3.

Boot the system in interactive mode.

4.

Specify the default file name of the kernel.

5.

Specify the default directory for modules.

6.

Specify the name of the system file.

7.

Specify the default root file system type.

8.

Specify the physical name of the root device.

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Exercise Summary

Exercise Summary

Discussion Take a few minutes to discuss what experiences, issues, or


discoveries you had during the lab exercises.

Manage the discussion here based on the time allowed for this module, which was given in the About This
Course module. If you find you do not have time to spend on discussion, then just highlight the key concepts
students should have learned from the lab exercise.

Experiences

Ask students what their overall experiences with this exercise have been. You may want to go over any
trouble spots or especially confusing areas at this time.

Interpretations

Ask students to interpret what they observed during any aspects of this exercise.

Conclusions

Have students articulate any conclusions they reached as a result of this exercise experience.

Applications

Explore with students how they might apply what they learned in this exercise to situations at their workplace.

Introducing the OBP Device Tree and the Boot Sequence


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

4-35

Exercise Solutions

Exercise Solutions
To navigate the OBP device tree, complete the following steps:
1.

Display a list of all the device aliases defined on your system.


ok devalias

2.

Select a device node on the device tree. Use an appropriate


command to make the node current.
ok dev device_pathname

3.

Display all the properties of the current node.


ok .properties

4.

Display all the methods that belong to the current node.


ok words

5.

Use the appropriate command to switch to the root of the device


tree.
ok device-end

To create a custom alias, complete the following steps:


1.

Use the appropriate command to display the full device path name
for the disk alias. Note the device path name.
ok devalias disk

2.

Use the appropriate command to display the disks available on your


system and select the device path that relates to the disk that you
recorded from step 1. Continue with step 2.
ok show-disks
(select a disk from the list)

Note Select the device path that relates to the disk from step 1.
3.

Run the appropriate command to create a device alias named


mydisk. Set mydisk to the path and disk name that you recorded in
steps 1 and 2.
ok nvalias mydisk pathname

4.

Verify that the new alias is correctly defined.


ok devalias mydisk

5.

Boot your system by using the mydisk alias.


ok boot mydisk

4-36

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Exercise Solutions
6.

Use the appropriate command to remove the mydisk alias.


ok nvunalias mydisk

7.

Verify that the mydisk alias is successfully removed from the


nvramrc variable.
ok printenv nvramrc

8.

Use the appropriate command to check if the mydisk alias exists in


the list of device aliases.
ok devalias mydisk

9.

Reset your system and then check if the mydisk alias still exists in
the list of device aliases.
ok devalias mydisk

10. Set the OBP variables to their default values and boot the system
from the default boot device.
ok set-defaults
ok boot
11. Verify that the system boots.
To restore the corrupt boot program files from the CD-ROM, complete the
following steps:
1.

Load the Solaris CD into the CD-ROM device.

2.

Boot the CD-ROM in single-user mode.


ok boot cdrom -s

3.

Install the boot block.


# cd /usr/platform/`uname -i`/lib/fs/ufs
# installboot bootblk /dev/rdsk/c0t0d0s0
where c0t0d0s0 is the root file system.

4.

Reboot the system.


# init 6

Introducing the OBP Device Tree and the Boot Sequence


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

4-37

Exercise Solutions
To boot the system in interactive mode, complete the following steps:
1.

Halt your system to display the ok prompt.


Press the Stop-A keys or specify the following command:
# init 0

2.

Display a list of all the OBP parameters defined on your system.


ok printenv

3.

Boot the system in interactive mode.


ok boot -a

4.

Specify the default file name of the kernel.


Press the Return key when the system prompts for the default file name of
the kernel.

5.

Specify the default directory for modules.


Press the Return key when the system prompts for the name of the default
directory for modules.

6.

Specify the name of the system file.


Press the Return key when the system prompts for the name of the system
file.

7.

Specify the default root file system type.


Press the Return key when the system prompts for the name of the default
root file system type.

8.

Specify the physical name of the root device.


Press the Return key when the system prompts for the physical name of the
root device.

4-38

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Module 5

Performing Solaris OE Diagnostics


Objectives
Overview on
page OH 5-2

Upon completion of this module, you should be able to:

Use the device management commands

Use the disk and file system management commands

Use the software package management commands

Use the file-checking commands

Use the CPU and memory management commands

Use the network management commands

Use the general-purpose commands

Use the program execution management commands

5-1
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Relevance

Relevance
Present the following questions to stimulate the students and get them to think about the issues and topics
presented in this module. While they are not expected to know the answers to these questions, the answers
should be of interest to them and inspire them to learn the material presented in this module.
Relevance on
page OH 5-3

!
?

Discussion These questions are relevant to understanding the


functionality and application of the diagnostic commands and tools that
are available in the Solaris 9 OE:

In which areas do the Solaris OE commands help you to diagnose


system faults?

Which commands and tools of the Solaris OE do you find the most
useful?

Allow students to share their work experiences, and list down the commands and tools that they find useful.

Can you describe a recent system problem and the commands and
tools that you used to solve the problem?

Ask students to share their work experiences and describe a recent system problem and the commands and
tools that they used to solve the problem.

5-2

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Additional Resources

Additional Resources
Additional resources The following references can provide additional
information on the topics discussed in this module:

Solaris Manual Pages (http://docs.sun.com), accessed 07 January


2002.

Solaris User and System Administration Answer Books


(http://docs.sun.com), accessed 07 January 2002.

Performing Solaris OE Diagnostics


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

5-3

Using the Device Management Commands

Using the Device Management Commands


Devices are one of the primary areas that you check when diagnosing a
problem in the Solaris OE. Maintaining devices helps to ensure high
availability and uptime of business-critical servers. In the Solaris 9 OE, the
devfsadm command replaces the Solaris 7 OE device management tools,
such as the drvconfig, disks, tapes, devlinks, and ports tools.

Note In the Solaris 9 OE, only the root user can run the device
management commands.

Using the devfsadm Command


The devfsadm command maintains the /dev and /devices directories on
Sun systems. The devfsadm command performs the following tasks:

Loads all available drivers on the system

Creates files for the devices in the /devices directory

Creates logical links for the devices in the /dev directory

In addition to managing the /dev and /devices directories, the devfsadm


command maintains the path_to_inst database. This database is a
device instance number file that contains the mappings of physical device
names to instance numbers.
Ask students to open the path_to_inst database in a text editor and view the mappings of physical device
names to instance numbers.

The /sbin/rcS boot script starts the /etc/rcS.d/S50devfsadm script.


This script handles reconfiguration boots and starts the syseventd
daemon. This daemon starts the devfsadmd daemon on demand.

Note You need not run the devfsadm command interactively because
the devfsadm daemon automatically detects the changes in device
configuration.

5-4

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Using the Device Management Commands


The following syntax displays the options for the devfsadm command:
/usr/sbin/devfsadm [-C] [-c device_class] [-i driver_name]
[-n] [-r root_dir] [-s] [-t table_file] [-v]
Table 5-1 lists the options supported by the devfsadm command with
their descriptions.
Table 5-1 Options of the devfsadm Command and Their Descriptions
Option

Description

-C

Invokes clean-up routines to remove logical links


that are no longer attached to the system.

-c device_class

Restricts operations to the devices that are


specified in the device_class variable. The
supported devices include the following:
Disk
Tape
Port
Audio
Pseudo

-i driver_name

Configures the devices only for the named


device.

-n

Does not attempt to load drivers for new


hardware or add new nodes to the kernel device
tree.

-s

Suppresses the changes made to the /dev or


/devices directory.

-t table_file

Reads the alternative devlink.tab file specified


by the table_file variable. The devlink.tab
file contains the address of all the device drivers
in the kernel.

-r root_dir

Assumes that the /dev and /devices directory


trees are listed under the root_dir directory and
not directly under the root (/) directory. You use
this option when the /dev/dsk/c0t0d0s0 disk
slice is mounted on a file system other than the
root directory.

-v

Prints the changes made to the /dev and


/devices directories in verbose mode.

Performing Solaris OE Diagnostics


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

5-5

Using the Device Management Commands

Using the Pre-Solaris 8 OE Device Commands


The devfsadm command replaces the drvconfig, disks, tapes,
devlinks, and ports commands that were present in the Solaris 8 OE
and pre-Solaris 8 OEs. In the Solaris 9 OE, the devfsadm command is the
preferred command to manage the devices on the system. Table 5-2 briefly
describes the pre-Solaris 8 OE commands.
Table 5-2 Pre-Solaris 8 OE Commands and Their Descriptions
Command

Description

drvconfig

Configures the /devices directory tree structure. This


command reads the /etc/minor_perm file and
determines which permissions to assign to the new
nodes. The drvconfig command does not change
permissions on existing nodes.

disks

Creates symbolic links for the disk devices in the


/dev/dsk and /dev/rdsk directories. These symbolic
links map to the special files in the actual disk device in
the /devices directory tree.

tapes

Creates symbolic links in the /dev/rmt directory. These


symbolic links specify the file names of the physical tape
device in the /devices directory. The tapes command
runs automatically when the system performs a
reconfiguration boot.

devlinks

Creates symbolic links in the /dev directory for all the


entries in the /devices directory tree. The links are
created according to the specifications in the
/etc/devlink.tab file.

ports

Creates symbolic links in the /dev/term and /dev/cua


directories. These symbolic links correspond to the serial
port entries in the /devices directory. The ports
command also creates entries in the /etc/inittab file
for the nonsystem ports on the system.

Note To run the disks, tapes, devlinks, and ports commands


manually, you first run the drvconfig command to ensure that the
required entries are present in the /devices directory.

5-6

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Using the Disk and File System Management Commands

Using the Disk and File System Management Commands


You monitor disks and file systems to ensure the integrity of the systems.
The following commands are available in the Solaris OE to manage disks
and file systems:

The format command

The fsck command

The fstyp -v command

The iostat command

Note In the Solaris 9 OE, only the root user can run the disk and file
system management commands.

Using the format Command


You use the format command to format, label, repair, and analyze the
disks on your system. You also use the format command to list the
available disks on the system and create, view, and modify the partitions
on the disks.
The following syntax displays the options for the format command:
format [-f command-file] [-l log-file] [-x data-file]
[-d disk-name] [-t disk-type] [-p partition-name] [-s] [-m]
[-M] [-e] [disk-list]
Refer to the online man pages for information on the options supported by the format command.

Multiple levels of prompts exist within the format command. For


example, to create a partition, you invoke the format command from the
terminal window and select the disk to be partitioned from the list of
disks on the system. This invokes the format command prompt. You then
run the partition command at the format command prompt on the
selected partition. This launches the partition menu. You use the partition
menu to view and modify partition information.

Performing Solaris OE Diagnostics


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

5-7

Using the Disk and File System Management Commands


The following is an example of a partition menu:
PARTITION MENU:
0
- change `0' partition
1
- change `1' partition
2
- change `2' partition
3
- change `3' partition
4
- change `4' partition
5
- change `5' partition
6
- change `6' partition
7
- change `7' partition
select - select a predefined table
modify - modify a predefined partition table
name
- name the current table
print - display the current table
label - write partition map and label to the disk
!<cmd> - execute <cmd>, then return
quit
If time permits, ask students to execute the format command on their respective systems and view the
partition information on the disks on their systems.

Note You cannot use the format command on diskette drives, CD-ROM
drives, or tape drives.

Using the fsck Command


If the control structures are found to be inconsistent while mounting a file
system, you must check and make the structures consistent. The
Solaris OE provides the fsck program to check and repair the control
structures for all file systems, including the default Solaris OE file system,
UFS.
If a file system is found to be inconsistent during the boot process, the
system automatically invokes the fsck program to make the file system
consistent. However, if the system reports serious problems, the boot
process is suspended, and you must repair the file system manually
before booting the system.

Note The control structures of the UFS file system that are repaired by
the fsck command include the superblock, the boot block, the inode
block, and the inode count.

5-8

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Using the Disk and File System Management Commands


The following syntax displays the options for the fsck command:
fsck [-F FSType] [-m] [-V] [special ...]
fsck [-F FSType] [-n | N | y | Y] [-V]
[-o FSType-specific-options] [special ...]
Refer to the online man pages for information on the options supported by the fsck command.

When you use the -m option, the fsck command verifies whether the file
system is ready for mounting. If the file system is ready for mounting, the
fsck command displays the following message:
ufs fsck: sanity check: /dev/rdsk/c0t3d0s1 okay
If time permits, ask students to run the fsck command with the -m option and check whether the file system
on their respective systems is ready for mounting.

Consider a scenario in which you reboot a Sun system. When you start
the system, it reports the following error:
THE FOLLOWING FILE SYSTEM(S) HAD AN UNEXPECTED
INCONSISTENCY: /dev/rdsk/c0t0d0s7 (/export/home)
WARNING - Unable to repair one or more filesystems.
Run fsck manually (fsck filesystem...).
Exit the shell when done to continue the boot process.
Type control-d to proceed with normal startup,
(or give root password for system maintenance):

Caution You can run the fsck -y command to check and repair all the
file systems on the system without user intervention. However, this action
might cause serious damage to the file system and should be used only as
a last resort. The best method is to run the fsck command without the -y
option.
In the preceding scenario, the corrupt file system is located in the
/dev/rdsk/c0t0d0s7 (/) disk slice. To repair the file system, you use the
following command:
# fsck -F ufs /dev/rdsk/c0t0d0s7
Caution Before running the fsck command, ensure that the system is in
single-user mode. This is essential for the fsck command to repair
damaged file systems.

Performing Solaris OE Diagnostics


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

5-9

Using the Disk and File System Management Commands

Repairing a Corrupt Superblock


When the superblock in a UFS file system is corrupted, you must replace
the superblock with the contents of the backup superblock. A backup
superblock is located at sector 32 of every file system. The following is the
format of the fsck command that you use to replace a corrupt superblock
on the /dev/rdsk/c0t0d0s7 file system:
# fsck -o b=32 /dev/rdsk/c0t0d0s7
Alternate superblock location: 32.
** /dev/rdsk/c0t0d0s7
** Last mounted on
** Phase 1 - Check Blocks and Sizes
** Phase 2 - Check Pathnames
** Phase 3 - Check Connectivity
** Phase 4 - Check Reference Counts
** Phase 5 - Check Cyl groups
FREE BLK COUNT(S) WRONG IN SUPERBLK
SALVAGE?
In the preceding output if the backup superblock is corrupted, use the
following newfs command to obtain a list of alternative backup
superblocks:
# newfs -N /dev/rdsk/c0t0d0s7
The -N option specifies that the newfs command should not dump a new
file system onto the slice specified but reports where the backup
superblocks would have been placed.
Caution If you do not specify the -N option, the newfs command
damages the file system.

5-10

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Using the Disk and File System Management Commands

Using the fstyp Command


To diagnose faults on a file system, you must know the type of file system
that is installed in the Solaris OE. The fstyp command identifies the type
of file system installed on a disk. You also use the fstyp command to read
and display the contents of the superblock and the cylinder group
information.
Refer to the online man pages for information on the options supported by the fstyp command.

For example, to identify the type of file system located in the


/dev/rdsk/c0t0d0s7 directory along with the cylinder group
information, you use the following command:
# fstyp -v /dev/dsk/c0t0d0s7
ufs
...
cylinders in last group 24
blocks in last group 1512
...
If time permits, ask students to execute the fstyp command by using the -v option. Ask them to use different
file systems and compare the results.

Note The fstyp -v command displays verbose output on the types of


file systems.

Using the iostat Command


In the Solaris OE, the iostat command reports statistics, such as disk
input and output. The command also produces measures of throughput,
utilization, queue lengths, transaction rates, and service time. The iostat
command iteratively reports terminal, disk, and tape I/O activity as well
as CPU utilization.
To compute information on I/O statistics, the kernel maintains a number
of counters. The kernel counts the reads, writes, bytes read, and bytes
written of each disk. The iostat command uses these values to produce
accurate measures of throughput, utilization, queue lengths, transaction
rates, and service time. For terminals, the kernel counts the number of
input and output characters.

Performing Solaris OE Diagnostics


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

5-11

Using the Disk and File System Management Commands


The following syntax displays the options for the iostat command:
/usr/bin/iostat [-cCdDeEiImMnpPrstxz] [-l n] [-T u | d]
[disk ...] [interval [count]]
Refer to the online man pages for information on the options supported by the iostat command.

Consider a scenario in which you suspect that an I/O bottleneck is


responsible for the degrading performance on a Sun system. When you
run the iostat -xt command, the following output is generated:
# iostat -xt
device
dad0
dad1
fd0
sd0
nfs1

r/s
3.1
4.0
2.0
4.3
3.3

w/s
2.1
3.1
1.8
4.1
3.0

extended
kr/s kw/s
8.0 4.2
9.3 6.7
6.0 6.0
8.0 5.0
9.0 6.0

device statistics
wait actv svc_t
3.0 4.0
65.8
3.2 3.0
22.3
2.1 3.0
33.0
4.1 2.0
46.0
3.7 4.0
39.7

%w
4
5
8
8
7

tty
%b tin tout
9
2
2
8
7
8
6

Table 5-3 lists the contents displayed in the output fields.


Table 5-3 Fields of the iostat Command

5-12

Field

Representation

r/s

Number of reads per second

w/s

Number of writes per second

kr/s

Number of Kbytes read per second

kw/s

Number of Kbytes written per second

wait

Average number of transactions waiting for service

actv

Average number of transactions that are being serviced

svc_t

Average service time in milliseconds

%w

Percentage of time when transactions are waiting for service

%b

Percentage of time the disk is busy

tin

Number of characters read from a terminal per second

tout

Number of characters written to a terminal per second

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Using the Disk and File System Management Commands


The output of the iostat command shows that the system is performing
several read and write operations every second. In addition, the system
has a large value for the wait and svc_t fields. These values indicate that
the system is overloaded with many small I/O operations. A bottleneck
might be the problem between the dad0 and dad1 devices. The iostat
command generates data that supports the hypothesis that the system has
an I/O bottleneck.
Ask students to run the iostat command on their systems to view information about the terminal, disk,
diskette, tape, file system, and CPU usage of the system.

Performing Solaris OE Diagnostics


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

5-13

Using the Software Package Management Commands

Using the Software Package Management Commands


In the Solaris OE, you use the software package management commands
to analyze or display information about installed software packages.
These commands also check the versions of the software package files
installed on a system.
The Solaris OE contains the following software package management
commands:

The pkgchk command

The pkginfo command

The pkgadd command

The pkgrm command

Note In the Solaris 9 OE, only the root user can run the management
commands of a software package.

Using the pkgchk Command


You use the pkgchk command to verify the integrity, specific path name,
file contents, and file attributes of an installed package. For example, to
check the installation of the man pages on a Sun system, you use the
following command:
# pkgchk SUNWman -l
If the man pages are installed correctly, the pkgchk command returns to
the prompt without any messages. However, if the man pages are not
installed or there is a problem with the installation of the man pages, the
system returns the following message:
WARNING: no pathnames were associated with <SUNWman>
The pkgchk command checks the following:

5-14

The package installation scripts

The contents or attributes of the objects that are currently installed


on the system

The contents of a spooled, uninstalled package

The contents or attributes of both the objects that are described in the
specified pkgmap file

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Using the Software Package Management Commands


The following syntax displays the options for the pkgchk command:
pkgchk [-l | -acfnqvx] [-i file] [-p path ...]
[-R root_path] [[-m pkgmap [-e envfile]] | [pkginst]... ]
pkgchk -d device [-l|-fv] [-i file] [-M] [-p path...] [-V
fs_file] [pkginst...]
Refer to the online man pages for information on the options supported by the pkgchk command.

Note To view information about all the packages installed on the


system, refer to the /var/sadm/install/contents file.
The first set of options in the pkgchk syntax lists or checks the contents
and attributes of the objects that are currently installed on the system or in
the indicated pkgmap file. By default, all the content on a system is
checked.
The second set of the options in the pkgchk syntax lists or checks the
contents of a package that is spooled on the specified device but not
installed. The pkgchk command cannot check the attributes for the
spooled packages.

Using the pkginfo Command


The pkginfo command displays information about the software
packages installed on the system. The command also displays information
about the software packages that are located on a particular device.
The following syntax displays the options for the pkginfo command:
pkginfo [-q | -x| -l] [-p | -i] [-r] [-a arch] [-v version]
[-c category...] [pkginst ...]
pkginfo [-d device] [-R root_path] [-q | -x| -l] [-a arch]
[-v version] [-c category ...] [pkginst ...]
Refer to the online man pages for information on the options supported by the pkginfo command.

The first set of the pkginfo syntax displays information about the
software packages that are installed on the system.

Performing Solaris OE Diagnostics


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

5-15

Using the Software Package Management Commands


The second set of the pkginfo syntax displays information about the
software packages that reside on a particular device or directory.
The pkginfo command lists the primary category, the package instance,
and the names of all completely and partially installed packages.
For example, to check whether the SUNWman package is installed on the
system, you type the following command:
# pkginfo -l SUNWman
The following output is generated if the SUNWman package is not installed:
ERROR: information for "SUNWman" was not found

Using the pkgadd Command


The pkgadd command transfers software packages from the distribution
directory to the system. By default, the pkgadd command searches the
/var/spool/pkg directory for packages. For example, to install the
SUNWman package from the /var/spool/pkg directory, you use the
following command:
# pkgadd SUNWman
The following is the syntax for the pkgadd command:
pkgadd [-nv] [-a admin] [-d device] [ [-M] -R root_path]
[-r response] [-V fs_file]
[pkginst... -Y category[,category...]]
pkgadd -s spool [-d device]
[pkginst... -Y category[,category...]]
To install the SUNWman package from a mounted Solaris CD-ROM, type the
following:
# pkgadd -d <Path to directory holding SUNWman>SUNWman
If you use the pkgadd command with the -s option, the pkgadd command
writes the package to a spool directory instead of installing it on the
system.

Note You cannot use the -r, -n, and -a options when transferring a
package to a spool directory.

5-16

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Using the Software Package Management Commands

Using the pkgrm Command


You use the pkgrm command to remove a previously installed package
from the system. Before the pkgrm command removes a package, the
command checks whether any other packages depend on the package that
is being removed. If a dependency exists, the resultant action is defined in
the admin file.

Note The pkgrm command first searches the current working directory
for the admin file. If the specified admin file is not in the current working
directory, the pkgrm command searches the /var/sadm/install/admin
directory for the admin file.
The following is the syntax of the pkgrm command:
pkgrm [-nv] [-a admin] [ [ -A| -M] -R root_path] [-V
fs_file] [ pkginst... -Y category[,category...]]
pkgrm -s spool [ pkginst... -Y category[,category...]]
By default, the pkgrm command runs in interactive mode. You use the -n
option to change to noninteractive mode.
Consider a scenario in which one or more files within the SUNWjunk
package are corrupted. You use the following pkgrm command to remove
all files and directories associated with the SUNWjunk package:
# pkgrm SUNWjunk

Performing Solaris OE Diagnostics


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

5-17

Using the File-Checking Commands

Using the File-Checking Commands


You use file-checking commands to analyze and display information
about the contents of files and to check for hidden characters in files. You
also use these file-checking commands to compare file contents.

Checking for Hidden Characters


Often, you reuse the scripts developed by your peers for routine activities,
such as monitoring disk usage. Sometimes, due to certain hidden
characters, the scripts that you develop on one platform do not work on a
different platform. You use the vi and cat -vet commands to check for
hidden or nonprintable characters in a file and display the contents of the
file.

Using the vi Command


The vi command launches a display-oriented text editor based on the line
editor ex. You can use the command mode of the ex editor from within
the vi command and the command mode of the vi editor from within the
ex command. To open a file in vi, type the following command:
# vi /etc/services
...
# Network services, Internet style
#
tcpmux
1/tcp
echo
7/tcp
echo
7/udp
discard
9/tcp
sink null
discard
9/udp
sink null
systat
11/tcp
users
daytime
13/tcp
daytime
13/udp
netstat
15/tcp
chargen
19/tcp
ttytst source
chargen
19/udp
ttytst source
ftp-data
20/tcp
ftp
21/tcp
......<output truncated>
Refer to the online man pages for information on the options supported by the vi command.

5-18

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Using the File-Checking Commands


When you use the vi command, the changes that you make to a file are
displayed on the terminal screen.
The vi editor has a special editing mode that enables you to view hidden
and nonprintable characters as either special characters or control
sequences. To invoke this special mode within the vi editor, complete the
following steps:
1.

Open a text file within the vi editor.

2.

Press the <Esc> key.

3.

At the colon prompt, type the following:


set list

The ^ character should be at the beginning of a line, the $ character at the


end of each line, and a ^| character for each tab entry. A space is
represented as a space.
Refer students to the man pages for details on using the vi editor.

Using the cat Command


You use the cat command to concatenate and display the contents of files.
The cat command reads each file in sequence and writes it on the
standard output. For example, to view the contents of the /etc/services
file, you type the following command:
# cat /etc/services
...
#
# Network services, Internet style
#
tcpmux
1/tcp
echo
7/tcp
echo
7/udp
discard
9/tcp
sink null
discard
9/udp
sink null
systat
11/tcp
users
daytime
13/tcp
daytime
13/udp
netstat
15/tcp
chargen
19/tcp
ttytst source
chargen
19/udp
ttytst source
ftp-data
20/tcp
ftp
21/tcp
.....<output truncated>

Performing Solaris OE Diagnostics


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

5-19

Using the File-Checking Commands


The -v option of the cat command enables you to print all nonprinting
characters visibly, except new lines and tabs. The -e option of the cat
command inserts a $ sign at the end of each line. Similarly, the -t option
prints ^I for tabs and ^L for form feed characters. Therefore, the
/etc/services file shows the following output for the cat -vet
command:
# cat -vet /etc/services
...
#$
# Network services, Internet style$
#$
tcpmux^I^I1/tcp$
echo^I^I7/tcp$
echo^I^I7/udp$
discard^I^I9/tcp^I^Isink null$
discard^I^I9/udp^I^Isink null$
systat^I^I11/tcp^I^Iusers$
daytime^I^I13/tcp$
daytime^I^I13/udp$
netstat^I^I15/tcp$
chargen^I^I19/tcp^I^Ittytst source$
chargen^I^I19/udp^I^Ittytst source$
ftp-data^I20/tcp$
ftp^I^I21/tcp$
......<output truncated>
Refer to the online man pages for information on the options supported by the cat command.

Caution When reading two or more files, if you redirect the output of
the cat command to one of the files, original data is lost.

Comparing File Contents


You use various commands to compare the changed versions of files with
the original files in the Solaris OE. You should always make copies of
system files before you modify the files. Later, to identify the exact change
made to a system file, you use commands, such as cmp, diff, and sum to
compare the modified files with the copies of original files.

5-20

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Using the File-Checking Commands

Using the cmp Command


In the Solaris OE, the cmp command compares two files. The command
lists the differences between the files and displays the byte and line
numbers at which the first difference occurred. If the two files are the
same, no output is displayed.
Consider a scenario in which the /etc/services file is backed up as the
/etc/old_services file. The nntp service is then disabled because the
organization policy states that users must not access the usenet service.
The cmp command results in the following output:
# cmp /etc/old_services /etc/services
/etc/old_services /etc/services differ: char 1406, line 56
The following is the syntax of the cmp command:
cmp [-l] [-s] file1 file2 [skip1] [skip2]

Using the diff Command


In the Solaris 9 OE, the diff command displays line-by-line differences
between a pair of text files. This command compares the first file with the
second file and then writes a list of the changes required to convert the
first file into the second file.
For example, the following is the result of running the diff command on
the /etc/old_services and /etc/service files:
# diff /etc/old_services /etc/services
56c56
< nntp
119/tcp
usenet
# Network News Transfer
--> #nntp
119/tcp
usenet
# Network News Transfer
Provide students with the myfile1 and myfile2 files, which are identical. Ask students to execute the diff
command on these files and view the output. Inform students that no list is generated if the files are identical.

Performing Solaris OE Diagnostics


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

5-21

Using the File-Checking Commands

Using the sum Command


You use the sum command to calculate and print a 16-bit checksum value
for a file. The sum command calculates and prints the number of 512-byte
blocks in a file. You use the sum command to validate a file that is
transferred over a transmission line.
The following syntax displays the options for the sum command:
sum [-r] [file ...]
Refer to the online man pages for more information on the options supported by the sum command.

For example, you run the sum command on a patch that you want to
install on your system. Then, you compare the checksum value generated
against the checksum value reported in the SunSolve Online service and
verify whether the patch was successfully downloaded.

5-22

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Using the CPU and Memory Management Commands

Using the CPU and Memory Management Commands


When diagnosing problems in Sun systems, you must check the CPU and
memory. In the Solaris OE, you use the following commands to manage
memory:

The ps command

The vmstat command

The psrinfo command

The mpstat command

The modinfo command

The pgrep command

Note To execute the CPU and memory management commands, you


require user rights on the system.

Using the ps Command


You use the ps command to monitor the processes running on a system.
The ps command prints information about active processes that have the
same user ID (UID) and controlling terminal. The output of the ps
command displays the process ID (PID), the terminal identifier, the
cumulative execution time, and the command name.
The Solaris OE provides the following versions of the ps command:

The /usr/bin/ps version

The /usr/ucb/ps version

Depending on the path name that you use, the command options and the
output of the command differ.

Note The Solaris OE also provides the Berkeley Software Distribution


(BSD) version of the ps command as the /usr/ucb/ps command. The
BSD version provides a detailed performance-related summary. The
/usr/ucb/ps command collects all the process data at one time, sorts the
data based on CPU usage, and displays the result.

Performing Solaris OE Diagnostics


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

5-23

Using the CPU and Memory Management Commands


For example, to display information about the processes running on the
system, you type the following command:
$ ps -ef | more
UID
PID PPID C
STIME TTY
root
0
0 0 16:10:51 ?
root
1
0 0 16:10:51 ?
root
2
0 0 16:10:51 ?
root
3
0 0 16:10:51 ?
root
311
1 0 16:13:40 ?
root
232
1 0 16:13:33 ?
root
190
1 0 16:13:31 ?
root
54
1 0 16:13:05 ?
/usr/lib/sysevent/syseventd
............<output truncated>

TIME
0:15
0:00
0:00
1:41
0:00
0:00
0:00
0:00

CMD
sched
/etc/init pageout
fsflush
/usr/lib/saf/sac -t 300
/usr/lib/utmpd
/usr/sbin/syslogd

Table 5-4 lists the fields of the preceding output and their descriptions.
Table 5-4 Fields and Descriptions of the Output of the ps Command
Field

Description

UID

The name of the user who initiates the process.

PID

The PID number that is assigned by the system.

PPID

The parent PID that is assigned by the system.

This is an absolute value that is retained for backward


compatibility. The value was originally used for
processor scheduling. This value is not printed when
you use the -c option.

STIME

The starting time or date of the process.

TTY

The name of the controlling terminal of the process.

TIME

The total amount of CPU time that the process has


already used.

CMD

The command that is being executed.

Note The ps command takes a snapshot of the current process list. By


the time the data appears, the process list might change significantly.

5-24

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Using the CPU and Memory Management Commands


You use the output of the ps command to perform the following tasks:

Determine which resource is causing a bottleneck for the process


The process that has a low TIME value along with a START value that
occurred several minutes or hours ago is probably blocked on the
I/O queue.

Check process status You use the status (S) field of the output to
check the status of different processes.

Determine how to set process priorities You use the


/usr/bin/nice command to set process priorities. You use the
priocntl command to display or set the scheduling parameters of
specified processes.

Using the vmstat Command


The vmstat command reports virtual memory statistics regarding kernel
threads, virtual memory, disks, traps, and CPU activity. You use the
vmstat command to retrieve information about the paging statistics of a
system. The vmstat command also displays the amount of free memory
on the system.

Note On multiprocessor systems, the vmstat command computes the


average number of CPUs in the output.

Inform students that they should refer to the mpstat command for information on the statistics of each
processor. The mpstat command displays information about CPU usage and the frequency of occurrence for
events, such as interrupts, page faults, and locking. Ask students to execute the mpstat command and view
the output displayed on the screen.

If you do not specify any options with the vmstat command, the
command displays a one-line summary of the virtual memory activity
that occurred since system startup.
The following is a sample output of the vmstat command:
# vmstat
kthr
memory
r b w
swap free
sy id
0 0 0 15020 4304
30 46

re
9

page
disk
mf pi po fr de sr f0 s1 s2 s3

faults
in
sy

cpu
cs us

58 198 228 220 0 3 0 16

86 1173

46 24

Performing Solaris OE Diagnostics


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

5-25

Using the CPU and Memory Management Commands


Table 5-5 lists the fields of the preceding output and their descriptions.
Table 5-5 Fields and Descriptions of the Output of the vmstat
Command
Field

Description

kthr

The number of kernel threads in different states. Each


user lightweight process (LWP) uses a kernel thread.

memory

The usage of virtual and real memory.

page

The page faults and paging activity.

disk

The number of disk operations per second.

faults

The number of interrupts per second.

cpu

The percentage of CPU time usage.

Consider the following sample output of the vmstat utility:


# vmstat 3 4
kthr
memory
r b w
swap free re mf
sy id
29 0 0 444512 8080 18 79
94 1
33 0 0 446080 8376 19 88
91 3
35 0 0 447876 9980
3 122
94 1
30 0 0 449176 11032 0 111
94 0

page
disk
pi po fr de sr s0 s1 s4 --

faults
in
sy

cpu
cs us

32 11 11

0 25

565 2283 1180

24 27 27

2 35

588 1900

946

51

0 21

550 1943 1142

0 10

423 1496

901

In the preceding output, observe the values in the r field of the procs
section and the id field of the cpu section. These values indicate a large
number of processes in the run queue and a low value for CPU idle time,
respectively. This information indicates that the system is overloaded with
multiple user processes waiting for simultaneous execution. Additional
investigation would be required to determine if this was the actual case.
The sr field, which represents the scan rate of the system, also helps you
to identify any problems with memory. A nonzero value for this field
indicates that the system is short of memory.

5-26

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Using the CPU and Memory Management Commands

Using the psrinfo Command


The psrinfo command displays information about configured
processors. This command displays information, such as whether the
processor is online, offline, noninterruptible, or powered off and when the
status of the processor last changed.
The following syntax displays the options for the psrinfo command:
psrinfo [-v] [processor_id ...]
psrinfo -s processor_id
The psrinfo command supports the following options:

-v Specifies the verbose mode option.

-s processor_id Specifies the silent mode option. The -s option


displays the value 1 if the specified processor is completely online
and the value 0 if the specified processor is noninterruptible, offline,
or powered off.

Note Use the -s option when you use the psrinfo command in shell
scripts.

Refer to the online man pages for information on the operand supported by the psrinfo command.

The -s option with the psrinfo command determines the enabled


processors on a system. For example, if processor 2 is enabled, the
following command displays a value of 1:
# psrinfo -s 2

Using the mpstat Command


You use the mpstat command to display information about the activity of
individual CPUs. The mpstat command displays the following
information:

Page faults, interrupts, and thread migrations to other processors

CPU usage, system calls, locking, and context switches

Performing Solaris OE Diagnostics


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

5-27

Using the CPU and Memory Management Commands


The mpstat command also provides the frequency of events, such as
interrupts, page faults, and locking. The output of the mpstat command
reports the performance of each CPU on the system and is useful for
locating CPU-related performance problems.
The following is a sample output of the mpstat utility generated on a
system with a single CPU:
# mpstat 3 5
CPU minf mjf xcal
0
13
1
0
0
3
0
0
0
0
0
0
0
0
0
0
0
0
0
0

intr ithr
435 323
759 548
730 531
535 434
533 432

csw icsw migr smtx srw syscl


256
13
0
0 0
370
658 286
0
2 0 8007
642 284
0
1 0 7843
551 236
0
1 0 9016
543 235
0
0 0 8695

usr
1
57
59
79
83

sys wt idl
1
1 97
30
1 11
32
2
7
21
0
0
17
0
0

In the preceding output, observe that the usr and sys fields have large
values. In addition, the idl field has a low value. This information
indicates that the CPU is executing both user and system processes.
Table 5-6 lists the fields of the preceding output and their descriptions.
Table 5-6 Fields and Descriptions of the mpstat Command

5-28

Field

Description

CPU

The CPU ID on the system

minf

The number of minor faults

mjf

The number of major faults

xcal

The number of interprocessor cross-calls

intr

The number of interrupts

ithr

The number of interrupt threads

csw

The number of context switches

icsw

The number of involuntary context switches

migr

The number of thread migrations to other processors

smtx

The number of spins on mutual exclusion locks that


fail on the first try and the number of times the CPU
fails to obtain a mutex immediately

srw

The number of spins on reader and writer locks that


fail on the first try

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Using the CPU and Memory Management Commands


Table 5-6 Fields and Descriptions of the mpstat Command (Continued)
Field

Description

syscl

The number of system calls

usr

The percentage of CPU user time

sys

The percentage of CPU system time

wt

The percentage of CPU wait time

idl

The percentage of CPU idle time

The mpstat command displays processor information in a tabular format.


Each row of the table represents the activity of one processor. All values
are measured as rates (events per second).

Using the modinfo Command


You use the modinfo command to display information about the kernel
modules loaded on the system. For example, to display the status of
module 10, you type the following command:
# modinfo -i 10
Id Loadaddr
Size Info Rev Module Name
10 1166dd8 29db9
2
1 ufs (filesystem for ufs)
Refer to the online man pages for information on the options supported by the modinfo command.

Performing Solaris OE Diagnostics


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

5-29

Using the CPU and Memory Management Commands


Table 5-7 lists the fields of the preceding output and their descriptions.
Table 5-7 Fields and Descriptions of the modinfo Command

5-30

Field

Description

Id

The module ID.

Loadaddr

The starting text address in hexadecimal bytes.

Size

The size of text, data, and Block Started by Symbol


(bss) in hexadecimal bytes.
The bss section is similar to the data and text value
sections of Assembly Language Programming (ALP).
The bss section is a part of an object file. Object files
consist of the following three main sections:

.txt Contains the program code. The .txt


section is set to read-only mode.

.data Contains the initialized data that is


allocated space in the program.

.bss Contains uninitialized data space. The


.bss section of an object file is a method of
compression. All data values that have a bit
image of zeros are collected in the virtual image
during compilation and linking. When the final
program files are written onto the disk, the file
does not store the zeros. This is because the
kernel fills a page on demand when the .bss
section is referred. In large programs, this
referencing helps to save the disk space that is
required to store an executable file.

Info

Module-specific information.

Rev

The revision number of the loadable modules system.

Module Name

The file name and description of the module.

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Using the CPU and Memory Management Commands

Using the pgrep Command


You use the pgrep command to examine the active processes on the
system. You also use the pgrep command to report the process IDs of the
processes whose attributes match the criteria specified at the command
line.
For each attribute option, you specify a set of values separated by commas
on the command line. For example, to list the processes owned by the
root or daemon user, you type the following command:
# pgrep -u root,daemon
Consider another example in which you want to obtain the PID of the
sendmail utility. Type the following command to obtain the PID:
# pgrep -x -u root sendmail
283

Performing Solaris OE Diagnostics


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

5-31

Using the Network Management Commands

Using the Network Management Commands


In the Solaris OE, you use the following network management commands
to analyze or display information about network usage:

The ping command

The traceroute command

The ifconfig command

The arp command

The netstat command

The snoop command

Note While you can view information about the status or configuration
of the network with user rights, you must have superuser access to make
changes to the parameters of network management commands.

Using the ping Command


You use the ping command to determine if a host can communicate with
another host over the network. The ping command contacts network
hosts by sending Internet Control Message Protocol (ICMP) request and
reply datagrams.
You use the ping command to perform the following tasks:

Determine the status of network hosts

Track and isolate hardware and software problems for managing


networks

Refer to the online man pages for information on the options supported by the ping command.

If a network host responds, the ping command displays the following


output:
host is alive
Ask students to execute the ping command on their respective systems to verify whether the system
receives any response from the network host.

5-32

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Using the Network Management Commands


If a network host does not respond even after the default time-out value
(20 seconds) expires, the ping command displays the following output:
no answer from host
The -s option of the ping command is useful when attempting to connect
to a remote host that is not available. No output is displayed until the
target host sends an ICMP echo response. For example, to view the load
on the network and the speed of a link, you execute the following
command:
# ping -s Hammer
PING Hammer: 56 data bytes
64 bytes from hammer(172.16.128.101): icmp_seq=0. time=1.
ms
64 bytes from hammer(172.16.128.101): icmp_seq=1. time=0.
ms
64 bytes from hammer(172.16.128.101): icmp_seq=2. time=0.
ms
64 bytes from hammer(172.16.128.101): icmp_seq=3. time=0.
ms
^ C
--- Hammer ping statistics --4 packets transmitted, 4 packets received, 0% packet loss
round-trip min/avg/max = 0/0/1 ms
The preceding output indicates the time that a packet takes to travel over
the network. The output also provides the average and lowest speeds and
any packet loss that might occur due to reasons, such as network
congestion.
Another method for troubleshooting network problems by using the ping
command is to send ICMP echo requests to the entire network. To do this,
you use the broadcast address as the target host. You use the -s option to
get information about the systems that are available on the network.

Using the traceroute Command


You use the traceroute command to display the route followed by
network packets to reach a host on the network from that system. If you
suspect that a host is not reachable, you use the traceroute command to
determine the distance travelled by a data packet.
Refer to the online man pages for information on the options supported by the traceroute command.

Performing Solaris OE Diagnostics


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

5-33

Using the Network Management Commands


The traceroute command uses both the Internet Protocol version 4
(IPv4) and Internet Protocol version 6 (IPv6) protocols to display the route
of a network packet.
Inform students that the traceroute command uses the ttl (time to live) field of the IPv4 protocol or the
hop limit field of the IPv6 protocol to determine the route of a network packet.

Note The traceroute command attempts to elicit an ICMP or ICMP6


TIME_EXCEEDED response from each gateway along the path and a
PORT_UNREACHABLE response from the destination host.
The traceroute command is especially useful for determining problems
in routing configuration and routine path failures. For example, if a
particular host is unreachable, you use the traceroute command to
identify the path followed by a packet to reach the host.
The traceroute command requires the name of the destination system.
The following is a sample output of the traceroute command:
# traceroute 172.17.22.93
traceroute to 172.17.22.93 (172.17.22.93), 30 hops max, 40 byte packets
1 172.17.66.1 (172.17.66.1) 0.873 ms 0.670 ms 0.596 ms
2 172.17.64.2 (172.17.64.2) 2.079 ms 2.123 ms 2.301 ms
3 172.16.101.26 (172.16.101.26) 4.572 ms 4.187 ms 4.156 ms
4 172.17.8.30 (172.17.8.30) 4.108 ms 4.125 ms 4.048 ms
5 172.17.22.93 (172.17.22.93) 4.553 ms 4.533 ms 4.431 ms
The preceding output is generated when the traceroute command
executes successfully. The output of the traceroute command also
displays the number of times a packet hops before reaching the
destination. For example, in the preceding output, the data packet hops
five times before reaching the destination computer.
The following output is generated when the traceroute command fails:
traceroute: Warning: Multiple interfaces found; using 172.17.22.93 @ hme0
traceroute to 172.16.32.100 (172.16.32.100), 30 hops max, 40 byte packets
1 172.17.22.1 (172.17.22.1) 1.057 ms 0.875 ms 0.790 ms
2 172.17.22.1 (172.17.22.1) 0.808 ms !H 0.810 ms !H 0.812 ms !H
In the preceding output, the !H annotation indicates that the host is
unreachable and the network packet did not proceed beyond the default
gateway.

5-34

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Using the Network Management Commands

Using the ifconfig Command


You use the ifconfig command to display and configure the network
interface parameters. You also use the ifconfig command to assign an
Internet Protocol (IP) address to a network interface. The ifconfig
command also analyzes the status of network interfaces.
The ifconfig command is useful when troubleshooting network
problems. You can use this command to display the current status of an
interface, including the settings for the following fields:

Address family

IP address

Netmask

Broadcast address

Ethernet address (MAC address)

Refer students to the online man pages for information on the options supported by the ifconfig command.

The Solaris OE provides two versions of the ifconfig command:

The /sbin/ifconfig command

The /usr/sbin/ifconfig command

The /sbin/ifconfig and /usr/sbin/ifconfig commands behave


differently with respect to name services. You cannot change the order in
which the /sbin/ifconfig command references names when the system
boots. However, if you change the /etc/nsswitch.conf file, the behavior
of the /usr/sbin/ifconfig command might be affected.
You use the ifconfig command during system startup to define the
network address of each interface on the system. You can also use the
ifconfig command to redefine the address of an interface or other
operating parameters.

Note The boot scripts that execute the ifconfig command reside in the
/sbin/rcS.d and /sbin/rc2.d directories.

Performing Solaris OE Diagnostics


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

5-35

Using the Network Management Commands


If you specify the -a option, the ifconfig command displays the current
configuration for all network interfaces. If you specify an address family,
the ifconfig command reports only the details that are specific to the
address family. The following is a sample output of the ifconfig -a
command:
# ifconfig -a
lo0: flags=1000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4> mtu 8232 index 1
inet 127.0.0.1 netmask ff000000
hme0: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index
inet 172.17.22.93 netmask ffffff80 broadcast 172.17.22.127
ether 8:0:20:f9:12:25
qfe0: flags=1000842<BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 3
inet 172.16.32.100 netmask ffffff00 broadcast 172.16.32.255
ether 8:0:20:f9:12:25
qfe1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index
inet 172.16.64.100 netmask ffffff00 broadcast 172.16.64.255
ether 8:0:20:f9:12:25
qfe2: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index
inet 0.0.0.0 netmask ff000000
ether 8:0:20:f9:12:25
qfe3: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index
inet 172.16.128.100 netmask ffffff00 broadcast 172.16.128.255
ether 8:0:20:f9:12:25

Note Only a root user can view the Ethernet address of a network
interface and has the permission to modify the configuration of a network
interface.
You can also use the ifconfig command to configure a network
interface. To configure an interface, you specify the interface name using
the plumb keyword.

Note The plumb keyword opens the devices associated with the physical
interface name and sets up the streams required by the IP to use the
device.
For example, to configure the hme0 interface, you type the following
command:
# ifconfig hme0 plumb

5-36

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Using the Network Management Commands


Inform students that when they use the ifconfig command with the network interface name (hme0, qfe0),
the output displays the IP address, netmask, and MAC address of that particular interface. If they use the
ifconfig command with the -a option, the output displays information about all the network cards.
Ask students to run the ifconfig commands on their systems to check if the network interface card is
functional on their systems.

You also use the plumb keyword when troubleshooting the interfaces that
you add and configure manually. Often, an interface reports that it is
functional, but a snoop session from another host shows that no traffic is
flowing out of that interface. The plumb keyword helps to resolve this
communication problem.

Using the arp Command


You use the arp command to display and modify the Address Resolution
Protocol (ARP) tables of the kernel. For example, when you change an
Ethernet card on a client, the MAC address for the client changes. In this
case, you use the arp command to update the arp table on the server to
allow continued access to the server.
The arp command supports the following options:

-a Displays all ARP entries in the table

-d Deletes an entry from the ARP table

-s Adds an entry to the table

Refer to the online man pages for information on the options supported by the arp command.
Inform students that they can use the Internet dot notation to specify the host by either name or number.

Performing Solaris OE Diagnostics


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

5-37

Using the Network Management Commands


The following is the output of the arp -a command:
# arp -a
Net to
Device
-----qfe1
qfe2
hme0
hme0
qfe3
hme0
hme0
hme0
hme0
qfe1

Media Table: IPv4


IP Address
-------------------172.16.64.102
172.16.96.102
172.17.22.1
172.17.22.7
sun3.atlantic.oceans.com
172.17.22.43
172.17.22.35
172.17.22.33
172.17.22.56
sun1.pacific.oceans.com

Mask
--------------255.255.255.255
255.255.255.255
255.255.255.255
255.255.255.255
255.255.255.255
255.255.255.255
255.255.255.255
255.255.255.255
255.255.255.255
255.255.255.255

Flags
Phys Addr
----- --------------00:a0:c9:36:54:3b
00:00:21:27:bb:90
00:01:30:51:b0:00
00:10:5a:9f:af:21
SP
08:00:20:f9:12:25
00:10:5a:9f:ae:dc
00:50:ba:89:d8:3c
00:80:48:d6:de:b7
00:50:ba:d7:27:b8
SP
08:00:20:f9:12:25

Inform students that they can refer to the example for the definitions of the flags in the arp table.

Each entry in the ARP table might have the following flags associated
with it:

Publish (P) Includes the entries that you use to respond to ARP
requests for this address.

Static (S) Includes the entries that are manually inserted and are
not defined by the ARP protocol.

Unresolved (U) Includes the entries in which an ARP request for


this address was sent but no response was received.

Mapping (M) Includes the entries that you use to map to the
Ethernet multicast MAC addresses in the range of 01:00:5e:00:00:00
through 01:00:5e:ff:ff:ff.

You can use the arp utility when attempting to locate network problems
that relate to duplicate IP address usage. For example, you need to
complete the following steps to determine if a system is responding:

5-38

1.

Determine the Ethernet address of the target host. To do this, use the
banner utility at the ok prompt or the ifconfig utility at a shell
prompt on a Sun system.

2.

Determine if you can reach the target host by using its IP address
with the ping command.

3.

Use the arp utility and verify that the arp table reflects the correct
Ethernet (MAC) address.

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Using the Network Management Commands

Using the netstat Command


You use the netstat command to display the network status, such as
network utilization and network traffic. You also use the netstat
command to analyze information on network tuning, such as active
routes and the status of interfaces, routing tables, and various protocols.
The netstat utility has several reports that include information on the
following:

Interface state, socket state, and DHCP

Routing tables

STREAMS statistics

The netstat command for the Solaris 9 OE can also generate reports for
the IPv6 protocols. The -i option with the netstat command shows the
state of the interfaces used by the system for locating the IP address.
The following is a sample output of the netstat -i command:
# netstat -i
Name Mtu Net/Dest
lo0
8232 loopback
hme0 1500 SUN
qfe0 1500 router-qfe0
qfe1 1500 router-qfe1
qfe2 1500 router-qfe2
qfe3 1500 router-qfe3
........

Address
localhost
SUN
router-qfe0
router-qfe1
router-qfe2
router-qfe3

Ipkts
17367
150518
6821
4241
0
902

Ierrs
0
84
0
0
0
0

Opkts
17367
161769
4658
1156
952
1989

Oerrs
0
42
0
0
0
0

Collis
0
1532
115
0
0
0

Performing Solaris OE Diagnostics


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Queue
0
0
0
0
0
0

5-39

Using the Network Management Commands


Table 5-8 lists the fields of the preceding output and their descriptions.
Table 5-8 Fields and Descriptions of the netstat -i Command
Field

Description

Name

The type of Ethernet interface

Mtu

The maximum size packet that will be transmitted by


this interface

Net/Dest

The subnet address

Address

The name of the system

Ipkts

The number of packets received

Ierrs

The number of received packets that have errors

Opkts

The number of packets sent

Oerrs

The number of sent packets that have errors

Collis

The number of output packets that result in collisions

Queue

The number of queued packets

In the preceding output, the Queue field should have a nonzero value, and
the value in the Collis field should not be greater than five percent of the
Opkts field. In addition, the value of the Ierrs field should be zero and
less than one percent of the Ipkts field.
The information generated by the netstat command is critical for tuning
your network because the netstat command reports data on network
usage and network traffic. However, it is difficult to interpret the output
of the netstat command for a system with many network interfaces
because of the complexity of the output.

5-40

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Using the Network Management Commands


To view the state of the kernel routing tables on the system, use the
netstat -r command. The following is a sample output of the
netstat -r command:
# netstat -r
Routing Table: IPv4
Destination
Gateway
Flags Ref
Use
Interface
-------------------- -------------------- ----- ----- ------ --------172.17.22.0
SUN
U
1
13 hme0
172.16.32.0
172.16.32.100
U
1
1792 qfe0
172.16.64.0
172.16.64.100
U
1
3 qfe1
172.16.96.0
172.16.96.100
U
1
6 qfe2
172.16.128.0
172.16.128.100
U
1
1 qfe3
172.17.0.0
172.17.22.1
UG
1
46
BASE-ADDRESS.MCAST.NET SUN
U
1
0 hme0
localhost
localhost
UH
44 59327 lo0
The Flags field in the preceding output shows the status of the network
interface. The Flags field can have the following values:

U Indicates that the route is working

G Indicates that the route is a gateway

H Indicates that the route is to a host system

D Indicates that the route is dynamically created

You use the netstat -s command to display the statistics related to the
Transmission Control Protocol (TCP), IP, ICMP, and Internet Group
Management Protocol (IGMP).
Ask students to execute the netstat -s command on their machines and read the statistics related to
various protocols.

Performing Solaris OE Diagnostics


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

5-41

Using the Network Management Commands

Using the snoop Command


When you transfer data over a network, the data packets might be
damaged. To check the validity of data packets, you use the snoop
command. The snoop command captures data packets from the network
and displays their contents.

Note You can use the snoop command to display network packets while
they are received or save them to a file for later analysis.

Inform students that the file in which you save the captured network packets must be RFC 1761-compliant.

The snoop command performs the following tasks:

Logs or displays packets selectively

Provides accurate time stamps for checking the response time of the
network Remote Procedure Call (RPC)

Formats packets and protocol information in a user-friendly manner

Refer to the online man pages for information on the options supported by the snoop command.

To capture and display a specific type of network packet, you must


provide a filter expression. The command captures and displays only the
packets for which you provide the filter expression. For example, to track
packets that are processed by the hme0 adapter, you type the following
command:
# snoop -d hme0 -o capture_snoop
Using device /dev/hme (promiscuous mode)
172.17.20.50 -> 172.17.22.15 TCP D=2189 S=3615 -----P Ack=3218425283
Seq=4142908314 Len=349 Win=17026
172.17.20.50 -> 172.17.22.15 TCP D=2189 S=3615 -F---- Ack=3218425283
Seq=4142908663 Len=0 Win=17026
172.17.22.15 -> 172.17.20.50 TCP D=3615 S=2189 ------ Ack=4142908664
Seq=3218425283 Len=0 Win=17171
172.17.22.15 -> 172.17.20.50 TCP D=3615 S=2189 -F---- Ack=4142908664
Seq=3218425283 Len=0 Win=17171
172.17.20.50 -> 172.17.22.15 TCP D=2189 S=3615 ------ Ack=3218425284
Seq=4142908664 Len=0 Win=17026
172.17.22.15 -> 172.17.20.50 UDP D=1745 S=2024 LEN=180
172.17.20.50 -> 172.17.22.43 TCP D=1111 S=3749 -----P Ack=43343
Seq=4142236211 Len=1460 Win=16963........<output truncated>

5-42

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Using the General-Purpose Commands

Using the General-Purpose Commands


You use the general-purpose commands to analyze or display information
about the contents of a file. These commands also display information
about the system status or configuration. You must have normal user
privileges to run the general-purpose commands.
The Solaris OE contains the following general-purpose commands:

The find command

The script command

The file command

The tail command

The uname command

The showrev command

The prtconf command

The sysdef command

The nm command

The swap command

Using the find Command


You use the find command to search for specific files in the file system
structure.
Inform students that they can refer to the online man pages for information on the options supported by the
file command.

The following is the syntax for the find command:


find / -name <filename> -print
For example, to locate the dtterm command, you type the following
command:
$ find / -name dtterm -print
/usr/share/lib/terminfo/d/dtterm
/usr/dt/bin/dtterm
/usr/dt/share/examples/dtterm
/opt/sfw/share/terminfo/d/dtterm
$

Performing Solaris OE Diagnostics


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

5-43

Using the General-Purpose Commands

Using the script Command


You use the script command to record the output displayed on the
screen. The script command writes the output to a file name. If you do
not specify a file name, the script command saves the output in the
typescript file.
Refer to the online man pages for information on the options supported by the script command.

When you use the -a option, the script command does not overwrite a
file name but appends the output to the file name.

Note The script command records all the output, including the
prompts displayed on the screen, in the file name.
The following is the syntax for the script command:
script <file-name>
For example, to store the location of the dtterm files to a script file called
dtterm_location, you type the following:
$ script dtterm_location
Script started, file is dtterm_location
$ find / -name dtterm -print
...
Script done, file is dtterm_location

Note To view a script file, type more filename.

5-44

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Using the General-Purpose Commands

Using the file Command


You use the file command to classify the type of the file. The file
command performs a series of tests on each file and classifies files into
different file types, such as first-in first-out (FIFO), block specials, and
character specials. In a regular file with zero length, the file command
identifies the file as an empty file.
Refer students to the online man pages for information on the options supported by the file command.

If the file appears to be a text file, the file command examines the first
512 bytes and tries to determine the type of the file. To do this, the file
command uses the control file /etc/magic, which contains magic
numbers and sequences for different file formats and built-in rules for
natural languages. If the file is a symbolic link, by default, the file
command follows the link and tests the file referred to by the symbolic
link.
The following is the syntax for the file command:
file <file name>
In the following example, the file command displays the type of the
hosts file.
# file /etc/hosts
hosts:
ascii text

Note The file command does not use the name of the file to determine
the file type.

Using the tail Command


You use the tail command to open a file for reading and displaying the
last part of the file contents.
Refer students to the online man pages for more information on the options supported by the tail command.

The following is the syntax of the tail command with the -f option:
tail -f <filename>

Performing Solaris OE Diagnostics


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

5-45

Using the General-Purpose Commands


In this example, you use the tail command with the -f option. The
output displays the last 10 lines of the /var/adm/messages file and all
subsequent lines that are appended to the file.
# tail -f /var/adm/messages
Dec 27 14:33:59 sun pseudo: [ID 129642 kern.info] pseudo-device: vol0
Dec 27 14:33:59 sun genunix: [ID 936769 kern.info] vol0 is /pseudo/vol@0
Dec 27 14:34:03 sun scsi: [ID 193665 kern.info] sd0 at uata0: target 2
lun 0
Dec 27 14:34:03 sun genunix: [ID 936769 kern.info] sd0 is
/pci@1f,0/pci@1,1/ide@3/sd@2,0
Dec 27 14:34:05 sun ebus: [ID 521012 kern.info] fd0 at ebus0: offset
14,3023f0
Dec 27 14:34:05 sun genunix: [ID 936769 kern.info] fd0 is
/pci@1f,0/pci@1,1/ebus@1/fdthree@14,3023f0
Dec 27 14:34:06 sun ebus: [ID 521012 kern.info] se0 at ebus0: offset
14,400000
Dec 27 14:34:06 sun genunix: [ID 936769 kern.info] se0 is
/pci@1f,0/pci@1,1/ebus@1/se@14,400000
Dec 27 14:34:10 sun pseudo: [ID 129642 kern.info] pseudo-device: pm0
Dec 27 14:34:10 sun genunix: [ID 936769 kern.info] pm0 is /pseudo/pm@0

Using the uname Command


You use the uname command to print information about the current
system. When you specify options with the uname command, symbols
representing one or more system characteristics are displayed. If no
options are specified, the uname command prints the name of the current
operating system.
The following syntax displays the options for the uname command:
uname [-aimnprsvX]
uname [-S system_name]
Refer to the online man pages for more information on the options supported by the uname command.

5-46

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Using the General-Purpose Commands


The uname command supports the following options:

-a Prints information that is currently available from the system.

-i Prints the name of the hardware platform.

-m Prints the name of the system hardware class.

-n Prints the node name. The node name is the name by which the
system is known to a communications network.

-p Prints the instruction set architecture (ISA) or type of processor


of the current host.

-r Prints the release level of the operating system.

-s Prints the name of the operating system.

-v Prints the version of the operating system.

-X Prints expanded system information, including system name,


node number, release number, version, machine type, number of
CPUs, bus type, serial number, users, original equipment
manufacturer (OEM) number, and origin number.

To display the kernel architecture, run the uname command with the -m
option.
# uname -m
sun4u
To display the system name, run the uname command with the -n option.
# uname -n
sun-sparc-1
To display system information, run the uname command with the -a
option.
# uname -a
SunOS sun 5.9 Beta sun4u sparc SUNW,Ultra-5_10
In the preceding output, information is displayed in the following order:
<system name> <node name> <release> <version> <kernel
architecture> <processor type> <hardware platform>

Performing Solaris OE Diagnostics


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

5-47

Using the General-Purpose Commands

Using the showrev Command


The showrev command displays revision information for the current
hardware and software installed on the system.
The following syntax displays the options for the showrev command:
/usr/bin/showrev [-a] [-p] [-w] [-c command] [-s hostname]
Refer to the online man pages for information on the options supported by the showrev command.

Without any arguments, the showrev command displays the system


revision information including the host name, the host ID, the release
date, kernel architecture, application architecture, the hardware provider,
the domain, and the kernel version. When you use the showrev command
with the -p option, the revision information about the patches installed on
a system is displayed.
The following output is generated when you run the showrev command
using the -p option:
Inform students that this is a partial output of the showrev command.

The following is a sample output of the showrev -p command that is


executed on an Ultra 10 workstation:
$ showrev -p
Patch: 109134-10 Obsoletes: Requires: 109318-06, 110386-01
Incompatibles: Packages: SUNWwbapi, SUNWwbcor, SUNWwbcou, SUNWmgapp
Patch: 109889-01 Obsoletes: 109353-04 Requires: Incompatibles:
Packages: SUNWk
vmx, SUNWkvm, SUNWctu, SUNWhea, SUNWmdb, SUNWpstl, SUNWpstlx
Patch: 110370-01 Obsoletes: Requires: Incompatibles: Packages:
SUNWkvmx, SUNW
kvm, SUNWhea, SUNWmdb, SUNWpstl, SUNWpstlx
Patch: 110376-01 Obsoletes: Requires: Incompatibles: Packages:
SUNWkvmx, SUNW
kvm, SUNWhea, SUNWmdb, SUNWpstl, SUNWpstlx
Patch: 108528-05 Obsoletes: 108874-01, 109153-01, 109656-01, 109291-06,
109663-0
1, 109309-02, 109345-02, 109313-02, 109880-01, 108966-06, 108979-10,
109236-01,
109296-05, 109348-05, 109350-06, 109571-02, 109801-02, 110096-05, 11011802, 110
134-02, 110121-01, 110132-02, 110133-03, 110141-02, 110201-01, 110225-01,
110231

5-48

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Using the General-Purpose Commands


-01 Requires: 110383-01 Incompatibles: 109079-01 Packages: SUNWkvmx,
SUNWkvm, SU
NWcsu, SUNWcsr, SUNWcslx, SUNWcsl, SUNWcarx, SUNWcar, FJSVhea, SUNWcpr,
SUNWcprx
, SUNWcsxu, SUNWdrr, SUNWdrrx, SUNWidn, SUNWidnx, SUNWpmu, SUNWpmr,
SUNWpmux, SU
NWarc, SUNWarcx, SUNWcstl, SUNWcstlx, SUNWhea, SUNWmdb, SUNWmdbx,
SUNWsrh, SUNWt
nfc, SUNWtnfcx
...............<Output truncated>

Note The showrev -p command displays both the current and obsolete
patches on the system.
If no patches are installed on the system, the showrev -p command
displays the following output:
showrev -p
No patches are installed
Ask students to run the showrev -p command on their systems and view the information about the patches
installed on the systems.

Note In the Solaris 9 OE, the showrev command is obsolete and the
patchadd command is used instead.

Using the prtconf Command


You use the prtconf command to display information about system
configuration. The prtconf command displays information on various
devices and the status of device drivers, such as the PROM version, the
bus slots, the total amount of memory, and the configuration of system
peripherals formatted as a device tree.
The following syntax displays the options for the prtconf command in a
SPARC processor:
/usr/sbin/prtconf [-V] | [-F] | [-x] | [-vpPD]
Refer to the online man pages for information on the options supported by the prtconf command.

Performing Solaris OE Diagnostics


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

5-49

Using the General-Purpose Commands


The prtconf command supports the following options:

-V Displays platform-dependent PROM or boot system version


information. The -V option displays the output as a string. The
format of the string is arbitrary and platform-dependent.

-F Is supported only in the SPARC 1 processor. The -F option


returns the device path name of the console frame buffer.
For example, if the console frame buffer on a SPARC station 1 is
cgthree in SBus slot #3, the command returns
/sbus@1,f80000000/cgthree@3,0. You use the -F option to create
a symbolic link for the /dev/fb file to the actual console device.

-x Reports whether the firmware on a system is 64-bit ready.

-v Specifies the verbose mode.

-p Displays the information that is derived from the device tree


provided by the firmware PROM on SPARC processors.

Inform students that the device tree information displayed using this option is a snapshot of the initial
configuration and might not accurately reflect the reconfiguration events that occur later.

-P Includes information about the pseudo devices. By default, the


information on the pseudo devices is omitted.

-D Displays the name of the device driver that you use to manage
a peripheral for each system peripheral in the device tree.

Inform students that some existing platforms might require a firmware upgrade to run the 64-bit kernel of the
system.

Consider a scenario in which you must allocate memory resources on


your system. To do this efficiently, you must know the total amount of
physical memory available on your system. You can use the prtconf
command to determine the total physical memory.
Ask students to run the prtconf -v command to view the physical memory available on their systems.

5-50

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Using the General-Purpose Commands


The following is a partial output of the prtconf command on the
Sun Enterprise 450 server:
$ prtconf -v | grep memory size
System Configuration: Sun Microsystems sun4u
Memory size: 2048 Megabytes
System Peripherals (Software Nodes):
SUNW,Ultra-4
System properties:
name <relative-addressing> length <4>
value <0x00000001>.
name <MMU_PAGEOFFSET> length <4>
value <0x00001fff>.
name <MMU_PAGESIZE> length <4>
value <0x00002000>.
name <PAGESIZE> length <4>
value <0x00002000>.
...
The preceding output indicates that the total amount of physical memory
available in the system is 2048 Mbytes.
Inform students that if the system does not have enough RAM to run the workload, system performance
degrades rapidly. To tune the system for the optimum utilization of memory, you must understand the usage
of physical memory in the system.

Using the sysdef Command


You use the sysdef command to display the output of the current system
definition in a tabular format. The sysdef command lists all hardware
devices, pseudo devices, system devices, loadable modules, and the
values of selected kernel-tunable parameters.
The following syntax displays the options for the sysdef command:
/usr/sbin/sysdef [-n namelist]
/usr/sbin/sysdef [-h] [-d] [-D]
Refer to the online man pages for information on the options supported by the sysdef command.

Performing Solaris OE Diagnostics


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

5-51

Using the General-Purpose Commands


The sysdef command generates output by analyzing the contents of
physical and virtual memory. You use the information displayed by the
sysdef command to tune or enhance the performance of a system.

Note The default system file is /dev/kmem.


For example, you can use the -i option with the sysdef command to
display the kernel modules on the system. The following is a partial
output of the sysdef command:
# sysdef -i
...
* IPC Shared Memory
*
1048576
max shared memory segment size (SHMMAX)
1 min shared memory segment size (SHMMIN)
100 shared memory identifiers (SHMMNI)
6 max attached shm segments per process (SHMSEG)
*
* Time Sharing Scheduler Tunables
*
60
maximum time sharing user priority (TSMAXUPRI)
SYS
system class name (SYS_NAME)
#
Ask students to read the output of the sysdef command on their systems.

Using the nm Command


You use the nm command to print the name list of objects in archive
libraries or in the symbol table of the kernel. The nm command displays
the symbol table of each Executable and Linking Format (ELF) object file.
In the absence of symbolic information for a valid input file, the nm
command reports the missing input file but does not consider it as an
error condition.
Inform students that the object file can be an application or the kernel.

The following syntax displays the options for the nm command:


/usr/ccs/bin/nm [-ACDhlnPprRsTuVv] [-efox] [-g | -u]
[-t format] file ...
/usr/xpg4/bin/nm [-ACDhlnPprRsTuVv] [-efox] [-g | -u]
[-t format] file ...

5-52

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Using the General-Purpose Commands


Refer to the online man pages for information on the options supported by the nm command.

For example, to locate a symbol in the symbol table of the kernel, you
type the following command:
# /usr/ccs/bin/nm /dev/ksyms | more
/dev/ksyms:
[Index]
Value
Size
[677]
|
16987860|
[668]
|
16987880|
[669]
|
16987900|
[670]
|
16987916|
[675]
|
16987796|
[671]
|
16987728|
[672]
|
16987740|
[673]
|
16987784|
.........<output truncated>

Type

Bind

0|NOTY
0|NOTY
0|NOTY
0|NOTY
0|NOTY
0|NOTY
0|NOTY
0|NOTY

|LOCL
|LOCL
|LOCL
|LOCL
|LOCL
|LOCL
|LOCL
|LOCL

Other Shndx
|0
|0
|0
|0
|0
|0
|0
|0

|ABS
|ABS
|ABS
|ABS
|ABS
|ABS
|ABS
|ABS

Name
|$done
|$done1
|$done2
|$done3
|$nowalgnd
|$s1algn
|$s2algn
|$s3algn

Using the swap Command


You use the swap command to add, delete, and monitor the system swap
areas used by the memory manager.
The following syntax displays the options for the swap command:
/usr/sbin/swap -a swapname [swaplow] [swaplen]
/usr/sbin/swap -d swapname [swaplow]
/usr/sbin/swap -l
/usr/sbin/swap -s
Refer to the online man pages for information on the options supported by the swap command.

Performing Solaris OE Diagnostics


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

5-53

Using the General-Purpose Commands


The swap command supports the following options:

-a swapname Adds the specified swap area. Only a superuser can


use this option. For example, swapname is the name of the swap
device /dev/dsk/c0t0d0s1. When you use the swap command with
the -a option, the size of the swap space increases.

-d swapname Deletes the specified swap area. Only a superuser can


use this option. For example, swapname is the name of the swap
device /dev/dsk/c0t0d0s1.
When the swap -d command is executed, swap blocks can no longer
be allocated from this area because all the swap blocks in this area
are transferred to other swap areas.

-l Lists the status of all swap areas. The list displays the total
number of blocks and the number of free blocks for each swap area.

-s Prints information about the availability and usage of the total


swap space on the system.

Note The output from the swap -s command includes the portion of
physical memory that is available for general programs and for all swap
spaces. However, the output from the swap -l command does not
include physical memory.
For example, if the output of the vmstat command indicates a shortage of
RAM, you can increase the swap space or upgrade RAM. If you increase
the swap space, the swap command displays the following information
about both available and free swap space:
Ask students to execute the swap command on their respective machines to view the amount of available
swap space.

swap -l
swapfile
/dev/dsk/c0t0d0s1

dev swaplo blocks


free
136,1
16 2048240 2048240

Note The Solaris OE supports the concept of the swapfs file system,
which enables a swap area to be a file residing on a file system and as a
logical or physical partition on a device.

5-54

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Using the Program Execution Management Commands

Using the Program Execution Management Commands


You use the program execution management commands to analyze or
display information about the execution of programs on a system. These
commands help you to identify the core administration files and
determine how to execute the programs. To invoke these commands, you
should be a root user.
The Solaris OE contains the following program execution management
commands:

The truss command

The coreadm command

Using the truss Command


You use the truss command to trace system calls and signals. The truss
command helps you to determine how programs are executed and
identify the points of failure in programs that return error conditions.
Each line of the truss command output reports either the fault, the signal
name, or the system call name with its arguments and return values.
The truss command executes an application and determines the
following:

The system calls made by the application

The signals received by the application

The time stamps of each event

The faults encountered by the application

You also use the truss command to analyze the stale or sick processes on
a system.
The following syntax displays the options for the truss command:
truss [-fcaeildD] [- [tTvx] [!] syscall , ...] [- [sS] [!]
signal , ...] [- [mM] [!] fault , ...] [- [rw] [!] fd ,
...] [- [uU] [!] lib , ... : [:] [!] func , ...] [-o
outfile] command | -p pid ...
Refer students to the online man pages for information on the options supported by the truss command.

Performing Solaris OE Diagnostics


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

5-55

Using the Program Execution Management Commands


Consider a scenario in which a user attempts to run the admintool
command from a remote terminal. Then, the user executes the truss
command to trace the system calls and isolate any problems. The
following is the output of the truss command:
$ truss -f -t open,close,read,write,stat admintool
...
In the preceding output, the -f option follows all the child processes that
are invoked by the parent process that is being trussed. The
-t open,close,read,write,stat options limit the traceable calls that
are relevant to the open, close, read, write, and stat system calls.

Using the coreadm Command


You use the coreadm command to manage application core dump files.
For example, you use the coreadm command to configure a system in
which all the core files are placed in a single directory. You can then
examine the core files in the specific directory whenever a process or
daemon terminates abnormally.
You can also use the coreadm command to set the name pattern for core
files. For example, if the global core file path is set to
/var/core/core.%f.%p and a sendmail process with PID 12345
terminates abnormally, the system generates a core file,
/var/core/core.sendmail.12345.
The following are the different syntax for using the coreadm command:

coreadm [-g pattern] [-i pattern] [-d option ...] [-e


option ...]
A superuser executes the preceding options of the coreadm
command. You use these options to configure system-wide core file
options. The system-wide core file options include a global core file
name pattern and a per-process core file name pattern for the init
process. These settings are saved in the configuration file
/etc/coreadm.conf.

coreadm [-p pattern] [pid ...]


Nonprivileged users execute the preceding options of the coreadm
command. You use these options to specify a file name pattern. The
operating environment uses the file name pattern to generate a
per-process core file.

5-56

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Using the Program Execution Management Commands

coreadm -u
A superuser executes the preceding option of the coreadm
command. You use this option to update all system-wide core file
options. The system-wide core file options are based on the contents
of the /etc/coreadm.conf file. The startup script
/etc/init.d/coreadm uses the -u option only on system reboot.

Refer students to the online man pages for more information on the options supported by the coreadm
command.

To display the configuration information of the Solaris OE, run the


coreadm command without any parameters.
The following output is displayed when you run the coreadm command
at the command line:
$ coreadm
global core file pattern:
init core file pattern:
global core dumps:
per-process core dumps:
global setid core dumps:
per-process setid core dumps:
global core dump logging:

core
disabled
enabled
disabled
disabled
disabled

Ask students to run the coreadm command in their respective systems to view the configuration information
about their respective systems.

To set the name pattern for a per-process core file, type the following at
the command line:
# coreadm -i $HOME/corefiles/%f.%p
The preceding command moves all the core dumps into the corefiles
subdirectory of the home directory.
To set the name pattern for a global core file, type the following at the
command line:
# coreadm -g /var/corefiles/%f.%p
If time permits, provide the following information to students. You can also ask students to perform the
following steps as part of an exercise.

Performing Solaris OE Diagnostics


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

5-57

Using the Program Execution Management Commands


To enable a per-process core file, complete the following steps:

Log in as the superuser.

Run the following command to enable the per-process core file:

# coreadm -e process

Run the following command to display the core file path for the current process to verify the
configuration:

# coreadm $$
To enable a global core file, complete the following steps:

Log in as the superuser.

Run the following command to enable the global core file:

# coreadm -e global -g /var/core/core.%f.%p

Run the coreadm command to verify the configuration.

# coreadm

To display the name pattern of the per-process core file for one or more
processes, run the coreadm command at the command line with a list of
PIDs.
$ coreadm 278 5678
278: core.%f.%p
5678: /home/george/cores/%f.%p.%t

5-58

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Exercise: Performing Solaris OE Diagnostics

Exercise: Performing Solaris OE Diagnostics


In this exercise, you use some of the diagnostic tools and files described in
this module to perform tests on your system.
Explain to students that the set of questions in the exercise facilitate the revision of the content in the module.
Instruct students to perform tasks and attempt questions in the given sequence. Inform them that they can
refer to the lecture notes to attempt the exercise.

Preparation
A standard Solaris 9 OE installation with access to the man pages is
required for this exercise.

Tasks
Complete the following tasks to perform diagnostics on the system:
1.

Log in as the root user, and open a terminal window. Use the
ifconfig command to display basic configuration information
about the network interfaces on the system.
Record the information for the following attributes.

Attribute

Value

IP address
Ethernet address
Netmask
Interface
up/down
2.

On two systems, start a snoop session and monitor the output.

3.

Use the appropriate command to verify that your system can contact
the network interface on another system in the network. Does the
output of the snoop command contain requests and replies (yes or
no)?

Performing Solaris OE Diagnostics


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

5-59

Exercise: Performing Solaris OE Diagnostics


4.

On only one system in the pair, use the ifconfig command to


mark its primary interface as down and then again execute the
ifconfig command. Does the ifconfig command display any
change in the information?

5.

On the system whose interface remains up, use the ping command
to contact the system whose interface is down. What does the ping
command report?

6.

On the system whose interface is down, use the ifconfig command


to mark its primary interface as up. Verify that the change took place.

7.

On the system whose interface remained up, again attempt to use the
ping command to contact the other system.

What does the ping command report?

_____________________________________________________

Does the snoop command report a reply from the target host?

_____________________________________________________
8.

Use the appropriate command to list the driver modules that are
loaded on your system.

9.

Use the appropriate command to determine the amount of memory


that is configured on your system.

10. Use the appropriate command to determine your Ethernet hardware


address. Check the IP address next to the keyword inet and ensure
that it matches the value for your system specified in the
/etc/hosts file.
11. Use the diagnostic tools and online system files to answer the
following question about the state and configuration of your system:

What is the size of the swap partition on your system?

_____________________________________________________________
_____________________________________________________________
_____________________________________________________________
12. Use the appropriate command to identify the system calls made by
the ls command.
13. Use the appropriate command to display information about the
active processes running on the system.
14. Use the appropriate option with the netstat command to display a
list of active sockets for each protocol on the system.
15. Use the appropriate command to append a session record to the file
myfile1.

5-60

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Exercise: Performing Solaris OE Diagnostics


16. Use the appropriate option with the uname command to display the
name of the hardware platform on the system.
17. Use the appropriate prtconf command to display information about
the name of the device driver that manages a peripheral device on a
SPARC processor.

Performing Solaris OE Diagnostics


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

5-61

Exercise Summary

Exercise Summary

Discussion Take a few minutes to discuss what experiences, issues, or


discoveries you had during the lab exercise.

Manage the discussion based on the time allowed for this module, which was provided in the About This
Course module. If you do not have time to spend on discussion, highlight just the key concepts students
should have learned from the lab exercise.

Experiences

Ask students what their overall experiences with this exercise have been. Go over any trouble spots or
especially confusing areas at this time.

Interpretations

Ask students to interpret what they observed during any aspect of this exercise.

Conclusions

Have students articulate any conclusions they reached as a result of this exercise experience.

Applications

Explore with students how they might apply what they learned in this exercise to situations at their workplace.

5-62

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Exercise Solutions

Exercise Solutions
The solutions for the tasks listed in this exercise are:
1.

Log in as the root user, and open a terminal window. Use the
ifconfig command to display basic configuration information
about the network interfaces on the system.
# ifconfig -a
Record the information for the following attributes.

Attribute

Value

IP address

It varies according to the


system in use.

Ethernet address

It varies according to the


system in use.

Netmask

It varies according to the


system in use.

Interface
up/down

The interface should be


UP.

2.

On two systems, start a snoop session and monitor the output.


# snoop host1 host 2

3.

Use the appropriate command to verify that your system can contact
the network interface on another system in the network. Does the
output of the snoop command contain requests and replies (yes or
no)?
# ping host 2
The output of the snoop command contains both requests and replies.

4.

On only one system in the pair, use the ifconfig command to


mark its primary interface as down and then again execute the
ifconfig command.
# ifconfig hme0 down
# ifconfig hme0
Does the ifconfig command display any change in the
information?
The ifconfig command no longer lists the interface as UP.

Performing Solaris OE Diagnostics


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

5-63

Exercise Solutions
5.

On the system whose interface remains up, use the ping command
to contact the system whose interface is down.
# ping host
What does the ping command report?
After a time-out period, the ping command reports no answer from
host.

6.

On the system whose interface is down, use the ifconfig command


to mark its primary interface as up. Verify that the change took place.
# ifconfig hme0 up
# ifconfig hme0

7.

On the system whose interface remained up, again attempt to use the
ping command to contact the other system.
# ping host2

What does the ping command report?


host is alive

Does the snoop command report a reply from the target host?
Yes

8.

Use the appropriate command to list the driver modules that are
loaded on your system.
# modinfo

9.

Use the appropriate command to determine the amount of memory


configured on your system.
# prtconf -v

10. Use the appropriate command to determine your Ethernet hardware


address. Check the IP address next to the keyword inet and ensure
that it matches the value for your system specified in the
/etc/hosts file.
# ifconfig -a
11. Use the diagnostic tools and online system files to answer the
following question on the state and configuration of your system:

What is the size of the swap partition on your system?

Run the following command to determine the size of the swap


partition on your system:
# swap -l

5-64

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Exercise Solutions
12. Use the appropriate command to identify the system calls made by
the ls command.
# truss ls
13. Use the appropriate command to display information about the
active processes running on the system.
# ps -e
14. Use the appropriate option with the netstat command to display a
list of active sockets for each protocol on the system.
# netstat -an
15. Use the appropriate command to append a session record to the file
myfile1.
# script -a myfile1
16. Use the appropriate option with the uname command to display the
name of the hardware platform on the system.
# uname -i
17. Use the appropriate prtconf command to display information about
the name of the device driver that manages a peripheral device on a
SPARC processor.
# /usr/sbin/prtconf -D

Performing Solaris OE Diagnostics


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

5-65

Module 6

Diagnosing Faults Using Online Tools


Objectives
Overview on
page OH 6-2

Upon completion of this module, you should be able to:

Use the online man pages

Diagnose problems by using the SunSolve Online service

Use the Sun Explorer Data Collector utility

Use the docs.sun.com Web site

6-1
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Relevance

Relevance
Present the following questions to stimulate students and get them thinking about the issues and topics
presented in this module. While they are not expected to know the answers to these questions, the answers
should be of interest to them and inspire them to learn the material presented in this module.

Relevance on
page OH 6-3

!
?

Discussion The following questions are relevant to understanding the


activities you perform in the Solaris Operating Environment (Solaris OE):

Which online tools do you use to diagnose problems in the Solaris


OE?

Which online diagnostic tools do you find effective and easy to use
for solving problems in the Solaris OE?

How do you ensure that a problem in your system is not because of


a known bug?

Allow students to share their experiences. Provide examples of diverse problems that students might
encounter at their workplaces. Use these examples to highlight the significance of the online man pages in
the Solaris OE, in the SunSolve Online service, and on the docs.sun.com Web site.

6-2

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Additional Resources

Additional Resources
Additional resources The following references provide additional
information on the topics described in this module:

SunSolve Online Contents (http://sunsolve.sun.com), accessed 14


March 2002.

Solaris Manual Pages (http://docs.sun.com), accessed 14 March


2002.

Diagnosing Faults Using Online Tools


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

6-3

Using the Online Man Pages

Using the Online Man Pages


The Solaris OE has many commands that you can use to check the
configuration, status, and resource information of the system. The Solaris
OE documentation is available online in the form of man pages for
reference. You check the online man pages for additional information,
such as the syntax, usage, flags, and parameters of commands.
To retrieve specific man pages, you use the man command with various
options at the shell prompt. The man command locates the appropriate
man pages and displays them on the screen.
The following are different options for using the man command:

man [-] [-M path] [-s section] command_name

man [-M path] -k keyword

man [-M path] -f file

Explain to students that they can run the man command with any options provided in the preceding syntax.
The choice of an option depends on the subject of the search. Inform students that the -M, -s, -l, and -k
options for the man command are described in the following pages.

The MANPATH Variable


The man pages are organized into directories and subdirectories, where
each subdirectory corresponds to a section of the reference manual.
When you run the man command, it searches for the specified string in the
directories specified by the MANPATH variable. MANPATH is an
environmental variable that defines the search path for directories and
man page sections.
The default path for the directories is the /usr/share/man directory, and
the default search path for sections is the man.cf file. However, if you set
the MANPATH variable, the search path in the MANPATH variable overrides
the default search path.
Note You can use the man -M and man -s options to override the search
path specified in the MANPATH variable.

6-4

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Using the Online Man Pages


If required, inform students that the man -M and man -s options are discussed later in the module.

Using the man -l Option


You use the man -l option to list all the man pages that match the search
criteria. The following is the syntax for using the man command with the
-l option:
$ man -l command_name
Consider a scenario in which you require information about the crypt
library function to encrypt a file. The crypt command is described at
multiple locations in the reference manual. To check these locations, you
use the man -l option.
$ man -l crypt
crypt (1)
crypt (3c)
$

-M /usr/man
-M /usr/man

The preceding output lists two man pages along with the section numbers
that contain information about the crypt command. You can then search
the particular section in the manual.

Using the man -s Option


You use the -s option to specify a section of the reference manual in
which the man command must search for a given command. The
following is the syntax for using the man command with the -s option:
$ man -s section command_name
To search multiple sections, separate each section name with a comma.

Note If you do not specify a section, the man command searches each
directory in the search path and displays the first matching man page.

Diagnosing Faults Using Online Tools


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

6-5

Using the Online Man Pages


For example, to display information about the crypt library function, you
must search section 3C. If you do not specify the section, the man
command displays the first matching page, which is in section 1.
Ask students to run the man command, once with and once without specifying the section, to search for the
crypt function. Highlight the difference in the two output results.

# man -s3C
Standard C
NAME
crypt
SYNOPSIS
...
...<output

crypt
Library Functions crypt(3C)
- string encoding function

truncated>

The section name at the command line limits the search for the crypt
function to section 3C. The first line of the output specifies the section of
the crypt library function.

Using the man -M Option


You use the man -M option to specify an alternative search path for the
man pages. For example, when you specify the alternative search path as
/usr/share/man:/usr/local/man, the man command first searches for
the command name in the default location and then in the alternative
/usr/local/man directory.
The following is the syntax for using the man command with the -M
option:
$ man -M path command_name

Using the man -k Option


If you do not know which command to search, use the man -k option.
The man -k option helps you to search the reference manual by using a
keyword.
The following is the syntax for using the man command with the -k
option:
$ man -k keyword

6-6

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Using the Online Man Pages


The -k option searches the windex database file and prints one-line
summaries for all the entries in the file that contain the keyword.

Note The windex database file is similar to an index file, which lists the
keywords and their corresponding reference pages.
For example, the following output is displayed when you run the man
command with the -k option:
# man -k passwd
d_passwd
d_passwd (4)
- dial-up password file
getpw
getpw (3c)
- get passwd entry from UID
kpasswd
kpasswd (1)
- change a user's Kerberos
password
nispasswd
nispasswd (1)
- change NIS+ password
information
nispasswdd
rpc.nispasswdd (1m) - NIS+ password update
daemon
...<output truncated>
The preceding output shows one-line summaries for all the entries in the
windex database file that contain the keyword passwd.
The keyword look-up feature for searching the man pages is not enabled
by default. To enable it, you must create the windex database file on the
system. If you attempt to run the man -k option without a windex
database file on the system, an error is displayed.
Inform students that to create a windex database file, they must use the catman command. This command is
described in the following section.

Using the catman -w Option


You use the catman command with the -w option to create an indexed
version of the online reference manual. This indexed version is called the
windex database file and is created at the path specified by the MANPATH
variable or the man -M option.
The following is the syntax for using the catman command with the -w
option:
$ catman -w

Diagnosing Faults Using Online Tools


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

6-7

Using the Online Man Pages


The windex database file helps to improve the speed of the search
operation. This file contains a list with the following three columns:

Keyword

Reference page to which the keyword points

Text that describes the purpose of the command or utility that is


documented on the reference page

The catman command indexes each page of the manual. If you make any
changes to the man pages, you must run the catman command to recreate
the windex database file.

Note Only the whatis command and the man -f and -k options use the
windex database file to perform search operations.

Refer students to the man pages for more information on the whatis command and the man -f and -k
options.

6-8

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Diagnosing Problems by Using the SunSolve Online Service

Diagnosing Problems by Using the SunSolve Online


Service
The SunSolve Online service is the support knowledge database that
provides various diagnostic tools and utilities for Sun systems. This
database also provides the latest patches available from Sun. You can use
the information provided in this database to troubleshoot and resolve
problems in the Solaris OE.
To discuss this section of the module, divide students into groups. Log in students to the SunSolve Online
service on one system in each group. The user name and password for the SunSolve Online Web site are
available in the classroom setup document located at the education.central Web site.

Accessing the SunSolve Online Service


You can access the SunSolve Online service by visiting one of the
following URLs:

http://sunsolve.sun.com

http://sunsolve1.sun.com

http://docs.sun.com

You must have a valid SunSpectrumSM contract ID before you register for
a SunSolve Online account.

Note The SunSpectrum support program provides five levels of support


for Sun hardware and software. You can access the SunSpectrum Web
page by visiting www.sun.com/service/support/sunspectrum.

Diagnosing Faults Using Online Tools


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

6-9

Diagnosing Problems by Using the SunSolve Online Service

Using the SunSolve Online Service


Contents of the
SunSolve Online
Service on
page OH 6-4

The SunSolve Online service contains several tools and utilities that help
you to diagnose and troubleshoot faults in your system. Figure 6-1
displays the main contents of this database.

Figure 6-1

Contents of the SunSolve Online Service

Patches
Ask students to define a patch. Note student responses on the white board, and identify the close-to-correct
and correct responses. Add your inputs to the correct responses to define a patch as a bug fix, a firmware
upgrade, or a software revision upgrade.

A patch contains a set of files and directories that correct the known bugs
in the system or adds product enhancements. You can download the
recommended and security patches provided by Sun without logging in
to the SunSolve Online service. However, to download the
product-related and operating system patches, you must be a registered
user of this database.

Diagnostic Tools
The SunSolve Online service provides a set of diagnostic tools and
utilities. In addition, it provides links to related tools that help to diagnose
the problem in the system.
Ask students if they referred to the SunSolve Online service to perform system diagnosis. If yes, ask them
which tools and utilities they used to resolve the problems in the Solaris OE.

6-10

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Diagnosing Problems by Using the SunSolve Online Service


Table 6-1 displays some of the diagnostic tools provided by the SunSolve
Online service.
Table 6-1 Diagnostic Tools of the SunSolve Online Service
Diagnostic Tool

Description

PatchDiag tool

Helps to determine the patch levels in the Solaris


OE based on the Recommended and Security
patch list provided by Sun.

Sun Explorer Data


Collector

Helps to collect system configuration


information to check the status of the system.
This tool also enables you to send the report to
Sun for troubleshooting any problems in the
system.

Online Support
Center

Allows you to submit an online service order to


the nearest Sun Solution Center.

Sun Validation Test


Suite (SunVTS)

Tests and validates Sun hardware by verifying


the configuration and functionality of hardware
controllers, devices, and platforms. The
SunSolve Online service provides a link to
SunVTS. You can directly access SunVTS from
www.sun.com/microelectronics/vts/.

If students are logged in to the SunSolve Online service, ask them to access the diagnostic tools from the
SunSolve Online home page to know more about the tools. Depending on the available time, decide on the
time to be spent on this exercise.

Collection Documents
Documents containing related or similar information are grouped as
collections in the SunSolve Online service. This database provides several
collection documents that help you to perform system diagnosis.
The following lists some of the collection documents available in the
SunSolve Online service:

Bug Reports

FAQs

Early Notifiers

Info Docs

Patch Descriptions and Reports

Sun Alert Notifications

Sun Security Bulletins

Diagnosing Faults Using Online Tools


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

6-11

Diagnosing Problems by Using the SunSolve Online Service


Consider a scenario in which you want to check whether a known bug is
causing a network problem on your system. To do this, you can access the
bug reports on the SunSolve Online service. If a bug report describes a
similar problem, you can check the report for workaround actions and
take the appropriate corrective action on your system.

Note If you are a registered user of the SunSolve Online service, you can
mark a collection document for downloading or for receiving a
notification whenever the document is modified.
Inform students that to access a collection, they must click the Searchable Collections link on the
SunSolve Online home page.

Security Information
The SunSolve Online service provides information about the bugs and
security issues of various products. The SunSolve Online service also
provides information about the security patches to resolve securityrelated bugs in the system. Table 6-2 displays the security-related
information items available in the SunSolve Online service.
Table 6-2 Security Information
Security Information
Item

6-12

Description

Latest security bulletin

Includes the latest security information on


the Solaris OE.

Security bulletin
archive

Contains the earlier security bulletins along


with their cross-referenced documents.

Security t-patches

Provides temporary and emergency patches


for the current security-related issues in
products. These temporary patches are not
guaranteed to be released as official patches.

Sun alert notifications

Provides Sun alert notifications for various


products, symptoms, and remedies for
known bugs and relevant patch IDs.

Security Pretty Good


Privacy (PGP) key

Ensures the protection of the security


bulletins on the SunSolve Online service. To
verify the PGP signature on a security
bulletin, use the PGP key of the Sun Security
Coordination team.

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Diagnosing Problems by Using the SunSolve Online Service

Note To ensure nonrepudiation of the security bulletin, Sun encrypts


each security bulletin with its private key and provides a public key to
decrypt the security bulletin. You use a PGP key to ensure that the
security bulletin originated from Sun.
Inform students that they can access the PGP key of the Sun Security Coordination team by clicking the
Security PGP Key hyperlink on the SunSolve Online home page. If students are logged in to the SunSolve
Online service, ask them to click the Security PGP Key hyperlink and view the public PGP key of the Sun
Security Coordination team. Students can also view the security bulletins in this database.

BigAdminSM Services
The SunSolve Online service provides a link to the BigAdminSM service,
which is a web-based, community-driven repository of resources for
system administrators. The BigAdmin service enables users to receive and
post information, resources, and tips.
FAQs, documentation, education resources, patches, scripts, software, and
services and support form an integral part of the BigAdmin service. The
service also includes discussion groups and technical guidance on shell
commands.
You can also access the BigAdmin service by visiting
www.sun.com/bigadmin/.

Diagnosing Faults Using Online Tools


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

6-13

Diagnosing Problems by Using the SunSolve Online Service

Performing Search Operations in the SunSolve Online


Service
The SunSolve Online service provides an advanced search utility that
enables you to search one or more collection documents.
The search fields vary, depending on the collections that you select for the
search. For example, if you select the Patch Reports collection, the search
fields include Bugs Fixed, Date, Keywords, and OS. However, if you select
more than one collection, only the fields that are common to all the
collections are displayed.
Search Syntax
on page OH 6-5

The advanced search screen provides a number of options, such as date


range, sort order, and the OS to help you restrict the search. You can also
combine a number of terms to perform specific search operations by using
the search syntax, as shown in Table 6-3.
Table 6-3 Search Syntax

6-14

Operator

Name

Description

Verbatim

Searches for exact matches for the string within


quotes

[]

AND

Searches for documents that contain all the


terms within the bracket

{}

OR

Searches for documents that contain one or


more terms within the bracket

()

Near

Searches for documents in which the specified


terms are located within 255 terms

Suffix

Searches for a term that starts with a specific


prefix and ends with any character

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Diagnosing Problems by Using the SunSolve Online Service

Performing a Search Operation


You must complete a set of steps to search for a document. For example,
to search for documentation on how to send a core dump to an alternative
dump device, you complete the following steps:
If you can access the Internet, ensure that students are logged in to the SunSolve Online service. Inform
students that they can refer to the instruction text on the search screens to perform the search. If the Internet
is not accessible, ask students to view the figures provided with each step.

1.

Select the collections that you want to search. Figure 6-2 displays the
selected Info Docs collection.

Figure 6-2

Collections for a Search

Diagnosing Faults Using Online Tools


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

6-15

Diagnosing Problems by Using the SunSolve Online Service


2.

To specify the search criteria, type dump in the synopsis field, as


shown in Figure 6-3. You can specify other options to restrict your
search. Click Go to begin the search.

Figure 6-3

Search Criteria

Figure 6-4 displays the search results for the Info Docs.

Figure 6-4

6-16

Search Results

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Diagnosing Problems by Using the SunSolve Online Service

Identifying Patch Support Tools


Explain to students that the SunSolve Online service provides access to various patch support tools and
utilities. Table 6-4 provides an overview to these tools and utilities.

The SunSolve Online service contains the following tools and utilities that
are related to patches, as shown in Table 6-4.
Table 6-4 Patch Support Tools
Patch Utility

Description

Patch Check

Determines the patch levels in the Solaris OE as


compared to the Recommended and Security
patch list provided by Sun.

Recommended
and Security
Patches

Help you to protect the Solaris OE from the most


critical system, user, or security-related bugs that
have been reported and fixed.

PatchPro

Generates a custom patch list for your Solaris OE.


It is currently available only for Enterprise
systems, Storage products, and Sun Cluster
software.

Checksum file

Contains a list of checksums, which is generated


daily in the SunSolve Online service for all the
available patches.

Automate
Downloads

Help you to down loadpatches automatically from


the SunSolve Online service.

Patch Finder

Helps you to search for a patch based on the patch


ID.

Solaris Patches

Enable you to view all the patches along with their


descriptions for the selected Solaris OE release.

Product Patches

Help you to locate all the patches for the selected


product.

Sun Alert Patch


Report

Displays a list of Sun alert messages for the


selected product. The report provides information
about alert messages and the relevant patch IDs.

Note You can access the Automate Downloads, Solaris Patches, and
Product Patches utilities only if you log in to the SunSolve Online service.
To access a patch tool or utility, click the corresponding link on the
SunSolve Online home page.
Diagnosing Faults Using Online Tools
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

6-17

Diagnosing Problems by Using the SunSolve Online Service

Viewing the Current Patch Report


The SunSolve Online service contains a comprehensive list of patches for
the Solaris 1.x OE through the Solaris 9 OE release. To obtain the current
patch report for a Solaris OE release, click the Solaris Patches link on
the SunSolve home page.
If students are logged in to SunSolve Online, ask them to click the Solaris Patches hyperlink on the
SunSolve home page to view the current patch list for the latest Solaris OE release.
If the Internet is not accessible, ask students to view Figure 6-5 for a sample patch report.

Figure 6-5 displays a section of the patch report for the Solaris 8 OE.

Figure 6-5

Solaris OE Patches

You can view the current patches available in your Solaris OE by running
the patchadd -p command.

6-18

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Diagnosing Problems by Using the SunSolve Online Service

The PatchDiag Tool


The SunSolve Online service provides the PatchDiag tool that enables you
to determine the patch levels for your system as compared to the patches
that are currently available for the Solaris OE. The PatchDiag tool
examines the patches with respect to the following:

Latest revisions

Recommended patches

Security patches

Y2K patches

Patches relevant to the software environment of the system

To access the PatchDiag tool, you must be a registered user of the


SunSolve Online service.

Note The PatchDiag tool is a compiled Perl script. The Perl source is in
the patchdiag.pl file of the installation directory.
To install and run the PatchDiag tool, complete the following steps:
1.

Download the patchdiag_1.0.4.tar.Z package from the


PatchDiag Tool Web page at
http://sunsolve.sun.com/patchdiag.

2.

Uncompress and untar the tar.Z package. A PatchDiag tool


subdirectory is created in the current directory.

3.

Download the PatchDiag cross-reference file, patchdiag.xref, from


the PatchDiag Tool Web page.

4.

Copy the patchdiag.xref file into the same directory as the


patchk.pl script.

Note The patchdiag.xref file contains the latest data about all the
patches. All users must have access to the patchdiag.xref file to run the
PatchDiag tool successfully.

Diagnosing Faults Using Online Tools


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

6-19

Diagnosing Problems by Using the SunSolve Online Service


5.

Check that the following files are present in the Solaris OE. The
PatchDiag tool uses these files to generate the patch report:

patchdiag.xref Used as a cross-reference data file for patch


information

/usr/bin/pkginfo Used to obtain information about the


packages currently installed on the system

/usr/bin/showrev Used to obtain information about the


patches installed on the system

/usr/bin/uname Used to determine the system information


when you do not specify the PatchDiag tool options

6.

Run the patchdiag_setup script.

7.

Run the patchdiag.pl script with one of the options shown in


Table 6-5.

Table 6-5 Options of the PatchDiag Tool


Option

Description

-l

Displays a long audit report and includes the


patches that are related to the installed packages

-s <sfile>
<os_ver>
<arch>

Displays a standard audit report by using the file


that contains the output of the showrev -p
command

-p <pfile>
<sfile>
<os_ver>
<arch>

Displays a long audit report by using files that


contain the output of the showrev -p and
pkginfo -l commands

-x <xref>

Uses a different cross-reference file than the file in


the same directory as the patchdiag script

-h | -?

Displays the PatchDiag user guide

If you do not specify an option, the PatchDiag tool runs the showrev -p
command in the Solaris OE and prints the standard audit report. The
audit report contains information about the installed patches, the security
patches, and the uninstalled recommended patches.
To identify the patches that you must download and install on your
system, review the audit report generated by the PatchDiag tool.

6-20

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Using the Sun Explorer Data Collector Utility

Using the Sun Explorer Data Collector Utility


The Sun Explorer Data Collector utility is commonly referred to as
Explorer. Explorer is a collection of shell scripts that helps you to collect
system information, compress the output, and send the report to Sun. The
engineers at Sun use the Explorer report to either describe the
configuration of the Solaris OE or troubleshoot a system problem.

Obtaining Explorer
You download Explorer from the www.sunsolve.sun.com Web site.

Installing Explorer
To unpack and install Explorer, complete the following steps:
1.

Copy the SUNWexplo.tar.Z file into the current directory.

2.

Type the following at the command line to unpack Explorer:


% zcat SUNWexplo.tar.Z|tar xf -

3.

Type the following at the command line to install Explorer in the


Solaris OE:
# pkgadd -d . SUNWexplo
During installation, you are prompted for information, such as your
SunSpectrum contract ID, the serial number of the system, and the
company name. This information helps you to track the Explorer
output effectively when you send the Explorer report to Sun
engineers for analysis.

Configuring and Executing Explorer


You can configure Explorer to run either automatically or manually. When
installing Explorer, you are prompted to specify the frequency of
execution of Explorer.
To configure Explorer to run automatically, select the option to run
Explorer on a weekly basis. You can modify the frequency of execution
later.

Diagnosing Faults Using Online Tools


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

6-21

Using the Sun Explorer Data Collector Utility


To run Explorer manually, type the following at the command line:
# /opt/SUNWexplo/bin/explorer -e
The assumption while running the preceding command is that Explorer is
installed in the /opt/SUNWexplo directory. The -e option enables you to
send the Explorer output through email to the recipients specified in the
/opt/SUNWexplo/etc/default/explorer configuration file.

Note You must have root privileges to run Explorer.


The Explorer script calls the scripts located in the
/opt/SUNWexplo/tools directory, which initiate the tools that collect
information about the system.

Note If you want to run Explorer on a cluster, select one node at a time
instead of selecting all the nodes simultaneously.

Reviewing the Explorer Output


Explorer generates an output report, which is automatically saved in the
/opt/SUNWexplo/output directory. You can review the Explorer output
report to identify any problems in the system.
You can also obtain a thorough analysis and status report for the Solaris
OE from Sun, which is based on the Explorer output report.
Note You must buy the SunSM System Configuration Check service to
request a thorough analysis report from Sun.

6-22

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Using the docs.sun.com Web Site

Using the docs.sun.com Web Site


The docs.sun.com Web site contains reference manuals, technical
documents, and extensive documentation on Sun products. This web site
also provides glossary definitions for various terms used in the
documentation and troubleshooting guides for several products. In
addition, you can use this site to read, browse, search, and print Sun
documentation. Figure 6-6 shows the docs.sun.com home page.

Figure 6-6

The docs.sun.com Home Page

Browsing the docs.sun.com Web Site


The docs.sun.com Web site is organized under the following groups:

Subject Categories

Collection Titles

Product Categories

Diagnosing Faults Using Online Tools


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

6-23

Using the docs.sun.com Web Site

Subject Categories
The docs.sun.com Web site contains several books that are grouped
according to the subject matter. These subjects include the following
categories:

System Administration

Programming

Desktop Manuals

Hardware

Manpages

Each subject category is further divided into subcategories. For example,


the Manpages category has several subcategories, as shown in Figure 6-7.

Figure 6-7

Subject Categories and Subcategories

Subject categories and subcategories enable you to follow an organized


and systematic approach to locate information on the docs.sun.com Web
site.

6-24

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Using the docs.sun.com Web Site

Collection Titles
A collection is a set of books that are grouped if they fulfill the following
requirements:

Relate to a single product or product line

Contain the same subject matter

Address a specific audience

Collections help you to browse Sun documentation because you can easily
track information on any subject if you know the type of content. For
example, to locate information on managing users in the Solaris 9 OE, you
can select the Solaris 9 System Administration collection.
Ask students to locate the OpenBoot collection title and observe how books with similar subject matter are
grouped in the OpenBoot collection.

Note An individual book can appear in more than one collection.


Document collections are listed alphabetically, according to the collection
title, as shown in Figure 6-8.

Figure 6-8

Collection Titles

Diagnosing Faults Using Online Tools


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

6-25

Using the docs.sun.com Web Site

Product Categories
This category groups and displays books according to the product
described by the books. The books are categorized under hardware and
software products, which are further categorized according to specific
products.
Figure 6-9 shows the product categories at the docs.sun.com Web site.

Figure 6-9

Product Categories

Ask students to access the docs.sun.com Web site. Ask them which document structure they prefer for
locating the man pages for the Solaris 9 OE. Note student responses. Highlight the fact that using the
Manpages subject category is the most convenient way of locating the required man pages because the
subject is known.

6-26

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Using the docs.sun.com Web Site

Performing a Search Operation on the docs.sun.com


Web Site
The docs.sun.com Web site provides the facility to search the entire
documentation for specific words or phrases. You can use the following
options to make the search useful and effective:

Search In functionality

Search Bar Options

Search Syntax

The result of the search displays a list of books in decreasing order of


relevance.

Using the Search In Functionality


The Search In functionality helps you to define the scope of the search.
You can use the Within drop-down menu on the search bar to search
books, collections, and subject or product categories.
Search In
Options on
page OH 6-6

Table 6-6 shows the options provided by the Search In drop-down menu.
Table 6-6 Search In Options
Search Option

Description

All Books

Searches all the books on the docs.sun.com


Web site. This option is available for all search
operations.

Subject or Product
category

Searches the current subject or product


category or subcategory. This option is
available for a subject or product category or a
book.

This collection

Searches the current collection. This option is


available for a book or a collection.

This Book

Searches the current book. This option is


available for a book.

Diagnosing Faults Using Online Tools


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

6-27

Using the docs.sun.com Web Site

Using Search Bar Options


In addition to the Search In options, the docs.sun.com Web site provides
the following two search bar options:

Search book titles only Performs the search only in the titles of
books.

Ignore old editions Performs the search only in the latest editions
of published books. For example, while searching for the Solaris
Advanced Users Guide, the result shows the Advanced Users
Guide for the Solaris 9 OE only. This is because the latest edition of
the book is published for the Solaris 9 OE.

Figure 6-10 shows the two search bar options.

Figure 6-10 Search Bar Options

6-28

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Using the docs.sun.com Web Site

Using the Search Syntax


You can use a single word or a combination of words to perform a search
in the docs.sun.com Web site. The search syntax can contain any of the
following options, as shown in Table 6-7.
Table 6-7 Search Syntax Options
Option

Description

Sample Syntax

Words

Searches for a book that contains


one or more words. Type the
words in the Search For text field,
separating each word with a space.

Openboot firmware

Phrases

Searches for a book that contains a


phrase. Type the phrase within
quotation marks in the Search For
text field. You can combine
multiple phrases for a single search
request.

ok setenv
ok printenv

AND

Searches for a book that contains a


combination of words and phrases.
Combine the words and phrases
by using the boolean variable
AND.

ok setenv AND
OpenBoot

OR

Searches for a book that contains


any or all the specified words or
phrases. Combine the words by
using the boolean variable OR.

Sun Enterprise OR
Ultra Enterprise

Diagnosing Faults Using Online Tools


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

6-29

Using the docs.sun.com Web Site

Printing Files From the docs.sun.com Web Site


The docs.sun.com Web site provides most books in the portable
document format (pdf) for printing. You can either download the pdf file
in a browser and print it or download the file by using the ftp command.

Downloading a File by Using the Browser


To print a book from the docs.sun.com Web site, complete the following
steps:
1.

Click the Download PDF tab on the document template, as shown in


Figure 6-11.

Figure 6-11 Downloading a PDF File

6-30

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Using the docs.sun.com Web Site


2.

Click the link to the book that you want to print, as shown in
Figure 6-12.

Figure 6-12 Printing a PDF File


The docs.sun.com Web site uses the Adobe Acrobat browser
plugin to display the pdf file in the browser plug-in window.
3.

Use the Print function in Adobe Acrobat Reader to print the file.

Diagnosing Faults Using Online Tools


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

6-31

Using the docs.sun.com Web Site

Downloading a File by Using the ftp Command


The docs.sun.com Web site allows you to download a pdf file by using
the ftp command instead of downloading the file to the browser.
To download the file using the ftp command, complete the following
steps:
1.

Move the mouse device over the pdf file that you want to download,
and note the URL of the file displayed on the status bar, as shown in
Figure 6-13.

Figure 6-13 Downloading a File by Using the ftp Command


2.

Open a terminal window in the Solaris OE, and select the directory
in which you want to download the pdf file.

3.

Run the ftp command by using the URL address, which you noted
in Step 1. For example, if the URL address of the pdf file is
ftp://192.18.99.138/802-1958/802-1958.pdf, run the
following ftp command:
$ ftp 192.18.99.138

4.

6-32

Log in to the FTP proxy server by using the User ID anonymous.

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Using the docs.sun.com Web Site


5.

Switch to the book_part_number directory on the proxy server, and


use the get command to download the file. For example, if the part
number of the book is 802-1958, type the following:
ftp> cd 802-1958
ftp> get 802-1958.pdf

After downloading the pdf file, you can view it in the browser by using
Adobe Acrobat Reader.

Icon Legends in the docs.sun.com Web Site


To help you navigate Sun documentation, the docs.sun.com Web site
provides three types of symbols. These symbols help you to identify the
type of document structure and the current location in the site.

Introducing Icons

Icon Legends
on page OH 6-7

Icons help you to identify the type of document structure at the


docs.sun.com Web site. For example, you can identify a book, a
collection, or a file by observing the icon preceding the name of the
document structure. Figure 6-14 shows different icons with their
descriptions.

Figure 6-14 Icon Legends

Diagnosing Faults Using Online Tools


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

6-33

Using the docs.sun.com Web Site

Introducing Control Symbols

Control
Legends on
page OH 6-8

The control symbols enable you to view contents, wherever required. You
can click the control symbols to either expand or collapse collections or
groups of documents.
Figure 6-15 shows various control symbols and their associated
descriptions.

Figure 6-15 Control Legends

6-34

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Using the docs.sun.com Web Site

Introducing Indicator Symbols

Indicator
Legends on
page OH 6-9

Indicator symbols help you to determine the current location in the


docs.sun.com Web site. You use these symbols to determine your
location in the hierarchy of a book or a collection. Figure 6-16 shows
various Indicator symbols and their associated descriptions.

Figure 6-16 Indicator Legends

!
?

Discussion Ask students to browse the docs.sun.com Web site to


locate information about IP network multipathing. After completing the
search, spend some time discussing how students performed the search
and how much time they spent to locate the manual. Arrive at the most
systematic approach followed.

To locate information on IP network multipathing, select the Solaris 9 System Administration collection. Next,
select the System Administration Guide: IP Services book. Type multipathing in the Search For text field on
the search bar. Identify the indicator legend to select the most relevant search result.

Diagnosing Faults Using Online Tools


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

6-35

Exercise: Using the man Command

Exercise: Using the man Command


In this exercise, you use the man command to explore system
documentation and search the reference manual:

By a section name

By a keyword

Preparation
Boot the Solaris OE, and log in as the root user if necessary.

Tasks
Perform the following tasks to display information from the online
reference manual:

6-36

1.

Run the man command with an appropriate option to search the


online reference manual for the chmod system call. Confine your
search to section 2 of the reference manual.

2.

Run the man command with an appropriate option to list all the man
pages that contain information about the passwd command.

3.

Use the appropriate command to create a windex database file for


the online reference manual.

4.

Use the appropriate command to display one-line summaries for all


the entries in the windex database file containing the keyword
device.

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Exercise Solutions

Exercise Solutions
The following are the solutions for the tasks listed in the exercise:
1.

Run the man command with an appropriate option to search the


online reference manual for the chmod system call. Confine your
search to section 2 of the reference manual.
$ man -s2 chmod

2.

Run the man command with an appropriate option to list all the
manual pages that contain information about the passwd command.
$ man -l passwd
passwd (1)
-M /usr/man
passwd (4)
-M /usr/man

3.

Use the appropriate command to create a windex database file for


the online reference manual.
$ catman -w

4.

Use the appropriate command to display one-line summaries for all


the entries in the windex database file containing the keyword
device.
$ man -k chmod

Diagnosing Faults Using Online Tools


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

6-37

Exercise: Diagnosing Problems Using the SunSolve Online Service

Exercise: Diagnosing Problems Using the SunSolve


Online Service
In this exercise, you use the following tools and utilities in the SunSolve
Online service:

The search utility to search for a collection document

The Patch Finder utility to find a patch based on the patch ID

The PatchDiag tool to determine the patch levels available in the


Solaris OE as compared to the recommended and security patches

Preparation

For the purpose of this exercise, divide students into manageable groups, depending on the strength of
the class.

If Internet access is available, log in students to the SunSolve Online service on one system in each
group. The login information is provided in the setup file located at the education.central Web site.

If the Internet is not accessible, ask students to write the steps for performing the exercise tasks. They
can refer to the figures in the module, which display the screens from the SunSolve Online service.

For Step 4, refer to the /ST350_LF/Sunsolvediag/patchdiag/ directory


for the relevant files that are downloaded from the SunSolve Online
service.
If there is enough time, ask students to install Explorer on their systems. The tar file for Explorer is available
in the /ST350_LF/Sunsolvediag/explorer/ directory. Inform students that they will be prompted for
information while installing Explorer. They can skip the details for the SunSpectrum ID and continue with the
installation. To run Explorer manually, ask them to run the following command:
# /opt/SUNWexplo/bin/explorer -e
A sample output of Explorer is provided in Appendix A, Sample Outputs.

Tasks
Perform the following tasks:

6-38

1.

Run the appropriate command to display all the patches installed in


your Solaris OE.

2.

Explorer is installed in the /opt/SUNexplo directory. Run the


appropriate command to run Explorer manually.

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Exercise: Diagnosing Problems Using the SunSolve Online Service


3.

Search the collection documents on the SunSolve Online service to


locate a white paper on security in Sun systems.

4.

Use the appropriate utility to find the patch with the patch ID
112552-01 on the SunSolve Online service. Note the bug IDs that are
fixed by the patch.

5.

The patchdiag_1.0.4.tar.Z and patchdiag.xref files are


available in the /ST350_LF/Sunsolvediag/patchdiag/ directory
on your system. Follow the appropriate steps to install the PatchDiag
tool in your Solaris OE, and generate the long audit report.
a.

Uncompress and untar the tar.Z package.

b.

Copy the patchdiag.xref file into the same directory as the


patchk.pl script.

c.

Run the patchdiag_setup script.

d.

Run the patchdiag.pl script with the -l option.

Diagnosing Faults Using Online Tools


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

6-39

Exercise Summary

Exercise Summary

Discussion Take a few minutes to discuss what experiences, issues, or


discoveries you had during the lab exercise.

Manage the discussion based on the time allowed for this module, which was provided in the About This
Course module. If you do not have time to spend on discussion, highlight just the key concepts students
should have learned from the lab exercise.

Experiences

Ask students what their overall experiences with this exercise have been. Go over any trouble spots or
especially confusing areas at this time.

Interpretations

Ask students to interpret what they observed during any aspect of this exercise.

Conclusions

Have students articulate any conclusions they reached as a result of this exercise experience.

Applications

Explore with students how they might apply what they learned in this exercise to situations at their workplace.

6-40

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Exercise Solutions

Exercise Solutions
The following are the solutions for the tasks listed in the exercise:
1.

Run the appropriate command to display all the patches installed in


your Solaris OE.
# patchadd -p

2.

Explorer is installed in the /opt/SUNexplo directory. Run the


appropriate command to run Explorer manually.
# /opt/SUNWexplo/bin/explorer -e

3.

Search the collection documents on SunSolve Online to locate a


white paper on security in Sun systems.
Complete the following steps:
a.

Click the Searchable Collections link on the SunSolve


Online home page.

b.

Select the White Papers/Tech Bulletins check box, and click


Next.

c.

Type sun system security in the Synopsis text field, and click
Go.

The document ID of the white paper on Sun Security System is 26922.


4.

Use the appropriate utility to find the patch having the patch ID
112552-01 on SunSolve Online. Note the bug IDs that are fixed by the
patch.
Use the Patch Finder utility to locate the patch on SunSolve Online.
Patch 112552-01 fixes the bugs 4607337 and 4624965 in the Solaris 9 OE.

5.

The patchdiag_1.0.4.tar.Z and patchdiag.xref files are


available in the /ST350_LF/Sunsolvediag/patchdiag/ directory
on your system. Follow the appropriate steps to install the PatchDiag
tool in your Solaris OE and generate the patch report.
The following are the steps to install the PatchDiag tool:
a.

Uncompress and untar the tar.Z package.

# tar xvf patchdiag_1.0.4.tar


b.

Copy the patchdiag.xref file into the same directory as the


patchk.pl script.

c.

Run the patchdiag_setup script.

# directory-path/patchdiag-1.0.4/patchdiag_setup
d.

Run the patchdiag.pl script with the -l option.

# perl directory-path/patchdiag-1.0.4/patchdiag.pl -l
Diagnosing Faults Using Online Tools
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

6-41

Module 7

Introducing Types of System Failures


Objectives
Overview on
page OH 7-2

Upon completion of this module, you should be able to:

Describe the causes of system panics and dumps

Describe the process of system crash dump generation

Describe watchdog resets

7-1
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Relevance

Relevance
Present the following questions to stimulate students and get them thinking about the issues and topics
presented in this module. While they are not expected to know the answers to these questions, the answers
should be of interest to them and inspire them to learn the material presented in this module.

Relevance on
page OH 7-3

!
?

Discussion The following questions are relevant to understanding the


reasons for system failures:

What is a core dump?

Ask students what they understand by a core dump. Use the following description to explain a core dump.
A core dump is a file that contains the memory image of a process that was terminated by the kernel in
abnormal circumstances. When an application code attempts to perform an illegal action, the kernel causes
the application to terminate and creates a core dump file on the disk, representing the process memory. Core
dumps enable you to perform a postmortem analysis of the offending application. By viewing the process
memory image at the exact moment of termination, you can determine the cause of the problem within the
source code. The process that dumps the core file is the only one that is affected.

What is a system crash dump?

Ask students what they understand by a system crash dump. Use the following description to explain a
system crash dump.
A system crash dump is the conceptual equivalent of a core dump, which is generated when the kernel code
performs an illegal action that jeopardizes data integrity. If data integrity is jeopardized, the kernel notes the
disparity and calls a special kernel routine, known as a panic, to manage the situation. The panic routine
causes the memory image and symbol table of the kernel to be saved to the swap space, by default, and
forces the system to reboot. When the kernel generates a crash dump, all the applications are affected.

What is a system crash?

A system crash occurs when either the computer stops working or an application aborts unexpectedly. A
system crash signifies either a hardware fault or a critical software bug.

What is a system hang?

When a system crashes in such a way that it does not respond to any inputs from the keyboard, the mouse
device, or any other program, it is known as a system hang.

7-2

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Additional Resources

Additional Resources
Additional resources The following references provide additional
information on the topics described in this module:

Drake, Chris, and Kimberley Brown. Panic! UNIX System Crash Dump
Analysis. Upper Saddle River, New Jersey: Prentice Hall PTR, May
1995.

Goodheart, Berny, and James Cox. The Magic Garden Explained. Upper
Saddle River, New Jersey: Prentice Hall Books, January 1994.

The SPARC Architecture Manual, SPARC International.

Introducing Types of System Failures


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

7-3

Introducing the Causes of System Panics

Introducing the Causes of System Panics


The kernel has unlimited access to all the memory, hardware, and
peripherals within a system. Therefore, the data that is being manipulated
by the kernel must retain its integrity.
To ensure data integrity, different routines within the kernel validate the
values of the manipulated data. When these routines doubt the validity of
the kernel, they call a panic routine within the kernel. This panic routine
writes the contents of the kernel memory to the dump device for later
analysis and then reboots the machine. By default, the dump device is the
swap partition on the system.

Note Before the release of the Solaris 2.6 OE, system crash dumps were
stored in the first dump device, as defined in the /etc/vfstab file. In the
Solaris 9 OE, you use the dumpadm command to specify the location where
the system stores the crash dumps.
Ask students to discuss the system failures that they experience at their work places. Note the responses on
the white board, and inform students that by the end of this module, they will be categorizing the list of
system failures as different types of system crashes.

After an application terminates abnormally or a system crashes,


programmers and system administrators examine the postmortem
information in the dump files to identify the type and cause of the fault in
the system.
When the kernel terminates an application, a core dump file is generated
in the directory from which the application was executed.

7-4

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Introducing the Causes of System Panics

Causes of System Crash Dumps


A system crash dump is generated due to the following types of faults:

System panics When the kernel code detects data inconsistency, it


calls a special function, known as panic, which writes the kernel
memory to the disk and reboots the system.

Note The administrator can also use the savecore utility to generate a
system crash dump without causing the system to reboot.

Bad traps The kernel panics the system if it encounters a bad


hardware or software trap.

Note A system hang is another fault that occurs on a system. A system


hang cannot generate a system dump. However, you must force a system
panic to generate a system dump to analyze the fault that caused the
system hang.

Causes of Application Core Dumps


An application core dump is generated when a process or application
running in the Solaris OE terminates abnormally. The following events
might generate an application core dump:

User intervention to request a core file

A signal that another application or user sends to the application

Note Not all signals sent to an application generate a core dump.

An abnormal event, such as a critical bug, which is encountered by


the application.

Introducing Types of System Failures


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

7-5

Introducing the Causes of System Panics

Introducing System Panics


When the kernel code detects data inconsistency within the running
kernel, the code responds by calling the panic() system call. The
panic() system call is not an error condition but a protective reaction to
an error condition that safeguards the data in the system.

Overview of the panic() System Call


The panic() system call performs the following tasks:

Displays a panic message on the console

Performs a stack trace to list the routines that caused the panic

Saves a core dump image of the system memory in the dump device

Resets the system

Note Each system crash dump image is delimited by short dump


records. These short dump records help the savecore utility in the Solaris
OE to identify the system crash dump.

Overview of Bad Traps


A trap can be a response to a hardware interrupt, a hardware or software
error condition, or a software request for kernel services. A trap causes the
current process to be suspended and also causes an immediate branch to
low-level kernel code to respond to and service the trap.
Traps manage both software- and hardware-related events, such as a
keyboard, a mouse, a port, or a process requesting system services from
the kernel, such as reading a file from a file system.

Note The processor subcomponent that interprets system language


instructions is called the Instruction Unit (IU). However, before executing
an instruction, the IU checks for pending interrupts or errors that must be
managed. When the IU detects pending interrupts or errors, it generates a
trap.

7-6

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Introducing the Causes of System Panics


If an error condition occurs during trap handling and prevents the trap
from being serviced, a bad trap error condition occurs. Bad traps that occur
during the processing of user code cause the kernel to terminate the
offending process, which generates a core dump. However, the bad traps
that occur during the processing of the kernel code cause the kernel to
panic. This panic condition causes a system crash dump to be written to
the swap device and also causes the system to reboot.
Table 7-1 shows some examples of bad traps.
Table 7-1 Examples of Bad Traps
Trap

Type number

Cause

Data fault

Access to an unmapped memory


location

Memory alignment

Access to unaligned memory

Illegal software
instruction

Unrecognized instruction

If the Solaris OE reboots after a bad trap, the trap messages are saved in
the /var/adm/messages file. However, if the trap messages are not saved,
the system administrator must run the dmesg command immediately after
system reboot. The dmesg command displays the messages generated by
the system crash.

Note The dmesg command identifies the recently generated diagnostic


messages and prints them on the standard output device.

Inform students that line #1 of the output indicates the trap type and the date and time of the trap. Line #2
indicates the name of the trap.

Introducing Types of System Failures


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

7-7

Introducing the Causes of System Panics


The contents of the /var/adm/messages file and the output of the dmesg
command are identical after a bad trap. The following is the output of a
memory-alignment bad trap:
...
1
Dec 21 03:36:49 mysun
mmu_fsr=0 rw=0
2
Dec 21 03:36:49 mysun
3
Dec 21 03:36:49 mysun
4
5
Dec 21 03:36:49 mysun
fd7a1a68,
6
7
Dec 21 03:36:49 mysun
8
Dec 21 03:36:49 mysun
3 3 3 3
9
10 Dec 21 03:36:49 mysun
...<output truncated>

unix: BAD TRAP: type=7 rp=f0bbeb8c addr=0


unix: find: Memory address alignment
unix: pid=916, pc=0xfc2550e4, sp=0xf0bbebd8,
psr=0x1f0000c0, context=1930
unix: g1-g7: f004f51c, 8000000, f007702c, c0,
1, fcbaa020
unix: panic: cross-call at high interrupt level
unix: syncing file systems... 3 3 3 3 3 3 3 3 3
3 3 3 3 3 3 3 done
unix: 14849 static and sysmap kernel pages

For more information on the dmseg command, refer students to the online man pages or the docs.sun.com
Web site.

When a bad trap occurs, the stack traceback of the process or thread that
caused the bad trap enables you to determine the cause of the trap. A
stack traceback provides the history of the thread that caused the trap.
The traceback also enables you to identify the sequence in which routines
were called before the trap.

Note You can compare the stack trace with the traces in bug reports to
verify whether the trap occurred because of a known bug.

7-8

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Introducing the Causes of System Panics

Introducing a System Hang


A system hang is a condition in which the system appears to have
stopped processing. When a system hangs, the kernel does not question
the integrity of system data and the system does not reboot. Therefore, a
system hang is different from a system panic.
To diagnose a system hang, force the system to panic so that you can
conduct a postmortem analysis of memory. To force the system to panic,
complete the following steps:
1.

Use the Stop-A key sequence to access the ok prompt.

2.

Run the OBP diagnostic commands, such as the .registers and


.locals commands, to capture the status of the registers.

3.

Run the sync command at the ok prompt to force a system panic and
a reboot.

4.

Check for the system dump files in the /var/crash/`uname -n`


directory when the system reboots.

5.

Examine the system dump to identify the cause of the hang.

If required, inform students that the.registers and .locals commands are discussed later in the module.

Introducing Types of System Failures


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

7-9

Generating a System Crash Dump

Generating a System Crash Dump


Use the OH to explain the process of crash dump generation.

Process of
System Crash
Dump
Generation on
page OH 7-4

This section describes the process of generating a system crash dump. The
first two phases are preparatory steps for generating a crash dump, and
the remaining phases are a part of the process of crash dump generation.
Figure 7-1 lists the steps in the process of generating a system crash
dump.

Figure 7-1

7-10

Process of Generating a System Crash Dump

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Generating a System Crash Dump

Writing the System Crash Dump


When a system crashes, a copy of the kernel memory is first copied to a
dump device and then saved to a file system for analysis and debugging.

Writing to the Dump Device


A swap partition in the Solaris OE is configured as the default dump
device. A swap partition is a disk partition that is reserved in the Solaris
OE as a backup store of virtual memory for the OS. The swap partition
does not contain any permanent information but stores the address space
of the kernel in case of a system panic.

Note The swap devices configured on the system are listed in the
/etc/vfstab file. The first entry in the file corresponds to the primary
swap device.
You configure the system to contain a single primary swap partition and
multiple secondary swap partitions. If the dump is too large for the
primary swap partition, the system writes the core dump to the secondary
swap partitions.
Ask students to open the /etc/vfstab file and identify various swap devices configured on their systems.

Note If the aggregate size of all the swap partitions is less than the size
of the system crash dump, the kernel does not create a system crash
dump.
Each system crash dump contains a header to which the system always
writes the end of the primary swap partition. The header contains
information about the size and location of the dump. The header
information enables the system to locate and save the dump when the
system reboots.
You can configure the system to save a system crash dump either partially
or completely. A partial dump contains the crash dump header and a
copy of a part of physical memory. A complete dump contains the dump
header and a copy of the entire physical memory.

Introducing Types of System Failures


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

7-11

Generating a System Crash Dump

Note If a swap device does not exist or is not configured as a dump


device, the system crash dump feature is disabled, and an error message
is printed on the system console.
You use the dumpadm command to configure the appropriate swap device
as the dump device. The dumpadm command also enables you to
reconfigure crash dump parameters to modify the dump configuration.
To view the current dump configuration in your Solaris OE, type the
dumpadm command at the command line without any arguments. The
following is a sample output of the dumpadm command:
# dumpadm
Dump content: kernel pages
Dump device: /dev/dsk/c0t0d0s3 (swap)
Savecore directory: /var/crash/sun
Savecore enabled: yes
Inform students that the dumpadm command is described later in the module.

Configuring the System to Process Crash Dumps


You must configure the Solaris OE to process a system crash dump when
the system crashes. You use the dumpadm command to specify the dump
device.

Using the dumpadm Command


The dumpadm command is an administrative command that helps you to
configure the crash dump utility in the Solaris OE. You also use the
dumpadm command to display and reconfigure crash dump parameters.
The following is the syntax of the dumpadm command:
/usr/sbin/dumpadm [-nuy][-c content-type][-d dump-device
][-m mink|minm|mn%]
[-s savecore-dir][-r root-dir]

7-12

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Generating a System Crash Dump

Options of the
dumpadm
Command on
page OH 7-5

Table 7-2 shows the options supported by the dumpadm command.


Table 7-2 Options of the dumpadm Command
Option

Description

-c content-type

Specifies the contents of the crash dump. Valid


content types include the following:
kernel Kernel memory pages only
all All memory pages
curporc Memory pages of the process that
crashed and the kernel

-d dump-device

Modifies the dump configuration to use the


specified dump device

-m mink|minm|min%

Creates a minfree file in the current savecore


directory, indicating that the savecore directory
should maintain at least the specified amount of
free space in the file system in which the
savecore directory is located

-n

Modifies the dump configuration so that the


savecore utility does not run automatically on
system reboot

-r root-dir

Specifies an alternative root directory relative to


which the dumpadm command should create files

-s savecore-dir

Modifies the location of the savecore directory


to the absolute path in the savecore-dir option

-y

Enables the autoexecution of the savecore


utility on system reboot

If necessary, remind students that the savecore utility is enabled by default when the system reboots.

To view the current dump configuration, run the dumpadm command at


the command line.

Introducing Types of System Failures


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

7-13

Generating a System Crash Dump


The following is a sample output when you run the dumpadm command:
# dumpadm
Dump Content:
Dump Device:
Savecore directory:
Savecore enabled:

kernel pages
/dev/dsk/c0t0d0s3 (swap)
/var/crash/SunSparc1
yes

Ask students to run the dumpadm command to view the current dump configuration on their systems.

In the following example, the -s savecore-dir option modifies the


location of the savecore directory:
# dumpadm -s /var/crash/mydump
Dump Content: kernel pages
Dump Device: /dev/dsk/c0t0d0s1 (swap)
Savecore directory: /var/crash/mydump
Savecore enabled: yes

Using the savecore Command Automatically


When the pre-Solaris 7 OE boots, it executes the /etc/init.d/sysetup
script by default. The /etc/init.d/sysetup script, in turn, executes the
savecore utility.
The /etc/init.d/sysetup script contains a section of the Bourne shell
code to run the savecore utility. However, by default, this section of code
is available as comments. You must remove the comments and enable the
following lines of code in the /etc/init.d/sysetup script to run the
savecore utility.
##
## Default is to not do a savecore
##
# if [ ! -d /var/crash/`uname -n` ]
# then mkdir -m 0700 -p /var/crash/`uname -n`
# fi
# echo 'checking for crash dump...\c '
#savecore /var/crash/`uname -n`
# echo ''

Note The savecore utility is enabled by default in the Solaris 9 OE.

7-14

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Generating a System Crash Dump


If time permits, ask students to open the /etc/init.d/sysetup script on their respective systems and
identify the difference between the script on their systems and the script from the pre-Solaris 7 OE provided
in the student guide.

Using the savecore Command Manually


The savecore command saves a crash dump of the kernel and writes a
reboot message in the shutdown log. You can also run the savecore
command manually from the command line.
The following is the syntax for the savecore command:
/usr/bin/savecore [-Lvd] [-f dumpfile] directory
The savecore command supports the following options:

-L Saves a crash dump of the running Solaris OE without


rebooting or modifying any configuration parameters. To perform a
live crash dump, you must configure a dedicated dump device in the
Solaris OE.
The following output is displayed when you run the savecore
command with the -L option:
# savecore -L
dumping to /dev/dsk/c0t0d0s3, offset 65536, content:
kernel
100% done: 6499 pages dumped, compression ratio 2.84,
dump succeeded
System dump time: Mon Jan 14 12:06:24 2002
Constructing namelist /var/crash/sun/unix.1
Constructing corefile /var/crash/sun/vmcore.1
100% done: 6499 of 6499 pages saved

-v Enables verbose error messages from the savecore command.

-d Forces the savecore command to save a crash dump even if the


dump header indicates that the dump is already saved.

-f dumpfile Saves a core dump from the specified file instead of


the dump device.

directory Saves the crash dump to the specified directory. If you


do not specify a directory on the command line, the savecore
command saves the core dump to the default savecore directory.

Ask students to locate the default savecore directory on their respective systems. The savecore directory is
located in the /var/crash/hostname directory. For more information on the dumpadm utility, refer students to
the relevant man pages.

Introducing Types of System Failures


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

7-15

Generating a System Crash Dump


For more information on the savecore command, refer students to the man pages at the docs.sun.com Web
site.

Copying From the Dump Device to the savecore Directory


When a system reboots after a crash, the /etc/init.d/savecore script
invokes the savecore command. The savecore command performs the
following tasks:

Determines if a system crash dump is written to the dump device as


configured by the dumpadm command.

Checks the amount of free space in the /var/crash file system.

Compares the amount of space in the /var/crash file system with


the value of the minfree variable.

Note If the minfree variable is not specified, the system assumes a


default value of 1 Mbyte.

Moves the system crash dump and a copy of the kernel files from the
dump device to the file system specified by the dumpadm command.
This is true if the free space on the file system is greater than the
value specified by the minfree variable.

Note The savecore command saves the unix.n and vmcore.n files.
The variable n is incremented each time a system saves a crash dump.

Managing Application Core Dumps


The preceding section in this module focused on using the dumpadm
command to manage system crash dumps in case of a kernel panic. The
Solaris OE enables you to manage application core dumps and system
crash dumps. When a kernel terminates applications abnormally, a core
file is created, which you use for postmortem analysis. The coreadm
command manages these core files.
You use the coreadm command to create descriptive file names for core
files and to specify the location on a system in which the application core
files are saved. Therefore, you can save all the core files, each with its own
unique name, within a central repository, such as an NFS-exported file
system.

7-16

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Generating a System Crash Dump


You use the coreadm command to configure the following core file paths:

Per-process Enables you to specify core file naming specifications


for a particular PID. Only the process owner has read and write
permissions on the generated core file.
For example:
# coreadm -p /var/mycore.%p.%u.%n 314
where the -p variable specifies a per-process pattern.

Note The per-process core file path is enabled by default.

Global Generates an additional core file with the same content as


the per-process core file. Only the superuser has read and write
permissions on the generated file.
For example:
# coreadm -g /var/corefiles.%p.%u -e global
where:

The -g variable specifies a global pattern

The -e global variable enables the global pattern

In the preceding example, you configure the kernel to cause the


application core dumps for all the processes on the system to be
written in the /var directory with the name corefiles. This name
is followed by a dot, the PID of the process that ended, a dot, and the
UID that you use to run the process.
Consider an example in which the global core file path is set to
/var/core/core.%f.%p and a sendmail process with PID 12345
terminates abnormally. In this example, the system generates a core file,
/var/core/core.sendmail.12345.

Introducing Types of System Failures


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

7-17

Generating a System Crash Dump


You use the coreadm command to set name patterns for core files.
Table 7-3 displays the variables that you use to specify name patterns for
core files.
Table 7-3 Variables for Core File Names
Variable Name

Variable Definition

%p

PID

%u

Effective UID

%g

Effective group ID

%f

Executable file name

%n

System node name (equivalent to the output of the


uname -n command)

%m

System name (equivalent to the output of the


uname -m command)

%t

Decimal value of the time(2) system call

%%

Literal percentage

To display the name pattern of the per-process core file for one or more
processes, run the coreadm command at the command line with a list of
PIDs.
$ coreadm 278 5678
278: core.%f.%p
5678: /home/george/cores/%f.%p.%t
Refer students to Module 5, Performing Solaris OE Diagnostics, for more information on the coreadm
command.

7-18

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Introducing Watchdog Resets

Introducing Watchdog Resets


A watchdog reset is an extremely rare condition, which is triggered by the
processor instead of the kernel.

Identifying Causes and Effects of Watchdog Resets


Watchdog resets can occur in the following scenarios:

When the processor cannot manage traps, a CPU watchdog reset is


generated.
CPU watchdog resets can have their origin in either hardware or
software. However, system watchdog resets occur only in hardware.

When the hardware of a multiboard system identifies any fatal


condition, a system watchdog reset is generated.
A system watchdog reset affects all the CPUs and I/O devices.
Writes in progress during the watchdog reset might be lost.
However, the state of the main memory is preserved.

Different revisions of the SPARC architecture have different levels of


tolerance for managing multiple traps simultaneously. For example:

The Sun4m (version 8) architecture of the SPARC specification states


that the processor can manage a single trap at any point in time.
However, if a new trap occurs while the processor is managing a
trap, a CPU watchdog reset is generated.

Note The Sun4m architecture of the SPARC specification does not


support nested traps. However, the Sun4U (version 9) architecture of the
SPARC specification defines multiple trap levels and manages nested
traps.

Introducing Types of System Failures


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

7-19

Introducing Watchdog Resets

The Sun4U (version 9) architecture of the SPARC specification


enables the system to manage nested traps up to a maximum of five
levels. If a trap occurs at the maximum trap level, a CPU watchdog
reset might be generated.

Note After a watchdog reset occurs, the words Watchdog Reset


appears on the system console, and the system drops to the ok prompt.
However, there is no crash dump written to the dump device to help with
the analysis because the kernel did not panic the system. Therefore, a
watchdog reset is difficult to diagnose.

Identifying the watchdog-reboot? OBP Variable


The value of the watchdog-reboot? variable determines whether the
system reboots after the watchdog reset or remains at the ok prompt. If
the parameter value is set to true, the system reboots automatically after
the watchdog reset. If the parameter value is set to false, the system
remains at the ok prompt.
The watchdog-reboot? variable should generally be set to false. This
ensures that the ok prompt is accessible and you can use diagnostic tools,
such as the .registers, and .locals commands, to identify the source
of the watchdog reset. You can also run the sync command to generate a
core dump, which you can use for further diagnosis.

Displaying the Register Contents by Using OBP


Commands
You run various OBP commands at the ok prompt to check the behavior
of the system after a watchdog reset.
The following commands display the contents of the registers during a
watchdog reset:

7-20

.registers Displays the internal registers of the current CPU.

.locals Displays the registers in the current register window.

ctrace Displays the kernel stack. If the misc/obpsym kernel


module is loaded, the output includes useful symbolic information.
However, if the misc/obpsym module is not loaded, you must
interpret the kernel stack along with a crash dump.

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Introducing Watchdog Resets

Identifying the misc/obpsym Kernel Module


The misc/obpsym kernel module installs the OpenBoot callback handlers
that provide symbolic information in the PROM environment. The
obpsym module enables you to use kernel-symbolic names anywhere in
the OpenBoot firmware command interface. For example, the ctrace
command displays a symbolic name instead of a numeric address after
the system installs the obpsym module. This helps you to debug watchdog
resets.

Note By default, the PROM environment displays the information as


addresses without including any symbolic textual information.
To enable the misc/obpsym kernel module, you must load the
misc/obpsym driver module in the Solaris OE.
To check whether the misc/obpsym kernel module is loaded, run the
following command:
# modinfo | grep obpsym
Use the following modload command to load the module from the
command line:
# modload -p misc/obpsym
You run the modinfo command after loading the misc/obpsym kernel
module to generate the following output:
# modinfo | grep obpsym
244
780e5a39
72a 1.23>

obpsym<OBP symbol callbacks

To ensure that the obpsym module is loaded across system resets, add the
following entry in the /etc/system file:
forceload:misc/obpsym

Introducing Types of System Failures


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

7-21

Exercise: Introducing Types of System Failures

Exercise: Introducing Types of System Failures


In this exercise, you reinforce the contents learned in the module to
describe types of system failures and the process of system dump
generation.
Inform students that they can refer to the student guide to attempt the exercise. Display the relevant OH
slides as students attempt the exercise questions.

Preparation
Boot the Solaris OE.

Tasks
Answer the following questions:

7-22

1.

List the types of system faults.

2.

Which kernel function is executed when a system panic occurs?


What actions are performed by the kernel function?

3.

List the steps to diagnose a system hang.

4.

List the tasks in the process of generating a core dump.

5.

In which files of the default crash directory does the savecore utility
save the core dump?

6.

Which command do you use to set the path name for a global core
file to include the PID and the name of the executable file? Use the
default crash directory path for the core file.

7.

Which is the command that you use to assign the swap device in the
Solaris OE as the dump device?

8.

Which is the command that you use to check whether the


misc/obpsym module is loaded in the Solaris OE? If the module is
not loaded, which command do you use to load the module?

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Exercise Summary

Exercise Summary

Discussion Take a few minutes to discuss what experiences, issues, or


discoveries you had during the lab exercise.

Manage the discussion based on the time allowed for this module, which was provided in the About This
Course module. If you do not have time to spend on discussion, highlight just the key concepts students
should have learned from the lab exercise.

Experiences

Ask students what their overall experiences with this exercise have been. Go over any trouble spots or
especially confusing areas at this time.

Interpretations

Ask students to interpret what they observed during any aspect of this exercise.

Conclusions

Have students articulate any conclusions they reached as a result of this exercise experience.

Applications

Explore with students how they might apply what they learned in this exercise to situations at their workplace.

Introducing Types of System Failures


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

7-23

Exercise Solutions

Exercise Solutions
The following are the solutions for the questions in the exercise:
1.

List the types of system faults.


The following are the types of system faults:

2.

System panics

Bad traps

System hangs

Which kernel function is executed when a system panic occurs?


What actions are performed by the kernel function?
The panic() kernel routine is executed when a system panics. The
panic() kernel routine performs the following actions in response to the
system panic:

3.

Displays a panic message at the console

Performs a stack trace

Generates a memory dump

Reboots the system

List the steps to diagnose a system hang.


To diagnose a system hang, complete the following steps:

4.

a.

Use the Stop-A key sequence to switch the system to the ok prompt.

b.

Run the OBP diagnostic commands to capture the status of the


system.

c.

Run the sync command to force a system panic.

d.

Check for the core dump when the system reboots.

e.

Study the core dump to determine the cause of the system hang.

List the tasks in the process of core dump generation.


The process of core dump generation includes the following tasks:

7-24

Displaying an error message on the console

Saving a copy of the physical memory onto the dump device

Copying data from the dump device to the crash dump directory

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Exercise Solutions
5.

In which files of the default crash directory does the savecore utility
save the core dump?
The savecore utility saves the core dump data in the vmcore.n file and
in the kernel files of the unix.n file.

6.

Which command do you use to set the path name for a global core
file to include the PID and the name of the executable file? Use the
default crash directory path for the core file.
# coreadm -g /var/crash/%f.%p

7.

Which is the command that you use to assign the swap device in the
Solaris OE as the dump device?
To check the swap dump device in the Solaris OE, run the following
command:
# dumpadm
To configure the swap dump device as the dedicated dump device, run the
following command:
# dumpadm -d swap

8.

Which is the command that you use to check if the misc/obpsym


module is loaded in the Solaris OE? If the module is not loaded,
which command do you use to load the module?
To check if the misc/obpsym module is loaded in the Solaris OE, run the
following command:
# modinfo|grep obpsym
To load the misc/obpsym module, run the following command:
# modload -p misc/obpsym

Introducing Types of System Failures


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

7-25

Module 8

Analyzing Core Dumps Using the mdbUtility


Objectives
Overview on
page OH 8-2

Upon completion of this module, you should be able to:

Describe the mdb utility

Use the mdb utility

8-1
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Relevance

Relevance
Present the following questions to stimulate the students and get them to think about the issues and topics
presented in this module. While they are not expected to know the answers to these questions, the answers
should be of interest to them and inspire them to learn the material presented in this module.

Relevance on
page OH 8-3

!
?

Discussion The following questions are relevant to understanding the


tools required to analyze core dumps:

Have you experienced a system panic in the Solaris OE?

Allow students to share their work experiences and describe how they analyzed the problem of a system
panic. Ask them to list the steps they performed to reach a solution.

How do you successfully configure a system for processing core


dumps?

Allow students to list the steps they performed at their work places to configure a system that helps them to
process core dumps successfully.

8-2

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Additional Resources

Additional Resources
Additional resources The following references provide additional
information on the topics discussed in this module:

Solaris Manual Pages (http://docs.sun.com), accessed 18 January


2002.

Solaris User and System Administration Answer Books


(http://docs.sun.com), accessed 18 January 2002.

Analyzing Core Dumps Using the mdb Utility


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

8-3

Introducing the mdb Utility

Introducing the mdb Utility


The mdb utility is an extensible modular debugging utility. This utility
provides an application programming interface (API) that enables you to
compile modules and perform tasks within the context of the debugger.
You use the mdb utility to debug complex software systems, the Solaris OE
kernel, and related device drivers and kernel modules. The mdb utility
also provides a dynamic module utility that you can use to develop
modules and debugging commands.
Inform students that the mdb utility is the recommended debugging utility for the Solaris OE kernel.

The mdb utility supports the following flags:


mdb [-fkmuwyAFMS] [ o option] [-p pid] [-s distance]
[-I path] [-L path] [-P prompt] [-R root] [-V dis-version]
[object [core]| core| suffix]
Refer students to the man pages for more information on the flags supported by the mdb utility.

Note The syntax of the mdb utility is compatible with the syntax of the
kadb and adb utilities. The mdb utility can execute all the macros of the
kadb and adb utilities.
You use the following command to launch the mdb utility:
# mdb
Loading modules: [ unix krtld genunix ip ufs_log nfs random
ptm lofs ipc logindmux cpc ]
>
After you launch the mdb utility, you can change the default prompt:
> $P"mdb: "
You can also invoke help to find out which options are available for a
particular command. For example, you use the following command to
find out the options that are available with the ps command:
mdb: ::help ps
::ps [-fltTP] - list processes (and associated thr,lwp)

8-4

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Introducing the mdb Utility


In the preceding output, the -lt option enables you to display the threads
and LWPs that are in process along with their current status.
mdb: ::ps -lt
S
PID
PPID
PGID
SID
UID
FLAGS
R
0
0
0
0
0 0x00000019
T
t0 <TS_STOPPED>
L
lwp0 ID: 1
R
3
0
0
0
0 0x00020019
T
0x300004e37c0 <TS_SLEEP>
L
0x300004e14a8 ID: 1
R
2
0
0
0
0 0x00020019
T
0x300004e3a60 <TS_SLEEP>
L
0x300004e1818 ID: 1
R
1
0
0
0
0 0x00004008
T
0x300004e3d00 <TS_SLEEP>
L
0x300004e1b88 ID: 1
R
337
1
337
337
0 0x00010008
T
0x30000c44040 <TS_SLEEP>
L
0x30000c20aa8 ID: 1
R
320
1
320
320
0 0x00000008
T
0x30000c44d60 <TS_SLEEP>
L
0x30000c21bd8 ID: 1
T
0x30000c44ac0 <TS_SLEEP>
L
0x30000c21868 ID: 2
T
0x30000b8f7a0 <TS_SLEEP>
L
0x30000b9e038 ID: 3
T
0x30000c442e0 <TS_SLEEP>
L
0x30000c20e18 ID: 4
R
304
1
304
304
0 0x00020008
T
0x30001b42800 <TS_SLEEP>
L
0x30001b40048 ID: 1
T
0x30000b8e7e0 <TS_SLEEP>
L
0x30000bf6a90 ID: 2
R
301
1
301
301
0 0x00000008
T
0x30001b43a60 <TS_SLEEP>
L
0x30001b41858 ID: 1
T
0x30001b42fe0 <TS_SLEEP>
L
0x30001b40a98 ID: 2
R
3388
3387
3388
3388
0 0x00004008
T
0x30001c61520 <TS_SLEEP>
L
0x30001c5f198 ID: 1
R
3560
426
426
426
0 0x00004008
T
0x30001c61a60 <TS_ONPROC>

ADDR NAME
00000000014393b8 sched

00000300004e6008 fsflush

00000300004e6a20 pageout

00000300004e7438 init

0000030001b7a060 sendmail

0000030001b8d488 devfsadm

0000030001b8ca70 snmpXdmid

0000030001b6f478 dmispd

0000030001c5a088 dtterm

0000030001cbf4c0 mdb

In the preceding output, you determine the running thread that caused
the panic by using the status of the TS_ONPROC field.

Analyzing Core Dumps Using the mdb Utility


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

8-5

Introducing the mdb Utility

Features of the mdb Utility


The mdb utility is a general purpose debugger and analyzer for the kernel
and the user processes. You also use the mdb utility to examine and
modify the running kernel, kernel crash dumps, running processes,
process core files, and object files.
If students have enough experience, ask them to list the features of the mdb utility. Use a flip chart to list the
inputs from students. Explain the applicable and non-applicable features to students, and cross out the
non-applicable features.

The mdb utility has the following features:

Enables the postmortem analysis of the Solaris OE kernel crash


dumps and the user process core dumps.
The mdb utility includes a collection of debugger modules, which
facilitates the analysis of the Solaris OE kernel and the process states.
The debugger modules enable you to formulate complex queries for
the following tasks:

Locating a particular thread allocated by memory

Printing a visual representation of a kernel STREAM

Determining the type of structure that is referred to by a


particular address

Locating leaked memory blocks in the kernel

Analyzing memory to locate stack traces

Implements debugger commands and analysis tools by using a


programming API.
The mdb utility has a set of loadable modules, which provides
support for debugging the core dumps. Each module provides a set
of commands, which extends the capabilities of the debugger. The
debugger provides an API of core services, such as reading and
writing memory and accessing symbol table information. You can
use the mdb utility to develop modules without recompiling or
modifying the debugger.

Inform students that they can use the mdb utility to debug existing software programs and develop their own
modules. This helps them to debug the drivers and applications in the Solaris OE.

8-6

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Introducing the mdb Utility

Provides compatibility with other debugging utilities.


The mdb utility provides backward compatibility with debugging
utilities, such as the adb and crash utilities. The mdb utility is a
superset of the adb utility and supports all existing adb macros and
commands. The mdb utility also provides commands that exceed the
functionality of the crash utility.

Offers other usability features:

Command-line editing

Command history

Built-in output pager

Syntax error-checking and handling

Online help

Interactive session logging

Command pipe-lining

Analyzing Core Dumps Using the mdb Utility


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

8-7

Introducing the mdb Utility

Limitations of the mdb Utility


The mdb utility has the following limitations:

Examining process core dumps The mdb utility does not provide
support for examining the process core dumps generated on the
Solaris 2.4 OE to Solaris 2.6 OE versions. The runtime link editor
debugging interface (librtld_db) might not be initialized if you
examine the core dump on one Solaris OE version from another
Solaris OE version. Therefore, the symbol information for shared
libraries is not available. In addition, if the mappings for the shared
libraries are not available in the user core dumps, the text section
and the read-only data of the shared libraries might not be the same
as the data in the core dump.

Examining crash dumps The mdb utility uses the libkvm library
routine from the corresponding operating system release to examine
the crash dumps that are generated on the Solaris 2.4 OE to Solaris 7
OE versions. If you use debugger modules (dmods) from one Solaris
OE version to examine a crash dump on another Solaris OE version,
the changes in kernel implementation might prevent some debugger
commands (dcmds) or walkers from functioning properly.

Provide the following information to the students:

The debugger command (dcmd)

A debugger command or dcmd (pronounced dee-command) is a routine in the debugger that can access any
properties of the current target. The mdb utility parses commands from the standard input and executes the
corresponding dcmds. Each dcmd can also accept a list of string or numerical arguments.

debugger module (dmod)

A debugger module or dmod (pronounced dee-mod) is a dynamically loaded library that contains a set of
dcmds and walkers. During initialization, the mdb utility attempts to load the dmods that correspond to the
load objects in the target. You can subsequently load or unload dmods any time while running the mdb utility.

walker

A set of routines that describe how to iterate through the elements of a particular program data structure. A
walker encapsulates the implementation of a data structure from dcmds and the mdb utility. You can use
walkers interactively or use them to build other dcmds or walkers.

Note The mdb utility might not provide support for examining the core
and crash dumps on an Intel platform from a SPARC platform or on a
SPARC platform from an Intel platform.

8-8

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Introducing the mdb Utility

General mdb Command Formats


The following is the general command format of the mdb utility:
[ address ] [ ,count ] command [ ; ]

Note The address parameter is a kernel symbol. If you do not specify


the address parameter, the mdb utility uses the current location. The dot
[.] also refers to the current location. If you do not specify the count
parameter, the mdb utility uses the default value of 1.
The commands of the mdb utility consist of a verb followed by a modifier
or a list of modifiers.
The following are the verbs that you specify in the mdb command:

The ? verb Displays code or variables in an executable object file

The / verb Displays data from the core file

The = verb Prints values in different formats

The $< verb Includes macro invocations for miscellaneous


commands

The > verb Assigns a value to a variable or a register

The < verb Reads a value from a variable or a register

The Return verb Repeats the previous command with a count of 1


and increments the current location represented by a dot (.)

Relationship Between the mdb and adb Utilities


The mdb utility provides support for the adb syntax, built-in commands,
and command-line options. In addition, the mdb utility supports all
existing adb macros and commands. If you know about the usage of the
adb utility, you can use the mdb utility without knowing about the mdb
commands. The adb utility is implemented as a link to the mdb utility in
the Solaris 9 OE.
The /usr/bin/adb path name is an adb link that invokes the mdb utility
and automatically enables the adb compatibility mode.
Ask students to refer to the online man pages for more information on the compatibility mode between the
mdb and adb utilities. Inform students that this compatibility mode is activated, by default, when you execute
the adb link.

Analyzing Core Dumps Using the mdb Utility


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

8-9

Using the mdb Utility

Using the mdb Utility


The mdb utility uses macro files and the information stored in registers to
debug the kernel and related device drivers.

Identifying Macros and Registers


When analyzing a core dump, the mdb utility can use text files that contain
sequences of mdb commands called macros. These macros help you to
examine various kernel data structures. You can use the mdb command to
view the contents of the processor registers to extract information that
enables you to determine the cause of a system crash.
Ask students what they understand about macros. Focus on the advantages of macros and their ease of use.
List the inputs from students on a flip chart.

A macro file is a text file that contains a set of commands. Macro files
automate the process of displaying commonly referenced programming
structures. For example, the proc macro displays the process structure,
the thread macro displays the thread structure, and the inode macro
displays the inode structure. You use macros to annotate the output
displays that help to interpret the information on programming
structures.
Inform students that macros facilitate working with the debugger.

Note The mdb utility provides backward compatibility to execute macro


files that are written for the adb utility. The Solaris OE also includes a set
of macro files to debug the Solaris OE kernel. You can use these macro
files with either the mdb or adb utility.
When you assign a macro name for the mdb utility, the utility searches
either of the following locations for the name:

8-10

The current directory

The standard /usr/lib/adb directory

The directory specified at the command line by using the -I option

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Using the mdb Utility


The directory location for the macro files depends on the system
architecture:

On a 32-bit kernel system, the macro files are located in the


/usr/platform/`uname -i`/lib/adb directory.

On a 64-bit kernel system, the macro files are located in the


/usr/lib/adb/sparcv9 and
/usr/platform/`uname -i`/lib/adb/sparcv9 directories.

Ask students to assign the /usr/lib/adb directory as the current directory and run the ls command to
view the list of available macros.

Macro files display information about programming structures. The


definitions for these structures are located in the header files on the
system.
You use the header files to determine different fields of information
displayed in a macro file. Each header defines one or more structures. For
example, the proc.h header file defines the contents of the process
structure.
If the class has enough experience, ask students to list the frequently used header files and their associated
macros. Use a flip chart to list the inputs from students.

The following are some of the frequently used header files with their
associated macros:

The /usr/include/sys/proc.h file for the proc macro

The /usr/include/sys/thread.h file for the thread macro

The /usr/include/sys/klwp.h file for the lwp macro

The /usr/include/sys/user.h file for the user macro

The /usr/include/sys/cred.h file for the cred macro

The /usr/include/vm/as.h file for the as macro

The /usr/include/vm/seg.h file for the seg macro

Analyzing Core Dumps Using the mdb Utility


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

8-11

Using the mdb Utility


The header files are located in the following directories:

/usr/include/sys Contains system header files

/usr/include/vm Contains the header files, such as the page.h


and seg.h files, which describe virtual memory structures

/usr/include/sys/fs Contains header files that describe the


structures and types of file systems

/usr/platform/arch_name/include/sys Contains the


architecture-dependent structures that are defined in the header files

/usr/include Contains the net, nfs, rpc, protocols, inet, and


netinet directories with headers that define networking data
structures

Identifying Register References for the mdb Utility


Registers contain information about the status of the system. While
analyzing a system dump, you can use the mdb utility to examine registers
and extract information to determine the cause of the crash.
The mdb utility uses the percent sign (%) and the less than sign (<) in
register references. The SPARC version 9 (v9) specification allows up to
528 general purpose 64-bit registers. Each process can access up to 32 of
these registers at any time. The following are the 32 general-purpose
registers available for analyzing system crash dumps:

8-12

Eight general-purpose registers %g0 through %g7

Eight input registers %i0 through %i7

Eight output registers %o0 through %o7

Eight local registers %l0 thorough %l7

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Using the mdb Utility


Among the registers available for analyzing system dumps, the following
are the reserved registers:

o6 Stack pointer or sp

o7 Program counter or pc

i6 Frame pointer, which provides a trace through the stack to the


previous function

g0 Register whose value is always zero

g7 Address of the current thread

In a system dump analysis, the most important registers are the program
counter and the stack pointer. The program counter, <o7 or pc, contains
the current instruction, and the stack pointer, <o6 or sp, points to the
current stack frame for use with local variables or return addresses.

Analyzing Core Dumps Using the mdb Utility


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

8-13

Using the mdb Utility

Examining System Dumps by Using the mdb Utility


Consider a scenario in which you capture a system dump. You use the
mdb utility to determine the following:

The instruction that failed.

The thread that was running at the time of the system panic.

The process that was running at the time of the system panic.

The arguments that were passed to a failing process.

To use the debugger utility for analyzing the crash dump, switch to the
directory in which the dump is located. To switch to the crash directory,
use the following syntax:
# cd /var/crash/`uname -n`
where /var/crash/`uname -n` is the default crash directory.
In this example, the crash directory is located in the
/var/crash/sun-sparc-1 directory. You can use the dumpadm command
to determine the current crash directory.
Consider a scenario in which one of the Sun systems panics. To solve this
problem, you must first understand the cause of the panic.
Discuss with students what you can achieve from an administrator's perspective after a system panic.

As a system administrator without a programming background, you can


use the mdb utility to perform the following:

Identify the address of the instruction that caused the panic

Identify the address of the thread that was running during the panic

Identify the name and arguments of the processes that were running
during the panic

Note In addition to using the mdb utility, the SunSpectrum support


customers can directly provide the preceding information and the image
of kernel memory (vmcore.n), the kernel symbol table (unix.n), and
ISCDA script output to the Sun Customer Support Center. The kernel
engineers at Sun use this information for a detailed analysis.

8-14

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Using the mdb Utility


The following section focuses on a limited and manageable subset of the
functionality of the mdb utility. This section helps you to achieve the
preceding objectives by completing the following steps:
1.

Invoke the mdb utility on the live kernel.

2.

Introduce a bug into the ksyms driver that the system uses to access
the symbol table of the kernel.
Device drivers run in a full privileged state, and any critical
problems within the driver code cause the kernel to panic the
system.

To invoke the mdb utility on the live kernel, complete the following steps:
1.

Type the following command to invoke the mdb utility on the live
kernel:
# mdb -kw /dev/ksyms /dev/mem
Loading modules: [ unix krtld genunix ip usba ufs_log
logindmux ptm isp cpc ipc random nfs ]

2.

Display 20 disassembled instructions from the start of the address in


virtual memory, corresponding to the kernel symbol ksyms_open.
The number of times (a count) that you perform a command is
known as a decimal value.
> ksyms_open/20i
ksyms_open:
ksyms_open:
save
sethi
clr
add
clr
mov
mov
call
<ksyms_snapshot>
clr
orcc
clr
bleu,pn
<ksyms_open+0x70>
mov
call
mov
mov
mov
call
clr
mov

%sp, -0xb0, %sp


%hi(0x115f400), %g2
%o2
%g2, 0x218, %l4
%l2
%l4, %o0
%i0, %l3
-0x76fcf4e8
%o1
%g0, %o0, %l0
%l1
%xcc,+0x44
%l2, %o0
-0x76fd12b0
%l1, %o1
%l0, %o0
%l0, %l1
-0x76fd1380
%o1
%o0, %l2

<kmem_free>

<kmem_alloc>

Analyzing Core Dumps Using the mdb Utility


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

8-15

Using the mdb Utility


3.

Write a 4-byte value to the location in virtual memory, corresponding


to the start of the address specified in the kernel symbol ksyms_open
plus 20 bytes (14 in hexadecimal). Memory addresses are known in
the mdb utility as hexadecimal (hex) values.
> ksyms_open+14/W 0
ksyms_open+0x14:0x90100014

4.

0x0

Display 20 disassembled instructions from the start of the address in


virtual memory, corresponding to the kernel symbol ksyms_open to
verify the sabotage of the ksyms_open routine. The number of times
(a count) that you perform a command is known as a decimal value.
> ksyms_open/20i
ksyms_open:
ksyms_open:
save
sethi
clr
add
clr
PROBLEM! ->
illtrap
mov
call
<ksyms_snapshot>
clr
orcc
clr
bleu,pn
<ksyms_open+0x70>
mov
call
mov
mov
mov
call
clr
mov

5.

%sp, -0xb0, %sp


%hi(0x115f400), %g2
%o2
%g2, 0x218, %l4
%l2
0
%i0, %l3
-0x76fcf4e8
%o1
%g0, %o0, %l0
%l1
%xcc,+0x44
%l2, %o0
-0x76fd12b0
%l1, %o1
%l0, %o0
%l0, %l1
-0x76fd1380
%o1
%o0, %l2

<kmem_free>

<kmem_alloc>

Type the following command to quit the mdb utility.


> ::quit

8-16

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Using the mdb Utility


6.

Next, invoke the nm command to access the symbol table of the


kernel.
In the preceding steps, you sabotaged the ksyms_open routine
within the ksyms driver. When the nm command makes a system call
to the kernel for accessing the symbol table using the corrupted
driver, the kernel encounters an illegal trap and panics the system.
# /usr/ccs/bin/nm /dev/ksyms
PANIC!

Note The kernel displays a stack traceback on the console to show the
routines that led to the panic and also displays the source of the panic.
After the kernel panics the system, the system reboots. Next, you use the
mdb utility to examine the offending address and thread that caused the
system to panic. To examine the cause of the panic, complete the
following steps:
1.

Invoke the mdb utility on the unix.0 and vmcore.0 files:


# mdb -k unix.0 vmcore.0

2.

Use the mdb utility to dump the values of the registers at the time of
the crash:
> $r

Note The mdb utility automatically pages the output to prevent scrolling.
3.

Press the space bar once, and look for the following register:
%pc = 0x00000000780ff8c0 ksyms_open+0x14
The %pc register contains the address of the instruction that the
processor was executing when the exception or error condition
occurred.

Note The mdb utility formats the output to display the hexadecimal
address (0x00000000780ff8c0) of the instruction that the processor was
executing. The address is followed by the symbolic name (ksyms_open)
associated with the routine and the hexadecimal offset (+0x14) from the
beginning of that routine.

Analyzing Core Dumps Using the mdb Utility


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

8-17

Using the mdb Utility


4.

Disassemble the instruction that caused the fault:


> 0x00000000780ff8c0/ai
ksyms_open+0x14:
ksyms_open+0x14:illtrap

0x14

The executed instruction was an illegal trap, and it resided at the memory
location specified by the symbol ksyms_open plus 20 (decimal) bytes
offset.
5.

Invoke the pointer to the address of the thread that was executing
when the system panicked:
> panic_thread/K
panic_thread:
panic_thread:
3000178fa80

6.

Run the thread macro along with the pointer to the data structure
address of the thread that was running when the system panicked.
Search for the procp pointer, which is the address of the proc
structure of the process that contains the thread.
> 3000178fa80$<thread
...........<output truncated>
0x3000178fb30: lpl
intr
did
142d3b8
0
42363
0x3000178fb50: tnf_tpdp
tid
waitfor
30000922490
1
-1
0x3000178fb60: sigqueue
sig
hold
0
0
0
0x3000178fb78: forw
back
thlink
3000178fa80
3000178fa80
0
0x3000178fb90: lwp
procp
audit_data
300017a0e10
300017c0060
0
0x3000178fba8: next
prev
trace
3000178ed60
30000d542e0
0
0x3000178fbc0: whystop whatstop
dslot
0
0
0
0x3000178fbc8: pollstate
pollcache
cred
0
0
300001cbce8
0x3000178fbe0: start
lbolt
stoptime
3cb4b228
7bfcb91
0
0x3000178fbf8: pctcpu
sysnum delay_cv
100000
5
0
0x3000178fc00: delay_lock
0x3000178fc00: owner
0
............<output truncated>

8-18

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Using the mdb Utility


7.

Run the proc2u macro to search user data structure for the process
that was executing when the system panicked. In the following
output, the psargs field contains the name of the process and any
associated arguments:
> 300017c0060$<proc2u
auxv
300017c0398
0x300017c04c8: start.tv_sec
start.tv_nsec
3cb4b228
343daa11
0x300017c0390: execsw
ticks
140e620
7bfcb89
0x300017c04f1: psargs /usr/ccs/bin/sparcv9/nm
/dev/ksyms\0\0\0\0\0\0\0\0\0\0\0
\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\
0\0\0\0\0
\0\0\0
0x300017c04e0: comm
nm\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0
0x300017c0544: argc
argv
envp
2
ffffffff7ffffca8
ffffffff7ffffcc0
0x300017c0558: cdir
rdir
mem
300003ebdb8
0
28
0x300017c0570: cmask
acflag systrap
022
0
0
entrymask
300017c0578
exitmask
300017c059c
0x300017c05c0: signodefer
sigonstack
sigresethand
0
0
0
0x300017c05d8: sigrestart
0
........<output truncated>

8.

Type the following command to quit the mdb utility:


> ::quit

Analyzing Core Dumps Using the mdb Utility


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

8-19

Using the mdb Utility


You can use the mdb command to extract the following information from
the system crash dump files:
1.

Identify the address of the instruction that caused the panic.


%pc = 0x00000000780ff8c0 ksyms_open+0x14

2.

Identify the address of the thread that was running during the panic.
panic_thread:

3.

3000178fa80

Identify the name of the process that was running during the panic.
0x300017c04e0: comm nm

4.

Identify the arguments passed to the program.


0x30017c04f1: psargs /usr/ccs/bin/sparcv9/nm/dev/ksyms

8-20

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Exercise: Analyzing Core Dumps Using the mdb Utility

Exercise: Analyzing Core Dumps Using the mdb Utility


In this exercise, you explore some of the features of the mdb utility and
also analyze a system dump. This familiarizes you with the kernel
structures that you must examine when analyzing a system crash or a
hung system.
Explain to students that the set of questions in the exercise facilitates the revision of the content in the
module. Instruct students to perform tasks and attempt questions in the sequence in which they appear.
Inform students that they can refer to the lecture notes to attempt the exercise. Use the system crash dump
generated in the course of the module to examine the offending address and thread that caused the system
to panic.

Preparation
Consult the instructor to access the files required for the lab. While
performing the lab exercise, refer to the examples in this module, the
online header files, and the online man pages.

Tasks
Answer the following questions:
1.

List the features of the mdb utility.

2.

List the tasks for which the mdb utility enables you to formulate
complex queries.

3.

List the usability features of the mdb utility.

4.

List the limitations of the mdb utility.

5.

List the general-purpose registers available for analyzing a core


dump.

Complete the following tasks to analyze the system crash dump:


1.

Launch the mdb utility to examine the core dump.

2.

Run the $c command to display the stacktrace registers, which


enables you to determine the routines that caused the panic and also
displays the source of the panic.

3.

Run the $r command to display the registers at the time of the panic.

4.

From the displayed registers, use the %pc (the program counter)
value to display the instruction that caused the system to fail.

Analyzing Core Dumps Using the mdb Utility


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

8-21

Exercise: Analyzing Core Dumps Using the mdb Utility


5.

Issue the ::status dcmd command to display a part of the message


that was displayed during the panic.

6.

Use the ps -lt command to examine the processes.

7.

Determine the running thread that caused the panic.

8.

Use the address from the output of the ::ps -lt command to
display the thread structure.

9.

Use the address under the procp field with the proc2u macro to
view the command name and arguments that caused the panic.

10. Countercheck the preceding information that you generated by


displaying the message buffer during the panic.

8-22

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Exercise Summary

Exercise Summary

Discussion Take a few minutes to discuss what experiences, issues, or


discoveries you had during the lab exercise.

Manage the discussion based on the time allowed for this module, which was provided in the About This
Course module. If you do not have time to spend on discussion, highlight just the key concepts students
should have learned from the lab exercise.

Experiences

Ask students what their overall experiences with this exercise have been. Go over any trouble spots or
especially confusing areas at this time.

Interpretations

Ask students to interpret what they observed during any aspect of this exercise.

Conclusions

Have students articulate any conclusions they reached as a result of this exercise experience.

Applications

Explore with students how they might apply what they learned in this exercise to situations at their workplace.

Analyzing Core Dumps Using the mdb Utility


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

8-23

Exercise Solutions

Exercise Solutions
The following are the solutions for the questions in the exercise:
1.

List the features of the mdb utility.


The following are the features of the mdb utility:

2.

Enables the postmortem analysis of the Solaris OE kernel crash dumps


and the user process core dumps

Implements debugger commands and analysis tools by using a


programming API

Provides compatibility with other debugging utilities

Offers other usability features

List the tasks for which the mdb utility enables you to formulate
complex queries.
The mdb utility enables you to formulate complex queries for the following
tasks:

3.

Locating a particular thread allocated by memory

Printing a visual representation of a kernel STREAM

Determining the type of structure referred to by a particular address

Locating leaked memory blocks in the kernel

Analyzing memory to locate stack traces

List the usability features of the mdb utility.


The mdb utility provides the following usability features:

8-24

Command-line editing

Command history

Built-in output pager

Syntax error-checking and handling

Online help

Interactive session logging

Command pipe-lining

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Exercise Solutions
4.

List the limitations of the mdb utility.


The mdb utility has the following limitations:

5.

Does not provide support for examining the process core dumps that
are generated in the Solaris 2.4 OE version to the Solaris 2.6 OE
version

Uses the libkvm library routine to examine the crash dumps


generated in the Solaris 2.4 OE version to the Solaris 7 OE version

Uses the 64-bit debugger that is running in a 64-bit Solaris OE to


debug 64-bit target programs

List the general-purpose registers available for analyzing a core


dump.
The following are the general-purpose registers that are available for
analyzing a core dump:

Eight general-purpose registers %g0 through %g7

Eight input registers %i0 through %i7

Eight output registers %o0 through %o7

Eight local registers %l0 thorough %l7

Complete the following tasks to analyze the system crash dump:


1.

Launch the mdb utility to examine the core dump:


# mdb unix.n vmcore.n
n is a value, such as 0, 1, 2, and 3.

2.

Run the $c command to display the stacktrace registers, which


enables you to determine the routines that caused the panic and also
displays the source of the panic.
> $c

3.

Run the $r command to display the registers at the time of the panic.
>

$r

4.

From the displayed registers, use the %pc (the program counter)
value to display the instruction that caused the system to fail.

5.

Issue the ::status dcmd command to display a part of the message


that was displayed at the during the panic.
> ::status

6.

Use the ps -lt command to examine the processes.


> ::ps

Analyzing Core Dumps Using the mdb Utility


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

8-25

Exercise Solutions
7.

Determine the running thread that caused the panic.


The answer to this will differ, based on the system crash dump that varies
with systems.

8.

Use the address from the output of the ::ps -lt command to
display the thread structure.
The answer to this will differ, based on the system crash dump that varies
with systems.

9.

Use the address under the procp field with the proc2u macro to
view the command name and arguments that caused the panic.
The answer to this will differ, based on the system crash dump that varies
with systems.

10. Countercheck the preceding information that you generated by


displaying the message buffer during the panic.
> $<msgbuf

8-26

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Appendix A

Sample Outputs
This appendix provides sample outputs for the following:

The eeprom command on a Sun4U Enterprise server

The PatchDiag tool

A-1
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Output of the eeprom Command on a Sun4U Enterprise Server

Output of the eeprom Command on a Sun4U Enterprise


Server
The following is the output of the eeprom command on a Sun4U
Enterprise server:
# eeprom
scsi-initiator-id=7
keyboard-click?=false
keymap: data not available.
ttyb-rts-dtr-off=false
ttyb-ignore-cd=false
ttya-rts-dtr-off=false
ttya-ignore-cd=false
ttyb-mode=9600,8,n,1,ttya-mode=9600,8,n,1,pcia-probe-list=1,2
pcib-probe-list=1,3,2,4,5
enclosure-type=540-4284
banner-name=Sun Enterprise 220R
energystar-enabled?=false
mfg-mode=off
diag-level=min
#power-cycles=41
system-board-serial#=5015606071913
system-board-date=39ce8ff3
fcode-debug?=false
output-device=screen
input-device=keyboard
load-base=16384
boot-command=boot
auto-boot?=true
watchdog-reboot?=false
diag-file: data not available.
diag-device=net
boot-file: data not available.
boot-device=disk:a disk net
local-mac-address?=false
ansi-terminal?=true
screen-#columns=80
screen-#rows=34
silent-mode?=false
use-nvramrc?=false
nvramrc: data not available.
security-mode=none

A-2

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Output of the eeprom Command on a Sun4U Enterprise Server


security-password: data not available.
security-#badlogins=0
oem-logo: data not available.
oem-logo?=false
oem-banner: data not available.
oem-banner?=false
hardware-revision: data not available.
last-hardware-update: data not available.
diag-switch?=false

Sample Outputs
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

A-3

Sample Report of The PatchDiag Tool

Sample Report of The PatchDiag Tool


The following is a sample report of the PatchDiag tool:
foo% patchdiag
============================================================
System Name: foo SunOS Vers: 5.7 Arch: sparc
Cross Reference File Date: 30/Nov/99
PatchDiag Version: 1.0.4
============================================================
Report Note:
Recommended patches are considered the most important and
highly
recommended patches that avoid the most critical system,
user, or
security related bugs which have been reported and fixed to
date.
A patch not listed on the recommended list does not imply
that it
should not be used if needed. Some patches listed in this
report
may have certain platform specific or application specific
dependencies
and thus may not be applicable to your system. It is
important to
carefully review the README file of each patch to fully
determine
the applicability of any patch with your system.
============================================================
INSTALLED PATCHES
Patch Installed Latest
Synopsis
ID
Revision Revision
------ --------- -------- ---------------------------------105346
09
10
Solstice Internet Mail Server 2.0:
Misc. fixes
106541
05
08
SunOS 5.7: Kernel update patch
106725
01
CURRENT OpenWindows 3.6.1: mailtool
vacation security patch
106793
01
03
SunOS 5.7: ufsdump and ufsrestore
patch
106934
03
CURRENT CDE 1.3: libDtSvc Patch
106952
01
CURRENT SunOS 5.7: /usr/bin/uux patch
106960
01
CURRENT SunOS 5.7: Manual Pages for
patchadd.1m and patchrm.1m

A-4

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Sample Report of The PatchDiag Tool


107001
01
CURRENT OBSOLETED by 107887
107022
02
05
CDE 1.3: Calendar Manager patch
107038
01
CURRENT SunOS 5.7:
apropos/catman/man/whatis patch
107171
02
04
SunOS 5.7: Fixes for patchadd and
patchrm
107200
03
09
CDE 1.3: dtmail patch
============================================================
UNINSTALLED RECOMMENDED PATCHES
Patch Ins Lat Age Require
Incomp Synopsis
ID
Rev Rev
ID
ID
------ --- --- --- --------- --------- -------------------------107359 N/A 02 33
SunOS 5.7: Patch for
SPARCompiler Binary Compatibility Libraries
107544 N/A 03 41
SunOS 5.7:
/usr/lib/fs/ufs/fsck patch
107587 N/A 01 211
SunOS 5.7:
/usr/lib/acct/lastlogin patch
108343 N/A 02
1 108374-01
CDE 1.3: sdtperfmeter
patch
============================================================
UNINSTALLED SECURITY PATCHES
NOTE: This list includes the Security patches that are also
Recommended
Patch Ins Lat Age Require
Incomp Synopsis
ID
Rev Rev
ID
ID
------ --- --- --- --------- --------- --------------------106944 N/A 02 210
SunOS 5.7:
/kernel/fs/fifofs and /kernel/fs/sparcv9/fifofs patch
106978 N/A 09 15 107456-01
SunOS 5.7: sysid
patch
107115 N/A 02 177
SunOS 5.7: LP Patch
107259 N/A 01 22
SunOS 5.7:
/usr/sbin/vold patch
107451 N/A 02 50 107117-03
SunOS 5.7:
/usr/sbin/cron patch
107454 N/A 03 64
SunOS 5.7:
/usr/bin/ftp patch
107456 N/A 01 160
SunOS 5.7:
/etc/nsswitch.dns patch

Sample Outputs
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

A-5

Sample Report of The PatchDiag Tool


107684 N/A 01 209
SunOS 5.7: Sendmail
patch
107792 N/A 01 75
SunOS 5.7:
/usr/bin/pax patch
107972 N/A 01 99
SunOS 5.7:
/usr/sbin/static/rcp patch
108301 N/A 01 47
SunOS 5.7:
/usr/sbin/in.tftpd patch
107219 N/A 02 163 106934-02
CDE 1.3: dtprintinfo
patch
107887 N/A 08
1
CDE 1.3: Actions
Patch
108219 N/A 01 79
CDE 1.3: dtaction
Patch
108221 N/A 01 79
CDE 1.3: dtspcd Patch
107337 N/A 01 257
OpenWindows 3.6.1:
KCMS configure tool has a security vulnerabilit
107893 N/A 02 65
OpenWindows 3.6.1:
Tooltalk patch
============================================================
UNINSTALLED Y2K PATCHES
NOTE: This list includes the Y2K patches that are also
Recommended
Patch Ins Lat Age Require
Incomp Synopsis
ID
Rev Rev
ID
ID
------ --- --- --- --------- --------- --------------------107359 N/A 02 33
SunOS 5.7: Patch for
SPARCompiler Binary Compatibility Libraries
107587 N/A 01 211
SunOS 5.7:
/usr/lib/acct/lastlogin patch
108343 N/A 02
1 108374-01
CDE 1.3: sdtperfmeter
patch
============================================================

A-6

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Appendix B

Additional Information
This appendix provides additional information for the following:

The probe commands

The test commands

The watch commands

The architecture of the Ultra 5 and Ultra 10 workstations

The show-post-results command

The process of obtaining a SunSolve account

B-1
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

The probe Commands

The probe Commands


To probe peripheral devices, such as disks, tape drives, and CD-ROMs
that are connected to your system, use the following probe commands:

probe-ide Probes the internal and external IDE devices connected


to the on-board IDE interface of the system.
The following is the output of the probe-ide command:
ok probe-ide
Device 0 (Primary master)
ATA Model : ST 39111A
Device 1 (Primary slave)
Not Present
Device 2 (Secondary master)
Removable ATAPI Model: CRD-8322B
Device 3 (Secondary slave)
Not Present
ok
The preceding output displays the IDE devices that are connected
and active. The output displays the target address, unit number,
device type, and manufacturer name for each displayed device.

probe-scsi Probes the SCSI devices, such as disks, tape drives,


and CD-ROMs that are attached to the on-board SCSI controller and
are active.
The probe-scsi command identifies the peripheral devices by their
target addresses. The following is the output of the probe-scsi
command:
ok probe-scsi
This command may hang the system if a Stop-A or halt
command has been executed. Please type reset-all to
reset the system before executing this command.
Do you wish to continue? (y/n) n
ok reset-all
ok probe-scsi
This command may hang the system if a Stop-A or halt
command
has been executed. Please type reset-all to reset the
system
before executing this command.
Do you wish to continue? (y/n) y
Primary UltraSCSI bus:
Target 0
Unit 0 Disk
SEAGATE ST34371W SUN4.2G8254

B-2

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

The probe Commands


Target 1
Unit 0
...
ok

Disk

SEAGATE ST34371W SUN4.2G8254

The preceding output displays the target address, unit number,


device type, and manufacturer name of each device.

Note The reset-all command resets the SCSI bus and memory to
ensure an effective probe of the devices.

probe-scsi-all Identifies the devices attached to the on-board


SCSI controller as well as those devices that are attached to SBus
SCSI controllers.
The following is the output of the probe-scsi-all command:
ok probe-scsi-all
This command may hang the system if a Stop-A or halt
command has been executed. Please type reset-all to
reset the system before executing this command.
Do you wish to continue? (y/n) y
/pci@6,4000/scsi@4,1
Target 0
Unit 0 Disk
SEAGATE ST34371W SUN4.2G8254
Target 1
Unit 0 Disk
SEAGATE ST34371W SUN4.2G8254
/pci@6,4000/scsi@4
Target 4
Unit 0 Disk
CONNER CFP1080E SUN1.055150
....
ok
In the preceding output, the probe-scsi-all command identifies
SCSI devices by their target addresses.

Additional Information
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

B-3

The test Commands

The test Commands


To test the hardware devices attached to the system, use the test
commands. While running the command to test the removable media
drives, such as a diskette or a CD-ROM, ensure that the media disk is
available in the media drive.

test-all Executes the self-test provided with each device that is


attached to the system.

The following is a sample output of the test-all command:


ok test-all
Testing /pci@1f,0/pci@1,1 /SUNW,M64B@2
Test hardware registers - passed ok
...
...
Testing /pci@1f,0 /pci@1,1/ebus@1/fdthread@14,3023f0
Testing floppy disk system. A formatted disk should be
in the drive.
Testing floppy drive
Test succeeded.

test floppy Executes the self-test to diagnose the diskette drive


attached to the system.
The following is the output of the test floppy command:
ok test floppy
Testing floppy disk system. A formatted disk should be
in the drive.
Testing floppy drive
Test succeeded.
If a formatted diskette is not inserted in the diskette drive, the
following message is displayed when you run the test floppy
command:
No diskette, or incorrect format.
self test fail. Return code=-1
ok

test-memory Executes the tests to diagnose the main memory.


The following is the output of the test -memory command:
ok test memory
Testing 256 megs of memory at addr 4000000 11
ok

B-4

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

The test Commands

test net Tests the on-board Ethernet controller.


The following is the output of the test net command:
ok test net
Internal loopback test -- succeeded
Transceiver check -- Passed

Additional Information
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

B-5

The watch Commands

The watch Commands


To monitor the system network traffic and clock function of the system,
use the following watch commands:

watch-net Monitors Ethernet packets on the Ethernet interfaces


that are connected to the system.
The following is the output of the watch-net command:
ok watch-net
Internal loopback test -- succeeded
Transceiver check -- Passed
Looking for Ethernet Packets
. is a good packet. x is a bad packet.
Type any key to stop.
.......................................................
The output indicates good packets by a period (.). A bad packet is
indicated by an x and the associated error, such as a cyclic
redundancy check error.

watch-net-all Monitors Ethernet packets on the Ethernet


interfaces that are connected to the system as well as in SBus slots.
The following is the output of the watch-net-all command:
ok watch-net-all
/pci@if,0/pci@1,1/network@1,1
Internal loopback test -- succeeded
Transceiver check -- Passed
Looking for Ethernet Packets
. is a good packet. x is a bad packet.
Type any key to stop.
.......................................................

watch-clock Enables you to test the clock function of the system.


The following is the output of the watch-clock command:
ok watch-clock
Watching the 'seconds' register of the real time clock
chip.
It should be 'ticking' once a second.
Type any key to stop.
50
ok
The watch-clock command reads the NVRAM chip and displays
the result as a seconds counter.

B-6

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Architecture of the Ultra 5 and Ultra 10 Workstations

Architecture of the Ultra 5 and Ultra 10 Workstations


The sample POST outputs included in this module are executed on Ultra
10 workstations.
Figure B-1 illustrates the architecture of Ultra 5 and Ultra 10 workstations.

Figure B-1

Ultra 5 and Ultra 10 Architecture

Additional Information
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

B-7

The show-post-results Command

The show-post-results Command


Table B-1 lists the fields and descriptions of the output of the
show-post-results command.
Table B-1 Fields and Descriptions of the Output of the
show-post-results Command

B-8

Field

Description

Cpu0/Cpu1

CPU modules on the system board

CPU{0,1}-OK

CPU module status

FailCode

Failure code (valid only if CPU failed)

FHC

Fire Hose Controller

SRAM

Static RAM

FPROM

Flash PROM

LabCon

Lab Console

Ovtemp

Overtemp

Bank0

Bank0 status (a bit indicates a missing or failed


SIMM)

Bank1

Bank1 status (a bit indicates a missing or failed


SIMM)

DTag0

DTags0 status

DTag1

DTags1 status

JTAG

JTAG status

CntrPl

Centerplane status

DC

Data Controllers (0 bit indicates a failed DC)

Sysio0

SysIO 0 status

Sysio1

SysIO 1 status

FEPS

On-board FEPS chip

FEPSFC

FEPS fail code (valid only if failed)

SOC

On-board SOC status

FFB

FFB card status

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

The show-post-results Command


Table B-1 Fields and Descriptions of the Output of the
show-post-results Command (Continued)
Field

Description

Sbus0

SBus0 slot status

Sbus1

SBus1 slot status

Sbus2

SBus2 slot status

AC

Address Controller

TODC

Time of Day Clock

Disk0

Disk0 ID (valid only if disk present)

Disk1

Disk1 ID (valid only if disk present)

Disk0P

Disk0 Present

Disk1P

Disk1 Present

VDDOK

SCSI VDD status

Fan

Fan Fail status

Clock

Clock running

Serial

Serial Port

KBytes

Keyboard Mouse status

PPS-DC

Peripheral PS ok (all DC levels OK)

AC

AC power status

ACFan

AC box fan status

KeyFan

KeySwitch fan status

PSFail

Power Supply fail status (bit position indicates


which ps failure)

Ovtemp

Overtemp

V5-P

Peripheral 5V

V12-P

Peripheral 12V

V5-Aux

Auxiliary 5V

V5P-PC

Peripheral 5V Precharge

V12-PC

Peripheral 12V Precharge

V3-PC

System 3.3V Precharge

Additional Information
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

B-9

The show-post-results Command


Table B-1 Fields and Descriptions of the Output of the
show-post-results Command (Continued)

B-10

Field

Description

V5-PC

System 5.0V Precharge

RKFan

Rack Fan Status

3.3V

Clock board 3.3 V

5.0V

Clock board 5.0 V

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Obtaining a SunSolve Account

Obtaining a SunSolve Account


You must have a valid SunSpectrum contract ID before you register for a
SunSolve Online account.
To obtain a SunSolve account, perform the following steps:
1.

Access the SunSolve home page by accessing the appropriate URL.

2.

Click Register.
Figure B-2 displays the SunSolve Online home page.

Figure B-2

SunSolve Home Page

If the students have access to the SunSolve Online Web site, instruct them to follow steps 1-3 and open the
registration form used to create a SunSolve Online account. Ask students who do not have Internet access to
refer to the figures provided with each step.

Additional Information
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

B-11

Obtaining a SunSolve Account


3.

Click the Create hyperlink to register for a SunSolve Online


account. Figure B-3 displays the SunSolve Online registration screen.

Figure B-3

SunSolve Online Registration Screen

Note If you are a registered user of the SunSolve Online service, you can
click the Edit hyperlink to modify your current user profile.
4.

Complete the registration form and click Submit Account Info.


Figure B-4 on page B-13 and Figure B-5 on page B-14 display the
registration form you use to create an account on the SunSolve
Online Web site.

Inform students that the registration form is displayed in two parts. Figure B-4 on page B-13 displays the first
part of the form, and Figure B-5 on page B-14 displays the remaining part of the form.

B-12

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Obtaining a SunSolve Account

Figure B-4

SunSolve Registration Form

Additional Information
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

B-13

Obtaining a SunSolve Account

Figure B-5

SunSolve Registration Form

Your contract information is checked for authenticity, and you are notified
through an email message when your SunSolve account is activated.

B-14

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Appendix C

Workshop Exercises
Introduction
The Analysis and Diagnosis Worksheet templates, provided in this
appendix, are similar to those presented in Module 1, Introducing the
Fault Analysis and Diagnosis Methodology. In workshop groups, you
apply the Fault Analysis and Diagnosis methodology described earlier in
Module 1, Introducing the Fault Analysis and Diagnosis Methodology,
and record key observations about the analysis and diagnosis for each
problem.
You are not required to complete any particular number of workshops.
However, it is important to apply a logical fault analysis and diagnosis
methodology to the workshops that you complete.
A worksheet template is provided with each fault in the appendix. You do
not have to complete each field in the worksheet. The amount of
information that you record might vary for each problem.

Preparatory Tasks
If a non-root account does not exist on your system, create one during
your first workshop session. You require the student account for
troubleshooting faults in some workshops and for comparing faults in
other workshops.
Try to use all the troubleshooting tools in your fault analysis and
diagnosis workshops, and explore the use of new utilities.

C-1
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Introduction
This appendix contains the following fault worksheets:

C-2

Fault Worksheet #1 Blank Monitor

Fault Worksheet #2 Unknown Device

Fault Worksheet #3 The ps Command Does Not Work

Fault Worksheet #4 Repetitive Boot Sequences

Fault Worksheet #5 Login Problem

Fault Worksheet #6 Hung System

Fault Worksheet #7 Problem in the Network

Fault Worksheet #8 Hung System

Fault Worksheet #9 Problem with the CDE

Fault Worksheet #10 Problem With the ftp Service

Fault Worksheet #11 Problem With the Non-root User Accounts

Fault Worksheet #12 Problem in the Network

Fault Worksheet #13 Problem with the CDE

Fault Worksheet #14 Problem with the CDE Login Screen

Fault Worksheet #15 Problem With the root Account

Fault Worksheet #16 Problem in the Network

Fault Worksheet #17 Problem With the Network Printer

Fault Worksheet #18 Problem in the Network

Fault Worksheet #19 Read-only File System

Fault Worksheet #20 Problem with the CDE

Fault Worksheet #21 Corrupt Network File

Fault Worksheet #22 Problem in the Network

Fault Worksheet #23 Problem With Admintool

Fault Worksheet #24 Boot Failure

Fault Worksheet #25 Hung System

Fault Worksheet #26 Problem in the Network

Fault Worksheet #27 Script Hangs the System

Fault Worksheet #28 Inappropriate Halts

Fault Worksheet #29 SunSolve Workshop

Fault Worksheet #30 Corrupt File System

Fault Worksheet #31 Insufficient File Permission

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Introduction

Fault Worksheet #32 Problem in the Network

Fault Worksheet #33 Login Problem

Fault Worksheet #34 Analyze System Crash Dumps

Fault Worksheet #35 Problem in the Network

Fault Worksheet #36 Faulty CD-ROM

Fault Worksheet #37 Turn the Page

Fault Worksheet #38 Login Problem

Fault Worksheet #39 Do not Point at Me

Fault Worksheet #40 Problem in the Network

Fault Worksheet #41 No Space on the File System

Fault Worksheet #42 Cannot Mount a File System

Fault Worksheet #43 Problem in the Network

Fault Worksheet #44 User Login Problem

Fault Worksheet #45 Problem in the Network

Fault Worksheet #46 System Displays a Panic Message

Fault Worksheet #47 Corrupt File System

Fault Worksheet #48 Remote Login Failure

Fault Worksheet #49 Corrupt File System

Fault Worksheet #50 Student Designed Workshop

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

C-3

Fault #1 Blank Monitor

Fault #1 Blank Monitor


Use the worksheet to complete the analysis and diagnosis of the fault.

Analysis Phase
Document the observations made during the Analysis phase.

Initial Customer Description


The administrator upgraded the PROM on the system and customized
some of the settings. However, when the system reboots, the monitor is
blank.

Problem Statement

Resources

Problem Description
Use Table C-1 to document the problem description.
Table C-1

Problem Description

Error Messages

C-4

Symptoms and
Conditions

Relevant
Changes

Comparative Facts

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #1 Blank Monitor

Diagnosis Phase
Document the observations made during the Diagnosis phase.

Test and Verification


Use Table C-2 to document the results of testing and verification.
Table C-2

Test and Verification

Likely Causes

Tests

Results

Verification

Corrective Action
Use Table C-3 to document the corrective action.
Table C-3

Corrective Action

Final Repair

Communication

Documentation

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

C-5

Fault #2 Unknown Device

Fault #2 Unknown Device


Use the worksheet to complete the analysis and diagnosis of the fault.

Analysis Phase
Document the observations made during the Analysis phase.

Initial Customer Description


The boot sequence appears to start correctly and then reports an unknown
device. When the user runs the boot -a command to start the system
with all the default parameters, the system boots successfully.
The boot sequence is incomplete due to apparent file system corruption
after the last system crash.
The system goes into a loop.

Problem Statement

Resources

Problem Description
Use Table C-4 to document the problem description.
Table C-4

Problem Description

Error Messages

C-6

Symptoms and
Conditions

Relevant
Changes

Comparative Facts

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #2 Unknown Device

Diagnosis Phase
Document the observations made during the Diagnosis phase.

Test and Verification


Use Table C-5 to document the results of testing and verification.
Table C-5

Test and Verification

Likely Causes

Tests

Results

Verification

Corrective Action
Use Table C-6 to document the corrective action.
Table C-6

Corrective Action

Final Repair

Communication

Documentation

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

C-7

Fault #3 The ps Command Does Not Work

Fault #3 The ps Command Does Not Work


Use the worksheet to complete the analysis and diagnosis of the fault.

Analysis Phase
Document the observations made during the Analysis phase.

Initial Customer Description


The administrator reconfigured the disk drives. Now, when the user
attempts to run the ps command, it does not work correctly.
The system displays the following messages and prompts you to enter the
root password for system maintenance:
.........<Output truncated>
failed to open /etc/coreadm.conf:Read only file system
INIT: Cannot create /var/adm/utmpx
....<Output truncated>
When you run the ps command, the following message is displayed:
# ps
ps: getexecname() failed

Problem Statement

Resources

C-8

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #3 The ps Command Does Not Work

Problem Description
Use Table C-7 to document the problem description.
Table C-7

Problem Description

Error Messages

Symptoms and
Conditions

Relevant
Changes

Comparative Facts

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

C-9

Fault #3 The ps Command Does Not Work

Diagnosis Phase
Document the observations made during the Diagnosis phase.

Test and Verification


Use Table C-8 to document the results of testing and verification.
Table C-8

Test and Verification

Likely Causes

Tests

Results

Verification

Corrective Action
Use Table C-9 to document the corrective action.
Table C-9

Corrective Action

Final Repair

C-10

Communication

Documentation

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #4 Repetitive Boot Sequences

Fault #4 Repetitive Boot Sequences


Use the worksheet to complete the analysis and diagnosis of the fault.

Analysis Phase
Document the observations made during the Analysis phase.

Initial Customer Description


The system reboots continuously.

Problem Statement

Resources

Problem Description
Use Table C-10 to document the problem description.
Table C-10

Problem Description

Error Messages

Symptoms and
Conditions

Relevant
Changes

Comparative Facts

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

C-11

Fault #4 Repetitive Boot Sequences

Diagnosis Phase
Document the observations made during the Diagnosis phase.

Test and Verification


Use Table C-11 to document the results of testing and verification.
Table C-11

Test and Verification

Likely Causes

Tests

Results

Verification

Corrective Action
Use Table C-12 to document the corrective action.
Table C-12

Corrective Action

Final Repair

C-12

Communication

Documentation

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #5 Login Problem

Fault #5 Login Problem


Use the worksheet to complete the analysis and diagnosis of the fault.

Analysis Phase
Document the observations made during the Analysis phase.

Initial Customer Description


The system administrator created an account for a new user. However,
during login, the system displays an error message and immediately logs
out the user.
The following error message is displayed:
Invalid user shell, login rejected

Problem Statement

Resources

Problem Description
Use Table C-13 to document the problem description.
Table C-13

Problem Description

Error Messages

Symptoms and
Conditions

Relevant
Changes

Comparative Facts

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

C-13

Fault #5 Login Problem

Diagnosis Phase
Document the observations made during the Diagnosis phase.

Test and Verification


Use Table C-14 to document the results of testing and verification.
Table C-14

Test and Verification

Likely Causes

Tests

Results

Verification

Corrective Action
Use Table C-15 to document the corrective action.
Table C-15

Corrective Action

Final Repair

C-14

Communication

Documentation

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #6 Problem With the root Login

Fault #6 Problem With the root Login


Use the worksheet to complete the analysis and diagnosis of the fault.

Analysis Phase
Document the observations made during the Analysis phase.

Initial Customer Description


The users cannot directly log in to the root account through a remote
login from the network, such as the Telnet service.
Every time a user attempts to log in as the root user, the following error
message is displayed on the console and the login fails:
Not on system console.
Connection to host lost.

Problem Statement

Resources

Problem Description
Use Table C-16 to document the problem description.
Table C-16

Problem Description

Error Messages

Symptoms and
Conditions

Relevant
Changes

Comparative Facts

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

C-15

Fault #6 Problem With the root Login

Diagnosis Phase
Document the observations made during the Diagnosis phase.

Test and Verification


Use Table C-17 to document the results of testing and verification.
Table C-17

Test and Verification

Likely Causes

Tests

Results

Verification

Corrective Action
Use Table C-18 to document the corrective action.
Table C-18

Corrective Action

Final Repair

C-16

Communication

Documentation

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #7 Problem in the Network

Fault #7 Problem in the Network


Use the worksheet to complete the analysis and diagnosis of the fault.

Analysis Phase
Document the observations made during the Analysis phase.

Initial Customer Description


Users cannot communicate with the local network from system B.
When you run the ping or rup command, the system does not respond.

Problem Statement

Resources

Problem Description
Use Table C-19 to document the problem description.
Table C-19

Problem Description

Error Messages

Symptoms and
Conditions

Relevant
Changes

Comparative Facts

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

C-17

Fault #7 Problem in the Network

Diagnosis Phase
Document the observations made during the Diagnosis phase.

Test and Verification


Use Table C-20 to document the results of testing and verification.
Table C-20

Test and Verification

Likely Causes

Tests

Results

Verification

Corrective Action
Use Table C-21 to document the corrective action.
Table C-21

Corrective Action

Final Repair

C-18

Communication

Documentation

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #8 Hung System

Fault #8 Hung System


Use the worksheet to complete the analysis and diagnosis of the fault.

Analysis Phase
Document the observations made during the Analysis phase.

Initial Customer Description


The system does not boot and drops to the ok prompt.
The following error message is displayed:
/etc/rcS.d/S30rootusr.hs:read vfstab: not found.
.....
WARNING: /proc could not be mounted
......
Program terminated

Problem Statement

Resources

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

C-19

Fault #8 Hung System

Problem Description
Use Table C-22 to document the problem description.
Table C-22

Problem Description

Error Messages

C-20

Symptoms and
Conditions

Relevant
Changes

Comparative Facts

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #8 Hung System

Diagnosis Phase
Document the observations made during the Diagnosis phase.

Test and Verification


Use Table C-23 to document the results of testing and verification.
Table C-23

Test and Verification

Likely Causes

Tests

Results

Verification

Corrective Action
Use Table C-24 to document the corrective action.
Table C-24

Corrective Action

Final Repair

Communication

Documentation

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

C-21

Fault #9 Problem With the CDE

Fault #9 Problem With the CDE


Use the worksheet to complete the analysis and diagnosis of the fault.

Analysis Phase
Document the observations made during the Analysis phase.

Initial Customer Description


When the user attempts to log in, the system returns to the login prompt.
The following error message is displayed:
Login fails.

Problem Statement

Resources

Problem Description
Use Table C-25 to document the problem description.
Table C-25

Problem Description

Error Messages

C-22

Symptoms and
Conditions

Relevant
Changes

Comparative Facts

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #9 Problem With the CDE

Diagnosis Phase
Document the observations made during the Diagnosis phase.

Test and Verification


Use Table C-26 to document the results of testing and verification.
Table C-26

Test and Verification

Likely Causes

Tests

Results

Verification

Corrective Action
Use Table C-27 to document the corrective action.
Table C-27

Corrective Action

Final Repair

Communication

Documentation

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

C-23

Fault #10 Problem With the ftp Service

Fault #10 Problem With the ftp Service


Use the worksheet to complete the analysis and diagnosis of the fault.

Analysis Phase
Document the observations made during the Analysis phase.

Initial Customer Description


The users cannot transfer files by using the File Transfer Protocol (FTP)
service.
The following error message is displayed:
ftp:connect: Connection refused

Problem Statement

Resources

Problem Description
Use Table C-28 to document the problem description.
Table C-28

Problem Description

Error Messages

C-24

Symptoms and
Conditions

Relevant
Changes

Comparative Facts

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #10 Problem With the ftp Service

Diagnosis Phase
Document the observations made during the Diagnosis phase.

Test and Verification


Use Table C-29 to document the results of testing and verification.
Table C-29

Test and Verification

Likely Causes

Tests

Results

Verification

Corrective Action
Use Table C-30 to document the corrective action.
Table C-30

Corrective Action

Final Repair

Communication

Documentation

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

C-25

Fault #11 Problem With the Non-root User Accounts

Fault #11 Problem With the Non-root User Accounts


Use the worksheet to complete the analysis and diagnosis of the fault.

Analysis Phase
Document the observations made during the Analysis phase.

Initial Customer Description


Non-root users cannot log in to the system. They can log in only by using
a direct command-line login into the shell environment.
Login fails and the login screen reappears.

Problem Statement

Resources

Problem Description
Use Table C-31 to document the problem description.
Table C-31

Problem Description

Error Messages

C-26

Symptoms and
Conditions

Relevant
Changes

Comparative Facts

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #11 Problem With the Non-root User Accounts

Diagnosis Phase
Document the observations made during the Diagnosis phase.

Test and Verification


Use Table C-32 to document the results of testing and verification.
Table C-32

Test and Verification

Likely Causes

Tests

Results

Verification

Corrective Action
Use Table C-33 to document the corrective action.
Table C-33

Corrective Action

Final Repair

Communication

Documentation

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

C-27

Fault #12 Problem in the Network

Fault #12 Problem in the Network


Use the worksheet to complete the analysis and diagnosis of the fault.

Analysis Phase
Document the observations made during the Analysis phase.

Initial Customer Description


The network communication does not work. Users cannot communicate
over the network.
While attempting to run the rlogin service from a functional host to the
faulty host, the following error message is displayed:
Unable to connect to remote host

Problem Statement

Resources

Problem Description
Use Table C-34 to document the problem description.
Table C-34

Problem Description

Error Messages

C-28

Symptoms and
Conditions

Relevant
Changes

Comparative Facts

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #12 Problem in the Network

Diagnosis Phase
Document the observations made during the Diagnosis phase.

Test and Verification


Use Table C-35 to document the results of testing and verification.
Table C-35

Test and Verification

Likely Causes

Tests

Results

Verification

Corrective Action
Use Table C-36 to document the corrective action.
Table C-36

Corrective Action

Final Repair

Communication

Documentation

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

C-29

Fault #13 Problem With the CDE

Fault #13 Problem With the CDE


Use the worksheet to complete the analysis and diagnosis of the fault.

Analysis Phase
Document the observations made during the Analysis phase.

Initial Customer Description


The CDE environment is unavailable. Only a direct login to the shell is
possible.

Problem Statement

Resources

Problem Description
Use Table C-37 to document the problem description.
Table C-37

Problem Description

Error Messages

C-30

Symptoms and
Conditions

Relevant
Changes

Comparative Facts

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #13 Problem With the CDE

Diagnosis Phase
Document the observations made during the Diagnosis phase.

Test and Verification


Use Table C-38 to document the results of testing and verification.
Table C-38

Test and Verification

Likely Causes

Tests

Results

Verification

Corrective Action
Use Table C-39 to document the corrective action.
Table C-39

Corrective Action

Final Repair

Communication

Documentation

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

C-31

Fault #14 Problem With the CDE Login Screen

Fault #14 Problem With the CDE Login Screen


Use the worksheet to complete the analysis and diagnosis of the fault.

Analysis Phase
Document the observations made during the Analysis phase.

Initial Customer Description


The system does not run the CDE default login screen.

Problem Statement

Resources

Problem Description
Use Table C-40 to document the problem description.
Table C-40

Problem Description

Error Messages

C-32

Symptoms and
Conditions

Relevant
Changes

Comparative Facts

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #14 Problem With the CDE Login Screen

Diagnosis Phase
Document the observations made during the Diagnosis phase.

Test and Verification


Use Table C-41 to document the results of testing and verification.
Table C-41

Test and Verification

Likely Causes

Tests

Results

Verification

Corrective Action
Use Table C-42 to document the corrective action.
Table C-42

Corrective Action

Final Repair

Communication

Documentation

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

C-33

Fault #15 Problem With the root Account

Fault #15 Problem With the root Account


Use the worksheet to complete the analysis and diagnosis of the fault.

Analysis Phase
Document the observations made during the Analysis phase.

Initial Customer Description


The system administrator tried to secure the system before going on a
vacation. The system now displays an error during the boot sequence and
fails to boot.
The error message indicates that the root user cannot be identified. The
inetd daemon and other services fail to start because the root user
cannot be identified.

Problem Statement

Resources

Problem Description
Use Table C-43 to document the problem description.
Table C-43

Problem Description

Error Messages

C-34

Symptoms and
Conditions

Relevant
Changes

Comparative Facts

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #15 Problem With the root Account

Diagnosis Phase
Document the observations made during the Diagnosis phase.

Test and Verification


Use Table C-44 to document the results of testing and verification.
Table C-44

Test and Verification

Likely Causes

Tests

Results

Verification

Corrective Action
Use Table C-45 to document the corrective action.
Table C-45

Corrective Action

Final Repair

Communication

Documentation

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

C-35

Fault #16 Problem in the Network

Fault #16 Problem in the Network


Use the worksheet to complete the analysis and diagnosis of the fault.

Analysis Phase
Document the observations made during the Analysis phase.

Initial Customer Description


The network communication does not function after the system was
installed on the current network.

Problem Statement

Resources

Problem Description
Use Table C-46 to document the problem description.
Table C-46

Problem Description

Error Messages

C-36

Symptoms and
Conditions

Relevant
Changes

Comparative Facts

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #16 Problem in the Network

Diagnosis Phase
Document the observations made during the Diagnosis phase.

Test and Verification


Use Table C-47 to document the results of testing and verification.
Table C-47

Test and Verification

Likely Causes

Tests

Results

Verification

Corrective Action
Use Table C-48 to document the corrective action.
Table C-48

Corrective Action

Final Repair

Communication

Documentation

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

C-37

Fault #17 Problem With the Network Printer

Fault #17 Problem With the Network Printer


Use the worksheet to complete the analysis and diagnosis of the fault.

Analysis Phase
Document the observations made during the Analysis phase.

Initial Customer Description


The network printer does not respond to print commands.

Problem Statement

Resources

Problem Description
Use Table C-49 to document the problem description.
Table C-49

Problem Description

Error Messages

C-38

Symptoms and
Conditions

Relevant
Changes

Comparative Facts

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #17 Problem With the Network Printer

Diagnosis Phase
Document the observations made during the Diagnosis phase.

Test and Verification


Use Table C-50 to document the results of testing and verification.
Table C-50

Test and Verification

Likely Causes

Tests

Results

Verification

Corrective Action
Use Table C-51 to document the corrective action.
Table C-51

Corrective Action

Final Repair

Communication

Documentation

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

C-39

Fault #18 Problem in the Network

Fault #18 Problem in the Network


Use the worksheet to complete the analysis and diagnosis of the fault.

Analysis Phase
Document the observations made during the Analysis phase.

Initial Customer Description


The user cannot access system A from any other system.
The following error message is displayed:
Could not open a connection to host. Connect failed.

Problem Statement

Resources

Problem Description
Use Table C-52 to document the problem description.
Table C-52

Problem Description

Error Messages

C-40

Symptoms and
Conditions

Relevant
Changes

Comparative Facts

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #18 Problem in the Network

Diagnosis Phase
Document the observations made during the Diagnosis phase.

Test and Verification


Use Table C-53 to document the results of testing and verification.
Table C-53

Test and Verification

Likely Causes

Tests

Results

Verification

Corrective Action
Use Table C-54 to document the corrective action.
Table C-54

Corrective Action

Final Repair

Communication

Documentation

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

C-41

Fault #19 Problem With Read-only File System

Fault #19 Problem With Read-only File System


Use the worksheet to complete the analysis and diagnosis of the fault.

Analysis Phase
Document the observations made during the Analysis phase.

Initial Customer Description


System cannot boot in multiuser mode.
The following error message is displayed:
Read-only file system. No utmpx.

Problem Statement

Resources

Problem Description
Use Table C-55 to document the problem description.
Table C-55

Problem Description

Error Messages

C-42

Symptoms and
Conditions

Relevant
Changes

Comparative Facts

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #19 Problem With Read-only File System

Diagnosis Phase
Document the observations made during the Diagnosis phase.

Test and Verification


Use Table C-56 to document the results of testing and verification.
Table C-56

Test and Verification

Likely Causes

Tests

Results

Verification

Corrective Action
Use Table C-57 to document the corrective action.
Table C-57

Corrective Action

Final Repair

Communication

Documentation

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

C-43

Fault #20 Problem With the CDE

Fault #20 Problem With the CDE


Use the worksheet to complete the analysis and diagnosis of the fault.

Analysis Phase
Document the observations made during the Analysis phase.

Initial Customer Description


The CDE environment is unavailable. Only a direct login to the shell is
possible.

Problem Statement

Resources

Problem Description
Use Table C-58 to document the problem description.
Table C-58

Problem Description

Error Messages

C-44

Symptoms and
Conditions

Relevant
Changes

Comparative Facts

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #20 Problem With the CDE

Diagnosis Phase
Document the observations made during the Diagnosis phase.

Test and Verification


Use Table C-59 to document the results of testing and verification.
Table C-59

Test and Verification

Likely Causes

Tests

Results

Verification

Corrective Action
Use Table C-60 to document the corrective action.
Table C-60

Corrective Action

Final Repair

Communication

Documentation

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

C-45

Fault #21 Corrupt Network File

Fault #21 Corrupt Network File


Use the worksheet to complete the analysis and diagnosis of the fault.

Analysis Phase
Document the observations made during the Analysis phase.

Initial Customer Description


The system administrator is working across different network services.
However, the system does not boot successfully after reconfiguring the
network files.
The following error message is displayed:
Missing or bad password entry for the <root> user.

Problem Statement

Resources

Problem Description
Use Table C-61 to document the problem description.
Table C-61

Problem Description

Error Messages

C-46

Symptoms and
Conditions

Relevant
Changes

Comparative Facts

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #21 Corrupt Network File

Diagnosis Phase
Document the observations made during the Diagnosis phase.

Test and Verification


Use Table C-62 to document the results of testing and verification.
Table C-62

Test and Verification

Likely Causes

Tests

Results

Verification

Corrective Action
Use Table C-63 to document the corrective action.
Table C-63

Corrective Action

Final Repair

Communication

Documentation

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

C-47

Fault #22 Problem in the Network

Fault #22 Problem in the Network


Use the worksheet to complete the analysis and diagnosis of the fault.

Analysis Phase
Document the observations made during the Analysis phase.

Initial Customer Description


Users cannot communicate over the network after restarting the system.
When you try to connect to a network system, the following error
message is displayed:
IMCP Host unreachable from localhost.

Problem Statement

Resources

Problem Description
Use Table C-64 to document the problem description.
Table C-64

Problem Description

Error Messages

C-48

Symptoms and
Conditions

Relevant
Changes

Comparative Facts

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #22 Problem in the Network

Diagnosis Phase
Document the observations made during the Diagnosis phase.

Test and Verification


Use Table C-65 to document the results of testing and verification.
Table C-65

Test and Verification

Likely Causes

Tests

Results

Verification

Corrective Action
Use Table C-66 to document the corrective action.
Table C-66

Corrective Action

Final Repair

Communication

Documentation

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

C-49

Fault #23 Problem With Admintool

Fault #23 Problem With Admintool


Use the worksheet to complete the analysis and diagnosis of the fault.

Analysis Phase
Document the observations made during the Analysis phase.

Initial Customer Description


When the user opens Admintool and selects the Browse -> Software
option, an error message appears stating that an incompatible release of
the Solaris OE is being used.
The following error message is displayed:
You are possibly running admintool with incompatible version
of the Solaris OE. The software add and remove capability
will be disabled.

Problem Statement

Resources

C-50

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #23 Problem With Admintool

Problem Description
Use Table C-67 to document the problem description.
Table C-67

Problem Description

Error Messages

Symptoms and
Conditions

Relevant
Changes

Comparative Facts

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

C-51

Fault #23 Problem With Admintool

Diagnosis Phase
Document the observations made during the Diagnosis phase.

Test and Verification


Use Table C-68 to document the results of testing and verification.
Table C-68

Test and Verification

Likely Causes

Tests

Results

Verification

Corrective Action
Use Table C-69 to document the corrective action.
Table C-69

Corrective Action

Final Repair

C-52

Communication

Documentation

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #24 Boot Failure

Fault #24 Boot Failure


Use the worksheet to complete the analysis and diagnosis of the fault.

Analysis Phase
Document the observations made during the Analysis phase.

Initial Customer Description


The system does not boot. The problem might have occurred when the
system crashed during a power failure.
The following error message is displayed:
Boot load failed. The file just loaded does not appear to be
executable.

Problem Statement

Resources

Problem Description
Use Table C-70 to document the problem description.
Table C-70

Problem Description

Error Messages

Symptoms and
Conditions

Relevant
Changes

Comparative Facts

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

C-53

Fault #24 Boot Failure

Diagnosis Phase
Document the observations made during the Diagnosis phase.

Test and Verification


Use Table C-71 to document the results of testing and verification.
Table C-71

Test and Verification

Likely Causes

Tests

Results

Verification

Corrective Action
Use Table C-72 to document the corrective action.
Table C-72

Corrective Action

Final Repair

C-54

Communication

Documentation

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #25 Hung System

Fault #25 Hung System


Use the worksheet to complete the analysis and diagnosis of the fault.

Analysis Phase
Document the observations made during the Analysis phase.

Initial Customer Description


The system does not boot and drops to the ok prompt.
The following error message is displayed:
(Cant load specfs) Program terminated.

Problem Statement

Resources

Problem Description
Use Table C-73 to document the problem description.
Table C-73

Problem Description

Error Messages

Symptoms and
Conditions

Relevant
Changes

Comparative Facts

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

C-55

Fault #25 Hung System

Diagnosis Phase
Document the observations made during the Diagnosis phase.

Test and Verification


Use Table C-74 to document the results of testing and verification.
Table C-74

Test and Verification

Likely Causes

Tests

Results

Verification

Corrective Action
Use Table C-75 to document the corrective action.
Table C-75

Corrective Action

Final Repair

C-56

Communication

Documentation

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #26 Problem in the Network

Fault #26 Problem in the Network


Use the worksheet to complete the analysis and diagnosis of the fault.

Analysis Phase
Document the observations made during the Analysis phase.

Initial Customer Description


Network communication was disrupted after the relocation of systems.
The following error message is displayed:
telnet: Unable to connect to remote host: Network is
unreachable.

Problem Statement

Resources

Problem Description
Use Table C-76 to document the problem description.
Table C-76

Problem Description

Error Messages

Symptoms and
Conditions

Relevant
Changes

Comparative Facts

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

C-57

Fault #26 Problem in the Network

Diagnosis Phase
Document the observations made during the Diagnosis phase.

Test and Verification


Use Table C-77 to document the results of testing and verification.
Table C-77

Test and Verification

Likely Causes

Tests

Results

Verification

Corrective Action
Use Table C-78 to document the corrective action.
Table C-78

Corrective Action

Final Repair

C-58

Communication

Documentation

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #27 Script Hangs the System

Fault #27 Script Hangs the System


Use the worksheet to complete the analysis and diagnosis of the fault.

Analysis Phase
Document the observations made during the Analysis phase.

Initial Customer Description


The following problems are encountered:

Keyboard input is not accepted

The arrow on the screen does not move

LEDs (if available) are in motion

The rlogin command from other machines on the network fails or


times out

The ping command works intermittently

No error messages are displayed

Problem Statement

Resources

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

C-59

Fault #27 Script Hangs the System

Problem Description
Use Table C-79 to document the problem description.
Table C-79

Problem Description

Error Messages

C-60

Symptoms and
Conditions

Relevant
Changes

Comparative Facts

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #27 Script Hangs the System

Diagnosis Phase
Document the observations made during the Diagnosis phase.

Test and Verification


Use Table C-80 to document the results of testing and verification.
Table C-80

Test and Verification

Likely Causes

Tests

Results

Verification

Corrective Action
Use Table C-81 to document the corrective action.
Table C-81

Corrective Action

Final Repair

Communication

Documentation

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

C-61

Fault #28 Inappropriate Halts

Fault #28 Inappropriate Halts


Use the worksheet to complete the analysis and diagnosis of the fault.

Analysis Phase
Document the observations made during the Analysis phase.

Initial Customer Description


The system powers off, halts, or reboots at inappropriate times.
The following error message is displayed:
.....<output truncated>
syncing file system....done

Problem Statement

Resources

Problem Description
Use Table C-82 to document the problem description.
Table C-82

Problem Description

Error Messages

C-62

Symptoms and
Conditions

Relevant
Changes

Comparative Facts

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #28 Inappropriate Halts

Diagnosis Phase
Document the observations made during the Diagnosis phase.

Test and Verification


Use Table C-83 to document the results of testing and verification.
Table C-83

Test and Verification

Likely Causes

Tests

Results

Verification

Corrective Action
Use Table C-84 to document the corrective action.
Table C-84

Corrective Action

Final Repair

Communication

Documentation

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

C-63

Fault #29 SunSolve Workshop

Fault #29 SunSolve Workshop


Use the worksheet to complete the analysis and diagnosis of the fault.

Analysis Phase
Document the observations made during the Analysis phase.

Initial Customer Description


You receive calls from two customers regarding problems for which
patches might be available. You first search the SunSolve Online service to
locate the patch identification numbers and related bug reports for the
calls:

Customer call #1
The customer complains that the Time-of-Day-Clock checksum value
is destroyed during the process of power cycling the machine. The
message Fatal Error Reset and, sometimes, the wrong year is
displayed. The customer has an Ultra 3000 workstation running the
Solaris 7 OE.

Customer call #2
The customer notices problems with the at and cron utilities on the
Solaris 2.6 OE. Audit records are not properly generated, and the
date 2/29/2000, in particular, causes errors with the at utility.

Note This workshop is slightly different from the others in appendix C


because the solution is to locate relevant information in the SunSolve
Online service rather than to repair a faulty system in the classroom.

Problem Statement

Resources

C-64

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #29 SunSolve Workshop

Problem Description
Use Table C-85 to document the problem description.
Table C-85

Problem Description

Error Messages

Symptoms and
Conditions

Relevant
Changes

Comparative Facts

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

C-65

Fault #29 SunSolve Workshop

Diagnosis Phase
Document the observations made during the Diagnosis phase.

Test and Verification


Use Table C-86 to document the results of testing and verification.
Table C-86

Test and Verification

Likely Causes

Tests

Results

Verification

Corrective Action
Use Table C-87 to document the corrective action.
Table C-87

Corrective Action

Final Repair

C-66

Communication

Documentation

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #30 Corrupt File System

Fault #30 Corrupt File System


Use the worksheet to complete the analysis and diagnosis of the fault.

Analysis Phase
Document the observations made during the Analysis phase.

Initial Customer Description


The boot sequence is incomplete due to apparent file system corruption
after a system crash.
The following error message is displayed:
/dev/rdsk/c0t0d0s7: CANNOT read: BLK 5097440
The following file system(s) had an unexpected
inconsistency: /dev/rdsk/c0t0d0s7 (/export/home).

Problem Statement

Resources

Problem Description
Use Table C-88 to document the problem description.
Table C-88

Problem Description

Error Messages

Symptoms and
Conditions

Relevant
Changes

Comparative Facts

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

C-67

Fault #30 Corrupt File System

Diagnosis Phase
Document the observations made during the Diagnosis phase.

Test and Verification


Use Table C-89 to document the results of testing and verification.
Table C-89

Test and Verification

Likely Causes

Tests

Results

Verification

Corrective Action
Use Table C-90 to document the corrective action.
Table C-90

Corrective Action

Final Repair

C-68

Communication

Documentation

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #31 Insufficient File Permission

Fault #31 Insufficient File Permission


Use the worksheet to complete the analysis and diagnosis of the fault.

Analysis Phase
Document the observations made during the Analysis phase.

Initial Customer Description


The user cannot write to the /home directory. The problem appeared
while creating the test directory in the /home directory.
The following error message is displayed:
mkdir: Failed to make directory "test"; Operation not
applicable
touch: test cannot create
vi Operation not applicable.

Problem Statement

Resources

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

C-69

Fault #31 Insufficient File Permission

Problem Description
Use Table C-91 to document the problem description.
Table C-91

Problem Description

Error Messages

C-70

Symptoms and
Conditions

Relevant
Changes

Comparative Facts

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #31 Insufficient File Permission

Diagnosis Phase
Document the observations made during the Diagnosis phase.

Test and Verification


Use Table C-92 to document the results of testing and verification.
Table C-92

Test and Verification

Likely Causes

Tests

Results

Verification

Corrective Action
Use Table C-93 to document the corrective action.
Table C-93

Corrective Action

Final Repair

Communication

Documentation

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

C-71

Fault #32 Problem in the Network

Fault #32 Problem in the Network


Use the worksheet to complete the analysis and diagnosis of the fault.

Analysis Phase
Document the observations made during the Analysis phase.

Initial Customer Description


The user cannot use the telnet command to communicate with the
systems in other networks. However, the local network systems are
accessible.
The following error message is displayed:
Network unreachable

Problem Statement

Resources

Problem Description
Use Table C-94 to document the problem description.
Table C-94

Problem Description

Error Messages

C-72

Symptoms and
Conditions

Relevant
Changes

Comparative Facts

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #32 Problem in the Network

Diagnosis Phase
Document the observations made during the Diagnosis phase.

Test and Verification


Use Table C-95 to document the results of testing and verification.
Table C-95

Test and Verification

Likely Causes

Tests

Results

Verification

Corrective Action
Use Table C-96 to document the corrective action.
Table C-96

Corrective Action

Final Repair

Communication

Documentation

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

C-73

Fault #33 Login Problem

Fault #33 Login Problem


Use the worksheet to complete the analysis and diagnosis of the fault.

Analysis Phase
Document the observations made during the Analysis phase.

Initial Customer Description


The system administrator created a user account. However, the user
cannot log in successfully. The system accepts the login ID and password
and a login seems to start, but then the system logs out.
The user logs in and is automatically logged out.

Problem Statement

Resources

Problem Description
Use Table C-97 to document the problem description.
Table C-97

Problem Description

Error Messages

C-74

Symptoms and
Conditions

Relevant
Changes

Comparative Facts

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #33 Login Problem

Diagnosis Phase
Document the observations made during the Diagnosis phase.

Test and Verification


Use Table C-98 to document the results of testing and verification.
Table C-98

Test and Verification

Likely Causes

Tests

Results

Verification

Corrective Action
Use Table C-99 to document the corrective action.
Table C-99

Corrective Action

Final Repair

Communication

Documentation

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

C-75

Fault #34 Analyze System Crash Dumps

Fault #34 Analyze System Crash Dumps


Use the worksheet to complete the analysis and diagnosis of the fault.

Analysis Phase
Document the observations made during the Analysis phase.

Initial Customer Description


One of the Sun systems that is administered by the customer panics and a
system crash dump is generated.

Problem Statement

Resources

Problem Description
Use Table C-100 to document the problem description.
Table C-100 Problem Description
Error Messages

C-76

Symptoms and
Conditions

Relevant
Changes

Comparative Facts

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #34 Analyze System Crash Dumps

Diagnosis Phase
Document the observations made during the Diagnosis phase.

Test and Verification


Use Table C-101 to document the results of testing and verification.
Table C-101 Test and Verification
Likely Causes

Tests

Results

Verification

Corrective Action
Use Table C-102 to document the corrective action.
Table C-102 Corrective Action
Final Repair

Communication

Documentation

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

C-77

Fault #35 Problem in the Network

Fault #35 Problem in the Network


Use the worksheet to complete the analysis and diagnosis of the fault.

Analysis Phase
Document the observations made during the Analysis phase.

Initial Customer Description


The user cannot use the ftp command to communicate with system A
from other systems, and the login fails.

Problem Statement

Resources

Problem Description
Use Table C-103 to document the problem description.
Table C-103 Problem Description
Error Messages

C-78

Symptoms and
Conditions

Relevant
Changes

Comparative Facts

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #35 Problem in the Network

Diagnosis Phase
Document the observations made during the Diagnosis phase.

Test and Verification


Use Table C-104 to document the results of testing and verification.
Table C-104 Test and Verification
Likely Causes

Tests

Results

Verification

Corrective Action
Use Table C-105 to document the corrective action.
Table C-105 Corrective Action
Final Repair

Communication

Documentation

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

C-79

Fault #36 Faulty CD-ROM

Fault #36 Faulty CD-ROM


Use the worksheet to complete the analysis and diagnosis of the fault.

Analysis Phase
Document the observations made during the Analysis phase.

Initial Customer Description


The user cannot download any files or directories from the CD-ROM
drive of the server.
The ls command does not display any files or directories, or the
following error message is displayed:
:no such directory

Problem Statement

Resources

Problem Description
Use Table C-106 to document the problem description.
Table C-106 Problem Description
Error Messages

C-80

Symptoms and
Conditions

Relevant
Changes

Comparative Facts

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #36 Faulty CD-ROM

Diagnosis Phase
Document the observations made during the Diagnosis phase.

Test and Verification


Use Table C-107 to document the results of testing and verification.
Table C-107 Test and Verification
Likely Causes

Tests

Results

Verification

Corrective Action
Use Table C-108 to document the corrective action.
Table C-108 Corrective Action
Final Repair

Communication

Documentation

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

C-81

Fault #37 Turn the Page

Fault #37 Turn the Page


Use the worksheet to complete the analysis and diagnosis of the fault.

Analysis Phase
Document the observations made during the Analysis phase.

Initial Customer Description


The pg and passwd commands do not work. Only users with no password
can log in.
The pg command causes the system to hang.

Problem Statement

Resources

Problem Description
Use Table C-109 to document the problem description.
Table C-109 Problem Description
Error Messages

C-82

Symptoms and
Conditions

Relevant
Changes

Comparative Facts

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #37 Turn the Page

Diagnosis Phase
Document the observations made during the Diagnosis phase.

Test and Verification


Use Table C-110 to document the results of testing and verification.
Table C-110 Test and Verification
Likely Causes

Tests

Results

Verification

Corrective Action
Use Table C-111 to document the corrective action.
Table C-111 Corrective Action
Final Repair

Communication

Documentation

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

C-83

Fault #38 Login Problem

Fault #38 Login Problem


Use the worksheet to complete the analysis and diagnosis of the fault.

Analysis Phase
Document the observations made during the Analysis phase.

Initial Customer Description


The root user cannot log in successfully. The login prompt and password,
if required, are accepted. It appears a login is starting, but then the system
logs out.

Problem Statement

Resources

Problem Description
Use Table C-112 to document the problem description.
Table C-112 Problem Description
Error Messages

C-84

Symptoms and
Conditions

Relevant
Changes

Comparative Facts

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #38 Login Problem

Diagnosis Phase
Document the observations made during the Diagnosis phase.

Test and Verification


Use Table C-113 to document the results of testing and verification.
Table C-113 Test and Verification
Likely Causes

Tests

Results

Verification

Corrective Action
Use Table C-114 to document the corrective action.
Table C-114 Corrective Action
Final Repair

Communication

Documentation

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

C-85

Fault #39 Do not Point at Me

Fault #39 Do not Point at Me


Use the worksheet to complete the analysis and diagnosis of the fault.

Analysis Phase
Document the observations made during the Analysis phase.

Initial Customer Description


User cannot talk to other systems on the network from system C.
Systems cannot talk to system C by the system name. System C cannot
talk to other systems on the same subnet.

Problem Statement

Resources

Problem Description
Use Table C-115 to document the problem description.
Table C-115 Problem Description
Error Messages

C-86

Symptoms and
Conditions

Relevant
Changes

Comparative Facts

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #39 Do not Point at Me

Diagnosis Phase
Document the observations made during the Diagnosis phase.

Test and Verification


Use Table C-116 to document the results of testing and verification.
Table C-116 Test and Verification
Likely Causes

Tests

Results

Verification

Corrective Action
Use Table C-117 to document the corrective action.
Table C-117 Corrective Action
Final Repair

Communication

Documentation

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

C-87

Fault #40 Problem in the Network

Fault #40 Problem in the Network


Use the worksheet to complete the analysis and diagnosis of the fault.

Analysis Phase
Document the observations made during the Analysis phase.

Initial Customer Description


User receives an error message when attempting a telnet, rlogin, or
rsh session or when trying to bring up an x-term login.
The following error message is displayed:
could not grant slave pty

Problem Statement

Resources

Problem Description
Use Table C-118 to document the problem description.
Table C-118 Problem Description
Error Messages

C-88

Symptoms and
Conditions

Relevant
Changes

Comparative Facts

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #40 Problem in the Network

Diagnosis Phase
Document the observations made during the Diagnosis phase.

Test and Verification


Use Table C-119 to document the results of testing and verification.
Table C-119 Test and Verification
Likely Causes

Tests

Results

Verification

Corrective Action
Use Table C-120 to document the corrective action.
Table C-120 Corrective Action
Final Repair

Communication

Documentation

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

C-89

Fault #41 No Space on the File System

Fault #41 No Space on the File System


Use the worksheet to complete the analysis and diagnosis of the fault.

Analysis Phase
Document the observations made during the Analysis phase.

Initial Customer Description


Cannot create any files in the file system.
The following error message is displayed:
No space left on device

Problem Statement

Resources

Problem Description
Use Table C-121 to document the problem description.
Table C-121 Problem Description
Error Messages

C-90

Symptoms and
Conditions

Relevant
Changes

Comparative Facts

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #41 No Space on the File System

Diagnosis Phase
Document the observations made during the Diagnosis phase.

Test and Verification


Use Table C-122 to document the results of testing and verification.
Table C-122 Test and Verification
Likely Causes

Tests

Results

Verification

Corrective Action
Use Table C-123 to document the corrective action.
Table C-123 Corrective Action
Final Repair

Communication

Documentation

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

C-91

Fault #42 Cannot Mount a File System

Fault #42 Cannot Mount a File System


Use the worksheet to complete the analysis and diagnosis of the fault.

Analysis Phase
Document the observations made during the Analysis phase.

Initial Customer Description


Cannot mount a file system to a particular directory but can mount in the
/mnt directory.
The following error message is displayed:
nfs mount: mount:/ :Device busy

Problem Statement

Resources

Problem Description
Use Table C-124 to document the problem description.
Table C-124 Problem Description
Error Messages

C-92

Symptoms and
Conditions

Relevant
Changes

Comparative Facts

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #42 Cannot Mount a File System

Diagnosis Phase
Document the observations made during the Diagnosis phase.

Test and Verification


Use Table C-125 to document the results of testing and verification.
Table C-125 Test and Verification
Likely Causes

Tests

Results

Verification

Corrective Action
Use Table C-126 to document the corrective action.
Table C-126 Corrective Action
Final Repair

Communication

Documentation

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

C-93

Fault #43 Problem in the Network

Fault #43 Problem in the Network


Use the worksheet to complete the analysis and diagnosis of the fault.

Analysis Phase
Document the observations made during the Analysis phase.

Initial Customer Description


No connection can be made because the target machine actively refuses to
make a connection.
The following error message is displayed:
telnet: Unable to connect to remote host: Connection
refused

Problem Statement

Resources

Problem Description
Use Table C-127 to document the problem description.
Table C-127 Problem Description
Error Messages

C-94

Symptoms and
Conditions

Relevant
Changes

Comparative Facts

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #43 Problem in the Network

Diagnosis Phase
Document the observations made during the Diagnosis phase.

Test and Verification


Use Table C-128 to document the results of testing and verification.
Table C-128 Test and Verification
Likely Causes

Tests

Results

Verification

Corrective Action
Use Table C-129 to document the corrective action.
Table C-129 Corrective Action
Final Repair

Communication

Documentation

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

C-95

Fault #44 User Login Problem

Fault #44 User Login Problem


Use the worksheet to complete the analysis and diagnosis of the fault.

Analysis Phase
Document the observations made during the Analysis phase.

Initial Customer Description


Error message appears while logging in as a user. The DT messaging
system could not be started.

Problem Statement

Resources

Problem Description
Use Table C-130 to document the problem description.
Table C-130 Problem Description
Error Messages

C-96

Symptoms and
Conditions

Relevant
Changes

Comparative Facts

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #44 User Login Problem

Diagnosis Phase
Document the observations made during the Diagnosis phase.

Test and Verification


Use Table C-131 to document the results of testing and verification.
Table C-131 Test and Verification
Likely Causes

Tests

Results

Verification

Corrective Action
Use Table C-132 to document the corrective action.
Table C-132 Corrective Action
Final Repair

Communication

Documentation

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

C-97

Fault #45 Problem in the Network

Fault #45 Problem in the Network


Use the worksheet to complete the analysis and diagnosis of the fault.

Analysis Phase
Document the observations made during the Analysis phase.

Initial Customer Description


The following error message is displayed while trying to invoke any TCP
services:
inetd<int>: string/tcp: unknown service

Problem Statement

Resources

Problem Description
Use Table C-133 to document the problem description.
Table C-133 Problem Description
Error Messages

C-98

Symptoms and
Conditions

Relevant
Changes

Comparative Facts

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #45 Problem in the Network

Diagnosis Phase
Document the observations made during the Diagnosis phase.

Test and Verification


Use Table C-134 to document the results of testing and verification.
Table C-134 Test and Verification
Likely Causes

Tests

Results

Verification

Corrective Action
Use Table C-135 to document the corrective action.
Table C-135 Corrective Action
Final Repair

Communication

Documentation

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

C-99

Fault #46 System Displays a Panic Message

Fault #46 System Displays a Panic Message


Use the worksheet to complete the analysis and diagnosis of the fault.

Analysis Phase
Document the observations made during the Analysis phase.

Initial Customer Description


System not booting to the default run level and goes into a loop.
The machine is rebooting continuously and displays the following panic
message:
.....<output truncated>
can't invoke /etc/init

Problem Statement

Resources

Problem Description
Use Table C-136 to document the problem description.
Table C-136 Problem Description
Error Messages

C-100

Symptoms and
Conditions

Relevant
Changes

Comparative Facts

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #46 System Displays a Panic Message

Diagnosis Phase
Document the observations made during the Diagnosis phase.

Test and Verification


Use Table C-137 to document the results of testing and verification.
Table C-137 Test and Verification
Likely Causes

Tests

Results

Verification

Corrective Action
Use Table C-138 to document the corrective action.
Table C-138 Corrective Action
Final Repair

Communication

Documentation

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

C-101

Fault #47 Corrupt File System

Fault #47 Corrupt File System


Use the worksheet to complete the analysis and diagnosis of the fault.

Analysis Phase
Document the observations made during the Analysis phase.

Initial Customer Description


The system does not boot. The problem might have occurred when the
system crashed during a power failure.
The following error message is displayed:
The file just loaded does not appear to be executable.

Problem Statement

Resources

Problem Description
Use Table C-139 to document the problem description.
Table C-139 Problem Description
Error Messages

C-102

Symptoms and
Conditions

Relevant
Changes

Comparative Facts

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #47 Corrupt File System

Diagnosis Phase
Document the observations made during the Diagnosis phase.

Test and Verification


Use Table C-140 to document the results of testing and verification.
Table C-140 Test and Verification
Likely Causes

Tests

Results

Verification

Corrective Action
Use Table C-141 to document the corrective action.
Table C-141 Corrective Action
Final Repair

Communication

Documentation

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

C-103

Fault #48 Remote Login Failure

Fault #48 Remote Login Failure


Use the worksheet to complete the analysis and diagnosis of the fault.

Analysis Phase
Document the observations made during the Analysis phase.

Initial Customer Description


Remote login fails and the following error message is displayed:
inetd<PID>: /usr/sbin/in.rlogind: cannot execute:
Permission denied

Problem Statement

Resources

Problem Description
Use Table C-142 to document the problem description.
Table C-142 Problem Description
Error Messages

C-104

Symptoms and
Conditions

Relevant
Changes

Comparative Facts

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #48 Remote Login Failure

Diagnosis Phase
Document the observations made during the Diagnosis phase.

Test and Verification


Use Table C-143 to document the results of testing and verification.
Table C-143 Test and Verification
Likely Causes

Tests

Results

Verification

Corrective Action
Use Table C-144 to document the corrective action.
Table C-144 Corrective Action
Final Repair

Communication

Documentation

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

C-105

Fault #49 Corrupt File System

Fault #49 Corrupt File System


Use the worksheet to complete the analysis and diagnosis of the fault.

Analysis Phase
Document the observations made during the Analysis phase.

Initial Customer Description


System booting into maintenance mode.
The following error message is displayed:
/usr/sbin/fsck not found
cannot mount /usr filesystem

Problem Statement

Resources

Problem Description
Use Table C-145 to document the problem description.
Table C-145 Problem Description
Error Messages

C-106

Symptoms and
Conditions

Relevant
Changes

Comparative Facts

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #49 Corrupt File System

Diagnosis Phase
Document the observations made during the Diagnosis phase.

Test and Verification


Use Table C-146 to document the results of testing and verification.
Table C-146 Test and Verification
Likely Causes

Tests

Results

Verification

Corrective Action
Use Table C-147 to document the corrective action.
Table C-147 Corrective Action
Final Repair

Communication

Documentation

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

C-107

Fault #50 Student Designed Workshop

Fault #50 Student Designed Workshop


Use the worksheet to complete the analysis and diagnosis of the fault.

Analysis Phase
Document the observations made during the Analysis phase.

Initial Customer Description


Create your own.
Students design a workshop for another group in the class to solve. This
exercise is optional and involves working in groups to design a workable
problem with a customer description that can be given to another group
for fault analysis.

Problem Statement

Resources

Problem Description
Use Table C-148 to document the problem description.
Table C-148 Problem Description
Error Messages

C-108

Symptoms and
Conditions

Relevant
Changes

Comparative Facts

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #50 Student Designed Workshop

Diagnosis Phase
Document the observations made during the Diagnosis phase.

Test and Verification


Use Table C-149 to document the results of testing and verification.
Table C-149 Test and Verification
Likely Causes

Tests

Results

Verification

Corrective Action
Use Table C-150 to document the corrective action.
Table C-150 Corrective Action
Final Repair

Communication

Documentation

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

C-109

Appendix D

Workshop Exercises
This appendix contains the following fault worksheets:

Fault Worksheet #1 Blank Monitor

Fault Worksheet #2 Unknown Device

Fault Worksheet #3 The ps Command Does Not Work

Fault Worksheet #4 Repetitive Boot Sequences

Fault Worksheet #5 Login Problem

Fault Worksheet #6 Hung System

Fault Worksheet #7 Problem in the Network

Fault Worksheet #8 Hung System

Fault Worksheet #9 Problem With the CDE

Fault Worksheet #10 Problem With the ftp Service

Fault Worksheet #11 Problem With the Non-root User Accounts

Fault Worksheet #12 Problem in the Network

Fault Worksheet #13 Problem With the CDE

Fault Worksheet #14 Problem With the CDE Login Screen

Fault Worksheet #15 Problem With the root Account

Fault Worksheet #16 Problem in the Network

Fault Worksheet #17 Problem With the Network Printer

Fault Worksheet #18 Problem in the Network

Fault Worksheet #19 Problem With Read-only File System

Fault Worksheet #20 Problem With the CDE

Fault Worksheet #21 Corrupt Network File

Fault Worksheet #22 Problem in the Network

Fault Worksheet #23 Problem With Admintool

D-1
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

D-2

Fault Worksheet #24 Boot Failure

Fault Worksheet #25 Hung System

Fault Worksheet #26 Problem in the Network

Fault Worksheet #27 Script Hangs the System

Fault Worksheet #28 Inappropriate Halts

Fault Worksheet #29 SunSolve Workshop

Fault Worksheet #30 Corrupt File System

Fault Worksheet #31 Insufficient File Permission

Fault Worksheet #32 Problem in the Network

Fault Worksheet #33 Login Problem

Fault Worksheet #34 Analyze System Crash Dumps

Fault Worksheet #35 Problem in the Network

Fault Worksheet #36 Faulty CD-ROM

Fault Worksheet #37 Turn the Page

Fault Worksheet #38 Login Problem

Fault Worksheet #39 Do Not Point at Me

Fault Worksheet #40 Problem in the Network

Fault Worksheet #41 No Space on the File System

Fault Worksheet #42 Cannot Mount a File System

Fault Worksheet #43 Problem in the Network

Fault Worksheet #44 User Login Problem

Fault Worksheet #45 Problem in the Network

Fault Worksheet #46 System Displays a Panic Message

Fault Worksheet #47 Corrupt File System

Fault Worksheet #48 Remote Login Failure

Fault Worksheet #49 Corrupt File System

Fault Worksheet #50 Student Designed Workshop

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #1 Blank Monitor

Fault #1 Blank Monitor


The following is the description and possible fixes of the fault.

Initial Customer Description


The administrator upgraded the PROM on the system and customized
some of the settings. However, when the system reboots, the monitor is
blank.

Error Messages or Symptoms


None recorded.

Probable Causes
The following are the probable causes:

Faulty monitor

Disconnected cable

Missing frame buffers

Inappropriate OBP settings

Fault Insertion
Use the setenv command to modify the pcib-probe-list variable to an
invalid value.
The following is an example for a Sun4u PCI-based system:
ok printenv pcib-probe-list
ok setenv pcib-probe-list 1,3
ok reset
The system restarts with a blank monitor. Students might use the Stop-N
key sequence during power on to set the default values of the
environment variables. However, encourage students to debug the system
and analyze the cause of the problem.

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

D-3

Fault #1 Blank Monitor

Possible Fixes
Complete the following steps:
1.

Set up a remote diagnostic session with the tip connection to


observe the POST output.

2.

Use the printenv command to check the values of environment


variables.

3.

Use the setenv command to set the probe list variable to its default
value.

Alternatively, you can use the Stop-N (L1-N) key sequence.

Learning
Set up a remote diagnostic session with the tip connection to perform
diagnostics on a remote system.

D-4

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #2 Unknown Device

Fault #2 Unknown Device


The following is the description and possible fixes of the fault.

Initial Customer Description


The boot sequence appears to start correctly and then reports an unknown
device. When the user runs the boot -a command to start the system
with all the default parameters, the system boots successfully.
The boot sequence is incomplete due to apparent file system corruption
after the last system crash.

Error Messages or Symptoms


The system goes into a loop.

Probable Causes
The following are the probable causes:

Selection of a hardware problem, such as the device with an incorrect


target

Incorrect configuration of the Sbus

Improper device definitions specified in OBP

Fault Insertion
Complete the following steps:
1.

Modify the /etc/system file to reflect a root device that does not
exist.

2.

Make a copy of the original /etc/system file.


# cp /etc/system /var/tmp/.system

3.

In the /etc/system file, remove the comment preceding the


example of the rootdev path name to enable the rootdev entry. If
the rootdev parameter of the system file is set, change a value in the
path name.

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

D-5

Fault #2 Unknown Device


For example, edit the /etc/system file:
Before edit:
rootdev:/sbus@1,f8000000/esp@0,800000/sd@3,0:a
After edit:
rootdev:/pci@1f,0/pci@1,1/disk@0,0
4.

Reboot the system.

Possible Fixes
Complete the following steps:
1.

Verify system configuration by using OBP commands.

2.

Use the boot -a command, and specify the /etc/system file


containing errors. The system boots successfully with the boot -a
command even when errors are present in the /etc/system file.

3.

Emphasize that the system boots successfully in spite of an invalid


entry in the /etc/system file because the interactive boot responses
overwrite the invalid information in the file.

Learning
Learn one of the many uses of the /etc/system file. In this example, you
use the /etc/system file to change the root device after loading the
initial boot and the kernel from a different device.

D-6

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #3 The ps Command Does Not Work

Fault #3 The ps Command Does Not Work


The following is the description and possible fixes of the fault.

Initial Customer Description


The administrator reconfigured the disk drives. Now, when the user
attempts to run the ps command, it does not work correctly.

Error Messages or Symptoms


The system displays the following messages and prompts you to enter the
root password for system maintenance:
.........<Output truncated>
failed to open /etc/coreadm.conf:Read only file system
INIT: Cannot create /var/adm/utmpx
....<Output truncated>
When you run the ps command, the following message is displayed:
# ps
ps: getexecname() failed

Probable Causes
The following are the probable causes:

Operator error

Corrupt ps executable file

Fault Insertion
Modify the /proc entry option field in the /etc/vfstab file.
Complete the following steps to modify the /etc/vfstab file:
1.

Make a copy of the /etc/vfstab file:


# cp /etc/vfstab /var/tmp/.vfstab

2.

Use the vi editor to open the /etc/vfstab file:


# vi /etc/vfstab

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

D-7

Fault #3 The ps Command Does Not Work


Change the line that reads:
/proc

/proc

proc

no

/proc

proc

no

suid

to
/proc
3.

Relocate the /etc/rcS.d/S40standardmounts.sh file


# mv /etc/rcS.d/S40standardmounts.sh
/etc/rcS.d/.orig.S40standardmounts.sh

4.

Reboot the system.

Possible Fixes
The modification in the /etc/rcS.d directory causes the root file system
to be mounted in read-only mode. Therefore, the edit session becomes
more complex because students cannot edit files on the root file system.
To debug the problem, students must boot the system from a CD-ROM.
To boot the system from a CD-ROM, complete the following steps:
1.

Boot the system from the CD-ROM in single-user mode.

2.

Type the fsck command to repair the root file system:


# fsck /dev/dsk/c0t0d0s0
where c0t0d0s0 is the root file system.

3.

Mount the root file system onto the /a directory:


# mount /dev/dsk/c0t0d0s0 /a

4.

Restore the modified files.

5.

Reboot the system.

Learning
Learn about the files required for system operations, and review the
contents of the /etc/vfstab file.

D-8

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #4 Repetitive Boot Sequences

Fault #4 Repetitive Boot Sequences


The following is the description and possible fixes of the fault.

Initial Customer Description


The system is continuously rebooting.

Error Messages or Symptoms


None recorded.

Probable Causes
The probable cause is an operator error.

Fault Insertion
Complete the following steps:
1.

Make a copy of the /etc/inittab file, and name it as the


/etc/inittab.org file.

2.

Set the TERM parameter:


# TERM=sun
# export TERM
The TERM parameter enables you to edit files by using the vi editor.

3.

Use the vi editor to open the /etc/inittab file.

4.

Change the run level from 3 to 6 in the /etc/inittab file.

5.

Edit the /etc/inittab file:


Before edit:
ap::sysinit:/sbin/autopush -f /etc/iu.ap
fs::sysinit:/sbin/rcS>/dev/console 2>&1 </dev/console
is:3:initdefault:
p3:s1234:powerfail:/sbin/shutdown -y -i0 -g0
>/dev/console 2>&1

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

D-9

Fault #4 Repetitive Boot Sequences


After edit:
ap::sysinit:/sbin/autopush -f /etc/iu.ap
fs::sysinit:/sbin/rcS>/dev/console 2>&1 </dev/console
is:6:initdefault
p3:s1234:powerfail:/sbin/shutdown -y -i0 -g0
>/dev/console 2>&1

Note The modified section in the file is in the bold format.

Possible Fixes Using a CD-ROM


Complete the following steps:
1.

Use the Stop-A keys to access the ok prompt.

2.

Boot the system from the CD-ROM in single-user mode.

3.

Type the fsck command to repair the root file system:


# fsck /dev/dsk/c0t0d0s0
where c0t0d0s0 is the root file system.

4.

Mount the root file system onto the /a directory:


# mount /dev/dsk/c0t0d0s0 /a

5.

Set the TERM parameter:


# TERM=sun
# export TERM
The TERM parameter enables you to edit files by using the vi editor.

6.

D-10

Use the vi editor to open the /etc/inittab file, and restore the
original settings.

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #4 Repetitive Boot Sequences

Possible Fixes if the CD-ROM Is not Available


Complete the following steps:
1.

Use the Stop-A keys to access the ok prompt.

2.

Boot the system in single-user mode.

3.

Set the TERM parameter:


# TERM=sun
# export TERM
The TERM parameter enables you to edit files by using the vi editor.

4.

Use the vi editor to open the /etc/inittab file, and restore the
original settings.

Learning
Learn about the /etc/inittab file that you use during the boot sequence.

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

D-11

Fault #5 Login Problem

Fault #5 Login Problem


The following is the description and possible fixes of the fault.

Initial Customer Description


The system administrator created an account for a new user. However,
during login, the system displays an error message and immediately logs
out the user.

Error Messages or Symptoms


Invalid user shell, login rejected

Probable Causes
The following are the probable causes:

Corrupt password, shadow, or shell-startup files

Modification of the /etc/passwd file, resulting in an invalid shell


entry

System overload

Fault Insertion
Modify the /etc/passwd file to reflect either an improper shell or no shell
for the new user, and then reboot the system.
Complete the following steps to edit the /etc/passwd file:
1.

Set the TERM parameter:


# TERM=sun
# export TERM

2.

D-12

Use the vi editor to open the /etc/passwd file.

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #5 Login Problem


3.

Edit the /etc/passwd file:


Before edit:
user1:x:100:1::/home/user1:/bin/sh
After edit:
user1:x:100:1::/home/user1:/sbin/csh

Note You can also add an extra space after the


user1:x:100:1::/home/user1:/bin/sh entry.

Possible Fixes
Log in as the root user, and correct the invalid shell entry in the
/etc/passwd file.

Learning
Learn about the /etc/passwd file and the significance of the parameters
specified in the file.

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

D-13

Fault #6 Problem With the root Login

Fault #6 Problem With the root Login


The following is the description and possible fixes of the fault.

Initial Customer Description


The users cannot directly log in to the root account through a remote
login from the network.

Error Messages or Symptoms


Every time a user attempts to log in as the root user, the following error
message is displayed on the console, and the login fails:
Not on system console.

Probable Causes
The following are the probable causes:

Faulty ASCII terminal

Security software

Incorrect settings in system files

Fault Insertion
Complete the following steps:
1.

Set the TERM parameter:


# TERM=sun
# export TERM

2.

D-14

Use the vi editor to open the /etc/default/login file.

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #6 Problem With the root Login


3.

Remove the comment on the line CONSOLE=/dev/console:


Before edit:
#CONSOLE=/dev/console
After edit:
CONSOLE=/dev/console

Note You can also insert the fault by ensuring that in


/etc/default/login file the CONSOLE line is uncommented and the
parameter of CONSOLE is set to /dev/null. This ensures that you cannot
log in as the root user on any system.

Possible Fixes
Complete the following steps:
1.

Perform diagnostics to check the hardware.

2.

Restore the default settings in the /etc/default/login file to


enable the root login.

Learning
You can enable or disable remote login by enabling or disabling the
CONSOLE=/dev/console parameter, respectively, in the
/etc/default/login file.

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

D-15

Fault #7 Problem in the Network

Fault #7 Problem in the Network


The following is the description and possible fixes of the fault.

Initial Customer Description


Users cannot communicate with the local network from system B.

Error Messages or Symptoms


When you run the ping or rup command, the system does not respond.

Probable Causes
The probable cause is the incorrect execution of the install, ifconfig,
or sys-unconfig command.

Fault Insertion
Edit the /etc/hosts file to modify the number 1 in each IP address to the
small letter l.
Complete the following steps:
1.

Use the vi editor to open the /etc/hosts network file.

2.

Modify the IP addresses in the /etc/hosts file by completing the


following steps:
# cp /etc/hosts /var/tmp/.hosts
# vi hosts
Before edit:
127.0.0.l localhost
172.16.64.l0l mako loghost
172.17.22.3l sun
172.17.128.l0l hammer
After edit:
l27.0.0.l localhost
172.16.64.l0l mako loghost
172.17.22.3l sun
172.16.128.l0l hammer
# touch -am 01121234 *

D-16

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #7 Problem in the Network


3.

Restart the system.

4.

Log in and clear the screen.

Possible Fixes
To restore the /etc/hosts file, type the following command:
# cp /var/tmp/.hosts /etc/hosts

Learning
Learn about the files that you must check for configuration errors when
network operations are faulty. Use fault analysis techniques to isolate
problems.

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

D-17

Fault #8 Hung System

Fault #8 Hung System


The following is the description and possible fixes of the fault.

Initial Customer Description


The system does not boot and drops to maintenance mode.

Error Messages or Symptoms


/etc/rcS.d/S40standardmounts:read vfstab: not found.
.....
INIT: Cannot create /var/adm/utmpx
......
INIT SINGLE USER MODE
......

Probable Causes
The probable cause is the incorrect or missing
/etc/rcS.d/S40standardmounts.sh file.

Fault Insertion
Move the /etc/rcS.d/S40standardmounts.sh file to a different
location or rename the file by typing the following commands:
# cd /etc/rcS.d
# mv S40standardmounts S40standardmounts.sh

Possible Fixes
Complete the following steps to repair the fault:
1.

Boot the system in single-user mode and login as the root user.

2.

Restore the /etc/rcS.d/S30rootuser.sh file by typing the


following commands:
#
#
#
#

3.

D-18

mount -o rw,remount /
cd /etc/rcS.d
mv S40standardmounts S40standardmounts.sh
uadmin 2 1

Reboot the system.

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #8 Hung System

Learning
Learn about the various startup scripts and their significance.

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

D-19

Fault #9 Problem With the CDE

Fault #9 Problem With the CDE


The following is the description and possible fixes of the fault.

Initial Customer Description


When the user attempts to log in, the system returns to the login prompt.

Error Messages or Symptoms


Login fails

Probable Causes
The following are the probable causes:

Modified password or shadow file

Introduction of a cracker into the system

Corrupt software

Fault Insertion
Modify the entry in the /etc/nsswitch.conf file.
1.

Change the lines for the passwd and group file entries in the
/etc/nsswitch.conf file:
Before alteration:
passwd: files
group: files
After alteration:
passwd: dns
group: dns

2.

Log out from the system.

Note When a user logs in, the system authenticates the user information
from the passwd and shadow files. If you modify the settings in these files,
the system fails to authenticate the user.

D-20

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #9 Problem With the CDE

Possible Fixes
You must boot the system from the CD-ROM to fix the previous bug.
Complete the following steps to boot the system from the CD-ROM:
1.

Press the Stop-A keys to halt the system.


The command prompt is displayed.

2.

Restart the system.

3.

Type the following commands to boot the system from the CD-ROM
in single-user mode:
ok boot cdrom -s

4.

Type the fsck command to repair the file system by typing the
following command:
# fsck /dev/dsk/c0t0d0s0
where c0t0d0s0 is the root file system.

5.

Mount the root file system onto the /a directory by typing the
following command:
# mount /dev/dsk/c0t0d0s0 /a

6.

Set the TERM parameter:


# TERM=sun
# export TERM
The TERM parameter enables you to edit files using the vi editor.

7.

Edit the /etc/nsswitch.conf file to correct the settings.

Note Instead of editing the nsswitch.conf file, you might overwrite it


using the nsswitch.files file that contains the default values.
8.

Restart the system.

Note If the system does not boot, you might have to reinstall the Solaris
OE.

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

D-21

Fault #9 Problem With the CDE

Learning
When a user cannot log in to the system, first check the settings in the
/etc/nsswitch.conf file.

D-22

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #10 Problem With the ftp Service

Fault #10 Problem With the ftp Service


The following is the description and possible fixes of the fault.

Initial Customer Description


The users cannot transfer files by using the File Transfer Protocol (FTP)
service.

Error Messages or Symptoms


ftp:connect: Connection refused

Probable Causes
The following are the probable causes:

Network problem

Incorrect configuration of the server

Fault Insertion
Disable the entry for the ftp service in the /etc/inetd.conf file by
completing the following steps:
1.

Use the vi editor to open the /etc/inetd.conf file.

2.

Edit the lines related to the ftp service to disable the ftp service:
Before edit:
ftp
stream tcp
/usr/sbin/in.ftpd

nowait root
in.ftpd

After edit:
#ftp
stream tcp
/usr/sbin/in.ftpd
3.

nowait
in.ftpd

root

Restart the inetd process by typing the following command:


# pkill -HUP inetd
# ftp localhost

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

D-23

Fault #10 Problem With the ftp Service

Possible Fixes
Complete the following steps:
1.

Edit the /etc/inetd.conf file to restore the correct settings.

2.

Restart the inetd process.


# pkill -HUP inetd
# ftp localhost

Learning
Learn about the files that are essential to provide network services. In
addition, learn how to restrict network services by editing the appropriate
files.

D-24

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #11 Problem With Non-root User Accounts

Fault #11 Problem With Non-root User Accounts


The following is the description and possible fixes of the fault.

Initial Customer Description


Non-root users cannot log in to the system. They can log in only by using
a direct command-line login into the shell environment.

Error Messages or Symptoms


Login fails, and the login screen reappears.

Probable Causes
The following are the probable causes:

Incorrect CDE configuration

Restrictions on the files that you require to log in to the desktop


environment

Fault Insertion
Remove access permissions for the users of the /tmp directory by typing
the following command:
# init 0
At the ok prompt, type the following:
ok boot -s
# chmod 1700 /tmp
The /tmp directory provides read and write permissions to all users by
default. A number of commands generate temporary files during
execution. Any command that creates temporary files fails because of
modified access rights.

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

D-25

Fault #11 Problem With Non-root User Accounts

Possible Fixes
Reset the permissions on the /tmp directory. The sys user and group must
own the sys user and group and have access rights of 1777 with the sticky
bit enabled. Type the following command to reset access permissions on
the /tmp directory:
# init 0
ok boot -s
# chmod 1777 /tmp

Note The sticky bit ensures that only the owner of a file can delete or
modify the files in the /tmp directory.

Learning
You can locate the SunSolve documents related to the preceding fault
because a relevant bug exists in an earlier release of the Solaris OE.

D-26

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #12 Problem in the Network

Fault #12 Problem in the Network


The following is the description and possible fixes of the fault.

Initial Customer Description


The network communication is disrupted. Users cannot communicate
over the network.

Error Messages or Symptoms


While attempting to run the rlogin service from a functional host to the
faulty host, the following error message is displayed:
Unable to connect to remote host

Probable Causes
The following are the probable causes:

Incorrect settings in network files

Network problem

Fault in the hardware connections to the network

Fault Insertion
Complete the following steps:
1.

Use the vi editor to open the /etc/hosts network file.

2.

Modify the IP address in the /etc/hosts file to reflect an invalid IP


address of the client system known as hammer:
Before edit:
# vi /etc/hosts
127.0.0.1
localhost loghost
172.16.64.101 hammer

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

D-27

Fault #12 Problem in the Network


After edit:
# vi /etc/hosts
127.0.0.1
localhost loghost
172.16.128.101 hammer

Note Make sure that there is no router on the network.

Possible Fixes
To fix the fault, restore the correct IP address in the /etc/hosts file.

Learning
Learn about the files in which you must check for configuration errors
when network operations are faulty.

D-28

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #13 Problem With the CDE

Fault #13 Problem With the CDE


The following is the description and possible fixes of the fault.

Initial Customer Description


The CDE environment is unavailable. Only a direct login to the shell is
possible.

Error Messages or Symptoms


None recorded.

Probable Causes
The probable cause is that the user corrupted or accidentally deleted a
device file or installed the device file incorrectly.

Fault Insertion
Corrupt the /devices/pseudo/conskbd@0:kbd device file, and then
reboot the system.
To modify the /devices/pseudo/conskbd@0:kbd file, type the following
commands:
# cd /devices/pseudo
# mv conskbd@0:kbd conskbd@0:kbd.old
# ln pts2l@ttyrf conskbd@0:kbd
The /devices/pseudo/conskbd@0:kbd file is the device file for the CDE
environment. If you move or corrupt the file, the system fails to start the
CDE environment.

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

D-29

Fault #13 Problem With the CDE

Possible Fixes
Compare the faulty system with a functional system, especially the
directory trees in the /dev and /devices directories. A reconfiguration
reboot fixes the problem. However, students must try to determine the
corrupt file.
Restore the correct device file by typing the following command:
# devfsadm -C

Learning
Determine which device files are required for proper console operation,
including the CDE and OpenWindows environments.

D-30

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #14 Problem With the CDE Login Screen

Fault #14 Problem With the CDE Login Screen


The following is the description and possible fixes of the fault.

Initial Customer Description


The system does not run the CDE default login screen.

Error Messages or Symptoms


None recorded.

Probable Cause
The probable cause is that the administrator accidentally deleted the
startup file for the desktop environment.

Fault Insertion
Type the dtconfig command with the -d option to disable the daemon.
This disables the S99dtlogin script in the /etc/rc2.d directory.

Note The S99dtlogin script starts the CDE.


To apply the bug, complete the following steps:
1.

Type the following command at the root prompt:


# /usr/dt/bin/dtconfig -d

2.

Restart the system.

Possible Fixes
Complete the following steps:
1.

Either copy the S99dtlogin file from another system, or type the
following command to enable the daemon:
# /usr/dt/bin/dtconfig -e

2.

Restart the system.

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

D-31

Fault #14 Problem With the CDE Login Screen

Learning
Learn about CDE configuration and administration, including the default
settings and the command interface.

D-32

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #15 Problem With the root Account

Fault #15 Problem With the root Account


The following is the description and possible fixes of the fault.

Initial Customer Description


The system administrator tried to secure the system before going on
vacation. The system now displays an error during the boot sequence and
fails to boot.

Error Messages or Symptoms


The error message indicates that the root user cannot be identified. The
inetd daemon and other services fail to start because the root user
cannot be identified.

Probable Causes
The following are the probable causes:

Corrupt passwd file

Corrupt system files

Fault Insertion
Edit the /etc/passwd file, and modify the name of the root login.
Complete the following steps:
1.

Use the vi editor to open the /etc/passwd file for editing.

2.

Change the following line:


root:x:0:1:Super-User:/:/sbin/sh
to:
root::x:0:1:Super-User:/:/sbin/sh.

3.

Restart the system.

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

D-33

Fault #15 Problem With the root Account

Possible Fixes
Complete the following steps:
1.

Boot the system in single-user mode.

2.

Type the fsck command to repair the root file system by typing the
following command:
# fsck /dev/dsk/c0t0d0s0
where /dev/dsk/c0t0d0s0 is the root file system.

3.

Mount the root file system onto the /a directory by typing the
following command:
# mount /dev/dsk/c0t0d0s0 /a

4.

Set the TERM parameter:


# TERM=sun
# export TERM
The TERM parameter enables you to edit files by using the vi editor.

5.

Edit the /etc/passwd file to restore the original root login.

6.

Reboot the system.

Learning
The Solaris OE cannot run without a valid root account.

D-34

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #16 Problem in the Network

Fault #16 Problem in the Network


The following is the description and possible fixes of the fault.

Initial Customer Description


The network communication does not function after the system was
installed on the current network.

Error Messages or Symptoms


None recorded.

Probable Causes
The following are the probable causes:

Faulty network files

Faulty network hardware

Faulty network cables

Fault Insertion
Complete the following steps:
1.

Disconnect the RJ connector from the workstation.

2.

Place a small piece of tape over pin 1, 2, and 3 of the connector.

3.

Reconnect the connector to the workstation.

Possible Fixes
Verify the network hardware connections, and check the network files to
ensure that you specify correct hosts and IP addresses.
In this exercise, remove the tape from pin 1, 2, and 3 of the RJ connector.

Learning
Use diagnostic checks to determine the cause of the problem.

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

D-35

Fault #17 Problem With the Network Printer

Fault #17 Problem With the Network Printer


The following is the description and possible fixes of the fault.

Initial Customer Description


The network printer does not respond to print commands.

Error Messages or Symptoms


None recorded.

Probable Causes
The following are the probable causes:

Faulty printer hardware

Defective cables

Faulty network connection

Fault Insertion
Complete the following steps:
1.

Use the vi editor to open the /etc/passwd file.

2.

Prevent the lpsched service from running by removing or


corrupting the lp account information, which is saved in the
/etc/passwd file. To do this, change the following entry for the lp
account:
lp:x:71:8:0000-lp(0000):/usr/spool/lp

3.

To remove the lp account information, comment the line in the


/etc/passwd file.

4.

To modify the lp account information, type the following


commands:
# cd /etc/rc2.d
# mv S80lp s80lp
# mv K20lp .org.K20lp

D-36

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #17 Problem With the Network Printer

Possible Fixes
Restore the account information for the lp account in the /etc/passwd
file.

Learning
The lp account in the /etc/passwd file is necessary to run a network
printer successfully.

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

D-37

Fault #18 Problem in the Network

Fault #18 Problem in the Network


The following is the description and possible fixes of the fault.

Initial Customer Description


The user cannot access system A from any other system.

Error Messages or Symptoms


Could not open a connection to host. Connect failed.

Probable Cause
The probable cause is that the system administrator accidently deleted the
entry for the system name in the /etc/hosts file.

Fault Insertion
Remove or modify the host name of the system in the /etc/hosts file.
Complete the following steps to edit the /etc/hosts file:
1.

Make a copy of the /etc/hosts file by typing the following


command:
cp /etc/hosts /var/tmp/.hosts

2.

Use the vi editor to open and edit the /etc/hosts file:


# vi hosts
Before edit
127.0.0.1
129.150.28.39
129.150.182.68

localhost
forward loghost
hammer

After edit
l27.0.0.l
localhost
l29.l50.28.39
forward loghost
129.150.182.68 hammer11
# touch -am 01121234 *
3.

D-38

Restart the system.

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #18 Problem in the Network

Possible Fixes
Restore the correct settings in the /etc/hosts file or replace the correct
/etc/hosts file by typing the following command:
# cp /var/tmp/.hosts /etc/hosts
# init 6

Learning
Learn about system files for network operations. Use fault analysis
techniques to isolate problems.

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

D-39

Fault #19 Problem With Read-only File System

Fault #19 Problem With Read-only File System


The following is the description and possible fixes of the fault.

Initial Customer Description


System cannot boot in multiuser mode.

Error Messages or Symptoms


Read-only file system. No utmpx.

Probable Cause
The following are the probable causes:

A corrupt vfstab entry

corrupt rc scripts

Fault Insertion
Change the rcS script to point to the /fstab directory instead of the
/vfstab directory. The vfstap directory mounts the root account in
read-only mode.

Note Set the TERM parameter. The TERM parameter enables you to edit
files by using the vi editor.
To corrupt the rcS script, complete the following steps:
1.

Edit the /etc/rcS script:


# vi /etc/rcS
Before edit:
vfstab=/etc/vfstab
After edit:
vfstab=/etc/fstab

2.

Make a copy of the /etc/vfstab file:


# cp /etc/vfstab /etc/fstab

D-40

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #19 Problem With Read-only File System


3.

Edit the /etc/fstab file to make the root file system read-only:
# vi /etc/fstab
Before edit:
/dev/dsk/c0t0d0s0 /dev/rdsk/c0t0d0s0 / ufs 1 no After edit:
/dev/dsk/c0t0d0s0 /dev/rdsk/c0t0d0s0 / ufs 1 no ro

Note The change in the preceding output is highlighted in bold.

Possible Fixes
1.

To repair the preceding fault, you first boot the system in single-user
mode from the CD-ROM, and then repair the rc scripts by using the
following commands:
ok boot cdrom -s
ok fsck /dev/dsk/c0t0d0s0
ok mount /dev/dsk/c0t0d0s0 /a

2.

Restore the /etc/vfstab file in the vi editor.

Learning
Insert echoes in the rc scripts to determine the source of the problem. This
is similar to the concept of single-stepping through the rc script
execution.

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

D-41

Fault #20 Problem With the CDE

Fault #20 Problem With the CDE


The following is the description and possible fixes of the fault.

Initial Customer Description


The CDE environment is unavailable. Only a direct login attempt to the
shell is possible.

Error Messages or Symptoms


None recorded.

Probable Cause
The probable cause is that the user corrupted or accidentally deleted the
device file or installed the device file incorrectly.

Fault Insertion
Corrupt the /devices/pseudo/consms@0:mouse device file, and reboot
the system.
To modify the /devices/pseudo/consms@0:mouse file, type the
following commands:
# cd /devices/pseudo
# mv consms@0:mouse consms@0:mouse.old
# touch consms@0:mouse

D-42

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #20 Problem With the CDE

Possible Fixes
The /devices/pseudo/consms@0:mouse file is the device file that you
use to activate the mouse. If you move or corrupt the file, the system fails
to start the CDE environment.
Restore the correct device file by typing the following command:
# devfsadm -C
Compare the faulty system with a functional system, especially the
directory trees in the /dev and /devices directories. A reconfiguration
reboot fixes the problem. However, students must try to determine the
location of the corrupt file.

Learning
Determine the device files required for console operation, including the
CDE and OpenWindows environments.

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

D-43

Fault #21 Corrupt Network File

Fault #21 Corrupt Network File


The following is the description and possible fixes of the fault.

Initial Customer Description


The system administrator is working across different network services.
However, the system does not boot successfully after reconfiguring the
network files.

Error Messages or Symptoms


Missing or bad password entry for the root user.

Probable Cause
The probable cause is that the system administrator modified the network
files incorrectly.

Fault Insertion
Corrupt the /etc/nsswitch.conf file to address the wrong services.
Complete the following steps:
1.

Make a copy of the /etc/nsswitch.conf and


/etc/nsswitch.conf files before editing the files by typing the
following commands:
# cd /etc
# cp nsswitch.conf /var/tmp/.nsswitch.conf

2.

Use the vi editor to open the /etc/nsswitch.conf file.


# vi /etc/nsswitch.conf

3.

Edit the /etc/nsswitch.conf file:


Before edit:
passwd: files
group: files
After edit:
passwd: Files
group: Files

4.

D-44

Restart the system.

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #21 Corrupt Network File

Possible Fixes
Select the correct name services and related files, and restore the modified
files. Restart the system.

Learning
Learn the types of problems that occur if you specify the wrong services
in the nsswitch.conf file.

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

D-45

Fault #22 Problem in the Network

Fault #22 Problem in the Network


The following is the description and possible fixes of the fault.

Initial Customer Description


Users cannot communicate over the network after restarting the system.

Error Messages or Symptoms


When you try to connect to a network system, the following error
message is displayed:
ping: IMCP Host unreachable from localhost.
telnet: unable to connect to remote host: Network is
Unreachable

Probable Causes
The following are the probable causes:

While testing the system, the system administrator or programmer


added a file to the /etc/rc3.d directory and forgot that the file
existed.

Modified network files

Hardware error

Fault Insertion
Complete the following steps:
1.

Add a file to the /etc/rc3.d directory by typing the following


command:
# vi /etc/rc3.d/K99.dtdown
ifconfig hme0 down

2.

Restart the /etc/rc3 script.


# /etc/rc3
# clear

D-46

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #22 Problem in the Network

Possible Fixes
In this exercise, locate and remove the K99.dtdown file, and restart the
system.

Learning
Familiarize yourself with the ifconfig command and the rc scripts.

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

D-47

Fault #23 Problem With Admintool

Fault #23 Problem With Admintool


The following is the description and possible fixes of the fault.

Initial Customer Description


When the user opens Admintool and selects the Browse -> Software
option, an error message appears stating that an incompatible release of
the Solaris OE is being used.

Error Messages or Symptoms


You are possibly running admintool with incompatible
version of the Solaris OE. The software add and remove
capability will be disabled.

Probable Cause
The probable cause is that the system administrator inadvertently
renamed a file.

Fault Insertion
Rename the /var/sadm/system/admin/INST_RELEASE file as
/var/sadm/system/admin/inst_release.

Note The INST_RELEASE is an ASCII text file that contains information


about the OS name, the revision, and the version number.
To remove the file, type the following commands:
# cd /var/sadm/system/admin
# mv INST_RELEASE inst_release

Possible Fixes
Rename the inst_release file to INST_RELEASE.

D-48

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #23 Problem With Admintool

Learning
Use the truss command in the applications for which you have no prior
knowledge.

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

D-49

Fault #24 Boot Failure

Fault #24 Boot Failure


The following is the description and possible fixes of the fault.

Initial Customer Description


The system does not boot. The problem might have occurred when the
system crashed during a power failure.

Error Messages or Symptoms


Boot load failed.
The file just loaded does not appear to be executable.

Probable Causes
The probable cause is corrupt boot block, boot file (/ufsboot), or kernel
(/kernel/unix).

Fault Insertion
Corrupt the boot block or boot file, or move the file to a different location.
Type the following commands to move the boot file to another location:
# mkdir /saved
# mv /platform/`uname -i`/ufsboot /saved
# reboot

Possible Fixes
Provide an alternative boot block to students for booting the system.
To boot the system in single-user mode from the CD-ROM, type the
following commands:
# fsck /dev/dsk/c0t0d0s0
# mount /dev/dsk/c0t0d0s0 /a
# cd /platform/`uname -i`

D-50

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #24 Boot Failure


For example, for a Sun4u system, type the following:
# cd /platform/sun4u
# cp /a/saved/ufsboot .
# reboot
Alternatively, you can complete the following steps to boot the system
from the CD-ROM:
1.

Boot the system in single-user mode from the CD-ROM.

2.

Use the uname -i command to note the platform name that is


displayed:
# uname -i
SUNW, Ultra-5_10

3.

Run the following command to install the boot block:


# installboot
/usr/platform/`uname -i`/lib/fs/ufs/bootblk
/dev/dsk/c0t0d0s0
where c0t0d0s0 is the root file system.

4.

Restart the system.

Note The installboot command is preferred when the boot file


(ufsboot) is corrupt.

Learning
Learn about the files related to the boot sequence and how to restore the
files using a CD-ROM.

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

D-51

Fault #25 Hung System

Fault #25 Hung System


The following is the description and possible fixes of the fault.

Initial Customer Description


The system does not boot and drops to the ok prompt.

Error Messages or Symptoms


(Cant load specfs) Program terminated.

Probable Causes
The following are the probable causes:

Corrupt kernel

Incorrect input parameters during the boot sequence

Operator error

Fault Insertion
Modify the /etc/system file.
Complete the following steps to corrupt the /etc/system file:
1.

Make a copy of the /etc/system file before editing it.


# cp /etc/system /var/tmp/.system

2.

Edit the file.


Before edit:
*moddir:
After edit:
moddir:

3.

D-52

Reboot the system.

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #25 Hung System

Possible Fixes
Complete the following steps:
1.

Boot the system in interactive mode by typing the following


command:
# boot -a

2.

When the system prompts for the system file during the boot
sequence, type the following:
/var/tmp/.system

3.

Restore the /etc/system file by typing the following command:


# cp /var/tmp/.system /etc/system

Learning
Learn about the /etc/system file. As a system administrator, you can
modify kernel parameters during the boot sequence in the /etc/system
file.

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

D-53

Fault #26 Problem in the Network

Fault #26 Problem in the Network


The following is the description and possible fixes of the fault.

Initial Customer Description


Network communication was disrupted after the relocations of systems.

Error Messages or Symptoms


telnet: Unable to connect to remote host: Network is
unreachable.

Probable Causes
The following are the probable causes:

Network files not restored properly after testing

Network problem

Fault in the hardware connections to the network

Fault Insertion
Complete the following steps:
1.

Use the vi editor to open the /etc/hostname.hme0 or .lme0


network file.

2.

Modify the IP address in the /etc/hostname.hme0 or .lme0


network file.
Before edit:
# vi /etc/hostname.hme0
hammer
After edit:
# vi /etc/hostname.hme0
hammers

3.

D-54

Reboot the system.

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #26 Problem in the Network

Possible Fixes
Complete the following steps:
1.

Restore the correct IP address in the /etc/hostname.hme0 file.

2.

Restart the system.

Learning
Learn about the files that you must check for configuration errors when
network operations are faulty.

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

D-55

Fault #27 Script Hangs the System

Fault #27 Script Hangs the System


The following is the description and possible fixes of the fault.

Initial Customer Description


The following problems are encountered:

Keyboard input is not accepted.

The arrow on the screen does not move.

LEDs (if available) are in motion.

The rlogin command from other systems on the network fails or


times out.

The ping command works intermittently.

No error messages are displayed.

Error Symptoms/Conditions/Messages
None recorded.

Fault Insertion
In this exercise, the students insert the fault.

Probable Cause
The following are the probable causes:

D-56

Resource shortage

High-priority process controlling the system

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #27 Script Hangs the System

Diagnostic Steps
Use the following procedure to determine the reason why your system
hangs.
1.

Use the vi editor to insert the following script in the /usr/bin


directory. Name the file start.
#!/bin/csh -f
clear
rm -f /tmp/guilty_party
cat > /tmp/guilty_party << Done
#!/bin/csh -f
while (1)
end
Done
chmod 777 /tmp/guilty_party
/usr/bin/priocntl -e -c RT /tmp/guilty_party &

2.

After you exit the vi editor, use the following commands:


# chmod 775 /usr/bin/start
# /usr/bin/start

3.

Attempt to use the ping command from a remote system.

4.

Attempt to use the rlogin command from another system.

5.

As soon as the system hangs or radically slows down, press Stop-A


(L1-a) to halt the system.
The system takes time to process the keyboard interrupt because it is
busy with a higher-priority process. Keep pressing the key sequence
until the system brings you to the ok prompt.

6.

Type the sync command to force a core dump.

7.

After the system reboots, log in, and type the following command in
a shell window:
# cd <default dump directory>

8.

Launch the mdb utility to examine the core dump:


# mdb unix.n vmcore.n
where n is a value, such as 0, 1, 2, or 3.

Note The variable n increments each time a system saves a crash dump.
9.

Use the ps command to examine the processes:


> ::ps
Are there any processes with abnormally large amounts of CPU time
as compared to the other processes? If so, note this process.

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

D-57

Fault #27 Script Hangs the System

Expected Fix
A workaround solution is to not run the program guilty_party until
CPU resources are available. Determine if this process usually consumes
so much of CPU time, or run the guilty_party program as a timesharing
process to see if the problem still occurs or is the problem a function of
real-time scheduling.

Verification of the Fix


Rerun the start command to verify whether this process is the cause of
the problem.

Note Another way to debug a system hang is to collect several core


dumps and compare the processes in execution for similarities.

Learning
The students learn how to determine the cause of a hung system.

Note The preceding steps generate a system crash dump that varies
with systems.

D-58

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #28 Inappropriate Halts

Fault #28 Inappropriate Halts


The following is the description and possible fixes of the fault.

Initial Customer Description


The system powers off, halts, or reboots at inappropriate times.

Error Messages or Symptoms


.....<output truncated>
syncing file system....done

Probable Cause
The following are the probable causes:

A Trojan Horse

A cron job

A faulty rc script

A faulty at script

Fault Insertion
Start an at process that calls the init 5, halt, or reboot command. An
email message is sent with an indication of the problem.
1.

Use the following commands to insert the fault:


# cd /bin
# vi tst
#!/bin/csh
at -c -m now + 1minute < /bin/tst
mail root </bin/tst
sync
init 5
sleep 7
halt
(exit vi session)

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

D-59

Fault #28 Inappropriate Halts


2.

After you exit the vi editor, use the following commands:


# chmod +x tst
# /bin/tst

Possible Fix
The at -l command shows the executing at scripts. After you locate the
script, you can read and remove the execution script. You must examine
the rc scripts.
Use the following commands to repair the fault:
ok boot -s (remain in single-user mode)
# cd /bin
Remove the tst file.
# rm /bin/tst
# reboot

Learning
Learn to trace rc scripts, the cron, and at (at -l) commands. In
addition, read the email message sent to the root user. (This is an often
overlooked source of information because each at job sends an email
message to the root user.)

D-60

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #29 SunSolve Workshop

Fault #29 SunSolve Workshop


The following is the description and possible fixes of the fault.

Initial Customer Description


You receive calls from two customers regarding problems for which
patches might be available. You first search the SunSolve Online service to
locate the patch identification numbers and related bug reports for the
calls:

Customer call #1
The customer complains that the Time-of-Day-Clock checksum value
is destroyed during the process of power cycling the system. The
message Fatal Error Reset and, sometimes, the wrong year is
displayed. The customer has an Ultra 3000 workstation running the
Solaris 7 OE.

Customer call #2
The customer notices problems with the at and cron utilities on the
Solaris 2.6 OE. Audit records are not properly generated, and the
date 2/29/2000, in particular, causes errors with the at utility.

Error Symptoms or Messages


This workshop is slightly different from the others in Appendix D because
the solution is to locate relevant information in the SunSolve Online
service rather than to repair a faulty system in the classroom.

Probable Cause
The probable cause is the need for a flash PROM update, patch number
103346-08 and 103346-02 or patch numbers 105393-07 and 105621-04 or
above.

Possible Fix
Locate the patch and bug report information, and insist that the customer
install the relevant patches.

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

D-61

Fault #29 SunSolve Workshop

Learning
Use available resources to diagnose problems efficiently for which
solutions already exist.

D-62

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #30 Corrupt File System

Fault #30 Corrupt File System


The following is the description and possible fixes of the fault.

Initial Customer Description


The boot sequence is incomplete due to an apparent file system
corruption after a system crash.

Error Messages or Symptoms


/dev/rdsk/c0t0d0s7: CANNOT read: BLK 5097440
The following file system(s) had an unexpected
inconsistency: /dev/rdsk/c0t0d0s7 (/export/home).

Probable Causes
The probable cause is a corrupt file system.

Fault Insertion
Complete the following steps to insert the fault:
1.

Select a partition to corrupt. Use either the /home or /export/home


directory.

2.

Corrupt the superblock and the normal backup block (32).

3.

Halt the system, and then restart it.

Complete the following steps to corrupt the /export/home partition:


1.

Use the df -a command to determine which partition to use and


which to unmount.
Ensure that slice 0 of the partition is greater than slice 7

2.

Corrupt the superblock by using the dd command as follows:


# umount /export/home
# dd if=/dev/rdsk/c0t0d0s0 of=/dev/rdsk/c0t0d0s7
count=35
where c0t0d0s0 is the partition for the root file system and
c0t0d0s7 is the partition for the /export/home directory.

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

D-63

Fault #30 Corrupt File System


3.

Restart the system.


You use a device-to-device copy to corrupt the superblock and the
first backup blocks. You must use the newfs -N command to locate
alternative backup blocks.

Note Record the super-block backups prior to using the dd command


( # newfs -Nv /dev/rdsk/c0t0d0s6)

Possible Fixes
1.

Type the fsck command to repair the /export/home partition:


# fsck /dev/dsk/c0t0d0s7
where c0t0d0s7 refers to the /export/home file system.

2.

Boot the system from the CD-ROM, and restore the superblock.

Learning
Locate and use an alternative superblock by using the newfs -N command
with the fsck command.

D-64

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #31 Insufficient File Permission

Fault #31 Insufficient File Permission


The following is the description and possible fixes of the fault.

Initial Customer Description


The user cannot write to the /home directory. The problem appeared while
creating the test directory in the /home directory.

Error Messages or Symptoms


mkdir: Failed to make directory "test"; Operation not
applicable
touch: test cannot create
vi Operation not applicable.

Probable Causes
The following are the probable causes:

Modified permissions (ls -l /home)

Incorrect mount tables

Fault Insertion
Add an entry for the /home directory as an /auto_home mount in the
/etc/auto_master file. To edit the /etc/auto_master file, complete the
following steps:
1.

Open the /etc/auto_master file, and add an entry for the /home
directory:
# vi auto_master
/home auto_home

2.

Restart the daemon by typing the following command:


# pkill -HUP automount

Possible Fixes
Remove the entry for the /home directory from the /etc/auto_master
file, and restart the daemon.

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

D-65

Fault #31 Insufficient File Permission

Learning
Learn the types of problems that are generated by the automount entries.

D-66

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #32 Problem in the Network

Fault #32 Problem in the Network


The following is the description and possible fixes of the fault.

Initial Customer Description


The user cannot use the telnet command to communicate with the
systems in other networks. However, the local network systems are
accessible.

Error Messages or Symptoms


Network unreachable

Probable Causes
The following are the probable causes:

Corrupt or nonexistent /etc/defaultrouter file

Default route not added

Fault Insertion
Use the route -f command to flush the routing table. Try connecting to
other systems in some other network.

Possible Fixes
Check for the default route entry by using the netstat -r command.
If the entry is not present, add the route by using the
route add default <ipaddress> command.
Create a file, /etc/defaultrouter, if it does not exist, and add the IP
address of the system.

Learning
Learn about the files that you must check for configuration errors when
network operations are faulty.

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

D-67

Fault #33 Login Problem

Fault #33 Login Problem


The following is the description and possible fixes of the fault.

Initial Customer Description


The system administrator created a user account. However, the user
cannot log in successfully. The system accepts the login ID and password
and a login seems to begin, but then the system logs out.

Error Messages or Symptoms


The user logs in and is automatically logged out.

Probable Causes
The following are the probable causes:

Operator error

Cracker on the system

File corruption

Fault Insertion
Ensure that the /.dtprofile file of the user exists within the root
directory. Modify this file to insert the fault.
If any student is working in the OpenWindows environment rather than
in CDE, make a similar modification in the /.profile or /.login file,
depending on the shell.
To edit the /.dtprofile file of the user, complete the following steps:
1.

Use the vi editor to open the /.dtprofile file for editing.


# vi /.dtprofile

D-68

2.

Edit the /.dtprofile file to append the exit command in the file.

3.

Log out, and attempt to log in as the new user.

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #33 Login Problem

Possible Fixes
Log in as the root user, and edit the /.dtprofile file of the user to
remove the exit command. Alternatively, you can use the command-line
login to edit the /.dtprofile file.
If the fault occurs for the root user, you must boot the system from the
CD-ROM and edit the file.

Learning
Learn about the files that affect the login sequence, and reinforce the
procedure for examining and fixing problems.

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

D-69

Fault #34 Analyze System Crash Dumps

Fault #34 Analyze System Crash Dumps


The following is the description and possible fixes of the fault.

Initial Customer Description


One of the Sun systems that is administered by the customer panics and a
system crash dump is generated.

Error Symptoms or Messages


A system crash dump is generated.

Fault Insertion
The students learn to analyze a system crash dump generated by
Fault # 27. This workshop is slightly different than the others in Appendix
D because here you use the mdb utility to examine the offending address
and thread that caused the system to panic.
Use the mdb utility to achieve the following:

Identify the address of the instruction that caused the panic

Identify the address of the thread that was running at the time of the
panic

Identify the name and arguments of the process that were running at
the time of the panic

Diagnostic Steps
Use the following procedure for determining the reason for your hung
system.
1.

Launch the mdb utility to examine the core dump:


# mdb unix.n vmcore.n
where n is a value, such as 0, 1, 2, or 3.

2.

D-70

Type the $c command to display the stacktrace registers, which


enable you to determine the routines that caused the panic, and also
to display the source of the panic.

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #34 Analyze System Crash Dumps


For example:
> $c
0xf0050c7c(f0066d2c, 2a10007d6e8, f0066d2c, 6, 0, 3000004f270)
prom_enter_mon+0x2c(0, 6, b, 6, 0, 0)
debug_enter+0x158(0, 248fbd5f3c, 248fbd5f40, 0, 0, 0)
kbdinput+0x304(300001d1d68, 4d, 1, 500d4, 4c384, 0)
kbdrput+0x14c(300004f8a28, 30000b976c0, 20, 0, ffbff508, ff33a000)
putnext+0x1d0(300004f8cb0, 30000b976c0, 20, 6, 0, ff)
async_softint+0x56c(2008e, 3000069c08e, 30000b976c0, 3000067c020, 0, 0)
asysoftintr+0x68(300001d1e08, 80d, 1400000, 2a10007dd40, 101a0, 12924a0)
intr_thread+0x12c(17, 475f4, 47674, 4, 50034, 0)
>
3.

Type the $r command to display the registers at the time of the


panic.
For example:

>
$r
%g0 = 0x0000000000000000
%l0 = 0x0000000001400000 cpu0
%g1 = 0x000000000103931c prom_enter_mon+0x2c %l1 = 0x000000000142a2c8 cpu
%g2 = 0x0000000000000000
%l2 = 0x000000000140c000
%g3 = 0x0000000000000001
%l3 = 0x0000000000000001
%g4 = 0x00000000014ade58
keyindex_s4 %l4 = 0x0000000000000016
%g5 = 0x0000000000007000
%l5 = 0x000000000000001e
%g6 = 0x0000000000000000
%l6 = 0x0000000000000016
%g7 = 0x000002a10007dd40
%l7 = 0x0000000000000000
%o0 = 0x0000000001000000
scb %i0 = 0x00000000f0066d2c
%o1 = 0x0000000000000016
%i1 = 0x000002a10007d6e8
%o2 = 0x00000000f0000000
%i2 = 0x00000000f0066d2c
%o3 = 0x0000000000000000
%i3 = 0x0000000000000006
%o4 = 0x000000000142ac00
cp_list_head %i4 = 0x0000000000000000
%o5 = 0x0000000001437800
p0+0x8c8 %i5 = 0x000003000004f270
%o6 = 0x000002a10007cd81
%i6 = 0x000002a10007ce31
%o7 = 0x00000000010077cc client_handler+0x2c %i7 = 0x000000000103931c
prom_enter_mon+0x2c
%ccr = 0x88 xcc=Nzvc icc=Nzvc
%fprs = 0x00 fef=0 du=0 dl=0
%asi = 0x00
%y = 0x0000000000000000
%pc = 0x00000000f0050c7c
%npc = 0x00000000f0050c80
%sp = 0x000002a10007cd81 unbiased=0x000002a10007d580
%fp = 0x000002a10007ce31
%tick = 0x0000000000000000
%tba = 0x0000000000000000
%tt = 0x17f

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

D-71

Fault #34 Analyze System Crash Dumps


%tl = 0x0
%pil = 0xf
%pstate = 0x014 cle=0 tle=0 mm=TSO red=0 pef=1 am=0 priv=1 ie=0 ag=0
%cwp = 0x02 %cansave = 0x00
%canrestore = 0x00 %otherwin = 0x00
%wstate = 0x00 %cleanwin = 0x00
4.

From the displayed registers, use the %pc (the program counter)
value to display the instruction that caused the system to fail.

5.

Issue the ::status dcmd command to display a part of the message


that was displayed during the panic.
For example:

> ::status
debugging crash dump vmcore.3 (64-bit) from mako
operating system: 5.9 Generic (sun4u)
panic message: sync initiated
dump content: kernel pages only
>
6.

Use the ps -lt command to examine the processes.


For example:

> ::ps
S
PID
PPID
PGID
SID
UID
FLAGS
R
0
0
0
0
0 0x00000019
T
t0 <TS_STOPPED>
L
lwp0 ID: 1
R
3
0
0
0
0 0x00020019
T
0x300005737c0 <TS_RUN>
L
0x300005714a8 ID: 1
R
2
0
0
0
0 0x00020019
T
0x30000573a60 <TS_SLEEP>
L
0x30000571818 ID: 1
R
1
0
0
0
0 0x00004008
T
0x30000573d00 <TS_SLEEP>
L
0x30000571b88 ID: 1
R
439
1
412
412
0 0x00014008
guilty_party
T
0x30000cead20 <TS_ONPROC>
L
0x30000dc1190 ID: 1
R
426
1
426
426
0 0x10010008
T
0x30000aacfc0 <TS_RUN>
L
0x30000a8f4e0 ID: 1
R
424
1
424
424
25 0x10010008
T
0x30000dfd7c0 <TS_SLEEP>
L
0x30000dfb508 ID: 1
..........<Output truncated>

D-72

ADDR NAME
0000000001436f38 sched

0000030000576008 fsflush

0000030000576a20 pageout

0000030000577438 init

0000030000ce4a98

0000030000aa9468 sendmail

0000030000ce4080 sendmail

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #34 Analyze System Crash Dumps


To determine the running thread that caused the panic, use the status
of the TS_ONPROC field. In the preceding output, the guilty_party
program is identified as the thread running during the panic.
7.

Determine the exact arguments with which the guilty_party


program was called. Use the address from the output of the
::ps -lt command to display the thread structure.
For example:

> 0x30000cead20$<thread
0x30000cead20: link
stk
startpc
0
2a10055daf0
0
0x30000cead38: bound_cpu
affinitycnt
bind_cpu
0
0
-1
0x30000cead44: flag
proc_flag
schedflag
1000
4
3
0x30000cead4a: preempt preempt_lk
state
0
0
4
0x30000cead50: pri
epri
100
0
0x30000cead58:
pc
sp
1007254
2a10055d2f1
0x30000cead68: wchan0
wchan
sobj_ops
0
0
0
0x30000cead80: cid
clfuncs
cldata
4
1480d08
30000e3e640
0x30000cead98: ctx
lofault
onfault
0
0
0
0x30000ceadb0: ontrap
swap
lock
0
2a10055a000
ff
0x30000ceadc2: pil
pi_lock cpu
0
0
1400000
0x30000ceadd0: lpl
intr
did
142cff0
0
1258
0x30000ceadf0: tnf_tpdp
tid
waitfor
30000d08050
1
-1
0x30000ceae00: sigqueue
sig
hold
0
0
2000000000000
0x30000ceae18: forw
back
thlink
30000cead20
30000cead20
0
0x30000ceae30: lwp
procp
audit_data
30000dc1190
30000ce4a98
0
0x30000ceae48: next
prev
trace
30000dfcd40
30000ceb260
0
.....<Output truncated>

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

D-73

Fault #34 Analyze System Crash Dumps


8.

Use the address under the procp field with the proc2u macro to
view the command name and arguments that caused the panic.

Note The address under the procp field is the address of the proc
structure.
For example:
> 30000ce4a98$<proc2u
auxv
30000ce4dd0
0x30000ce4f00: start.tv_sec
start.tv_nsec
3cda4a8d
28d191b4
0x30000ce4dc8: execsw
ticks
140e228
86b1
0x30000ce4f29: psargs /bin/csh -f
/tmp/guilty_party\0\0\0\0\0\0\0\0\0\0\0\0\0\
0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\
0\0\0\0\0\0
0x30000ce4f18: comm
guilty_party\0\0\0\0\0
0x30000ce4f7c: argc
argv
envp
3
ffbff6e4
ffbff6f4
0x30000ce4f90: cdir
rdir
mem
30000a5eed8
0
23ef7
0x30000ce4fa8: cmask
acflag systrap
022
02
0
entrymask
30000ce4fb0
exitmask
30000ce4fd4
0x30000ce4ff8: signodefer
sigonstack
sigresethand
8000000000000001 0
8000000000000001
0x30000ce5010: sigrestart
2000000000000
..........<Output truncated>
The preceding information that you generated can be counterchecked by
displaying the message buffer during the panic.
For example:
> $<msgbuf
SunOS Release 5.9 Version Generic 64-bit
0x3000006d8a3: Copyright 1983-2002 Sun Microsystems, Inc.
reserved.

D-74

All rights

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #34 Analyze System Crash Dumps


Use is subject to license terms.
0x3000006d4e3: Ethernet address = 8:0:20:f8:f:10
0x3000006d120: mem = 262144K (0x10000000)
0x3000006cd60: avail mem = 250093568
0x3000006c9a3: root nexus = Sun Ultra 5/10 UPA/PCI (UltraSPARC-IIi
440MHz)
0x3000006c5e3: pcipsy0 at root: UPA 0x1f 0x0
0x3000006c223: pcipsy0 is /pci@1f,0
0x30000401de2: PCI-device: pci@1,1, simba0
0x30000401a23: simba0 is /pci@1f,0/pci@1,1
0x30000401662: PCI-device: pci@1, simba1
0x300004012a3: simba1 is /pci@1f,0/pci@1
0x30000400ee1: PCI-device: ide@3, uata0
0x30000400b23: uata0 is /pci@1f,0/pci@1,1/ide@3
0x30000400760: dad0 at pci1095,6460
0x300004003a0:
target 0 lun 0
0x300003f5ea3: dad0 is /pci@1f,0/pci@1,1/ide@3/dad@0,0
0x300003f5ae0:
<ST39111A cyl 17660 alt 2 hd 16 sec 63>
0x300003f5727: root on /pci@1f,0/pci@1,1/ide@3/disk@0,0:a fstype ufs
0x300003f5361: PCI-device: ebus@1, ebus0
0x300003f4fa3: ebus0 is /pci@1f,0/pci@1,1/ebus@1
0x300003f4be0: power0 at ebus0: offset 14,724000
0x300003f4823: power0 is /pci@1f,0/pci@1,1/ebus@1/power@14,724000
0x300003f4460: su0 at ebus0: offset 14,3083f8
0x300003f40a3: su0 is /pci@1f,0/pci@1,1/ebus@1/su@14,3083f8
0x30000581c60: su1 at ebus0: offset 14,3062f8
0x300005818a3: su1 is /pci@1f,0/pci@1,1/ebus@1/su@14,3062f8
0x300005814e1: PCI-device: SUNW,m64B@2, m640
0x30000581123: m640 is /pci@1f,0/pci@1,1/SUNW,m64B@2
0x30000580d5f: m64#0: 1152x900, 4M mappable, rev 4750.7c
0x300005809a0: cpu0: SUNW,UltraSPARC-IIi (upaid 0 impl 0x12 ver 0x91
clock 440 MHz)
0x300005805e0: se0 at ebus0: offset 14,400000
0x30000580223: se0 is /pci@1f,0/pci@1,1/ebus@1/se@14,400000
0x300007dfddf: SUNW,hme0 : PCI IO 2.0 (Rev Id = c1) Found
0x300007dfa21: PCI-device: network@1,1, hme0
0x300007df663: hme0 is /pci@1f,0/pci@1,1/network@1,1
0x300007df023: dump on /dev/dsk/c0t0d0s1 size 512 MB
0x300007dec5f: SUNW,hme0 : Internal Transceiver Selected.
0x300007de89f: SUNW,hme0 :
10 Mbps Half-Duplex Link Up
0x300007de4e2: pseudo-device: devinfo0
0x300007de263: devinfo0 is /pseudo/devinfo@0
0x300007de622: pseudo-device: tod0
0x300007deda3: tod0 is /pseudo/tod@0
0x300007de3a2: pseudo-device: pm0
0x300007df7a3: pm0 is /pseudo/pm@0

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

D-75

Fault #34 Analyze System Crash Dumps


0x300007de9e2:
0x300007dff23:
0x300007df160:
0x30000580723:
0x300007dfb60:
0x30000580ea3:
0x30000580360:
0x30000581623:
0x30000580ae2:
0x30000581da3:
0x300003f45a0:
0x300003f41e0:
0x300003f4d20:

pseudo-device: vol0
vol0 is /pseudo/vol@0
sd0 at uata0: target 2 lun 0
sd0 is /pci@1f,0/pci@1,1/ide@3/sd@2,0
fd0 at ebus0: offset 14,3023f0
fd0 is /pci@1f,0/pci@1,1/ebus@1/fdthree@14,3023f0
se0 at ebus0: offset 14,400000
se0 is /pci@1f,0/pci@1,1/ebus@1/se@14,400000
pseudo-device: pm0
pm0 is /pseudo/pm@0
panic[cpu0]/thread=2a10007dd40:
sync initiated

0x300003f54a0: sched:
0x300003f5c20: software trap 0x7f
0x300004004e0: pid=0, pc=0xf0050c7c, sp=0x2a10007cd81,
tstate=0x8800001402, context=0x8c0
0x30000400c60: g1-g7: 103931c, 0, 1, 14ade58, 7000, 0, 2a10007dd40
0x300004013e0:
0x30000401b63: 00000000fffa9d00 unix:sync_handler+12c (fff9b840,
1000000, 1412d55, fffe0000, f003bda6, 1437800)
0x3000006c363:
%l0-3: 0000000000000001 000000000103931c
00000000f0000000 00000000fffe0000
%l4-7: 00000000f0050c28 00000000f006729c 00000000fffefd28
00000000fffeef98
0x3000006cae3: 00000000fffa9de0 unix:vx_handler+8c (fff9b840,
2a10007d6e8, f0066d2c, 6, 0, 3000004f270)
0x3000006d263:
%l0-3: 000000000102768c 0000000000000080
00000000014173b8 00000000f0000000
%l4-7: 0000000000000016 000000000000001e 0000000000000016
0000000000000000
0x3000006d9e3: 00000000fffa9e90 unix:callback_handler+20 (fff9b840,
fffde280, 0, 0, 0, 0)
0x300007de123:
%l0-3: 0000000000000016 00000000fffa9741
000000000004a238 00000000ffbff187
%l4-7: 0000000000000000 0000000000000000 0000000000000000
0000000000000000
0x30000e97d60:
0x30000e97ae3: syncing file systems...
0x30000e97863:
done
0x30000e975e3: dumping to /dev/dsk/c0t0d0s1, offset 107479040, content:
kernel
0x30000e97360: WARNING: timeout: reset bus chno = 0 targ = 0

D-76

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #34 Analyze System Crash Dumps

Learning
The students learn to analyze a system dump. This familiarizes them with
the kernel structures that they must examine when analyzing a system
crash or a hung system.

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

D-77

Fault #35 Problem in the Network

Fault #35 Problem in the Network


The following is the description and possible fixes of the fault.

Initial Customer Description


The user cannot use the ftp command to communicate with system A
from other systems, and the login fails.

Error Messages or Symptoms


Login failed.

Probable Causes
The following are the probable causes:

Blocked ftp service

User blocked from using the ftp service

Fault Insertion
Enter the user name in the /etc/ftpd/ftpusers file.

Note The ftpusers file contains the names of users who are not
authorized to use the ftp service.

Possible Fixes
The following are the possible fixes:
1.

Check for the correct entries in the /etc/ftpd/ftpusers file.

2.

Remove the user name if it is added in the /etc/ftpd/ftpusers


file.

Learning
Learn about the files that you must check for configuration errors when
network operations are faulty.

D-78

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault # 36 Faulty CD-ROM

Fault # 36 Faulty CD-ROM


The following is the description and possible fixes of the fault.

Initial Customer Description


The user cannot download any files or directories from the CD-ROM
drive of the server.

Error Messages or Symptoms


The ls command does not display any files or directories, or the
following error message is displayed:
:no such directory

Probable Causes
The following are the probable causes:

Bad server CD-ROM

Bad server sharetab

Bad procedure

Fault Insertion
Complete the following steps to share the CD-ROM in the wrong way:
1.

Insert a CD-ROM in the server.

2.

Issue the following command on the server:


# share /cdrom -o ro

3.

Issue the following command on the client:


# mount <server_name>:/cdrom /mnt

4.

Use the following commands on the client:


# cd /mnt
# ls
cdrom0 <sunsolve_version>
# cd cdrom0
# ls

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

D-79

Fault # 36 Faulty CD-ROM

Possible Fix
Complete the following steps to share the CD-ROM properly:
1.

On the server, eject the CD-ROM.

2.

Add the following entry to the /etc/rmmount.conf file:


share cdrom*

3.

Ensure that NFS sharing is in place:


sh /etc/rc3.d/S15nfsserver start

4.

Insert the CD-ROM.

Learning
Learn to properly share the CD-ROM files correctly.

D-80

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #37 Turn the Page

Fault #37 Turn the Page


The following is the description and possible fixes of the fault.

Initial Customer Description


The pg and passwd commands do not work. Only users with no
password can log in.

Error Symptoms/Conditions/Messages
The pg command hangs.

Probable Cause
The probable cause is either a bad command or a bad device.

Fault Insertion
Change the major and minor numbers for the tty drivers.
Complete the following steps:
1.

Create a user account with no password and root privileges. This


facilitates the analysis of the fault. If students log out, they cannot
log back in as the root user. Ensure that the root user and at least
one other user account require passwords.
a.

Edit the /etc/passwd file.

su:x:0:1::/usr/su:/sbin/sh
guest1:x:12:10::/export/home/guest1:/bin/csh
b.

Edit the /etc/shadow file.

root:YX4pytcVVZF2k:9555::::::
su::9555::::::
guest1:oWU/elsH4pe6E:::::::
2.

In a terminal window, enter the following:


# cd /devices/pseudo
# mv sy@0:tty sy@0:tty.org
# mv ptsl@0:ttyrf sy@0:tty
The preceding modification changes the major and minor numbers,
which essentially calls the WRONG driver.

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

D-81

Fault #37 Turn the Page


3.

Make sure the pg command hangs.


# pg /etc/path_to_inst

Note Do not reboot now. The students eventually reboot and then notice
that they cannot log in. You can then tell them about the su user and let
them figure out that the su user does not require a password.

Possible Fix
Fix tty in the /devices/pseudo file. Use the truss command for this
analysis.
Complete the following steps to fix the bug:
A reconfiguration boot fixes the problem. However, this should not
be accepted as a solution unless students locate the fault and
associated file specifically.
1.

In a window, type the following:


# devfsadm -C
# mv sy@0:tty ptsl@0:ttyrf
# mv sy@0:tty.org sy@0:tty

2.

Make sure that the pg command pages.


# pg /etc/group

Learning
What appears to be a minor problem can actually be something quite
disastrous. The truss command is useful on both the passwd and pg
commands.

D-82

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #38 Login Problem

Fault #38 Login Problem


The following is the description and possible fixes of the fault.

Initial Customer Description


The root user cannot log in successfully. The login prompt and password,
if required, are accepted. It appears a login is starting, but then the system
logs out.

Error Symptoms/Conditions/Messages
None recorded.

Probable Cause
The following are the probable causes:

Hasty operator error

Cracker on the system

File corruption

Fault Insertion
Complete the following steps to insert the fault:
1.

Ensure that the .dtlogin file exists within the root directory.

2.

Edit the /.dtlogin file using the vi editor by adding a line at the
end of the file that invokes the exit command. There are many
comments in the file. Scroll to the bottom of the file, and type the
following on one line:
exit

3.

Quit the text editor, and log out.

If any student is working in the OpenWindows environment rather than


in CDE, the problem requires a similar modification in the /.profile or
/.login files, depending on the shell.

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

D-83

Fault #38 Login Problem

Possible Fix
Press Control-C to stop the login script from executing. Then, boot from
the CD-ROM to edit the correct files.
You cannot log in as the root user, therefore, you must boot the faulty
system from the CD-ROM or from an available server. Mount the root
partition in this environment, and edit the .dtlogin file in the root
directory by removing the line that invokes the exit command.

Learning
Students become acquainted with files that affect the login sequence, and
reinforce the procedure for examining and fixing problems by booting the
CD-ROM environment.

D-84

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault #39 Do Not Point at Me

Fault #39 Do Not Point at Me


The following is the description and possible fixes of the fault.

Initial Customer Description


User cannot talk to other systems on the network from system C.

Error Symptoms/Conditions/Messages
Systems cannot talk to system C by the system name. System C cannot
talk to other systems on the same subnet.

Probable Cause
The probable cause is an oversight by the system administrator.

Note Isolate the fault by using the fault analysis techniques.

Fault Insertion
Same fault fundamentally as Fault 18.
Edit the /etc/hostname.hme0 file to refer to an incorrect host.
Complete the following steps on system C:
1.

Make a copy of the /etc/hosts and /etc/hostname files:


# cp /etc/hosts /etc.host.orig
# cp /etc/hostname.hme0 /etc/.hostname.hme0

2.

Edit the /etc/hosts and /etc/hostname files:


# vi hostname.hme0
Remove nodename_for_machineC
# vi hosts

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

D-85

Fault #39 Do Not Point at Me


Before edit:
127.0.0.1localhost
129.150.28.39 forward loghost
129.150.182.68 scha
After edit:
127.0.0.1localhost
129.150.28.39forwardloghost
129.150.183.68 scha
3.

Exit the vi editor, and complete the following:


# touch -am 01121234 *
# reboot

4.

Log in, and clear the screen.

Possible Fix
Fix the /etc/hosts or /etc/hostname.hme0 file.

Learning
Isolate problems using fault analysis techniques.

D-86

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault # 40 Problem in the Network

Fault # 40 Problem in the Network


The following is the description and possible fixes of the fault.

Initial Customer Description


User receives an error message when attempting a telnet, rlogin, or
rsh session or when trying to bring up an x-term login.

Error Symptoms/Conditions/Messages
could not grant slave pty

Probable Cause
The probable cause is that the file permissions or ownership of the
/usr/lib/pt_chmod file are set incorrectly.

Fault Insertion
Complete the following steps to insert the fault:
1.

Log in as the root user.

2.

Change the ownership and group permission of the


/usr/lib/pt_chmod file.
For example:
# chown bin:bin /usr/lib/pt_chmod

Possible Fix
Edit the /usr/lib/pt_chmod file to reflect correct file ownership and
group permissions.
For example:
# chown root:bin /usr/lib/pt_chmod
The following should be the file permission:
# ls -la /usr/lib/pt_chmod ---s--x--x 1 root bin

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

D-87

Fault # 40 Problem in the Network

Learning
To know about files and their correct ownership required for
pseudo-terminals.

D-88

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault # 41 No Space on the File System

Fault # 41 No Space on the File System


The following is the description and possible fixes of the fault.

Initial Customer Description


There is enough free space but cannot create any new files in the file
system.

Error Messages or Symptoms


No space left on device

Probable Cause
The probable cause is that the file system contains many small files,
exceeding the limit for inodes (file information nodes).

Fault Insertion
Complete the following steps to insert the fault:
1.

Create a partition that has a capacity of 20 Mbytes.


For example, the /dev/rdsk/c0t0d0s3 partition.

Note Make backup copies of the file system on tape devices.


2.

Use the following command to construct a file system:


# /usr/sbin/newfs -i 204800<raw file partition>
The preceding command constructs a file system with a maximum of
192 inodes.

3.

Mount the new constructed file system on the /test mount point.
# mount -F ufs<raw file partition> /test

4.

Copy some files from another file system to the /test directory.
# cp /usr/bin/* /test
You will not be able to store more than 192 files on this file system.

5.

Use the df command to verify that there is enough disk space


available on the specified file system.

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

D-89

Fault # 41 No Space on the File System

Possible Fix
To repair this fault, reconstruct the file system with the newfs -i
command to increase the inode density, and restore the file system from
the backup.

Learning
To know about different parameters and their importance to construct a
new file system.

D-90

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault # 42 Cannot Mount a File System

Fault # 42 Cannot Mount a File System


The following is the description and possible fixes of the fault.

Initial Customer Description


Cannot mount a NFS file system to a particular directory but able to
mount in a different directory.

Error Messages or Symptoms


nfs mount: mount:/ :Device busy

Probable Cause
The following are the probable causes:

File system is already mounted

Another file system is mounted in this directory

User is accessing this directory

Fault Insertion
Complete the following steps to insert the fault:
1.

Create two directories test1 and test2.


For example:
# mkdir /test1
# mkdir /test2

2.

Use the vi editor to edit the /etc/dfs/dfstab file, and add the
following entries:
share -F nfs -o rw -d test NFS /test1
share -F nfs rw -d test NFS /test2

3.

Reboot the system.

4.

Try to mount both directories on the same mount point.


For example:
# mount -F nfs localhost: /test1/mnt
# mount -F nfs localhost: /test2/mnt

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

D-91

Fault # 42 Cannot Mount a File System

Possible Fix
To repair this fault, mount the file system on different mount points.

Learning
Learn how to use the NFS shares.

D-92

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault # 43 Problem in the Network

Fault # 43 Problem in the Network


The following is the description and possible fixes of the fault.

Initial Customer Description


No connection could be made because the target machine actively refused
it.

Error Messages or Symptoms


telnet: Unable to connect to remote host: Connection refused

Probable Cause
The following are the probable causes:

Trying to connect to an inactive service

No service process exists at the requested address

Fault Insertion
To insert the fault, kill the inetd daemon.
For example:
# pkill -9 inetd

Possible Fix
To repair this fault, reboot the system, or use the following command to
restart the inetd daemon:
# /usr/sbin/inetd -s

Learning
Learn about the importance of the inetd daemon.

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

D-93

Fault # 44 User Login Problem

Fault # 44 User Login Problem


The following is the description and possible fixes of the fault.

Initial Customer Description


Error message appears while logging in CDE as a normal user, user is
unable to login using CDE.

Error Messages or Symptoms


The DT messaging system could not be started.

Probable Cause
The probable cause is that the dtlogin program could not find or create
files and directories to initiate the CDE for the user.

Fault Insertion
Complete the following steps:
1.

Create a user, testusr, and remove the users home directory.

2.

Try to log in as the user testusr.

Possible Fix
To repair this fault, create the users home directory with the proper
rights.

Learning
Learn about different files, their location, and importance required for
proper working of the users CDE environment.

D-94

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault # 45 Problem in the Network

Fault # 45 Problem in the Network


The following is the description and possible fixes of the fault.

Initial Customer Description


Error message while trying to invoke any TCP services.

Error Messages or Symptoms


inetd<int>: string/tcp: unknown service

Probable Cause
The probable cause is that the Master Internet services daemon inetd
could not locate the TCP service specified after the first colon.

Fault Insertion
Complete the following steps to insert the fault:
1.

Use the vi editor to modify the entry in the /etc/nsswitch.conf


file.

2.

Change the lines for the services entry in the


/etc/nsswitch.conf file:
Before edit:
services: files
After edit:
services: dns

Possible Fix
To repair this fault, edit the /etc/nsswitch.conf file to correct the
settings.

Learning
Learn to ensure proper settings in the /etc/nsswitch.conf file.

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

D-95

Fault # 46 System Displays a Panic Message

Fault # 46 System Displays a Panic Message


The following is the description and possible fixes of the fault.

Initial Customer Description


System not booting to the default run level and goes into a loop.

Error Messages or Symptoms


The machine is rebooting continuously and displays a panic message:
.....<output truncated>
can't invoke /etc/init

Probable Cause
The probable cause is that the init program is missing or corrupted.

Fault Insertion
Complete the following steps to insert the fault:
1.

Use the following command to remove the symbolic link from the
/etc/init file:
# unlink /etc/init

2.

Reboot the system.

Possible Fix
To repair this fault, complete the following steps:
1.

Use the Stop-A key sequence to halt the system.

2.

Boot the system from the CD-ROM in single-user mode.

3.

Run the fsck command to fix the root file system:


# fsck /dev/dsk/c0t0d0s0
where c0t0d0s0 is the root file system.

4.

Mount the root file system onto the /a directory:


# mount /dev/dsk/c0t0d0s0 /a

D-96

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault # 46 System Displays a Panic Message


5.

Copy the /sbin/init file to the /a directory:


# cp /sbin/init /a/sbin/init

6.

Change the current working directory to the /a/etc directory:


# cd /a/etc

7.

Use the following command to restore the symbolic link in the


/etc/init file:
# ln -s ../sbin/init init

8.

Reboot the system.

Learning
Learn about the importance of the init program, and know more about
the system initialization files. The /etc/init file is a symbolic link to
/sbin/init file.

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

D-97

Fault # 47 Corrupt File System

Fault # 47 Corrupt File System


The following is the description and possible fixes of the fault.

Initial Customer Description


The system does not boot. The problem might have occurred when the
system crashed during a power failure.

Error Messages or Symptoms


The file just loaded does not appear to be executable.

Probable Causes
The probable cause is a corrupt boot block, boot file (/ufsboot), or kernel
(/kernel/unix).

Fault Insertion
1.

Use the dd command to corrupt the boot block:


For example:
# dd if = /dev/dsk/c0t0d0s7 of = /dev/dsk/c0t0d0s0
count=31

2.

Reboot the system.

Possible Fixes
Provide an alternative boot block to students for booting the system.
To repair this fault, complete the following steps:
1.

Use the Stop-A key sequence to halt the system.

2.

Boot the system from the CD-ROM in single-user mode.

3.

Run the fsck command to fix the root file system:


# fsck /dev/dsk/c0t0d0s0
where c0t0d0s0 is the root file system.

D-98

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault # 47 Corrupt File System


4.

Use the uname -i command to note the platform name that is


displayed:
# uname -i
SUNW, Ultra-5_10

5.

Run the following command to install the boot block:


# installboot /usr/platform/SUNW,
Ultra-5_10/lib/fs/bootblk /dev/dsk/c0t0d0s0
where c0t0d0s0 is the root file system.

6.

Reboot the system.

Learning
Learn about the files related to the boot sequence and how to restore a
corrupt boot block.

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

D-99

Fault # 48 Remote Login Failure

Fault # 48 Remote Login Failure


The following is the description and possible fixes of the fault.

Initial Customer Description


Remote login fails.

Error Messages or Symptoms


inetd<PID>: /usr/sbin/in.rlogind: cannot execute:
Permission denied

Probable Cause
The following are the probable causes:

Incorrect file permissions in the in.rlogind daemon

Incorrect entries in the /etc/inetd.conf file

Invalid login shell is substituted in the entry for the login ID in the
/etc/passwd file

Fault Insertion
To insert the fault, change the permission in the /usr/sbin/in.rlogin
file.
For example:
# chmod 444 /usr/sbin/in.rlogin

Possible Fix
To repair this fault, check the permission of the in.rlogind daemon, and
set it as the default permission.
For example:
# chmod 555 /usr/sbin/in.rlogin

D-100

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault # 48 Remote Login Failure

Learning
Learn about the in.rlogind daemon and the significance of the
permissions specified in the /usr/sbin/in.rlogin file.

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

D-101

Fault # 49 Corrupt File System

Fault # 49 Corrupt File System


The following is the description and possible fixes of the fault.

Initial Customer Description


System booting into maintenance mode.

Error Messages or Symptoms


/usr/sbin/fsck not found
cannot mount /usr filesystem

Probable Cause
The following are the probable causes:

The /usr file system is not mounted

The fsck command of the /usr file system in preen mode failed

Fault Insertion
1.

Use the dd command to corrupt the boot block of the /usr file
system:
For example:
# dd if = /dev/dsk/c0t0d0s7 of = /dev/dsk/c0t0d0s6
count=31

Note Record the super-block backups prior to using the dd command


( # newfs -Nv /dev/rdsk/c0t0d0s6)
2.

D-102

Reboot the system.

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Fault # 49 Corrupt File System

Possible Fixes
Provide an alternative boot block to students for booting the system.
To repair this fault, complete the following steps:
1.

Use the Stop-A key sequence to halt the system.

2.

Boot the system from the CD-ROM in single-user mode.

3.

Run the fsck command to fix the /usr file system:


# fsck /dev/dsk/c0t0d0s6 -o b=<alternate superblock>
where c0t0d0s6 is the /usr file system.

4.

Reboot the system.

Learning
Learn how to restore a corrupt file system and the importance of the
/usr/sbin/fsck utility.

Workshop Exercises
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

D-103

Fault # 50 Student Designed Workshop

Fault # 50 Student Designed Workshop


The following is the description and possible fixes of the fault.

Initial Customer Description


Create your own.

Error Messages or Symptoms


Students design a workshop for another group in the class to solve. It is
optional and involves working in groups to design a workable problem
with a customer description that can be given to another group for fault
analysis.

Fault Insertion

Probable Cause

Possible Fix

Learning

D-104

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Glossary/Acronyms
A
Admintool
A system administration utility with a graphical interface that enables
administrators to maintain system database files, printers, serial ports,
user accounts, and hosts.

B
backup
A copy of file system data that is stored separately from the disk drive
on which the data resides.
baud rate
The unit in which the signalling rate of a communication channel is
measured. In addition, baud rate is the measure of the speed at which
the communication channel can transmit and receive information.
boot
The boot command is used to start the system kernel or a standalone
program.
boot block
A 15-sector disk block that contains information used to boot a system.
Block numbers point to the location of the ufsboot program on the
disk. The boot block directly follows the disk label.

C
checksum
A number that is calculated from the binary bytes of the file. You can
use the checksum to determine if the file contents have changed.

Glossary-1
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

console
The console device is the main input/output device that is used to
access a system and display all system messages. A console can either be
a display monitor and keyboard or a shell window.
coreadm command
The coreadm command specifies the name or location of core files
produced during a core dump.
Cumulative residence-length product
The total time for which a process remains in the queue during its
complete life cycle.

D
debugger command (dcmd)
A debugger command or dcmd (pronounced dee-command) is a routine
in the debugger that can access any of the properties of the current
target. The mdb utility parses commands from standard input and
executes the corresponding dcmds. Each dcmd can also accept a list of
string or numerical arguments.
debugger module (dmod)
A debugger module or dmod (pronounced dee-mod) is a dynamically
loaded library containing a set of dcmds and walkers. During
initialization, the mdb utility attempts to load dmods corresponding to the
load objects present in the target. You can subsequently load or unload
dmods at any time while running the mdb utility.
devfsadm
The devfsadm command configures devices and updates the /dev and
/devices directories.
device
A hardware component or a physical device, such as a printer or disk
drive that act as a unit to perform a specific function.
device driver
A program that the kernel uses to communicate with devices.
directory
A location for files and other directories. The Solaris OE file system or
directory structure enables you to create files and directories that can be
accessed through a hierarchy of directories.

Glossary-2

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

domain name
The name assigned to a group of systems on a local network that share
administrative files. The domain name is required for the network
information service database to work properly.
dumpadm
The dumpadm utility manages the configuration of the crash dump
facility on the Solaris OE.
dumping
Dumping is the process of copying files and directories for offline
storage.

E
encryption
Encryption is used to protect account passwords, data, and other pieces
of information. When a password is encrypted, it appears as a series of
numerals and uppercase and lowercase letters unrelated to the actual
password. This means that not even the superuser can read the
password; only the system can read the special code.
error checking and correcting (ECC)
ECC logic is used by memory chips and processing units for correction
of single-bit errors and detection of double-bit and multiple-bit errors.
ECC logic uses a part of the system memory to store parity information.
With full parity memory, a memory error alert is sent and the system
halts. With no parity memory, in case of an error, the system experiences
random results, such as system crashes and data corruption. However,
in case of minor memory errors, ECC handles the error without causing
any damage to the system.
Ethernet
A local area network (LAN) that employs a bus topology in which all
the workstations are connected to a single physical medium. Ethernet is
a broadcast network, which means that all of the workstations on the
network receive all transmissions.
Ethernet address
The physical address of an individual Ethernet controller board. It is
called the hardware address or media access control (MAC) address.
The Ethernet address of every Sun workstation is unique and coded into
a chip on the motherboard.

Glossary/Acronyms
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Glossary-3

F
firmware
Firmware includes programs that are permanently installed in a chip.
The programmable read-only memory (PROM) and non-volatile
random access memory (NVRAM) chips are examples of firmware.
fsck
The fsck command checks the integrity of a file system and repairs any
damage found.
ftp command
The ftp command is used to transfer files to and from a remote network
site using the File Transfer Protocol (FTP) service.

G
group
A group identifies the users associated with a file. A user group is a set
of users who have access to a common set of files. User groups are
defined in the /etc/group file and are granted the same sets of
permissions.

H
heuristic
Heuristic is the process of describing an approach to learning by trying
rather than by following some pre-established formula or organized
hypothesis. A heuristic program is a mathematical program, consisting
of a complex set of functions.
host
A computer system in a network computing environment.
host name
A unique name identifying a host machine connected to a network. The
host name must be unique on the network.

I
inetd
The inetd server daemon listens to service requests and executes the
server program associated with the service.

Glossary-4

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

K
kernel
The master program of the Solaris OE. It manages devices, memory,
swap, processes, and daemons. The kernel also controls the functions
between system programs and the system hardware.
kernel STREAM
A kernel mechanism that supports development of network services
and data communications drivers. STREAMS define interface standards
for character input/output within the kernel and between the kernel
and user level. The STREAMS mechanism includes integral functions,
utility routines, kernel facilities, and a set of structures.

L
login
The login is used to sign on to the system. A login consists of a login ID
or user name and a valid password.

M
man page
Manual pages or man pages are online references that are available as
part of the Solaris OE.
multiuser
A feature of the Solaris OE that enables more than one user to access the
same system resources.

N
network
A connection between machines that enables an exchange of
information between the machines. Two main types of networks are
local area networks (LANs) and wide area networks (WANs).

O
OEM
An original equipment manufacturer (OEM) is a supplier who builds
parts for systems.

Glossary/Acronyms
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Glossary-5

P
partition
A logical subdivision of a physical disk drive that is treated as an
individual device. A partition consists of a range of physical disk
cylinders. Partitions are defined in the disk label. Partitions can contain
file systems or can be treated as raw devices, such as swap.
patch
A collection of files and directories that replace or update existing files
and directories that prevent the proper execution of software. The main
purpose of patches is to correct application bugs and provide product
enhancements.
peripheral device
A piece of hardware, such as a mouse or a printer, that performs a
specific function and is connected to a workstation.
port
A pathway used to connect computers. A port can be made up of both
hardware, such as pins and connectors and software, such as a device
driver. Types of ports include serial, parallel, small computer system
interface (SCSI), network, and Ethernet.
Power-On Self-Test (POST)
A series of diagnostic checks to check the system hardware. POST is
invoked each time the system is powered on.
Programmable read-only memory (PROM)
A chip containing permanent, nonvolatile memory and a limited set of
commands used to test the system and start the boot process.

Q
quad card
A quad card is a card having four Ethernet ports plugged into the
motherboard.

R
remote host
A system other than the local system on which the user is working.
residence time
The time a process is in queue at any particular instance

Glossary-6

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

rlogin
A service that enables users of one system to connect to other systems
across the intranet as if they were connected directly.
root
The user name of the superuser account. The superuser is a privileged
user with complete system access. The terms superuser and root can be
used interchangeably.
run control (rc) script
A script that is executed during system initialization and when
changing run levels. Commands executed by the run control scripts
determine which file systems are mounted, which daemon processes are
running, and other environment configuration.
run level
One of the eight initialization states in which a system runs. A system
can run in only one initialization state at a time. The default run level for
each system is specified in the /etc/inittab file.

S
SBus
A proprietary bus system used in most Sun operating systems.
serial port
A serial port is used to transfer data one bit at a time. It is usually an RS232 port, but 25-pin connector and 9-pin connectors are also used.
single-user
A feature of the Solaris OE that ensures that the system runs minimal
processes and services and regular users cannot log in. The single-user
mode is often referred to as the maintenance mode. You require the root
password to switch to single-user mode on a system.
Small Computer System Interface (SCSI)
A high-speed interface that can connect to computer devices, such as
hard drives, CD-ROM drives, diskette drive, tape drives, scanners, and
printers.
superblock
A block on the disk that contains information about a file system, such
as its name and size in blocks. Each file system has its own superblock.
A block is also defined as space on a physical hard disk where you can
write a unit of data.

Glossary/Acronyms
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Glossary-7

superuser
The superuser is a privileged user with total system access. For example,
only the superuser can change the password file and edit system
administration files in the /etc directory. The user name for the
superuser account is root.

T
telnet
A service that enables users of one system to connect to other systems
across the intranet as if they were connected directly.
typescript file
A file that is used to record user action during a session. It is a form of
log generation that records user activities during a session.

U
UNIX file system (ufs)
The default disk-based file system for the Sun OS.

V
vfstab
The configuration file for the file systems that defines which file systems
are mounted at the boot time.

W
walker
A set of routines that describe how to walk or iterate through the
elements of a particular program data structure. A walker encapsulates
the implementation of a data structure from dcmds and the mdb utility.
You can use walkers interactively or use them to build other dcmds or
walkers.

Glossary-8

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Index
/var/sadm/install/admin
directory 5-17
boot commands 4-30
mdb and adb utilities
relationship 8-9
mdb utility
examining system
dumps 8-14
limitations 8-8

Symbols
.bss section 5-30
.data section 5-30
.enet-addr command 2-20
.locals command 7-9, 7-20
.properties command 4-11
.registers command 7-9, 7-20
.speed command 2-20
.version command 2-19
/ 5-35, 6-20
/dev/cua directory 5-6
/dev/dsk directory 5-6
/dev/kmem file 5-52
/dev/rdsk directory 5-6
/dev/rmt directory 5-6
/dev/term directory 5-6
/devices directory 4-4
/etc directory 1-8
/etc/coreadm.conf file 5-57
/etc/init.d/sysetup
script 7-14
/etc/minor_perm file 5-6

/etc/rc2.d/S50devfadm
script 5-4
/etc/system file 4-21
/etc/vfstab file 7-4, 7-11
/kernel directory 4-22
/sbin/ifconfig
command 5-35
/sbin/init process 4-22
/sbin/rc2 boot script 5-4
/usr/kernel directory 4-22
/usr/local/man directory 6-6
/usr/sbin/ directory 4-19
/usr/share/man directory 6-4
/usr/ucb/ps command 5-23
/var/adm/messages file 7-7
/var/adm/messages log 1-7
/var/sadm/install/contents
file 5-15
/var/spool/pkg directory 5-16

A
About This Course xvii
adb utility 8-4
Address Resolution Protocol
(ARP) 5-37
admin file 5-17
ALP (Assembly Language
Programming) 5-30
American Standard Code for
Information Interchange
(ASCII) 2-9

Index-1
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

analyzing core dumps using the mdb


utility 8-1
API (application programming
interface) 8-4
Application 7-5, 7-16
application core dump 7-5
application programming interface
(API) 8-4
ARP (Address Resolution Protocol) 5-37
arp command 5-37
ASCII (American Standard Code for
Information Interchange) 2-9
Assembly Language Programming
(ALP) 5-30
auto-boot? variable 2-10, 4-32
Automate Downloads utility 6-17
automated OBP probing
PCI probing 4-8
UPA probing 4-8

B
bad traps 7-6
banner command 2-19
Basic layers and error types in Sun
systems
identifying 1-17
Berkeley Software Distribution
(BSD) 5-23
BigAdmin portal 6-13
BigAdmin services 6-13
boot command 4-17
Boot Programs phase 4-19
Boot PROM
description 2-5
features 2-6
phase 4-18
boot sequence 4-17
bootblk program 4-18
boot-device variable 3-9
BSD (Berkeley Software
Distribution) 5-23
bus errors 1-20

Index-2

C
cat command 5-19
catman -w option 6-7
causes of system panics 7-4
cd command 4-10
choosing the test methodology
factual approach 1-13
realistic approach 1-13
result-oriented approach 1-13
CMOS (complementary metal-oxide
semiconductor) 2-9
cmp command 5-21
collecting error messages 1-7
collection documents 6-11
commands
.enet-addr 2-20
.locals 7-9, 7-20
.properties 4-11
.registers 7-9, 7-20
.speed 2-20
.version 2-19
/sbin/ifconfig 5-35
/usr/ucb/ps 5-23
arp 5-37
banner 2-19
boot 4-17
cat 5-19
cd 4-10
cmp 5-21
coreadm 5-56, 7-16
ctrace 7-20
dev 4-10
devalias 4-13
devfsadm 5-4
device-end 4-10
devlinks 5-6
diff 5-21
disks 5-6
dmesg 7-7
drvconfig 5-6
dumpadm 7-12
eeprom 2-8, 2-14
file 5-45
find 5-43
format 5-7

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

fsck 5-8
fstyp 5-11
ifconfig 5-35
installboot 4-19
iostat 5-11
ls 4-12
modinfo 5-29
mpstat 5-27
netstat 5-39
nm 5-52
nvalias 4-12, 4-16
nvunalias 4-12, 4-16
pgrep 5-31
ping 5-32
pkgadd 5-16
pkgchk 5-14
pkginfo 5-15
pkgrm 5-17
ports 5-6
printenv 2-11
probe 2-16
probe-ide 2-16
probe-scsi 2-16
probe-scsi-all 2-16
prtconf 5-49
prtconf -vp 4-4
prtdiag 3-14
ps 5-23
psrinfo 5-27
reset-all 4-16
savecore 7-14
script 5-44
see 2-21
set-default 2-13
set-defaults 2-13
setenv 2-13
show-devs 4-14
show-disks 4-15
show-nets 4-15
show-post-results 3-22
showrev 5-48
sifting 2-20
snoop 5-42
stop-n 2-13
sum 5-22
swap 5-53

sysdef 5-51
tail 5-45
tapes 5-6
test 2-17, B-4
test floppy 2-17
test net 2-17
test-all 2-17
tip 3-10
traceroute 5-33
truss 5-55
uname 5-46
vi 5-18
vmstat 5-25
watch 2-18
watch-clock 2-18
watch-net 2-18
watch-net-all 2-18
whatis 6-8
words 4-11
common OBP variables 2-10
comparison results 1-8
complementary metal-oxide
semiconductor (CMOS) 2-9
configuring and executing Explorer 6-21
controlled comparisons 1-8
Copying 7-16
coreadm command 5-56, 7-16
course goals xvii
Course Map xviii
CPU and memory management
commands 5-23
CPU watchdog reset 1-21
crash utility 8-7
ctrace command 7-20
custom device aliases 4-12

D
Data Communication Equipment
(DCE) 3-10
Data Terminal Equipment (DTE) 3-10
DCE (Data Communication
Equipment) 3-10
dcmds 8-8
debugger commands (dcmds) 8-8
debugger modules (dmods) 8-8

Index
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Index-3

dev command 4-10


devalias command 4-13
devfsadm command 5-4
device management commands 5-4
device path 4-12
device path name 4-6
device-arguments parameter 4-6
device-end command 4-10
device-name 4-6
devlink.tab file 5-5
devlinks command 5-6
DHCP (Dynamic Host Configuration
Protocol) 4-26
diag-device variable 2-10, 3-8
diag-level variable 3-6
diagnose problems using the SunSolve
Online service 6-9
diagnostic tools 6-10
diag-switch variable 3-5
diag-switch? variable 2-10
diff command 5-21
different levels of diagnostic tests for
POST 3-6
disk and file system management
commands 5-7
disks command 5-6
dmesg command 7-7
dmods 8-8
drvconfig command 5-6
DTE (Data Terminal Equipment) 3-10
dumpadm command 7-12
Dynamic Host Configuration Protocol
(DHCP) 4-26

E
eeprom A-2
eeprom command 2-14
eeprom Command on a Sun4u Enterprise
Server A-2
ELF object file 5-52
enable extended POST diagnostics 3-5
error checking and correcting (ECC) 1-18
errors in a boot sequence 4-25
ex editor 5-18

Index-4

examining a successful boot


sequence 4-24
exclude module 4-21
Explorer 6-21

F
factual approach 1-13
Failed Field Replaceable Units
(FRUs) 3-14
fault analysis and diagnosis
methodology 1-1
fault diagnosis methodology 1-11
FIFO (first-in first-out) 5-45
file command 5-45
file-checking commands 5-18
find command 5-43
first-in first-out (FIFO) 5-45
forceload module 4-21
format command 5-7
formulating hypotheses 1-12
FPROM jumper J2003 2-8
FPROM Upgrades 2-8
FRUs (Failed Field Replaceable
Units) 3-14
fsck command 5-8
fstyp command 5-11

G
general-purpose commands 5-43
generating system crash dump 7-10
genuix file 4-20
global core file path 7-17
group file 1-8

H
hardwire argument 3-11
header files 8-11

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

I
ICMP (Internet Control Message
Protocol) 5-32
Icons xxii
IDE (Integrated Drive Electronics) 2-16
Identifying 4-26, 7-20
identifying error-reporting
mechanisms 1-20
bus errors 1-20
Interrupts 1-20
Watchdog Resets 1-21
identifying magnitude of the fault 1-9
identifying patch support tools 6-17
Checksum file 6-17
Patch Check 6-17
PatchPro 6-17
Recommended and Security
Patches 6-17
Solaris Patches 6-17
Sun Alert Patch Report 6-17
identifying the basic layers and error types
in Sun systems 1-17
IEEE (Institute of Electrical and Electronics
Engineers) 2-4
ifconfig command 5-35
IGMP (Internet Group Management
Protocol 5-41
impacts of the methodology chosen 1-13
in.ftpd daemon 1-9
information sources 1-6
init phase 4-23
installboot command 4-19
Installing Explorer 6-21
Institute of Electrical and Electronics
Engineers (IEEE) 2-4
Instruction Unit (IU) 7-6
Integrated Drive Electronics (IDE) 2-16
Internet Control Message Protocol
(ICMP) 5-32
Internet Group Management Protocol
(IGMP) 5-41
Internet Protocol version 4 (IPv4)
protocol 5-34
Internet Protocol version 6 (IPv6)
protocol 5-34

Interrupts 1-20
Introducing 3-4
introducing OBP components, features,
and diagnostics 2-1
introducing system panics 7-6
introduction to types of faults in Sun
systems 1-18
critical errors 1-19
fatal errors 1-19
hardware errors 1-18
software errors 1-18
system panics 1-19
iostat command 5-11
IPv4 (Internet Protocol version 4)
protocol 5-34
IPv6 (Internet Protocol version 6)
protocol 5-34
ISCDA script 8-14
IU (Instruction Unit) 7-6

K
kadb utility 8-4
Kernel Initialization phase 4-21
kernel STREAM 8-6

L
latest security bulletin 6-12
listing facts about the problem 1-6
local-mac-address? variable 2-12
ls command 4-12

M
macro file 8-10
man -k option 6-6
man -l option 6-5
man -M option 6-6
man -s option 6-5
man.cf file 6-4
MANPATH variable 6-4
manual OBP diagnostic commands
preparing 2-15
using 2-15

Index
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Index-5

mdb 8-4
mdb utility
command formats 8-9
features 8-6
identifying register references 8-12
Macros 8-10
Registers 8-10
misc/obpsym kernel module 7-21
moddir module 4-21
modinfo command 5-29
modules 4-21
mounting 5-9
mpstat command 5-27
multicast 5-38

N
netstat command 5-39
Network File System (NFS) 4-26
network management commands 5-32
NFS (Network File System) 4-26
nm command 5-52
Nonvolatile random access memory
(NVRAM) 2-4
nvalias command 4-12, 4-16
NVRAM 2-9
NVRAM (Nonvolatile random access
memory) 2-4
NVRAM chip 4-12
nvramrc variable 2-9, 4-12
nvunalias command 4-12, 4-16

O
object file
.txt section 5-30
OBP
components 2-1
diagnostics 2-1
features 2-1
OBP Device tree
examining 4-9
introducing 4-4
navigating 4-9

Index-6

OBP device tree and the boot


sequence 4-1
OBP variables
modifying 2-12
running diagnostics 2-12
Online Support Center 6-11

P
panic() kernel function 1-19
panic() system call 7-6
passwd file 1-8
Patch Finder 6-17
patchadd command 1-9
PatchDiag cross-reference file 6-19
PatchDiag tool
description 6-11
installation 6-19
sample report A-4
patchdiag.xref file 6-19
patchdiag_setup script 6-20
Patches 6-10
patchk.pl script 6-19
path_to_inst database 5-4
PCI (peripheral component
interconnect) 2-6
PCI probing 4-8
pcia-probe-list variable 2-10
pcib-probe-list variable 2-10
performing search operations in the
SunSolve Online service 6-14
performing Solaris OE diagnostics 5-1
peripheral component interconnect
(PCI) 2-6
per-process core file 5-56, 7-17
pgrep command 5-31
Phases in the boot process
boot programs phase 4-17
boot PROM phase 4-17
init phase 4-17
kernel initialization phase 4-17
ping command 5-32
pkgadd command 5-16
pkgchk command 5-14
pkginfo command 5-15
pkgmap file 5-15

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

pkgrm command 5-17


ports command 5-6
POST diagnostics
enabling 3-1
monitoring 3-1
Power-on self-test (POST) 3-1
pre-Solaris 8 OE device commands 5-6
printenv command 2-11
prioritizing planned tests 1-12
probe command 2-16
probe-ide command 2-16
probe-ide option B-2
probe-scsi command 2-16
probe-scsi option B-2
probe-scsi-all command 2-16
probe-scsi-all option B-3
process crash dumps 7-12
Product Patches 6-17
program execution management
commands 5-55
PROM revisions
1.x 2-7
2.x 2-7
3.x 2-7
4.x 2-7
prtconf command 5-49
prtconf -vp command 4-4
prtdiag 3-14
prtdiag command 3-14
ps command 5-23
psrinfo command 5-27

R
realistic approach 1-13
Registers 8-12
repairing a corrupt superblock 5-10
reset-all command 4-16
restoring the bootblk or ufsboot
programs 4-29
result-oriented approach 1-13
reviewing Explorer output 6-22
rootdev module 4-21
rootfs module 4-21
run level 4-26

S
savecore command 7-14, 7-16
sbus-probe-list variable 2-10
script command 5-44
SCSI (small computer system
interface) 2-16
Security bulletin archive 6-12
security bulletin archive 6-12
security information
latest security bulletin 6-12
Security Pretty Good Privacy (PGP)
key 6-12
security t-patches 6-12
security-mode variable 2-10
see command 2-21
set module 4-21
set-default command 2-13
set-defaults command 2-13
setenv command 2-13
setting up a tip connection 3-12
show-devs command 4-14
show-disks command 4-15
show-nets command 4-15
show-post-results command 3-22
showrev command 5-48
sifting command 2-20
single-user mode 4-30
small computer system interface
(SCSI) 2-16
snoop command 5-42
software package management
commands 5-14
SSP (System Service Processor) 3-14
stating the problem 1-5
Stop-D key sequence 3-7
stop-n command 2-13
sum command 5-22
Sun alert notifications 6-12
Sun Explorer Data Collector utility 6-11,
6-21
Sun Validation Test Suite (SunVTS) 6-11
SunSolve Online database documents 6-11
SunSolve Online service 6-9
swap command 5-53
sysdef command 5-51

Index
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Index-7

syseventd daemon 5-4


syslogd daemon 3-15
system changes 1-8
system hang 7-9
System Service Processor (SSP) 3-14
system watchdog reset 1-21
system-wide core file 5-56

T
tail command 5-45
tapes command 5-6
tar.Z package 6-19
TCP (Transmission Control
Protocol) 5-41
test command 2-17
test floppy command 2-17
test floppy option B-4
test -memory option B-4
test net command 2-17
test net option B-5
test-all command 2-17
test-all option B-4
testing the hypothesis 1-13
Time of Day (TOD) 2-9
tip command 3-10
TOD (Time of Day) 2-9
Topics Not Covered xix
tpe-link-test? variable 2-10
traceroute command 5-33
Transmission Control Protocol
(TCP) 5-41
troubleshooting scripts using shell
options 4-28
truss command 5-55
types of system failures 7-1
typescript file 5-44
Typographical Conventions xxiii

U
ufsboot program 4-20
Ultra 5 and Ultra 10 architecture B-7
Ultra Port Architecture (UPA) 4-8
uname command 5-46

Index-8

unit-address parameter 4-6


Universal Serial Bus (USB) 2-12
unix file 4-20
use-nvramrc? variable 2-9
using OBP commands to display
information 2-18
using the file-checking commands 5-18
using the online man pages 6-4
Using the SunSolve online database 6-9
Using the system key switch 3-7

V
variables
auto-boot? 2-10, 4-32
boot-device 3-9
diag-device 2-10, 3-7
diag-level 3-6
diag-switch 3-5
diag-switch? 2-10
local-mac-address? 2-12
MANPATH 6-4
nvramrc 2-9, 4-12
pcia-probe-list 2-10
pcib-probe-list 2-10
sbus-probe-list 2-10
security-mode 2-10
tpe-link-test? 2-10
use-nvramrc? 2-9
watchdog-reboot? 2-10
vi command 5-18
Viewing 3-10
viewing extended diagnostics during
POST 3-10
viewing the current patch report 6-18
vmstat command 5-25

W
walkers 8-8
watch B-6
watch command 2-18, B-6
watch-clock command 2-18, B-6

Sun Systems Fault Analysis Workshop


Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Watchdog Resets 1-21


causes and effects 7-19
CPU watchdog reset 1-21
introducing 7-19
system watchdog reset 1-21
watchdog-reboot? variable 2-10
watchdog-reboot? variable 7-20
watch-net command 2-18
watch-net option B-6
watch-net-all command 2-18
whatis command 6-8
windex database file 6-7
words command 4-11
Writing 7-11
Writing to the dump device 7-11

Index
Copyright 2002 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services, Revision E

Index-9

You might also like