You are on page 1of 366

Sun Systems Fault Analysis Workshop

ST-350

Student Guide

Sun Educational Services


SunService Division
Sun Microsystems, Inc.
MS UMIL07-14
2550 Garcia Avenue
Mountain View, CA 94043
U.S.A.

Part Number 802-6162-02


Revision C, May 1996
Copyright 1996 Sun Microsystems, Inc., 2550 Garcia Avenue, Mountain View, California 94043-1100 U.S.A. All rights
reserved.
This product or document is protected by copyright and distributed under licenses restricting its use, copying, distribution,
and decompilation. No part of this product or document may be reproduced in any form by any means without prior
written authorization of Sun and its licensors, if any.
Portions of this product may be derived from the UNIX® system, licensed from Novell, Inc., and from the Berkeley 4.3 BSD
system, licensed from the University of California. UNIX is a registered trademark in the United States and other countries
and is exclusively licensed by X/Open Company Ltd. Third-party software, including font technology in this product, is
protected by copyright and licensed from Sun’s suppliers.
RESTRICTED RIGHTS LEGEND: Use, duplication, or disclosure by the government is subject to restrictions as set forth in
subparagraph (c)(1)(ii) of the Rights in Technical Data and Computer Software clause at DFARS 252.227-7013 and FAR
52.227-19.
Sun, Sun Microsystems, the Sun logo, Solaris are trademarks or registered trademarks of Sun Microsystems, Inc. in the
United States and other countries. All SPARC trademarks are used under license and are trademarks or registered
trademarks of SPARC International, Inc. in the United States and other countries. Products bearing SPARC trademarks are
based upon an architecture developed by Sun Microsystems, Inc.
The OPEN LOOK® and Sun™ Graphical User Interfaces were developed by Sun Microsystems, Inc. for its users and
licensees. Sun acknowledges the pioneering efforts of Xerox in researching and developing the concept of visual or graphical
user interfaces for the computer industry. Sun holds a non-exclusive license from Xerox to the Xerox Graphical User
Interface, which license also covers Sun’s licensees who implement OPEN LOOK GUIs and otherwise comply with Sun’s
written license agreements.
X Window System is a trademark of X Consortium, Inc.
THIS PUBLICATION IS PROVIDED “AS IS” WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED,
INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY, FITNESS FOR A
PARTICULAR PURPOSE, OR NON-INFRINGEMENT.

Please
Recycle
About This Course

Overview
The primary objective of this course is to learn a systematic fault
analysis technique to troubleshoot intermediate and some advanced
Solaris system faults.

This course is intended for system administrators and system


maintainers who must isolate system faults regardless of the cause.

This course provides guided hands-on lab experience in performing


fault analysis on SPARCstation systems with inserted faults.

iii
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
Course Prerequisites

To succeed in this course, you must already have:

● Completed the Solaris 2.X System Administration (SA-285) or the


Solaris 1.X Solaris 2.X System Administration (SA-271) courses

● Six months of field system administration or system maintenance


experience in Sun environments

iv Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
Course Objectives

Upon completion of this course, you will be able to:

● Use an organized total system approach for fault analysis.

● Differentiate and repair selected hardware, software, and system


administration problems.

● Isolate and repair selected network problems.

● Use SunSolve SearchTool to determine if the fault is already


known and if a repair or patch has been determined.

● Gather and interpret system error indication to determine the most


likely cause and a repair strategy.

● Use diagnostic tools to verify repairs if needed.

● Learn to use a cookbook approach to analyze a selected class of


kernel core dumps.

About This Course v


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
Day-to-Day Schedule

Monday

A.M. Course Introduction – Set expectations


Module 1 – Fault Analysis
Workshop
P.M. Module 2 – Error Detection
Workshop
Module 3 – POST Diagnotics
Workshop
Module 4 – OBP Diagnostics
Workshop

Tuesday

A.M. Module 5 – Diagnostic Tools


Module 6 – SunVTS
Workshop
Module 7 – SunSolve
Workshop
P.M. Workshops

Wednesday

A.M. Module 8 – Kernel Core Analysis


Workshop
P.M. Workshops

Thursday

Workshops all day

Friday

Workshops and review

vi Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
Attention VARs

If you are an SMCC authorized reseller taking this course for


Competency 2000 certification credit, the Drake tests and certification
specifications for this course revision are being upgraded.

● Retain your signed course certificate.

● Have the instructor initial you completed lab projects and fault
forms.

● Schedule yourself to take the appropriate Drake test by contacting:

John Shedaker
Sun Microsystems
2550 Garcia Ave., MS UMIL06-01
Mountain View, CA 94043

Fax Number (408)-945-9476


Phone Number (408)-276-1315

About This Course vii


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
Typographical Conventions and Symbols

The following table describes the type changes and symbols used in
this book.

Typeface or
Meaning Example
Symbol

AaBbCc123 The names of commands, Edit your .login file.


files, and directories; on- Use ls -a to list all files.
screen computer output system% You have mail.

AaBbCc123 What you type, system% su


contrasted with Password:
on-screen computer
output
AaBbCc123 Command-line To delete a file, type rm
placeholder—replace filename.
with a real name or value
AaBbCc123 Book titles, new words or Read Chapter 6 in User’s
terms, or words to be Guide. These are called class
emphasized options.
You must be root to do this.

Code samples are included in this book and may display the following:

Prompt Type Example

C shell prompt system%


Superuser prompt, C shell system#
Bourne and Korn shell prompt $
Superuser prompt, Bourne and #
Korn shells

viii Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
Contents

About This Course.......................................................................................iii


Course Prerequisites........................................................................... iv
Course Objectives................................................................................. v
Day-to-Day Schedule...........................................................................vi
Attention VARs ................................................................................. vii
Typographical Conventions and Symbols ................................... viii
Fault Analysis and Diagnosis..................................................................1-1
Introduction ....................................................................................... 1-2
Eight Steps of Fault Analysis and Diagnosis ................................ 1-3
Fault Analysis ............................................................................1-3
Diagnosis ....................................................................................1-3
Stating the Problem........................................................................... 1-4
Guidelines for a Problem Statement ......................................1-4
Feedback, and Checking the Problem Statement .................1-4
Describing the Problem.................................................................... 1-5
Listing All Observed Facts.......................................................1-5
Establishing Comparative Facts..............................................1-7
Identifying Differences..................................................................... 1-9
Guidelines for Identifying Differences ..................................1-9
Listing Relevant Changes .............................................................. 1-10
Guidelines to Analyze Relevant Changes ...........................1-10
Feedback to Check Relevant Changes .................................1-10
Generating Likely Causes .............................................................. 1-11
Testing Likely Causes..................................................................... 1-12
Taking Action to Correct the Fault ............................................... 1-14
Fault Analysis Example Worksheet (1 of 3) ................................ 1-15
Likely Causes...........................................................................1-17
Verifying Testing Likely Causes ...........................................1-17
Final Repair..............................................................................1-17
Exercise 1 – Solving Host C ........................................................... 1-18
Exercise 2 – Solving Host B............................................................ 1-19
Exercise 3 – Solving Host A........................................................... 1-20
Skills Checklist................................................................................. 1-40

i
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
Error Detection Overview ........................................................................2-1
Introduction ....................................................................................... 2-2
Error Types ........................................................................................ 2-3
Error Reporting Mechanisms .......................................................... 2-4
Bus Errors...................................................................................2-4
Interrupts for Reporting...........................................................2-4
Resets ..........................................................................................2-4
Type of Errors .................................................................................... 2-5
Software Errors..........................................................................2-5
Hardware-Corrected Errors ....................................................2-5
Recoverable Errors....................................................................2-5
Fatal Errors.................................................................................2-5
CPU Watchdog Reset ...............................................................2-6
System Watchdog Reset ...........................................................2-6
Critical Errors ............................................................................2-6
Primary Buses.................................................................................... 2-7
Sun-4u ................................................................................................. 2-8
Memory Management Unit (MMU)............................................... 2-9
Number Base Conversion Chart ................................................... 2-10
Page Table Entry – Sun-4 Architecture ........................................ 2-11
Sun-4 PTE Format ...................................................................2-12
Examples of Valid PTEs .........................................................2-12
Page Table Entry – Sun-4c Architecture ...................................... 2-13
Sun-4c PTE Format .................................................................2-14
Examples of Valid PTEs .........................................................2-14
Page Table Entry – Sun-4m Architecture .................................... 2-15
Access Code .............................................................................2-16
Examples of Valid PTEs .........................................................2-16
Page Table Entry – Sun-4d Architecture...................................... 2-17
Access Code .............................................................................2-18
Example of Valid PTEs...........................................................2-18
Sun-4 Error Detection Workshop ................................................. 2-19
Sun-4c Error Detection Workshop................................................ 2-22
Example 1 .................................................................................2-23
Example 2 .................................................................................2-26
Sun-4m Error Detection Workshop.............................................. 2-27
Example 1 .................................................................................2-28
Example 2 .................................................................................2-31
Sun-4d Error Detection Workshop ............................................... 2-32
Example 1 .................................................................................2-33
Example 2 .................................................................................2-36
Skills Checklist................................................................................. 2-37
System Fault Status Register (sfsr) Format .......................2-41
POST Diagnostics ......................................................................................3-1
Diagnostics Overview ...................................................................... 3-2

ii Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
Boot PROM POST .....................................................................3-3
Sun VTS Diagnostics.................................................................3-3
POST Viewing Methods................................................................... 3-4
Viewing POST From the CPU Board LEDs
(older systems) ..........................................................................3-4
Viewing POST With a Serial Port Terminal ..........................3-4
Viewing POST Using the tip hardwire Command .................. 3-5
POST Example Using tip – SparcStation5 ................................... 3-6
Machine Information ................................................................3-6
POST Example Using tip – SS1000................................................ 3-9
Diagnostics Output...................................................................3-9
POST Diagnostic Workshop Using TIP ....................................... 3-20
Using Terminal Interface Protocol (TIP) for Remote
Diagnostics............................................................................3-20
Using tip to Observe POST Diagnostics ............................3-21
POST tip Commands .................................................................... 3-25
OBP Diagnostics and Commands...........................................................4-1
Functions and Capabilities of the OpenBoot PROM (OBP)........ 4-3
Features ......................................................................................4-3
OpenBoot PROM............................................................................... 4-4
NVRAM Contents – System Variable Parameters ....................... 4-5
SPARCstation 20 Workstation ................................................4-5
Diagnostic Overview ........................................................................ 4-6
Default Boot Sequence...................................................................... 4-8
OBP Device Tree Navigation – SPARCstation 1000 System....... 4-9
OBP User Diagnostics and Commands – SS1000 ....................... 4-10
OPB User Diagnostics and Commands – SS20 ........................... 4-13
Lab 1.................................................................................................. 4-15
Lab 2.................................................................................................. 4-17
Resetting to the Defaults ........................................................4-19
Optional....................................................................................4-19
Lab 3.................................................................................................. 4-21
Lab 4.................................................................................................. 4-22
Lab 5- Optional................................................................................ 4-25
Diagnostic Tools.........................................................................................5-1
Diagnostic Tools, Functions and Uses ........................................... 5-2
Open Discussion ............................................................................... 5-4
SunVTS System Diagnostics ...................................................................6-1
Introduction ....................................................................................... 6-2
Hardware and Software Requirements .................................6-2
The SunVTS Architecture ................................................................ 6-3
User Interfaces ................................................................................... 6-4
Kernel..........................................................................................6-4
Hardware Tests .........................................................................6-4

Contents iii
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
Additional References ..............................................................6-4
Installing SunVTS Software............................................................. 6-5
The SunVTS Graphical User Interface ........................................... 6-7
Selecting and Setting Up Tests ........................................................ 6-9
SunVTS Testing Options ................................................................ 6-10
Tests Switch ..................................................................................... 6-12
Option Files..............................................................................6-12
Running the SunVTS Tests ............................................................ 6-13
System Status Panel ................................................................6-13
Test Status Panel .....................................................................6-14
Performance Monitor Panel...................................................6-15
Reviewing SunVTS Test Results ................................................... 6-17
System Status Panel ................................................................6-17
Console Window Messages...................................................6-17
Log Files ...................................................................................6-18
Using SunVTS in TTY Mode ......................................................... 6-19
Negotiating the SunVTS TTY Interface ....................................... 6-20
Using SunVTS Remotely................................................................ 6-21
Kernel Interface .......................................................................6-21
User Interface...........................................................................6-21
Lab Overview .................................................................................. 6-24
Lab Objectives..........................................................................6-24
Equipment................................................................................6-24
Lab Tasks...........................................................................................6-25
SunSolve ......................................................................................................7-1
Overview ............................................................................................ 7-3
Distribution........................................................................................ 7-4
SunSolve Online Account ................................................................ 7-5
Installing SunSolve ........................................................................... 7-6
Installing SunSolve Using File Manager ...............................7-7
Installation GUI Window.........................................................7-8
Sharing SunSolve ....................................................................7-10
Starting Sunsolve ............................................................................ 7-11
Starting From an Installed Server .........................................7-11
Starting From the CD-ROM...................................................7-12
The SunSolve Window...........................................................7-12
Search Tool....................................................................................... 7-13
Configuring SunSolve ............................................................7-14
SearchTool Properties.............................................................7-15
Troubleshooting Using SearchTool .............................................. 7-16
Setting Up the Search .............................................................7-16
Keyword Logical Connectors................................................7-16
Starting the Search ..................................................................7-17
Datasets and Collections to Search............................................... 7-18
Viewing Documents Found........................................................... 7-19

iv Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
Patches .............................................................................................. 7-20
Displaying the Current Patch Report...................................7-20
Displaying Installed Patches .................................................7-23
Installing Recommended or Suggested Patches.................7-24
Installing a Specific Patch ......................................................7-25
Removing a Specific Patch.....................................................7-26
SunSolve Labs.................................................................................. 7-27
Optional – Basic Search Techniques............................................. 7-29
Using MultiView............................................................................. 7-35
Document Formats ......................................................................... 7-36
Setting MultiView Properties........................................................ 7-37
Displaying a Document in MultiView......................................... 7-39
MultiView Features ........................................................................ 7-41
The File Menu..........................................................................7-41
Kernel Core Dump Analysis....................................................................8-1
Introduction ....................................................................................... 8-2
Header Files ....................................................................................... 8-3
Debuggers .......................................................................................... 8-4
adb...............................................................................................8-4
crash ..........................................................................................8-4
kadb.............................................................................................8-4
SAVECORE Setup................................................................................. 8-5
Invoking adb/kadb/crash.............................................................. 8-6
adb...............................................................................................8-6
crash ..........................................................................................8-6
kadb.............................................................................................8-6
adb Commands ................................................................................ 8-7
adb Macros and Commands............................................................ 8-8
Display and Control Commands............................................8-9
adb Macros.................................................................................8-9
adb Macros....................................................................................... 8-10
Kernel Core Dump Analysis ......................................................... 8-11
Kernel Dump Analysis – adb – SC2000 Example....................... 8-12
Using adb to Analyze a Kernel Core Dump .......................8-12
Kernel Dump Analysis – adb – SPARC 5 Example.................... 8-22
Using adb to Analyze a Kernel Core Dump .......................8-22
crash Help Menu ........................................................................... 8-32
Commonly Used crash Commands............................................ 8-33
Kernel Dump Analysis – crash – SC2000 ................................... 8-34
Kernel Crash Dump Analysis Workshop 1................................. 8-36
Kernel Crash Dump Analysis Workshop 2A...............................8-37
Kernel Crash Dump Analysis Workshop 2B ...............................8-38
Kernel Crash Dump Analysis Workshop 3................................. 8-39
Kernel Crash Dump Analysis Workshop 4 (Sheet 1 of 8) ......... 8-41
kadb Workshop Introduction................................................8-41

Contents v
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
kadb Description .....................................................................8-42
Invoking and Exiting kadb ....................................................8-43
Mapping UNIX Data Structures ...........................................8-44
Related Data Structures..........................................................8-46
Kernel Crash Dump Analysis Workshop 5................................. 8-49
Kernel Crash Dump Analysis Workshop 6................................. 8-51
Kernel Crash Dump Analysis Workshop 7................................. 8-52
Watchdog Reset Workshop 8 (Sheet 1 of 2) Optional................ 8-53
Bug Install ................................................................................8-53
Program Debugging – Optional ................................................... 8-55
ps Command Workshop 9—Optional ......................................... 8-57
Introduction .............................................................................8-57
Sequence of Procedures (Do Not Execute).................................. 8-58
Setting Base Level ...................................................................8-58
Acquiring Base-Level Information .......................................8-58
Tracing OpenWindows Processes ........................................8-58
Setting Base Level ........................................................................... 8-59
Acquiring Base-Level Information ............................................... 8-60
Base-Level Processes (1 of 2) ......................................................... 8-61
Tracing OpenWindows Processes ................................................ 8-63
Workshop Summary Exercise ....................................................... 8-64
Skills Checklist................................................................................. 8-66
Fault Tracker Progress Chart ..................................................................A-1
Fault Worksheets - Student Guide ........................................................ B-1
Requirements............................................................................ B-1
Resources................................................................................... B-1
System Configurations ............................................................ B-1
Fault Worksheet #1 - Blank Monitor ............................................. B-2
Fault Worksheet #2 - Device Error During Boot.......................... B-3
Fault Worksheet #3 - File Errors During Boot ............................. B-4
Fault Worksheet #4 - Incomplete Boot to Solaris
Operating System.......................................................................... B-5
Fault Worksheet #5 - Login Problem............................................. B-6
Fault Worksheet #6 - adb Macro Error.......................................... B-7
Fault Worksheet #7 - Feckless ........................................................ B-8
Fault Worksheet #8 - Incomplete Boot to Solaris
Fault Worksheet #9 - Turn the Page ............................................ B-10
Fault Worksheet #10 - Login Problem......................................... B-11
Fault Worksheet #11 - Network Problem ................................... B-12
Fault Worksheet #12 - OpenWindows Problem ........................ B-13
Fault Worksheet #13 - Shutdown When Opening
Windows ...................................................................................... B-14
Fault Worksheet #14 - Network Printer Problem...................... B-15
Fault Workshop #15 - Incomplete Boot to Solaris
Operating System........................................................................ B-16

vi Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
Fault Worksheet #16 - Constant Reboot, Halt, or
Power Off Problem ..................................................................... B-17
Fault Worksheet #17 - The ps Command Returns
Nothing......................................................................................... B-18
Fault Worksheet #18 - NIS or NIS+ Network Problem ............ B-19
Fault Worksheet #19 - Network Problem ................................... B-20
Fault Worksheet #20 - OpenWindows Problem ........................ B-21
Fault Worksheet #21 - Banner Logo Has Been Changed.......... B-22
Fault Worksheet #22 - Do Not Tread on Me .............................. B-23
Fault Worksheet #23 - vi Editor Problem .................................. B-24
Fault Worksheet #24 - “Hacker” Intrudes the System.............. B-25
Fault Worksheet #25 - No OpenWindows Environment ......... B-26
Fault Worksheet #26 - Login Problem......................................... B-27
Fault Worksheet #27 - “Hangs” on Boot..................................... B-28
Fault Worksheet #28 - No Network............................................. B-29
Fault Worksheet #29 - Where It Is At .......................................... B-30
Fault Worksheet #30 - Seedy ROM.............................................. B-31
Fault Worksheet #31 - See It Now ............................................... B-32
Fault Worksheet #32 - Cannot Log In as Root ........................... B-33
Fault Worksheet #33 - No Network or Interface ....................... B-34
Fault Worksheet #34 (Sheet 1 of 4) - Script “Hangs”
System........................................................................................... B-35
Fault Worksheet #35 - No shcat ................................................. B-39
Fault Worksheet #36 - Login Problem......................................... B-40
Fault Worksheet #37 - Noel Two ................................................. B-41
Fault Worksheet #38 - Client-Server ftp Problem .................... B-42
Fault Worksheet #39 - Network Problem ................................... B-43
Fault Worksheet #40 (Sheet 1 of 6) - Slow and Fast
Perceptions................................................................................... B-44
Fault Worksheet #41 - Cannot Boot Diskless Client ................. B-50
Fault Worksheet #42 - Logs Out During OpenWindows
Startup .......................................................................................... B-51
Fault Worksheet #43 - Sorry User ................................................ B-52
Fault Worksheet #44 - No Window, Use SunSolve................... B-53
Fault Worksheet #45 - NIS+ Password ....................................... B-54
Fault Worksheet #46 - Let Me In.................................................. B-55
Fault Worksheet #47 - RPC Not Registered ............................... B-56
Fault Worksheet #48 - Slow Access to Data ............................... B-57
Fault Worksheet #49 - Trust Me................................................... B-58
Fault Worksheet #50 - Cannot Talk to Machine A .................... B-59
Fault Worksheet #51 - Not On This Network ............................ B-60
Fault Worksheet #52 - Do Not Point At Me ............................. B-611
Fault Worksheet #53 - Resource Temporarily
Unavailable .................................................................................. B-62

Contents vii
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
Install Alternate Boot Block....................................................................C-1
Installing an Alternate Boot Block ................................................. C-2

viii Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
Fault Analysis and Diagnosis 1

Objectives
Upon completion of this module, you will be able to:

● Use an organized total system approach for fault analysis and


diagnosis.

● Use the fault analysis worksheet to gather and document facts.

● Communicate to other technical people the details and status of


system faults.

References
Alamo Learning Systems AdvantEdge Analysis Program

1-1
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
1

Introduction

Fault analysis and diagnosis is an efficient and reliable method to


isolate and repair Sun™ system faults using a two-stage process:

● Fault analysis - Organizes fact gathering and comparisons.

● Diagnosis - Organizes the actual discovery, testing, repair, and


reporting of the problem.

You may be an expert. With the expert approach, you gather data and
use your experience and the experience of others to determine causes.
Fault analysis and diagnosis provides you with a powerful tool to
analyze data and focus on the likely causes of a complex problem or a
problem outside of your immediate experience.

Keeping notes in the fault analysis format enhances communication


about the status of a problem.

1-2 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
1

Eight Steps of Fault Analysis and Diagnosis

Fault Analysis
1. State the problem.

2. Describe the problem.

3. Identify differences.

4. List relevant changes.

Diagnosis
5. Generate likely causes.

6. Test likely causes.

7. Verify the most likely cause.

8. Take action to correct the fault.

Generate likely cause

Test
Next likely cause No

Yes
Verify likely cause

Take corrective action

Fault Analysis and Diagnosis 1-3


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
1

Stating the Problem

Given a system problem, identify the object and its defect, and write a
problem statement. A problem statement answers these questions:

● What object, device, or subsystem exhibits the problem?

● What is wrong? What is the defect or deviation from the standard?

The following is an example of a problem statement:

The printer Grumpy will not print.

The object, printer Grumpy, has a defect (deviation from the


standard)—it will not print.

Guidelines for a Problem Statement


● Identify the exact object with the exact defect.

● Be certain that the cause is not already known.

● Limit the problem statement to a single object and a single defect.

Feedback, and Checking the Problem Statement


● Does the statement clearly identify the exact object with the
problem?

● Does the statement state the exact deviation from the norm?

● Is the cause of the problem unknown?

Most bugs that become a disaster happen because the original problem
is not described correctly.

1-4 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
1

Describing the Problem

The next step in system fault analysis and diagnosis is to describe the
problem in detail.

1. List all observed facts.

2. Establish comparative facts. To help isolate the likely cause, ask


what could be wrong but is not.

3. Identify the unique elements of the problem.

Listing All Observed Facts

Questions to Ask

● Who observed the problem?

● What is the problem?

● Where is the problem observed?

● What is the magnitude or size of problem?

Expand and customize a question list for your own style and
environment.

List all observed facts here – do no discard any as irrelevant.


Discarding facts, or not looking for more facts, at this stage is a
common mistake.

_______________________________________________________________

_______________________________________________________________

_______________________________________________________________

_______________________________________________________________

Fault Analysis and Diagnosis 1-5


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
1

Describing the Problem

Listing All Observed Facts (Continued)

Fact Sources

● Customer complaints

● Customer interviews. Use the list of questions on the previous


page. Expand and customize the question list for your own style
and environment

● Interviews of others involved

● Diagnostics, other run levels, changed environments, and


operating system levels

● Dumps

What are other sources from your own experience?

List all additional observed facts here – do no discard any as


irrelevant. Discarding facts, or not looking for more facts, at this stage
is a common mistake.

_______________________________________________________________

_______________________________________________________________

_______________________________________________________________

_______________________________________________________________

1-6 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
1

Describing the Problem

Establishing Comparative Facts

Questions to Ask

● What similar object might have this defect but does not?

● What other defect could you see on the problem object but do not?

● Where else in this system environment or other environments


might you expect to see the defective object but do not?

● Where else on the problem object could you see the defect but do
not?

● When could the defect have been first observed but was not?

● What other time in the object’s life cycle could the defect have
occurred but did not?

● In what other pattern could the defect have occurred but did not?

● How much of the problem object could be defective but is not?

● How many of the objects might have been defective but are not?

● What other trend could have been observed but was not?

Fault Analysis and Diagnosis 1-7


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
1

Describing the Problem

Establishing Comparative Facts (Continued)

Reviewing Comparative Facts

Try to answer “yes” for each of the following questions:

● Are the comparative facts as close and similar to the observed facts
as possible and yet not complete opposites?

● Are the comparative facts problem-free themselves?

● Does the first fact compare the problem to other objects?

● Do all other facts compare the problem to itself?

● Are the facts the most logical and reasonable comparisons?

1-8 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
1

Identifying Differences

Use the lists of observed facts and comparative facts to analyze and list
the differences.

Guidelines for Identifying Differences


● Focus on one set of observed and comparative facts at a time.

● List only the differences that are unique between the observed and
comparative facts.

For example, what is the difference between System A (problem


object) and System B (operational object)?

System A is running the Solaris™ 2.5 operating environment and


the NIS+ software, and it is installed on the network using a
10BASE-T Ethernet connection.

System B is running the Solaris 2.5 operating environment and the


NIS software, and it is installed on the network using a 10BASE-5
Ethernet connection.

● State the facts and differences, not opinions or conclusions. For


example, stating that the NIS+ software on System A is a flawed
service is an opinion, but to say that the NIS+ software on System
A is a different service is stating a fact.

● Analyze observed and comparative facts for contrasts. For


example, state that System A is an NIS+ software client and that
System B is an NIS software server.

● Many observed and comparative facts show no contrast, so just


note them as no difference.

● Look for distinctions between systems, such as hardware mix,


system load, surrounding temperature, patches applied, and so on.

● Keep completed fault analysis worksheets to compile your own


lists.

Fault Analysis and Diagnosis 1-9


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
1

Listing Relevant Changes

A change can cause or identify a problem. The differences between


observed and comparative facts can identify changes. Determine if the
changes are relevant to the problem.

Guidelines to Analyze Relevant Changes


● Examine the differences and ask what, if anything, has changed.

● Describe each relevant change and the date or time of its


occurrence. Examples include:

● Power supply was upgraded Friday night.

● System administrator added four clients Thursday afternoon.

● A relevant change that happened before the problem occurred


could be a likely cause; one that happened after the problem
occurred can be ruled out.

Feedback to Check Relevant Changes


● Does each relevant change represent something new or unusual
about a difference?

● What alterations or improvements do the relevant changes reflect?

● Does each relevant change identify the timing relative to the


problem; for example, one month before, two days after, or the
same time?

1-10 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
1

Generating Likely Causes

First you use differences and relevant changes to discover likely causes
of the problem. Then you form a hypothesis about the cause, and
analyze the problem with facts, differences, and relevant changes.
Then you can diagnose the problem.

State your hypothesis in the form of a question and an answer that can
be tested. For example:

How could the fault analysis element have caused this problem?

The answer (hypothesis) is that changing A may cause B.

For the fault analysis element, insert one of the following possibilities:

● A relevant change

● Two or more relevant changes

● A relevant change and a difference

● A single difference

The following is an example:

The problem is slow system response.

A relevant change is that the system administrator added four


new users last Friday.

The answer (hypothesis) is that additional users can cause swap


space to be insufficient to prevent thrashing and can cause a slow
system response.

You can develop as many hypotheses as you have facts. Use your
experience and judgement to limit, initially, the list to the most logical
and likely cause(s). If your first hypothesis does not prove true, you
can return to this step.

Fault Analysis and Diagnosis 1-11


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
1

Testing Likely Causes

Using the list of likely causes, test each one to determine the most
likely cause. Testing your likely causes increases the certainty that you
will discover the actual cause of the problem before you embark on or
recommend a potentially costly, time-consuming solution.

To test for the most likely cause, eliminate any cause that fails to
explain the observed and comparative facts.

Eliminate a likely cause only when you are certain it cannot be the true
cause of the problem.

Test each likely cause separately using the fault analysis worksheets.
Ask yourself whether the cause can support the facts, and mark a Y for
yes or N for no on the line under the fact number. For example:

Relevant Fact #2 – Smoke from system

Likely Causes:..........................Facts: 1.... 2.... 3.... 4.... 5.... 6........

New Power Supply...................................Y...................................


Migrate to 2.2.............................................N..................................

Test each likely cause against each relevant fact and mark it Y or N. If
you must make an assumption or have a doubt about an answer, mark
it with a question mark (?). If you simply cannot make a
determination, leave it blank.

Test your hypothesis aggressively. Try to eliminate a likely cause


logically. Prove that it cannot possibly be the actual cause. Be careful
not to change any observed or comparative facts to support a likely
cause hypothesis, especially in hurried or stressful situations.
Changing facts can make the problem worse rather than solve it.

Closely examine any doubtful answers and investigate assumed facts.


Make sure to fill in any blanks.

1-12 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
1

Verifying the Most Likely Cause

Now you are ready to verify, test, and prove that the most likely cause
is the actual cause of the problem.

To verify the most likely cause, use the method that is:

● Least disruptive

● Least expensive

● Least time-consuming

● Most conclusive

Verifying the most likely cause should remove all uncertainty about
the cause of a problem. Three methods that verify the most likely
cause of the problem include:

● Factual and logical – This is based on information gathered on the


fault analysis worksheet and on past experience. This is the likely
cause that makes the most sense.

● Reality – The most likely cause must pass an experiment to show


conclusively that it is or is not the cause. For example, try a new
driver without overwriting the old one. This provides a quick,
nondisruptive verification with good, but not complete,
conclusiveness.

● Results – Assume, without proof, that the most likely cause you
choose is the actual cause, and take the indicated corrective action.
This is the least conclusive verification, and it can be disruptive,
expensive, and time-consuming, especially if your assumptions
are not correct.

Note – No one method is presented as better than another for all


problems. Each method has its strengths and weaknesses, and you
must choose the method most appropriate for your situation.

Fault Analysis and Diagnosis 1-13


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
1

Taking Action to Correct the Fault

1. Complete the repair.

2. Test and verify the repair.

3. Obtain confirmation and acceptance.

1-14 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
1
Fault Analysis Example Worksheet (1 of 3)

Problem Statement

Window system hangs on systems using the GX+ video frame buffer.

Problem
Observed Facts Comparative Facts Differences
Description
1. What object (system) is Six systems using Not on other Sun Location and
defective? ss2GX+ video frame machines on this site, but environment,
buffer on other Sun machines temperature, humidity,
elsewhere dirt, power, static

2. What exactly is wrong? System "hangs" but can System does not crash or Operating system is still
remote login freeze; running; power cycle of
can sometimes fix by mouse may be related
removing or inserting
mouse

3. Where is the object Acme Industries; factory Not at other Acme sites, Environment, network,
(system) located? control units in other customers, or office vibration
manufacturing plant environment

4. Where on the object Not guaranteed No documented hang at Window system uses
(system) does the defect repeatable, but window OBP monitor or single mouse and full resolution
appear? system most often user of GX+ color
affected

5. When was the defect Call logged 1/18, problem Not right after delivery of Happening more often
first observed? has been ongoing for a systems using GX+ video during busy periods
while frame buffers on 12/12

6. When in the life cycle Five weeks after Not when system was New hardware; bedded in
was the defect noticed? installation brand new

7. What is the pattern of Random No cyclic or continuous


occurrence? pattern

8. How much of the Does not affect bootup or


object (system) is rlogin
defective?

9. How many objects One group of six Not all Sun workstations
(systems) are defective?

10. What is the trend? Worse, more frequent Not getting better or
stable

Fault Analysis and Diagnosis 1-15


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
1

Fault Analysis Example Worksheet (2 of 3)

Problem Description Relevant Changes Date

1. What object (system) is defective? The systems using the GX+ video frame buffer after 12/12
the last quarterly anti-static treatment was
completed.

2. What exactly is wrong? 12/13–12/17

3. Where is the object (system) Installed onto existing network 12/1


located?

4. Where on the object (system) does Upgraded from OpenWindows Version 2


the defect appear? environment to Version 3 environment.

5. When was the defect first observed? 12/13–12/17

6. When in the life cycle was the defect 12/27


noticed?

7. What is the pattern of occurrence?

8. How much of the object (system) is


defective?

9. How many objects (systems) are


defective?

10. What is the trend?

1-16 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
1
Fault Analysis Example Worksheet (3 of 3)

Likely Causes

Likely Cause 1 2 3 4 5 6 7 8 9 10
1 GX+ video frame buffer design or build fault Y N N ? Y N? Y - Y Y?

2 Environment (static) Y Y Y ? Y Y Y - Y Y

3 Application local to site Y N Y Y Y N ? - Y Y?

4 Keyboard and mouse, CPU, UART Y Y N Y N Y Y - N Y

Verifying Testing Likely Causes

Likely Cause Test Results


Environment

Keyboard or mouse UART

Application

Design

Final Repair
Environment static created the problem.

Fault Analysis and Diagnosis 1-17


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
1

Exercise 1 – Solving Host C

Use the fault analysis worksheets at the end of this module to


determine the cause of the following problems.

The instructor is the user, and you can ask the instructor questions
about the problem.

The following is the premise:

The user has added three new hosts to the established network. A
matrix was generated that indicated which hosts were
communicating. These were installed after midnight just prior to a
three-day holiday.

Host A Host B Host C

Host A Host B Host C Host A Host B Host C

Host A

Host B

Host C

ping rlogin

1-18 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
1

Exercise 2 – Solving Host B

Use the fault analysis worksheets at the end of the module to


determine the cause of the following problem.

The instructor is the user, and you can ask the instructor questions
about the problem.

The following is the premise:

The user has added three new hosts to the established network. A
matrix was generated that indicated which hosts were
communicating.

Host A Host B Host C

Host A Host B Host C Host A Host B Host C

Host A

Host B

Host C

ping rlogin

Fault Analysis and Diagnosis 1-19


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
1

Exercise 3 – Solving Host A

Use the fault analysis worksheets at the end of the module to


determine the cause of the following problem.

The instructor is the user, and you can ask the instructor questions
about the problem.

The following is the premise:

The user has added three new hosts to the established network. A
matrix was generated that indicated which hosts were
communicating.

Host A Host B Host C

Host A Host B Host C Host A Host B Host C

Host A

Host B

Host C

ping rlogin

1-20 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
1

Fault Analysis Worksheet (1 of 3)

Likely Causes

Likely Cause 1 2 3 4 5 6 7 8 9 10
1

Verifying and Testing Likely Causes

Likely Cause Test Results

Final Repair

Fault Analysis and Diagnosis 1-21


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
1

Fault Analysis Worksheet (2 of 3)

Problem Statement

_______________________________________________________________

Problem
Observed Facts Comparative Facts Differences
Description

1. What system is
defective?

2. What exactly is wrong?

3. Where is the system


located?

4. Where on the system


does the defect appear?

5. When was the defect


first observed?

6. When in the life cycle


was the defect noticed?

7. Pattern of occurrence

8. How much of the


system is defective?

9. Number of systems
defective

10. Trend

1-22 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
1

Fault Analysis Worksheet (3 of 3)

Problem Description Relevant Changes Date

1. What system is defective?

2. What exactly is wrong?

3. Where is the system located?

4. Where on the system does the defect


appear?

5. When was the defect first observed?

6. When in the life cycle was the defect


noticed?

7. Pattern of occurrence

8. How much of the system is defective?

9. Number of systems defective?

10. Trend

Fault Analysis and Diagnosis 1-23


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
1

System Fault Analysis Workshop (Short Form) - Sample

Initial Customer Complaint


Problem Statement

Error Symptoms/Conditions/Messages
● Observed facts (1)

● What exactly is wrong? (2)

● Where on the system does the defect appear? (4)

● How much of the system is defective? (8)

● Differences

● Relevant changes

● Comparative facts

Problem Statement

Research Resources

Likely Cause Test Results

Repair Verification

1-24 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
1

Fault Analysis Worksheet (1 of 3)

Likely Causes

Likely Cause 1 2 3 4 5 6 7 8 9 10
1

Verifying and Testing Likely Causes

Likely Cause Test Results

Final Repair

Fault Analysis and Diagnosis 1-25


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
1

Fault Analysis Worksheet (2 of 3)

Problem Statement

_______________________________________________________________

Problem
Observed Facts Comparative Facts Differences
Description

1. What system is
defective?

2. What exactly is wrong?

3. Where is the system


located?

4. Where on the system


does the defect appear?

5. When was the defect


first observed?

6. When in the life cycle


was the defect noticed?

7. Pattern of occurrence

8. How much of the


system is defective?

9. Number of systems
defective

10. Trend

1-26 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
1

Fault Analysis Worksheet (3 of 3)

Problem Description Relevant Changes Date

1. What system is defective?

2. What exactly is wrong?

3. Where is the system located?

4. Where on the system does the defect


appear?

5. When was the defect first observed?

6. When in the life cycle was the defect


noticed?

7. Pattern of occurrence

8. How much of the system is defective?

9. Number of systems defective?

10. Trend

Fault Analysis and Diagnosis 1-27


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
1

Notes

1-28 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
1

Fault Analysis Worksheet (1 of 3)

Likely Causes

Likely Cause 1 2 3 4 5 6 7 8 9 10
1

Verifying and Testing Likely Causes

Likely Cause Test Results

Final Repair

Fault Analysis and Diagnosis 1-29


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
1

Fault Analysis Worksheet (2 of 3)

Problem Statement

_______________________________________________________________

Problem
Observed Facts Comparative Facts Differences
Description

1. What system is
defective?

2. What exactly is wrong?

3. Where is the system


located?

4. Where on the system


does the defect appear?

5. When was the defect


first observed?

6. When in the life cycle


was the defect noticed?

7. Pattern of occurrence

8. How much of the


system is defective?

9. Number of systems
defective

10. Trend

1-30 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
1

Fault Analysis Worksheet (3 of 3)

Problem Description Relevant Changes Date

1. What system is defective?

2. What exactly is wrong?

3. Where is the system located?

4. Where on the system does the defect


appear?

5. When was the defect first observed?

6. When in the life cycle was the defect


noticed?

7. Pattern of occurrence

8. How much of the system is defective?

9. Number of systems defective?

10. Trend

Fault Analysis and Diagnosis 1-31


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
1

Notes

1-32 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
1

Fault Analysis Worksheet (1 of 3)

Likely Causes

Likely Cause 1 2 3 4 5 6 7 8 9 10
1

Verifying and Testing Likely Causes

Likely Cause Test Results

Final Repair

Fault Analysis and Diagnosis 1-33


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
1

Fault Analysis Worksheet (2 of 3)

Problem Statement

______________________________________________________

Problem
Observed Facts Comparative Facts Differences
Description

1. What system is
defective?

2. What exactly is wrong?

3. Where is the system


located?

4. Where on the system


does the defect appear?

5. When was the defect


first observed?

6. When in the life cycle


was the defect noticed?

7. Pattern of occurrence

8. How much of the


system is defective?

9. Number of systems
defective

10. Trend

1-34 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
1

Fault Analysis Worksheet (3 of 3)

Problem Description Relevant Changes Date

1. What system is defective?

2. What exactly is wrong?

3. Where is the system located?

4. Where on the system does the defect


appear?

5. When was the defect first observed?

6. When in the life cycle was the defect


noticed?

7. Pattern of occurrence

8. How much of the system is defective?

9. Number of systems defective?

10. Trend

Fault Analysis and Diagnosis 1-35


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
1

Notes

1-36 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
1

Fault Analysis Worksheet (1 of 3)

Likely Causes

Likely Cause 1 2 3 4 5 6 7 8 9 10
1

Verifying and Testing Likely Causes

Likely Cause Test Results

Final Repair

Fault Analysis and Diagnosis 1-37


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
1

Fault Analysis Worksheet (2 of 3)

Problem Statement

______________________________________________________

Problem
Observed Facts Comparative Facts Differences
Description

1. What system is
defective?

2. What exactly is wrong?

3. Where is the system


located?

4. Where on the system


does the defect appear?

5. When was the defect


first observed?

6. When in the life cycle


was the defect noticed?

7. Pattern of occurrence

8. How much of the


system is defective?

9. Number of systems
defective

10. Trend

1-38 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
1

Fault Analysis Worksheet (3 of 3)

Problem Description Relevant Changes Date

1. What system is defective?

2. What exactly is wrong?

3. Where is the system located?

4. Where on the system does the defect


appear?

5. When was the defect first observed?

6. When in the life cycle was the defect


noticed?

7. Pattern of occurrence

8. How much of the system is defective?

9. Number of systems defective?

10. Trend

Fault Analysis and Diagnosis 1-39


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
1

Skills Checklist

Student Instructor
Skill
Initials Initials
Gather and document observed facts and place them in the fault
analysis worksheet matrix.
Gather and document obtained information and place it in the
fault analysis matrix.
Generate a list of likely causes based on facts within the fault
analysis matrix.
Develop a course of action to repair based on likely causes.

1-40 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
Error Detection Overview 2

Objectives
Upon completion of this module, you will be able to:

● Describe how different Sun architectures require different


techniques when attempting to troubleshoot system faults.

● Compare the primary bus architectures of Sun-4™, Sun-4c,


Sun-4d, Sun-4m and Sun-4u machines.

● Describe the purpose of virtual addresses.

● Perform a manual virtual-address translation.

● Display and interpret internal error registers.

● Describe system error registers and their meanings and relate


them to fault analysis.

● Insert a system error using open boot PROM (OBP) commands.

● Determine the cause of the system error using OBP commands.

References
The SPARC Architecture Manual - Version 8

2-1
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
2

Introduction

The fundamental error detection mechanisms for all architectures are


basically the same. As architecture design became more complex, the
error detection mechanism became more sophisticated, specifically in
the multiprocessor environment.

This module introduces you to the fundamental error detection


mechanism. You will perform predetermined tasks using OBP
commands.

The lab in this module will bring you back to an early level of
computer understanding and data manipulation – back to the 1’s and
0’s and register-bit mapping. The labs are architecture-dependent.

2-2 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
2

Error Types

● Error reporting mechanisms

● Bus Errors

● Interrupts

● Resets

● Types of errors

● Software errors

● Hardware-corrected errors

● Recoverable errors

● Fatal errors

● Critical errors

Error Detection Overview 2-3


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
2

Error Reporting Mechanisms

Bus Errors
Bus errors are issued to the processor when the processor references to
virtual or physical space that cannot be satisfied for hardware reasons.
Some typical bus errors occur:

● During instruction fetch.

● On SBus direct virtual memory access (DVMA) read/write


operation.

● On synchronous/asynchronous data store.

● On memory management unit (MMU) operations.

Interrupts for Reporting


Interrupts are issued to the processor to notify of external conditions
that are asynchronous with the normal operation. Interrupts indicate:

● Device done or ready.

● Error detected.

● Change in power status.

Resets
A reset attempts to bring the system to a well known (deterministic)
state. Types of resets include:

● System

● Power on

● Watchdog

● System software

2-4 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
2

Type of Errors

Software Errors
Errors that do not originate in the hardware are classified as software
errors. All such errors are detected by the processor and are reported.
Examples of software errors are programming errors or bugs in the
system code.

Hardware-Corrected Errors
For error-logging purposes, hardware-corrected errors are always
signaled by an interrupt. No recovery action is normally required. One
bit error from memory is corrected by the error checking and
correcting (ECC) logic. This is reported in the error log.

Recoverable Errors
Recoverable errors caused by hardware are usually signaled by a bus
error indication to the requesting device and a specified interrupt
(which could broadcast the error). Error recovery is normally handled
by the trap routines, while error logging is done by the interrupt
handler. A nonessential device losing power or becoming inaccessible
is an example of a recoverable error.

Fatal Errors
All fatal errors initiate a system-watchdog reset. Fatal errors
correspond to hardware errors in which proper system operation
cannot be guaranteed. Parity errors on backplanes are an example of a
fatal error.

Error Detection Overview 2-5


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
2

Type of Errors

CPU Watchdog Reset


A CPU watchdog reset is initiated when a trap condition occurs while
a trap is disabled and the MMU control register no fault (NF) bit is not
set. The CPU branches to a reserved physical address.

System Watchdog Reset


When a fatal error is detected, a system watchdog reset is initiated. A
system watchdog reset affects all CPUs and I/O devices. Writes in
progress may be lost, but the state of main memory is not altered and
continues to be refreshed after a system watchdog reset.

Critical Errors
Critical errors require immediate system shutdown and power-off.
They are notified through a high-level broadcast interrupt if at all
possible. Types of critical errors include:

● An AC/DC failure

● Temperature warning

● Fan failure

2-6 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
2

Primary Buses

The following is a graphical representation of the major buses


supported by SPARC architectures. Not shown is the onboard I/O
(OBIO) for all architectures, which connects to chips on the system
board as serial controllers do.

Architecture

Sun-4u Sun-4d Sun-4c Sun-4m Sun-4

Buses Buses Buses Buses Buses

UPA XDbus SBus VME600 only VME


SBus SBus MBus
SBus

Sun Architecture

Architecture Model
Sun-4 4/330, 4/370, 4/390, 4/470, 4/490

Sun-4c SS1, SS1+, SS2, SLC, ELC, IPC, IPX

Sun-4m SS5, SS10, SS20, 630, 670, 690, Classic, ClassicX, SSLX

Sun-4d SC2000, SS1000

Sun-4u 140 ultra1, 170 creator,170e creator3

Error Detection Overview 2-7


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
2

Sun-4u
Architecture Ultra-4u

Address
UltraSparc Serial

Sysio sbus

Onboard

Bus UPA
Multiplexor

UPA
UPA connector

SBus = Same specification


Address = Same as onboard I/O
UPA = Universal Port Architecture , 128 bit
plus ECC packet-switching bus
DRAM

2-8 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
2

Memory Management Unit (MMU)

The purpose of the MMU is to translate the virtual address, generated


by the executing code, to a physical address.

Virtual address Physical Address


MMU

The MMU contains page table entries (PTEs) that are loaded by kernel
code during normal process execution.

One PTE describes:

● The physical address

● Page referencing, if any

● Page modifications, if any

● The memory page access

● Page caching

One PTE is used for 4096 bytes of physical memory.

A valid PTE indicates that the virtual address has been mapped to a
physical page in memory.

Error Detection Overview 2-9


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
2

Number Base Conversion Chart

The table below is used to code and decode PTEs:

Base (10) Base (2) Base (16)

0 0000 0
1 0001 1
2 0010 2
3 0011 3
4 0100 4
5 0101 5
6 0110 6
7 0111 7
8 1000 8
9 1001 9
10 1010 a
11 1011 b
12 1100 c
13 1101 d
14 1110 e
15 1111 f

2-10 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
2

Page Table Entry – Sun-4 Architecture

To map a virtual address to a physical address, a valid PTE must be


placed within the MMU by the operating system. The format for PTEs
is architecture-dependent.

The format of a valid PTE for Sun-4 architecture is:

● Bit 31 (PTE valid bit) – When set to one (1), the PTE is valid.

● Bit 30 (Write access bit) – When set to one (1), page has write
access.

● Bit 29 (System access bit) – When set to one (1), system access is
enabled for that page.

● Bit 28 (Do not cache bit) – When set to one (1), caching is disabled.

● Bits 27 and 26 – Define memory type.

● Bit 25 (Access bit) – When set to one (1), indicates page has been
accessed.

● Bit 24 (Modify bit) – When set to one (1), indicates page has been
modified.

● Bits 23–19 – Must be zero bits.

● Bits 18–00 – Physical page number.

31 30 29 28 27 26 25 24 23 19 18 00

Error Detection Overview 2-11


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
2

Page Table Entry – Sun-4 Architecture

Sun-4 PTE Format


Bits 27 and 26 define memory type:

● 0 0 – Main memory

● 0 1 – I/O space

● 1 0 – VMEbus 16-bit access

● 1 1 – VMEbus 32-bit access

31 30 29 28 27 26 25 24 23 19 18 00

Examples of Valid PTEs


● e0000003 – Valid PTE, Page is read/write, System Access only,
Cache enable, Reference On-Board memory, Physical Page
Number 3

● 8c000003 – Valid PTE, Page is read only, user/System Access,


Don’t Cache, VMEbus 32-bit access, Physical Page Number 3

● d0000003 – Valid PTE, Read/write access, user/system access,


Don’t Cache, main memory, Physical Page Number 3

2-12 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
2

Page Table Entry – Sun-4c Architecture

To map a virtual address to a physical address, a valid PTE must be


placed within the MMU by the operating system. The format for PTEs
are architecture dependent.

The PTE format for Sun-4c is:

● Bit 31 (PTE valid bit) – When set to one (1), means the PTE is valid.

● Bit 30 (Write access bit) – When set to one (1), page has write
access.

● Bit 29 (System access bit) – When set to one (1), system access is
enabled for that page.

● Bit 28 (Do not cache bit) – When set to one (1), caching is disabled.

● Bits 27 and 26 – Define memory type.

● Bit 25 (Access bit) – When set to one (1), indicates page has been
accessed.

● Bit 24 (Modify bit) – When set to one (1), indicates page has been
modified.

● Bits 23–19 – Must be zero bits.

● Bits 18–00 – Physical page number.

31 30 29 28 27 26 25 24 23 19 18 00

Error Detection Overview 2-13


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
2

Page Table Entry – Sun-4c Architecture

Sun-4c PTE Format


Bits 27 and 26 define memory reference:

● 0 0 – Onboard main memory

● 0 1 – I/O physical

● 1 0 – I/O physical

● 1 1 – I/O physical

31 30 29 28 27 26 25 24 23 19 18 00

Examples of Valid PTEs


● e0000003 – Valid PTE, Page is read/write, System Access only,
Cache enable, Reference On-Board memory, Physical Page
Number 3

● 98000003 – Valid PTE, Page is read only, User/System Access, Do


Not Cache, Reference onboard I/0 space, Physical Page Number 3

2-14 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
2

Page Table Entry – Sun-4m Architecture

To map a virtual address to a physical address, a valid PTE must be


placed within the MMU by the operating system. The format for PTEs
is architecture-dependent.

The PTE format for Sun-4m architecture is:

● Bits 00 and 01 (Entry type) – 10 is required to indicate a valid PTE.

● Bits 02 and 04 – Access code.

● Bit 05 – Reference bit indicator.

● Bit 06 – Modify bit indicator.

● Bit 07 (Cache entry) – When set to one (1), caching is enabled.

● Bits 08–31 – Physical page number.

31 30 08 07 06 05 04 03 02 01 00

Error Detection Overview 2-15


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
2

Page Table Entry – Sun-4m Architecture

31 30 08 07 06 05 04 03 02 01 00

Access Code

Access Code System User


0 0 0 r - - r - -
0 0 1 r w - r w -
0 1 0 r - x r - x
0 1 1 r w x r w x

1 0 0 - - x - - x
1 0 1 r w - r - -
1 1 0 r - x - - -
1 1 1 r w x - - -

Examples of Valid PTEs


● 39a – Physical Page Number 3, System (r - x), User (- - -),
Page is Cached

● 302 – Physical Page Number 3, System (r - -), User (r - -),


Page is not Cached

● 38e – Physical Page Number 3, System (r w x), User (r w x),


Page is Cached

2-16 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
2

Page Table Entry – Sun-4d Architecture

To map a virtual address to a physical address, a valid PTE must be


placed within the MMU by the operating system. The format for PTEs
are architecture-dependent.

The PTE format for Sun-4d architecture is:

● Bits 00 and 01 (Entry Type) – 10 is required to indicate a valid PTE.

● Bits 02 and 04 – Access code.

● Bit 05 – Reference bit indicator.

● Bit 06 – Modify bit indicator.

● Bit 07 (Cache entry) – When one (1), caching is enabled.

● Bits 08–31 – Physical page number.

31 30 08 07 06 05 04 03 02 01 00

Error Detection Overview 2-17


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
2

Page Table Entry – Sun-4d Architecture

31 30 08 07 06 05 04 03 02 01 00

Access Code

Access Code System User


0 0 0 r - - r - -
0 0 1 r w - r w -
0 1 0 r - x r - x
0 1 1 r w x r w x
1 0 0 - - x - - x
1 0 1 r w - r - -
1 1 0 r - x - - -
1 1 1 r w x - - -

Example of Valid PTEs


● 396 – Physical Page Number 3, System (r w -), User (r - -),
Page is not Cached

● 38a – Physical Page Number 3, System (r - x), User (r - x),


Page is not Cached

● 30e – Physical Page Number 3, System (r w x), User (r w x),


Page is Cached

2-18 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
2

Sun-4 Error Detection Workshop

This workshop was performed by the instructor due to the primitive


console command language within the Sun-4 models. You can study it
or even perform it if workstations are available. All Sun-4 commands
are in boldface type.

1. Use the k0 command to reset the CPU.

k0

2. Use the k1 command to reset the MMU (virtual address = physical


address).

k1

3. Use the p command to open a page map for virtual address 1000
and enable it to be modified if needed.

p 1000

Page Map 00000000 [segment: 0000]: D0000000? d0000002


Page Map 00002000 [segment: 0000]: D0000001?<sp> <cr>

d0000002 is selected.

4. Press Shift, t, and 6 keys simultaneously to decode the page table


entry for virtual address 1000.

^t<6> 1000

Virtual address 0x00001000 is mapped to Physical


Address 0x00005000. Context=0x0, Segment Map=0x0,
Page Map=0xD0000002.

Page 2 has these attributes.


Valid = 1
Write Allow = 1
Supervisor Protect = 0
Don't Cache = 1
Type = 0
Accessed = 0
Modified = 0

Error Detection Overview 2-19


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
2

Sun-4 Error Detection Workshop

5. Use the l command to open (read) a longword location at virtual


address 1000 and modify it, if necessary. This is a read operation,
and you can perform it. 12345678 is a value to be deposited
(write) into virtual address 1000.

l 1000
00001000: 00000000? 12345678
00001004: 00000000?
>l 1000 This shows that you wrote to
virtual address 1000. No errors were detected.
00001000: 12345678?

6. To create conditions that will cause an error to occur, to be


detected, and inform the user, open up the page map for virtual
address 1000 and deposit a PTE that will not enable a write
operation.

p 1000
Page Map 00000000 [segment: 0000]: F0000000? a0000002
Page Map 00002000 [segment: 0000]: F0000001?

7. Verify that there is no write access.

>^t 1000

Virtual Address 0x00001000 is mapped to Physical


Address 0x00005000. Context=0x0, Segment Map=0x0,
Page Map=0xA0000002.

Page 2 has these attributes.


Valid = 1
Write Allow = 0
Supervisor Protect= 1
Don't Cache = 0
Type = 0
Accessed = 0
Modified = 0

2-20 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
2

Sun-4 Error Detection Workshop

8. Open (read) virtual address 1000. There was no problem (none


was expected). 00001000: 00000000?

l 1000

Deposit 1234 (write), and an error was detected.

l 1000
00001000: 00000000? 1234

9. Bus error at virtual address 0x00001000 (physical address


0x00005000) with PC 0xFFE82108. PTE is WRITE PROTECTED.
Type 0.

The next error forces an invalid PTE for virtual address 1000. As
you will see, not even a read can be performed. Once again, all
commands are highlighted including the error.

k0
k1
p 1000
Page Map 00000000 [segment: 0000]: D0000000? 20000002
Page Map 00002000 [segment: 0000]: D0000001?
^t 1000
Virtual Address 0x00001000 is mapped to Physical
Address 0x00005000.
Context=0x0, Segment Map=0x0, Page Map=0x20000002.

Page 2 has these attributes.


Valid = 0
Write Allow = 0
Supervisor Protect= 1
Don't Cache = 0
Type = 0
Accessed = 0
Modified = 0

>l 1000
00001000:

10. Bus Error at virtual address 0x00001000 (physical address


0x00005000) with PC 0xFFE82078. PTE is marked INVALID. Type
0.

Error Detection Overview 2-21


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
2

Sun-4c Error Detection Workshop

Reset Procedure
To begin this workshop, you must obtain a Sun-4c workstation.

1. Bring the system safely to monitor prompt if the operating system


is running (init 0 or halt)

2. At the ok monitor prompt, type reset.

ok reset

(If you see a > monitor prompt, type n, then type reset.)

Within 4 seconds, the pinwheel for booting begins. Press Stop (L1)–a.

Refer to “Page Table Entry – Sun-4c Architecture” for the correct PTE
format for Sun-4c architecture. Console commands are in boldface
type. Use this information not to troubleshoot problems but to
understand the error detection mechanism used by the diagnostics
and operating system software.

2-22 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
2

Sun-4c Error Detection Workshop

Example 1
1. Type the following console command:

e0000003 1000 pgmap!

This command sets up a PTE for virtual address 1000 to be


mapped to physical address 3000.

2. Type the following console command:

1000 map?

What information is displayed? Does it agree with “Page Table


Entry – Sun-4c Architecture”? If not, see the instructor.

3. Type the following console command:

f0000003 1000 pgmap!

4. Type the following console command:

1000 map?

What information is displayed? Does it agree with the student


guide? If not, see the instructor.

5. Type the following console command:

1000 20 ab fill

This writes the value ab to 32 memory locations starting at virtual


address 1000.

6. Type the following console command:

1000 20 dump

This reads the contents of 32 memory locations starting at virtual


address 1000.

At this point, you have set up a known condition (read/write) and


ensured that it worked. Now, you will create an error condition.

Error Detection Overview 2-23


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
2

Sun-4c Error Detection Workshop

Example 1 (Continued)
The first error condition is a valid PTE that will be read only. You
will attempt to perform a write to the page, thus forcing the error
condition.

7. Type the following console command:

a0000003 1000 pgmap!

8. Type the following console command:

1000 map?

Is the page read-only? If not, contact the instructor.

9. Type the following console command (to prove you can read):

1000 20 dump

10. Type the following console command:

1000 20 11 fill

The previous command should have generated an error. See the


instructor if it did not.

11. Type the following console command:

serr@ .

2-24 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
2

Sun-4c Error Detection Workshop

Example 1 (Continued)
A hex value is returned indicating the type of error that was
detected. Refer to the table below for verification.

Bit Map for serr Register

Bit Error

15 Error during read (0) / error during write (1)


14-8 Must be zero
7 PTE not valid
6 Protection violation
5 Time-out
4 Sbus error
3 Memory error
2 Must be zero
1 Size error
0 Watchdog reset

15 07 06 05 04 03 02 01 00
1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
8 0 1 0

As an example, if the content of the serr register was 8010, an


SBus error was detected during a write operation. Conversely, if
serr contained 0010, the SBus error was detected during a read
operation.

Error Detection Overview 2-25


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
2

Sun-4c Error Detection Workshop

Example 2
1. Perform the reset procedure on page 2-22. This resets the system
after the error.

2. Type the following console command:

70000003 1000 pgmap!

3. Type the following console command:

1000 map?

Is the PTE invalid? If not, contact the instructor.

4. Type the following console command:

1000 20 dump

The previous command should have generated an error. See the


instructor if it did not.

5. Type the following console command:

serr@ .

A hex value is returned indicating the type of error that was


detected. Refer to the “Bit Map for serr Register” table on the
previous page for verification.

2-26 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
2

Sun-4m Error Detection Workshop

Reset Procedure
To begin this workshop, you must obtain a Sun-4m workstation.

1. Bring the system safely to monitor prompt if the operating system


is running (init 0 or halt)

2. At the ok monitor prompt, type reset.

ok reset

(If you see a > monitor prompt, type n, then type reset.)

Within 4 seconds, the pinwheel for booting begins. Press Stop (L1)–a.

Refer to “Page Table Entry – Sun-4m Architecture” for the correct


PTE format for Sun-4m architecture. Console commands are in
boldface type. Use this information not to troubleshoot problems
but to understand the error detection mechanism used by the
diagnostics and operating system software.

Error Detection Overview 2-27


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
2

Sun-4m Error Detection Workshop

Example 1
1. Type the following console command:

39a 1000 pgmap!

The previous command set up a PTE for virtual address 1000 to be


mapped to physical address 3000.

2. Type the following console command:

1000 map?

What information is displayed? Does it agree with “Page Table


Entry – Sun-4m Architecture”? If not, see the instructor. Focus on
virtual page and physical parameters only.

3. Type the following console command:

31a 1000 pgmap!

4. Type the following console command:

1000 map?

What information is displayed? Does it agree with the student


guide? If not, see the instructor.

5. Type the following console command to set up a page to


read/write:

38e 1000 pgmap!

6. Type the following console command:

1000 20 ab fill

This writes the value ab to 32 memory locations starting at virtual


address 1000.

7. Type the following console command:

1000 20 dump

This reads the contents of 32 memory locations starting at virtual


address 1000.

2-28 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
2

Sun-4m Error Detection Workshop

Example 1 (Continued)
At this point, you have set up a known read/write condition and
ensured that it worked. Now, you will create an error condition.

The first error condition is a valid PTE that will be read-only. You
will attempt to perform a write to the page, thus forcing the error
condition.

8. Type the following console command:

302 1000 pgmap!

9. Type the following console command:

1000 map?

Is the page read-only? If not, contact the instructor.

10. Type the following console command to prove you can read:

1000 20 dump

11. Type the following console command:

1000 20 11 fill

The previous command should have generated an error. See the


instructor if it did not.

Error Detection Overview 2-29


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
2

Sun-4m Error Detection Workshop

Example 1 (Continued)
12. Type the following console command:

.sfsr

What is the value of the fault type field? Refer to the table below
for verification.

sfsr Fault Types

Fault Type Code Error

6 Internal error
5 Access bus or time-out
4 Translation error
3 Privilege violation
2 Protection error
1 Invalid address
0 No error

2-30 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
2

Sun-4m Error Detection Workshop

Example 2
1. Perform the reset procedure on page 2-27 to reset the system after
the error.

2. Type the following console command:

304 1000 pgmap!

3. Type the following console command:

1000 map?

If the PTE is valid, contact the instructor.

4. Type the following console command:

1000 20 dump

The previous command should have generated an error. See the


instructor if it did not.

5. Type the following console command:

.sfsr

What is the value of the fault type field? Refer to the “sfsr Fault
Types” table for verification.

Error Detection Overview 2-31


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
2

Sun-4d Error Detection Workshop

Reset Procedure
To begin this workshop, you must obtain a Sun-4d workstation. Do
one of the following, depending on the state of your system.

● Bring the system safely to the monitor prompt if the operating


system is running using either the init 0 or halt commands.

Then run the reset command.

ok reset

● Run the reset command if the system is at the monitor prompt:

Within 4 seconds, the pinwheel for booting begins. Press Stop (L1)–a.

Refer to “Page Table Entry – Sun-4d Architecture” for the correct PTE
format for Sun-4d architecture. Console commands are in boldface
type. Use this information not to troubleshoot problems but to
understand the error detection mechanism used by the diagnostics
and operating system software.

2-32 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
2

Sun-4d Error Detection Workshop

Example 1
1. Type the following console command:

39a 1000 pgmap!

The previous command set up a PTE for virtual address 1000 to be


mapped to physical address 3000.

2. Type the following console command:

1000 map?

What information is displayed? Does it agree with the Student


Guide, Module 2, “Page Table Entry – Sun-4d Architecture”? If
not, see the instructor. Focus on virtual, page and physical
parameters only.

3. Type the following console command:

31a 1000 pgmap!

4. Type the following console command:

1000 map?

What information is displayed? Does it agree with the student


guide? If not, see the instructor.

5. Type the following console command to set up a page to


read/write:

38e 1000 pgmap!

6. Type the following console command:

1000 20 ab fill -

This writes the value ab to 32 memory locations starting at virtual


address 1000.

7. Type the following console command:

1000 20 dump

This reads the contents of 32 memory locations starting at virtual


address 1000.

Error Detection Overview 2-33


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
2

Sun-4d Error Detection Workshop

Example 1 (Continued)
At this point, you have set up a known condition (read/write) and
ensured that it worked. Now, you will create an error condition.

The first error condition is a valid PTE that will be read-only. You
will attempt to perform a write to the page, thus forcing the error
condition.

8. Type the following console command:

382 1000 pgmap!

9. Type the following console command:

1000 map?

Is the page read-only? If not, contact the instructor.

10. Type the following console command to prove you can read:

1000 20 dump

11. Type the following console command:

1000 20 11 fill

The previous command should have generated an error. See the


instructor if it did not.

12. Type the following console command:

.sfsr

2-34 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
2

Sun-4d Error Detection Workshop

Example 1 (Continued)
What is the value of the fault type field? Refer to the table below
for verification.

sfsr Fault Types

Fault Type Code Error


6 Internal error

5 Access bus or time-out

4 Translation error

3 Privilege violation

2 Protection error

1 Invalid address

0 No error

Error Detection Overview 2-35


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
2

Sun-4d Error Detection Workshop

Example 2
1. Perform the reset procedure on page 2-32 to reset the system after
the error.

2. Type the following console command:

304 1000 pgmap!

3. Type the following console command:

1000 map?

If the PTE is valid, contact the instructor.

4. Type the following console command:

1000 20 dump

The previous command should have generated an error. See the


instructor if it did not.

5. Type the following console command:

.sfsr

What is the value of the fault type field. Refer to the “sfsr Fault
Types” table for verification.

2-36 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
2

Skills Checklist

No direct skills are associated with this module. This module and
associated workshops are used only to demonstrate the error-detection
mechanism. A field engineer would not be required to troubleshoot
the equipment with the skills used within the workshop.

Error Detection Overview 2-37


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
2

2-38 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
POST Diagnostics 3

Objectives
Upon completion of this module, you will be able to:

● Describe the importance, capabilities, and limitations of the


power-on self test (POST) in identifying and resolving system
faults.

● Describe the different ways to view the POST.

● Configure a console server using a tip connection, and view the


POST process.

References
Field Engineer Handbook, Volume 1 and 2, Part Numbers 800-4006 and
800-4247

OpenBoot Command Reference, Part Number 800-6076

OpenBoot 2.x Quick Reference Card, Part Number 802-1958

OpenBoot 3.x Quick Reference Card, Part Number 802-3240

System Answerbook, 2.5

3-1
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
3

Diagnostics Overview

Boot PROM-Based Diagnostics

Extended User
*POST POST diagnostics

*Power-on self test

Sun VTS Diagnostics

Installed as package
Requires Solaris operating system

3-2 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
3

Diagnostics Overview

This section describes the importance, capabilities, and limitations of


the power-on self test (POST) in identifying and resolving system
faults.

Boot PROM POST


● Are invoked automatically at the power-on sequence

● Differ slightly between workstation models

● Differ slightly between boot PROM revisions

● Conduct error detection and hardware verification for each system


board

● Conduct all hardware bus probes, and save information for the
operating system’s automatic reconfiguration (ok boot -r) and
memory sizing

Note – A deliberate limitation of the boot PROM POST is that the I/O
devices themselves are not tested, only the devices and buses required
to access the boot device are tested.

Sun VTS Diagnostics


● Requires the Solaris operating environment to be operating

● Installed as a package

● Used for system verification

● Runs in a window or nonwindow environment

POST Diagnostics 3-3


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
3

POST Viewing Methods

POST is a set of boot PROM resident firmware programs that run


independently of the Solaris operating environment.

Viewing POST From the CPU Board LEDs (older systems)

Boot PROM
Machine
LEDs
POST diags instructions IU
Run at power-on CPU
or a system reset chip
Test numbers
(Some desktops
only use LEDs on
keyboard)

Viewing POST With a Serial Port Terminal

Boot PROM
LEDs
POST diags IU
Run at power-on CPU
or a system reset chip Test numbers

Serial port A 7 3 2
Transmit data Modem port

Transmit data 2
Receive data ASCII
Receive data 3
terminal
Signal ground Signal ground 7

Null modem cable


The ASCII terminal will list, in English text, the current executing POST diagnostic.

3-4 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
3

Viewing POST Using the tip hardwire Command

You can use the tip hardwire command:

● When a serial port terminal is not available.

● To analyze POST output in a Sun window.

% tip hardwire
connected

Serial port A

Serial port A or B

Broken machine in
diagnostic mode Good machine

POST Diagnostics 3-5


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
3

POST Example Using tip – SparcStation5

Machine Information
The information below describes the machine used for this example.

● SPARCstation 5, no keyboard

● ROM Rev. 2.15, 32 Mbytes of memory installed, Serial #7516196

● Ethernet address 8:0:20:72:b0:24, Host ID: 8072b024

Turn on the power to the machine (Power-on-Reset) connected using a


null modem cable to a “good” machine where you enter:

# tip hardwire
$$$$$ WARNING: No Keyboard Detected! $$$$$
MMU Context Table Reg Test
MMU Context Register Test
MMU TLB Replace Ctrl Reg Tst
MMU Sync Fault Stat Reg Test
MMU Sync Fault Addr Reg Test
MMU TLB RAM NTA Pattern Test
MMU TLB CAM NTA Pattern Test
MMU TLB LCAM NTA Pattern Test
IOMMU SBUS Config Regs Test
IOMMU Control Reg Test
IOMMU Base Address Reg Test
IOMMU TLB Flush Entry Test
IOMMU TLB Flush All Test
SBus Read Timeout Test
EBus Read Timeout Test
D-Cache RAM NTA Test
D-Cache TAG NTA Test
I-Cache RAM NTA Test
I-Cache TAG NTA Test
Memory Address Pattern Test
FPU Register File Test
FPU Misaligned Reg Pair Test

(Multiple lines of output deleted here by editor.)


initializing TLB
initializing cache

3-6 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
3

POST Example Using tip – SparcStation5

Machine Information (Continued)


Allocating SRMMU Context Table
Setting SRMMU Context Register
Setting SRMMU Context Table Pointer Register
Allocating SRMMU Level 1 Table
Mapping RAM
Mapping ROM
ttya initialized
Probing Memory Bank #0 32 Megabytes
Probing Memory Bank #1 Nothing there

(Multiple lines of output deleted here by editor.)


Probing Memory Bank #7 Nothing there
Probing CPU FMI,MB86904
Probing /iommu@0,10000000/sbus@0,10001000 at 5,0 espdma
esp sd st SUNW,bpp ledma le
Probing /iommu@0,10000000/sbus@0,10001000 at 4,0
SUNW,CS4231 power-management
Probing /iommu@0,10000000/sbus@0,10001000 at 1,0
Nothing there
Probing /iommu@0,10000000/sbus@0,10001000 at 2,0
Nothing there
Probing /iommu@0,10000000/sbus@0,10001000 at 3,0 cgsix
Probing /iommu@0,10000000/sbus@0,10001000 at 0,0
Nothing there
Probing Memory Bank #0 32 Megabytes
Probing Memory Bank #1 Nothing there

(Multiple lines of output deleted here by editor.)


Probing CPU FMI,MB86904
Probing /iommu@0,10000000/sbus@0,10001000 at 5,0 espdma
esp sd st SUNW,bpp ledma le
Probing /iommu@0,10000000/sbus@0,10001000 at 4,0
SUNW,CS4231 power-management
Probing /iommu@0,10000000/sbus@0,10001000 at 1,0
Nothing there
Probing /iommu@0,10000000/sbus@0,10001000 at 2,0
Nothing there
Probing /iommu@0,10000000/sbus@0,10001000 at 3,0 cgsix

POST Diagnostics 3-7


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
3

POST Example Using tip – SparcStation5

Machine Information (Continued)


Probing /iommu@0,10000000/sbus@0,10001000 at 0,0
Nothing there
Boot device: /iommu/sbus/ledma@5,8400010/le@5,8c00000
File and args:
Automatic network cable selection succeeded : Using TP
Ethernet Interface
Timeout waiting for ARP/RARP packet

3-8 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
3

POST Example Using tip – SS1000

Note – On the 4d and 4u machines, the operating system keeps a


record of the results of the POST and any Watchdog Resets that may
have occurred. This can be displayed for fault isolation purposes by
typing the command /usr/kvm/prtdiag -v.

The following example shows the code output when using the tip
command. The correct response is connected, and the POST is
displayed.

Diagnostics Output
The diagnostics run on all system boards, testing all CPU modules,
buses and memories.

# tip hardwire
connected
0B>
BIST Status = 00000001 Signature - CPU = 6ED695A2
0B>map16 test
0A>
BIST Status = 00000001 Signature - CPU = 6ED695A2
0B>
**** SPARCserver_1000 MP POST Rev 8 ****

POST Diagnostics 3-9


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
3

POST Example Using tip – SS1000

Diagnostics Output (Continued)


0A>map16 test
0B>EPROMs Test
0B> EPROM path Test
0A>
**** SPARCserver_1000 MP POST Rev 8 ****

0B> EPROM checksum Test


0A>EPROMs Test
0A> EPROM path Test
0A> EPROM checksum Test
0B>LEDs Test
0B> WALK LED Test
0B>Serial Ports Test
0B> Port A Register Testx@ !"#$%&'()*+,-
./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuv
wxyz{|}~0A>Serial Ports Test
0B> Serial Port B Loopback Testz0A> Port A Register Test` !"#$%&'()*+,-
./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuv
wxyz{|}~0B> Mouse Loopback Test
0A> Serial Port B Loopback Testz0B>NVRAM/TOD Test
0A>Keybd/Mouse Test
0B>Basic CPU Test
0A> Keyboard Loopback Test
0B> FPU Register Test
0A> Mouse Loopback Test
0B> FPU Functional Test .
0A>NVRAM/TOD Test
0A>Basic CPU Test
0B> MMU TLB Test
0A> FPU Register Test
0B> Instruction Cache Tags Test
0A> FPU Functional Test
0A> MMU TLB Test
0A> Instruction Cache Tags Test
0B> Instruction Cache Ram Test

3-10 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
3

POST Example Using tip – SS1000

Diagnostics Output (Continued)


Press the s key to toggle to a 1, indicating you want to stop in the
POST DEMON.
1
*** Toggle Stop POST Flag = 1 ***

0B> Data Cache Tags Test


0B> Data Cache Ram Test
Data Cache Tags Test
0A> Data Cache Ram Test
The h key invokes the help menu.
0A> Store Buffer RAM Test
0A> Store Buffer Functional Test
0A> MXCC Register
Key Action
-------------------------------------------
a Toggle Pause CPU A flag
b Toggle Pause CPU B flag
c Toggle Trace Test Case flag
l Toggle Loop on Subtest flag
e Toggle Loop on Error flag
p Toggle Print all Errors flag
v Toggle Verbose Print Mode flag
s Toggle Stop Flag
t Toggle Timestamp flag
n Skip to Next Subtest
N Skip to Next Test
sp Skip to Next Testcase
h or ? Display this command summary
0A>Pausing ... press any key to continue
0B> Init MXCC Regs
0B>Ecache Test
0B> Setting Cache Size
0B> Ecache Tags Test
0A> Init MXCC Regs
0A>Ecache Test
0A> Setting Cache Size
0A> Ecache Tags Test

POST Diagnostics 3-11


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
3

POST Example Using tip – SS1000

Diagnostics Output (Continued)


0B> Ecache SRAM Test
0A> Ecache SRAM Test
0B> Ecache Enable
0B> Clear CC SRAM
0A> Ecache Enable
0A> Clear CC SRAM
0A>BW0 Regs Test
0A> C_O BW
0A> BW Registers Test
0A> Timers and Interrupts Test
0A> BW Tag RAM 6N Test
0B>BW0 Regs Test
0B> C_O BW
0B> BW Registers Test
0B> Timers and Interrupts Test
0B> BW Tag RAM 6N Test
0A>C0 MQH Test
0A> C_0 BW,MQH
0A> MQH Registers Test
0A> MQH Initialization
0A> Enable ECC
0A> Memory Test
0A> Config Memory Available
0A>Config Board = 64MB, Config Total = 64MB
0A>C0 IOC Test
0A> C_0 BW,IOC
0A> IOC Registers Test
0A> IOC XDBus Tags Test
0A> IOC Sbus Tags Test
0A> IOC Cache RAM Test
0A>C0 SBI Test
0A> SBI Initialization
0A> SBI Registers Test
0A> SBI Initialization
0A> SBus Interrupts Test
0A>C0 SBUS Cards Test
0A> SBI Initialization
0A> Checking for SBUS cards

3-12 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
3

POST Example Using tip – SS1000

Diagnostics Output (Continued)


0A>Board 0 Slot 0 occupied
0A>Board 0 Slot 1 occupied
0A>C0 XDBus Timing Test0A> C_0 BW
0A> Compute XDBus Frequency
0A>Bus frequency = 40 MHz
0A> TOD Delay
0A>C0 XPT Test
0A> C_0 BW,IOC
0A> XPT Read Write Test
0A
C0 BW-MQH Consistency Test
0A> C_0 BW,MQH
0A> BW MQH Cache Consistency Test
0A>C0 IOC-MQH Consistency Test
0A> C_0 BW,IOC,MQH
0A> SBus Loopback Test
0A>Testing slot 0 on board 0
0A>Testing slot 1 on board 0
0A>Testing slot 2 on board 0
0A>Testing slot 3 on board 0
0A> IOC MQH Consistency Test
0A>C0 BW-IOC Consistency Test
0A> C_0 BW,IOC,MQH
0A> Cache States Test
0A> BW IOC Consistency Test
0A>SPARC Module Board Master Test
0A> C_0 BW,MQH
0A> CPU and Cache Test
0A> MMU PTP Cache Invalidation Test
0A> MMU Stuff TLB Hit Test
0A> MMU Table Walk Test
0A> MMU Flush Test
0A> MMU TLB Lock Test
0A> MMU TLB Protection Error Test
0A> MMU Table Walk With Parity Error Test
0A> MMU Table Walk With ECC Error Test
0B>SPARC Module Board Slave Test
0B> Read MQH State
0B> C_0 BW,MQH

POST Diagnostics 3-13


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
3

POST Example Using tip – SS1000

Diagnostics Output (Continued)


0B> CPU and Cache Test
0B> MMU PTP Cache Invalidation Test
0B> MMU Stuff TLB Hit Test0B> MMU Table Walk Test
0B> MMU Flush Test
0B> MMU TLB Lock Test
0B> MMU TLB Protection Error Test
0B> MMU Table Walk With Parity Error Test
0B> MMU Table Walk With ECC Error Test
0A>programming MQH group addr at E0101000 to 00400009
0A>programming MQH group addr at E0101008 to 00000009
0A>programming MQH group addr at E1101000 to 00400049

The results of the POST normally pass quickly on the display. You can
view the results using the DEMON menu.
0A>total pmem 0x00008000 [pages] 0x008000000 [bytes] in 1 chunks
0A>DRAM chunk 0 base 0x00000000 size 0x00008000
0A> (0=failed,1=passed,blank=untested/unavailable)
(sbus 1=card present,0=card not present,x=failed)
0A>------+---------+------+-------+------+----+-----+----+--------+-------+------+-----+
0A> Slot | cpuA | bw0 | cpuB | bw0 | bb | ioc0| sbi| mqh0 | mem |sbus |xd0|
0A>------+---------+------+-------+------+----+-----+----+--------+-------+------+-----+
0A> 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 64 | 0011| 1 |
0A> 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 64 | 0011| 1 |
0A>------+--------+------+-------+------+----+-----+----+--------+--------+------+-----+

0A>Memory Group Status


(0=failed,1=passed,m=simm missing,c=simm mismatch,blank=unpopulated/unused)
0A>+-----+-------+------+------+------+
0A> Slot | g0 | g1 | g2 | g3 |
0A>+-----+-------+------+------+------+
0A> 0 | 1 | 1 | | |
0A> 1 | 1 | 1 | | |
0A>+-----+-------+-------+-----+------+
0A>

3-14 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
3

POST Example Using tip – SS1000

The next area displays the POST DEMON menu. It shows the steps
necessary to view system parameter information. The keys are
considered hot keys. You do not need to press Return after you press a
hot key.
DEMON
0A>Select one of the following functions
0A> '0' System Parameters
0A> '1' Read/Write device
0A> '2' Software Reset
0A> '3' NVRAM Management
0A> '4' Error Reporting
0A> '5' Analyze Error Logs
0A> '6' Power Off at Main Breaker
0A> '7' NVRAM SIMM tests
0A> 'r' Return to selftest

Command ==> 0

System Parameters
0A>Select one of the following functions
0A> '0' Set POST Level
0A> '1' Dump Device Table
0A> '2' Display System
0A> '3' Dump Board Registers
0A> '4' Dump Component IDs
0A> '5' Clear Error Logs
0A> '6' Display Simms
0A> '7' Scrub Main Memory
0A> 'r' Return

POST Diagnostics 3-15


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
3

POST Example Using tip – SS1000

Command ==> 2
0A> (0=failed,1=passed,blank=untested/unavailable)
(sbus 1=card present,0=card not present,x=failed)
0A>------+-------+-----+-------+------+---+------+----+--------+-------+------+-----+
0A> Slot | cpuA | bw0 | cpuB | bw0 | bb | ioc0| sbi | mqh0 | mem |sbus |xd0|
0A>-----+-------+------+-------+-----+----+------+----+--------+-------+------+-----+
0A> 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 64 |0011| 1 |
0A> 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 64 |0011| 1 |
0A>-----+-------+------+-------+-----+----+------+----+--------+-------+------+-----+
0A>Memory Group Status
(0=failed,1=passed,m=simm missing,c=simm
mismatch,blank=unpopulated/unused)
0A>+----+------+------+------+------+
0A> Slot| g0 | g1 | g2 | g3 |
0A>+---+-------+------+------+------+
0A> 0 | 1 | 1 | | |
0A> 1 | 1 | 1 | | |
0A>+---+-----+-------+-------+------+
0A>Hit any key to continue :

The following is a sequence of steps to look at the error logs.


System Parameters
0A>Select one of the following functions
0A> '0' Set POST Level
0A> '1' Dump Device Table
0A> '2' Display System
0A> '3' Dump Board Registers
0A> '4' Dump Component IDs
0A> '5' Clear Error Logs
0A> '6' Display Simms
0A> '7' Scrub Main Memory
0A> 'r' Return

3-16 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
3

POST Example Using tip – SS1000

Command ==> r

0A>
DEMON
0A>Select one of the following functions
0A> '0' System Parameters
0A> '1' Read/Write device
0A> '2' Software Reset
0A> '3' NVRAM Management
0A> '4' Error Reporting
0A> '5' Analyze Error Logs
0A> '6' Power Off at Main Breaker
0A> '7' NVRAM SIMM tests
0A> 'r' Return to selftest
0A>

Command ==> 5
0A>
-------------- Error Log Analysis for Board 0 --------------
0A>
-------------- Error Log Analysis for Board 1 --------------
0A>
-------------- System Memory Failure Analysis ----------------
0A> No Bad groups found
0A>Hit any key to continue :
0A>

DEMON
0A>Select one of the following functions
0A> '0' System Parameters
0A> '1' Read/Write device
0A> '2' Software Reset
0A> '3' NVRAM Management
0A> '4' Error Reporting
0A> '5' Analyze Error Logs
0A> '6' Power Off at Main Breaker
0A> '7' NVRAM SIMM tests
0A> 'r' Return to selftest
0A>

POST Diagnostics 3-17


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
3

POST Example Using tip – SS1000

Command ==>r
0A>
ttya initialized
Probing Memory Bank #0 128 Megabytes
SUNW,SPARCserver-1000
Cpu #0 cpu-unit TI,TMS390Z55
Cpu #1 cpu-unit TI,TMS390Z55
Cpu #2 cpu-unit TI,TMS390Z55
Cpu #3 cpu-unit TI,TMS390Z55
mem-unit mem-unit
bif bif
bootbus zs zs eeprom sram leds bootbus zs zs eeprom sram leds
io-unit sbi
Probing /io-unit@f,e0200000/sbi@0,0 at 0,0 dma esp sd st lebuffer le
Probing /io-unit@f,e0200000/sbi@0,0 at 1,0 cgsix
Probing /io-unit@f,e0200000/sbi@0,0 at 2,0 Nothing there
Probing /io-unit@f,e0200000/sbi@0,0 at 3,0 SUNW,soc SUNW,pln SUNW,ssd
SUNW,pln SUNW,ssd
io-unit sbi
Probing /io-unit@f,e1200000/sbi@0,0 at 0,0 dma esp sd st lebuffer le
Probing /io-unit@f,e1200000/sbi@0,0 at 1,0 dma esp sd st lebuffer le
Probing /io-unit@f,e1200000/sbi@0,0 at 2,0 Nothing there
Probing /io-unit@f,e1200000/sbi@0,0 at 3,0 Nothing there
Probing Memory Bank #0 128 Megabytes
SUNW,SPARCserver-1000
Cpu #0 cpu-unit TI,TMS390Z55
Cpu #1 cpu-unit TI,TMS390Z55
Cpu #2 cpu-unit TI,TMS390Z55
Cpu #3 cpu-unit TI,TMS390Z55
mem-unit mem-unit
bif bif
bootbus zs zs eeprom sram leds bootbus zs zs eeprom sram leds
io-unit sbi
Probing /io-unit@f,e0200000/sbi@0,0 at 0,0 dma esp sd st lebuffer le
Probing /io-unit@f,e0200000/sbi@0,0 at 1,0 cgsix
Probing /io-unit@f,e0200000/sbi@0,0 at 2,0 Nothing there
Probing /io-unit@f,e0200000/sbi@0,0 at 3,0 SUNW,soc SUNW,pln SUNW,ssd
SUNW,pln SUNW,ssd
io-unit sbi
Probing /io-unit@f,e1200000/sbi@0,0 at 0,0 dma esp sd st lebuffer le
Probing /io-unit@f,e1200000/sbi@0,0 at 1,0 dma esp sd st lebuffer le
Probing /io-unit@f,e1200000/sbi@0,0 at 2,0 Nothing there
Probing /io-unit@f,e1200000/sbi@0,0 at 3,0 Nothing there

3-18 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
3

POST Example Using tip – SS1000

SPARCserver 1000, No Keyboard


ROM Rev. 2.13, 128 MB memory installed, Serial #5243375.
Ethernet address 8:0:20:18:4a:6f, Host ID: 805001ef.

Boot device: /io-unit@f,e0200000/sbi/lebuffer@0,40000/le@0,60000 File and


args:
Timeout waiting for ARP/RARP packet
Timeout waiting for ARP/RARP packet
Timeout waiting for ARP/RARP packet
~
Type help for more information

POST Diagnostics 3-19


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
3

POST Diagnostic Workshop Using tip

Using Terminal Interface Protocol (TIP) for Remote Diagnostics


You can use a null modem cable or a modem with TIP to remotely
troubleshoot a faulty system.

Healthy system

Null modem
cable

or

Modem

Faulty system

3-20 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
3

POST Diagnostic Workshop Using tip

Using tip to Observe POST Diagnostics


Use the following procedure to run the tip command on a faulty
machine:

Note – Before you begin, make sure that the healthy system has the
Solaris operating environment booted to multiuser mode and has a
window system running or available.

1. Connect an RS-232C null modem cable to port B of the functional


workstation.

2. Connect the other end to port A of the faulty machine.

3. Halt the faulty machine by pressing the Stop-a (L1-a) key


sequence.

4. Set the diag-switch? parameters to true on the faulty machine.

SS1000, SC2000, 600mp, and Sun-4 systems have a hardware


diagnostic switch.

Sun-4c and Sun-4m desktop models use OBP diag-switch?


NVRAM parameter or the power-on OBP command. Press Stop-d
while turning on the power. This action forces the diag-switch
parameter to true.

ok setenv diag-switch? true


ok reset

5. Turn off the faulty system to prevent blowing the keyboard fuse.

6. Disconnect the keyboard from the back of the system (output to


ttya). Remember to also power off when you reconnect.

POST Diagnostics 3-21


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
3

POST Diagnostic Workshop Using tip

Using tip to Observe POST Diagnostics (Continued)


7. Start the OpenWindows™ environment on the functional machine
if not already started, and bring up a Shell Tool from the Programs
menu. (You can run the tip command in a nonwindowed
environment, but there is a danger that if tip hangs, there will be
no way to get into the system to release or kill it.)

# /usr/openwin/bin/openwin

(optional: use Motif)

Note – The hardwire argument says that the tip command expects
9600 baud, 8 data bits, and 1 stop bit at port B on the CPU board, not
an ALM or SPC. It is not a coincidence that these are the parameters
set for Port A when a machine powers up without a keyboard.

8. If port A is the only available port, edit the /etc/remote file for
port A on “good” system

● Before edit:

:dv=/dev/term/b:br#9600:el=^C^S^Q^U^D:ie=%$:oe=^D

● After edit:

:dv=/dev/term/a:br#9600:el=^C^S^Q^U^D:ie=%$:oe=^D

3-22 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
3

POST Diagnostic Workshop Using tip

Using tip to Observe POST Diagnostics (Continued)


9. In the Shell Tool window, type the following command:

# tip hardwire

Note – The system should respond with connected. If it does not,


some likely causes are:

● Wrong port selected, physically or logically in /etc/remote.

● Selected port is already active (bring up admintool and assure port


is disabled).

● There exists a /var/spool/locks/LCK file from a previous tip or


uucp session (often because someone did not properly exit tip with
a ^D or ~.).

10. Power on the faulty system.

Note – At this point, you should observe the power-on diagnostic


messages in the Shell Tool window of the healthy system. If not some
likely causes are:

● Wrong physical or logical port selected at either end.

● Faulty null modem cable.

● ”Bad” machines not in diag mode or still has keyboard plugged


in.

POST Diagnostics 3-23


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
3

POST Diagnostic Workshop Using tip

Using tip to Observe POST Diagnostics (Continued)


11. After the diagnostics finish, observe any boot errors.

12. Why are you getting an error that looks like a “Net” error?

Notes

13. Press ~Control-d or ~ . to end the tip session. (See “POST tip
Commands.”)

You can also display POST tests on nearly any ASCII terminal or
laptop.

14. On the “bad” machine:

● Power off and plug in the keyboard.

● Power on to the ok prompt.

● ok setenv diag-switch? true

● ok reset and assure that the machine boots fully again.

3-24 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
3

POST tip Commands

Warning – Never exit a tip window by killing processes, quitting the


! Shell Tool, or by pressing Stop-a (L1-a); these actions disable any
future tip functions.

● To send a break through the tip window (Stop-a or L1-a key


remote equivalent), type:

~#

● To interrupt a test, press Control-c.

● To exit from tip, type:

~.

Or

~ ^d (tilde Control-d)

● To see a list of tip commands, type:

~?

For more information on the tip command, refer to the on-line man
pages.

POST Diagnostics 3-25


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
3

3-26 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
OBP Diagnostics and Commands 4

Objectives
Upon completion of this module, you will be able to:

● Using OBP commands, perform the following procedures:

● Gather general information about the system.

● Display and capture the names of the devices in the system


device tree, and display their attributes.

● Test devices using the device path, node name, and device
alias.

● Generate and test a PROM device alias.

● Alter any NVRAM setting, display the settings, and reset to the
defaults.

● Optional: Construct, download, and run FORTH macros.

4-1
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
4

References
Field Engineer Handbook, Volume 1 and 2, Part Numbers 800-4006 and
800-4247

OpenBoot Command Reference, Part Number 800-6076

OpenBoot 2.x Quick Reference Card, Part Number 802-1958

OpenBoot 3.x Quick Reference Card, Part Number 802-3240

4-2 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
4

Functions and Capabilities of the OpenBoot PROM (OBP)

The OpenBoot PROM consists of two chips on each system board:

● The boot PROM itself

● A nonvolatile random access memory (NVRAM)

The boot PROM has extensive firmware and FORTH-code writing


capabilities that allows access to user-written boot drivers and
extended diagnostics.

The NVRAM has user-definable system parameters and writable areas


for user-controlled diagnostics, macros, or useful settings such as
device aliases. The NVRAM also contains system-identification
information and is removed and replaced into a replacement system
board.

Features
● Ability to read plug-in device drivers and diagnostics from probed
devices. (Early Sun machines required all boot drivers and
diagnostics to be completely written in the boot PROM.)

● F(ORTH) code interpreter to facilitate writing and downloading


drivers, diagnostics, and parameters

● Device tree – A data structure hierarchy, similar to UNIX, for


placing and locating device addresses.

● User-callable diagnostics

● Restricted monitor (passworded security to disallow accesses)

OBP Diagnostics and Commands 4-3


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
4

OpenBoot PROM

OBP
NVRAM
> Limited commands
setenv Variable
OK full FORTH commands printenv system
parameters
FORTH code

Boot program Host ID


database
Boot device
drivers Clock

POST Battery
Extended POST

User diagnostics

NVRAM defaults

Host ID contains:

● 48-bit hardware Ethernet address

● CPU-type code

● Host serial number

4-4 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
4

NVRAM Contents – System Variable Parameters

SPARCstation 20 Workstation
<#2> ok printenv
Parameter Name Value Default Value

tpe-link-test? true true


output-device screen screen
input-device keyboard keyboard
keyboard-click? false false
keymap
ttyb-rts-dtr-off false false
ttyb-ignore-cd true true
ttya-rts-dtr-off false false
ttya-ignore-cd true true
ttyb-mode 9600,8,n,1,- 9600,8,n,1,-
ttya-mode 9600,8,n,1,- 9600,8,n,1,-
fcode-debug? false false
local-mac-address? false false
screen-#columns 80 80
screen-#rows 34 34
selftest-#megs 1 1
scsi-initiator-id 7 7
sbus-probe-list fe0123 fe0123
auto-boot? true true
watchdog-reboot? false false
diag-file
diag-device net net
boot-file
boot-device disk net disk net
silent-mode? false false
use-nvramrc? false false
nvramrc
sunmon-compat? false false
security-mode none none
security-password
security-#badlogins 0 <no default>
oem-logo default <no default>
oem-logo? false false
oem-banner <no default>
oem-banner? false false
hardware-revision <no default>
last-hardware-update <no default>
testarea 255 0
mfg-switch? false false
diag-switch? true false
#2> ok

<#2> ok

OBP Diagnostics and Commands 4-5


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
4

Diagnostic Overview

Boot PROM-Based Diagnostics

Extended User
*POST POST diagnostics

*Power-on self test

Sun VTS Diagnostics

Installed as package
Requires Solaris operating system

4-6 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
4

Default Boot Sequence – System Disk

Init system Power on


If use-nvramrc? true
Read NVRAMRC
Probe all
Install console diag-switch?
Banner
Create device tree False True
NVRAM parameter
Execute POST Output to serial port
Execute POST
Pass Fail Fail Pass

Error
Init system indication Init system
Pass Fail
Error Fail Pass
indication

Auto-Boot? Execute extended diags


(memory)
False True
Pass Fail

Boot-device
sunmon-compat? boot-file Error
security-mode? Auto-Boot? indication
Start boot sequence
False True False True

OK >
sunmon-compat? diag-device
security-mode? diag-file
False True Start boot sequence

OK >

OBP Diagnostics and Commands 4-7


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
4

Default Boot Sequence

ok boot

Execute primary
boot—OBP

Load and start


secondary boot
(/ufsboot)

Load and start


kernel
(/kernel/unix)

Kernel reads
/etc/system

Kernel
initialized

Kernel starts the


init process

Execute rc scripts

4-8 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
4

OBP Device Tree Navigation – SPARCstation 1000 System


Use the commands in the following example to navigate the device
tree. The cd / command brings you to root. The ls command displays
the device tree attached to the root. The virtual address (left) and
devices with their offset are used by the Solaris operating system
during boot. To navigate, use the cd and ls commands. The example
navigates to the sd and st devices.

<#0> ok cd /
<#0> ok ls
ffda476c io-unit@f,e1200000
ffd91c10 io-unit@f,e0200000
ffd8d2f4 mem-unit@f,e1100000
ffd8d210 mem-unit@f,e0100000
ffd8cebc cpu-unit@f,e1800000
ffd8cb68 cpu-unit@f,e1000000
ffd8c814 cpu-unit@f,e0800000
ffd8c4c0 cpu-unit@f,e0000000
ffd839a8 boards
ffd712fc openprom
ffd702bc virtual-memory@0,0
ffd7016c memory@0,0
ffd625cc aliases
ffd6257c options
ffd6252c packages

<#0> ok cd io-unit@f,e1200000
<#0> ok ls
ffda4d20 sbi@0,0

<#0> ok cd sbi
<#0> ok ls
ffdb0ffc lebuffer@1,40000
ffdac1f4 dma@1,81000
ffda9ff4 lebuffer@0,40000
ffda51ec dma@0,81000

<#0> ok cd dma@1,81000
<#0> ok ls
ffdac878 esp@1,80000
<#0> ok cd esp@1,80000
<#0> ok ls
ffdb05b4 st
ffafef4 sd

OBP Diagnostics and Commands 4-9


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
4

OBP User Diagnostics and Commands – SS1000

The following example highlights user diagnostics and commands.


<#0> ok help diag
Category: Diag (diagnostic routines)
test device-specifier ( -- ) run selftest method for specified device
Examples:
test /memory - test memory
test /io-unit@f,e0200000/sbi/lebuffer@0,40000/le - test net
test net - test net (device-specifier is an alias)
test scsi - test scsi (device-specifier is an alias)
watch-clock ( -- ) show ticks of real-time clock
watch-net ( -- ) monitor broadcast packets
watch-net-all ( -- ) monitor broadcast packets on all net interfaces
probe-scsi ( -- ) show attached SCSI devices
probe-scsi-all ( -- ) show attached SCSI devices for all host adapters
test-all ( -- ) run test for all devices with selftest method
test-memory ( -- ) test all memory if diag-switch? is true, otherwise
test memory specified by selftest-#megs
<#0> ok

These commands are useful in probing your system.


<#0> ok
<#0> ok probe-scsi
Target 0
Unit 0 Disk SEAGATE ST3610N SUN0535881000000000Copyright (c) 1993
Seagate All rights reserved 0000
Target 1
Unit 0 Disk SEAGATE ST3610N SUN0535881000000000Copyright (c) 1993
Seagate All rights reserved 0000
Target 2
Unit 0 Disk SEAGATE ST3610N SUN0535881000000000Copyright (c) 1993
Seagate All rights reserved 0000
Target 3
Unit 0 Disk SEAGATE ST3610N SUN0535881000000000Copyright (c) 1993
Seagate All rights reserved 0000
Target 4
Unit 0 Removable Tape ARCHIVE Python 28454-XXX4.28
Target 6
Unit 0 Removable Read Only device SONY CD-ROM CDU-8012 3.1e
<#0> ok probe-scsi-all
/io-unit@f,e1200000/sbi@0,0/dma@1,81000/esp@1,80000

4-10 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
4

OBP User Diagnostics and Commands – SS1000

Target 0
Unit 0 Disk CONNER CP30548 SUN0535AEBX93081BWC
Target 1
Unit 0 Disk CONNER CP30548 SUN0535AEBX93082TZA
Target 2
Unit 0 Disk CONNER CP30548 SUN0535AEBX93082MD4
Target 3
Unit 0 Disk CONNER CP30548 SUN0535AEBX93081BRX

/io-unit@f,e1200000/sbi@0,0/dma@0,81000/esp@0,80000
Target 0
Unit 0 Disk CONNER CP30548 SUN0535AEB793081TGX
Target 1
Unit 0 Disk CONNER CP30548 SUN0535AEB793081WNL
Target 2
Unit 0 Disk CONNER CP30548 SUN0535AEB793081Q8Z
Target 3
Unit 0 Disk CONNER CP30548 SUN0535AEB7930810A0

/io-unit@f,e0200000/sbi@0,0/dma@0,81000/esp@0,80000
Target 0
Unit 0 Disk SEAGATE ST3610N SUN0535881000000000Copyright (c) 1993
Seagate All rights reserved 0000
Target 1
Unit 0 Disk SEAGATE ST3610N SUN0535881000000000Copyright (c) 1993
Seagate All rights reserved 0000
Target 2
Unit 0 Disk SEAGATE ST3610N SUN0535881000000000Copyright (c) 1993
Seagate All rights reserved 0000
Target 3
Unit 0 Disk SEAGATE ST3610N SUN0535881000000000Copyright (c) 1993
Seagate All rights reserved 0000
Target 4
Unit 0 Removable Tape ARCHIVE Python 28454-XXX4.28
Target 6
Unit 0 Removable Read Only device SONY CD-ROM CDU-8012 3.1e

<#0> ok

<#0> ok show-sbus
Board# 0 SBus slot 0 lebuffer le dma esp
Board# 0 SBus slot 1 cgsix

OBP Diagnostics and Commands 4-11


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
4

OBP User Diagnostics and Commands – SS1000

Board# 0 SBus slot 2


Board# 0 SBus slot 3 SUNW,soc
Board# 1 SBus slot 0 lebuffer le dma esp
Board# 1 SBus slot 1 lebuffer le dma esp
Board# 1 SBus slot 2
Board# 1 SBus slot 3

<#0> ok module-info
CPU# 0 : 50.0 MHz SuperSPARC / SuperCache
CPU# 1 : 50.0 MHz SuperSPARC / SuperCache
CPU# 2 : 50.0 MHz SuperSPARC / SuperCache
CPU# 3 : 50.0 MHz SuperSPARC / SuperCache

<#0> ok print-nvram-stat
Board#0 -- nvram master, Prom Version 2.13
Board#1 -- nvram slave, Prom Version 2.13+0.08
Board#2 -- no board or no Viking module
Board#3 -- no board or no Viking module

4-12 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
4

OPB User Diagnostics and Commands – SS20

The following example highlights user diagnostics and commands.


<#0> ok help diag
Category: Diag (diagnostic routines)
test device-specifier ( -- ) run selftest method for specified device
Examples:
test /memory - test memory
test /iommu/sbus/ledma@f,400010/le - test net
test floppy - test floppy disk drive
test net - test net (device-specifier is an alias)
test scsi - test scsi (device-specifier is an alias)
watch-clock ( -- ) show ticks of real-time clock
watch-net ( -- ) monitor broadcast packets using auto-selected
interface
watch-aui ( -- ) monitor broadcast packets using AUI interface
watch-tpe ( -- ) monitor broadcast packets using TPE interface
watch-net-all ( -- ) monitor broadcast packets on all net interfaces
probe-scsi ( -- ) show attached SCSI devices
probe-scsi-all ( -- ) show attached SCSI devices for all host adapters
test-all ( -- ) run test for all devices with selftest method
test-memory ( -- ) test all memory if diag-switch? is true, otherwise
test memory specified by selftest-#megs
<#0> ok

<#0> ok show-sbus
SBus slot f SUNW,bpp ledma le espdma esp
SBus slot e SUNW,DBRIe
SBus slot 0
SBus slot 1
SBus slot 2 cgsix
SBus slot 3

<#0> ok probe-scsi
Target 1
Unit 0 Disk QUANTUM P105SS 910-10-94A.1 08/31/89009030144
GENERIC

Target 3
Unit 0 Disk SEAGATE ST31200W SUN1.05872400795741
Copyright (c) 1994 Seagate
All rights reserved 0000

OBP Diagnostics and Commands 4-13


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
4

OPB User Diagnostics and Commands – SS20

Target 4
Unit 0 Removable Tape ARCHIVE VIPER 150 21531-003 SUN-03.00.00
Target 6
Unit 0 Removable Read Only device TOSHIBA XM-
4101TASUNSLCD108404/18/94

Note – Refer to the Field Engineering Handbook, Volume 1, module


section, “SPARC Processor Revision,” for information on the
significance of the commands and their results. The commands enable
a person to determine the revision level of the SPARC processor and
any required operating system patches.

<#0> ok module-info
MBus : 50.00 MHz
SBus : 25.00 MHz
CPU#0 : 50.00 MHz SuperSPARC
CPU#2 : 50.00 MHz SuperSPARC

<#0> ok 2 switch-cpu
<#2> ok 0 switch-cpu
<#0> ok 2 switch-cpu
IMPL:0
<#2> ok 1 switch-cpu
Processor #1 is not present!

4-14 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
4

Lab 1

In this lab you will test devices using the device path, node name, and
device alias.

Note – Due to different PROM levels and architectures the syntax for
these labs can vary slightly. Refer back to the OBP reference card if
necessary.

1. Return the machine safely to the ok prompt.

2. Use help to list some PROM level diagnostics, and run them al.l
ok help diag
Category: Diag (diagnostic routines)
test device-specifier ( -- ) run selftest method for specified device
Examples:
test /memory - test memory
test /iommu/sbus/ledma@5,8400010/le - test net
te................
...................
ok setenv selftest-#megs 99 (setting up to test 99 megs of memory)
ok test-memory
Testing memory \/
ok test net

Using AUI Ethernet Interface


Internal loopback test -- succeeded.
External loopback test -- Lost Carrier (transceiver cable problem?)
send failed. (Note – This test failed because the machine is hooked only to a twisted pair net.)

Using TP Ethernet Interface


Internal loopback test -- succeeded.
External loopback test -- succeeded.

Run them all.

Notes

OBP Diagnostics and Commands 4-15


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
4
3. Run tests on devices listed in devalias (some devices will not
have tests; that is important to know also.)
ok devalias
screen /iommu@0,10000000/sbus@0,10001000/cgsix@3,0
newdisk
/iommu@0,10000000/sbus@0,10001000/espdma@5,8400000/esp@5,8800000/sd@0,0
ttyb /obio/zs@0,100000:b
ttya /obio/zs@0,100000:a
keyboard! /obio/zs@0,0:forcemode
...................................................
ok test keyboard
Keyboard Present
ok test audio
CS4231 ASIC SelfTest Passed.
L1A7192 DMA Loopback SelfTest Passed.

4. Test ttyb (Connect a tip or terminal at the speed and


characteristics set in NVRAM for ttyb.) If all is fine, you will see
the following:
!"#$%&'()*+,-
./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`ab
cdefghijklmnopqrstuvwxyz{|}

5. Test the screen.

Note – If the ok prompt returns with no message, this means the self
test found no errors.)

6. Test the keyboard.

7. Try a few more tests.

Notes

4-16 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
4

Lab 2

In this lab, you will do the following:

● Gather general information about the system

● Alter any NVRAM setting, display the settings, and reset to the
defaults

Note – Due to the capabilities of different architectures and PROM


revision level differences, some of these suggested commands may not
work on your particular lab machine. For example, switch-cpu
probably will not work on a single CPU system.

You are directed to use selected console commands and observe the
output. You can determine if you find the results useful.

OBP Command Results


banner

help diag
help watch-tpe

show boot-device

show-hier

show-ttys

show-tapes

show-nets

show-disks

n switch-cpu (The letter n = 0, 1, 2, and


so on)

module-info

OBP Diagnostics and Commands 4-17


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
4

OBP Command Results


print-nvram-stat

devalias

show-attrs

show-devs

printenv
printenv diag-switch?

OBP Command Results


setenv diag-switch? true

show diag-switch?

set-default diag-switch?

show diag-switch?

setenv fcode-debug? true

set-defaults

OBP Command Results


old-mode
(> n)

Optional, experiment from OBP


reference card

4-18 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
4

OBP Command Results

Resetting to the Defaults


Setting the NVRAM parameters to defaults is important in fault
isolation.

ok set-defaults (which you did above)

Or do the following:

During power on or after the ok reset, hold down the Stop (L1) and n
keys simultaneously on the Sun keyboard. (There is no corresponding
simple key hold down to reset NVRAM to defaults from a port
connection.)

Optional
The NVRAM settings can also be changed by root from the operating
system:

# /usr/sbin/eeprom

(With no parameters, this shows the current settings, similar to the


ok printenv command.)

The syntax to change a parameter is slightly different from the


operating system, mainly the use of the = sign, an intolerance of
spaces, and control characters.

OBP Diagnostics and Commands 4-19


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
4

To change the boot device:

# /usr/sbin/eeprom boot-device=disk1

A reset, power cycle, or boot from the ok prompt is required for


most NVRAM changes to become effective.

Notes

4-20 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
4

Lab 3

In this lab you will display and capture the names of the devices in the
system device tree and display their attributes. This is useful in
isolating failures of Sun or third-party devices between hardware or
software problems.

Note – The lab will take you to one device; if you have time, go out
and display some others.

ok cd /
ok ls
ffd3c184 FMI,MB86904
.........
ok cd iommu@0,10000000
ok ls
ffd2c2c8 sbus@0,10001000
ok cd sbus@0,10001000
ok ls
ffd42504 cgsix@3,0
f.......
ok cd cgsix@3,0
ok ls
ok .attributes
character-set ISO8859-1
intr 00000039 00000000
reg 00000003 00000000 01000000
dblbuf 00000000
v0,64125000,108000000,94500000
chiprev 0000000b
device_type display
model SUNW,501-2325 (look at this, the Sun part #!)
name cgsix

OBP Diagnostics and Commands 4-21


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
4

Lab 4

In this lab, you will generate and test a PROM device alias.

With the increased use of storage arrays and other variously addressed
devices, it is important to be able to set a simple name for the device
that the customer can boot from or otherwise use.

Note – If you recreate the tip hardwire session, you can cut and paste,
instead of typing a lot of the entries in the lab.

1. ok devalias (to show the format of devices aliases already)


tape1 /iommu/sbus/espdma@5,8400000/esp@5,8800000/st@5,0
disk3 /iommu/sbus/espdma@5,8400000/esp@5,8800000/sd@3,0

2. ok show-disks
a) /obio/SUNW,fdtwo@0,400000
b) /iommu@0,10000000/sbus@0,10001000/espdma@5,8400000/esp@5,8800000/sd
q) NO SELECTION
Enter Selection, q to quit: b
/iommu@0,10000000/sbus@0,10001000/espdma@5,8400000/esp@5,8800000/sd has
been selected.
Type ^Y ( Control-Y ) to insert it in the command line.
e.g. ok nvalias mydev ^Y
for creating devalias mydev for
/iommu@0,10000000/sbus@0,10001000/espdma@5,8400000/esp@5,8800000/sd

3. ok nvalias newdisk^Y
/iommu@0,10000000/sbus@0,10001000/espdma@5,8400000/esp@5,8800000/sd@0,0

4. ok devalias
newdisk
/iommu@0,10000000/sbus@0,10001000/espdma@5,8400000/esp@5,8800000/sd@0,0
screen /iommu@0,10000000/sbus@0,10001000/cgsix@3,0
ttyb /obio/zs@0,100000:b

5. ok boot newdisk

4-22 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
4

Note – Of course the boot will probably fail here unless, somehow, a
bootblock was placed on it. You will be setting up for alternate boots
in a later module.

Option to Steps 2–5

The following options can be used instead of performing steps 2–5 on


the previous page.

Hand edit the nvramrc file using information from the device tree;
then enable the use of it. (This is required currently for making aliases
for storage array devices or with older PROMs that do not support the
nvalias command.)
ok devalias
cd /
ok ls (just to find our way!)
ffd3c184 FMI,MB86904
ffd2d1e0 virtual-memory@0,0
ffd2d124 memory@0,0
ffd2c458 obio
ffd2c184 iommu@0,10000000
ok cd iommu@0,10000000
ok ls
ffd2c2c8 sbus@0,10001000
ok cd sbus@0,10001000
ok ls
ffd4242c cgsix@3,0
ffd423cc power-management@4,a000000
ffd41c80 SUNW,CS4231@4,c000000
ffd40024 ledma@5,8400010
ffd3ff98 SUNW,bpp@5,c800000
ffd3cea4 espdma@5,8400000
ok cd espdma@5,8400000
ok ls
ffd3d280 esp@5,8800000
ok cd esp@5,8800000
ok ls
ffd3f854 st
ffd3f13c sd
ok cd sd
ok pwd
/iommu@0,10000000/sbus@0,10001000/espdma@5,8400000/esp@5,8800000/sd
ok nvedit

OBP Diagnostics and Commands 4-23


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
4
0: devalias newdisk
/iommu@0,10000000/sbus@0,10001000/espdma@5,8400000/esp@5,8800000/sd@0,1(C
arriage return)
1:
(Enter Control c)
ok nvstore
ok setenv use-nvramrc? true
use-nvramrc? = true
ok reset

4-24 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
4

Lab 5- Optional

In this lab, you will construct, download, and run FORTH macros.

1. 1. Set up the tip command like you did in POST lab. That is, one
machine at the ok prompt displayed in another machine’s “tip
hardwire.”

2. Do some basic FORTH computations at the ok prompt.


proto2# tip hardwire
connected

ok 4 5 + (Note the reverse Polish notation to add 4 and 5)


ok . (The . here means to pop the last entry in the stack and display it.)
9
ok 2 3 * .
6

3. Make a FORTH macro and run it.


ok : add4
] + + +
] .
] ;
ok 2 3 4 5 add4
e (Notice the hexadecimal output.)
ok decimal (Entering hex or decimal selects that mode for both input and output)
ok 2 3 4 5 add4
14

4. Due to the fact that the macros you construct do not survive a
power on reset, construct a macro in a file that you can download
any time you want.

You are going to create the file in the machine that is up running
the operating system now; then download it to the machine that is
at the ok prompt.
proto2# vi /opt/mapping
: mapping
38e 1000 pgmap!
1000 map?
1000 100 ab fill
1000 100 dump

OBP Diagnostics and Commands 4-25


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
4
5. In the tip window:
ok dl
Ready for download. Send file then type ^D
~C Local command? cat /opt/mapping (That is a Tilde ,Upper Case ‘C’)

away for 0 seconds


sift(Enter a control-d)
ok mapping (Run it now)
Virtual : 0000.1000
Context : @ 0.01ff.f000 001f.eec1 # 0
Region : @ 0.01fe.ec00 001f.ee71
Segment : @ 0.01fe.e700 001f.ee61
Page : @ 0.01fe.e604 0000.038e Cached Access : rwxrwx
Physical : 0.0000.3000
\/ 1 2 3 4 5 6 7 8 9 a b c d e f v123456789abcdef
1000 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ++++++++++++++++
1010 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ++++++++++++++++
1020 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ++++++++++++++++
1030 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ++++++++++++++++
1040 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ++++++++++++++++
1050 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ++++++++++++++++
1060 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ++++++++++++++++
1070 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ++++++++++++++++
1080 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ++++++++++++++++
1090 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ++++++++++++++++
10a0 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ++++++++++++++++
10b0 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ++++++++++++++++
10c0 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ++++++++++++++++
10d0 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ++++++++++++++++
10e0 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ++++++++++++++++
10f0 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ++++++++++++++++
ok

6. Option 2 – Use the reference card and make a macro to traverse


the device tree (remember cd / ?) or think of a diagnostic or
special display you would like to try.

Notes

7. Option 3 – Use nvedit to store your macro in NVRAM to survive


the power on reset. Example:
ok nvedit

0 : multilpy3 <cr>

4-26 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
4
1 ] * * <cr>
2 ] . <cr>
3 ] ; <cr>
(Enter Control-c to exit nvedit)
ok nvstore
ok setenv use-nvramrc? true
ok reset

Notes

OBP Diagnostics and Commands 4-27


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
4

4-28 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
Diagnostic Tools 5

Objectives
Upon completion of this module, you will be able to:

● Describe a selected subset of SunOS commands, files, and utilities


in terms of:

● What their customary functions and definitions are.

● When, where, and how they can be useful as a fault isolation


aid.

References
Solaris User and System Administration Answerbooks

Solaris man pages

5-1
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
5

Diagnostic Tools, Functions and Uses

Command or Tool Use

adb Analyze dumps and a running system.


Answerbook Reference information, hardware, user, system administration,
and other.
crash Analyze crash dumps.
diff Compare file contents.
dmesg Analyze recent log messages.
eeprom Analyze and change boot PROM settings.
/etc/mnttab Lists currently mounted file systems.
/etc/nsswitch.conf Contains name-service configuration information.
/etc/rmtab Contains NFS-shared files mounted by other systems.
/etc/sharetab Contains directories or files shared by NFS.
file Determine a file’s type.
find Look for specific files in the file system structure.
format Analyze or modify disk partition information.
grep Analyze file contents, look for specific patterns.
ifconfig Analyze the status of network interfaces.
iostat Analyze I/O performance issues.
kadb Trap kernel and low-level faults.
ls Analyze file properties.
netstat Analyze network tuning information.
nfsstat Analyze NFS performance information.
nm Analyze executable file contents (display symbol table).
pkgchk Check file integrity and accuracy of installation.
prtconf -v Get system device information from POST probe.

5-2 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
5

Command or Tool Use

prtdiag -v Sun-4d, Sun-4u diagnostic, watchdog reset, and configuration


information.
ps Analyze properties of running processes.
A running system Compare to a failing system.
sar Analyze system performance information.
Shells Use the -x, -v operands to the shells to see the executed
commands.
showrev -p List currently installed patches.
snoop Display and analyze network traffic.
strings Look through files (object and binary) for ASCII strings.
sunsolve Look for tips and known problems.
sysdef Analyze device and software configuration information.
swap List and change swap usage.
truss Trace system calls issued and used by a program or command.
/var/adm/messages Contains records of console and boot messages.
/var/adm/install/c Locate installed files and directories.
ontents
/var/utmp User and accounting information.
/var/wtmp
/var/utmpx Extended user and accounting information.
/var/wtmpx
/var/adm/sulog History of root logins and su use.
vmstat Analyze memory performance statistics.
whoami Displays the effective current user name (checks password and
group files).

Diagnostic Tools 5-3


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
5

Open Discussion

What other tools have you used or heard about?

1.

2.

3.

4.

5.

6.

7.

8.

9.

5-4 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
SunVTS System Diagnostics 6

Objectives
Upon completion of this module and lab, you will be able to:

● Install the SunVTS™ package on a system.

● Select, set up, and run SunVTS diagnostic tests.

● Run SunVTS over a network.

● Run SunVTS without a frame buffer.

● Analyze SunVTS test results.

6-1
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
6

Introduction

SunVTS is Sun’s on-line validation test suite. Functionality of most


Sun hardware devices can be verified.

The SunVTS tests can be used to stress certain areas of the system as
needed for diagnostic and troubleshooting purposes.

The SunVTS diagnostic software is the successor to SunDiag™


diagnostics. SunDiag software is shipped with all Sun operating
systems running the Solaris™ 2.4 operating system or earlier. SunVTS
runs on the Solaris 2.5 operation system and later.

Like its SunDiag predecessor, SunVTS software can run concurrently


with customer applications and the Solaris operating system.

Hardware and Software Requirements


These are the requirements to run SunVTS Version 1.0 software
successfully in the OpenWindows™ environment:

● The Solaris 2.5 operating system.

● The SunVTS 1.0 package.

● The operating system kernel must be configured to support all


peripherals that are to be tested.

● OpenWindows (or Motif) should be properly installed on the


system.

● Superuser access is required for both installation and startup of


SunVTS software.

6-2 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
6

The SunVTS Architecture

The SunVTS architecture is divided into:

● The user interfaces

● The SunVTS kernel

● The hardware tests

Graphical TTY SuVTS


user interface user interface utilities

SunVTS application
programming interface

Logs messages

Probes configuration SunVTS kernel Schedules tests

Monitors test results

Test interface

SunVTS User-created
hardware tests custom tests

SunVTS System Diagnostics 6-3


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
6

User Interfaces

SunVTS has a graphical user interface for OpenWindows and a TTY


version of the interface for a terminal.

Kernel
The kernel runs as a background process, a daemon. Upon startup of
the SunVTS software, the SunVTS kernel probes the system kernel for
installed hardware devices. Those devices are displayed on the
SunVTS user interface.

Both the SunVTS kernel and the user interface must be started before
testing can begin.

Hardware Tests
For each supported hardware device, a corresponding hardware test
can validate its operation. Each test is a separate process from the
SunVTS kernel process.

Additional References
For more extensive information and usage of the SunVTS diagnostic
software, see the following publications:

● SunVTS Users Guide, Part Number 801-7271-10

● SunVTS Test Reference Guide, Part Number 802-1448-10

6-4 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
6

Installing SunVTS Software

The SUNWvts (application) and SUNWvtsmn (man pages) require


approximately 16 megabytes of disk space (combined) in /opt, the
default install directory.

The pkgadd command is used to install SunVTS software from the CD-
ROM Updates for Solaris Operating Environment 2.5 (Part Number 704-
5104-10).

Insert the CD-ROM into the CD-ROM drive, and type the pkgadd
command as root:

# pkgadd -d /cdrom/upd_sol_2_5_smcc/SMCC

Select SUNWvts and SUNWvtsmn (options 14 and 15, respectively) and


start the install.

View the screen output from the pkgadd application to ensure that the
install completed successfully.

SunVTS System Diagnostics 6-5


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
6

Using the OpenWindows SunVTS Graphical User Interface

The SunVTS OpenWindows graphical user interface can be used to


select and run tests and view test results.

As root, type the following command:

# /opt/SUNWvts/bin/sunvts

System Status panel Performance meter Control panel Tests Selection panel

Test Status panel Console window Test Option panel

6-6 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
6

The SunVTS Graphical User Interface

● The Control panel – A panel that contains the buttons that you
use to control the SunVTS user interface.

● The Test Option panel – A panel where you select the tests and
test groups to run; you can also change the options for each test
and test group.

● The Tests Selection panel – A panel where you choose the global
options for all SunVTS tests.

● The System Status panel – A panel that shows the general testing
status.

● The Test Status panel – A panel that displays pass and error
counts for each test and test group.

● The Performance meter – A meter that displays performance


statistics for the system being tested.

● The Console window – A window that displays operating system


messages and test messages.

The following are the buttons on the control panel and their functions:

Start Click on the Start button to start all enabled tests.


When the tests are running, the Start button is
dimmed.

Stop Click on the Stop button to halt all active tests. The
test results remain on the Test Status panel after
testing is completed. Click on the Stop button only
once. Some tests do not stop immediately, so the
System Status may slowly change from Stop to Idle.

SunVTS System Diagnostics 6-7


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
6

The SunVTS Graphical User Interface

Reset The Reset button resets system passes, total errors,


and elapsed time counts to zero for each test. When
testing starts, the Reset button changes to Suspend.

Click on the Suspend button to pause all SunVTS


tests. When you do this, the button label changes to
Resume.

To resume testing again, click on the Resume button.

Quit Using the Quit button, you can terminate the user
interface, the SunVTS kernel, or both.

If you want to restart the kernel from the command


line to connect to another machine on the network to
run tests, terminate the SunVTS kernel only.

Sys Config Click on the Sys Config button to display the Sys
Config menu. Menu choices are display or print test
system configuration information, or reprobe the test
system.

Log Files SunVTS saves the status of its progress in three log
files. Use the Log Files button to look at the error
messages, information, or UNIX® messages log files.

Connect to Click on Connect to button to connect the user


interface to another machine on the network. Once
you are connected to the SunVTS kernel on the test
machine, you can view and control that system's
testing status.

Reprobe Click on the Reprobe button for the SunVTS kernel to


reprobe the hardware devices on the system being
tested. This option could be used if you replace a
SCSI device on your system.

6-8 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
6

Selecting and Setting Up Tests

From the Test Selection panel, you can select the tests you want to run,
and specify the testing options.

Options can be set globally for all of the SunVTS tests you select. Click
on the Set Options button for the SunVTS Testing Options menu.

Options can also be set for each test group. Press the button of a test
group or test name for the option menu.

SunVTS System Diagnostics 6-9


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
6

SunVTS Testing Options

The following options can be set to apply to all selected SunVTS tests
or, if applicable, to individual test groups or tests.

sys_override Supersedes the specific group and test options in


favor of the options in this window.

auto_start Automatically runs the tests selected in a previously


saved option file when SunVTS is started.

single_pass Runs only one pass of each selected test.

send_email Chooses if and when the test status messages should


be sent to you through email.

email_addr Indicates the email address where the test status


messages are sent.

log_period Specifies, in minutes, the time between test status


email messages when the periodically option is
selected in the send_email option.

max_sys_pass States the maximum number of system passes before


stopping all tests. (A value of 0 causes the tests to run
until you click on the Stop button.)

max_sys_errs States the maximum number of system errors before


the SunVTS stops all tests. (A value of 0 causes the
tests to continue regardless of errors.)

max_sys_time Specifies the maximum number of minutes that


SunVTS continues testing. (A value of 0 makes the
tests run until you click on the Stop button.)

group_override
Supersedes the specific test options in favor of the
group options in this window.

group_concurrency
Sets the number of tests you want to run at the same
time in the same group.

6-10 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
6

SunVTS Testing Options

group_lock Protects group options from being changed.

test_mode Specifies regular mode or quick mode. Regular mode


runs full versions of each test. Quick mode runs
abbreviated versions of each test.

test_lock Protects test options from being changed.

verbose Displays all messages in the SunVTS console window.


When disabled, only error messages are displayed.

core_file Generates a core dump in the current directory if


certain abnormal conditions occur. This can be used
for software debugging purposes.

run_on_err Continues testing until the max_errs number is


reached.

max_pass Specifies the maximum number of passes that tests


can run.

max_errs States the maximum number of errors any test allows


before stopping. (A value of 0 causes the tests to
continue regardless of errors.)

max_time States, in minutes, the time limit tests run. (A value of


0 means there is no limit.)

num_instances
Specifies the number of tests to run for all tests that
are scalable.

p_affinity Specifies which processor should be used to run all


tests. If no processor is specified, the testing is
distributed among all the processors. This option is
only available on multiprocessor systems.

test_lock Enables locking on all tests.

SunVTS System Diagnostics 6-11


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
6

Tests Switch
Three settings are available:

● Default enables the default group of tests. This includes all tests
that do not require intervention.

● None deselects all tests.

● All selects all the tests.

Option Files
You can save your SunVTS testing selections to a file. This prevents
you from having to reset these same options again in the future. Test
settings are saved in the /var/adm/sunvtslog/options directory.

To save an option file, type a name for the option file, and click on the
Store button.

Intervention

Certain tests require that you intervene before you can run the test
successfully. These include tests that require media or loopback
connectors.

● Loopback connectors are required to run certain tests, such as


serial port tests, successfully.

Refer to the SunVTS Test Reference Manual for more information


about loopback connectors, and which tests need them.

● Media (tapes, diskettes, or CDs) must be present in the drive(s)


before the system is probed at SunVTS startup. If this is not done,
an error message is reported.

Using old or damaged tapes and diskettes may cause errors in


corresponding tests.

You cannot select these tests until you enable the intervention mode.
This setting does not change the test function; it just serves as a
reminder that you must intervene before the test can be successfully
completed.

6-12 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
6

Running the SunVTS Tests

To start the tests, click on Start.

System Status Panel


The System Status panel shows the general testing progress.

SunVTS System Diagnostics 6-13


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
6

Running the SunVTS Tests

Test Status Panel


The Test Status panel shows the current status of all devices under test.

Actively running tests are marked with an asterisk.

The icons at the top of the Test Status panel enable you to navigate
through the list of tests in case there are more tests running than can
be displayed on the panel.

6-14 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
6

Running the SunVTS Tests

Performance Monitor Panel


The performance monitor shows various levels of system activity. It
displays the same information as the operating system Performance
Meter utility.

● cpu – Percentage of CPU used.

● pkts – Ethernet packets per second.

● page – Paging activity in pages per second.

● swap – Jobs swapped per second.

SunVTS System Diagnostics 6-15


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
6

Running the SunVTS Tests

Performance Monitor Panel (Continued)


● intr – Number of device interrupts per second.

● disk – Disk use in transfers per second.

● cntxt – Number of context switches per second.

● load – Average number of processes that have run over last


minute.

● colls – Collisions per second detected on the Ethernet.

● errs – Errors per second on receiving packets.

6-16 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
6

Reviewing SunVTS Test Results

System Status Panel


When your tests have run, the results are displayed in the system
status panel.

Console Window Messages


If you enabled the verbose option during test selection, all testing
activity is displayed in the SunVTS Console window as the tests run.

Errors are also recorded in a log file that you can view by clicking on
the Log File button on the Control panel.

SunVTS System Diagnostics 6-17


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
6

Reviewing SunVTS Test Results

Log Files
You can use the Log Files menu to view error, information, and UNIX
message log files that are managed by the system.

1. Click on the Log Files button.

2. Click on the Display option of the Log Files menu to display an


error window.

3. To close this window, click on the pushpin located in the upper-


left corner.

4. Display the Information and UNIX Msgs files, but do not remove
any files.

6-18 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
6

Using SunVTS in TTY Mode

If you use the SunVTS software in TTY mode, no frame buffer is


required. To run in TTY mode, perform the following steps.

1. Start the SunVTS kernel with the vtsk command:

# /opt/SUNWvts/bin/vtsk

2. Start the SunVTS TTY User Interface with the vtstty command:

#/opt/SUNWvts/bin/vtstty

SunVTS System Diagnostics 6-19


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
6

Negotiating the SunVTS TTY Interface

The SunVTS TTY interface has four different sections: a Console, a


Status panel, a Control panel and a Tests panel. Messages pertaining to
SunVTS tests are displayed in the Console section.

Only one panel has focus (selected for keyboard input) at a time. Focus
can be shifted between the three panels by pressing the tab key. The
panel with focus is bordered by asterisks (*).

Each panel has various options. A selected option can be changed by


pressing the spacebar. Use the arrow keys to move between the
options in a panel.

The Esc key is used to close pop-up option windows.


The TTY window can be refreshed by pressing Control-l.

The TTY interface is functionally similar to its graphical counterpart.


(Review the earlier section “Using SunVTS Graphical User Interface” for
information on each field.)

Selected panel
Control panel
Tests panel

Status panel

Console

6-20 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
6

Using SunVTS Remotely

A testing session can be run across a network or even a modem.


SunVTS consists of two components: the kernel and the user interface.

Kernel Interface
To test a remote system, it must have the kernel process
/opt/SUNWvts/bin/vtsk running.

User Interface
To test local system, the user interface can be either TTY (teletype) or
graphical.

SunVTS System Diagnostics 6-21


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
6

Using SunVTS Remotely

User Interface

Graphical User Interface

The graphical user interface (GUI) component must have the interface
/opt/SUNWvts/bin/sunvts running as an active process.

Click on the Connect to button in the Control panel. In the connect to


Machine window that is displayed, type the name of the remote
computer to be tested (/opt/SUNWvts/bin/vtsk must already be
running on the remote computer). Click on the Apply button.

6-22 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
6

Using SunVTS Remotely

User Interface

Connecting Directly to the Remote Computer

You can also connect directly to the remote computer running the
SunVTS kernel when starting the graphical user interface.

/opt/SUNWvts/bin/sunvts -h remote_hostname

TTY interface

The TTY interface process /opt/SUNWvts/bin/vtstty is run as a


process from a terminal that is logged in to the computer to be tested.

Testing can now be carried out as described earlier in “The SunVTS


Graphical User Interface” section.

SunVTS System Diagnostics 6-23


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
6

Lab Overview

Lab Objectives
● Install the SunVTS package on a system.

● Review SunVTS options.

● Set up SunVTS tests for local and remote testing.

● Run the tests in graphical and TTY mode.

● Monitor the tests as they run.

● Pause testing to change an option.

● Stop the testing session.

● Review the results.

Equipment
To complete this lab, you will need:

● At least two networked SPARC desktop workstations running the


Solaris 2.5 (or above) operating system.

● Access to a copy of the SUNWvts package, either from a server on


the lab network or on the CD-ROM, SMCC Updates for Solaris 2.5.

6-24 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
6

Lab Tasks
In this lab, you are going to verify that all hardware on your lab
system is functional. You will need the SunVTS software present on
your system.

1. Log in to the system as superuser.


# su

2. Change to the /opt directory, and list the contents.


# cd /opt
# ls

3. If SUNWvts is already present, run the pkgrm command to remove


the old SUNWvts package.

# pkgrm SUNWvts

4. (Optional step) Install SunVTS using the pkgadd program as


described earlier in this module. (Use pkgrm SUNWvts to remove
the existing package)

5. Change to the SUNWvts/bin directory, and start the sunvts


application.

6. Display your lab system’s configuration.

7. Set your global options to:

● Enable the verbose option.

● Run in Quick Test mode.

● Send email to your_username@your_hostname when an error


occurs.

8. Save the configuration by choosing Options ➤ Save)

9. Start the session, and observe.

10. Read the manual pages:


# man sunvts
# man vtstty
# man vtsk
# man vtsui
# man vtsprobe

SunVTS System Diagnostics 6-25


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
6

Lab Tasks
Now that you have a general idea of how the diagnostics work, here
are some steps to try to get more familiar with the features.

1. Run tests on another machine in the lab.

2. Run an intervention test.

The instructor can supply you with scratch media.

3. Try the TTY interface.

4. Run tests remotely using the Connect to button.

5. Run tests remotely using the following command:


# /opt/SUNWvts/bin/sunvts -h your_neighbours_hostname

6. Kill the SUNWvts kernel process and try the previous two steps
again.

7. Deselect all tests.

8. Run the audio test. Observe the different selections that are played
depending on the machine you are testing.

9. Run only the frame buffer test.

10. Select audio test.

a. Auto-start.

b. Save configuration (Options) to the file class-vtstest.

c. Kill SunVTS.

d. Start SunVTS using the following command:


# /opt/SUNvts/bin/sunvts -o class-vtstest

(Consider using the above command with cron and enabling


email on error.)

11. Find the maximum number of passes allowed for the fputest?.

6-26 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
6

Lab Tasks
12. Attempt to force an error.

If your instructor has any failed hardware attached, observe a test


on that. Otherwise, do the following:

a. Run a test that requires intervention, and do not attach the


necessary media or cables.

b. Disconnect a peripheral, and then try to test that peripheral.

c. Disconnect the Ethernet cable.

d. Run nettest on a non-existent machine.

13. View your email.

Read the SunVTS console messages. Peruse the error and


information logs.

SunVTS System Diagnostics 6-27


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
6

6-28 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
SunSolve 7

Objectives
Upon completion of this module, you will be able to:

● Describe how the SunSolve™ system helps resolve system faults.

● Differentiate between the SunSolve CD-ROM™ and SunSolve


Online™ databases.

● Describe how to apply for a SunSolve Online account.

● Install the SunSolve and patches software on a server and share


them correctly to the network.

● Start, configure, and display the SunSolve software from an


installed server or from the CD-ROM (without installing).

● Given a set of symptoms, discover a likely cause using SunSolve.

● Display the installed patches on your system.

● Install all the recommended or suggested patches from the CD-


ROM.

● Install a specific patch from the CD-ROM.

● Remove a specific patch.

● Display the current patch report for a given operating system.

7-1
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
7

References
● SunSolve Online User’s Guide

● SunSolve User’s Guide

7-2 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
7

Overview

Sun users have requested a mechanism they could use to access


information about systems and system problems. Sun’s solution
centers also needed an organized central database to report, track, and
dispense information about system problems. The SunSolve system
supports these needs, and it is used to distribute operating system
patches, important technical information, and problem workarounds
for both customers and Sun support.

The SunSolve CD-ROM, and, Online systems are intended to


supplement, but not replace, the traditional human support interface.

SunSolve 7-3
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
7

Distribution

SunSolve is available and shipped automatically to all Sun customers


(and Sun field offices or VAR accounts) with any level of service
agreement, or as a separate purchase through the 1-800-USA4SUN
phone number.

Additional or lost CD-ROM replacements can be arranged through


your service provider or using the 1-800-USA4SUN phone service.

Updated CD-ROMs are sent out about ten times a year and have
information regarding all supported software, operating system levels,
and hardware.

SunSolve Online is updated nearly every business day.

In order to utilize SunSolve Online, you must establish an account


using your service contract number (from your service provider). Sun
employees use their employee ID number.

The information and search mechanisms for SunSolve Online and


SunSolve CD-ROM are practically identical. The SunSolve Online
information can be more up-to-date because of the logistics of
shipping the CD-ROMs.

7-4 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
7

SunSolve Online Account

A SunSolve Online account can be applied for by using a Web browser


and visiting one of the following Web sites:

● http://sunsolve.Sun.COM

● http://SunSolve1.Sun.COM

● http://www.Sun.Com

Follow the steps below to apply for a SunSolve Online account:

1. Click on the Register button.

2. Click on the Create new account button and answer the questions.
(You must have a SunService Spectrum Account number to
register for a SunSolve Online account.) There is little or no wait in
receiving an account once you submit the form.

Once you have an account, click on the Patches button.

For more details on a SunSolve Online account, refer to the Sunsolve


Online Reference Guide.

SunSolve 7-5
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
7

Installing SunSolve

Install the SunSolve software and patches on a server and share them
correctly to the network.

1. Insert the SunSolve CD-ROM in server1 which is running the vold


daemon process and the OpenWindows environment.

2. Verify that the vold process has mounted SunSolve:


# mount
/cdrom/sunsolve_2_8 on
/vol/dev/dsk/c0t6/sunsolve_2_8 read only on Fri Mar 2

3. Set the mounted software to be shared.


# share -o ro /cdrom/sunsolve_2_8

a. If this is the first time you have run the share command on
this machine, edit the /etc/dfs/dfstab file and add the
following line:

# vi /etc/dfs/dfstab

Add the line:

share -o ro /cdrom/sunsolve_2_8

b. Start the NFS server:

# /etc/init.d/nfs.server start

c. Check to see if the share command was successful:

# showmount -e
Or
# dfshares -F nfs servername

4. On each client machine, verify the hosts and rpcs:


# showmount -e server1 /cdrom/sunsolve_2_8 (everyone)
Or
# dfshares -F nfs server1
# mkdir /cdrom/sunsolve_2_8
# mount server1:/cdrom/sunsolve_2_8
/cdrom/sunsolve_2_8

7-6 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
7

Installing SunSolve

Installing SunSolve Using File Manager


If your CD mounts automatically and you see the File Manager
window below, click on the START icon. The Installation GUI shown
on the following page is displayed.

The START installation program starts the Installation GUI, which


invokes all of the necessary processes for installing the SunSolve
software.

SunSolve 7-7
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
7

Installing SunSolve

Installation GUI Window

7-8 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
7

Installing SunSolve

Installation GUI Window (Continued)


After you have clicked on START, the prompt asks if you want to run
the Installation GUI with a Shell Tool. If you choose this option, your
Shell Tool displays any error messages generated by the process.

When the Installation GUI window is displayed, follow the steps


below:

1. Click on the product name SunSolve.

2. Click on Install.

Note – If you have a previous version of SunSolve installed and have


selected SunSolve as the product to install, click on Upgrade, which
updates your previous SunSolve installation (refer to the Installation
GUI window on the previous page).

3. Enter your root password when you are prompted by the


installation script.

4. Respond to the prompts, and follow the on-screen instructions.

SunSolve 7-9
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
7

Installing SunSolve

Linking Searches to Answerbooks


If your machine or the server has Answerbook™ software installed,
return to the Install page and click on the Link AnswerBook icon.

Sharing SunSolve
To set the SunSolve server as shared, at /opt/SUNWss, perform the
following steps.

1. Edit the /etc/dfs/dfstab file.


# vi /etc/dfs/dfstab

2. Add the following line:


share -o ro /opt/SUNWss

3. Execute one of the following commands:


# shareall

Or
# /etc/init.d/nfs.server start

7-10 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
7

Starting Sunsolve

Starting From an Installed Server


From the File Manager window below, click on the sunsolve icon to
display the SunSolve window shown on the following page.

SunSolve 7-11
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
7

Starting Sunsolve

Starting From the CD-ROM


If you did not have room or time to install the SunSolve software, you
can run directly from the SunSolve CD-ROM (the searches will be
slower). Enter the commands below to display the SunSolve window.
# cd /sunsolve_mount_point_directory
# ./sunsolve

The SunSolve Window


Once you have started the SunSolve software, the following window is
displayed.

Note – If you are asked if you want to run in a Shell Tool, answer yes.

7-12 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
7

Search Tool

To start using SunSolve, click on SearchTool in the SunSolve window.


The SearchTool window is displayed.

SunSolve 7-13
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
7

Search Tool

Configuring SunSolve
To configure the SunSolve software, click on the Properties button in
the SearchTool window. The SearchTool properties window is
displayed.

7-14 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
7

Search Tool

SearchTool Properties
The SearchTool properties window contains a Category menu button
with the following property types:

● Search – You can choose the maximum number of documents to


search, the search time limit, and either basic or extended query
mode.

Notice here that the maximum documents set to retrieve is 100, the
search timeout is set to 60 seconds (make the timeout longer if
searching across a network), and Fuzzy Boolean searching is on
(this helps to find related keywords in searches).

To apply these new settings, click on the Apply button.

● Viewer – You can specify the text viewer, the PostScript viewer, or
the picture (GIF) viewer.

SunSolve 7-15
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
7

Troubleshooting Using SearchTool

You are having printer problems under the Solaris 2.5 operating
environment; you can use the SearchTool window to search for
probable symptoms.

Setting Up the Search


In the SearchTool window shown on page 7-13, the search is defined
by selecting or setting the following:

● Collections to search – The following are selected:

● Symptoms and Resolutions

● Patch Descriptions

● Search for – Both of the following keywords are entered and


linked by the chosen logical connector AND:

● printer

● 2.5

Note – Searches are not case sensitive.

Keyword Logical Connectors


Each of the keywords specified are logically connected by one of the
words AND, OR, or NOT.

● AND – The logical AND means the collections searched must contain
all keywords joined by AND.

● OR – The logical OR connector indicates that the collections


searched can contain either of the keywords joined by OR.

● NOT – The logical NOT connector indicates that the collections


searched should not contain the keyword following NOT.

7-16 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
7

Troubleshooting Using SearchTool

Starting the Search


When you have set up the parameters for your search, click on the
Search button. The scrolling list at the bottom of the SearchTool
window contains a list of the documents found in your search. Each
document contains one or more occurrences of the search query. The
documents are listed in descending order of occurrences.

SunSolve 7-17
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
7

Datasets and Collections to Search

The following SunSolve collections are available to search:

● Early Notifier

● Symptoms and Resolutions

● Bug Reports

● Patch Descriptions

● Sun Technical Bulletin

● Solaris Q & A

● Info Docs

7-18 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
7

Viewing Documents Found

View the first document found in your search by double-clicking on


the first (101941) document.

SunSolve 7-19
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
7

Patches

Displaying the Current Patch Report


To display the current patch report for a given operating system,
perform the following steps.

1. From the SearchTool window, select only the Info Docs collection
to search.

2. Type patches in the Search for field.

3. Click on the Search button.

7-20 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
7

Patches

Displaying the Current Patch Report (Continued)


4. Choose Solaris 2.5 Patch Report Update from the scrolling list of
found documents.

5. From the Display menu, choose In new viewer. The 2.5 patch
report is displayed.

Below is a report from February 16, 1996.

SunSolve 7-21
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
7

Patches

For lab setup, insert or mount the patches CD-ROM. The File Manager
window displays the following:

7-22 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
7

Patches

Displaying Installed Patches


To display the installed patches on your system, perform the following
steps.

1. From the File Manager window shown on the previous page, click
on the patchinstall icon.

Note – If you are not running File Manager or OpenWindows, you can
start the patch install script by changing to the directory where the
patch CD-ROM is mounted and typing ./patchinstall as superuser.

During the installation, default answers are provided inside


square brackets ([]).

Press the Return key to select the default provided.

Press Control-c at any time to stop the installation.

2. Type Y for the answer to the following question.


Continue with patch installation? [Y] Y

3. Type /tmp for the answer to the following question.


Where should I store temporary files? [/tmp] /tmp

4. Type Y for the answer to the question below.


Would you like to save the original versions of the
software? [n] Y
Patches already installed:
101753-01 101829-01 101878-01 101879-01
101880-03 101902-01 101905-01 101907-02
101920-01 101921-04

Note – If you answer N (no) to the question to save the original


versions of the software, it can be difficult to safely back out of a patch.

SunSolve 7-23
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
7

Patches

Installing Recommended or Suggested Patches


You are next prompted for the patch to install from the CD-ROM. To
install all of the suggested patches, perform the following steps.

To start the installation, press Return when prompted for a patchid.

1. Type suggested to answer the following question.


Patch to install (patchid, suggested, ?): suggested
Patch installation setup:
Temporary directory: /tmp
Save old versions of files: TRUE
Patches to install: suggested

2. Type Y to answer the following question.


Is this correct? [y] Y
Installing suggested patches for (Your machine/OS
release level)
list of patches coming.....

You will see each installpatch script run. You might also see
messages such as Patch already installed: continue? |Y|.

After installing these (or any other patches), reboot the system unless
specifically given other instructions from the install script.

7-24 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
7

Patches

Installing a Specific Patch


You can also install a specific patch such as the following.
Patch-ID# 102979-01
Keywords: memory leak be
Synopsis: SunOS 5.5: memory leakage in be driver
Date: Jan/08/96
Solaris Release: 2.5

To install the above patch, type the patch ID number (102979 here),
instead of typing suggested when prompted for the patch to install.
Patch to install (patchid, suggested, ?): 102979

Note – Do not enter the patch level number (-01).

Answer the remaining prompts as before. After installation, reboot the


system.

SunSolve 7-25
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
7

Patches

Removing a Specific Patch


To remove a specific patch, perform the following steps:

1. Display a list of installed patches on your system to find the exact


name and revision level of the patch.

Follow the steps listed on page 7-23 or type the following


command.
# showrev -p

2. Find the installed location of the patch (they are usually installed
in the /var/sadm/patch directory).
# find / -name 102044-01 -print
(output omitted)

3. Change directory to the location of the patch.


# cd /var/adm/patch/102044-01

4. Run the script to remove the patch.


# ./backoutpatch .
5. Reboot the system.
# reboot

7-26 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
7

SunSolve Labs

Note – The question text below matches page headers in this module.

1. Describe how the SunSolve system helps resolve system faults.

2. Differentiate between the SunSolve CD-ROM and SunSolve online


databases.

3. Describe how to apply for a SunSolve online account.

4. Install the SunSolve software and patches on a server and share


them correctly to the network.

5. Start, configure, and display SunSolve.

6. Given a set of symptoms, discover a likely cause using SunSolve.

7. Display the installed patches on your system.

SunSolve 7-27
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
7

SunSolve Labs

8. Install all the recommended or suggested patches.

9. Install a specific patch.

10. Remove a specific patch.

11. Display the current patch report for a given operating system.

7-28 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
7

Optional – Basic Search Techniques

This section illustrates the method for conducting some basic searches
of the SunSolve information. It shows how to construct and refine a
search, and displays the results of a sample search.

SearchTool is used to search the SunSolve data collections to locate the


documents you need. You can quickly search the extensive SunSolve
collections for documents that meet your needs by entering keywords
into the query line of the SearchTool window, selecting the collections
where you think you might find the data, and performing the search.

SunSolve 7-29
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
7

Optional – Basic Search Techniques

▼ Conducting a Basic Search

Start the SearchTool by opening the SunSolve icon. The SearchTool


window will open.

Choose the
collections you
want to search.
Select the area of
the document(s)
Enter keywords you want to
(and optional search.
operators) that
describe the

Click on the
Search button
to start the

The documents matching your


search (results) are listed.

1. Select collections to search. Click on the checkbox(es) of the


document collection(s) you want to search.

You can search any combination of collections, or you can search


them all.

7-30 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
7

Basic Search Techniques

▼ Conducting a Basic Search (Continued)

2. Enter the keyword(s) on the Search for line.

What you enter here is the keyword that SearchTool will look for
in the collections. You can also use the optional operators to
further define your search.

3. Choose the area of the collection you would like to search.

The most commonly used area is entire doc, which looks in all
parts of all of the documents of the collections you have selected.
Each collection allows you to define your search by the areas
available in that collection. In some cases, you may know the
document ID number, and might want to search in the document
ID area of All Collections.

4. Click on the Search button to conduct the search.

SearchTool finds all of the documents (up to the maximum


number you specify) that match your search. The title, ID number,
and collection name of all of the documents found is displayed in
the scroll list at the bottom of the window.

5. View (and print) the desired document(s). Double-click on the


document name to view it.

SunSolve 7-31
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
7

Basic Search Techniques

▼ Conducting a Basic Search (Continued)

Here is an example of a basic search. In the search illustrated, the


SunSolve Dataset has been selected, and within that dataset, the Sun
Technical Bulletins collection is selected. The keyword graphics has
been entered for the entire document. When SearchTool returns the
titles of the documents that matched the entered keyword, the tile of
the first document is selected. Choose display from the Display menu
to view the document.

7-32 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
7

Basic Search Techniques

▼ Conducting a Basic Search (Continued)

If the prompt for document type property is appropriately set, the


document types window is displayed. Select postscript, and click on
the Display button.

SunSolve 7-33
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
7

Basic Search Techniques

▼ Conducting a Basic Search (Continued)

The document is opened and displayed in MultiView.

7-34 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
7

Using MultiView

Once you have used SearchTool to locate the documents you want,
you can use MultiView to display or print the document or save the
document to a file. MultiView is the display tool for SearchTool. It is
capable of displaying the full range of document formats available in
the SunSolve collections.

SunSolve 7-35
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
7

Document Formats

The following are the types of documents available in the SunSolve


collections:

● Picture - Photos and other high-resolution graphics.

● PostScript - Pictures and documents containing graphics saved in


high-resolution format. They produce high-quality printouts on a
laser printer.

● ASCII - Text-only formatted documents. They can be printed on


any printer.

● Interleaf - Documents available in a format compatible with the


Interleaf desktop publishing software. You must have the Interleaf
software to work with these documents.

● FrameMaker - Documents compatible with the FrameMaker


desktop publishing software. You must have the FrameMaker
software to work with these documents.

7-36 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
7

Setting MultiView Properties

The Properties button on the SearchTool main window enables you to


set operating properties for both searching and for MultiView. You can
set the type of viewer that displays the different document formats as
well as the printer that prints documents.

▼ To Set MultiView Properties

1. Click on the Properties button at the top of the SearchTool


window.

The Properties window opens, displaying the current search


properties.

2. From the Category menu, choose Viewer.

The MultiView Properties window is displayed.

SunSolve 7-37
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
7

Setting MultiView Properties

▼ To Set MultiView Properties (Continued)

Note – You should set the viewer types to Default unless you are
familiar with other tools that you would like to specify as Custom.
Default viewers have been selected to work with the document
collections.

Text Viewer

Default displays ASCII text files in the system text viewer. The Custom
selection displays ASCII files in a TextEdit window, or another text
window specified.

PostScript Viewer

Default displays PostScript files in the system PostScript viewer. The


Custom selection displays PostScript files in a PageView viewer or
another specified viewer.

If you are running on an Xterminal, you should set this to Custom and
the to name of the PostScript viewer. For example, to use ghostview,
replace the default pageview with ghostview.

Picture Viewer

Default displays picture files in the system graphics viewer. The


Custom selection displays picture files using the Snapshot viewer.

7-38 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
7

Displaying a Document in MultiView

The scroll list at the bottom of the SearchTool window lists documents
that match your search. You can use MultiView to display, print, email,
or save these documents to a file.

▼ To Display a Document With MultiView

1. Click on the title of the document in the scroll list at the bottom of
the SearchTool window.

2. From the Display menu, choose Display to view the document


with MultiView.

You can also double-click on the title to open MultiView.

Choose In new viewer to show the document in another viewer


when a document is already displayed. You can display more than
one document at a time.

SunSolve 7-39
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
7

Displaying a Document in MultiView

▼ To Display a Document with MultiView (Continued)

3. If appropriate, select one of the file formats shown in MultiView’s


document types window.

If a document exists in a nontext format, you can view a


description of the document before you display it (if you have the
prompt for document type option selected). The document types
window opens, enabling you to select the available options.

The default selection is description. Click on the Display button to


view a text description of the document on the screen.

Click on any other button that is shown next to description for


other file formats (such as postscript) to view the document.

4. Click on one of the display option buttons at the bottom of the


document types window.

Display shows the document on the screen, using the viewer


specified in the Viewer Properties window.

Save opens the MultiView Save window, enabling you to specify a


file in which to save the contents of the document.

Cancel (or just unpinning the document types window) cancels


the operation.

7-40 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
7

MultiView Features

If you close MultiView, the viewer remains as an icon:.

The File Menu


The File menu in the upper-left corner of the MultiView window has a
pulldown menu that enables you to direct documents to other
locations, as described below.

Print Option

The Print option sends the document you are viewing to a printer. You
can specify which pages to print: All (the entire document), This page
(only the page you are presently viewing), or a Range of pages,
delimited by the From and To fields in the window. You can also
specify the name of the printer in the printer field. Click on the Print
button when all your choices are completed.

SunSolve 7-41
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
7

MultiView Features

The File Menu (Continued)

Save Option

The Save option enables you to save the current document in a file.
Specify the location of the file in the window, and type the name of the
file in the Name field. Click on the Save button when all your choices
are completed.

Email Option

The Email option enables you to send the document in an email


message. A mail composition window is displayed, with the
document as preloaded text. Enter the address of your email
destination in the To field, the subject of the message in the Subject
field, and any additional address in the Cc field. You can also edit the
text of the message if you choose. Click on the Deliver button to send
the message, or click on the Cancel button to cancel the operation.

7-42 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
7

MultiView Features

The File Menu (Continued)

Properties Option

The Properties option enables you to set various properties of the


viewer including the DPI (dots-per-inch) density of the display. The
default DPI is 85, as shown in the illustration below; you may find that
for certain documents, a setting of 72 is preferable. Click on the Apply
button when your choices are completed.

Note – Standard resolution for viewing on screen is 72 dpi and 300


and 400 dpi is used for higher printing resolution.

SunSolve 7-43
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
7

7-44 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
Kernel Core Dump Analysis 8

Objectives
Upon completion of this module, you will be able to:

● Describe how a process creates a system dump, crash, or “hang.”

● Differentiate between system faults caused by system and user


processes.

● Describe the purpose of the core file in analyzing a failing process


and file.

● Use the adb and crash commands to manipulate core files and to
locate a failing process or file.

● Use the adb and crash commands to isolate the failing processor,
instruction, thread, process, and file on three core dumps and on
one system hang.

References
The SPARC Architecture Manual, SPARC International

The Magic Garden Explained, Goodheart and Cox

Panic! UNIX System Crash Dump Analysis, C. Drake and K. Brown

Using adb and adb Macros (Reference material)

8-1
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
8

Introduction

The UNIX operating system uses assertion checks throughout the kernel
code. Assertion checks are placed at critical points within the software.
When a call is made to the ASSERT() routine, a check is made. If the
condition is not true and the kernel module is compiled with the
DEBUG flag, the system panics. Also, within the code are data
integrity checks. If a data check fails, it calls upon the cmn_err()
routine.

There are over 400,000 lines of C assembler code with over:

● 17,000 assertions

● 600 nonfatal calls to the cmn_err()

● 250 fatal calls to cmn_err()

If the system code or an electronic interruption misses all of these


checks, the system or process can “hang.” The worst that can happen
is the user or system will continue processing corrupted data.

When the kernel panics, it writes the interesting portions of memory to


the dump device (which is usually the swap device). To save a core
dump, there must be enough room in the swap area to contain it. To be
safe, the primary swap area should be at least the size of main
memory (all the information is in main memory, though not all of it is
dumped).

When the system reboots, this core dump must be saved into files that
can then be passed to adb for analysis. savecore(1M) is used to
perform this function. Normally, the system does not examine the
swap area for core dumps when it boots. savecore() must be enabled
in /etc/init.d/sysetup.

8-2 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
8

Header Files

The following header files contain the C structures analyzed in this


module:

● /usr/include/sys/proc.h

● /usr/include/sys/thread.h

● /usr/include/sys/klwp.h

● /usr/include/sys/user.h

● /usr/include/sys/cred.h

● /usr/include/vm/as.h

● /usr/include/vm/seg.h

Kernel Core Dump Analysis 8-3


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
8

Debuggers

adb
adb is an interactive, general-purpose debugger. It can be used to
examine files, and it provides a controlled environment for the
execution of programs. adb reads commands from the standard input
and displays responses on the standard output. It does not supply a
prompt.

crash
The crash command is used to examine the system memory image of
a running or a crashed system by formatting and printing control
structures, tables, and other information. Command-line arguments to
crash are dump file, name list, and output file.

kadb
kadb is an interactive debugger with a user interface similar to that of
adb(1), the assembly language debugger. kadb must be loaded prior
to the standalone program it is to debug. It runs in the same address
space as the standalone program, thus sharing many resources with
that program. The debugger is cognizant of and able to control
multiple processors if they are present in a system.

Unlike adb, kadb runs in the same supervisor virtual address space as
the program being debugged although it maintains a separate context.
The debugger runs as a coprocess that cannot be killed (`:k').

8-4 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
8

SAVECORE Setup

When savecore(1M) runs, it makes a copy of the kernel symbol table


(/dev/ksyms) of the kernel that was running, called unix.n, and
dumps physical memory to a file called vmcore.n in the specified
directory (normally /var/crash/machine_name). There must be
enough space in /var/crash to contain the core dump or it will be
truncated. The file appears larger than it actually is because it contains
holes – so avoid copying it. adb(crash(1)) can then be used on the
core dump and the saved kernel.

In the /etc/init.d/sysetup file, change the lines that read:

##
## Default is to not do a savecore
##
#if [ ! -d /var/crash/`uname -n` ]
#then mkdir -p /var/crash/`uname -n`
#fi
# echo ‘checking for crash dump...\c ‘
#savecore /var/crash/`uname -n`
# echo ‘’

To:

##
## Default is to not do a savecore
##
if [ ! -d /var/crash/`uname -n` ]
then mkdir -p /var/crash/`uname -n`
fi
echo ‘checking for crash dump...\c ‘
savecore /var/crash/`uname -n‘
echo ‘’

Note – A minimum of 32 Mbytes of swap space is required to save


dumps.

Kernel Core Dump Analysis 8-5


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
8

Invoking adb/kadb/crash

adb
# cd crash_directory
# adb -k unix.n vmcore.n

Note – n is a value starting at 0. Ensure that vmcore and unix both


have the same n value within the command line.

To start adb on a live system, run the following command as root:

# adb -kw /dev/ksyms /dev/mem


physmem 272a

/dev/ksyms is a special driver that provides an image of the kernel’s


symbol table. This can be used to examine the debugging information
the driver has left in memory. When adb(1) responds with physmem
nnn, it is ready for a command. If you want to run adb with a prompt,
you can use the -P option as follows:

# adb -kw -P "adb: " /dev/ksyms /dev/mem


physmem 272a
adb:

crash
# crash vmcore.n unix.n
dumpfile = vmcore.0, namelist = unix.0, outfile = stdout
>

Note – n is a value starting at 0. Ensure that vmcore and unix both


have the same n value within the command line.

kadb
ok boot disk kadb

8-6 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
8

adb Commands

The general form of an adb(1M) command is:

[ address ] [ ,count ] command [ ; ]

If address is omitted, the current location is used. (The dot [.] also
stands for the current location.) The address can be a kernel symbol. If
the count is omitted, it defaults to 1.

Commands to adb consist of a verb followed by a modifier or list of


modifiers. Verbs can be:

? Used to examine code or variables in the object file


(executable).

/ Used to examine data from the core file.

= Prints values in different formats.

$ For miscellaneous commands, including macro


invocations.

> Assigns a value to a variable or register.

< Reads a value from a variable or register.

Return Repeats the previous command with a count of 1.


Increments ‘.’.

Kernel Core Dump Analysis 8-7


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
8

adb Macros and Commands

With ?, /, and = symbols, output format specifiers can be used.


Lowercase letters normally print two bytes; uppercase letters print
four bytes:

o or O Displays in octal (short, long).

q or Q Displays in signed octal (short, long).

d or D Displays in signed decimal.

x or X Displays in hex.

u or U Displays unsigned decimal.

f or F Displays as floating point (long or double).

b Prints as an octal byte.

s Prints as a null-terminated string.

c Displays as a single character.

C Displays a single character using escape conventions


for nonprinting characters.

i Displays as a disassembled instruction (mnemonic


code).

Examples
v+0
v: 100 examine a symbolic location
v+0/D examine a symbolic location - display content decimal
v:
v: 100
v+0/X examine a symbolic location - display content hex
v:
v: 64 e
v+0=X Determine VA of symbolic location v
f017255c
f017255c/X examine content of a VA
64
fc63ecbc/i examine a VA for an instruction(disassemble)
backseat_write:sethi%hi(0xfffffc00), %g1

8-8 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
8

adb Macros and Commands

Display and Control Commands


The following commands are used to display and control the status of
adb(1):

$b Displays all breakpoints (kadb, adb on user


programs).

$d Changes default radix to value of dot.

$q - Quit.

adb Macros
$M Displays built-in macros (kadb).

cpu0$<cpu (or cpus$<cpu for MP systems)


Displays the address of threads running on each
CPU.

$<msgbuf Displays last several console messages that lead up to


and include the crash.

$c Displays the stack trace.

$C Shows the call trace and arguments at the time of the


crash as well as the saved frame pointer and the
saved pc for each stack frame; useful with crash
dumps. It is also useful in kadb(1M) when a
breakpoint is reached, but is usually not useful if
kadb(1M) is entered at a random time.

$r Displays machine registers; most likely these registers


are not the ones in use at the time of the panic and so
may provide very minimal help in debugging the
system.

Kernel Core Dump Analysis 8-9


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
8
adb Macros

as inode proc2u snode


bootobj iocblk procthreads stack
buf iovec procthreads.list stackregs
bufctl itimerval ptbltohme stacktrace
bufctl_audit kmem_cache ptbltopte stacktrace.nxt
cachefsfsc kosyminfo ptetoptbl stat
cachefsmeta ksiginfo putbuf stdata
callout lwp putbuf.wrap strtab
calltrace mblk qinit svcfh
calltrace.nxt mblk.nxt qproc.info sysinfo
cnode memlist qthread.info tcpcb
cpu memlist.list queue tcpip
cpun memlist.nxt regs thread
cpus memseg rlimit thread.trace
cpus.nxt mntinfo rnode threadlist
cred modctl rpctimer threadlist.nxt
ctx modctl_list rwindow tmount
dblk modinfo rwlock tmpnode
devinfo modlinkage seg traceall.nxt
dino module segdev tsdpent
direct modules seglist tsproc
disp modules.nxt seglist.nxt tune
dispq msgbuf segmap u
dispq.nxt msgbuf.wrap segvn u.sizeof
dispqtrace mutex sema ucalltrace
dispqtrace.list netbuf session ucalltrace.nxt
dispqtrace.nxt page setproc ufchunk
dumphdr page2hme setproc.done ufchunk.nxt
exdata page2hme.nxt setproc.nop uio
file pathname setproc.nxt ustack
filsys pcb sigaltstack utsname
hat pid slab v
hme pid.print sleepq v_call
hme.sizeof pid2proc sleepq.nxt v_proc
hmelist pid2proc.chain slpqtrace vattr
hmelist.nxt pollhead slpqtrace.list vfs
hmetoptbl prgregset slpqtrace.nxt vnode
ifnet proc smap

8-10 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
8

Kernel Core Dump Analysis

During the development of the RAM disk driver, the system crashes
with a data fault when running newfs. The savecore command has
been enabled in the sysetup shell script. This enables copies of the
current kernel and core file to be saved when the system reboots.

The adb/crash utility is used to determine:

● What instruction failed.

● What thread was running when the system panicked.

● What process called or used the instruction.

● What parameters were passed to a failing process.

To become proficient in the subject, attend the Internals course, some C


language courses, and higher-level programming courses.

Also, SunSolve provides good examples of knowledgeable users’


analyses of system core dumps, which can be a helpful learning tool.

Kernel Core Dump Analysis 8-11


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
8
Kernel Dump Analysis – adb – SC2000 Example

Using adb to Analyze a Kernel Core Dump


Invoke adb with the following command (1): adb -k unix.0
vmcore.0. The physmem value is returned (2) indicating the amount of
free memory after the kernel is loaded. A nonprompt is returned. One
of the first things that can be done is to display the message that was
displayed on the system console at the time of the panic. To do this
you need to invoke the msgbuf macro as follows:
# adb -k unix.0 vmcore.0
physmem fd6a
$<msgbuf
msgbuf:
msgbuf: magic size bufx bufr
8724786 1fe8 685 79c

There are times when the msgbuf variable used by the msgbuf macro
may not be loaded in the dynamic kernel symbol table, in which case
you would use the strings command on the vmcore.n file.
# strings vmcore.0
...
ASC = 0x4 (LUN not ready), ASCQ = 0x2, FRU = 0x0
BAD TRAP: cpu_id=2 type=9 <Data fault> addr=30 rw=1 rp=e0922ac4
MMU sfsr=0x326: ft=<Invalid address error> at=<supv data load>
level=3
MMU sfsr=0x326<FAV>
BAD TRAP occurred in module "ramd" due to an illegal access to a user
address.
mkfs: Data fault
kernel read fault at addr=0x30, pte=0x0
MMU sfsr=0x326: ft=<Invalid address error> at=<supv data load>
level=3
MMU sfsr=0x326<FAV>
...

Notice that you would get a lot of information which also includes the
panic message as returned by the $<msgbuf command. The rest of the
panic message is shown on the next page.

8-12 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
8

Kernel Dump Analysis – adb – SC2000 Example

The message buffer has been edited, but in the workshop, you will see
the full message buffer. The bold area contains the important
information about the crash. The information located in the BAD TRAP
(1) message informs you of the type of fault detected (<Data Fault>),
which CPU detected the fault (id=2), register pointer (rp), and the
fault type (ft). The fault type indicates an <Invalid Address
Error>. Included within the panic message is the CPU ID and thread
(sequence of instructions) executing at the time of the crash.

The thread pointer 0xf06c16c0 (located in the panic message) points


you to the proc data structure, which tells you what process was
executing.

Notice also the pc (0xf06ad304), located at ram_write+0x2c, points


to the instruction that was executing at the time of the crash.

You have almost all the information located in the message buffer to
determine most of the information about the system crash.

The rest of the crash dump analysis uses adb macros and commands to
navigate through a crash dump to get data that may not be available
through the message buffer (or if the message buffer is not available,
for whatever reasons).
BAD TRAP: cpu_id=2 type=9 <Data fault> addr=30 rw=1 rp=e0922ac4
MMU sfsr=0x326: ft=<Invalid address error> at=<supv data load> l
evel=3
MMU sfsr=0x326<FAV>
BAD TRAP occurred in module "ramd" due to an illegal access to a
user address.
mkfs: Data fault
kernel read fault at addr=0x30, pte=0x0
MMU sfsr=0x326: ft=<Invalid address error> at=<supv data load> l
evel=3
MMU sfsr=0x326<FAV>
ram_write+0x2c, pid=363, pc=0xf06ad304, sp=0xe0922b10, psr=0x400
000c4, context=39
g1-g7: ffffff98, 0, e00afac4, 40, f0bb0bd8, 1, f06c16c0
Begin traceback... sp = e0922b10
write+0x190 @ 0xe00afc54, fp=0xe0922b78
args=d80000 e0922bd8 f03b1c18 d8 f0287d48 f06ad2d8

Kernel Core Dump Analysis 8-13


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
8

Kernel Dump Analysis – adb – SC2000 Example

syscall_trap+0x104 @ 0xe0060f14, fp=0xe0922c08


args=5 2a33c 200 f0a5822c 2 f06d32a8
(unknown)+0x12778 @ 0x12778, fp=0xdffffad0
args=1ff 200 2a33c df751c14 200 df751c14
End traceback...
panic[cpu2]/thread=0xf06c16c0: Data fault
syncing file systems... done
3613 static and sysmap kernel pages
176 dynamic kernel data pages
440 kernel-pageable pages
0 segkmap kernel pages
0 segvn kernel pages
209 current user process pages
4438 total pages (4438 chunks)

dumping to vp f070926c, offset

There is more editing of the message buffer. Finally, adb returns a


nonprompt. You will navigate through the dump using adb macros
and commands.The journey begins by using the $c macro.

The $c macro (1) displays the stack. Also note, the cmn_err() routine
is called. This fault was determined to be a nonrecoverable error
ending up in a panic. In Solaris 2.5, notice that the stacktrace is very
indicative of the reason for the fault through the presence of the
ram_write() driver routine that caused the system to go down.
$c
complete_panic(0xe024c800,0x1,0xe0241800,0xf05b2ab8,0x5,0xe024c800)
+ d0
do_panic(?) + 20
vcmn_err(0xe02496b0,0xe092297c,0xe092297c,0x18,0x18,0x3)
cmn_err(0x3,0xe02496b0,0xe0251fa0,0x0,0x12778,0xdffffad0) + 1c
die(0x9,0xe0922ac4,0x30,0x326,0x1,0xe02496b0) + 120
trap(0x0,0xe0922ac4,0x30,0x326,0x1,0x0) + 498
fault(?) + 7c
Syssize(via
getminor)(0x0,0x3ffff,0x20,0x7fffffff,0xf0a5829c,0x315c1813)
ram_write(0xd80000,0xe0922bd8,0xf03b1c18,0xd8,0xf0287d48,0xf06ad2d8
) + 1c
write(0x5) + 190

8-14 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
8

Kernel Dump Analysis – adb – SC2000 Example

The second parameter to trap(), 0xe0922ac4, is a pointer to the


registers associated with the current process at the time of panic. The
$regs macro displays the register data structure. Focus on the address
contained in the pc field, which contains the instruction that caused
the system to panic.
0xe0922ac4$<regs
0xe0922ac4: psr pc npc
400000c4 f06ad304 f06ad308
0xe0922ad0: y g1 g2 g3
0 ffffff98 0 e00afac4
0xe0922ae0: g4 g5 g6 g7
40 f0bb0bd8 1 f06c16c0
0xe0922af0: o0 o1 o2 o3
0 0 40000 0
0xe0922b00: o4 o5 o6 o7
315c1811 0 e0922b10 0

Using the value in the pc field, you can determine the instruction that
was executing at the time of the panic with the adb i command (1).
The results of this command indicate that a load instruction was
executing at an address given by the symbol ram_write+0x2c. With
the /i command, you have determined the assembly instruction that
caused the system to go down.

The load instruction is reading from an address in memory given by


what is in register l1 + 0x30 and trying to load what is in this
address into register l2. If the address in register l1 is a bad address,
(for example, a user address) then this error would cause the system to
panic; when a thread is executing device driver code, it is supposed to
be executing in kernel address space, and not in user address space –
thus the reason for the panic that says, “BAD TRAP occurred in
module ramd due to an illegal access to a user address.” Note once
again, the pc is also displayed in the message buffer.
f06ad304/i
ram_write+0x2c: ld [%l1 + 0x30], %l2

Kernel Core Dump Analysis 8-15


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
8

Kernel Dump Analysis – adb – SC2000 Example

You can use the cpu macro (1) to navigate the CPU data structure to
locate the thread (which you already know from the message buffer).
The macro will open the CPU data structure for the first CPU (id=0).
Since you know this was not the CPU (message buffer), you will use
the content of next field. This points to the address of the next CPU
data structure. Note also, thread and idle thread (idle_t) are equal.
This indicates this CPU was idle.
cpu0$<cpu
cpu0:
cpu0: id seqid flags
0 0 1d
cpu0+0xc: thread idle_t pause
e06c1ec0 e06c1ec0 e08a0ec0
cpu0+0x18: lwp callo fpowner
0 0 f06a30c0
cpu0+0x24: next prev next on prev on
f05852d0 f05b2ab8 f05852d0 f05b2d58
cpu0+0x34: lock npri queue limit actmap
0 110 f036e568 f036ea90 f028dbc0
cpu0+0x44: maxrunpri max unb pri nrunnable
-1 -1 0
cpu0+0x50: runrun kprnrn dispthread thread lock
0 0 e06c1ec0 0
cpu0+0x5c: intr_stack on_intr intr_thread intr_actv
e06dffa0 1 e06dcec0 0
cpu0+0x6c: base_spl
0

8-16 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
8

Kernel Dump Analysis – adb – SC2000 Example

Follow the boldface type to locate the CPU at the time of the fault.
Note thread and idle thread.
f05852d0$<cpu
0xf05852d0: id seqid flags
1 1 1d
0xf05852dc: thread idle_t pause
e06feec0 e06feec0 e0721ec0
0xf05852e8: lwp callo fpowner
0 0 f06a30c0
0xf05852f4: next prev next on prev on
f0585030 e0251120 f0585030 e0251120
0xf0585304: lock npri queue limit actmap
0 110 f057b580 f057baa8 f028d660
0xf0585314: maxrunpri max unb pri nrunnable
-1 -1 0
0xf0585320: runrun kprnrn dispthread thread lock
0 0 e06feec0 0
0xf058532c: intr_stack on_intr intr_thread intr_actv
e071ffa0 1 e071cec0 0
0xf058533c: base_spl
0

Kernel Core Dump Analysis 8-17


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
8

Kernel Dump Analysis – adb – SC2000 Example

Finally, you have arrived at the correct CPU data structure. Another
key point has been reached. You can display the content of the thread
data structure. You know the thread from the message buffer. Note the
threads.
f561cc00$<cpu
0xf0585030: id seqid flags
2 2 1d
0xf058503c: thread idle_t pause
f06c16c0 e0723ec0 e0746ec0
0xf0585048: lwp callo fpowner
0 0 f0a8e810
0xf0585054: next prev next on prev on
f05b2d58 f05852d0 f05b2ab8 f05852d0
0xf0585064: lock npri queue limit actmap
0 110 f057b040 f057b568 f028db10
0xf0585074: maxrunpri max unb pri nrunnable
-1 -1 0
0xf0585080: runrun kprnrn dispthread thread lock
0 0 e0723ec0 0
0xf058508c: intr_stack on_intr intr_thread intr_actv
e0744fa0 1 e0741ec0 0
0xf058509c: base_spl
0

The thread that caused the panic can also be obtained from the
message buffer or from the panic_thread variable that the system
maintains. This variable holds the address of the thread that caused
the system to panic regardless of how many CPUs there are in the
system.
panic_thread/X
panic_thread:
panic_thread: f06c16c0

8-18 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
8

Kernel Dump Analysis – adb – SC2000 Example

Use the thread macro. Search the structure for the procp (process
pointer) field.
f06c16c0$<thread
adb
0xf06c16c0:
link stk
0 e0922c08
0xf06c16cc:
bound affcnt bind_cpu
0 0 -1
0xf06c16d4:
flag procflag schedflag state
0 0 11 4
0xf06c16e0: pri epri pc sp
0 0 e004c13c e0922480
0xf06c16ec: wchan0 wchan cid clfuncs
0 0 2 f0371960
0xf06c1700:
cldata ctx lofault onfault
f0594700 0 0 0
0xf06c1710:
nofault swap lock cpu
0 e0921000 ff f05b2ab8
0xf06c1720:
intr delay_cv tid alarmid
0 0 1 0
realitimer
0xf06c1734: interval.sec interval.usec value.sec value.usec
0 0 0 0
0xf06c1744:
itimerid sigqueue sig
0 0 0 0
0xf06c1754:
hold forw back
0 0 f06c16c0 f06c16c0
0xf06c1764:
lwp procp next prev
f0bb0bd8 f0bb8cd0 f0aa0920 f06c1ea0
0xf06c16da:
preempt trace whystop whatstop
1 0 0 0
0xf06c17a4:
kpri_req sysnum astflag pollstate cred
11 4 0 0 f03b1c18
0xf06c178c:
lbolt pctcpu trapret pre_sys post_sys sig_check
1b520 ae 0 0 0 0
0xf06c1794:
lockp oldspl disp queue disp time
f05b2b10 de1 f05b2aec 111899
0xf06c17b8:
mstate waitrq rprof

Kernel Core Dump Analysis 8-19


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
8

Kernel Dump Analysis – adb – SC2000 Example

You can use the last macro proc2u to expand the proc structure.
Locate the psargs symbol, which indicates the commands and its
arguments executing at the time of the panic. You have accomplished
the last key point (locating the process).
f0bb8cd0$<proc2u
0xf0bb8e88:
execid execsz tsize
32581 12e 0
0xf0bb8e94:
dsize start ticks cv
0 315c1813 1b513 0
0xf0bb8ea4:
exdata
0xf0bb8ea4:
vp tsize dsize bsize
0 0 0 0
0xf0bb8eb4:
lsize nshlibs mach mag toffset
0 0 0 10b 0
0xf0bb8ec4:
doffset loffset txtorg datorg
0 0 0 0
0xf0bb8ed4:
entloc
df7d43a8
0xf0bb8ed8: aux vector
7d8 dfffffe1 3 10034
4 20 5 5
9 11b54 7 df7d0000
8 0 6 1000
7d0 0 7d1 0
7d2 1 7d3 1
7d9 7 0 0
0 0 0 0
0 0 0 0
0xf0bb8f68: psargs
mkfs /devices/pseudo/ramd@0:0,raw 512 8 1 8192 1024 16 10 60 204
8 t 0 -1 8 -1^@^@^@
0xf0bb8fb8: comm
mkfs^@^@^@^@^@^@^@^@^@^@^@^@^@
0xf0bb8fd8:

8-20 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
8

Kernel Dump Analysis – adb – SC2000 Example

cdir rdir ttyvp cmask


f0bd8688 0 0 12
0xf0bb8fe8:
mem systrap ttyp ttyd
450 0 0 0
0xf0bb8ff8: entrymask
0 0 0 0
0 0 0
0xf0bb9014: exitmask
0 0 0 0
0 0 0
0xf0bb9030:
signodefer sigonstack
0 0 0 0
0xf0bb9040:
sigresethand sigrestart
0 0 0 0

sigmask
0xf0bb9050: 0 0 0 0

Note – The remaining message buffer was deleted.

Kernel Core Dump Analysis 8-21


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
8

Kernel Dump Analysis – adb – SPARC 5 Example

Using adb to Analyze a Kernel Core Dump


Invoke adb with the following command (1): adb -k unix.0
vmcore.0. The physmem value is returned (2) indicating the amount of
free memory after the kernel is loaded. A nonprompt is returned. The
/s (string) command (3) is used to locate the start of the message
buffer (msgbuf). Several attempts are made to locate the start.
Normally msgbuf+14 is good.

# cd crash_directory
# ls
bounds unix.0 vmcore.0
# adb -k unix.0 vmcore.0 1
physmem 1e6e 2
msgbuf+14/s 3
symbol not found
$q

Note – If the message symbol not found is returned, exit adb and
use the strings command.

# strings vmcore.0 | more


Generic
Data fault
../devices/pseudo/cn@0:systty
../devices/pseudo/ptc@0:ptyp5

vac: enabled in write through mode


cpu0: FMI,MB86904 (mid 0 impl 0x0 ver 0x4 clock 85 MHz)
mem = 32768K (0x2000000)
avail mem = 27553792
Ethernet address = 8:0:20:22:8f:9d
root nexus = SUNW,SPARCstation-5
iommu0 at root: obio 0x10000000
sbus0 at iommu0: obio 0x10001000
espdma0 at sbus0: SBus slot 5 0x8400000
esp0 at espdma0: SBus slot 5 0x8800000 sparc ipl 4
sd3 at esp0: target 3 lun 0
sd3 is
saving trees

8-22 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
8

Kernel Dump Analysis – adb – SPARC 5 Example

The message buffer has been edited, but in the workshop, you will see
the full message buffer. The bold area contains the important
information about the crash. The information located in the BAD TRAP
message informs you of the type of fault detected <Date Fault> plus
it also informs you of the name of the module that caused the system
to panic (ramd).

Notice also that the pc (0xfc479dbc) points to the instruction that was
executing at the time of the crash.

The rest of the crash dump analysis will use adb macros and
commands to navigate you through a crash dump. This would be
necessary if the message buffer did not help or if one was not
available.
BAD TRAP: type=9 rp=f05246f4 addr=30 mmu_fsr=326 rw=1
BAD TRAP: occurred in module “ramd” due to an illegal access to a
user address
mkfs: Data fault
kernel read fault at addr=0x30, pme=0x0
MMU sfsr=326: Invalid Address on supv data fetch at level 3
pid=465, pc=0xfc479dbc, sp=0xf0524740, psr=0x40000c2, context=0
g1-g7: ffffff98, 0, ffffff00, 0, f05249e0, 1, fc2dec00
Begin traceback... sp = f0524740
Called from f00df9b4, fp=f05247a8, args=1a40000 f0524808 fc38fc80 f0154664
0 fc479d90
Called from f0070258, fp=f05248b8, args=200 f0524920 2 0 4 fc2d5b04
Called from f0041aa0, fp=f0524938, args=f0160cf8 f0524eb4 0 f0524e90
fffffffc ffffffff
Called from 15cc0, fp=effffae8, args=4 32400 200 0 0 3fe00
End traceback...
panic: Data fault

# cd crash_directory
# adb -k unix.0 vmcore.0
physmem 1e6e

Kernel Core Dump Analysis 8-23


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
8

Kernel Dump Analysis – adb – SPARC 5 Example

The $c macro (1) displays the stack. Note the value 9 in the initial trap
handler (2) as it is also displayed in the message buffer. Also note, the
cmn_err() routine is called. This fault was determined to be a
nonrecoverable error ending up in a panic.
$c
complete_panic(0xf026b428,0xfbfab98c,0xf0048ec8,0x6a,0xfbfab818,0xf
0279800) + 108
do_panic(?) + 1c
vcmn_err(0xf0266600,0xfbfab98c,0xfbfab98c,0x7,0xffeec000,0x3)
cmn_err(0x3,0xf0266600,0x1,0x21,0x21,0xf025c000) + 1c
die(0x9,0xfbfabac4,0x30,0x326,0x1,0xf0266600) + bc
trap(0xf028a1d8,0xfbfabac4,0x0,0x326,0x1,0x0) + 4f8
fault(?) + 84
Syssize(via
getminor)(0x0,0x3ffff,0x20,0x7fffffff,0xf5c4b4bc,0x31585486)
ram_write(0xdc0000,0xfbfabbd8,0xf5a8ed38,0xdc,0xf5970d48,0xf5c54d90
) + 1c
write(0x5) + 190

The second parameter to trap(), 0xfbfabac4, is a pointer to the


registers associated with the current thread at the time of panic. The
regs macro displays the register data structure. Focus on the address
contained in the pc field, which is the program counter at the fault.
The pc field contains the address of the instruction that caused the
system to go down. This macro also displays the contents of other
machine registers.
fbfabac4$<regs
0xfc020ac4: psr pc npc
110000c4 f5c98dbc f5c98dc0
0xfc020ad0: y g1 g2 g3
50000000 ffffff98 0 f00bb080
0xfc020ae0: g4 g5 g6 g7
40 f5ca0648 1 f5caec60
0xfc020af0: o0 o1 o2 o3
0 3ffff 20 7fffffff
0xfc020b00: o4 o5 o6 o7
f614b084 31505b99 fc020b10 f5c98dac

8-24 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
8

Kernel Dump Analysis – adb – SPARC 5 Example

The next step in the core dump analysis is to get the program counter
and, with the disassemble command (i), display the assembly
instruction that caused the system to panic. This will usually display
the name of the device driver routine as part of the label. With this
information, you can pinpoint precisely what device driver caused the
system to go down.

f5c98dbc/i
ram_write+0x2c: ld [%l1 + 0x30], %l2

You may also want to find out what program or command was
running when the system went down. This is additional information
that will point out the bad device driver as well.

Using adb, you do this in two steps: first, display the thread that was
running when the system went down; the thread structure has a
pointer to the process that holds the name of the command running.
Second, display the user structure of this process that has the
command name.

Kernel Core Dump Analysis 8-25


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
8

Kernel Dump Analysis – adb – SPARC 5 Example

When the system is brought down because of a panic, there is a system


variable panic_thread that points to the address of the thread
structure of the thread that caused the system to go down (regardless
of the number of CPUs that the system has).

panic_thread/X
panic_thread:
panic_thread: f5c66480
f5c66480$<thread
adb
0xf5c66480:
link stk
0 fbfabc08
0xf5c6648c:
bound affcnt bind_cpu
f026b494 0 -1
0xf5c66494:
flag procflag schedflag state
0 0 11 4
0xf5c664a0: pri epri pc sp
14 0 f0048ec8 fbfab818
0xf5c664ac: wchan0 wchan cid clfuncs
0 0 2 f59a0378
0xf5c664c0:
cldata ctx lofault onfault
f5cb6460 0 0 0
0xf5c664d0:
nofault swap lock cpu
0 fbfaa000 ff f026b494
0xf5c664e0:
intr delay_cv tid alarmid
0 0 1 0
realitimer
0xf5c664f4: interval.sec interval.usec value.sec
value.usec
0 0 0 0
0xf5c66504:
itimerid sigqueue sig
0 0 0 0
0xf5c66514:
hold forw back

8-26 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
8

Kernel Dump Analysis – adb – SPARC 5 Example

0 0 f5c66480 f5c66480
0xf5c66524:
lwp procp next prev
f5c11828 f5c0fcc8 f5c665a0 f5c66d80
0xf5c6649a:
preempt trace whystop whatstop
1 0 0 0
0xf5c66564:
kpri_req sysnum astflag pollstate cred
0 4 1 0 f5a8ed38
0xf5c6654c:
lbolt pctcpu trapret pre_sys post_sys sig_check
405b8 fd 0 0 0 0
0xf5c66554:
lockp oldspl disp queue disp time
f026b4ec be1 f026b4c8 263603
0xf5c66578:
mstate waitrq rprof
9 0 0 0
0xf5c66580:
prioinv ts sobj_ops
0 0 0

Kernel Core Dump Analysis 8-27


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
8

Kernel Dump Analysis – adb – SPARC 5 Example

From the previous thread structure, you can get the address of the
process’s proc structure from the field that is labeled procp. When
you use this in combination with the macro proc2u, you can display
the user structure of the process that has the command or program
name. Take special note also of the arguments that were passed to the
command.
f5c0fcc8$<proc2u
0xf5c0fe80:
execid execsz tsize
32581 12e 0
0xf5c0fe8c:
dsize start ticks cv
0 31585486 405a4 0
0xf5c0fe9c:
exdata
0xf5c0fe9c:
vp tsize dsize bsize
0 0 0 0
0xf5c0feac:
lsize nshlibs mach mag toffset
0 0 0 10b 0
0xf5c0febc:
doffset loffset txtorg datorg
0 0 0 0
0xf5c0fecc:
entloc
ef7d43a8
0xf5c0fed0: aux vector
7d8 efffffe6 3 10034
4 20 5 5
9 11b54 7 ef7d0000
8 0 6 1000
7d0 0 7d1 0
7d2 1 7d3 1
7d9 3 0 0
0 0 0 0
0 0 0 0
0xf5c0ff60: psargs
mkfs /devices/pseudo/ramd@0:0,raw 512 8 1 8192 1024
16 10 60 204
8 t 0 -1 8 -1^@^@^@
0xf5c0ffb0: comm

8-28 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
8

Kernel Dump Analysis – adb – SPARC 5 Example

mkfs^@^@^@^@^@^@^@^@^@^@^@^@^@
0xf5c0ffd0:
cdir rdir ttyvp cmask
f5ca82e8 0 0 12
0xf5c0ffe0:
mem systrap ttyp ttyd
4f5 0 0 0
0xf5c0fff0: entrymask
0 0 0 0
0 0 0
0xf5c1000c: exitmask
0 0 0 0
0 0 0
0xf5c10028:
signodefer sigonstack
0 0 0 0
0xf5c10038:
sigresethand sigrestart
0 0 0 0

sigmask
0xf5c10048: 0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0

Kernel Core Dump Analysis 8-29


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
8

Kernel Dump Analysis – adb – SPARC 5 Example

0 0 0 0

signal
0xf5c101a8: 0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 1 1 0
0 1 1 0
0 0 0 0
0 0 0 0
0 0 0 0

ru
0xf5c10258:
nshmseg acflag
0 0
0xf5c1025c: rlimit
7fffffff 7fffffff 7fffffff 7fffffff
7ffff000 7ffff000 800000 7ffff000
7fffffff 7fffffff 40 400
7fffffff 7fffffff
flock
0xf5c10294: owner
0
0xf5c10294: lock
0
0xf5c10294: waiters wlock type
0 0 0
0xf5c1029c:
nofiles
24
flist
f5c0c910
0xf5c1029c: ofile pofile refcnt
0xf5c0c910: f5c69758 0 0
0xf5c0c918: f5c69758 0 0
0xf5c0c920: f5c69758 0 0
0xf5c0c928: f5c69218 0 0
0xf5c0c930: f5c695d8 0 0
0xf5c0c938: f5c69188 0 0

8-30 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
8

Kernel Dump Analysis – adb – SPARC 5 Example

0xf5c0c940: f5c697e8 0 0
0xf5c0c948: 0 0 0
0xf5c0c950: 0 1 0
0xf5c0c958: f5c696f8 0 0
0xf5c0c960: f5c69698 0 0
0xf5c0c968: 0 0 0
0xf5c0c970: 0 0 0
0xf5c0c978: 0 0 0
0xf5c0c980: 0 0 0
0xf5c0c988: 0 0 0
0xf5c0c990: 0 0 0
0xf5c0c998: 0 0 0
0xf5c0c9a0: 0 0 0
0xf5c0c9a8: 0 1 0
0xf5c0c9b0: 0 0 0
0xf5c0c9b8: 0 0 0
0xf5c0c9c0: 0 0 0
0xf5c0c9c8: 0 0 0

Kernel Core Dump Analysis 8-31


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
8

crash Help Menu

# cd /var/crash/proto2 - change directory to the directory containing


dumps
# crash vmcore.0 unix.0
dumpfile = vmcore.0, namelist = unix.0, outfile = stdout
> ? - Enter ‘?’ to display the menu of crash commands
as hment prnode strstat
b (buffer) kfp proc t (trace)
base kmastat pte trace
buf (bufhdr) l (lck) pty thread
buffer lck q (quit) ts
bufhdr linkblk qrun tsdptbl
c (callout) lwp queue tsproc
callout m (vfs) quit tty
class major rd (od) u (user)
cpu map redirect user
ctx mblock rtdptbl ui (uinode)
dblock mode rtproc uinode
defproc mount (vfs) rwlock v (var)
defthread mutex s (stack) var
dispq mutextable search vfs
ds nfsnode sema vfssw
f (file) nm size vnode
file od sment vtop
findaddr p (proc) smgrp ?
findslot page snode !cmd
fs (vfssw) pcb stack
hat pcfsnode status
help pmgrp stream
> help defthread - get help for a specific command
defthread [-p] [-r] [-w filename] [-c address]
set default thread
alias:
acceptable aliases are uniquely identifiable initial
substrings
>

8-32 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
8

Commonly Used crash Commands

user (alias: u) Prints the user structure for the designated process.

status Prints system statistics.

proc (alias: p) Prints the process table.

cpu Displays the CPU structure pointed to by


start_addr.

stack (alias: s) Dumps the stack. The -u option prints the user stack.
The -k option prints the kernel stack. If no arguments
are entered, the kernel stack for the current thread is
printed. Otherwise, the kernel stack for the currently
running thread is printed.

For more information about crash commands, refer to the man pages.

Kernel Core Dump Analysis 8-33


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
8

Kernel Dump Analysis – crash – SC2000

# cd crash_directory
# crash vmcore.0 unix.0
dumpfile = vmcore.0, namelist = unix.0, outfile = stdout
> stat
system name: SunOS
release: 5.5
node name: mustang
version: Generic
machine name: sun4d
time of crash: Fri Mar 29 09:04:19 1996
age of system: 18 min.
panicstr: Data fault
panic registers:
pc: e004c13c sp: e0922808

> u
PER PROCESS USER AREA FOR PROCESS 34
PROCESS MISC:
command: mkfs, psargs: mkfs /devices/pseudo/ramd@0:0,raw 512 8 1 8192 1024
16 10 60 2048 t 0 -1 8 -1
start: Fri Mar 29 09:04:19 1996
mem: 450, type: exec
vnode of current directory: f0bd8688
OPEN FILES, POFILE FLAGS, AND THREAD REFCNT:
[0]: F 0xf06d3db8, 0, 0 [1]: F 0xf06d3db8, 0, 0
[2]: F 0xf06d3db8, 0, 0 [3]: F 0xf06d3938, 0, 0
[4]: F 0xf06d3ae8, 0, 0 [5]: F 0xf06d32a8, 0, 0
[6]: F 0xf06d38d8, 0, 0 [9]: F 0xf06d3878, 0, 0
[10]: F 0xf06d3848, 0, 0
cmask: 0022
RESOURCE LIMITS:
cpu time: unlimited/unlimited
file size: unlimited/unlimited
swap size: 2147479552/2147479552
stack size: 8388608/2147479552
coredump size: unlimited/unlimited
file descriptors: 64/1024
address space: unlimited/unlimited
SIGNAL DISPOSITION:
1: default 2: default 3: default 4: default
5: default 6: default 7: default 8: default
9: default 10: default 11: default 12: default
13: default 14: default 15: default 16: default
17: default 18: default 19: default 20: default
21: default 22: default 23: default 24: default
25: default 26: ignore 27: ignore 28: default

8-34 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
8

Kernel Dump Analysis – crash – SC2000

29: default 30: ignore 31: ignore 32: default


33: default 34: default 35: default 36: default
37: default 38: default 39: default 40: default
41: default 42: default 43: default 44: default
> proc
PROC TABLE SIZE = 4058
SLOT ST PID PPID PGID SID UID PRI NAME FLAGS
0 t 0 0 0 0 0 96 sched load sys lock
1 s 1 0 0 0 0 58 init load
2 s 2 0 0 0 0 98 pageout load sys lock nowait
3 s 3 0 0 0 0 60 fsflush load sys lock nowait
4 s 253 1 253 253 0 58 sac load jctl
5 s 147 1 147 147 0 48 inetd load
6 s 257 253 253 253 0 58 ttymon load jctl
7 s 130 1 130 130 0 58 rpcbind load
8 s 138 1 138 138 0 12 kerbd load
9 s 191 1 191 191 0 22 nscd load
10 s 150 1 150 150 0 0 statd load
11 s 132 1 132 132 0 12 keyserv load
12 s 122 1 122 122 0 58 in.routed load
13 s 152 1 152 152 0 12 lockd load
14 s 254 1 254 254 0 48 sh load
15 s 171 1 171 171 0 4 automountd load
16 s 175 1 175 175 0 58 syslogd load nowait
17 s 185 1 185 185 0 56 cron load
18 s 209 201 201 201 0 44 lpNet load nowait jctl
19 s 201 1 201 201 0 33 lpsched load nowait
20 s 229 1 229 229 0 58 vold load jctl
21 s 210 1 210 210 0 0 sendmail load jctl
22 s 220 1 220 220 0 58 utmpd load
23 s 269 254 254 254 0 43 openwin load
24 s 273 269 254 254 0 38 xinit load
25 s 274 273 274 254 0 59 Xsun load
26 s 275 273 275 254 0 55 sh load
27 s 280 1 275 254 0 59 fbconsole load
28 s 361 318 318 318 0 44 newfs load jctl
29 s 286 1 275 254 0 59 vkbd load
30 s 291 147 147 147 0 0 rpc.ttdbserver load jctl
31 s 289 1 275 254 0 59 ttsession load jctl
32 s 293 275 275 254 0 59 olwm load
33 s 294 293 275 254 0 10 olwmslave load
34 p 363 362 318 318 0 0 mkfs load
35 s 298 1 298 298 0 59 cmdtool load jctl
36 s 300 298 300 300 0 60 sh load
37 s 302 1 275 254 0 59 filemgr load
>

Kernel Core Dump Analysis 8-35


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
8

Kernel Crash Dump Analysis Workshop 1

This exercise duplicates the classroom example. Take your time—the


exercise is self-paced.

A RAM disk device driver has just been installed in your system by
your resident device driver writer, who has asked you to test the
driver.

1. In a Shell Tool, change directories to /devices/pseudo.

# cd /devices/pseudo

2. Type the ls command.

# ls

If the RAM disk has been installed correctly, two entries are in this
directory: ramd@0:0, and ramd@0:0,raw.

Your resident driver writer brought the system down because of


bugs in the driver and has asked you to assist in debugging the
driver.

3. Verify that savecore is enabled on your system. If it is not, use vi


to uncomment savecore in the /etc/init.d/sysetup file or else
this system will go down. Make a copy of the original sysetup
file. Before invoking newfs, run the sync command to make sure
that the file systems are synchronized with the disk. This
procedure minimizes file system errors due to a panic.

4. Run newfs(1m) on the raw RAM disk.

# sync; sync; newfs /devices/pseudo/ramd@0:0,raw

The device driver writer brings the system down again.

5. Save the core dump and use adb to analyze the problem following
the classroom exercise template.

● Find the failing instruction.

● Find the failing process (or command).

● Find the failing argument to the process or command.

8-36 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
8

Kernel Crash Dump Analysis Workshop 2A

The bug in Workshop 2B might render your system useless because a


lot of configuration files can be destroyed during a panic.

You can prevent this from happening if you back up the root partition.
Then when /etc files such as name_to_major, path_to_inst,
driver_classes, and driver_aliases become corrupted, you can
boot from a backup root partition that has these files intact.

1. Make sure that your system has been installed with a backup root
partition that has exactly the same size as the root partition. If
your root partition is /dev/dsk/c0t3d0s0 with 20983 Kbytes,
then your backup partition could be /dev/dsk/c0t1d0s0 with
20983 Kbytes.

2. # umount /backup_root (to unmount the backup partition)

3. # dd if=/dev/dsk/c0t3d0s0 of=/dev/dsk/c0t1d0s0

4. # fsck /dev/dsk/c0t1d0s0 (to make sure that the backup


succeeded)

5. # mount /dev/dsk/c0t1d0s0 /backup_root

6. # cd /backup_root

7. Modify the /etc/vfstab file to indicate this partition as the new


root and to comment out the entry that refers to the real root
partition. You can boot your system alternately from this file
system without any problems.

# vi /etc/vfstab

8. Halt your system and then try to boot from the backup_root file
system.

9. Do the lab exercise described in Workshop 2B, or corrupt your


original /etc/name_to_major or /etc/path_to_inst files
and observe what the effects are of these corrupted files on the
system.

10. If your system becomes corrupted, boot from the backup partition,
and then copy the corrupted files from the backup to the original
root partition.

Kernel Core Dump Analysis 8-37


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
8

Kernel Crash Dump Analysis Workshop 2B

In this exercise, you will install on the system a version of ramdisk


where the bug is in an autoconfiguration routine. Make sure that you
have done Workshop 2A before doing this workshop.

Type the following commands to test the RAM disk:

1. Detach the old driver from the kernel.

# rem_drv ramd
# cd /usr/kernel/drv

2. ramd.bad_attach is the newly assembled driver you just


received. Copy it, and call it ramd.

# cp ramd.bad_attach ramd

3. Attach and link the new driver to the kernel. Use the sync
command several times to minimize the file system damage
because of a panic.

# sync; sync; add_drv ramd

The device driver writer brings the system down again.

4. Save the core dump and use adb to analyze the problem using the
classroom exercise as a template.

● Find the failing instruction.

● Find the failing process (or command).

● Find the failing argument to the process or command (failing


file).

Note – You may have to boot with the -a option and not put
/usr/kernel in the module path. This bug may not allow you to save
a core dump because the panic occurs in an auto-configuration routine
that gets called during boot time. When the system panics, the system
will try to reboot; and when it reboots, it will encounter the bad
attach routine and the system will go down again. This is when the
-a option to boot becomes very useful.

8-38 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
8

Kernel Crash Dump Analysis Workshop 3

After describing what is wrong with the RAM disk driver, your device
driver writer reports that the writer has written another ramd and that
you are to test it. Use adb commands to modify a live kernel.

1. Change directories to /usr/kernel/drv.

# cd /usr/kernel/drv

2. Ensure test1 is executable as a script. Invoke test1.

# test1

Note – test1 is a test program to read and write to the backseat


pseudo device using kernel system calls.

3. Invoke adb.

# adb -kw /dev/ksyms /dev/mem


physmem xxx <= (adb prints this out and returns a nonprompt.)

4. Type the following command to display a portion of the machine


code that is used within the backseat program.

backseat_write,10/X

5. Type the following command to display the machine code in


assembler syntax. You are going to replace the top instruction. You
have a choice: FFFFFFFF (an illegal instruction format) or
00000000 (a real instruction in the wrong place).

backseat_write,10/i

6. Insert an error instruction of your choice in the live kernel code


used by the backseat program. /W opens the location for writing
FFFFFFFF or 00000000.

backseat_write+20/W FFFFFFFF or 00000000

7. Press Control-d to exit adb.

Kernel Core Dump Analysis 8-39


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
8

Kernel Crash Dump Analysis Workshop 3

8. Use the sync command several times, then invoke test1 again.

The system panics.

9. Analyze the core dump so that you can tell the device driver
writer what was wrong.

8-40 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
8

Kernel Crash Dump Analysis Workshop 4 (Sheet 1 of 8)

kadb Workshop Introduction


This workshop enables you to become familiar with limited and
carefully selected Solaris 5.x data structures related to user processes
using kadb utility.

You will use the ps (report process status) command and the kadb
(kernel debugger) utility. This procedure is time-consuming but
interesting. You will select one of the active processes in your system
like init, a Command Tool, more, or vi. You are going to trace
through the various structures that the operating system allocates to
processes starting with the output of the ps -le command. Then you
will use kadb to go through the structures.

Use the man pages and .h files to gain insight into the Solaris 5.x
operating system and to increase your fault analysis skills with
advanced concepts.

Kernel Core Dump Analysis 8-41


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
8

Kernel Crash Dump Analysis Workshop 4 (Sheet 2 of 8)

kadb Description
kadb is an interactive debugger with a user interface similar to that of
adb(1), the assembly language debugger. kadb must be loaded prior
to the standalone program it is to debug. It runs in the same address
space as the standalone program, thus sharing many resources with
that program but not able to use the facilities available to the system
(such as the mouse, and access to file systems) because the system is
suspended when kadb is running. Because the kernel is not running
when kadb is active, any system structure that is examined or looked
at through kadb has the current state of that structure. The debugger is
cognizant of and able to control multiple processors if they are present
in a system.

Unlike adb, kadb runs in the same supervisor virtual address space as
the program being debugged (although it maintains a separate
context). The debugger runs as a coprocess that cannot be killed (`:k')
or rerun (`:r'). There is no signal control (`:i', `:t', or `$i'),
although the keyboard facilities (Control-c, Control-s, and Control-q)
are simulated.

In the case of the UNIX kernel, the keyboard abort sequence (Stop-a
[L1-a] for console and BREAK for serial line) suspends kernel
operations and breaks into the debugger. The system will also fall into
kadb when it panics, allowing you to do an immediate analysis as to
why the system went down. You would want to use kadb when it is
not possible to save a coredump or if your dump device (swap device)
is too small to save physical memory. kadb gives the prompt kadb[#]
where # is the CPU it is currently executing on.

Note – Running under kadb has proven to be very valuable when very
bad crashes cause the machine to be so ill that it cannot generate a
dump. The analysis is the same as if running adb on a coredump.

8-42 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
8

Kernel Crash Dump Analysis Workshop 4 (Sheet 3 of 8)

Invoking and Exiting kadb


ok boot disk kadb

Wait until the system is fully booted.

Note – Enter kadb with the Stop-a (L1-a) key sequence.

# (Stop-a)

To exit or make transitions between the operating system and kadb:

● Press Stop-a to invoke kadb.

● Type :c to exit kadb and return to the operating system.

● Type $q to quit kadb and go down to the ok prompt.

Note – To display a list of all kadb macros, type $M at the kadb prompt.

Kernel Core Dump Analysis 8-43


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
8

Kernel Crash Dump Analysis Workshop 4 (Sheet 4 of 8)

Mapping UNIX Data Structures

Simple Process

Address space
structure
Process structure

as pointer

Thread structure

Lightweight process
tlist pointer lwp pointer

Note – The process data structure contains a portion of all the


information for proper execution of any process. When you logged in,
a process was created for you. The starting address of the process
structure is displayed in the ADDR field in the ps -el output. That
address is your starting point. Using kadb, you are required to obtain
selected information. Refer to the proc.h file in your reference section
for more information or look at the header file in
/usr/include/sys/proc.h.

8-44 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
8

Kernel Crash Dump Analysis Workshop 4 (Sheet 5 of 8)

1. Boot the system with kadb. Use the ps command to obtain the
starting address of your process.

2. Invoke kadb.

3. Use the process address with the proc macro, for example,
fc363000$<proc. To control the flow of information, use the
Control-q and Control-s key sequences.

4. Record the following information:

Simplified Process Structure

Process structure
Starting address

as

ppid

pidp

cred

tlist

Kernel Core Dump Analysis 8-45


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
8

Kernel Crash Dump Analysis Workshop 4 (Sheet 6 of 8)

Related Data Structures


Using the content of the process structure you can trace, if needed,
other data structures to acquire further information about the process.
These are examples; you will have to supply your own pointers.
[field] specifies the content of a specified field.

[as]$<as

[pidp]$<pid

[tlist]$<thread

[cred]$<cred

Refer to the related h files in the reference material.

An interesting project is to find out how many segments make up the


address space.

A process’s address space is composed of segments. A segment is a


contiguous portion of virtual address space. The virtual address space
of a process contains different types of segments: text, data, stack, and
other memory mapped objects such as regular files and device files.
The seg structure contains the public information. The as structure
describes the virtual memory.

The seglast field contains the address of the segment that was last
used. In most cases, when the kernel needs to search a segment, it
starts with the last searched segment.

Refer to the Segment Mapping figure on the following page.

The as and seg structures are defined in header files as.h and seg.h,
which are in the directory /usr/include/vm. The thread and cred
structures are defined in the header files thread.h and cred.h in the
directory /usr/include/sys.

8-46 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
8

Kernel Crash Dump Analysis Workshop 4 (Sheet 7 of 8)

Segment Mapping

proc

size
[as] stack
BASE
seg

seg

size
data
seg BASE

size
text
BASE
as

PROCESS IMAGE

Kernel Core Dump Analysis 8-47


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
8

Kernel Crash Dump Analysis Workshop 4 (Sheet 8 of 8)

5. Return to the Solaris operating environment.

6. Perform the Kernel Crash Dump Analysis Workshops 1 or 2 again


using kadb.

7. When kadb is used to trap a panic, is the stack different? If it is,


can it be used to your advantage?

8-48 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
8

Kernel Crash Dump Analysis Workshop 5

1. This is an additional exercise using kadb. Make sure that you have
booted the system using kadb. Log in as root, start OpenWindows,
and go to the backseat directory. Copy the backseat_hang driver
into the /usr/kernel/drv directory, but rename it as backseat. If
you have installed the working version of backseat before, do a
rem_drv of this working backseat driver before installing this
defective one into your system.

2. # cp backseat_hang /usr/kernel/drv/backseat

3. # cp backseat.conf /usr/kernel/drv

4. # rem_drv backseat

5. # add_drv backseat (Install the new buggy driver.)

6. # ./test1 (Run the test program, this program will hang!)

7. # ps -le (to determine the pid of test1)

8. # kill pid_of_test1 or kill -9 pid_of_test1

9. After determining that kill is not capable of eliminating the


test1 process running the backseat driver. Press L1-a (Stop-a).

10. Type the following command to display a stacktrace of all threads


in the system.

kadb[0]: $<threadlist

11. You should see a backseat driver routine calling physio() and
physio() calling biowait(); the thread is blocked in biowait(),
which is the reason why test1 is hung. Look at the man pages for
physio() and biowait() to determine what could be wrong
with the device driver. Then look at the source file for
backseat_hang, which should be backseat_hang.c, to find out
what the device driver forgot to do.

The lesson that can be learned here is that a device driver can put
a thread to sleep in such a way that the thread cannot be
awakened by a signal.

Kernel Core Dump Analysis 8-49


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
8

Kernel Crash Dump analysis Workshop 5

12. After this exercise, you can exit kadb and reboot your system
without kadb. You may want to sync your system first. Then press
Stop-a, and issue the $q command and boot from the ok prompt.

8-50 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
8

Kernel Crash Dump Analysis Workshop 6

Refer to Fault #53 for this workshop.

Kernel Core Dump Analysis 8-51


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
8

Kernel Crash Dump Analysis Workshop 7

Refer to Fault #34 for this workshop.

8-52 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
8

Watchdog Reset Workshop 8 (Sheet 1 of 2) Optional

The workshop is a cookbook approach to gather the correct


information about a watchdog reset. You will install the bug that
creates the watchdog reset.

Bug Install
1. Ensure that NVRAM watchdog-reboot? is false.

2. Boot the system to the Solaris operating environment.

3. Log in as root.

4. Invoke adb with the -kw qualifiers on /dev/mem.

5. Type the location sys_trap+4 0.

6. Wait for the watchdog error.

Note – If a watchdog error does not occur, ask the instructor for
assistance.

7. At the OBP ok prompt, perform the following:

● wd-dump (if you are on a Sun-4d workstation)

● .registers

● .locals

● .psr

● ctrace

Kernel Core Dump Analysis 8-53


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
8

Watchdog Reset Workshop 8(Sheet 2 of 2) Optional

8. Boot the system to the Solaris operating environment, and save the
output of the following Solaris commands:

● showrev -p

● prtconf -v

● pkginfo

● /usr/ccs/bin/nm /dev/ksyms

9. You should also copy the following files:

● /etc/system

● /var/adm/messages*

Note – You should set up a tip line to the machine that is expected to
get a watchdog reset, as this is the easiest way to save the OBP
command outputs in a file.

Note – Search the SunSolve software for watchdog reset with error
messages similar to yours.

Note – Search patch reports for watchdog resets.

8-54 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
8

Program Debugging – Optional

The pc corresponds to the ram_write() routine, which is in the RAM


disk driver. The bug is in the RAM disk write routine and occurs
during an ld (load) instruction. This load instruction dereferences the
value of l1+0x30. The value l1 is not in the regs structure, and it is
inappropriate to use the $r command to examine the other registers
because they are reused in the trap routine. So you cannot look at the
value of l1, although it most likely has an invalid address.

Instead, you can find out where the failing instruction is with respect
to the entire routine so that the assembly language can be matched to
the C code. To do this, the routine is disassembled up to the problem
instruction, which occurs 2c bytes into the routine. Since each
instruction is 4 bytes, 2c/4 or 0xb additional instructions must be
displayed:
ff4eadbc/i (from Determining What Instruction Failed)
ram_write+0x2c: ld [%l1 + 0x30], %l2
ram_write,c/i
ram_write:
ram_write: sethi %hi(0xfffffc00), %g1
add %g1, 0x398, %g1 ! ffffff98
save %sp, %g1, %sp
st %i0, [%fp + 0x44]
st %i1, [%fp + 0x48]
st %i2, [%fp + 0x4c]
ld [%fp + 0x44], %o0
call getminor
nop
st %o0, [%fp - 0x4]
ld [%fp - 0x8], %l1
ld [%l1 + 0x30], %l2

Kernel Core Dump Analysis 8-55


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
8

Program Debugging – Optional

The crash occurs a few instructions after a call to getminor(9F).

After examining the ramd.c source file, these lines stand out in
ram_write:
static int
ram_write(dev_t dev, struct uio *uiop, cred_t *credp)
{
int instance;
struct ram_state *rs;

instance = RAM_DEV_TO_INST(dev); /* a macro to getminor(dev) */

/* Comment this out in order to pass a pointer that has not been
initialized, so that you can cause a data fault and a core dump.

rs = ddi_get_soft_state(statep, instance);
if (rs == NULL) {
cmn_err(CE_NOTE,
“%s: write: could not get state for instance %d.”,
RAMDISK_NAME, instance);
return ENXIO;
}
*/
if (uiop->uio_offset >= rs->size)
return EINVAL;

In the above code, since the call to ddi_get_soft_state() was


commented out, the rs pointer is never initialized. This is the problem
that causes the panic.

Note – Most data fault panics are bad pointer references.

8-56 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
8

ps Command Workshop 9—Optional

Introduction
The workshop will enable you to trace and identify the processes and
files needed to open windows. You will use the ps (report process
status) command and man pages, within the reference material, to
accomplish this task.

This procedure is time-consuming, interesting, and valuable in


gathering data for fault analysis.

Kernel Core Dump Analysis 8-57


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
8

Sequence of Procedures (Do Not Execute)

Setting Base Level


1. Boot the system.

2. Log in as root.

3. Use the openwin command to enter the OpenWindows


environment.

4. Quit all windows except the Console window.

5. Save the current workspace.

6. Exit the OpenWindows environment.

7. Log out.

Acquiring Base-Level Information


1. Log in as root.

2. Type the ps command to determine the current process ID (PID).

3. Type the ps -ef command to obtain base-level processes.

Tracing OpenWindows Processes


1. Start the OpenWindows environment by typing the openwin
command.

2. Type the ps command to determine the PID for the current


window process.

3. Type the ps -ef command to identify the processes required to


open windows.

4. Identify files required to open windows.

8-58 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
8

Setting Base Level

1. Boot the system.

2. Log in as root.

3. Use the openwin command to enter the OpenWindows


environment if it does not open up by default.

# /usr/openwin/bin/openwin

4. Quit all windows except the Console window.

5. Save the current workspace.

6. Exit the OpenWindows environment.

7. Log out.

Kernel Core Dump Analysis 8-59


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
8

Acquiring Base-Level Information

1. Log in as root. Do not start the OpenWindows environment.

2. Type the ps command to determine current process ID (PID)


created for you during login.

# ps

3. Record the values returned.

PID TTY Command

The next ps command fills the screen with information concerning all
the processes that have been started including your login process.
Some of the processes vary from system to system.

The table (next two pages) provides some of the processes that will be
on all systems with space for other system-dependent processes.
Check the processes you have that are the same as those listed. Use the
ps -ef command to obtain base-level processes.

4. Acquire a list of the processes at this time.

# ps -ef

8-60 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
8

Base-Level Processes (1 of 2)

PID PPID Command

sched*
init*
pageout
fsflush
sac
rpcbind
sctserve
sendmai
keyserv
inetd
ypbind
in.route
kerbd
automoun
statd
lockd
lpsched
syslogd
cron
vold

Kernel Core Dump Analysis 8-61


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
8

Base-Level Processes (2 of 2)

PID PPID Command

8-62 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
8

Tracing OpenWindows Processes

1. Start the OpenWindows environment by typing the following


command:

# /usr/openwin/bin/openwin

2. Type the ps command and determine the PID for the current
window process. Record the information in the table below:

PID TTY Command

3. Type the ps -ef command and identify the processes required to


run the OpenWindows environment. List only the new processes
created. Record all required information in the table below:

PID PPID Command

Kernel Core Dump Analysis 8-63


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
8

Workshop Summary Exercise

1. Record your login shell PID. ___________

2. Record the parent ID (PPID) of your login PID. ___________

3. Record the cmdtool (CONSOLE) PID. ___________

4. Record the parent ID (PPID) for cmdtool (CONSOLE).


___________

5. What is the PID and PPID for the openwin command?

PID ___________ PPID ___________

6. Who is the parent for the xinit (X Server Initialization Program)?


___________

7. Who is the parent for Xsun (Solaris X Server)? ___________

8. Who is the parent for olwm (Open Look Window Manager)?


___________

9. Who is the parent for the parent of olwm? ___________

10. Who is the parent for olmslav (Open Window Manager Slave)?
___________

11. Who is the parent for cmdtool (Enhance Terminal Window


Program)? ___________

12. Is cmdtool a parent? If yes, who is it a parent for? ___________

13. Who is the parent for ttsession (Tool Talk Message Server)?
___________

14. Who is the parent of vkbd (Virtual Keyboard and Function


Displayer)? ___________

15. Open another Shell Tool at this time. Trace the family tree.

16. Open the calculator. Trace the family tree.

8-64 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
8

Workshop Summary Exercise

Fill in the following chart relating to files required to open windows.


(Many are in the reference handouts.) The ps -elf command can help
you with the parameters.

OpenWindows Files

File Path File Type Parameters

1. Close windows and type the following command:

# truss -f -t exec -o /tmp/trusstrace


/usr/openwin/bin/openwin

2. Exit the OpenWindows environment.

# view /tmp/trusstrace

Is truss another tool you can use to trace commands? Remember that
PIDs different from your original chart will be displayed because they
are not reused. The grep command can be useful at this time.

Kernel Core Dump Analysis 8-65


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
8

Skills Checklist

Student Instructor
Skill
Initials Initials
Invoke adb to start a kernel core dump analysis.
Invoke crash to start a kernel core dump analysis.
Use the adb string macro to display the message buffer.
Use selected adb macros to determine the process at the time of
the fault.
Use selected crash commands to determine the process at the
time of the fault.
Use the correct commands to properly exit crash or adb.

8-66 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
Fault Tracker Progress Chart A

Team Members

________________________________________________________

________________________________________________________

_________________________________________________________

A-1
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
A

Fault Tracker Progress Chart

Fault # Hardware/Software Time

A-2 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
A

Fault Tracker Progress Chart

Fault # Hardware/Software Time

Fault Tracker Progress Chart A-3


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
A

A-4 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
Fault Worksheets - Student Guide B

Requirements
● Sun-4 systems

● Solaris 2.x operating environment

Resources
● AnswerBook

● SunSolve

● Diagnostics

● Open Boot Prom (OBP) diagnostics

● SunVTS

● Format

System Configurations
● Standalone

● Network

● Client-server

● NIS or NIS+

B-1
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
B

Fault Worksheet #1 - Blank Monitor

Initial Customer Description


Monitor is blank after power up.

Error Symptoms/Conditions/Messages

Problem Statement

Research Resources

Likely Cause Test Results

Repair Verification
Instructor Initials _________________

B-2 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
B

Fault Worksheet #2 - Device Error During Boot

Initial Customer Description


Error messages occurs when booting to the Solaris operating system.

Error Symptoms/Conditions/Messages

Problem Statement

Research Resources

Likely Cause Test Results

Repair Verification
Instructor Initials _________________

Fault Worksheets - Student Guide B-3


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
B

Fault Worksheet #3 - File Errors During Boot

Initial Customer Description


The boot sequence is incomplete due to apparent file-system
corruption after the last crash.

Error Symptoms/Conditions/Messages

Problem Statement

Research Resources

Likely Cause Test Results

Repair Verification
Instructor Initials _________________

B-4 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
B

Fault Worksheet #4 - Incomplete Boot to Solaris Operating


System

Initial Customer Description


The default boot sequence appears to start correctly and then reports
an unknown device. When the customer performs a boot -a and
takes all default parameters, the system boots.

Error Symptoms/Conditions/Messages

Problem Statement

Research Resources

Repair Verification

Likely Cause Test Results

Instructor Initials _________________

Fault Worksheets - Student Guide B-5


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
B

Fault Worksheet #5 - Login Problem

Initial Customer Description


When logging in to root, an error message complains of an improper
shell and immediately logs out.

Error Symptoms/Conditions/Messages

Problem Statement

Research Resources

Likely Cause Test Results

Repair Verification
Instructor Initials _________________

B-6 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
B

Fault Worksheet #6 - adb Macro Error

Initial Customer Description


While using adb to modify the max process field in the kernel, the v
macro (v$<v) returns no parameters.

Error Symptoms/Conditions/Messages

Problem Statement

Research Resources

Likely Cause Test Results

Repair Verification
Instructor Initials _________________

Fault Worksheets - Student Guide B-7


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
B

Fault Worksheet #7 - Feckless

Initial Customer Description


You cannot write to the directory /feck. You need to use a directory
named feck to make a directory called test and a file called my.test

Error Symptoms/Conditions/Messages

Problem Statement

Research Resources

Likely Cause Test Results

Repair Verification
Instructor Initials _________________

B-8 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
B

Fault Worksheet #8 - Incomplete Boot to Solaris Operating


System

Initial Customer Description


The system administrator was tuning the system over the weekend
and left for another tuning class in Dallas. You have been asked to
restore the system. The problem is that the system “hangs” during
boot sequence.

Error Symptoms/Conditions/Messages

Problem Statement

Research Resources

Likely Cause Test Results

Repair Verification
Instructor Initials _________________

Fault Worksheets - Student Guide B-9


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
B

Fault Worksheet #9 - Turn the Page

Initial Customer Description


The pg command does not work.
The passwd command does not work
Only users with no password can login.

Error Symptoms/Conditions/Messages

Problem Statement

Research Resources

Likely Cause Test Results

Repair Verification
Instructor Initials _________________

B-10 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
B

Fault Worksheet #10 - Login Problem

Initial Customer Description


The root user cannot log in successfully. The login prompt and
password (if required) are accepted, and it appears a login is starting,
but then the system logs out. The system administrator has just come
back from training, worked the weekend, and left on a holiday.

Error Symptoms/Conditions/Messages

Problem Statement

Research Resources

Likely Cause Test Results

Repair Verification
Instructor Initials _________________

Fault Worksheets - Student Guide B-11


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
B

Fault Worksheet #11 - Network Problem

Initial Customer Description


The user’s routing information is lost, and netstat -r does not
respond correctly. The ping command has also stopped working.
Some of the people in the department attended a network course last
week. They all say they have not touched the system.

Error Symptoms/Conditions/Messages

Problem Statement

Research Resources

Likely Cause Test Results

Repair Verification
Instructor Initials _________________

B-12 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
B

Fault Worksheet #12 - OpenWindows Problem

Initial Customer Description


A user had been having problems with the shell. The user tried to fix
the problem alone, but the problem has deteriorated, and the
OpenWindows environment does not open.

Error Symptoms/Conditions/Messages

Problem Statement

Research Resources

Likely Cause Test Results

Repair Verification
Instructor Initials _________________

Fault Worksheets - Student Guide B-13


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
B

Fault Worksheet #13 - Shutdown When Opening Windows

Initial Customer Description


When the superuser opens windows, the workstation shuts down. Not
all users experience the error.

The user account is student1, and the password is student1.

Error Symptoms/Conditions/Messages

Problem Statement

Research Resources

Likely Cause Test Results

Repair Verification
Instructor Initials _________________

B-14 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
B

Fault Worksheet #14 - Network Printer Problem

Initial Customer Description


The network printer has stopped printing.

Error Symptoms/Conditions/Messages

Problem Statement

Research Resources

Likely Cause Test Results

Repair Verification
Instructor Initials _________________

Fault Worksheets - Student Guide B-15


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
B

Fault Workshop #15 - Incomplete Boot to Solaris Operating


System

Initial Customer Description


The system does not boot after external power failure during a storm.

Error Symptoms/Conditions/Messages

Problem Statement

Research Resources

Likely Cause Test Results

Repair Verification
Instructor Initials _________________

B-16 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
B

Fault Worksheet #16 - Constant Reboot, Halt, or Power Off


Problem

Initial Customer Description


The system administrator has just came back from training, worked
the weekend, has left on a holiday to the wilds of Canada, and cannot
be reached. The system appears to reboot, halt, or power off.

Error Symptoms/Conditions/Messages

Problem Statement

Research Resources

Likely Cause Test Results

Repair Verification
Instructor Initials _________________

Fault Worksheets - Student Guide B-17


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
B

Fault Worksheet #17 - The ps Command Returns Nothing

Initial Customer Description


The system administrator has returned from training, worked the
weekend, and is now on vacation. Now, when the ps command is
used, it returns nothing.

Error Symptoms/Conditions/Messages

Problem Statement

Research Resources

Likely Cause Test Results

Repair Verification
Instructor Initials _________________

B-18 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
B

Fault Worksheet #18 - NIS or NIS+ Network Problem

Initial Customer Description


You cannot gain access to the network.

Two system administrators cannot agree on which name service to use,


and they keep switching back and forth between difference services.

Error Symptoms/Conditions/Messages

Problem Statement

Research Resources

Likely Cause Test Results

Repair Verification
Instructor Initials _________________

Fault Worksheets - Student Guide B-19


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
B

Fault Worksheet #19 - Network Problem

Initial Customer Description


The network was tested all weekend. Since the last boot, you lost
network communications on your system.

Error Symptoms/Conditions/Messages

Problem Statement

Research Resources

Likely Cause Test Results

Repair Verification
Instructor Initials _________________

B-20 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
B

Fault Worksheet #20 - OpenWindows Problem

Initial Customer Description


New users are having problems using the mouse in the OpenWindows
environment. The problem appeared after the system administrator
went to a tuning class. The system administrator is now on vacation
somewhere in Alaska and cannot be reached. The mouse is operational
in root but not with new user.

Use the student1 account. The user name is student1, and the
password is student1.

Error Symptoms/Conditions/Messages

Problem Statement

Research Resources

Repair Verification

Likely Cause Test Results

Instructor Initials _________________

Fault Worksheets - Student Guide B-21


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
B

Fault Worksheet #21 - Banner Logo Has Been Changed

Initial Customer Description


An inappropriate logo appeared in the banner after an unhappy
employee left the firm. The standard Sun logo is not an acceptable fix.
Using Icon editor is not an acceptable fix. A fix is any logo except the
current logo.

Error Symptoms/Conditions/Messages

Problem Statement

Research Resources

Likely Cause Test Results

Repair Verification
Instructor Initials _________________

B-22 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
B

Fault Worksheet #22 - Do Not Tread on Me

Initial Customer Description


Cannot boot to multi-user.

Error Symptoms/Conditions/Messages

Problem Statement

Research Resources

Likely Cause Test Results

Repair Verification
Instructor Initials _________________

Fault Worksheets - Student Guide B-23


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
B

Fault Worksheet #23 - vi Editor Problem

Initial Customer Description


The vi editor works in a non-window environment but not in the
OpenWindows environment. Some personnel have attended system
administration training, but they all claim no knowledge of error.

Error Symptoms/Conditions/Messages

Problem Statement

Research Resources

Likely Cause Test Results

Repair Verification
Instructor Initials _________________

B-24 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
B

Fault Worksheet #24 - “Hacker” Intrudes the System

Initial Customer Description


Using the OpenWindows environment causes the system to reboot
after some time has elapsed. The department believes it is the result of
a “hacker” intruding the system. The user was using standard
commands when the reboot occurred.

Error Symptoms/Conditions/Messages

Problem Statement

Research Resources

Likely Cause Test Results

Repair Verification
Instructor Initials _________________

Fault Worksheets - Student Guide B-25


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
B

Fault Worksheet #25 - No OpenWindows Environment

Initial Customer Description


After a system crash, the OpenWindows environment no longer
functions. In addition, fsck was executed during the boot sequence.

Error Symptoms/Conditions/Messages

Problem Statement

Research Resources

Likely Cause Test Results

Repair Verification
Instructor Initials _________________

B-26 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
B

Fault Worksheet #26 - Login Problem

Initial Customer Description


The system administrator went to a system security training class and
secured the system before going to Alaska. The user cannot log in to
root from an ASCII terminal now.

Error Symptoms/Conditions/Messages

Problem Statement

Research Resources

Likely Cause Test Results

Repair Verification
Instructor Initials _________________

Fault Worksheets - Student Guide B-27


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
B

Fault Worksheet #27 - “Hangs” on Boot

Initial Customer Description


While on a service call, the field engineer ran a power-on self test
(POST) on an ASCII terminal. During the process, the monitor went
blank. The resourceful field engineer decided to boot the system by
using the ASCII port as the console.

After disconnecting the keyboard, and using the ASCII terminal, the
system “hangs” during boot.

Error Symptoms/Conditions/Messages

Problem Statement

Research Resources

Likely Cause Test Results

Repair Verification
Instructor Initials _________________

B-28 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
B

Fault Worksheet #28 - No Network

Initial Customer Description


The user lost network communications. The system was moved over
the weekend from the Baker Street location and reinstalled on the
current network.

Error Symptoms/Conditions/Messages

Problem Statement

Research Resources

Likely Cause Test Results

Repair Verification
Instructor Initials _________________

Fault Worksheets - Student Guide B-29


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
B

Fault Worksheet #29 - Where It Is At

Initial Customer Description


Machine powers off, halts or reboots ever since we promoted and
transferred our old System Administrator.

Error Symptoms/Conditions/Messages

Problem Statement

Research Resources

Likely Cause Test Results

Repair Verification
Instructor Initials _________________

B-30 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
B

Fault Worksheet #30 - Seedy ROM

Initial Customer Description


Cannot download anything from the server’s CD-ROM.

Error Symptoms/Conditions/Messages

Problem Statement

Research Resources

Likely Cause Test Results

Repair Verification
Instructor Initials _________________

Fault Worksheets - Student Guide B-31


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
B

Fault Worksheet #31 - See It Now

Initial Customer Description


Someone is Stop-a’ing my machine and displaying the /etc/shodow
and /etc/passwd files.

Error Symptoms/Conditions/Messages

Problem Statement

Research Resources

Likely Cause Test Results

Repair Verification
Instructor Initials _________________

B-32 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
B

Fault Worksheet #32 - Cannot Log In as Root

Initial Customer Description


The system administrator went to a system security training class. The
system administrator then secured the system before going on
vacation. The user cannot log in to root from any terminal.

Error Symptoms/Conditions/Messages

Problem Statement

Research Resources

Likely Cause Test Results

Repair Verification
Instructor Initials _________________

Fault Worksheets - Student Guide B-33


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
B

Fault Worksheet #33 - No Network or Interface

Initial Customer Description


“After reboot, I do not get a network or interface unless I bring it up
manually.”

Error Symptoms/Conditions/Messages

Problem Statement

Research Resources

Likely Cause Test Results

Repair Verification
Instructor Initials _________________

B-34 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
B

Fault Worksheet #34 (Sheet 1 of 4) - Script “Hangs” System

Initial Customer Description


You will place the fault on the system.

● Keyboard input is not accepted.

● The arrow on the screen does not move.

● LEDs (if available) are in motion.

● rlogin from other machines on the network fails (times out).

● ping may work intermittently.

● No error messages are displayed.

Error Symptoms/Conditions/Messages

Problem Statement

Research Resources

Likely Cause Test Results

Repair Verification
Instructor Initials _________________

Fault Worksheets - Student Guide B-35


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
B

Fault Worksheet #34 (Sheet 2 of 4)

1. Use the vi editor to put the following script in /usr/bin directory.


Name the file start.

#!/bin/csh -f
clear
rm -f /tmp/guilty_party
cat > /tmp/guilty_party << Done
#!/bin/csh -f
while (1)
end
Done
chmod 777 /tmp/guilty_party
/usr/bin/priocntl -e -c RT /tmp/guilty_party &

2. After exiting the vi editor, run the following commands:

# chmod 775 start


# start

The system “hangs.”

3. If you are on a multiprocessor, include the following line after the


chmod command:

/usr/sbin/psradm -f a

B-36 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
B

Fault Workshop #34 (Sheet 3 of 4)

Diagnostic Steps
Use the following procedure to determine what is causing your system
to “hang.”

1. Attempt a ping from remote machine.

2. Attempt to rlogin from another machine.

3. Press Stop-a (L1-a) to halt the machine.

Note – You may need to press Stop-a several times before the
keyboard interrupt is handled.

4. Type new to enter the PROM monitor mode.

5. Type sync to force a core dump.

6. Reboot the system.

7. Type the following command:

cd directory_with_core_dumps

8. Type the following to reflect current dump files:

crash vmcore.n unix.n

n is a value (0, 1, 2, and so on).

9. Type proc.

Are there any processes that look unusual?

10. Type proc -f.

11. For each process entry, examine the utime and stime fields. The
combined total of these fields is total CPU time being used by the
process.

Are there any processes with an abnormally high amount of CPU


time as compared to the other processes? Note this process.

Fault Worksheets - Student Guide B-37


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
B

Fault Workshop #34 (Sheet 4 of 4)

Expected Repair
A workaround is to not run the trouble program (guilty_party) until
CPU resources are available. You also need to determine if it is normal
behavior for this process to use so much CPU time. Or run
guilty_party as a timesharing process (not real time).

Repair verification
Rerun the start command to verify that this process is the culprit.

Note – Another way to debug “machine hangs” is to collect several


core dumps and compare the processes in execution for similarities.

B-38 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
B

Fault Worksheet #35 - No shcat

Initial Customer Description


“System gets hung up on rarp and will not finish booting.”

Error Symptoms/Conditions/Messages

Problem Statement

Research Resources

Likely Cause Test Results

Repair Verification
Instructor Initials _________________

Fault Worksheets - Student Guide B-39


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
B

Fault Worksheet #36 - Login Problem

Initial Customer Description


The system administrator attended advanced system training, worked
the weekend, and then left on vacation. Now, when the user attempts
to log in, the system returns another login prompt.

Error Symptoms/Conditions/Messages

Problem Statement

Research Resources

Likely Cause Test Results

Repair Verification
Instructor Initials _________________

B-40 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
B

Fault Worksheet #37 - Noel Two

Initial Customer Description


“After I added a user with their own home directory, they cannot
login.”

Error Symptoms/Conditions/Messages

Problem Statement

Research Resources

Likely Cause Test Results

Repair Verification
Instructor Initials _________________

Fault Worksheets - Student Guide B-41


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
B

Fault Worksheet #38 - Client-Server ftp Problem

Initial Customer Description


Over the weekend, the entire system was off line to perform network
testing. The system was tested and then restored in the original
configuration as a diskless client and server. Now, the user is unable to
transfer files using the file transfer protocol (ftp) program.

Error Symptoms/Conditions/Messages

Problem Statement

Research Resources

Likely Cause Test Results

Repair Verification
Instructor Initials _________________

B-42 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
B

Fault Worksheet #39 - Network Problem

Initial Customer Description


Over the weekend, the entire system was taken off line to perform
network testing. The system was tested and then restored in the
original configuration as a workstation on the network.

Error Symptoms/Conditions/Messages

Problem Statement

Research Resources

Likely Cause Test Results

Repair Verification
Instructor Initials _________________

Fault Worksheets - Student Guide B-43


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
B

Fault Worksheet #40 (Sheet 1 of 6) - Slow and Fast Perceptions

Initial Customer Description


After completing the worksheet, determine the user’s complaint, if
any.

Customer claims that applications are running slower.

Error Symptoms/Conditions/Messages

Problem Statement

Research Resources

Likely Cause Test Results

Repair Verification
Instructor Initials _________________

B-44 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
B

Fault Worksheet #40 (Sheet 2 of 6)

1. Log in as root, but do not start the OpenWindows environment.

2. Type the swap -s command, and record the values in Table 1 on


page 49 (in the “Before OpenWindows” column). These will be the
base values.

# swap -s

3. Type the swap -l command and record the values in Table 2 page
49 (in the “Before OpenWindows” column).

# swap -l

4. Make a directory called test in root to be used as a mount point.

# mkdir /test

5. Mount the server partition /opt to mount point /test.

# mount server_name:/opt /test

6. Start the OpenWindows environment.

# /usr/openwin/bin/openwin

7. Open a shell and perform the swap -s and swap -l commands.


Record these values in Tables 1 and 2 page 49 (in the “After
OpenWindows” column).

8. Start the SunDiag program, but do not begin any tests.

# /test/SUNWdiag/bin/sundiag

9. Using the Console window within the SunDiag program, perform


the swap -s and swap -l commands. Compare these values to
the base values.

10. Deselect all tests and then select the kmem test.

Fault Worksheets - Student Guide B-45


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
B

Fault Worksheet #40 (Sheet 3 of 6)

11. Record the value of swap space indicated in the kmem option box.

Swap space = _______

12. Record the value of physical memory as indicated in the mem


option box.

The value of physical memory = _______

13. Total physical memory must also take into account the pages
required by the kernel. The total memory minus the memory of
the kernel equals available physical memory. Check the dmesg for
size of kernel memory.

Kernel size = _______

14. The total disk swap space minus the available physical equals
memory swap space.

15. Run two passes of kmem tests and record the time required to
complete the tests. This will be the base time.

Base time = _______

16. While the test is running, you can monitor the behavior of swap
space, using the swap commands. Record the value of the first
swap commands in Tables 1 and 2 on page 49 (in the “During
SunDiag” column).

17. If the test passes, add fpu and one device for a fstest.

18. Run two passes of new tests and record the time required to
complete them. This will be the loaded base time.

Loaded base time = _______

B-46 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
B

Fault Worksheet #40 (Sheet 4 of 6)

If the test is successful and the virtual and physical links are
functional, the system administrator can add the partition to the
/etc/vfstab using a shortcut method:

1. Exit the OpenWindows environment.

2. Type the following commands:

# cp /etc/vfstab /etc/vfstab.orig
# mount -p > /etc/vfstab
umount /test
# init 6

3. Boot the system and log in as root.

# /usr/openwin/bin/openwin

4. Ensure that the server partition is mounted to /test.

5. Start the SunDiag program.

# /test/SUNWdiag/bin/sundiag

Warning – Caution – Anything can happen when trying to perform


! the next steps. If you encounter errors, try to determine the cause and
correct the problem.

6. Run two passes of kmem tests and record the time to complete.
This will be the new base time.

New base time = _______

7. Compare the “new base time” with the original “base time.” Is it
faster, slower, “hung,” stopped, or the same? Why?

___________________________________________________________

___________________________________________________________

8. If the test passes, add fpu and one device for a fstest.

Fault Worksheets - Student Guide B-47


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
B

Fault Worksheet #40 (Sheet 5 of 6)

9. Run two passes of new tests and record the time to complete. This
will be the new loaded base time.

New loaded base time = _______

10. Compare the “new loaded base time” with the original “loaded
base time.” It is faster, slower, stopped, “hung,” or the same?
Why?

___________________________________________________________

___________________________________________________________

11. Justify your findings with supportive facts.

___________________________________________________________

___________________________________________________________

12. Is this performance system-dependent? ________________________

B-48 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
B

Fault Worksheet #40 (Sheet 6 of 6)

Before After
Parameter During SunDiag
OpenWindows OpenWindows
Bytes allocated
Bytes reserved
Total bytes
Bytes available

Swap disk(s) ___________________, __________________

Before After
Parameters During SunDiag
OpenWindows OpenWindows
Current blocks
Free blocks

Fault Worksheets - Student Guide B-49


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
B

Fault Worksheet #41 - Cannot Boot Diskless Client

Initial Customer Description


Over the weekend, the entire system was off line to perform network
testing and reconfiguration. The system was tested and then restored
in the original configuration as a diskless client on the server.

Now, the diskless client cannot be booted.

Error Symptoms/Conditions/Messages

Problem Statement

Research Resources

Likely Cause Test Results

Repair Verification
Instructor Initials _________________

B-50 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
B

Fault Worksheet #42 - Logs Out During OpenWindows Startup

Initial Customer Description


The system administrator went to a system security training class. The
system administrator then secured the system before going on
vacation. The resident programmer gets logged out during
OpenWindows startup.

The user account is student3, and the password is student3.

Error Symptoms/Conditions/Messages

Problem Statement

Research Resources

Likely Cause Test Results

Repair Verification
Instructor Initials _________________

Fault Worksheets - Student Guide B-51


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
B

Fault Worksheet #43 - Sorry User

Initial Customer Description


“Clients other than root cannot do any work.”

Error Symptoms/Conditions/Messages

Problem Statement

Research Resources

Likely Cause Test Results

Repair Verification
Instructor Initials _________________

B-52 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
B

Fault Worksheet #44 - No Window, Use SunSolve

Initial Customer Description


Users cannot use openwin since the system administrator added a new
kernel patch.

Error Symptoms/Conditions/Messages

Problem Statement

Research Resources

Likely Cause Test Results

Repair Verification
Instructor Initials _________________

Fault Worksheets - Student Guide B-53


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
B

Fault Worksheet #45 - NIS+ Password

Initial Customer Description


User cannot change their own password in an NIS+ environment.

Error Symptoms/Conditions/Messages

Problem Statement

Research Resources

Likely Cause Test Results

Repair Verification
Instructor Initials _________________

B-54 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
B

Fault Worksheet #46 - Let Me In

Initial Customer Description


Cannot rlogin, rsh, or telnet into the server.

Error Symptoms/Conditions/Messages

Problem Statement

Research Resources

Likely Cause Test Results

Repair Verification
Instructor Initials _________________

Fault Worksheets - Student Guide B-55


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
B

Fault Worksheet #47 - RPC Not Registered

Initial Customer Description


“Cannot ftp from the server.”

Error Symptoms/Conditions/Messages

Problem Statement

Research Resources

Likely Cause Test Results

Repair Verification
Instructor Initials _________________

B-56 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
B

Fault Worksheet #48 - Slow Access to Data

Initial Customer Description


Access to data (mount, ftp, tftp) from the server gets slow when
there is more than minimum of activity.

Error Symptoms/Conditions/Messages

Problem Statement

Research Resources

Likely Cause Test Results

Repair Verification
Instructor Initials _________________

Fault Worksheets - Student Guide B-57


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
B

Fault Worksheet #49 - Trust Me

Initial Customer Description


Users responses are quite slow but root is okay!

Error Symptoms/Conditions/Messages

Problem Statement

Research Resources

Likely Cause Test Results

Repair Verification
Instructor Initials _________________

B-58 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
B

Fault Worksheet #50 - Cannot Talk to Machine A

Initial Customer Description


Cannot talk to machine A from another machine.

Error Symptoms/Conditions/Messages

Problem Statement

Research Resources

Likely Cause Test Results

Repair Verification
Instructor Initials _________________

Fault Worksheets - Student Guide B-59


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
B

Fault Worksheet #51 - Not On This Network

Initial Customer Description


Can no longer talk to local network from machine B.

Error Symptoms/Conditions/Messages

Problem Statement

Research Resources

Likely Cause Test Results

Repair Verification
Instructor Initials _________________

B-60 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
B

Fault Worksheet #52 - Do Not Point At Me

Initial Customer Description


Cannot talk to other machines on the network from machine C.

Error Symptoms/Conditions/Messages

Problem Statement

Research Resources

Likely Cause Test Results

Repair Verification
Instructor Initials _________________

Fault Worksheets - Student Guide B-61


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
B

Fault Worksheet #53 - Resource Temporarily Unavailable

Initial Customer Description


During the workshop, you will insert the fault and monitor the system
up to the time of failure.

The workshop begins on the next page.

Error Symptoms/Conditions/Messages

Problem Statement

Research Resources

Likely Cause Test Results

Repair Verification
Instructor Initials _________________

B-62 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
B

Fault Worksheet #53 (Sheet 1 of 4)

In the steps below, using adb on the live kernel, you will lower the
value of maximum number of processes allowed per user. Then you
will open various windows (processes) until an error occurs informing
you Resource temporarily unavailable.

1. Login as student1 or some other non-root account and start the


OpenWindows environment.

2. Open a Command Tool and a Calculator (used later), and su to


root.

3. Use the following adb command to open the “running kernel.”

# adb -kw /dev/ksyms /dev/mem

4. Wait for the physical memory message and the non-prompt.

5. Use the following command to open up the var structure.

v$<v

The maxup field indicates the maximum number of processes that


is allocated per user on this workstation. The v macro displays this
value in base(10) notation. This value is calculated by the following
formula: maxup=(max_users x 16) + 5.

6. Record the value of proc field.

maxup= _______

7. Examine the nproc field to determine the number of currently


running processes.

Perform the following:

nproc/D

The value returned indicates the current number of processes. The


value is in base(10) notation.

Fault Worksheets - Student Guide B-63


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
B

Fault Worksheet #53 (Sheet 2 of 4)

8. Record the value of the nproc field.

nproc = _______

In the next step, you will reduce the value of maxup and another
variable that controls the maximum number of processes per user system
wide. The reduced value should be about 5 more than the current
nproc value.

When you deposit the value into the proc field, the value is
entered as a base(16) notation. The command v+1c/W xx, where xx
is your input value, enables you to change a kernel parameter
using adb.

Example:

If nproc = 50(10), then nproc = 32(16), then 32(16)+5(16) = 37(16)

v+1c/W 37 or if you want to write in decimal, enter:

v+1c/W 0t50, the prefix 0t identifies the value to be in decimal.

Note – Do not change values in the kernel using this method. This is
for an academic learning experience only. But be aware that it can be
done.

9. Calculate the value of nproc for your system, using the Calculator
utility, if necessary. Then replace v+1c with the calculated value.

v+1c/W xx or you can enter v+1c/W0t54, enter also:

v+c/W 0t54, to change also the maximum processes system-wide.


Unfortunately, the v macro does not display this field but
nevertheless you also have to change this variable. In Solaris 2.5,
these two variables control the maximum number of processes
that can be created by a user. Root is no longer subject to any
maximums but only to the exhaustion of kernel resources.

10. Run the following commands:

v$<v
nproc/D

B-64 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
B

Fault Worksheet #53 (Sheet 3 of 4)

If maxup is not 5 greater than nproc, contact the instructor.

The next steps are repetitive.

11. Open a window or a tool, and monitor the value of nproc.


Monitor the Console window for an error indication. If your
calculations are correct, an error should happen after the fifth
window or tool is opened.

a. Open a window or tool and record the value of nproc.

nproc = _______

b. Monitor the Console window for any errors.

Error Y/N

c. Open a window or tool and record the value of nproc.

nproc = _______

d. Monitor the Console window for any errors.

Error Y/N

e. Open a window or tool and record the value of nproc.

nproc = _______

f. Monitor the Console window for any errors.

Error Y/N

g. Open a window or tool and record the value of nproc.

nproc = _______

h. Monitor the Console window for any errors.

Error Y/N

i. Open a window or tool and record the value of nproc.

nproc = _______

Fault Worksheets - Student Guide B-65


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
B

Fault Worksheet #53 (Sheet 4 of 4)

j. Monitor the Console window for any errors.

Error Y/N

k. Open a window or tool and record the value of nproc.

nproc = _______

l. Monitor the Console window for any errors.

Error Y/N

Note – To restore maxup back to its original value, convert the original
value into a base(16) value. Using the v+1c/W xx, where xx is the
base(16) value of original value of maxup(10). Use the Calculator utility,
if necessary. Do this also to maxupttl which is in v+c, enter
v+c/W0tdd, where dd is the original value of maxup in decimal.

12. Return maxup and maxupttl back to its original value, and exit
adb.

B-66 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
B

Fault Worksheet #

Initial Customer Description

Error Symptoms/Conditions/Messages

Problem Statement

Research Resources

Likely Cause Test Results

Repair Verification
Instructor Initials _________________

Fault Worksheets - Student Guide B-67


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
B

Fault Worksheet #

Initial Customer Description

Error Symptoms/Conditions/Messages

Problem Statement

Research Resources

Likely Cause Test Results

Repair Verification
Instructor Initials _________________

B-68 Sun Systems Fault Analysis Workshop


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
Install Alternate Boot Block C
The appendix describes how to install an alternate boot block on
another partition (Solaris 2.x only).

C-1
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService Month 1996
C

Installing an Alternate Boot Block

Follow the steps below to install an alternate boot block on your


system.

1. Login as root

2. Use df -k to find another file system with 20MB of space; as an


example, output for /opt below makes it a possible choice:
/dev/dsk/c0t1d0s0 299118 160579 108629
60% /opt

3. Install the bootblock.


# /usr/sbin/installboot \
/usr/lib/fs/ufs/bootblk /dev/rdsk/c0t1d0s0 (change
for your system)

4. Dump and restore the important parts of the root file system.
# cd /opt (or the file system you want to use for your alt block)
# ufsdump 0f /opt/rootdump /
Dump messages............
Dump messages.............
# ufsrestore if /opt/rootdump
ufsrestore > add dev
ufsrestore > add devices
ufsrestore > add kernel
ufsrestore > add sbin
ufsrestore > add etc
ufsrestore > add ufsboot
ufsrestore > extract
ufsrestore > quit

5. # halt

6. Record the original boot device address from the nvram and
devalias.
ok printenv
boot-device disk
ok devalias
disk /sbus/..........:a

C-2 Document Title


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService Month 1996
C

Installing an Alternate Boot Block

7. Test the alternate boot block.


ok boot disk1:a -a

Press the Return key (to accept the default) on all questions until
the last one.

When asked for the address of the root device put in the original
device address from 6 above:

Install Alternate Boot Block C-3


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService Month 1996
C

C-4 Document Title


Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService Month 1996

You might also like