Professional Documents
Culture Documents
ST-350
Student Guide
Please
Recycle
About This Course
Overview
The primary objective of this course is to learn a systematic fault
analysis technique to troubleshoot intermediate and some advanced
Solaris system faults.
iii
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
Course Prerequisites
Monday
Tuesday
Wednesday
Thursday
Friday
● Have the instructor initial you completed lab projects and fault
forms.
John Shedaker
Sun Microsystems
2550 Garcia Ave., MS UMIL06-01
Mountain View, CA 94043
The following table describes the type changes and symbols used in
this book.
Typeface or
Meaning Example
Symbol
Code samples are included in this book and may display the following:
i
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
Error Detection Overview ........................................................................2-1
Introduction ....................................................................................... 2-2
Error Types ........................................................................................ 2-3
Error Reporting Mechanisms .......................................................... 2-4
Bus Errors...................................................................................2-4
Interrupts for Reporting...........................................................2-4
Resets ..........................................................................................2-4
Type of Errors .................................................................................... 2-5
Software Errors..........................................................................2-5
Hardware-Corrected Errors ....................................................2-5
Recoverable Errors....................................................................2-5
Fatal Errors.................................................................................2-5
CPU Watchdog Reset ...............................................................2-6
System Watchdog Reset ...........................................................2-6
Critical Errors ............................................................................2-6
Primary Buses.................................................................................... 2-7
Sun-4u ................................................................................................. 2-8
Memory Management Unit (MMU)............................................... 2-9
Number Base Conversion Chart ................................................... 2-10
Page Table Entry – Sun-4 Architecture ........................................ 2-11
Sun-4 PTE Format ...................................................................2-12
Examples of Valid PTEs .........................................................2-12
Page Table Entry – Sun-4c Architecture ...................................... 2-13
Sun-4c PTE Format .................................................................2-14
Examples of Valid PTEs .........................................................2-14
Page Table Entry – Sun-4m Architecture .................................... 2-15
Access Code .............................................................................2-16
Examples of Valid PTEs .........................................................2-16
Page Table Entry – Sun-4d Architecture...................................... 2-17
Access Code .............................................................................2-18
Example of Valid PTEs...........................................................2-18
Sun-4 Error Detection Workshop ................................................. 2-19
Sun-4c Error Detection Workshop................................................ 2-22
Example 1 .................................................................................2-23
Example 2 .................................................................................2-26
Sun-4m Error Detection Workshop.............................................. 2-27
Example 1 .................................................................................2-28
Example 2 .................................................................................2-31
Sun-4d Error Detection Workshop ............................................... 2-32
Example 1 .................................................................................2-33
Example 2 .................................................................................2-36
Skills Checklist................................................................................. 2-37
System Fault Status Register (sfsr) Format .......................2-41
POST Diagnostics ......................................................................................3-1
Diagnostics Overview ...................................................................... 3-2
Contents iii
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
Additional References ..............................................................6-4
Installing SunVTS Software............................................................. 6-5
The SunVTS Graphical User Interface ........................................... 6-7
Selecting and Setting Up Tests ........................................................ 6-9
SunVTS Testing Options ................................................................ 6-10
Tests Switch ..................................................................................... 6-12
Option Files..............................................................................6-12
Running the SunVTS Tests ............................................................ 6-13
System Status Panel ................................................................6-13
Test Status Panel .....................................................................6-14
Performance Monitor Panel...................................................6-15
Reviewing SunVTS Test Results ................................................... 6-17
System Status Panel ................................................................6-17
Console Window Messages...................................................6-17
Log Files ...................................................................................6-18
Using SunVTS in TTY Mode ......................................................... 6-19
Negotiating the SunVTS TTY Interface ....................................... 6-20
Using SunVTS Remotely................................................................ 6-21
Kernel Interface .......................................................................6-21
User Interface...........................................................................6-21
Lab Overview .................................................................................. 6-24
Lab Objectives..........................................................................6-24
Equipment................................................................................6-24
Lab Tasks...........................................................................................6-25
SunSolve ......................................................................................................7-1
Overview ............................................................................................ 7-3
Distribution........................................................................................ 7-4
SunSolve Online Account ................................................................ 7-5
Installing SunSolve ........................................................................... 7-6
Installing SunSolve Using File Manager ...............................7-7
Installation GUI Window.........................................................7-8
Sharing SunSolve ....................................................................7-10
Starting Sunsolve ............................................................................ 7-11
Starting From an Installed Server .........................................7-11
Starting From the CD-ROM...................................................7-12
The SunSolve Window...........................................................7-12
Search Tool....................................................................................... 7-13
Configuring SunSolve ............................................................7-14
SearchTool Properties.............................................................7-15
Troubleshooting Using SearchTool .............................................. 7-16
Setting Up the Search .............................................................7-16
Keyword Logical Connectors................................................7-16
Starting the Search ..................................................................7-17
Datasets and Collections to Search............................................... 7-18
Viewing Documents Found........................................................... 7-19
Contents v
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
kadb Description .....................................................................8-42
Invoking and Exiting kadb ....................................................8-43
Mapping UNIX Data Structures ...........................................8-44
Related Data Structures..........................................................8-46
Kernel Crash Dump Analysis Workshop 5................................. 8-49
Kernel Crash Dump Analysis Workshop 6................................. 8-51
Kernel Crash Dump Analysis Workshop 7................................. 8-52
Watchdog Reset Workshop 8 (Sheet 1 of 2) Optional................ 8-53
Bug Install ................................................................................8-53
Program Debugging – Optional ................................................... 8-55
ps Command Workshop 9—Optional ......................................... 8-57
Introduction .............................................................................8-57
Sequence of Procedures (Do Not Execute).................................. 8-58
Setting Base Level ...................................................................8-58
Acquiring Base-Level Information .......................................8-58
Tracing OpenWindows Processes ........................................8-58
Setting Base Level ........................................................................... 8-59
Acquiring Base-Level Information ............................................... 8-60
Base-Level Processes (1 of 2) ......................................................... 8-61
Tracing OpenWindows Processes ................................................ 8-63
Workshop Summary Exercise ....................................................... 8-64
Skills Checklist................................................................................. 8-66
Fault Tracker Progress Chart ..................................................................A-1
Fault Worksheets - Student Guide ........................................................ B-1
Requirements............................................................................ B-1
Resources................................................................................... B-1
System Configurations ............................................................ B-1
Fault Worksheet #1 - Blank Monitor ............................................. B-2
Fault Worksheet #2 - Device Error During Boot.......................... B-3
Fault Worksheet #3 - File Errors During Boot ............................. B-4
Fault Worksheet #4 - Incomplete Boot to Solaris
Operating System.......................................................................... B-5
Fault Worksheet #5 - Login Problem............................................. B-6
Fault Worksheet #6 - adb Macro Error.......................................... B-7
Fault Worksheet #7 - Feckless ........................................................ B-8
Fault Worksheet #8 - Incomplete Boot to Solaris
Fault Worksheet #9 - Turn the Page ............................................ B-10
Fault Worksheet #10 - Login Problem......................................... B-11
Fault Worksheet #11 - Network Problem ................................... B-12
Fault Worksheet #12 - OpenWindows Problem ........................ B-13
Fault Worksheet #13 - Shutdown When Opening
Windows ...................................................................................... B-14
Fault Worksheet #14 - Network Printer Problem...................... B-15
Fault Workshop #15 - Incomplete Boot to Solaris
Operating System........................................................................ B-16
Contents vii
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
Install Alternate Boot Block....................................................................C-1
Installing an Alternate Boot Block ................................................. C-2
Objectives
Upon completion of this module, you will be able to:
References
Alamo Learning Systems AdvantEdge Analysis Program
1-1
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
1
Introduction
You may be an expert. With the expert approach, you gather data and
use your experience and the experience of others to determine causes.
Fault analysis and diagnosis provides you with a powerful tool to
analyze data and focus on the likely causes of a complex problem or a
problem outside of your immediate experience.
Fault Analysis
1. State the problem.
3. Identify differences.
Diagnosis
5. Generate likely causes.
Test
Next likely cause No
Yes
Verify likely cause
Given a system problem, identify the object and its defect, and write a
problem statement. A problem statement answers these questions:
● Does the statement state the exact deviation from the norm?
Most bugs that become a disaster happen because the original problem
is not described correctly.
The next step in system fault analysis and diagnosis is to describe the
problem in detail.
Questions to Ask
Expand and customize a question list for your own style and
environment.
_______________________________________________________________
_______________________________________________________________
_______________________________________________________________
_______________________________________________________________
Fact Sources
● Customer complaints
● Dumps
_______________________________________________________________
_______________________________________________________________
_______________________________________________________________
_______________________________________________________________
Questions to Ask
● What similar object might have this defect but does not?
● What other defect could you see on the problem object but do not?
● Where else on the problem object could you see the defect but do
not?
● When could the defect have been first observed but was not?
● What other time in the object’s life cycle could the defect have
occurred but did not?
● In what other pattern could the defect have occurred but did not?
● How many of the objects might have been defective but are not?
● What other trend could have been observed but was not?
● Are the comparative facts as close and similar to the observed facts
as possible and yet not complete opposites?
Identifying Differences
Use the lists of observed facts and comparative facts to analyze and list
the differences.
● List only the differences that are unique between the observed and
comparative facts.
First you use differences and relevant changes to discover likely causes
of the problem. Then you form a hypothesis about the cause, and
analyze the problem with facts, differences, and relevant changes.
Then you can diagnose the problem.
State your hypothesis in the form of a question and an answer that can
be tested. For example:
How could the fault analysis element have caused this problem?
For the fault analysis element, insert one of the following possibilities:
● A relevant change
● A single difference
You can develop as many hypotheses as you have facts. Use your
experience and judgement to limit, initially, the list to the most logical
and likely cause(s). If your first hypothesis does not prove true, you
can return to this step.
Using the list of likely causes, test each one to determine the most
likely cause. Testing your likely causes increases the certainty that you
will discover the actual cause of the problem before you embark on or
recommend a potentially costly, time-consuming solution.
To test for the most likely cause, eliminate any cause that fails to
explain the observed and comparative facts.
Eliminate a likely cause only when you are certain it cannot be the true
cause of the problem.
Test each likely cause separately using the fault analysis worksheets.
Ask yourself whether the cause can support the facts, and mark a Y for
yes or N for no on the line under the fact number. For example:
Test each likely cause against each relevant fact and mark it Y or N. If
you must make an assumption or have a doubt about an answer, mark
it with a question mark (?). If you simply cannot make a
determination, leave it blank.
Now you are ready to verify, test, and prove that the most likely cause
is the actual cause of the problem.
To verify the most likely cause, use the method that is:
● Least disruptive
● Least expensive
● Least time-consuming
● Most conclusive
Verifying the most likely cause should remove all uncertainty about
the cause of a problem. Three methods that verify the most likely
cause of the problem include:
● Results – Assume, without proof, that the most likely cause you
choose is the actual cause, and take the indicated corrective action.
This is the least conclusive verification, and it can be disruptive,
expensive, and time-consuming, especially if your assumptions
are not correct.
Problem Statement
Window system hangs on systems using the GX+ video frame buffer.
Problem
Observed Facts Comparative Facts Differences
Description
1. What object (system) is Six systems using Not on other Sun Location and
defective? ss2GX+ video frame machines on this site, but environment,
buffer on other Sun machines temperature, humidity,
elsewhere dirt, power, static
2. What exactly is wrong? System "hangs" but can System does not crash or Operating system is still
remote login freeze; running; power cycle of
can sometimes fix by mouse may be related
removing or inserting
mouse
3. Where is the object Acme Industries; factory Not at other Acme sites, Environment, network,
(system) located? control units in other customers, or office vibration
manufacturing plant environment
4. Where on the object Not guaranteed No documented hang at Window system uses
(system) does the defect repeatable, but window OBP monitor or single mouse and full resolution
appear? system most often user of GX+ color
affected
5. When was the defect Call logged 1/18, problem Not right after delivery of Happening more often
first observed? has been ongoing for a systems using GX+ video during busy periods
while frame buffers on 12/12
6. When in the life cycle Five weeks after Not when system was New hardware; bedded in
was the defect noticed? installation brand new
9. How many objects One group of six Not all Sun workstations
(systems) are defective?
10. What is the trend? Worse, more frequent Not getting better or
stable
1. What object (system) is defective? The systems using the GX+ video frame buffer after 12/12
the last quarterly anti-static treatment was
completed.
Likely Causes
Likely Cause 1 2 3 4 5 6 7 8 9 10
1 GX+ video frame buffer design or build fault Y N N ? Y N? Y - Y Y?
2 Environment (static) Y Y Y ? Y Y Y - Y Y
Application
Design
Final Repair
Environment static created the problem.
The instructor is the user, and you can ask the instructor questions
about the problem.
The user has added three new hosts to the established network. A
matrix was generated that indicated which hosts were
communicating. These were installed after midnight just prior to a
three-day holiday.
Host A
Host B
Host C
ping rlogin
The instructor is the user, and you can ask the instructor questions
about the problem.
The user has added three new hosts to the established network. A
matrix was generated that indicated which hosts were
communicating.
Host A
Host B
Host C
ping rlogin
The instructor is the user, and you can ask the instructor questions
about the problem.
The user has added three new hosts to the established network. A
matrix was generated that indicated which hosts were
communicating.
Host A
Host B
Host C
ping rlogin
Likely Causes
Likely Cause 1 2 3 4 5 6 7 8 9 10
1
Final Repair
Problem Statement
_______________________________________________________________
Problem
Observed Facts Comparative Facts Differences
Description
1. What system is
defective?
7. Pattern of occurrence
9. Number of systems
defective
10. Trend
7. Pattern of occurrence
10. Trend
Error Symptoms/Conditions/Messages
● Observed facts (1)
● Differences
● Relevant changes
● Comparative facts
Problem Statement
Research Resources
Repair Verification
Likely Causes
Likely Cause 1 2 3 4 5 6 7 8 9 10
1
Final Repair
Problem Statement
_______________________________________________________________
Problem
Observed Facts Comparative Facts Differences
Description
1. What system is
defective?
7. Pattern of occurrence
9. Number of systems
defective
10. Trend
7. Pattern of occurrence
10. Trend
Notes
Likely Causes
Likely Cause 1 2 3 4 5 6 7 8 9 10
1
Final Repair
Problem Statement
_______________________________________________________________
Problem
Observed Facts Comparative Facts Differences
Description
1. What system is
defective?
7. Pattern of occurrence
9. Number of systems
defective
10. Trend
7. Pattern of occurrence
10. Trend
Notes
Likely Causes
Likely Cause 1 2 3 4 5 6 7 8 9 10
1
Final Repair
Problem Statement
______________________________________________________
Problem
Observed Facts Comparative Facts Differences
Description
1. What system is
defective?
7. Pattern of occurrence
9. Number of systems
defective
10. Trend
7. Pattern of occurrence
10. Trend
Notes
Likely Causes
Likely Cause 1 2 3 4 5 6 7 8 9 10
1
Final Repair
Problem Statement
______________________________________________________
Problem
Observed Facts Comparative Facts Differences
Description
1. What system is
defective?
7. Pattern of occurrence
9. Number of systems
defective
10. Trend
7. Pattern of occurrence
10. Trend
Skills Checklist
Student Instructor
Skill
Initials Initials
Gather and document observed facts and place them in the fault
analysis worksheet matrix.
Gather and document obtained information and place it in the
fault analysis matrix.
Generate a list of likely causes based on facts within the fault
analysis matrix.
Develop a course of action to repair based on likely causes.
Objectives
Upon completion of this module, you will be able to:
References
The SPARC Architecture Manual - Version 8
2-1
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
2
Introduction
The lab in this module will bring you back to an early level of
computer understanding and data manipulation – back to the 1’s and
0’s and register-bit mapping. The labs are architecture-dependent.
Error Types
● Bus Errors
● Interrupts
● Resets
● Types of errors
● Software errors
● Hardware-corrected errors
● Recoverable errors
● Fatal errors
● Critical errors
Bus Errors
Bus errors are issued to the processor when the processor references to
virtual or physical space that cannot be satisfied for hardware reasons.
Some typical bus errors occur:
● Error detected.
Resets
A reset attempts to bring the system to a well known (deterministic)
state. Types of resets include:
● System
● Power on
● Watchdog
● System software
Type of Errors
Software Errors
Errors that do not originate in the hardware are classified as software
errors. All such errors are detected by the processor and are reported.
Examples of software errors are programming errors or bugs in the
system code.
Hardware-Corrected Errors
For error-logging purposes, hardware-corrected errors are always
signaled by an interrupt. No recovery action is normally required. One
bit error from memory is corrected by the error checking and
correcting (ECC) logic. This is reported in the error log.
Recoverable Errors
Recoverable errors caused by hardware are usually signaled by a bus
error indication to the requesting device and a specified interrupt
(which could broadcast the error). Error recovery is normally handled
by the trap routines, while error logging is done by the interrupt
handler. A nonessential device losing power or becoming inaccessible
is an example of a recoverable error.
Fatal Errors
All fatal errors initiate a system-watchdog reset. Fatal errors
correspond to hardware errors in which proper system operation
cannot be guaranteed. Parity errors on backplanes are an example of a
fatal error.
Type of Errors
Critical Errors
Critical errors require immediate system shutdown and power-off.
They are notified through a high-level broadcast interrupt if at all
possible. Types of critical errors include:
● An AC/DC failure
● Temperature warning
● Fan failure
Primary Buses
Architecture
Sun Architecture
Architecture Model
Sun-4 4/330, 4/370, 4/390, 4/470, 4/490
Sun-4m SS5, SS10, SS20, 630, 670, 690, Classic, ClassicX, SSLX
Sun-4u
Architecture Ultra-4u
Address
UltraSparc Serial
Sysio sbus
Onboard
Bus UPA
Multiplexor
UPA
UPA connector
The MMU contains page table entries (PTEs) that are loaded by kernel
code during normal process execution.
● Page caching
A valid PTE indicates that the virtual address has been mapped to a
physical page in memory.
0 0000 0
1 0001 1
2 0010 2
3 0011 3
4 0100 4
5 0101 5
6 0110 6
7 0111 7
8 1000 8
9 1001 9
10 1010 a
11 1011 b
12 1100 c
13 1101 d
14 1110 e
15 1111 f
● Bit 31 (PTE valid bit) – When set to one (1), the PTE is valid.
● Bit 30 (Write access bit) – When set to one (1), page has write
access.
● Bit 29 (System access bit) – When set to one (1), system access is
enabled for that page.
● Bit 28 (Do not cache bit) – When set to one (1), caching is disabled.
● Bit 25 (Access bit) – When set to one (1), indicates page has been
accessed.
● Bit 24 (Modify bit) – When set to one (1), indicates page has been
modified.
31 30 29 28 27 26 25 24 23 19 18 00
● 0 0 – Main memory
● 0 1 – I/O space
31 30 29 28 27 26 25 24 23 19 18 00
● Bit 31 (PTE valid bit) – When set to one (1), means the PTE is valid.
● Bit 30 (Write access bit) – When set to one (1), page has write
access.
● Bit 29 (System access bit) – When set to one (1), system access is
enabled for that page.
● Bit 28 (Do not cache bit) – When set to one (1), caching is disabled.
● Bit 25 (Access bit) – When set to one (1), indicates page has been
accessed.
● Bit 24 (Modify bit) – When set to one (1), indicates page has been
modified.
31 30 29 28 27 26 25 24 23 19 18 00
● 0 1 – I/O physical
● 1 0 – I/O physical
● 1 1 – I/O physical
31 30 29 28 27 26 25 24 23 19 18 00
31 30 08 07 06 05 04 03 02 01 00
31 30 08 07 06 05 04 03 02 01 00
Access Code
1 0 0 - - x - - x
1 0 1 r w - r - -
1 1 0 r - x - - -
1 1 1 r w x - - -
31 30 08 07 06 05 04 03 02 01 00
31 30 08 07 06 05 04 03 02 01 00
Access Code
k0
k1
3. Use the p command to open a page map for virtual address 1000
and enable it to be modified if needed.
p 1000
d0000002 is selected.
^t<6> 1000
l 1000
00001000: 00000000? 12345678
00001004: 00000000?
>l 1000 This shows that you wrote to
virtual address 1000. No errors were detected.
00001000: 12345678?
p 1000
Page Map 00000000 [segment: 0000]: F0000000? a0000002
Page Map 00002000 [segment: 0000]: F0000001?
>^t 1000
l 1000
l 1000
00001000: 00000000? 1234
The next error forces an invalid PTE for virtual address 1000. As
you will see, not even a read can be performed. Once again, all
commands are highlighted including the error.
k0
k1
p 1000
Page Map 00000000 [segment: 0000]: D0000000? 20000002
Page Map 00002000 [segment: 0000]: D0000001?
^t 1000
Virtual Address 0x00001000 is mapped to Physical
Address 0x00005000.
Context=0x0, Segment Map=0x0, Page Map=0x20000002.
>l 1000
00001000:
Reset Procedure
To begin this workshop, you must obtain a Sun-4c workstation.
ok reset
(If you see a > monitor prompt, type n, then type reset.)
Within 4 seconds, the pinwheel for booting begins. Press Stop (L1)–a.
Refer to “Page Table Entry – Sun-4c Architecture” for the correct PTE
format for Sun-4c architecture. Console commands are in boldface
type. Use this information not to troubleshoot problems but to
understand the error detection mechanism used by the diagnostics
and operating system software.
Example 1
1. Type the following console command:
1000 map?
1000 map?
1000 20 ab fill
1000 20 dump
Example 1 (Continued)
The first error condition is a valid PTE that will be read only. You
will attempt to perform a write to the page, thus forcing the error
condition.
1000 map?
9. Type the following console command (to prove you can read):
1000 20 dump
1000 20 11 fill
serr@ .
Example 1 (Continued)
A hex value is returned indicating the type of error that was
detected. Refer to the table below for verification.
Bit Error
15 07 06 05 04 03 02 01 00
1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
8 0 1 0
Example 2
1. Perform the reset procedure on page 2-22. This resets the system
after the error.
1000 map?
1000 20 dump
serr@ .
Reset Procedure
To begin this workshop, you must obtain a Sun-4m workstation.
ok reset
(If you see a > monitor prompt, type n, then type reset.)
Within 4 seconds, the pinwheel for booting begins. Press Stop (L1)–a.
Example 1
1. Type the following console command:
1000 map?
1000 map?
1000 20 ab fill
1000 20 dump
Example 1 (Continued)
At this point, you have set up a known read/write condition and
ensured that it worked. Now, you will create an error condition.
The first error condition is a valid PTE that will be read-only. You
will attempt to perform a write to the page, thus forcing the error
condition.
1000 map?
10. Type the following console command to prove you can read:
1000 20 dump
1000 20 11 fill
Example 1 (Continued)
12. Type the following console command:
.sfsr
What is the value of the fault type field? Refer to the table below
for verification.
6 Internal error
5 Access bus or time-out
4 Translation error
3 Privilege violation
2 Protection error
1 Invalid address
0 No error
Example 2
1. Perform the reset procedure on page 2-27 to reset the system after
the error.
1000 map?
1000 20 dump
.sfsr
What is the value of the fault type field? Refer to the “sfsr Fault
Types” table for verification.
Reset Procedure
To begin this workshop, you must obtain a Sun-4d workstation. Do
one of the following, depending on the state of your system.
ok reset
Within 4 seconds, the pinwheel for booting begins. Press Stop (L1)–a.
Refer to “Page Table Entry – Sun-4d Architecture” for the correct PTE
format for Sun-4d architecture. Console commands are in boldface
type. Use this information not to troubleshoot problems but to
understand the error detection mechanism used by the diagnostics
and operating system software.
Example 1
1. Type the following console command:
1000 map?
1000 map?
1000 20 ab fill -
1000 20 dump
Example 1 (Continued)
At this point, you have set up a known condition (read/write) and
ensured that it worked. Now, you will create an error condition.
The first error condition is a valid PTE that will be read-only. You
will attempt to perform a write to the page, thus forcing the error
condition.
1000 map?
10. Type the following console command to prove you can read:
1000 20 dump
1000 20 11 fill
.sfsr
Example 1 (Continued)
What is the value of the fault type field? Refer to the table below
for verification.
4 Translation error
3 Privilege violation
2 Protection error
1 Invalid address
0 No error
Example 2
1. Perform the reset procedure on page 2-32 to reset the system after
the error.
1000 map?
1000 20 dump
.sfsr
What is the value of the fault type field. Refer to the “sfsr Fault
Types” table for verification.
Skills Checklist
No direct skills are associated with this module. This module and
associated workshops are used only to demonstrate the error-detection
mechanism. A field engineer would not be required to troubleshoot
the equipment with the skills used within the workshop.
Objectives
Upon completion of this module, you will be able to:
References
Field Engineer Handbook, Volume 1 and 2, Part Numbers 800-4006 and
800-4247
3-1
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
3
Diagnostics Overview
Extended User
*POST POST diagnostics
Installed as package
Requires Solaris operating system
Diagnostics Overview
● Conduct all hardware bus probes, and save information for the
operating system’s automatic reconfiguration (ok boot -r) and
memory sizing
Note – A deliberate limitation of the boot PROM POST is that the I/O
devices themselves are not tested, only the devices and buses required
to access the boot device are tested.
● Installed as a package
Boot PROM
Machine
LEDs
POST diags instructions IU
Run at power-on CPU
or a system reset chip
Test numbers
(Some desktops
only use LEDs on
keyboard)
Boot PROM
LEDs
POST diags IU
Run at power-on CPU
or a system reset chip Test numbers
Serial port A 7 3 2
Transmit data Modem port
Transmit data 2
Receive data ASCII
Receive data 3
terminal
Signal ground Signal ground 7
% tip hardwire
connected
Serial port A
Serial port A or B
Broken machine in
diagnostic mode Good machine
Machine Information
The information below describes the machine used for this example.
● SPARCstation 5, no keyboard
# tip hardwire
$$$$$ WARNING: No Keyboard Detected! $$$$$
MMU Context Table Reg Test
MMU Context Register Test
MMU TLB Replace Ctrl Reg Tst
MMU Sync Fault Stat Reg Test
MMU Sync Fault Addr Reg Test
MMU TLB RAM NTA Pattern Test
MMU TLB CAM NTA Pattern Test
MMU TLB LCAM NTA Pattern Test
IOMMU SBUS Config Regs Test
IOMMU Control Reg Test
IOMMU Base Address Reg Test
IOMMU TLB Flush Entry Test
IOMMU TLB Flush All Test
SBus Read Timeout Test
EBus Read Timeout Test
D-Cache RAM NTA Test
D-Cache TAG NTA Test
I-Cache RAM NTA Test
I-Cache TAG NTA Test
Memory Address Pattern Test
FPU Register File Test
FPU Misaligned Reg Pair Test
The following example shows the code output when using the tip
command. The correct response is connected, and the POST is
displayed.
Diagnostics Output
The diagnostics run on all system boards, testing all CPU modules,
buses and memories.
# tip hardwire
connected
0B>
BIST Status = 00000001 Signature - CPU = 6ED695A2
0B>map16 test
0A>
BIST Status = 00000001 Signature - CPU = 6ED695A2
0B>
**** SPARCserver_1000 MP POST Rev 8 ****
The results of the POST normally pass quickly on the display. You can
view the results using the DEMON menu.
0A>total pmem 0x00008000 [pages] 0x008000000 [bytes] in 1 chunks
0A>DRAM chunk 0 base 0x00000000 size 0x00008000
0A> (0=failed,1=passed,blank=untested/unavailable)
(sbus 1=card present,0=card not present,x=failed)
0A>------+---------+------+-------+------+----+-----+----+--------+-------+------+-----+
0A> Slot | cpuA | bw0 | cpuB | bw0 | bb | ioc0| sbi| mqh0 | mem |sbus |xd0|
0A>------+---------+------+-------+------+----+-----+----+--------+-------+------+-----+
0A> 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 64 | 0011| 1 |
0A> 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 64 | 0011| 1 |
0A>------+--------+------+-------+------+----+-----+----+--------+--------+------+-----+
The next area displays the POST DEMON menu. It shows the steps
necessary to view system parameter information. The keys are
considered hot keys. You do not need to press Return after you press a
hot key.
DEMON
0A>Select one of the following functions
0A> '0' System Parameters
0A> '1' Read/Write device
0A> '2' Software Reset
0A> '3' NVRAM Management
0A> '4' Error Reporting
0A> '5' Analyze Error Logs
0A> '6' Power Off at Main Breaker
0A> '7' NVRAM SIMM tests
0A> 'r' Return to selftest
Command ==> 0
System Parameters
0A>Select one of the following functions
0A> '0' Set POST Level
0A> '1' Dump Device Table
0A> '2' Display System
0A> '3' Dump Board Registers
0A> '4' Dump Component IDs
0A> '5' Clear Error Logs
0A> '6' Display Simms
0A> '7' Scrub Main Memory
0A> 'r' Return
Command ==> 2
0A> (0=failed,1=passed,blank=untested/unavailable)
(sbus 1=card present,0=card not present,x=failed)
0A>------+-------+-----+-------+------+---+------+----+--------+-------+------+-----+
0A> Slot | cpuA | bw0 | cpuB | bw0 | bb | ioc0| sbi | mqh0 | mem |sbus |xd0|
0A>-----+-------+------+-------+-----+----+------+----+--------+-------+------+-----+
0A> 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 64 |0011| 1 |
0A> 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 64 |0011| 1 |
0A>-----+-------+------+-------+-----+----+------+----+--------+-------+------+-----+
0A>Memory Group Status
(0=failed,1=passed,m=simm missing,c=simm
mismatch,blank=unpopulated/unused)
0A>+----+------+------+------+------+
0A> Slot| g0 | g1 | g2 | g3 |
0A>+---+-------+------+------+------+
0A> 0 | 1 | 1 | | |
0A> 1 | 1 | 1 | | |
0A>+---+-----+-------+-------+------+
0A>Hit any key to continue :
Command ==> r
0A>
DEMON
0A>Select one of the following functions
0A> '0' System Parameters
0A> '1' Read/Write device
0A> '2' Software Reset
0A> '3' NVRAM Management
0A> '4' Error Reporting
0A> '5' Analyze Error Logs
0A> '6' Power Off at Main Breaker
0A> '7' NVRAM SIMM tests
0A> 'r' Return to selftest
0A>
Command ==> 5
0A>
-------------- Error Log Analysis for Board 0 --------------
0A>
-------------- Error Log Analysis for Board 1 --------------
0A>
-------------- System Memory Failure Analysis ----------------
0A> No Bad groups found
0A>Hit any key to continue :
0A>
DEMON
0A>Select one of the following functions
0A> '0' System Parameters
0A> '1' Read/Write device
0A> '2' Software Reset
0A> '3' NVRAM Management
0A> '4' Error Reporting
0A> '5' Analyze Error Logs
0A> '6' Power Off at Main Breaker
0A> '7' NVRAM SIMM tests
0A> 'r' Return to selftest
0A>
Command ==>r
0A>
ttya initialized
Probing Memory Bank #0 128 Megabytes
SUNW,SPARCserver-1000
Cpu #0 cpu-unit TI,TMS390Z55
Cpu #1 cpu-unit TI,TMS390Z55
Cpu #2 cpu-unit TI,TMS390Z55
Cpu #3 cpu-unit TI,TMS390Z55
mem-unit mem-unit
bif bif
bootbus zs zs eeprom sram leds bootbus zs zs eeprom sram leds
io-unit sbi
Probing /io-unit@f,e0200000/sbi@0,0 at 0,0 dma esp sd st lebuffer le
Probing /io-unit@f,e0200000/sbi@0,0 at 1,0 cgsix
Probing /io-unit@f,e0200000/sbi@0,0 at 2,0 Nothing there
Probing /io-unit@f,e0200000/sbi@0,0 at 3,0 SUNW,soc SUNW,pln SUNW,ssd
SUNW,pln SUNW,ssd
io-unit sbi
Probing /io-unit@f,e1200000/sbi@0,0 at 0,0 dma esp sd st lebuffer le
Probing /io-unit@f,e1200000/sbi@0,0 at 1,0 dma esp sd st lebuffer le
Probing /io-unit@f,e1200000/sbi@0,0 at 2,0 Nothing there
Probing /io-unit@f,e1200000/sbi@0,0 at 3,0 Nothing there
Probing Memory Bank #0 128 Megabytes
SUNW,SPARCserver-1000
Cpu #0 cpu-unit TI,TMS390Z55
Cpu #1 cpu-unit TI,TMS390Z55
Cpu #2 cpu-unit TI,TMS390Z55
Cpu #3 cpu-unit TI,TMS390Z55
mem-unit mem-unit
bif bif
bootbus zs zs eeprom sram leds bootbus zs zs eeprom sram leds
io-unit sbi
Probing /io-unit@f,e0200000/sbi@0,0 at 0,0 dma esp sd st lebuffer le
Probing /io-unit@f,e0200000/sbi@0,0 at 1,0 cgsix
Probing /io-unit@f,e0200000/sbi@0,0 at 2,0 Nothing there
Probing /io-unit@f,e0200000/sbi@0,0 at 3,0 SUNW,soc SUNW,pln SUNW,ssd
SUNW,pln SUNW,ssd
io-unit sbi
Probing /io-unit@f,e1200000/sbi@0,0 at 0,0 dma esp sd st lebuffer le
Probing /io-unit@f,e1200000/sbi@0,0 at 1,0 dma esp sd st lebuffer le
Probing /io-unit@f,e1200000/sbi@0,0 at 2,0 Nothing there
Probing /io-unit@f,e1200000/sbi@0,0 at 3,0 Nothing there
Healthy system
Null modem
cable
or
Modem
Faulty system
Note – Before you begin, make sure that the healthy system has the
Solaris operating environment booted to multiuser mode and has a
window system running or available.
5. Turn off the faulty system to prevent blowing the keyboard fuse.
# /usr/openwin/bin/openwin
Note – The hardwire argument says that the tip command expects
9600 baud, 8 data bits, and 1 stop bit at port B on the CPU board, not
an ALM or SPC. It is not a coincidence that these are the parameters
set for Port A when a machine powers up without a keyboard.
8. If port A is the only available port, edit the /etc/remote file for
port A on “good” system
● Before edit:
:dv=/dev/term/b:br#9600:el=^C^S^Q^U^D:ie=%$:oe=^D
● After edit:
:dv=/dev/term/a:br#9600:el=^C^S^Q^U^D:ie=%$:oe=^D
# tip hardwire
12. Why are you getting an error that looks like a “Net” error?
Notes
13. Press ~Control-d or ~ . to end the tip session. (See “POST tip
Commands.”)
You can also display POST tests on nearly any ASCII terminal or
laptop.
~#
~.
Or
~ ^d (tilde Control-d)
~?
For more information on the tip command, refer to the on-line man
pages.
Objectives
Upon completion of this module, you will be able to:
● Test devices using the device path, node name, and device
alias.
● Alter any NVRAM setting, display the settings, and reset to the
defaults.
4-1
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
4
References
Field Engineer Handbook, Volume 1 and 2, Part Numbers 800-4006 and
800-4247
Features
● Ability to read plug-in device drivers and diagnostics from probed
devices. (Early Sun machines required all boot drivers and
diagnostics to be completely written in the boot PROM.)
● User-callable diagnostics
OpenBoot PROM
OBP
NVRAM
> Limited commands
setenv Variable
OK full FORTH commands printenv system
parameters
FORTH code
POST Battery
Extended POST
User diagnostics
NVRAM defaults
Host ID contains:
● CPU-type code
SPARCstation 20 Workstation
<#2> ok printenv
Parameter Name Value Default Value
<#2> ok
Diagnostic Overview
Extended User
*POST POST diagnostics
Installed as package
Requires Solaris operating system
Error
Init system indication Init system
Pass Fail
Error Fail Pass
indication
Boot-device
sunmon-compat? boot-file Error
security-mode? Auto-Boot? indication
Start boot sequence
False True False True
OK >
sunmon-compat? diag-device
security-mode? diag-file
False True Start boot sequence
OK >
ok boot
Execute primary
boot—OBP
Kernel reads
/etc/system
Kernel
initialized
Execute rc scripts
<#0> ok cd /
<#0> ok ls
ffda476c io-unit@f,e1200000
ffd91c10 io-unit@f,e0200000
ffd8d2f4 mem-unit@f,e1100000
ffd8d210 mem-unit@f,e0100000
ffd8cebc cpu-unit@f,e1800000
ffd8cb68 cpu-unit@f,e1000000
ffd8c814 cpu-unit@f,e0800000
ffd8c4c0 cpu-unit@f,e0000000
ffd839a8 boards
ffd712fc openprom
ffd702bc virtual-memory@0,0
ffd7016c memory@0,0
ffd625cc aliases
ffd6257c options
ffd6252c packages
<#0> ok cd io-unit@f,e1200000
<#0> ok ls
ffda4d20 sbi@0,0
<#0> ok cd sbi
<#0> ok ls
ffdb0ffc lebuffer@1,40000
ffdac1f4 dma@1,81000
ffda9ff4 lebuffer@0,40000
ffda51ec dma@0,81000
<#0> ok cd dma@1,81000
<#0> ok ls
ffdac878 esp@1,80000
<#0> ok cd esp@1,80000
<#0> ok ls
ffdb05b4 st
ffafef4 sd
Target 0
Unit 0 Disk CONNER CP30548 SUN0535AEBX93081BWC
Target 1
Unit 0 Disk CONNER CP30548 SUN0535AEBX93082TZA
Target 2
Unit 0 Disk CONNER CP30548 SUN0535AEBX93082MD4
Target 3
Unit 0 Disk CONNER CP30548 SUN0535AEBX93081BRX
/io-unit@f,e1200000/sbi@0,0/dma@0,81000/esp@0,80000
Target 0
Unit 0 Disk CONNER CP30548 SUN0535AEB793081TGX
Target 1
Unit 0 Disk CONNER CP30548 SUN0535AEB793081WNL
Target 2
Unit 0 Disk CONNER CP30548 SUN0535AEB793081Q8Z
Target 3
Unit 0 Disk CONNER CP30548 SUN0535AEB7930810A0
/io-unit@f,e0200000/sbi@0,0/dma@0,81000/esp@0,80000
Target 0
Unit 0 Disk SEAGATE ST3610N SUN0535881000000000Copyright (c) 1993
Seagate All rights reserved 0000
Target 1
Unit 0 Disk SEAGATE ST3610N SUN0535881000000000Copyright (c) 1993
Seagate All rights reserved 0000
Target 2
Unit 0 Disk SEAGATE ST3610N SUN0535881000000000Copyright (c) 1993
Seagate All rights reserved 0000
Target 3
Unit 0 Disk SEAGATE ST3610N SUN0535881000000000Copyright (c) 1993
Seagate All rights reserved 0000
Target 4
Unit 0 Removable Tape ARCHIVE Python 28454-XXX4.28
Target 6
Unit 0 Removable Read Only device SONY CD-ROM CDU-8012 3.1e
<#0> ok
<#0> ok show-sbus
Board# 0 SBus slot 0 lebuffer le dma esp
Board# 0 SBus slot 1 cgsix
<#0> ok module-info
CPU# 0 : 50.0 MHz SuperSPARC / SuperCache
CPU# 1 : 50.0 MHz SuperSPARC / SuperCache
CPU# 2 : 50.0 MHz SuperSPARC / SuperCache
CPU# 3 : 50.0 MHz SuperSPARC / SuperCache
<#0> ok print-nvram-stat
Board#0 -- nvram master, Prom Version 2.13
Board#1 -- nvram slave, Prom Version 2.13+0.08
Board#2 -- no board or no Viking module
Board#3 -- no board or no Viking module
<#0> ok show-sbus
SBus slot f SUNW,bpp ledma le espdma esp
SBus slot e SUNW,DBRIe
SBus slot 0
SBus slot 1
SBus slot 2 cgsix
SBus slot 3
<#0> ok probe-scsi
Target 1
Unit 0 Disk QUANTUM P105SS 910-10-94A.1 08/31/89009030144
GENERIC
Target 3
Unit 0 Disk SEAGATE ST31200W SUN1.05872400795741
Copyright (c) 1994 Seagate
All rights reserved 0000
Target 4
Unit 0 Removable Tape ARCHIVE VIPER 150 21531-003 SUN-03.00.00
Target 6
Unit 0 Removable Read Only device TOSHIBA XM-
4101TASUNSLCD108404/18/94
<#0> ok module-info
MBus : 50.00 MHz
SBus : 25.00 MHz
CPU#0 : 50.00 MHz SuperSPARC
CPU#2 : 50.00 MHz SuperSPARC
<#0> ok 2 switch-cpu
<#2> ok 0 switch-cpu
<#0> ok 2 switch-cpu
IMPL:0
<#2> ok 1 switch-cpu
Processor #1 is not present!
Lab 1
In this lab you will test devices using the device path, node name, and
device alias.
Note – Due to different PROM levels and architectures the syntax for
these labs can vary slightly. Refer back to the OBP reference card if
necessary.
2. Use help to list some PROM level diagnostics, and run them al.l
ok help diag
Category: Diag (diagnostic routines)
test device-specifier ( -- ) run selftest method for specified device
Examples:
test /memory - test memory
test /iommu/sbus/ledma@5,8400010/le - test net
te................
...................
ok setenv selftest-#megs 99 (setting up to test 99 megs of memory)
ok test-memory
Testing memory \/
ok test net
Notes
Note – If the ok prompt returns with no message, this means the self
test found no errors.)
Notes
Lab 2
● Alter any NVRAM setting, display the settings, and reset to the
defaults
You are directed to use selected console commands and observe the
output. You can determine if you find the results useful.
help diag
help watch-tpe
show boot-device
show-hier
show-ttys
show-tapes
show-nets
show-disks
module-info
devalias
show-attrs
show-devs
printenv
printenv diag-switch?
show diag-switch?
set-default diag-switch?
show diag-switch?
set-defaults
Or do the following:
During power on or after the ok reset, hold down the Stop (L1) and n
keys simultaneously on the Sun keyboard. (There is no corresponding
simple key hold down to reset NVRAM to defaults from a port
connection.)
Optional
The NVRAM settings can also be changed by root from the operating
system:
# /usr/sbin/eeprom
# /usr/sbin/eeprom boot-device=disk1
Notes
Lab 3
In this lab you will display and capture the names of the devices in the
system device tree and display their attributes. This is useful in
isolating failures of Sun or third-party devices between hardware or
software problems.
Note – The lab will take you to one device; if you have time, go out
and display some others.
ok cd /
ok ls
ffd3c184 FMI,MB86904
.........
ok cd iommu@0,10000000
ok ls
ffd2c2c8 sbus@0,10001000
ok cd sbus@0,10001000
ok ls
ffd42504 cgsix@3,0
f.......
ok cd cgsix@3,0
ok ls
ok .attributes
character-set ISO8859-1
intr 00000039 00000000
reg 00000003 00000000 01000000
dblbuf 00000000
v0,64125000,108000000,94500000
chiprev 0000000b
device_type display
model SUNW,501-2325 (look at this, the Sun part #!)
name cgsix
Lab 4
In this lab, you will generate and test a PROM device alias.
With the increased use of storage arrays and other variously addressed
devices, it is important to be able to set a simple name for the device
that the customer can boot from or otherwise use.
Note – If you recreate the tip hardwire session, you can cut and paste,
instead of typing a lot of the entries in the lab.
2. ok show-disks
a) /obio/SUNW,fdtwo@0,400000
b) /iommu@0,10000000/sbus@0,10001000/espdma@5,8400000/esp@5,8800000/sd
q) NO SELECTION
Enter Selection, q to quit: b
/iommu@0,10000000/sbus@0,10001000/espdma@5,8400000/esp@5,8800000/sd has
been selected.
Type ^Y ( Control-Y ) to insert it in the command line.
e.g. ok nvalias mydev ^Y
for creating devalias mydev for
/iommu@0,10000000/sbus@0,10001000/espdma@5,8400000/esp@5,8800000/sd
3. ok nvalias newdisk^Y
/iommu@0,10000000/sbus@0,10001000/espdma@5,8400000/esp@5,8800000/sd@0,0
4. ok devalias
newdisk
/iommu@0,10000000/sbus@0,10001000/espdma@5,8400000/esp@5,8800000/sd@0,0
screen /iommu@0,10000000/sbus@0,10001000/cgsix@3,0
ttyb /obio/zs@0,100000:b
5. ok boot newdisk
Note – Of course the boot will probably fail here unless, somehow, a
bootblock was placed on it. You will be setting up for alternate boots
in a later module.
Hand edit the nvramrc file using information from the device tree;
then enable the use of it. (This is required currently for making aliases
for storage array devices or with older PROMs that do not support the
nvalias command.)
ok devalias
cd /
ok ls (just to find our way!)
ffd3c184 FMI,MB86904
ffd2d1e0 virtual-memory@0,0
ffd2d124 memory@0,0
ffd2c458 obio
ffd2c184 iommu@0,10000000
ok cd iommu@0,10000000
ok ls
ffd2c2c8 sbus@0,10001000
ok cd sbus@0,10001000
ok ls
ffd4242c cgsix@3,0
ffd423cc power-management@4,a000000
ffd41c80 SUNW,CS4231@4,c000000
ffd40024 ledma@5,8400010
ffd3ff98 SUNW,bpp@5,c800000
ffd3cea4 espdma@5,8400000
ok cd espdma@5,8400000
ok ls
ffd3d280 esp@5,8800000
ok cd esp@5,8800000
ok ls
ffd3f854 st
ffd3f13c sd
ok cd sd
ok pwd
/iommu@0,10000000/sbus@0,10001000/espdma@5,8400000/esp@5,8800000/sd
ok nvedit
Lab 5- Optional
In this lab, you will construct, download, and run FORTH macros.
1. 1. Set up the tip command like you did in POST lab. That is, one
machine at the ok prompt displayed in another machine’s “tip
hardwire.”
4. Due to the fact that the macros you construct do not survive a
power on reset, construct a macro in a file that you can download
any time you want.
You are going to create the file in the machine that is up running
the operating system now; then download it to the machine that is
at the ok prompt.
proto2# vi /opt/mapping
: mapping
38e 1000 pgmap!
1000 map?
1000 100 ab fill
1000 100 dump
Notes
0 : multilpy3 <cr>
Notes
Objectives
Upon completion of this module, you will be able to:
References
Solaris User and System Administration Answerbooks
5-1
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
5
Open Discussion
1.
2.
3.
4.
5.
6.
7.
8.
9.
Objectives
Upon completion of this module and lab, you will be able to:
6-1
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
6
Introduction
The SunVTS tests can be used to stress certain areas of the system as
needed for diagnostic and troubleshooting purposes.
SunVTS application
programming interface
Logs messages
Test interface
SunVTS User-created
hardware tests custom tests
User Interfaces
Kernel
The kernel runs as a background process, a daemon. Upon startup of
the SunVTS software, the SunVTS kernel probes the system kernel for
installed hardware devices. Those devices are displayed on the
SunVTS user interface.
Both the SunVTS kernel and the user interface must be started before
testing can begin.
Hardware Tests
For each supported hardware device, a corresponding hardware test
can validate its operation. Each test is a separate process from the
SunVTS kernel process.
Additional References
For more extensive information and usage of the SunVTS diagnostic
software, see the following publications:
The pkgadd command is used to install SunVTS software from the CD-
ROM Updates for Solaris Operating Environment 2.5 (Part Number 704-
5104-10).
Insert the CD-ROM into the CD-ROM drive, and type the pkgadd
command as root:
# pkgadd -d /cdrom/upd_sol_2_5_smcc/SMCC
View the screen output from the pkgadd application to ensure that the
install completed successfully.
# /opt/SUNWvts/bin/sunvts
System Status panel Performance meter Control panel Tests Selection panel
● The Control panel – A panel that contains the buttons that you
use to control the SunVTS user interface.
● The Test Option panel – A panel where you select the tests and
test groups to run; you can also change the options for each test
and test group.
● The Tests Selection panel – A panel where you choose the global
options for all SunVTS tests.
● The System Status panel – A panel that shows the general testing
status.
● The Test Status panel – A panel that displays pass and error
counts for each test and test group.
The following are the buttons on the control panel and their functions:
Stop Click on the Stop button to halt all active tests. The
test results remain on the Test Status panel after
testing is completed. Click on the Stop button only
once. Some tests do not stop immediately, so the
System Status may slowly change from Stop to Idle.
Quit Using the Quit button, you can terminate the user
interface, the SunVTS kernel, or both.
Sys Config Click on the Sys Config button to display the Sys
Config menu. Menu choices are display or print test
system configuration information, or reprobe the test
system.
Log Files SunVTS saves the status of its progress in three log
files. Use the Log Files button to look at the error
messages, information, or UNIX® messages log files.
From the Test Selection panel, you can select the tests you want to run,
and specify the testing options.
Options can be set globally for all of the SunVTS tests you select. Click
on the Set Options button for the SunVTS Testing Options menu.
Options can also be set for each test group. Press the button of a test
group or test name for the option menu.
The following options can be set to apply to all selected SunVTS tests
or, if applicable, to individual test groups or tests.
group_override
Supersedes the specific test options in favor of the
group options in this window.
group_concurrency
Sets the number of tests you want to run at the same
time in the same group.
num_instances
Specifies the number of tests to run for all tests that
are scalable.
Tests Switch
Three settings are available:
● Default enables the default group of tests. This includes all tests
that do not require intervention.
Option Files
You can save your SunVTS testing selections to a file. This prevents
you from having to reset these same options again in the future. Test
settings are saved in the /var/adm/sunvtslog/options directory.
To save an option file, type a name for the option file, and click on the
Store button.
Intervention
Certain tests require that you intervene before you can run the test
successfully. These include tests that require media or loopback
connectors.
You cannot select these tests until you enable the intervention mode.
This setting does not change the test function; it just serves as a
reminder that you must intervene before the test can be successfully
completed.
The icons at the top of the Test Status panel enable you to navigate
through the list of tests in case there are more tests running than can
be displayed on the panel.
Errors are also recorded in a log file that you can view by clicking on
the Log File button on the Control panel.
Log Files
You can use the Log Files menu to view error, information, and UNIX
message log files that are managed by the system.
4. Display the Information and UNIX Msgs files, but do not remove
any files.
# /opt/SUNWvts/bin/vtsk
2. Start the SunVTS TTY User Interface with the vtstty command:
#/opt/SUNWvts/bin/vtstty
Only one panel has focus (selected for keyboard input) at a time. Focus
can be shifted between the three panels by pressing the tab key. The
panel with focus is bordered by asterisks (*).
Selected panel
Control panel
Tests panel
Status panel
Console
Kernel Interface
To test a remote system, it must have the kernel process
/opt/SUNWvts/bin/vtsk running.
User Interface
To test local system, the user interface can be either TTY (teletype) or
graphical.
User Interface
The graphical user interface (GUI) component must have the interface
/opt/SUNWvts/bin/sunvts running as an active process.
User Interface
You can also connect directly to the remote computer running the
SunVTS kernel when starting the graphical user interface.
/opt/SUNWvts/bin/sunvts -h remote_hostname
TTY interface
Lab Overview
Lab Objectives
● Install the SunVTS package on a system.
Equipment
To complete this lab, you will need:
Lab Tasks
In this lab, you are going to verify that all hardware on your lab
system is functional. You will need the SunVTS software present on
your system.
# pkgrm SUNWvts
Lab Tasks
Now that you have a general idea of how the diagnostics work, here
are some steps to try to get more familiar with the features.
6. Kill the SUNWvts kernel process and try the previous two steps
again.
8. Run the audio test. Observe the different selections that are played
depending on the machine you are testing.
a. Auto-start.
c. Kill SunVTS.
11. Find the maximum number of passes allowed for the fputest?.
Lab Tasks
12. Attempt to force an error.
Objectives
Upon completion of this module, you will be able to:
7-1
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
7
References
● SunSolve Online User’s Guide
Overview
SunSolve 7-3
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
7
Distribution
Updated CD-ROMs are sent out about ten times a year and have
information regarding all supported software, operating system levels,
and hardware.
● http://sunsolve.Sun.COM
● http://SunSolve1.Sun.COM
● http://www.Sun.Com
2. Click on the Create new account button and answer the questions.
(You must have a SunService Spectrum Account number to
register for a SunSolve Online account.) There is little or no wait in
receiving an account once you submit the form.
SunSolve 7-5
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
7
Installing SunSolve
Install the SunSolve software and patches on a server and share them
correctly to the network.
a. If this is the first time you have run the share command on
this machine, edit the /etc/dfs/dfstab file and add the
following line:
# vi /etc/dfs/dfstab
share -o ro /cdrom/sunsolve_2_8
# /etc/init.d/nfs.server start
# showmount -e
Or
# dfshares -F nfs servername
Installing SunSolve
SunSolve 7-7
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
7
Installing SunSolve
Installing SunSolve
2. Click on Install.
SunSolve 7-9
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
7
Installing SunSolve
Sharing SunSolve
To set the SunSolve server as shared, at /opt/SUNWss, perform the
following steps.
Or
# /etc/init.d/nfs.server start
Starting Sunsolve
SunSolve 7-11
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
7
Starting Sunsolve
Note – If you are asked if you want to run in a Shell Tool, answer yes.
Search Tool
SunSolve 7-13
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
7
Search Tool
Configuring SunSolve
To configure the SunSolve software, click on the Properties button in
the SearchTool window. The SearchTool properties window is
displayed.
Search Tool
SearchTool Properties
The SearchTool properties window contains a Category menu button
with the following property types:
Notice here that the maximum documents set to retrieve is 100, the
search timeout is set to 60 seconds (make the timeout longer if
searching across a network), and Fuzzy Boolean searching is on
(this helps to find related keywords in searches).
● Viewer – You can specify the text viewer, the PostScript viewer, or
the picture (GIF) viewer.
SunSolve 7-15
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
7
You are having printer problems under the Solaris 2.5 operating
environment; you can use the SearchTool window to search for
probable symptoms.
● Patch Descriptions
● printer
● 2.5
● AND – The logical AND means the collections searched must contain
all keywords joined by AND.
SunSolve 7-17
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
7
● Early Notifier
● Bug Reports
● Patch Descriptions
● Solaris Q & A
● Info Docs
SunSolve 7-19
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
7
Patches
1. From the SearchTool window, select only the Info Docs collection
to search.
Patches
5. From the Display menu, choose In new viewer. The 2.5 patch
report is displayed.
SunSolve 7-21
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
7
Patches
For lab setup, insert or mount the patches CD-ROM. The File Manager
window displays the following:
Patches
1. From the File Manager window shown on the previous page, click
on the patchinstall icon.
Note – If you are not running File Manager or OpenWindows, you can
start the patch install script by changing to the directory where the
patch CD-ROM is mounted and typing ./patchinstall as superuser.
SunSolve 7-23
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
7
Patches
You will see each installpatch script run. You might also see
messages such as Patch already installed: continue? |Y|.
After installing these (or any other patches), reboot the system unless
specifically given other instructions from the install script.
Patches
To install the above patch, type the patch ID number (102979 here),
instead of typing suggested when prompted for the patch to install.
Patch to install (patchid, suggested, ?): 102979
SunSolve 7-25
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
7
Patches
2. Find the installed location of the patch (they are usually installed
in the /var/sadm/patch directory).
# find / -name 102044-01 -print
(output omitted)
SunSolve Labs
Note – The question text below matches page headers in this module.
SunSolve 7-27
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
7
SunSolve Labs
11. Display the current patch report for a given operating system.
This section illustrates the method for conducting some basic searches
of the SunSolve information. It shows how to construct and refine a
search, and displays the results of a sample search.
SunSolve 7-29
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
7
Choose the
collections you
want to search.
Select the area of
the document(s)
Enter keywords you want to
(and optional search.
operators) that
describe the
Click on the
Search button
to start the
What you enter here is the keyword that SearchTool will look for
in the collections. You can also use the optional operators to
further define your search.
The most commonly used area is entire doc, which looks in all
parts of all of the documents of the collections you have selected.
Each collection allows you to define your search by the areas
available in that collection. In some cases, you may know the
document ID number, and might want to search in the document
ID area of All Collections.
SunSolve 7-31
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
7
SunSolve 7-33
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
7
Using MultiView
Once you have used SearchTool to locate the documents you want,
you can use MultiView to display or print the document or save the
document to a file. MultiView is the display tool for SearchTool. It is
capable of displaying the full range of document formats available in
the SunSolve collections.
SunSolve 7-35
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
7
Document Formats
SunSolve 7-37
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
7
Note – You should set the viewer types to Default unless you are
familiar with other tools that you would like to specify as Custom.
Default viewers have been selected to work with the document
collections.
Text Viewer
Default displays ASCII text files in the system text viewer. The Custom
selection displays ASCII files in a TextEdit window, or another text
window specified.
PostScript Viewer
If you are running on an Xterminal, you should set this to Custom and
the to name of the PostScript viewer. For example, to use ghostview,
replace the default pageview with ghostview.
Picture Viewer
The scroll list at the bottom of the SearchTool window lists documents
that match your search. You can use MultiView to display, print, email,
or save these documents to a file.
1. Click on the title of the document in the scroll list at the bottom of
the SearchTool window.
SunSolve 7-39
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
7
MultiView Features
Print Option
The Print option sends the document you are viewing to a printer. You
can specify which pages to print: All (the entire document), This page
(only the page you are presently viewing), or a Range of pages,
delimited by the From and To fields in the window. You can also
specify the name of the printer in the printer field. Click on the Print
button when all your choices are completed.
SunSolve 7-41
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
7
MultiView Features
Save Option
The Save option enables you to save the current document in a file.
Specify the location of the file in the window, and type the name of the
file in the Name field. Click on the Save button when all your choices
are completed.
Email Option
MultiView Features
Properties Option
SunSolve 7-43
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
7
Objectives
Upon completion of this module, you will be able to:
● Use the adb and crash commands to manipulate core files and to
locate a failing process or file.
● Use the adb and crash commands to isolate the failing processor,
instruction, thread, process, and file on three core dumps and on
one system hang.
References
The SPARC Architecture Manual, SPARC International
8-1
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
8
Introduction
The UNIX operating system uses assertion checks throughout the kernel
code. Assertion checks are placed at critical points within the software.
When a call is made to the ASSERT() routine, a check is made. If the
condition is not true and the kernel module is compiled with the
DEBUG flag, the system panics. Also, within the code are data
integrity checks. If a data check fails, it calls upon the cmn_err()
routine.
● 17,000 assertions
When the system reboots, this core dump must be saved into files that
can then be passed to adb for analysis. savecore(1M) is used to
perform this function. Normally, the system does not examine the
swap area for core dumps when it boots. savecore() must be enabled
in /etc/init.d/sysetup.
Header Files
● /usr/include/sys/proc.h
● /usr/include/sys/thread.h
● /usr/include/sys/klwp.h
● /usr/include/sys/user.h
● /usr/include/sys/cred.h
● /usr/include/vm/as.h
● /usr/include/vm/seg.h
Debuggers
adb
adb is an interactive, general-purpose debugger. It can be used to
examine files, and it provides a controlled environment for the
execution of programs. adb reads commands from the standard input
and displays responses on the standard output. It does not supply a
prompt.
crash
The crash command is used to examine the system memory image of
a running or a crashed system by formatting and printing control
structures, tables, and other information. Command-line arguments to
crash are dump file, name list, and output file.
kadb
kadb is an interactive debugger with a user interface similar to that of
adb(1), the assembly language debugger. kadb must be loaded prior
to the standalone program it is to debug. It runs in the same address
space as the standalone program, thus sharing many resources with
that program. The debugger is cognizant of and able to control
multiple processors if they are present in a system.
Unlike adb, kadb runs in the same supervisor virtual address space as
the program being debugged although it maintains a separate context.
The debugger runs as a coprocess that cannot be killed (`:k').
SAVECORE Setup
##
## Default is to not do a savecore
##
#if [ ! -d /var/crash/`uname -n` ]
#then mkdir -p /var/crash/`uname -n`
#fi
# echo ‘checking for crash dump...\c ‘
#savecore /var/crash/`uname -n`
# echo ‘’
To:
##
## Default is to not do a savecore
##
if [ ! -d /var/crash/`uname -n` ]
then mkdir -p /var/crash/`uname -n`
fi
echo ‘checking for crash dump...\c ‘
savecore /var/crash/`uname -n‘
echo ‘’
Invoking adb/kadb/crash
adb
# cd crash_directory
# adb -k unix.n vmcore.n
crash
# crash vmcore.n unix.n
dumpfile = vmcore.0, namelist = unix.0, outfile = stdout
>
kadb
ok boot disk kadb
adb Commands
If address is omitted, the current location is used. (The dot [.] also
stands for the current location.) The address can be a kernel symbol. If
the count is omitted, it defaults to 1.
x or X Displays in hex.
Examples
v+0
v: 100 examine a symbolic location
v+0/D examine a symbolic location - display content decimal
v:
v: 100
v+0/X examine a symbolic location - display content hex
v:
v: 64 e
v+0=X Determine VA of symbolic location v
f017255c
f017255c/X examine content of a VA
64
fc63ecbc/i examine a VA for an instruction(disassemble)
backseat_write:sethi%hi(0xfffffc00), %g1
$q - Quit.
adb Macros
$M Displays built-in macros (kadb).
During the development of the RAM disk driver, the system crashes
with a data fault when running newfs. The savecore command has
been enabled in the sysetup shell script. This enables copies of the
current kernel and core file to be saved when the system reboots.
There are times when the msgbuf variable used by the msgbuf macro
may not be loaded in the dynamic kernel symbol table, in which case
you would use the strings command on the vmcore.n file.
# strings vmcore.0
...
ASC = 0x4 (LUN not ready), ASCQ = 0x2, FRU = 0x0
BAD TRAP: cpu_id=2 type=9 <Data fault> addr=30 rw=1 rp=e0922ac4
MMU sfsr=0x326: ft=<Invalid address error> at=<supv data load>
level=3
MMU sfsr=0x326<FAV>
BAD TRAP occurred in module "ramd" due to an illegal access to a user
address.
mkfs: Data fault
kernel read fault at addr=0x30, pte=0x0
MMU sfsr=0x326: ft=<Invalid address error> at=<supv data load>
level=3
MMU sfsr=0x326<FAV>
...
Notice that you would get a lot of information which also includes the
panic message as returned by the $<msgbuf command. The rest of the
panic message is shown on the next page.
The message buffer has been edited, but in the workshop, you will see
the full message buffer. The bold area contains the important
information about the crash. The information located in the BAD TRAP
(1) message informs you of the type of fault detected (<Data Fault>),
which CPU detected the fault (id=2), register pointer (rp), and the
fault type (ft). The fault type indicates an <Invalid Address
Error>. Included within the panic message is the CPU ID and thread
(sequence of instructions) executing at the time of the crash.
You have almost all the information located in the message buffer to
determine most of the information about the system crash.
The rest of the crash dump analysis uses adb macros and commands to
navigate through a crash dump to get data that may not be available
through the message buffer (or if the message buffer is not available,
for whatever reasons).
BAD TRAP: cpu_id=2 type=9 <Data fault> addr=30 rw=1 rp=e0922ac4
MMU sfsr=0x326: ft=<Invalid address error> at=<supv data load> l
evel=3
MMU sfsr=0x326<FAV>
BAD TRAP occurred in module "ramd" due to an illegal access to a
user address.
mkfs: Data fault
kernel read fault at addr=0x30, pte=0x0
MMU sfsr=0x326: ft=<Invalid address error> at=<supv data load> l
evel=3
MMU sfsr=0x326<FAV>
ram_write+0x2c, pid=363, pc=0xf06ad304, sp=0xe0922b10, psr=0x400
000c4, context=39
g1-g7: ffffff98, 0, e00afac4, 40, f0bb0bd8, 1, f06c16c0
Begin traceback... sp = e0922b10
write+0x190 @ 0xe00afc54, fp=0xe0922b78
args=d80000 e0922bd8 f03b1c18 d8 f0287d48 f06ad2d8
The $c macro (1) displays the stack. Also note, the cmn_err() routine
is called. This fault was determined to be a nonrecoverable error
ending up in a panic. In Solaris 2.5, notice that the stacktrace is very
indicative of the reason for the fault through the presence of the
ram_write() driver routine that caused the system to go down.
$c
complete_panic(0xe024c800,0x1,0xe0241800,0xf05b2ab8,0x5,0xe024c800)
+ d0
do_panic(?) + 20
vcmn_err(0xe02496b0,0xe092297c,0xe092297c,0x18,0x18,0x3)
cmn_err(0x3,0xe02496b0,0xe0251fa0,0x0,0x12778,0xdffffad0) + 1c
die(0x9,0xe0922ac4,0x30,0x326,0x1,0xe02496b0) + 120
trap(0x0,0xe0922ac4,0x30,0x326,0x1,0x0) + 498
fault(?) + 7c
Syssize(via
getminor)(0x0,0x3ffff,0x20,0x7fffffff,0xf0a5829c,0x315c1813)
ram_write(0xd80000,0xe0922bd8,0xf03b1c18,0xd8,0xf0287d48,0xf06ad2d8
) + 1c
write(0x5) + 190
Using the value in the pc field, you can determine the instruction that
was executing at the time of the panic with the adb i command (1).
The results of this command indicate that a load instruction was
executing at an address given by the symbol ram_write+0x2c. With
the /i command, you have determined the assembly instruction that
caused the system to go down.
You can use the cpu macro (1) to navigate the CPU data structure to
locate the thread (which you already know from the message buffer).
The macro will open the CPU data structure for the first CPU (id=0).
Since you know this was not the CPU (message buffer), you will use
the content of next field. This points to the address of the next CPU
data structure. Note also, thread and idle thread (idle_t) are equal.
This indicates this CPU was idle.
cpu0$<cpu
cpu0:
cpu0: id seqid flags
0 0 1d
cpu0+0xc: thread idle_t pause
e06c1ec0 e06c1ec0 e08a0ec0
cpu0+0x18: lwp callo fpowner
0 0 f06a30c0
cpu0+0x24: next prev next on prev on
f05852d0 f05b2ab8 f05852d0 f05b2d58
cpu0+0x34: lock npri queue limit actmap
0 110 f036e568 f036ea90 f028dbc0
cpu0+0x44: maxrunpri max unb pri nrunnable
-1 -1 0
cpu0+0x50: runrun kprnrn dispthread thread lock
0 0 e06c1ec0 0
cpu0+0x5c: intr_stack on_intr intr_thread intr_actv
e06dffa0 1 e06dcec0 0
cpu0+0x6c: base_spl
0
Follow the boldface type to locate the CPU at the time of the fault.
Note thread and idle thread.
f05852d0$<cpu
0xf05852d0: id seqid flags
1 1 1d
0xf05852dc: thread idle_t pause
e06feec0 e06feec0 e0721ec0
0xf05852e8: lwp callo fpowner
0 0 f06a30c0
0xf05852f4: next prev next on prev on
f0585030 e0251120 f0585030 e0251120
0xf0585304: lock npri queue limit actmap
0 110 f057b580 f057baa8 f028d660
0xf0585314: maxrunpri max unb pri nrunnable
-1 -1 0
0xf0585320: runrun kprnrn dispthread thread lock
0 0 e06feec0 0
0xf058532c: intr_stack on_intr intr_thread intr_actv
e071ffa0 1 e071cec0 0
0xf058533c: base_spl
0
Finally, you have arrived at the correct CPU data structure. Another
key point has been reached. You can display the content of the thread
data structure. You know the thread from the message buffer. Note the
threads.
f561cc00$<cpu
0xf0585030: id seqid flags
2 2 1d
0xf058503c: thread idle_t pause
f06c16c0 e0723ec0 e0746ec0
0xf0585048: lwp callo fpowner
0 0 f0a8e810
0xf0585054: next prev next on prev on
f05b2d58 f05852d0 f05b2ab8 f05852d0
0xf0585064: lock npri queue limit actmap
0 110 f057b040 f057b568 f028db10
0xf0585074: maxrunpri max unb pri nrunnable
-1 -1 0
0xf0585080: runrun kprnrn dispthread thread lock
0 0 e0723ec0 0
0xf058508c: intr_stack on_intr intr_thread intr_actv
e0744fa0 1 e0741ec0 0
0xf058509c: base_spl
0
The thread that caused the panic can also be obtained from the
message buffer or from the panic_thread variable that the system
maintains. This variable holds the address of the thread that caused
the system to panic regardless of how many CPUs there are in the
system.
panic_thread/X
panic_thread:
panic_thread: f06c16c0
Use the thread macro. Search the structure for the procp (process
pointer) field.
f06c16c0$<thread
adb
0xf06c16c0:
link stk
0 e0922c08
0xf06c16cc:
bound affcnt bind_cpu
0 0 -1
0xf06c16d4:
flag procflag schedflag state
0 0 11 4
0xf06c16e0: pri epri pc sp
0 0 e004c13c e0922480
0xf06c16ec: wchan0 wchan cid clfuncs
0 0 2 f0371960
0xf06c1700:
cldata ctx lofault onfault
f0594700 0 0 0
0xf06c1710:
nofault swap lock cpu
0 e0921000 ff f05b2ab8
0xf06c1720:
intr delay_cv tid alarmid
0 0 1 0
realitimer
0xf06c1734: interval.sec interval.usec value.sec value.usec
0 0 0 0
0xf06c1744:
itimerid sigqueue sig
0 0 0 0
0xf06c1754:
hold forw back
0 0 f06c16c0 f06c16c0
0xf06c1764:
lwp procp next prev
f0bb0bd8 f0bb8cd0 f0aa0920 f06c1ea0
0xf06c16da:
preempt trace whystop whatstop
1 0 0 0
0xf06c17a4:
kpri_req sysnum astflag pollstate cred
11 4 0 0 f03b1c18
0xf06c178c:
lbolt pctcpu trapret pre_sys post_sys sig_check
1b520 ae 0 0 0 0
0xf06c1794:
lockp oldspl disp queue disp time
f05b2b10 de1 f05b2aec 111899
0xf06c17b8:
mstate waitrq rprof
You can use the last macro proc2u to expand the proc structure.
Locate the psargs symbol, which indicates the commands and its
arguments executing at the time of the panic. You have accomplished
the last key point (locating the process).
f0bb8cd0$<proc2u
0xf0bb8e88:
execid execsz tsize
32581 12e 0
0xf0bb8e94:
dsize start ticks cv
0 315c1813 1b513 0
0xf0bb8ea4:
exdata
0xf0bb8ea4:
vp tsize dsize bsize
0 0 0 0
0xf0bb8eb4:
lsize nshlibs mach mag toffset
0 0 0 10b 0
0xf0bb8ec4:
doffset loffset txtorg datorg
0 0 0 0
0xf0bb8ed4:
entloc
df7d43a8
0xf0bb8ed8: aux vector
7d8 dfffffe1 3 10034
4 20 5 5
9 11b54 7 df7d0000
8 0 6 1000
7d0 0 7d1 0
7d2 1 7d3 1
7d9 7 0 0
0 0 0 0
0 0 0 0
0xf0bb8f68: psargs
mkfs /devices/pseudo/ramd@0:0,raw 512 8 1 8192 1024 16 10 60 204
8 t 0 -1 8 -1^@^@^@
0xf0bb8fb8: comm
mkfs^@^@^@^@^@^@^@^@^@^@^@^@^@
0xf0bb8fd8:
sigmask
0xf0bb9050: 0 0 0 0
# cd crash_directory
# ls
bounds unix.0 vmcore.0
# adb -k unix.0 vmcore.0 1
physmem 1e6e 2
msgbuf+14/s 3
symbol not found
$q
Note – If the message symbol not found is returned, exit adb and
use the strings command.
The message buffer has been edited, but in the workshop, you will see
the full message buffer. The bold area contains the important
information about the crash. The information located in the BAD TRAP
message informs you of the type of fault detected <Date Fault> plus
it also informs you of the name of the module that caused the system
to panic (ramd).
Notice also that the pc (0xfc479dbc) points to the instruction that was
executing at the time of the crash.
The rest of the crash dump analysis will use adb macros and
commands to navigate you through a crash dump. This would be
necessary if the message buffer did not help or if one was not
available.
BAD TRAP: type=9 rp=f05246f4 addr=30 mmu_fsr=326 rw=1
BAD TRAP: occurred in module “ramd” due to an illegal access to a
user address
mkfs: Data fault
kernel read fault at addr=0x30, pme=0x0
MMU sfsr=326: Invalid Address on supv data fetch at level 3
pid=465, pc=0xfc479dbc, sp=0xf0524740, psr=0x40000c2, context=0
g1-g7: ffffff98, 0, ffffff00, 0, f05249e0, 1, fc2dec00
Begin traceback... sp = f0524740
Called from f00df9b4, fp=f05247a8, args=1a40000 f0524808 fc38fc80 f0154664
0 fc479d90
Called from f0070258, fp=f05248b8, args=200 f0524920 2 0 4 fc2d5b04
Called from f0041aa0, fp=f0524938, args=f0160cf8 f0524eb4 0 f0524e90
fffffffc ffffffff
Called from 15cc0, fp=effffae8, args=4 32400 200 0 0 3fe00
End traceback...
panic: Data fault
# cd crash_directory
# adb -k unix.0 vmcore.0
physmem 1e6e
The $c macro (1) displays the stack. Note the value 9 in the initial trap
handler (2) as it is also displayed in the message buffer. Also note, the
cmn_err() routine is called. This fault was determined to be a
nonrecoverable error ending up in a panic.
$c
complete_panic(0xf026b428,0xfbfab98c,0xf0048ec8,0x6a,0xfbfab818,0xf
0279800) + 108
do_panic(?) + 1c
vcmn_err(0xf0266600,0xfbfab98c,0xfbfab98c,0x7,0xffeec000,0x3)
cmn_err(0x3,0xf0266600,0x1,0x21,0x21,0xf025c000) + 1c
die(0x9,0xfbfabac4,0x30,0x326,0x1,0xf0266600) + bc
trap(0xf028a1d8,0xfbfabac4,0x0,0x326,0x1,0x0) + 4f8
fault(?) + 84
Syssize(via
getminor)(0x0,0x3ffff,0x20,0x7fffffff,0xf5c4b4bc,0x31585486)
ram_write(0xdc0000,0xfbfabbd8,0xf5a8ed38,0xdc,0xf5970d48,0xf5c54d90
) + 1c
write(0x5) + 190
The next step in the core dump analysis is to get the program counter
and, with the disassemble command (i), display the assembly
instruction that caused the system to panic. This will usually display
the name of the device driver routine as part of the label. With this
information, you can pinpoint precisely what device driver caused the
system to go down.
f5c98dbc/i
ram_write+0x2c: ld [%l1 + 0x30], %l2
You may also want to find out what program or command was
running when the system went down. This is additional information
that will point out the bad device driver as well.
Using adb, you do this in two steps: first, display the thread that was
running when the system went down; the thread structure has a
pointer to the process that holds the name of the command running.
Second, display the user structure of this process that has the
command name.
panic_thread/X
panic_thread:
panic_thread: f5c66480
f5c66480$<thread
adb
0xf5c66480:
link stk
0 fbfabc08
0xf5c6648c:
bound affcnt bind_cpu
f026b494 0 -1
0xf5c66494:
flag procflag schedflag state
0 0 11 4
0xf5c664a0: pri epri pc sp
14 0 f0048ec8 fbfab818
0xf5c664ac: wchan0 wchan cid clfuncs
0 0 2 f59a0378
0xf5c664c0:
cldata ctx lofault onfault
f5cb6460 0 0 0
0xf5c664d0:
nofault swap lock cpu
0 fbfaa000 ff f026b494
0xf5c664e0:
intr delay_cv tid alarmid
0 0 1 0
realitimer
0xf5c664f4: interval.sec interval.usec value.sec
value.usec
0 0 0 0
0xf5c66504:
itimerid sigqueue sig
0 0 0 0
0xf5c66514:
hold forw back
0 0 f5c66480 f5c66480
0xf5c66524:
lwp procp next prev
f5c11828 f5c0fcc8 f5c665a0 f5c66d80
0xf5c6649a:
preempt trace whystop whatstop
1 0 0 0
0xf5c66564:
kpri_req sysnum astflag pollstate cred
0 4 1 0 f5a8ed38
0xf5c6654c:
lbolt pctcpu trapret pre_sys post_sys sig_check
405b8 fd 0 0 0 0
0xf5c66554:
lockp oldspl disp queue disp time
f026b4ec be1 f026b4c8 263603
0xf5c66578:
mstate waitrq rprof
9 0 0 0
0xf5c66580:
prioinv ts sobj_ops
0 0 0
From the previous thread structure, you can get the address of the
process’s proc structure from the field that is labeled procp. When
you use this in combination with the macro proc2u, you can display
the user structure of the process that has the command or program
name. Take special note also of the arguments that were passed to the
command.
f5c0fcc8$<proc2u
0xf5c0fe80:
execid execsz tsize
32581 12e 0
0xf5c0fe8c:
dsize start ticks cv
0 31585486 405a4 0
0xf5c0fe9c:
exdata
0xf5c0fe9c:
vp tsize dsize bsize
0 0 0 0
0xf5c0feac:
lsize nshlibs mach mag toffset
0 0 0 10b 0
0xf5c0febc:
doffset loffset txtorg datorg
0 0 0 0
0xf5c0fecc:
entloc
ef7d43a8
0xf5c0fed0: aux vector
7d8 efffffe6 3 10034
4 20 5 5
9 11b54 7 ef7d0000
8 0 6 1000
7d0 0 7d1 0
7d2 1 7d3 1
7d9 3 0 0
0 0 0 0
0 0 0 0
0xf5c0ff60: psargs
mkfs /devices/pseudo/ramd@0:0,raw 512 8 1 8192 1024
16 10 60 204
8 t 0 -1 8 -1^@^@^@
0xf5c0ffb0: comm
mkfs^@^@^@^@^@^@^@^@^@^@^@^@^@
0xf5c0ffd0:
cdir rdir ttyvp cmask
f5ca82e8 0 0 12
0xf5c0ffe0:
mem systrap ttyp ttyd
4f5 0 0 0
0xf5c0fff0: entrymask
0 0 0 0
0 0 0
0xf5c1000c: exitmask
0 0 0 0
0 0 0
0xf5c10028:
signodefer sigonstack
0 0 0 0
0xf5c10038:
sigresethand sigrestart
0 0 0 0
sigmask
0xf5c10048: 0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
signal
0xf5c101a8: 0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 1 1 0
0 1 1 0
0 0 0 0
0 0 0 0
0 0 0 0
ru
0xf5c10258:
nshmseg acflag
0 0
0xf5c1025c: rlimit
7fffffff 7fffffff 7fffffff 7fffffff
7ffff000 7ffff000 800000 7ffff000
7fffffff 7fffffff 40 400
7fffffff 7fffffff
flock
0xf5c10294: owner
0
0xf5c10294: lock
0
0xf5c10294: waiters wlock type
0 0 0
0xf5c1029c:
nofiles
24
flist
f5c0c910
0xf5c1029c: ofile pofile refcnt
0xf5c0c910: f5c69758 0 0
0xf5c0c918: f5c69758 0 0
0xf5c0c920: f5c69758 0 0
0xf5c0c928: f5c69218 0 0
0xf5c0c930: f5c695d8 0 0
0xf5c0c938: f5c69188 0 0
0xf5c0c940: f5c697e8 0 0
0xf5c0c948: 0 0 0
0xf5c0c950: 0 1 0
0xf5c0c958: f5c696f8 0 0
0xf5c0c960: f5c69698 0 0
0xf5c0c968: 0 0 0
0xf5c0c970: 0 0 0
0xf5c0c978: 0 0 0
0xf5c0c980: 0 0 0
0xf5c0c988: 0 0 0
0xf5c0c990: 0 0 0
0xf5c0c998: 0 0 0
0xf5c0c9a0: 0 0 0
0xf5c0c9a8: 0 1 0
0xf5c0c9b0: 0 0 0
0xf5c0c9b8: 0 0 0
0xf5c0c9c0: 0 0 0
0xf5c0c9c8: 0 0 0
user (alias: u) Prints the user structure for the designated process.
stack (alias: s) Dumps the stack. The -u option prints the user stack.
The -k option prints the kernel stack. If no arguments
are entered, the kernel stack for the current thread is
printed. Otherwise, the kernel stack for the currently
running thread is printed.
For more information about crash commands, refer to the man pages.
# cd crash_directory
# crash vmcore.0 unix.0
dumpfile = vmcore.0, namelist = unix.0, outfile = stdout
> stat
system name: SunOS
release: 5.5
node name: mustang
version: Generic
machine name: sun4d
time of crash: Fri Mar 29 09:04:19 1996
age of system: 18 min.
panicstr: Data fault
panic registers:
pc: e004c13c sp: e0922808
> u
PER PROCESS USER AREA FOR PROCESS 34
PROCESS MISC:
command: mkfs, psargs: mkfs /devices/pseudo/ramd@0:0,raw 512 8 1 8192 1024
16 10 60 2048 t 0 -1 8 -1
start: Fri Mar 29 09:04:19 1996
mem: 450, type: exec
vnode of current directory: f0bd8688
OPEN FILES, POFILE FLAGS, AND THREAD REFCNT:
[0]: F 0xf06d3db8, 0, 0 [1]: F 0xf06d3db8, 0, 0
[2]: F 0xf06d3db8, 0, 0 [3]: F 0xf06d3938, 0, 0
[4]: F 0xf06d3ae8, 0, 0 [5]: F 0xf06d32a8, 0, 0
[6]: F 0xf06d38d8, 0, 0 [9]: F 0xf06d3878, 0, 0
[10]: F 0xf06d3848, 0, 0
cmask: 0022
RESOURCE LIMITS:
cpu time: unlimited/unlimited
file size: unlimited/unlimited
swap size: 2147479552/2147479552
stack size: 8388608/2147479552
coredump size: unlimited/unlimited
file descriptors: 64/1024
address space: unlimited/unlimited
SIGNAL DISPOSITION:
1: default 2: default 3: default 4: default
5: default 6: default 7: default 8: default
9: default 10: default 11: default 12: default
13: default 14: default 15: default 16: default
17: default 18: default 19: default 20: default
21: default 22: default 23: default 24: default
25: default 26: ignore 27: ignore 28: default
A RAM disk device driver has just been installed in your system by
your resident device driver writer, who has asked you to test the
driver.
# cd /devices/pseudo
# ls
If the RAM disk has been installed correctly, two entries are in this
directory: ramd@0:0, and ramd@0:0,raw.
5. Save the core dump and use adb to analyze the problem following
the classroom exercise template.
You can prevent this from happening if you back up the root partition.
Then when /etc files such as name_to_major, path_to_inst,
driver_classes, and driver_aliases become corrupted, you can
boot from a backup root partition that has these files intact.
1. Make sure that your system has been installed with a backup root
partition that has exactly the same size as the root partition. If
your root partition is /dev/dsk/c0t3d0s0 with 20983 Kbytes,
then your backup partition could be /dev/dsk/c0t1d0s0 with
20983 Kbytes.
3. # dd if=/dev/dsk/c0t3d0s0 of=/dev/dsk/c0t1d0s0
6. # cd /backup_root
# vi /etc/vfstab
8. Halt your system and then try to boot from the backup_root file
system.
10. If your system becomes corrupted, boot from the backup partition,
and then copy the corrupted files from the backup to the original
root partition.
# rem_drv ramd
# cd /usr/kernel/drv
# cp ramd.bad_attach ramd
3. Attach and link the new driver to the kernel. Use the sync
command several times to minimize the file system damage
because of a panic.
4. Save the core dump and use adb to analyze the problem using the
classroom exercise as a template.
Note – You may have to boot with the -a option and not put
/usr/kernel in the module path. This bug may not allow you to save
a core dump because the panic occurs in an auto-configuration routine
that gets called during boot time. When the system panics, the system
will try to reboot; and when it reboots, it will encounter the bad
attach routine and the system will go down again. This is when the
-a option to boot becomes very useful.
After describing what is wrong with the RAM disk driver, your device
driver writer reports that the writer has written another ramd and that
you are to test it. Use adb commands to modify a live kernel.
# cd /usr/kernel/drv
# test1
3. Invoke adb.
backseat_write,10/X
backseat_write,10/i
8. Use the sync command several times, then invoke test1 again.
9. Analyze the core dump so that you can tell the device driver
writer what was wrong.
You will use the ps (report process status) command and the kadb
(kernel debugger) utility. This procedure is time-consuming but
interesting. You will select one of the active processes in your system
like init, a Command Tool, more, or vi. You are going to trace
through the various structures that the operating system allocates to
processes starting with the output of the ps -le command. Then you
will use kadb to go through the structures.
Use the man pages and .h files to gain insight into the Solaris 5.x
operating system and to increase your fault analysis skills with
advanced concepts.
kadb Description
kadb is an interactive debugger with a user interface similar to that of
adb(1), the assembly language debugger. kadb must be loaded prior
to the standalone program it is to debug. It runs in the same address
space as the standalone program, thus sharing many resources with
that program but not able to use the facilities available to the system
(such as the mouse, and access to file systems) because the system is
suspended when kadb is running. Because the kernel is not running
when kadb is active, any system structure that is examined or looked
at through kadb has the current state of that structure. The debugger is
cognizant of and able to control multiple processors if they are present
in a system.
Unlike adb, kadb runs in the same supervisor virtual address space as
the program being debugged (although it maintains a separate
context). The debugger runs as a coprocess that cannot be killed (`:k')
or rerun (`:r'). There is no signal control (`:i', `:t', or `$i'),
although the keyboard facilities (Control-c, Control-s, and Control-q)
are simulated.
In the case of the UNIX kernel, the keyboard abort sequence (Stop-a
[L1-a] for console and BREAK for serial line) suspends kernel
operations and breaks into the debugger. The system will also fall into
kadb when it panics, allowing you to do an immediate analysis as to
why the system went down. You would want to use kadb when it is
not possible to save a coredump or if your dump device (swap device)
is too small to save physical memory. kadb gives the prompt kadb[#]
where # is the CPU it is currently executing on.
Note – Running under kadb has proven to be very valuable when very
bad crashes cause the machine to be so ill that it cannot generate a
dump. The analysis is the same as if running adb on a coredump.
# (Stop-a)
Note – To display a list of all kadb macros, type $M at the kadb prompt.
Simple Process
Address space
structure
Process structure
as pointer
Thread structure
Lightweight process
tlist pointer lwp pointer
1. Boot the system with kadb. Use the ps command to obtain the
starting address of your process.
2. Invoke kadb.
3. Use the process address with the proc macro, for example,
fc363000$<proc. To control the flow of information, use the
Control-q and Control-s key sequences.
Process structure
Starting address
as
ppid
pidp
cred
tlist
[as]$<as
[pidp]$<pid
[tlist]$<thread
[cred]$<cred
The seglast field contains the address of the segment that was last
used. In most cases, when the kernel needs to search a segment, it
starts with the last searched segment.
The as and seg structures are defined in header files as.h and seg.h,
which are in the directory /usr/include/vm. The thread and cred
structures are defined in the header files thread.h and cred.h in the
directory /usr/include/sys.
Segment Mapping
proc
size
[as] stack
BASE
seg
seg
size
data
seg BASE
size
text
BASE
as
PROCESS IMAGE
1. This is an additional exercise using kadb. Make sure that you have
booted the system using kadb. Log in as root, start OpenWindows,
and go to the backseat directory. Copy the backseat_hang driver
into the /usr/kernel/drv directory, but rename it as backseat. If
you have installed the working version of backseat before, do a
rem_drv of this working backseat driver before installing this
defective one into your system.
2. # cp backseat_hang /usr/kernel/drv/backseat
3. # cp backseat.conf /usr/kernel/drv
4. # rem_drv backseat
kadb[0]: $<threadlist
11. You should see a backseat driver routine calling physio() and
physio() calling biowait(); the thread is blocked in biowait(),
which is the reason why test1 is hung. Look at the man pages for
physio() and biowait() to determine what could be wrong
with the device driver. Then look at the source file for
backseat_hang, which should be backseat_hang.c, to find out
what the device driver forgot to do.
The lesson that can be learned here is that a device driver can put
a thread to sleep in such a way that the thread cannot be
awakened by a signal.
12. After this exercise, you can exit kadb and reboot your system
without kadb. You may want to sync your system first. Then press
Stop-a, and issue the $q command and boot from the ok prompt.
Bug Install
1. Ensure that NVRAM watchdog-reboot? is false.
3. Log in as root.
Note – If a watchdog error does not occur, ask the instructor for
assistance.
● .registers
● .locals
● .psr
● ctrace
8. Boot the system to the Solaris operating environment, and save the
output of the following Solaris commands:
● showrev -p
● prtconf -v
● pkginfo
● /usr/ccs/bin/nm /dev/ksyms
● /etc/system
● /var/adm/messages*
Note – You should set up a tip line to the machine that is expected to
get a watchdog reset, as this is the easiest way to save the OBP
command outputs in a file.
Note – Search the SunSolve software for watchdog reset with error
messages similar to yours.
Instead, you can find out where the failing instruction is with respect
to the entire routine so that the assembly language can be matched to
the C code. To do this, the routine is disassembled up to the problem
instruction, which occurs 2c bytes into the routine. Since each
instruction is 4 bytes, 2c/4 or 0xb additional instructions must be
displayed:
ff4eadbc/i (from Determining What Instruction Failed)
ram_write+0x2c: ld [%l1 + 0x30], %l2
ram_write,c/i
ram_write:
ram_write: sethi %hi(0xfffffc00), %g1
add %g1, 0x398, %g1 ! ffffff98
save %sp, %g1, %sp
st %i0, [%fp + 0x44]
st %i1, [%fp + 0x48]
st %i2, [%fp + 0x4c]
ld [%fp + 0x44], %o0
call getminor
nop
st %o0, [%fp - 0x4]
ld [%fp - 0x8], %l1
ld [%l1 + 0x30], %l2
After examining the ramd.c source file, these lines stand out in
ram_write:
static int
ram_write(dev_t dev, struct uio *uiop, cred_t *credp)
{
int instance;
struct ram_state *rs;
/* Comment this out in order to pass a pointer that has not been
initialized, so that you can cause a data fault and a core dump.
rs = ddi_get_soft_state(statep, instance);
if (rs == NULL) {
cmn_err(CE_NOTE,
“%s: write: could not get state for instance %d.”,
RAMDISK_NAME, instance);
return ENXIO;
}
*/
if (uiop->uio_offset >= rs->size)
return EINVAL;
Introduction
The workshop will enable you to trace and identify the processes and
files needed to open windows. You will use the ps (report process
status) command and man pages, within the reference material, to
accomplish this task.
2. Log in as root.
7. Log out.
2. Log in as root.
# /usr/openwin/bin/openwin
7. Log out.
# ps
The next ps command fills the screen with information concerning all
the processes that have been started including your login process.
Some of the processes vary from system to system.
The table (next two pages) provides some of the processes that will be
on all systems with space for other system-dependent processes.
Check the processes you have that are the same as those listed. Use the
ps -ef command to obtain base-level processes.
# ps -ef
Base-Level Processes (1 of 2)
sched*
init*
pageout
fsflush
sac
rpcbind
sctserve
sendmai
keyserv
inetd
ypbind
in.route
kerbd
automoun
statd
lockd
lpsched
syslogd
cron
vold
Base-Level Processes (2 of 2)
# /usr/openwin/bin/openwin
2. Type the ps command and determine the PID for the current
window process. Record the information in the table below:
10. Who is the parent for olmslav (Open Window Manager Slave)?
___________
13. Who is the parent for ttsession (Tool Talk Message Server)?
___________
15. Open another Shell Tool at this time. Trace the family tree.
OpenWindows Files
# view /tmp/trusstrace
Is truss another tool you can use to trace commands? Remember that
PIDs different from your original chart will be displayed because they
are not reused. The grep command can be useful at this time.
Skills Checklist
Student Instructor
Skill
Initials Initials
Invoke adb to start a kernel core dump analysis.
Invoke crash to start a kernel core dump analysis.
Use the adb string macro to display the message buffer.
Use selected adb macros to determine the process at the time of
the fault.
Use selected crash commands to determine the process at the
time of the fault.
Use the correct commands to properly exit crash or adb.
Team Members
________________________________________________________
________________________________________________________
_________________________________________________________
A-1
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
A
Requirements
● Sun-4 systems
Resources
● AnswerBook
● SunSolve
● Diagnostics
● SunVTS
● Format
System Configurations
● Standalone
● Network
● Client-server
● NIS or NIS+
B-1
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService May 1996
B
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Use the student1 account. The user name is student1, and the
password is student1.
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
After disconnecting the keyboard, and using the ASCII terminal, the
system “hangs” during boot.
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
#!/bin/csh -f
clear
rm -f /tmp/guilty_party
cat > /tmp/guilty_party << Done
#!/bin/csh -f
while (1)
end
Done
chmod 777 /tmp/guilty_party
/usr/bin/priocntl -e -c RT /tmp/guilty_party &
/usr/sbin/psradm -f a
Diagnostic Steps
Use the following procedure to determine what is causing your system
to “hang.”
Note – You may need to press Stop-a several times before the
keyboard interrupt is handled.
cd directory_with_core_dumps
9. Type proc.
11. For each process entry, examine the utime and stime fields. The
combined total of these fields is total CPU time being used by the
process.
Expected Repair
A workaround is to not run the trouble program (guilty_party) until
CPU resources are available. You also need to determine if it is normal
behavior for this process to use so much CPU time. Or run
guilty_party as a timesharing process (not real time).
Repair verification
Rerun the start command to verify that this process is the culprit.
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
# swap -s
3. Type the swap -l command and record the values in Table 2 page
49 (in the “Before OpenWindows” column).
# swap -l
# mkdir /test
# /usr/openwin/bin/openwin
# /test/SUNWdiag/bin/sundiag
10. Deselect all tests and then select the kmem test.
11. Record the value of swap space indicated in the kmem option box.
13. Total physical memory must also take into account the pages
required by the kernel. The total memory minus the memory of
the kernel equals available physical memory. Check the dmesg for
size of kernel memory.
14. The total disk swap space minus the available physical equals
memory swap space.
15. Run two passes of kmem tests and record the time required to
complete the tests. This will be the base time.
16. While the test is running, you can monitor the behavior of swap
space, using the swap commands. Record the value of the first
swap commands in Tables 1 and 2 on page 49 (in the “During
SunDiag” column).
17. If the test passes, add fpu and one device for a fstest.
18. Run two passes of new tests and record the time required to
complete them. This will be the loaded base time.
If the test is successful and the virtual and physical links are
functional, the system administrator can add the partition to the
/etc/vfstab using a shortcut method:
# cp /etc/vfstab /etc/vfstab.orig
# mount -p > /etc/vfstab
umount /test
# init 6
# /usr/openwin/bin/openwin
# /test/SUNWdiag/bin/sundiag
6. Run two passes of kmem tests and record the time to complete.
This will be the new base time.
7. Compare the “new base time” with the original “base time.” Is it
faster, slower, “hung,” stopped, or the same? Why?
___________________________________________________________
___________________________________________________________
8. If the test passes, add fpu and one device for a fstest.
9. Run two passes of new tests and record the time to complete. This
will be the new loaded base time.
10. Compare the “new loaded base time” with the original “loaded
base time.” It is faster, slower, stopped, “hung,” or the same?
Why?
___________________________________________________________
___________________________________________________________
___________________________________________________________
___________________________________________________________
Before After
Parameter During SunDiag
OpenWindows OpenWindows
Bytes allocated
Bytes reserved
Total bytes
Bytes available
Before After
Parameters During SunDiag
OpenWindows OpenWindows
Current blocks
Free blocks
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
In the steps below, using adb on the live kernel, you will lower the
value of maximum number of processes allowed per user. Then you
will open various windows (processes) until an error occurs informing
you Resource temporarily unavailable.
v$<v
maxup= _______
nproc/D
nproc = _______
In the next step, you will reduce the value of maxup and another
variable that controls the maximum number of processes per user system
wide. The reduced value should be about 5 more than the current
nproc value.
When you deposit the value into the proc field, the value is
entered as a base(16) notation. The command v+1c/W xx, where xx
is your input value, enables you to change a kernel parameter
using adb.
Example:
Note – Do not change values in the kernel using this method. This is
for an academic learning experience only. But be aware that it can be
done.
9. Calculate the value of nproc for your system, using the Calculator
utility, if necessary. Then replace v+1c with the calculated value.
v$<v
nproc/D
nproc = _______
Error Y/N
nproc = _______
Error Y/N
nproc = _______
Error Y/N
nproc = _______
Error Y/N
nproc = _______
Error Y/N
nproc = _______
Error Y/N
Note – To restore maxup back to its original value, convert the original
value into a base(16) value. Using the v+1c/W xx, where xx is the
base(16) value of original value of maxup(10). Use the Calculator utility,
if necessary. Do this also to maxupttl which is in v+c, enter
v+c/W0tdd, where dd is the original value of maxup in decimal.
12. Return maxup and maxupttl back to its original value, and exit
adb.
Fault Worksheet #
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
Fault Worksheet #
Error Symptoms/Conditions/Messages
Problem Statement
Research Resources
Repair Verification
Instructor Initials _________________
C-1
Copyright 1997 Sun Microsystems, Inc. All Rights Reserved. SunService Month 1996
C
1. Login as root
4. Dump and restore the important parts of the root file system.
# cd /opt (or the file system you want to use for your alt block)
# ufsdump 0f /opt/rootdump /
Dump messages............
Dump messages.............
# ufsrestore if /opt/rootdump
ufsrestore > add dev
ufsrestore > add devices
ufsrestore > add kernel
ufsrestore > add sbin
ufsrestore > add etc
ufsrestore > add ufsboot
ufsrestore > extract
ufsrestore > quit
5. # halt
6. Record the original boot device address from the nvram and
devalias.
ok printenv
boot-device disk
ok devalias
disk /sbus/..........:a
Press the Return key (to accept the default) on all questions until
the last one.
When asked for the address of the root device put in the original
device address from 6 above: