You are on page 1of 63


Windows Hang and

Crash Dump Analysis

Mark Russinovich
Chief Software Architect
Winternals Software
Copyright © 2006 Mark Russinovich
About The Speaker

Co-author of Windows Internals

and Inside Windows 2000
(Microsoft Press)
Senior Contributing Editor
Windows IT Pro Magazine
Author of tools on
Co-founder and chief software
architect of Winternals Software
Microsoft Most Valuable Professional (MVP) 2005, 2006
Teach public and private live classes on Windows
Internals and Advanced Troubleshooting with David
Solomon (
Crash dumps and tools
Analysis basics
Analyzing an “easy” crash
Un-analyzable crashes
Crash transformation
Buffer overrun
Code overwrite
Microsoft Windows Memory Diagnostic
Manual analysis
Stack trashes
Hung Systems
When there is no crash dump
Many systems administrators ignore Windows crash
dump options
“I didn’t know I could analyze crashes”
“Crash analysis too hard”
“A crash dump won’t tell me anything anyway”
Basic crash dump analysis is actually
pretty straightforward
Even if only 1 out of 5 or 10 dumps tells you what’s wrong, isn’t it worth
spending a few minutes?
More advanced crash dump analysis much harder
Not well documented
Requires advanced internals, compiler and CPU knowledge
Requires lots of experience
Often difficult to pinpoint cause
More often than not, victim is not the culprit
For example, a driver corrupts an operating system structure; Windows crashes later
Why Does Windows Crash?
This is called when something’s wrong in kernel-mode:
Unhandled exception (for example, executing invalid instruction)
OS or driver detects severe inconsistency
Referencing paged out memory at interrupt level (famous
A reschedule is attempted at dispatch level IRQL or higher
Hardware error
Why Does Windows Crash?

Microsoft’s analysis of crash root causes indicates:

~70% caused by third-party driver code
~15% caused by unknown (memory is too corrupted to tell)
~10% caused by hardware issues
~5% caused by Microsoft code
There are lots of third-party drivers!
From online crash analysis database:
55,000 unique drivers – 24 new/day (28,000 in 2004)
220,000 total drivers – 98 revised/day (130,000 in 2004)
Many Devices
Over 1,263,300 distinct Plug and Play (PnP) IDs (680,000 in 2004)
1,600 PnP IDs added every day
What Happens at the Crash

When a condition is detected that requires a

crash, KeBugCheckEx is called
Takes five arguments:
Stop code (also called bugcheck code)
Four stop-code defined parameters
Turns off interrupts
Tells other CPUs to stop
Paints the blue screen
Notifies registered drivers of the crash
If a dump is configured (and it is safe to do so), writes
dump to disk
Bugcheck Codes

Bugcheck codes are shared by many components

and drivers
There are about 150 defined stop codes
Two common ones are:
(DRIVER_) IRQL_NOT_LESS_OR_EQUAL (0x0A) - Usually an
invalid memory access
Generated by executing garbage instructions
Its usually caused when a stack is trashed
Most are documented in the Debugging Tools help file
Also search Microsoft Knowledge Base (
Often, bugcheck code and parameters are not enough to
solve the crash
Need to examine crash dump
Crash Dumps Options
Small Memory Dump (aka minidump
or Triage Dump)
Default for Microsoft Windows
2000/Windows XP Professional/Home
Only 64 KB (128 KB on 64-bit systems,
up to 512 KB on Vista)
Contains minimal crash information
Creates a unique file name in
\Windows\Minidump after reboot
Writes OS memory and not processes
Most crash debugging doesn’t involve
looking at process memory anyway
Useful for large memory systems
Overwrites every time
Default on Windows Vista
Writes all of RAM
Overwrites every time

On Windows XP, Windows Server 2003, and

Windows Vista, minidump is always created,
even if system set to full or kernel dump
Can extract a minidump from a kernel or full
dump using the debugger “.dump /m” command
To analyze, requires access to the images on the
system that crashed
At least must have have access to the Ntoskrnl.exe
Microsoft Symbol Server now has images for Windows
XP and later
Set image path to same as symbol path (covered later)
Writing a Crash Dump

Crash dumps are written to the paging file

Too risky to try and create a new file
(no guarantee you will get a dump anyway)
How is even this protected?
When the system boots it checks
The boot volume paging file’s on-disk mapping is obtained
Relevant components are checksummed:
Boot disk miniport driver
Crash I/O functions
Page file map
On crash, if checksum doesn’t match, dump is not written
Why Would You Not Get a Dump?

Crash occurred before paging file was open

For example a crash during driver initialization
The crash corrupted components involved in the
dump process
Spontaneous reboot
Paging file on boot volume is too small
Not enough free space for extracted dump
Hung system

We’ll cover how to troubleshoot these problems later

At The Reboot

Session 2
Manager Memory.dmp

1 4
User mode
Kernel mode

At The Reboot
Session Manager process (\Windows\system32\smss.exe) initializes
paging file
NtCreatePagingFile 1
NtCreatePagingFile determines if the dump has a crash header 2
Protects the dump from use
Note: crash dump portion of paging file is in use during the copy, so
virtual memory can run low while the copy is in progress
WinLogon calls NtQuerySystemInformation to tell if there’s a dump 3
to extract
If there’s a dump, Winlogon executes SaveDump 4
Writes an event to the System event log
SaveDump writes contents to appropriate file
On Windows XP or later, checks to see if Windows Error Reporting
should be invoked
Online Crash Analysis (OCA)

By Default, after a reboot

Windows XP/Windows
Server 2003 prompts you to
send information to
Can be configured with
Computer Properties-
Can be customized with
Group Policies
Do/do not show UI
Send dump to an internal error
reporting server
Windows Error Reporting
Savedump checks if kernel error reporting is enabled
Checks two values under
IncludeKernelFaults and DoReport
If crash reporting is enabled, Savedump:
Extracts a minidump from the dump file (if system set to full or
kernel dumps)
Writes the name of the minidump under
Adds a command to execute Dumprep.exe to
This will cause it to run at the first user log on
Dumpprep then:
Generates an XML description of system version, drivers present,
loaded plug and play drivers and depending on the configuration
Displays the message box (if enabled) to send the dump
Submits to dump for automatic analysis
What Gets Sent
<?xml version="1.0" encoding="Unicode" ?>
<OSNAME>Microsoft Windows XP
<OSVER>5.1.2477 0.0</OSVER>
1. XML description of <VERSION></VERSION>
system version, Corporation</MANUFACTURER>
drivers present, Intel(r) Integrated Controller Hub Audio Driver</PRODUCTNAME>

loaded plug and play </DRIVER>

2. Minidump file
What Does OCA Do?

Server farm uses !analyze, but looks up

crash fingerprint in Microsoft’s crash
resolution database
Sometimes OCA will point you at KB articles that
describe the problem
KB articles may tell you to use Windows Update to get newer
drivers, a hotfix, or install a Service Pack
Many times OCA will say “A driver caused a problem”
OCA can’t tell you when it suspects a driver that
hasn’t been conclusively identified as being
responsible by hand analysis
Crash dumps and tools
Analysis basics
Analyzing an “easy” crash
Un-analyzable crashes
Crash transformation
Buffer overrun
Code overwrite
Windows Memory Diagnostic
Manual analysis
Stack trashes
Hung Systems
When there is no crash dump
Analyzing a Crash Dump Yourself
There are two kernel-level debuggers that can open crash dump
WinDbg –Windows program
Kd – command-line program
Both provide same kernel debugger analysis commands
Must first configure to point to symbols
Easiest to use Microsoft Symbol Server for symbol access
Windbg: click on File->Symbol File Path
Enter “srv*c:\symbols*”
If a minidump, must also configure image path to point to location of
images (File->Image File Path)
Use same string as for symbol server (Windows XP and beyond)
To open a crash dump:
WinDbg: File->Open Crash DumpKd crash dump syntax:
Kd: kd –z <memory dump file> -y <symbols directory> -i <image
IRQL stands for Interrupt Request Level
Each CPU maintains IRQL independently
Software and hardware interrupts map to IRQLs
When a CPU raises its IRQL to a level all interrupts at that level
and below are masked for that CPU

Hardware : Unmasked
Interrupts :
Software DISPATCH_LEVEL Masked
Interrupts APC_LEVEL

No interrupts are masked
User mode code always executes at PASSIVE_LEVEL
Kernel-mode code executes at PASSIVE_LEVEL most
of the time
Highest software interrupt level
Scheduler is off
Page faults cannot be handled and are illegal

Each thread has a user-mode and

kernel-mode stack
The user-mode stack is usually 1 MB on x86
The kernel-mode stack is typically 12 KB (20 KB for
GUI threads) on x86 systems
Stacks allow for nested function invocation
Parameters can be passed on the stack
Stores return address
Serves as storage for local variables
Stack Frames Parameter 1
Return Address
Frame Pointer
Local Variable 1

Function 1 Local Variable 2

Parameter 3
Parameter 2 Addresses
Parameter 1
Function 2 Stack Return Address
Frame Frame Pointer
Local Variable 1

Local Variable 2

Function 3 Parameter 2
Parameter 1
Return Address
Frame Pointer
Local Variable 1
Calling Conventions

Stacks are easy to interpret if functions use

standard calling conventions
Other calling conventions make the stack hard to
figure out
No frame pointer
Register arguments (fast calls)
A debugger requires symbol information to parse
non-standard stack frames
Makes accurate analysis of crashes involving third-
party drivers difficult
Crash dumps and tools
Analysis basics
Analyzing an “easy” crash
Un-analyzable crashes
Crash transformation
Buffer overrun
Code overwrite
Windows Memory Diagnostic
Manual analysis
Stack trashes
Hung Systems
When there is no crash dump
In order to demonstrate
common crash scenarios,
Mark wrote NotMyFault.Exe
Download from
It loads MyFault.sys
MyFault.Sys has an IOCTL
interface that implements User Mode
Kernel Mode
different bugs
IOCTL Interface
Generating an Easy Crash
Run NotMyFault and select “High IRQL fault (kernel mode)”
Allocates paged pool buffer
Frees the buffer
Touches the buffer and pages following the buffer
Paged buffers that are marked “not present” but are touched when
IRQL ≥ DISPATCH_LEVEL result in the
Memory Manager calls KeBugCheckEx from page fault handler
The IRQL is not less than or equal to the maximum IRQL at which the
operation is legal (which is < DISPATCH_LEVEL)
Analyzing an Easy Crash

Open crash dump with Windbg

!analyze easily identifies MyFault.sys by looking
at the KeBugCheckEx parameters
The Memory Manager looked at the stack and
determined the address that caused the page fault
!analyze often looks at the stack to determine the
cause of a crash
Automated Analysis

When you open a crash dump with Windbg or Kd

you get a basic crash analysis:
Stop code and parameters
A guess at offending driver
The analysis is the result of the automated
execution of the !analyze debugger command
!Analyze uses heuristics to walk up the stack and
determine what driver is the likely cause of the crash
“Followup” is taken from optional triage.ini file
Don’t trust blame of ntoskrnl, win32k, hal, ntfs or
other core Windows components
Crash Transformation

Many crashes can’t be analyzed

The “victim” crashed the system, not the criminal
The analyzer may point at Ntoskrnl.exe or Win32K.sys
or other Windows components
Or, you may get many different crash dumps all
pointing at different causes
You’re goal isn’t to analyze impossible crashes…

It’s to try to make an “unanalyzable” crash into

one that can be analyzed
Crash dumps and tools
Analysis basics
Analyzing an “easy” crash
Un-analyzable crashes
Crash transformation
Buffer overrun
Code overwrite
Windows Memory Diagnostic
Manual analysis
Stack trashes
Hung Systems
When there is no crash dump
Using the Driver Verifier
The tool for crash
transformation is the Driver
Verifier (Verifier.exe – not in
Start menu)
Introduced in Windows 2000
Helps developers test their
drivers and systems
administrators identify faulty
Run Verifier.exe
Choose “Create Custom
Choose “Select Individual
Settings from a List”
Enable all options except Low
Resource Simulation
Selecting Drivers to Verify
Don’t verify all the drivers
Performance hit will make system unusable
Limits effectiveness of the Verifier
Crash Transformation Recipe

The Recipe:
1. First, try any “suspicious” drivers (recently updated, known to
be problematic, etc.)
2. If still un-analyzable crashes, try enabling verification on all
third-party drivers and/or all unsigned drivers
3. As a last resort enable verification on groups of 10-20 drivers
at a time
4. Run the Windows Memory Diagnostic
The following crash examples demonstrate the Driver
Verifier making “un-analyzable” crashes into ones that
point at the problem
Buffer overflow
System code overwrite
Buffer Overruns
Result when a driver goes past the end (overrun) or the
beginning (underrun) of a buffer
Usually detected when
overwritten data
is referenced
Higher Another Driver’s Buffer
Another driver or the Addresses
kernel makes the reference
There can be a long delay Pool Structures
between corruption
and detection Driver Buffer
Causing a Buffer Overrun

Run NotMyFault and select “Buffer Overrun”

Allocates a nonpaged pool buffer
Writes a string past the end
Note that you might have to run several times
since a crash will occur only if:
The kernel references the corrupted pool structures
A driver references the corrupted buffer
The crash tells you what happened, but not why
A Buffer Overrun Bluescreen

In this example, where the crash was the result

of the kernel tripping on corrupt pool tracking
structures, the Bluescreen tells you what to do:
What is Special Pool?
Special pool is a kernel buffer area
where buffers are sandwiched with
invalid pages
Conditions for a driver allocating Page n+2
from special pool:
Driver Verifier is verifying driver Higher
Buffer Addresses
Special pool is enabled
and available Page n+1
Allocation is slightly less than one
page (4 KB on x86)
Special pool is a
limited resource Page n Invalid
When it runs out verified drivers
allocate from standard pool
Note: can be enabled
without rebooting
The Verifier Catching Buffer
The Driver Verifier catches the overrun when it
The Bluescreen tells you who’s fault it is
!analyze explains the crash and also tells you the
driver name
The stack shows where the driver bug is
Code Overwrites

Caused when a bug results in a wild pointer

A wild pointer that points at invalid memory is easily detected
A wild pointer that points at data is similar to buffer overrun
Might not cause a problem for a long time
Crash makes it look like its something else’s fault
System code write protection catches code overwrite, but
it’s not on if:
It’s a Windows 2000 system with > 127 MB memory
It’s a Windows XP or Windows 2003 Server system
with > 255 MB
In other words, it’s off on most systems
Causing a Code Overwrite

Run NotMyFault and select “Code Overwrite”

Overwrites first bytes of nt!ntreadfile
Function is most common entry to I/O system so a
random thread will cause the crash

The crash hints that the fault occurred in

The last user-mode address is ZwReadFile
The ebx register in the exception frame points at
NtReadFile’s start location looks scrambled
(u ntreadfile)
System Code Write Protection
To obtain a more obvious crash, enable system code
write protection by turning on Driver Verifier on one or
more drivers
Can also manually enable by setting
\Session Manager\Memory Management
EnforceWriteProtection REG_DWORD 1
Reboot to take effect
Rerun NotMyFault
Crash occurs immediately and even the blue screen points at

!analyze shows the address of the write and the target

Windows Memory Diagnostic

Memory errors are a significant cause of

hardware-related crashes
Windows Memory Diagnostic checks memory for errors
Free download from
Installs to
floppy or
Built into
Windows Vista
Run at least
one pass
Crash dumps and tools
Analysis basics
Analyzing an “easy” crash
Un-analyzable crashes
Crash transformation
Buffer overrun
Code overwrite
Windows Memory Diagnostic
Manual analysis
Stack trashes
Hung Systems
When there is no crash dump
Manual Analysis
Sometimes !analyze isn’t enough
Doesn’t tell you anything useful
You want to know what was happening at the time of the crash
Useful commands:
List loaded drivers: lm kv
Make sure drivers are all recognized and up to date
Look at memory usage: !vm
Make sure memory pools are not full
If full, use !poolused (requires pool tagging to be on)
Examine current thread: !thread
May or may not be related to the crash
List all processes: !process 0 0
Make sure you understand what was running on the system
If a Verifier detected deadlock: !deadlock
Additional commands: !help
Stack Trashing

An example of a crash requiring manual analysis

is a stack trash
Stack trashes have several possible causes:
A driver pushing things on the stack causes the
stack to overflow
A driver overruns a stack-allocated buffer
Usually results in garbage code being executed
Driver Verifier can’t determine cause
Since the stack is corrupted, analysis is especially
Debugging Stack Trashes

Run NotMyFault and select “Stack Trash”

Allocates a buffer on the stack
Overruns the buffer
Returns to the caller

Crash doesn’t show much off hand

!analyze actually blames Win32K.sys, the Win32 kernel-mode
Stack doesn’t show anything except an exception handler
Look deeper
!thread shows an outstanding IRP
!irp <irp> shows that myfault.sys was the target of the IRP
Troubleshooting Crashes That Don’t
Generate Crash Dumps
If you are getting crashes with no resulting dump (or
other “spontaneous reboots”), you need to boot in
Debugging Mode:
Press F8 during the boot and choose
“Debugging Mode”
Or, edit the target’s boot.ini file to configure:
/debugport=comX /baudrate=XXX (note: default baud rate in
Debugging Mode is 19200)
Windows XP and Windows 2003 support 1394
Windows Vista supports USB 2.0
In either case, this loads the kernel debugger at boot time
Does not affect performance
On a crash system will wait indefinitely for debugger connection
even if configured to do so!
Connecting to a Crashed System

When system crashes, attach a kernel debugger

and analyze
In Windbg, choose File->Kernel Debug
Configure baud rate and COM port
Click OK
Debugger should connect and display the
bugcheck code
Type !analyze –v, and if necessary, perform additional
analysis commands as described earlier
To save complete memory dump for offline
analysis, use “.dump” (or “.dump /f” to capture a
full dump)
Note: this will be slow over a serial cable
Hung Systems

Sometimes system becomes unresponsive

Keyboard and mouse freeze
Two types of hang:
Instant lockup
Kernel synchronization deadlock
Infinite loop at high IRQL or very high priority thread
Grinding to a halt
Storage stack resource deadlock
Two techniques that both require prior setup and
a reboot:
Manually crash the hung system and hope you get a dump to
analyze offline
Boot the system in debugging mode and when it hangs, break in
with the kernel debugger and analyze system
Initiating a Manual Crash

Crash from keyboard

Requires PS2 keyboard and right control key
Right CTRL button and then Scroll Lock twice
Must be configured in the Registry:
\CrashOnCtrlScroll (DWORD) – set to value of 1
Documented in Debugging Tools help file
Keyboard interrupts must run for this to work
Use a hardware “dump switch”
Some servers come with an NMI button
You can also make one:
Must be configured in the Registry
\NMICrashDump (DWORD) – set to value of 1
Breaking into a Hung System

Instead of crashing you can boot in debugging

mode and break in when it hangs
After the hang, connect the host debugger
system to the target
Run WinDbg (or KD)
Press Ctrl-C (or click Debug->Break) – this breaks into
target system
Analyzing a Hang
Then attempt to determine reason for hang. (This
is the hard part.)
Use !thread to see what’s running– check the stack
Check each CPU by using the ~ command, for example, ~0,
Use !locks to look at possible deadlocks
Use !irql to see previous IRQL (Windows Server 2003
and later)
If you can’t figure it out but want to save it for
later analysis:
Use .crash to force a crash
Or .dump to save the current state of the system in a
dump file
This can also be done with LiveKD (free from Sysinternals) on
live system
Generating a Hung System

Enable keyboard-initiated manual crash

and reboot
Run Notmyfault
Select “Hang” and press Do Bug
On reboot, open dump and look at
current thread
Remember to check each CPU of a SMP: ~0, ~1, etc.
Try to determine reason for hang
Analyzing a “Sick” System

Sometimes a system is still responsive, but you

know that something is wrong with it
You want to look at its kernel state, but…
You don’t want to take it off line by crashing it or
connecting a debugger to it
You can get a “dump” of a live system with
LiveKd (free download from
Use it to run Windbg or Kd
Use .dump to snapshot live system
The Bluescreen Screen Saver

Scare your enemies and fool your friends with

the Sysinternals Bluescreen Screen Saver
Remotely execute it (requires admin privilege on
remote system):
psexec –i –d –c “sysInternals bluescreen.scr” /s
Be careful, your job may be on the line!
More Information

Windows Internals, 4th Edition – Chapter 10: Crash

Dump Analysis
The help file which is installed with Debugging Tools
for Windows
Knowledge Base Articles
Other books:
The debugger team wants your feedback and bug reports
microsoft.public.windbg newsgroup
Technical Chats and Webcasts

Microsoft Learning and Certification

MSDN & TechNet

Virtual Labs


Technical Community Sites

User Groups
Live from Tech·Ed Webcast
Series has Been
Brought to You by:
Fill out a session
evaluation on
CommNet for
a chance to
Win an XBOX 360!
© 2006 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.
The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not
be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation.