Professional Documents
Culture Documents
Mach Win9x
Linux/MacOS NT
Symbian
4
The success of UNIX…
The Killer App theory
Every successful computer system has a particular
application/use which is so compelling the system
immediately gets established – and grows from there.
7
Effect on OS Design
NT vs UNIX
Although both Windows and Linux have adapted to changes in the
environment, the original design environments (i.e. in 1989 and 1973) heavily
influenced the design choices:
Unit of concurrency: Threads vs processes Addr space, uniproc
Process creation: CreateProcess() vs fork() Addr space, swapping
I/O: Async vs sync Swapping, I/O devices
Namespace root: Virtual vs Filesystem Removable storage
Security: ACLs vs uid/gid User populations
fork()?
8
Today‟s Environment
64-bit addresses
Gbytes of physical memory
New memory hierarchies
Hypervisors, virtual processors
Multiprocessors (64-512x-???)
High-speed internet/intranet, Web Services
Single user, but vulnerable to hackers worldwide
TV/PC/Messaging Convergence
Cellphone/Walkman/PDA/PC Convergence
9
Well-understood OS ideas
• Processes
• Threads
• Virtual Memory
• Input/output
• Local file systems
• Client/server computing
• Virtual machines
• Network protocols
• Security technologies (ACLs, authentication)
10
Open problems
Security/Privacy models
– Today’s systems are single-user but insecure
Application model
– What is an app? App lifecycle mgmt?
Configuration/state management
– Registry? “.” files? Migration? What is “policy”?
System Extensibility
– Apps want modify the system, platform, and user experience
Compatibility/Fragility
– Apps break apps, OS updates break apps,
Data Management
– Representing information, finding it, sharing it, expanding it, …
Federated Computing
– How do all my computers/devices work together seamlessly?
Ecosystem
– Successfully aggregating HW/SW/Info from multiple sources
11
Open problems
System complexity
– How to organize functionality. How to find “help”.
Administration
– Scaling to 100k+ systems.
– Should systems reboot? Update mgmt. VHDs.
Fundamental abstractions
– Are threads and processes obsolete?
– Isolation vs sharing? Is the future “managed”?
– Is user/kernel distinction meaningful?
Cloud computing
– How does it change the local OS?
Manycore
– “powerwall”⇒10s to 1000s of heterogeneous cores
Memory hierarchy
– Electromechanical no longer the only persistent store
12
Teaching unix AND Windows
13
NT – the accidental secret
Historically little information on NT available
– Microsoft focus was end-users and Win9x
– Source code for universities was too encumbered
14
The Windows Research Kernel
15
Windows Research Kernel
WRK Goals
• Make it easier for faculty and students to compare
and contrast Windows to other operating systems
• Enable students to study source, and modify and build
projects
• Provide better support for research & publications
based on Windows internals
• Encourage more OS textbook and university-oriented
internals books on Windows kernel
• Simplify licensing
16
WRK: Kernel Sources
Based on Windows Server 2003 SP1 / Windows x64 NTOS
Processes, threads, LPC, VM, scheduler, object manager, I/O
manager, synchronization, worker threads, kernel memory
manager, …
17
NT Kernel Design Workbook
NT OS/2 Design Workbook: Core OS
File Title Author(s)
dwintro NT OS/2 Design Workbook Introduction Lou Perazzoli
David N. Cutler,
ke NT OS/2 Kernel Specification
Bryan M. Willman
alerts NT OS/2 Alerts Design Note David N. Cutler
apc NT OS/2 APC Design Note David N. Cutler
ob NT OS/2 Object Management Specification Steven R. Wood
proc NT OS/2 Process Structure Mark Lucovsky
suspend NT OS/2 Suspend/Resume Design Note David N. Cutler
attproc NT OS/2 Attach Process Design Note David N. Cutler
vm NT OS/2 Virtual Memory Specification Lou Perazzoli
vmdesign NT OS/2 Memory Management Design Note Lou Perazzoli
io NT OS/2 I/O System Specification Darryl E. Havens
irp NT OS/2 IRP Language Definition Gary D. Kimura
namepipe NT OS/2 Named Pipe Specification David Cutler & Gary
Kimura
mailslot NT OS/2 Mailslot Specification Manny Weiser
rsm Windows NT Session Management and Control Mark Lucovsky
fsdesign NT OS/2 File System Design Note Gary D. Kimura
fsrtl NT OS/2 File System Support Routines Specification Gary D. Kimura
18
NT Kernel Design Workbook
NT OS/2 Design Workbook: Core OS
File Title Author(s)
sem NT OS/2 Event - Semaphore Specification Lou Perazzoli
argument NT OS/2 Argument Validation Specification David N. Cutler
timer NT OS/2 Timer Specification David N. Cutler
Mark Lucovsky,
coding NT OS/2 Coding Conventions
Helen Custer
ulibcode NT Utilities Coding Conventions David J. Gilman
exceptn NT OS/2 Exception Handling Specification David N. Cutler
os2 OS/2 Emulation Subsystem Specification Steven R. Wood
status NT OS/2 Status Code Specification Darryl E. Havens
ntdesrtl NT OS/2 Subsystem Design Rational Mark H. Lucovsky
resource NT OS/2 Shared Resource Specification Gary D. Kimura
execsupp NT OS/2 Executive Support Routines Specification David Treadwell
support NT OS/2 Interlocked Support Routines Specification David N. Cutler
dd Windows NT Driver Model Specification Darryl E. Havens
oplock NT OS/2 Opportunistic Locking Design Note Darryl Havens, et al
memio NT OS/2 Memory Management Guide for I/O Lou Perazzoli
time NT OS/2 Time Conversion Specification Gary D. Kimura
mutant NT OS/2 Mutant Specification David N. Cutler
19
NT Kernel Design Workbook
NT OS/2 Design Workbook: Core OS
File Title Author(s)
prefix NT OS/2 Prefix Table Specification Gary D. Kimura
startup NT OS/2 System Startup Design Note Mark Lucovsky
dbg NT OS/2 Debug Architecture Mark Lucovsky
coff NT OS/2 Linker/Librarian/Image Format Spec Michael J. O'Leary
cache NT OS/2 Caching Design Note Tom Miller
ntutil NT OS/2 Utility Design Specification Steven D. Rowe
implan NT OS/2 Product Description and Implementation Plan David N. Cutler
basecont NT OS Base Product Contents Lou Perazzoli
20
Major Kernel Functions
21
Major NT Kernel Components
OB – Object Manager
SE – Security Reference Monitor
PS – Process/Thread management
MM – Memory Manager
CACHE – Cache Manager
KE – Scheduler
IO – I/O manager, PnP, device power mgmt, GUI
Drivers – devices, file systems, volumes, network
LPC – Local Procedure Calls
HAL – Hardware Abstraction Layer
EX – Executive functions
RTL – Run-Time Library
CONFIG – Persistent configuration state (registry)
22
Introduction to the History and
Architecture of the Windows
Kernel
23
History/Overview of NT
24
NT Timeline: the first 20+ years
2/1989 Design/Coding Begins
7/1993 NT 3.1
9/1994 NT 3.5
5/1995 NT 3.51
7/1996 NT 4.0
12/1999 NT 5.0 Windows 2000
8/2001 NT 5.1 Windows XP
3/2003 NT 5.2 Windows Server 2003
8/2004 NT 5.2 Windows XP SP2
4/2005 NT 5.2 Windows XP 64 Bit Edition (& WS03SP1)
10/2006 NT 6.0 Windows Vista (client)
2/2008 NT 6.0 Windows Server 2008 (Vista SP1)
2010? NT 6.1 Windows 7
25
Overview of NTOS Architecture
• Heritage is RSX-11, VMS – not UNIX
• Kernel-based, microkernel-like
– OS personalities in subsystems (i.e. for Posix, OS/2, Win32)
– Kernel focused on memory, processes, threads, IPC, I/O
– Kernel implementation organized around the object manager
– Win32 and other subsystems built on native NT APIs
– System functionality heavily based on client/server computing
• Primary supported programming interfaces: Win32 (and .NET)
• NT APIs
– Generally not documented (except for DDK)
– NT APIs are rich (many parameters)
• NTOS (kernel)
– Implements the NTAPI
– Drivers, file systems, protocol stacks not in NTOS
– Dynamic loading of drivers (.sys DLLs) is extension model
26
NT kernel philosophy
27
Important NT kernel features
28
Important NT kernel features (cont)
• Extensible filter-based I/O model with driver layering,
standard device models, notifications, tracing, journaling,
namespace, services/subsystems
• Virtual address space managed separately from memory
objects
• Advanced VM features for databases (app management
of virtual addresses, physical memory, I/O, dirty bits, and
large pages)
• Plug-and-play, power-management
• System library mapped in every process provides trusted
entrypoints
29
Windows Architecture
Applications
Kernel-mode
NTOS kernel layer
HAL
Firmware, Hardware
30
Windows user-mode
• Subsystems
– OS Personality processes
– Dynamic Link Libraries
– Why NT mistaken for a microkernel
• System services (smss, lsass, services)
• System Library (ntdll.dll)
• Explorer/GUI (winlogon, explorer)
• Random executables (robocopy, cmd)
31
Windows kernel-mode
• NTOS (aka „the kernel‟)
– Kernel layer (abstracts the CPU)
– Executive layer (OS kernel functions)
• Drivers (kernel-mode extension model)
– Interface to devices
– Implement file system, storage, networking
– New kernel services
• HAL (Hardware Abstraction Layer)
– Hides Chipset/BIOS details
– Allows NTOS and drivers to run unchanged
32
Kernel-mode Architecture of Windows
user
NT API stubs (wrap sysenter) -- system library (ntdll.dll)
mode
firmware/
hardware CPU, MMU, APIC,Corporation
© Microsoft BIOS/ACPI,2008 memory, devices
33
Kernel/Executive layers
• Kernel layer – aka „ke‟ (~ 5% of NTOS source)
– Abstracts the CPU
• Threads, Asynchronous Procedure Calls (APCs)
• Interrupt Service Routines (ISRs)
• Deferred Procedure Calls (DPCs – aka Software Interrupts)
– Providers low-level synchronization
• Executive layer
– OS Services running in a multithreaded environment
– Full virtual memory, heap, handles
• Note: VMS had four layers:
– Kernel / Executive / Supervisor / User
34
NT Kernel Structure
• Object-manager organizes data and handles
• Namespace not rooted in the file system
• Kernel has kernel/executive split internally
– promotes high-degree of multi-threading
• Clean separation of processes and threads
– threads are unit of concurrency
– virtual address space managed separately from
memory objects
– system library mapped in every process provides
trusted entry points for user-mode side of kernel
• I/O uses stacks of device drivers
– highly extensible model (for better or worse)
– IO Request Packets make I/O inherently asynchronous
35
NT (Native) API examples
NtCreateProcess (&ProcHandle, Access, SectionHandle,
DebugPort, ExceptionPort, …)
NtCreateThread (&ThreadHandle, ProcHandle, Access,
ThreadContext, bCreateSuspended, …)
NtAllocateVirtualMemory (ProcHandle, Addr, Size, Type,
Protection, …)
NtMapViewOfSection (SectHandle, ProcHandle, Addr,
Size, Protection, …)
NtReadVirtualMemory (ProcHandle, Addr, Size, …)
NtDuplicateObject (srcProcHandle, srcObjHandle,
dstProcHandle, dstHandle, Access, Attributes, Options)
36
Major NTOS Facilities
Object Manager Implement system namespace, manage objects
Security Ref Mon representing kernel abstractions, data structure
Registry references, metadata, configuration, and security.
Components: Object Manager, Security Reference
Monitor, Registry, Handle table
37
Major NT Kernel Areas (cont)
38
Object Manager
Security Reference Monitor
Registry
39
Kernel Abstractions
40
NT Object Manager
41
\ObjectTypes
Object Manager: Directory, SymbolicLink, Type
Processes/Threads: DebugObject, Job, Process, Profile,
Section, Session, Thread, Token
Synchronization:
Event, EventPair, KeyedEvent, Mutant, Semaphore,
ALPC Port, IoCompletion, Timer, TpWorkerFactory
IO: Adapter, Controller, Device, Driver, File, Filter*Port
Kernel Transactions: TmEn, TmRm, TmTm, TmTx
Win32 GUI: Callback, Desktop, WindowStation
System: EtwRegistration, WmiGuid
42
Implementation: Object Methods
43
L“\”
Naming example
\Global??\C:
<directory> <directory>
L“Global??” L“C:”
<symbolic link>
\Device\HarddiskVolume1
L“\” \Device\HarddiskVolume1
44
UNIX namei ~ NT IopParseDevice
NtCreateFile API builds a context descriptor for operation
Then calls ObOpenObjectByName with (path, context)
Object Manager follows path until it finds a Device Object
Then calls the object‟s Parse routine
45
Object Manager Parsing example
\Global??\C:\foo\bar.txt
<device object>
implemented
by I/O
, “foo\bar.txt”
manager
deviceobject->ParseRoutine == IopParseDevice
File Sys
File System Fills in File object
47
Why not root namespace in filesys?
A few reasons…
• Hard to add new object types
• Device configuration requires filesys modification
• Root partition needed for each remote client
– End up trying to make a tiny root for each client
– Have to check filesystem very early
Windows uses object manager + registry hives
• Fabricates top-level namespace in kernel
• Uses config information from registry hive
• Only needs to modify hive after system stable
48
Security Reference Monitor
49
Vista Kernel Security Changes
Code Integrity (x64) and BitLocker Encryption
• Signature verification of kernel modules
• Drives can be encrypted
Protected Processes
• Secures DRM processes
User Account Control (Allow or Deny?)
• Signature verification of kernel modules
Integrity Levels
• Provides a backup for ACLs by limiting write access to
objects (and windows) regardless of permission
• Used by “low-rights” Internet Explorer
50
Handle Table
51
Object referencing
Security
Ref Monitor
Name
Access checks
Name Object
lookup Manager
App NTOS Returns ref‟d ptr
Kernel
Handle Data Object
Ref‟d ptr used until deref
52
Process/Thread structure
Thread
Files Virtual
Process‟ Thread
Address
Events Handle Table
Descriptors Thread
Devices
Memory Thread
Manager
Drivers
read(handle) Structures Thread
user-mode execution
53
One level: (to 512 handles)
A: Handle Table Entries [512 ]
Handle Table
TableCode
Object
Object
Object
54
Two levels: (to 512K handles)
Handle Table B: Handle Table Pointers [1024 ]
TableCode
Object
Object
Object
55
Three levels: (to 16M handles)
Handle Table D: Handle Table Pointers [32 ]
TableCode
Object
Object
Object
C: Handle Table Entries [512 ]
56
Registry (Configuration Manager)
57
Configuration/State management
Registry vs File system
What registry solved:
• Boot / device configuration
• User accidentally deleting state
• Performance
58
Configuration/State management
Registry vs File system
What went wrong:
• Configuration state obscured
• Doesn't leverage file system technology
– Unfamiliar
– Robustness is separately addressed
• Poor network story and heterogeneous inter-op
• Abuse: used for IPC/synchonization, as data base,
system extension, same state stored multiple times
• Competing configuration stores
• Bad state separation from system
• Have to reinstall apps / file copy installs broken
59
Process Handle Tables
EPROCESS
object
object
EPROCESS object
Handle Table pHandleTable
Kernel Handles
System
Process
pHandleTable Handle Table
object
object
60
OBJECT_HEADER
PointerCount
HandleCount
pObjectType
oNameInfo oHandleInfo oQuotaInfo Flags
pQuotaBlockCharged
pSecurityDescriptor
CreateInfo + NameInfo + HandleInfo + QuotaInfo
OBJECT BODY
[with optional DISPATCHER_HEADER]
61
DISPATCHER_HEADER
Fundamental kernel synchronization mechanism
Equivalent to a KEVENT at front of dispatcher objects
SignalState
WaitListHead.flink
WaitListHead.blink
62
KPRCB Thread Thread
WaitListHead WaitListEntry WaitListEntry
WaitBlockList WaitBlockList
Object->Header WaitBlock WaitBlock
WaitListHead WaitListEntry WaitListEntry
Signaled NextWaitBlock NextWaitBlock
Object->Header WaitBlock WaitBlock
WaitListHead WaitListEntry WaitListEntry
Signaled NextWaitBlock NextWaitBlock
Object->Header WaitBlock
WaitListHead WaitListEntry
Signaled NextWaitBlock
Object->Header WaitBlock
WaitListHead WaitListEntry
Signaled Structure used by
NextWaitBlock WaitMultiple
63
Summary: Object Manager
• Foundation of NT namespace
• Unifies access to kernel data structures
– Outside the filesystem (initialized form registry)
– Unified access control via Security Ref Monitor
– Unified kernel-mode referencing (ref pointers)
– Unified user-mode referencing (via handles)
– Unified synchronization mechanism (events)
64
Process Manager
Memory Manager
Cache Manager
65
Processes
• An environment for program execution
• Binds
– namespaces
– virtual address mappings
– ports (debug, exceptions)
– threads
– user authentication (token)
– virtual memory data structures
• Abstracts the MMU, not the CPU
66
Process Lifetime
Process created as an empty shell
Address space created with only system DLL and
the main image (unless forked for Posix)
Handle table created empty or populated via
duplication of inheritable handles from parent
Add environment, threads, map executable image
sections (EXE and DLLs)
68
Process/Thread structure
Thread
Files Virtual
Process‟ Thread
Address
Events Handle Table
Descriptors Thread
Devices
Memory Thread
Manager
Drivers
Structures Thread
69
Virtual Address Descriptors
70
Shared-Memory Data Structures
File Object Segment
Handle
Control
Area
Section
Page Object
Handle Directory Proto
PTEs
Shared
Cache Map Subsection
Subsection
VAD
71
Physical Frame Management
• Table of PFN data structures
– represent all pageable pages
– synchronize page-ins
– linked to management lists
• Page Tables
– hierarchical index of page directories and tables
– leaf-node is page table entry (PTE)
– PTE states:
• Active/valid
• Transition
• Modified-no-write
• Demand zero
• Page file
• Mapped file
72
Physical Frame State Changes
Standby
Modified Pagewriter Modified
List List
MM Low Mem
Process
(or System)
Working Set
Free Zero
List Zero Thread
List
73
Paging Overview
Working Sets: list of valid pages for each process
(and the kernel)
Pages „trimmed‟ from working set on lists
Standby list: pages backed by disk
Modified list: dirty pages to push to disk
Free list: pages not associated with disk
Zero list: supply of demand-zero pages
Modify/standby pages can be faulted back into a
working set w/o disk activity (soft fault)
Background system threads trim working sets,
write modified pages and produce zero pages
based on memory state and config parameters
74
Managing Working Sets
Aging pages: Increment age counts for pages
which haven't been accessed
Estimate unused pages: count in working set and
keep a global count of estimate
When getting tight on memory: replace rather
than add pages when a fault occurs in a working
set with significant unused pages
When memory is tight: reduce (trim) working sets
which are above their maximum
Balance Set Manager: periodically runs Working
Set Trimmer, also swaps out kernel stacks of
long-waiting threads
75
Balance Set Manager
NTs swapping
Balance set = sum of all “inswapped” working sets
76
32-bit VA/Memory Management
Working-set list Working-set Manager
VAD tree
Sections
executable
Image
c-o-w
Standby List
Free List
Data
SQL db
Modified List
File
SQL Phys
Data
datafile
File
Data
pagefile
Data Modified
Page Writer
77
Context-switching Kernel VM
78
Vista Memory Management
Memory Management improvements
• Improved prefetch at app launch/swap-in and
resume from hibernation/sleep
• Kernel Address Space dynamically configured
• Support use of flash as write-through cache
(compressed/encrypted)
• Session 0 is now isolated (runs systemwide
services)
• Address Space Randomization (executables
and stacks) for improved virus resistance
79
Cache Manager
Virtual Block Cache on top of Virtual Memory
• Memory-maps files into kernel virtual memory
• Most work is managing how memory and virtual address
space is used, and dealing with size changes in files
• Lazily flushes data in older views to disk
• Easy coherency with user memory-mapped files
• Heavily used by file systems through extensive API set
– Note: NTFS metadata is conveniently all in files
Most Cache manager work:
• Managing use of physical memory and virtual addresses
• Dealing with size changes in files
80
Thread Manager
Kernel Layer
81
Threads
Unit of concurrency (abstracts the CPU)
Threads created within processes
System threads created within system process (kernel)
System thread examples:
Dedicated threads
Lazy writer, modified page writer, balance set manager,
mapped pager writer, other housekeeping functions
General worker threads
Used to move work out of context of user thread
Must be freed before drivers unload
Sometimes used to avoid kernel stack overflows
Driver worker threads
Extends pool of worker threads for heavy hitters, like file server
82
Thread elements
user-mode
• user-mode stack
• Thread Environment Block (TEB)
– most interesting: thread local storage
kernel-mode
• kernel-mode stack
• KTHREAD: scheduling, synchronization, timers, APCs
• ETHREAD: timestamps, I/O list, exec locks, process link
• thread ID
• impersonation token
83
Kernel Thread Attach
Allows a thread in the kernel to temporarily move to a different
process‟ address space
• Used heavily in Ps and Mm
– PspProcessDelete attaches before calling ObKillProcess() so
close/delete in proper process context
• Used to access the TEB/PEB
• Changes PsGetCurrentProcess() (effective process)
84
Scheduling
Windows schedules threads, not processes
Scheduling is preemptive, priority-based, and round-robin at the
highest-priority
16 real-time priorities above 16 normal priorities
Scheduler tries to keep a thread on its ideal processor/node to
avoid perf degradation of cache/NUMA-memory
Threads can specify affinity mask to run only on certain processors
Each thread has a current & base priority
Base priority initialized from process
Non-realtime threads have priority boost/decay from base
Boosts for GUI foreground, waking for event
Priority decays, particularly if thread is CPU bound (running at
quantum end)
Scheduler is state-driven by timer, setting thread
priority, thread block/exit, etc
Priority inversions can lead to starvation
balance manager periodically boosts non-running runnable threads
85
NT thread priorities
worker 15 critical 31
H
I 14 30
threads 13
G 29
+ 12
H 28
N 11 27
N O 10 26
R 9 normal 25
N O
8 real-time
R M 24
O 7 (dynamic) (fixed)
M 23
I R 6 22
D M 5 21
- 4 20
L
3 19
E 2 18
1 idle 17
0 zero thread 16
86
CPU Starvation Avoidance
Balance Set Manager
Once per second finds threads that have been ready for several
seconds
Scans up to 16 ready threads per priority level per second
Boosts up to 10 ready threads per second
Boost thread priority to 15 and double quantum
At quantum end immediately restores original priority and quantum
(no decay)
blocked
11
acquiring
resource running
9
ready
resource 5
holds resource
87
CPU Control-flow
Thread scheduling occurs at PASSIVE or APC level
(IRQL < 2)
APCs (Asynchronous Procedure Calls) deliver I/O
completions, thread/process termination, etc (IRQL == 1)
Not a general mechanism like unix signals (user-mode code must
explicitly block pending APC delivery)
Interrupt Service Routines run at IRL > 2
ISRs defer most processing to run at IRQL==2 (DISPATCH
level) by queuing a DPC to their current processor
A pool of worker threads available for kernel components
to run in a normal thread context when user-mode thread
is unavailable or inappropriate
Normal thread scheduling is round-robin among priority
levels, with priority adjustments (except for fixed priority
real-time threads)
88
Asynchronous Procedure Calls
APCs execute routine in thread context
not as general as UNIX signals
user-mode APCs run when blocked & alertable
kernel-mode APCs used extensively: timers,
notifications, swapping stacks, debugging, set thread
ctx, I/O completion, error reporting, creating &
destroying processes & threads, …
APCs generally blocked in critical sections
e.g. don‟t want thread to exit holding resources
89
Asynchronous Procedure Calls
APCs execute code in context of a particular thread
APCs run only at PASSIVE or APC LEVEL (0 or 1)
Three kinds of APCs
User-mode: deliver notifications, such as I/O done
Kernel-mode: perform O/S work in context of a
process/thread, such as completing IRPs
Special kernel-mode: used for process termination
Multiple „environments‟:
Original: The normal process for the thread (ApcState[0])
Attached: The thread as attached (ApcState[1])
Current: The ApcState[ ] as specified by the thread
Insert: The ApcState[ ] as specified by the KAPC block
90
Deferred Procedure Calls
DPCs run a routine on a particular processor
DPCs are higher priority than threads
common usage is deferred interrupt processing
ISR queues DPC to do bulk of work
• long DPCs harm perf, by blocking threads
• Drivers must be careful to flush DPCs before unloading
also used by scheduler & timers (e.g. at quantum end)
kernel-mode APCs used extensively: timers,
notifications, swapping stacks, debugging, set thread
ctx, I/O completion, error reporting, creating &
destroying processes & threads, …
High-priority routines use IPI (inter-processor intr)
used by MM to flush TLB in other processors
91
Summary: CPU
92
I/O
93
I/O Model
• Extensible filter-based I/O model with driver layering
• Standard device models for common device classes
• Support for notifications, tracing, journaling
• Configuration store
• File caching is virtual, based on memory mapping
• Completely asynchronous model (with cancellation)
• Multiple completion models:
– wait on the file handle
– wait on an event handle
– specify a routine to be called on IO completion (APC)
– use completion port
– poll status variable
94
IO Request Packet (IRP)
95
IRP Fields
Flags
System Buffer Pointers
96
Layering Drivers
Device objects attach one on top of another using
IoAttachDevice* APIs creating device stacks
– IO manager sends IRP to top of the stack
– drivers store next lower device object in their private
data structure
– stack tear down done using IoDetachDevice and
IoDeleteDevice
Device objects point to driver objects
– driver represent driver state, including dispatch table
File objects point to open files
File systems are drivers which manage file objects for
volumes (described by VolumeParameterBlocks)
97
File System Device Stack
Application
Kernel32 / ntdll
user
kernel
NT I/O Manager
98
NtCreateFile
IRP
File
Object
I/O Manager
FS filter drivers
ObOpenObjectByName
IoCallDriver
IopParseDevice IoCallDriver
Volume Mgr
I/O Manager
IoCallDriver
IoCallDriver
99
IRP flow of control (synchronous)
IOMgr (e.g. IopParseDevice) creates IRP, fills in top
stack location, calls IoCallDriver to pass to stack
driver determined by top device object on device stack
driver passed the device object and IRP
IoCallDriver
copies stack location for next driver
driver routine determined by major function in drvobj
Each driver in turn
does work on IRP, if desired
keeps track in the device object of the next stack device
Calls IoCallDriver on next device
Eventually bottom driver completes IO and returns on callstack
100
IRP flow of control (asynchronous)
Eventually a driver decides to be asynchronous
driver queues IRP for further processing
driver returns STATUS_PENDING up call stack
higher drivers may return all the way to user, or may
wait for IO to complete (synchronizing the stack)
Eventually a driver decides IO is complete
usually due to an interrupt/DPC completing IO
each completion routine in device stack is called,
possibly at DPC or in arbitrary thread context
IRP turned into APC request delivered to original thread
APC runs final completion, accessing process memory
101
Programming Asychronous I/O
• Applications can issue asynchronous IO requests to files
opened with FILE_FLAG_OVERLAPPED and passing
an LPOVERLAPPED parameter to the IO API (e.g.,
ReadFile(…))
• Five methods available to wait for IO completion,
– Wait on the file handle
– Wait on an event handle passed in the overlapped
structure (e.g., GetOverlappedResult(…))
– Specify a routine to be called on IO completion
– Use completion ports
– Poll status variable
102
I/O Completions
• Five methods to receive notification of completion for
asynchronous I/O:
– poll status variable
– wait for the file handle to be signalled
– wait for an explicitly passed event to be signalled
– specify a routine to be called on the originating ports
– use an I/O completion port
103
104
I/O Completion
complete
thread
thread
I/O completion ports
I/O Completion Ports
I/O
thread
request
complete
I/O
thread
request
I/O
thread
request
U
K
normal completion
complete
I/O
thread
request
complete
I/O thread
request
complete
I/O
thread
request
U
K
Summary: I/O
• Plug-and-Play
• Power management
105
Questions
106