You are on page 1of 106

Windows Academic Program

Windows Kernel Source and Internals


Compare & Contrast
Windows Architecture
Windows Kernel Internals
11 March 2009 – MSRA Beijing Workshop

Dave Probert, Ph.D.


Architect, Windows Kernel Group
Windows Core Operating Systems Division
Microsoft Corporation
© Microsoft Corporation 2009
Dave Probert (Dave.Probert@Microsoft.com)
Day Job: Kernel Architect – working on Windows 8 and Windows 9,
including manycore, services, and core facilities
Hobby: Developed the WRK and ProjectOZ components of the
Windows Academic Program and travel extensively presenting on
kernel internals to faculty at universities around the world
Previous Day Jobs:
o Managed kernel development (Win2000 to Vista alpha)
o Joined Microsoft in 1996 after Ph.D. from UC Santa Barbara
(developing SPACE -- a kernel-less OS framework)
o Acquired and supported a 64-node Meiko parallel computer
o Consultant to various companies on UNIX kernel internals
o Vice President of Software Engineering for Culler Scientific
o Developed/managed laboratory computing facilities for a new
Computer Science department, and various researchers
o Systems Engineer for Burroughs working on B1900/B6700
architecture, memory controller hardware, and microcode
2
Over-simplified OS history
OS/360
Multics MCP
Tenex
VM/370
UNIX v6/v7 RSX-11
CP/M
System38 Accent
VMS
BSD/SVR4 MS/DOS

Mach Win9x
Linux/MacOS NT
Symbian

Of all the interesting operating systems


only UNIX and NT matter (and maybe Symbian)
3
Kernel Culture Wars:

VMS vs UNIX round two

4
The success of UNIX…
The Killer App theory
Every successful computer system has a particular
application/use which is so compelling the system
immediately gets established – and grows from there.

Examples: The killer app for…


IBM PC? Lotus 123
Macintosh? Desktop publishing
Apple Lisa? There was none!
UNIX? Computer Science education
5
…in Computer Science
UNIX is by Computer Scientists
for Computer Scientists
Simple, clean, interesting concepts
Pipelines, fork, C, shell, inode-based file system
Ran on popular (& economical) university
hardware
PDP-11 (and later,Vax-11)
Source code available (to universities)
Popularized by Lion‟s (illegal) commentary
Features relevant to computer science
Uucp, troff, C/f77, & games (spacewars!)
6
NT vs UNIX Design Environments

Environment which influenced


fundamental design decisions
Windows (NT) UNIX
32-bit program address space 16-bit program address space
Mbytes of physical memory Kbytes of physical memory
Virtual memory Swapping system with memory mapping
Mbytes of disk, removable disks Kbytes of disk, fixed disks
Multiprocessor (4-way) Uniprocessor
Micro-controller based I/O devices State-machine based I/O devices
Client/Server distributed computing Standalone interactive systems

Large, diverse user populations Small number of friendly users

7
Effect on OS Design

NT vs UNIX
Although both Windows and Linux have adapted to changes in the
environment, the original design environments (i.e. in 1989 and 1973) heavily
influenced the design choices:
Unit of concurrency: Threads vs processes Addr space, uniproc
Process creation: CreateProcess() vs fork() Addr space, swapping
I/O: Async vs sync Swapping, I/O devices
Namespace root: Virtual vs Filesystem Removable storage
Security: ACLs vs uid/gid User populations

fork()?
8
Today‟s Environment

64-bit addresses
Gbytes of physical memory
New memory hierarchies
Hypervisors, virtual processors
Multiprocessors (64-512x-???)
High-speed internet/intranet, Web Services
Single user, but vulnerable to hackers worldwide

TV/PC/Messaging Convergence
Cellphone/Walkman/PDA/PC Convergence

9
Well-understood OS ideas

• Processes
• Threads
• Virtual Memory
• Input/output
• Local file systems
• Client/server computing
• Virtual machines
• Network protocols
• Security technologies (ACLs, authentication)

10
Open problems
Security/Privacy models
– Today’s systems are single-user but insecure
Application model
– What is an app? App lifecycle mgmt?
Configuration/state management
– Registry? “.” files? Migration? What is “policy”?
System Extensibility
– Apps want modify the system, platform, and user experience
Compatibility/Fragility
– Apps break apps, OS updates break apps,
Data Management
– Representing information, finding it, sharing it, expanding it, …
Federated Computing
– How do all my computers/devices work together seamlessly?
Ecosystem
– Successfully aggregating HW/SW/Info from multiple sources
11
Open problems
System complexity
– How to organize functionality. How to find “help”.
Administration
– Scaling to 100k+ systems.
– Should systems reboot? Update mgmt. VHDs.
Fundamental abstractions
– Are threads and processes obsolete?
– Isolation vs sharing? Is the future “managed”?
– Is user/kernel distinction meaningful?
Cloud computing
– How does it change the local OS?
Manycore
– “powerwall”⇒10s to 1000s of heterogeneous cores
Memory hierarchy
– Electromechanical no longer the only persistent store

12
Teaching unix AND Windows

“Compare & Contrast” drives innovation


• Studying „foo‟ is fine
• But if you also study „bar‟, students will compare &
contrast
• Result is innovation:
– Students mix & match concepts to create new
ideas
– Realizing there is not a single „right‟ solution,
students invent even more approaches
– Learning to think critically is an important skill for
students

13
NT – the accidental secret
Historically little information on NT available
– Microsoft focus was end-users and Win9x
– Source code for universities was too encumbered

Much better internals information today


– Windows Internals, 4th Ed., Russinovich & Solomon
– Windows Academic Program (universities only):
• CRK: Curriculum Resource Kit (NT kernel in PowerPoint)
• WRK: Windows Research Kernel (NT kernel in source)
• Projects! Tools!
– Chapters in leading OS textbooks (Tanenbaum,
Silberschatz, Stallings)

14
The Windows Research Kernel

15
Windows Research Kernel

WRK Goals
• Make it easier for faculty and students to compare
and contrast Windows to other operating systems
• Enable students to study source, and modify and build
projects
• Provide better support for research & publications
based on Windows internals
• Encourage more OS textbook and university-oriented
internals books on Windows kernel
• Simplify licensing

16
WRK: Kernel Sources
Based on Windows Server 2003 SP1 / Windows x64 NTOS
Processes, threads, LPC, VM, scheduler, object manager, I/O
manager, synchronization, worker threads, kernel memory
manager, …

most everything in NTOS except plug-and-play, power-


management, and specialized code such as the driver
verifier, splash screen, branding, timebomb

Over 800K lines of kernel source


Simplified a little, cleaned up comments, improved spelling
Original specs & design docs for NT
CRK w/ source references
Student Projects

17
NT Kernel Design Workbook
NT OS/2 Design Workbook: Core OS
File Title Author(s)
dwintro NT OS/2 Design Workbook Introduction Lou Perazzoli
David N. Cutler,
ke NT OS/2 Kernel Specification
Bryan M. Willman
alerts NT OS/2 Alerts Design Note David N. Cutler
apc NT OS/2 APC Design Note David N. Cutler
ob NT OS/2 Object Management Specification Steven R. Wood
proc NT OS/2 Process Structure Mark Lucovsky
suspend NT OS/2 Suspend/Resume Design Note David N. Cutler
attproc NT OS/2 Attach Process Design Note David N. Cutler
vm NT OS/2 Virtual Memory Specification Lou Perazzoli
vmdesign NT OS/2 Memory Management Design Note Lou Perazzoli
io NT OS/2 I/O System Specification Darryl E. Havens
irp NT OS/2 IRP Language Definition Gary D. Kimura
namepipe NT OS/2 Named Pipe Specification David Cutler & Gary
Kimura
mailslot NT OS/2 Mailslot Specification Manny Weiser
rsm Windows NT Session Management and Control Mark Lucovsky
fsdesign NT OS/2 File System Design Note Gary D. Kimura
fsrtl NT OS/2 File System Support Routines Specification Gary D. Kimura

18
NT Kernel Design Workbook
NT OS/2 Design Workbook: Core OS
File Title Author(s)
sem NT OS/2 Event - Semaphore Specification Lou Perazzoli
argument NT OS/2 Argument Validation Specification David N. Cutler
timer NT OS/2 Timer Specification David N. Cutler
Mark Lucovsky,
coding NT OS/2 Coding Conventions
Helen Custer
ulibcode NT Utilities Coding Conventions David J. Gilman
exceptn NT OS/2 Exception Handling Specification David N. Cutler
os2 OS/2 Emulation Subsystem Specification Steven R. Wood
status NT OS/2 Status Code Specification Darryl E. Havens
ntdesrtl NT OS/2 Subsystem Design Rational Mark H. Lucovsky
resource NT OS/2 Shared Resource Specification Gary D. Kimura
execsupp NT OS/2 Executive Support Routines Specification David Treadwell
support NT OS/2 Interlocked Support Routines Specification David N. Cutler
dd Windows NT Driver Model Specification Darryl E. Havens
oplock NT OS/2 Opportunistic Locking Design Note Darryl Havens, et al
memio NT OS/2 Memory Management Guide for I/O Lou Perazzoli
time NT OS/2 Time Conversion Specification Gary D. Kimura
mutant NT OS/2 Mutant Specification David N. Cutler

19
NT Kernel Design Workbook
NT OS/2 Design Workbook: Core OS
File Title Author(s)
prefix NT OS/2 Prefix Table Specification Gary D. Kimura
startup NT OS/2 System Startup Design Note Mark Lucovsky
dbg NT OS/2 Debug Architecture Mark Lucovsky
coff NT OS/2 Linker/Librarian/Image Format Spec Michael J. O'Leary
cache NT OS/2 Caching Design Note Tom Miller
ntutil NT OS/2 Utility Design Specification Steven D. Rowe
implan NT OS/2 Product Description and Implementation Plan David N. Cutler
basecont NT OS Base Product Contents Lou Perazzoli

20
Major Kernel Functions

• Manage naming & security  OB, SE


• Manage address spaces  PS, MM
• Manage physical memory  MM, CACHE
• Manage CPU  KE
• Provide I/O & net abstractions  IO, drivers
• Implement cross-domain calls  LPC
• Abstract low-level hardware  HAL
• Internal support functions  EX, RTL
• Internal configuration mgmt  CONFIG

21
Major NT Kernel Components
 OB – Object Manager
 SE – Security Reference Monitor
 PS – Process/Thread management
 MM – Memory Manager
 CACHE – Cache Manager
 KE – Scheduler
 IO – I/O manager, PnP, device power mgmt, GUI
 Drivers – devices, file systems, volumes, network
 LPC – Local Procedure Calls
 HAL – Hardware Abstraction Layer
 EX – Executive functions
 RTL – Run-Time Library
 CONFIG – Persistent configuration state (registry)

22
Introduction to the History and
Architecture of the Windows
Kernel

23
History/Overview of NT

1988: Bill Gates recruits VMS Architect Dave Cutler


Business goals:
 an advanced commercial OS for desktops/servers
 compatible with OS/2
Technical goals:
 scalable on symmetric multiprocessors
 secure, reliable, performant, portable
NT: New Technology
 Targeted at Intel i860 code-named N10
 NT: N10 (N-Ten)

24
NT Timeline: the first 20+ years
2/1989 Design/Coding Begins
7/1993 NT 3.1
9/1994 NT 3.5
5/1995 NT 3.51
7/1996 NT 4.0
12/1999 NT 5.0 Windows 2000
8/2001 NT 5.1 Windows XP
3/2003 NT 5.2 Windows Server 2003
8/2004 NT 5.2 Windows XP SP2
4/2005 NT 5.2 Windows XP 64 Bit Edition (& WS03SP1)
10/2006 NT 6.0 Windows Vista (client)
2/2008 NT 6.0 Windows Server 2008 (Vista SP1)
2010? NT 6.1 Windows 7
25
Overview of NTOS Architecture
• Heritage is RSX-11, VMS – not UNIX
• Kernel-based, microkernel-like
– OS personalities in subsystems (i.e. for Posix, OS/2, Win32)
– Kernel focused on memory, processes, threads, IPC, I/O
– Kernel implementation organized around the object manager
– Win32 and other subsystems built on native NT APIs
– System functionality heavily based on client/server computing
• Primary supported programming interfaces: Win32 (and .NET)
• NT APIs
– Generally not documented (except for DDK)
– NT APIs are rich (many parameters)
• NTOS (kernel)
– Implements the NTAPI
– Drivers, file systems, protocol stacks not in NTOS
– Dynamic loading of drivers (.sys DLLs) is extension model

26
NT kernel philosophy

• Reliability, Security, Portability, Compatibility are


all paramount
• Performance important
– Multi-threaded, asynchronous
• General facilities that can be re-used
– Support kernel-mode extensibility (for better or worse)
– Provide unified mechanisms that can be shared
– Kernel/executive split provides a clean layering model
– Choose designs with architectural headroom

27
Important NT kernel features

• Highly multi-threaded in a process-like environment


• Completely asynchronous I/O model
• Thread-based scheduling
• Unified management of kernel data structures, kernel
references, user references (handles), namespace,
synchronization objects, resource charging, cross-
process sharing
• Centralized ACL-based security reference monitor
• Configuration store decoupled from file system

28
Important NT kernel features (cont)
• Extensible filter-based I/O model with driver layering,
standard device models, notifications, tracing, journaling,
namespace, services/subsystems
• Virtual address space managed separately from memory
objects
• Advanced VM features for databases (app management
of virtual addresses, physical memory, I/O, dirty bits, and
large pages)
• Plug-and-play, power-management
• System library mapped in every process provides trusted
entrypoints

29
Windows Architecture
Applications

DLLs System Services Logon/GINA


Subsystem
servers Kernel32 Critical services win32

User-mode System library (ntdll) / run-time library

Kernel-mode
NTOS kernel layer

Drivers NTOS executive layer

HAL

Firmware, Hardware

30
Windows user-mode

• Subsystems
– OS Personality processes
– Dynamic Link Libraries
– Why NT mistaken for a microkernel
• System services (smss, lsass, services)
• System Library (ntdll.dll)
• Explorer/GUI (winlogon, explorer)
• Random executables (robocopy, cmd)

31
Windows kernel-mode
• NTOS (aka „the kernel‟)
– Kernel layer (abstracts the CPU)
– Executive layer (OS kernel functions)
• Drivers (kernel-mode extension model)
– Interface to devices
– Implement file system, storage, networking
– New kernel services
• HAL (Hardware Abstraction Layer)
– Hides Chipset/BIOS details
– Allows NTOS and drivers to run unchanged
32
Kernel-mode Architecture of Windows
user
NT API stubs (wrap sysenter) -- system library (ntdll.dll)
mode

NTOS Trap/Exception/Interrupt Dispatch


kernel
layer CPU mgmt: scheduling, synchr, ISRs/DPCs/APCs

Procs/Threads IPC Object Mgr


kernel Drivers
mode Devices, Filters, Virtual Memory glue Security
Volumes,
Networking, Caching Mgr I/O Registry
Graphics
NTOS executive layer

Hardware Abstraction Layer (HAL): BIOS/chipset details

firmware/
hardware CPU, MMU, APIC,Corporation
© Microsoft BIOS/ACPI,2008 memory, devices
33
Kernel/Executive layers
• Kernel layer – aka „ke‟ (~ 5% of NTOS source)
– Abstracts the CPU
• Threads, Asynchronous Procedure Calls (APCs)
• Interrupt Service Routines (ISRs)
• Deferred Procedure Calls (DPCs – aka Software Interrupts)
– Providers low-level synchronization
• Executive layer
– OS Services running in a multithreaded environment
– Full virtual memory, heap, handles
• Note: VMS had four layers:
– Kernel / Executive / Supervisor / User

34
NT Kernel Structure
• Object-manager organizes data and handles
• Namespace not rooted in the file system
• Kernel has kernel/executive split internally
– promotes high-degree of multi-threading
• Clean separation of processes and threads
– threads are unit of concurrency
– virtual address space managed separately from
memory objects
– system library mapped in every process provides
trusted entry points for user-mode side of kernel
• I/O uses stacks of device drivers
– highly extensible model (for better or worse)
– IO Request Packets make I/O inherently asynchronous

35
NT (Native) API examples
NtCreateProcess (&ProcHandle, Access, SectionHandle,
DebugPort, ExceptionPort, …)
NtCreateThread (&ThreadHandle, ProcHandle, Access,
ThreadContext, bCreateSuspended, …)
NtAllocateVirtualMemory (ProcHandle, Addr, Size, Type,
Protection, …)
NtMapViewOfSection (SectHandle, ProcHandle, Addr,
Size, Protection, …)
NtReadVirtualMemory (ProcHandle, Addr, Size, …)
NtDuplicateObject (srcProcHandle, srcObjHandle,
dstProcHandle, dstHandle, Access, Attributes, Options)

36
Major NTOS Facilities
Object Manager Implement system namespace, manage objects
Security Ref Mon representing kernel abstractions, data structure
Registry references, metadata, configuration, and security.
Components: Object Manager, Security Reference
Monitor, Registry, Handle table

Implement address spaces and containers


Process Manager
(processes), manage memory resources including
Memory Manager
allocation and paging/caching of file system data
Cache Manager
Components: Processes, Virtual Memory Manager,
Cache Manager

Implement flow-of-control and synchronization: threads,


Thread Manager scheduling, APCs, ISRs, DPCs, IPIs, traps/interrupts,
Kernel Layer events, mutexes, spinlocks, and lock-free lists
Components: Process Manager, KE (Scheduler,
Synchronization)

37
Major NT Kernel Areas (cont)

I/O Implement asynchronous flow and layering for device


drivers, file systems, and network sockets, protocols,
and interfaces. Provide standard interfaces for talking
to devices, file systems, and the network.
Components: I/O Manager, File System Run-time
(and also plug-and-play and power management)
IPC Implement inter-process communication.
KRPC (Vista) Components: LPC (Local interProcess
Communication), KRPC (Kernel-mode RPC)

Glue Implement various interfaces for communicating


between address spaces, user/kernel modes, and to
the hardware itself. Provide run-time library support
and executive management functions.
Components: Trap interfaces, Lightweight Procedure
Calls, Kernel Run-Time Library, and Hardware
Abstraction Layer interfaces

38
Object Manager
Security Reference Monitor
Registry

How NT securely names, references, and


manages OS abstractions and data in a
unified way

39
Kernel Abstractions

Kernels implement abstractions


– Processes, threads, semaphores, files, …
Abstractions implemented as data and code
– Need a way of referencing instances
UNIX uses a variety of mechanisms
– File descriptors, Process IDs, SystemV IPC numbers
NT uses handles extensively
– Provides a unified way of referencing instances of
kernel abstractions
– Objects can also be named (independently of the file
system)

40
NT Object Manager

Provides unified management of:


• kernel data structures
• kernel references
• user references (handles)
• namespace
• synchronization objects
• resource charging
• cross-process sharing
• central ACL-based security reference monitor
• configuration (registry)

41
\ObjectTypes
Object Manager: Directory, SymbolicLink, Type
Processes/Threads: DebugObject, Job, Process, Profile,
Section, Session, Thread, Token
Synchronization:
Event, EventPair, KeyedEvent, Mutant, Semaphore,
ALPC Port, IoCompletion, Timer, TpWorkerFactory
IO: Adapter, Controller, Device, Driver, File, Filter*Port
Kernel Transactions: TmEn, TmRm, TmTm, TmTx
Win32 GUI: Callback, Desktop, WindowStation
System: EtwRegistration, WmiGuid

42
Implementation: Object Methods

Note that the methods are unrelated to actual


operations on the underlying objects:
OPEN: Create/Open/Dup/Inherit handle
CLOSE: Called when each handle closed
DELETE: Called on last dereference
PARSE: Called looking up objects by name
SECURITY: Usually SeDefaultObjectMethod
QUERYNAME: Return object-specific name

43
L“\”
Naming example
\Global??\C:
<directory> <directory>

L“Global??” L“C:”

<symbolic link>
\Device\HarddiskVolume1

L“\” \Device\HarddiskVolume1

<directory> <directory> <device>


implemented
L“Device” L“HarddiskVolume1” by I/O
manager

44
UNIX namei ~ NT IopParseDevice
NtCreateFile API builds a context descriptor for operation
 Then calls ObOpenObjectByName with (path, context)
Object Manager follows path until it finds a Device Object
 Then calls the object‟s Parse routine

IopParseDevice(device object, context, remaining path)


– Ask Object Manager to create a File Object
– Send open I/O request with File Object to the I/O stack
– File system fills in File Object and returns
– File Object is put in process handle table
– Handle is returned to the user program

45
Object Manager Parsing example
\Global??\C:\foo\bar.txt

<device object>
implemented
by I/O
, “foo\bar.txt”
manager

deviceobject->ParseRoutine == IopParseDevice

Note: namespace rooted in object manager, not FS


46
I/O Support: IopParseDevice
Returns handle to File object
user
Trap mechanism
kernel
Access Security
NtCreateFile() IopParseDevice() RefMon
check
context File object
DevObj,
ObjMgr Lookup context Access
Dev Stack check

File Sys
File System Fills in File object

47
Why not root namespace in filesys?
A few reasons…
• Hard to add new object types
• Device configuration requires filesys modification
• Root partition needed for each remote client
– End up trying to make a tiny root for each client
– Have to check filesystem very early
Windows uses object manager + registry hives
• Fabricates top-level namespace in kernel
• Uses config information from registry hive
• Only needs to modify hive after system stable
48
Security Reference Monitor

– Single module for access checks


– Implements Security Descriptors, System and
Discretionary ACLs, Privileges, and Tokens
– Collaborates with Local Security Authority
Service to obtain authenticated credentials
– Provides auditing and fulfills other Common
Criteria requirements

49
Vista Kernel Security Changes
Code Integrity (x64) and BitLocker Encryption
• Signature verification of kernel modules
• Drives can be encrypted
Protected Processes
• Secures DRM processes
User Account Control (Allow or Deny?)
• Signature verification of kernel modules
Integrity Levels
• Provides a backup for ACLs by limiting write access to
objects (and windows) regardless of permission
• Used by “low-rights” Internet Explorer
50
Handle Table

– NT handles allow user code to reference


kernel data structures (similar, but more
general than UNIX file descriptors)
– NT APIs use explicit handles to refer to
objects (simplifying cross-process operations)
– Handles can be used for synchronization,
including WaitMultiple
– Implementation is highly scalable

51
Object referencing

Security
Ref Monitor

Name
Access checks
Name Object
lookup Manager
App NTOS Returns ref‟d ptr

Kernel
Handle Data Object
Ref‟d ptr used until deref

52
Process/Thread structure

Any Handle Object Process


Table Manager Object
Thread

Thread
Files Virtual
Process‟ Thread
Address
Events Handle Table
Descriptors Thread
Devices
Memory Thread
Manager
Drivers
read(handle) Structures Thread

user-mode execution

53
One level: (to 512 handles)
A: Handle Table Entries [512 ]
Handle Table
TableCode

Object
Object
Object

54
Two levels: (to 512K handles)
Handle Table B: Handle Table Pointers [1024 ]
TableCode

A: Handle Table Entries [512 ]

Object
Object
Object

C: Handle Table Entries [512 ]

55
Three levels: (to 16M handles)
Handle Table D: Handle Table Pointers [32 ]
TableCode

B: Handle Table Pointers [1024 ]


E: Handle Table Pointers [10

A: Handle Table Entries [512 ]


F: Handle Table Entries [512 ]

Object
Object
Object
C: Handle Table Entries [512 ]

56
Registry (Configuration Manager)

– System configuration metadata organized into


files called hives
– System Hive contains configuration needed to
boot system, and is loaded in memory at boot
– Boot configuration includes file system info,
configuration parameters, and list of drivers to
be loaded by boot process
– Backup (last-known-good) hive recovers from
configuration errors

57
Configuration/State management
Registry vs File system
What registry solved:
• Boot / device configuration
• User accidentally deleting state
• Performance

58
Configuration/State management
Registry vs File system
What went wrong:
• Configuration state obscured
• Doesn't leverage file system technology
– Unfamiliar
– Robustness is separately addressed
• Poor network story and heterogeneous inter-op
• Abuse: used for IPC/synchonization, as data base,
system extension, same state stored multiple times
• Competing configuration stores
• Bad state separation from system
• Have to reinstall apps / file copy installs broken

59
Process Handle Tables
EPROCESS
object

object

EPROCESS object
Handle Table pHandleTable
Kernel Handles
System
Process
pHandleTable Handle Table
object
object

60
OBJECT_HEADER

PointerCount
HandleCount
pObjectType
oNameInfo oHandleInfo oQuotaInfo Flags
pQuotaBlockCharged
pSecurityDescriptor
CreateInfo + NameInfo + HandleInfo + QuotaInfo

OBJECT BODY
[with optional DISPATCHER_HEADER]

61
DISPATCHER_HEADER
Fundamental kernel synchronization mechanism
Equivalent to a KEVENT at front of dispatcher objects

Object Body → Inserted Size Absolute Type

SignalState

WaitListHead.flink

WaitListHead.blink

62
KPRCB Thread Thread
WaitListHead WaitListEntry WaitListEntry
WaitBlockList WaitBlockList
Object->Header WaitBlock WaitBlock
WaitListHead WaitListEntry WaitListEntry
Signaled NextWaitBlock NextWaitBlock
Object->Header WaitBlock WaitBlock
WaitListHead WaitListEntry WaitListEntry
Signaled NextWaitBlock NextWaitBlock
Object->Header WaitBlock
WaitListHead WaitListEntry
Signaled NextWaitBlock
Object->Header WaitBlock
WaitListHead WaitListEntry
Signaled Structure used by
NextWaitBlock WaitMultiple

63
Summary: Object Manager

• Foundation of NT namespace
• Unifies access to kernel data structures
– Outside the filesystem (initialized form registry)
– Unified access control via Security Ref Monitor
– Unified kernel-mode referencing (ref pointers)
– Unified user-mode referencing (via handles)
– Unified synchronization mechanism (events)

64
Process Manager
Memory Manager
Cache Manager

How NT manages programs, libraries,


address spaces, virtual/physical memory,
and memory-mapped files

65
Processes
• An environment for program execution
• Binds
– namespaces
– virtual address mappings
– ports (debug, exceptions)
– threads
– user authentication (token)
– virtual memory data structures
• Abstracts the MMU, not the CPU
66
Process Lifetime
Process created as an empty shell
Address space created with only system DLL and
the main image (unless forked for Posix)
Handle table created empty or populated via
duplication of inheritable handles from parent
Add environment, threads, map executable image
sections (EXE and DLLs)

Process is partially destroyed (“rundown”) on last


thread exit
Process totally destroyed on last dereference
67
System DLL
Core user-mode functionality in the system dynamic link
library (DLL) ntdll.dll

Mapped during process address space setup by the kernel

Contains all core system service entry points

User-mode trampoline points for:


– Process/thread startup
– Exception dispatch
– User APC dispatch
– Kernel-user callouts

68
Process/Thread structure

Any Handle Object Process


Table Manager Object
Thread

Thread
Files Virtual
Process‟ Thread
Address
Events Handle Table
Descriptors Thread
Devices
Memory Thread
Manager
Drivers
Structures Thread

69
Virtual Address Descriptors

• Tree representation of an address space


• Types of VAD nodes
– invalid
– reserved
– committed
– committed to backing store
– app-managed (large pages, AWE, physical)
• Backing store represented by section
objects

70
Shared-Memory Data Structures
File Object Segment
Handle
Control
Area
Section
Page Object
Handle Directory Proto
PTEs
Shared
Cache Map Subsection

Subsection
VAD

Process Page Page


Directory Table

71
Physical Frame Management
• Table of PFN data structures
– represent all pageable pages
– synchronize page-ins
– linked to management lists
• Page Tables
– hierarchical index of page directories and tables
– leaf-node is page table entry (PTE)
– PTE states:
• Active/valid
• Transition
• Modified-no-write
• Demand zero
• Page file
• Mapped file

72
Physical Frame State Changes
Standby
Modified Pagewriter Modified
List List
MM Low Mem

Process
(or System)
Working Set

Free Zero
List Zero Thread
List

73
Paging Overview
Working Sets: list of valid pages for each process
(and the kernel)
Pages „trimmed‟ from working set on lists
Standby list: pages backed by disk
Modified list: dirty pages to push to disk
Free list: pages not associated with disk
Zero list: supply of demand-zero pages
Modify/standby pages can be faulted back into a
working set w/o disk activity (soft fault)
Background system threads trim working sets,
write modified pages and produce zero pages
based on memory state and config parameters

74
Managing Working Sets
Aging pages: Increment age counts for pages
which haven't been accessed
Estimate unused pages: count in working set and
keep a global count of estimate
When getting tight on memory: replace rather
than add pages when a fault occurs in a working
set with significant unused pages
When memory is tight: reduce (trim) working sets
which are above their maximum
Balance Set Manager: periodically runs Working
Set Trimmer, also swaps out kernel stacks of
long-waiting threads

75
Balance Set Manager
NTs swapping
Balance set = sum of all “inswapped” working sets

Balance Set Manager is a system thread


Wakes up every second. If paging activity high or memory
needed:
– trims working sets of processes
– Pageout kernel stacks of threads in long user wait
– “outswaps” processes with no kernel stacks:
• separate thread does the “outswap” by gradually reducing
target process‟s working set limit to zero

76
32-bit VA/Memory Management
Working-set list Working-set Manager
VAD tree
Sections

executable
Image

c-o-w

Standby List

Free List
Data

SQL db
Modified List
File
SQL Phys
Data

datafile
File
Data

pagefile
Data Modified
Page Writer

77
Context-switching Kernel VM

Three regions of kernel VM are switched


– Page tables and page directory self-map
– Hyperspace (working set lists)
– Session space
• Session space
– „Session‟ is a terminal services session
– Contains data structures for kernel-level GUI
– Only switched when processes in different TS session
• Switched kernel regions not usually needed in other processes
– Thread attach is used to temporary context switch when they are
– Savings in KVA is substantial, as these are very large data
structures

78
Vista Memory Management
Memory Management improvements
• Improved prefetch at app launch/swap-in and
resume from hibernation/sleep
• Kernel Address Space dynamically configured
• Support use of flash as write-through cache
(compressed/encrypted)
• Session 0 is now isolated (runs systemwide
services)
• Address Space Randomization (executables
and stacks) for improved virus resistance

79
Cache Manager
Virtual Block Cache on top of Virtual Memory
• Memory-maps files into kernel virtual memory
• Most work is managing how memory and virtual address
space is used, and dealing with size changes in files
• Lazily flushes data in older views to disk
• Easy coherency with user memory-mapped files
• Heavily used by file systems through extensive API set
– Note: NTFS metadata is conveniently all in files
Most Cache manager work:
• Managing use of physical memory and virtual addresses
• Dealing with size changes in files

80
Thread Manager
Kernel Layer

How NT manages and abstracts the CPU to


provide a generalized programming
environment for writing kernel-mode
executive services

81
Threads
Unit of concurrency (abstracts the CPU)
Threads created within processes
System threads created within system process (kernel)
System thread examples:
Dedicated threads
Lazy writer, modified page writer, balance set manager,
mapped pager writer, other housekeeping functions
General worker threads
Used to move work out of context of user thread
Must be freed before drivers unload
Sometimes used to avoid kernel stack overflows
Driver worker threads
Extends pool of worker threads for heavy hitters, like file server

82
Thread elements

user-mode
• user-mode stack
• Thread Environment Block (TEB)
– most interesting: thread local storage
kernel-mode
• kernel-mode stack
• KTHREAD: scheduling, synchronization, timers, APCs
• ETHREAD: timestamps, I/O list, exec locks, process link
• thread ID
• impersonation token

83
Kernel Thread Attach
Allows a thread in the kernel to temporarily move to a different
process‟ address space
• Used heavily in Ps and Mm
– PspProcessDelete attaches before calling ObKillProcess() so
close/delete in proper process context
• Used to access the TEB/PEB
• Changes PsGetCurrentProcess() (effective process)

84
Scheduling
Windows schedules threads, not processes
Scheduling is preemptive, priority-based, and round-robin at the
highest-priority
16 real-time priorities above 16 normal priorities
Scheduler tries to keep a thread on its ideal processor/node to
avoid perf degradation of cache/NUMA-memory
Threads can specify affinity mask to run only on certain processors
Each thread has a current & base priority
Base priority initialized from process
Non-realtime threads have priority boost/decay from base
Boosts for GUI foreground, waking for event
Priority decays, particularly if thread is CPU bound (running at
quantum end)
Scheduler is state-driven by timer, setting thread
priority, thread block/exit, etc
Priority inversions can lead to starvation
balance manager periodically boosts non-running runnable threads

85
NT thread priorities

worker 15 critical 31
H
I 14 30
threads 13
G 29
+ 12
H 28
N 11 27
N O 10 26
R 9 normal 25
N O
8 real-time
R M 24
O 7 (dynamic) (fixed)
M 23
I R 6 22
D M 5 21
- 4 20
L
3 19
E 2 18
1 idle 17
0 zero thread 16

86
CPU Starvation Avoidance
Balance Set Manager
Once per second finds threads that have been ready for several
seconds
Scans up to 16 ready threads per priority level per second
Boosts up to 10 ready threads per second
Boost thread priority to 15 and double quantum
At quantum end immediately restores original priority and quantum
(no decay)
blocked
11
acquiring
resource running
9

ready
resource 5
holds resource
87
CPU Control-flow
Thread scheduling occurs at PASSIVE or APC level
(IRQL < 2)
APCs (Asynchronous Procedure Calls) deliver I/O
completions, thread/process termination, etc (IRQL == 1)
Not a general mechanism like unix signals (user-mode code must
explicitly block pending APC delivery)
Interrupt Service Routines run at IRL > 2
ISRs defer most processing to run at IRQL==2 (DISPATCH
level) by queuing a DPC to their current processor
A pool of worker threads available for kernel components
to run in a normal thread context when user-mode thread
is unavailable or inappropriate
Normal thread scheduling is round-robin among priority
levels, with priority adjustments (except for fixed priority
real-time threads)

88
Asynchronous Procedure Calls
APCs execute routine in thread context
not as general as UNIX signals
user-mode APCs run when blocked & alertable
kernel-mode APCs used extensively: timers,
notifications, swapping stacks, debugging, set thread
ctx, I/O completion, error reporting, creating &
destroying processes & threads, …
APCs generally blocked in critical sections
e.g. don‟t want thread to exit holding resources

89
Asynchronous Procedure Calls
APCs execute code in context of a particular thread
APCs run only at PASSIVE or APC LEVEL (0 or 1)
Three kinds of APCs
User-mode: deliver notifications, such as I/O done
Kernel-mode: perform O/S work in context of a
process/thread, such as completing IRPs
Special kernel-mode: used for process termination
Multiple „environments‟:
Original: The normal process for the thread (ApcState[0])
Attached: The thread as attached (ApcState[1])
Current: The ApcState[ ] as specified by the thread
Insert: The ApcState[ ] as specified by the KAPC block
90
Deferred Procedure Calls
DPCs run a routine on a particular processor
DPCs are higher priority than threads
common usage is deferred interrupt processing
ISR queues DPC to do bulk of work
• long DPCs harm perf, by blocking threads
• Drivers must be careful to flush DPCs before unloading
also used by scheduler & timers (e.g. at quantum end)
kernel-mode APCs used extensively: timers,
notifications, swapping stacks, debugging, set thread
ctx, I/O completion, error reporting, creating &
destroying processes & threads, …
High-priority routines use IPI (inter-processor intr)
used by MM to flush TLB in other processors

91
Summary: CPU

• Multiple mechanisms for getting CPU


– Integrated with the I/O system
• Thread is basic unit of scheduling
• Highly preemptive kernel environment
• Real-time scheduling priorities
• Interesting part is locking/scalability

92
I/O

How NT provides a composable data-flow


model for dynamically managing,
implementing, and filtering a completely
asynchronous input/output model

93
I/O Model
• Extensible filter-based I/O model with driver layering
• Standard device models for common device classes
• Support for notifications, tracing, journaling
• Configuration store
• File caching is virtual, based on memory mapping
• Completely asynchronous model (with cancellation)
• Multiple completion models:
– wait on the file handle
– wait on an event handle
– specify a routine to be called on IO completion (APC)
– use completion port
– poll status variable

94
IO Request Packet (IRP)

• IO operations encapsulated in IRPs.


• IO requests travel down a driver stack in an IRP.
• Each driver gets a stack location which contains
parameters for that IO request.
• IRP has major and minor codes to describe IO
operations.
• Major codes include create, read, write, PNP,
devioctl, cleanup and close.
• Irps are associated with a thread that made the
IO request.

95
IRP Fields
Flags
System Buffer Pointers

User MDL Chain


MDL Thread‟s IRPs
Thread Completion/Cancel Info
Driver
Completion
Queuing
APC block
& Comm.

IRP Stack Locations

96
Layering Drivers
Device objects attach one on top of another using
IoAttachDevice* APIs creating device stacks
– IO manager sends IRP to top of the stack
– drivers store next lower device object in their private
data structure
– stack tear down done using IoDetachDevice and
IoDeleteDevice
Device objects point to driver objects
– driver represent driver state, including dispatch table
File objects point to open files
File systems are drivers which manage file objects for
volumes (described by VolumeParameterBlocks)

97
File System Device Stack
Application

Kernel32 / ntdll
user
kernel
NT I/O Manager

File System Filters


Disk Class Manager

File System Driver


Cache Manager Disk Driver
Partition/Volume
Storage Manager
Virtual Memory
Manager DISK

98
NtCreateFile
IRP
File
Object
I/O Manager
FS filter drivers
ObOpenObjectByName
IoCallDriver

Object Manager NTFS

IopParseDevice IoCallDriver

Volume Mgr
I/O Manager
IoCallDriver
IoCallDriver

Result: File Object HAL Disk Driver


filled in by NTFS

99
IRP flow of control (synchronous)
IOMgr (e.g. IopParseDevice) creates IRP, fills in top
stack location, calls IoCallDriver to pass to stack
driver determined by top device object on device stack
driver passed the device object and IRP
IoCallDriver
copies stack location for next driver
driver routine determined by major function in drvobj
Each driver in turn
does work on IRP, if desired
keeps track in the device object of the next stack device
Calls IoCallDriver on next device
Eventually bottom driver completes IO and returns on callstack

100
IRP flow of control (asynchronous)
Eventually a driver decides to be asynchronous
driver queues IRP for further processing
driver returns STATUS_PENDING up call stack
higher drivers may return all the way to user, or may
wait for IO to complete (synchronizing the stack)
Eventually a driver decides IO is complete
usually due to an interrupt/DPC completing IO
each completion routine in device stack is called,
possibly at DPC or in arbitrary thread context
IRP turned into APC request delivered to original thread
APC runs final completion, accessing process memory

101
Programming Asychronous I/O
• Applications can issue asynchronous IO requests to files
opened with FILE_FLAG_OVERLAPPED and passing
an LPOVERLAPPED parameter to the IO API (e.g.,
ReadFile(…))
• Five methods available to wait for IO completion,
– Wait on the file handle
– Wait on an event handle passed in the overlapped
structure (e.g., GetOverlappedResult(…))
– Specify a routine to be called on IO completion
– Use completion ports
– Poll status variable

102
I/O Completions
• Five methods to receive notification of completion for
asynchronous I/O:
– poll status variable
– wait for the file handle to be signalled
– wait for an explicitly passed event to be signalled
– specify a routine to be called on the originating ports
– use an I/O completion port

103
104
I/O Completion

complete
thread
thread
I/O completion ports
I/O Completion Ports

I/O

thread
request
complete

I/O

thread
request
I/O

thread
request
U
K

normal completion
complete
I/O

thread
request
complete
I/O thread
request
complete

I/O
thread
request

U
K
Summary: I/O

• Layered, extensible I/O model


• Highly asynchronous
• NTFS uses unification principles extensively

• Plug-and-Play
• Power management

105
Questions

106

You might also like