You are on page 1of 182

IRP Manual

OSR OPEN SYSTEMS RESOURCES, INC.


105 Route 101A, Suite 19
Amherst, New Hampshire 03031-2277
(603) 595-6500 ♦ FAX: (603) 595-6503
© 2001 OSR Open Systems Resources, Inc.

All rights reserved. No part of this work covered by the copyright hereon may be
reproduced or used in any form or by any means -- graphic, electronic, or mechanical,
including photocopying, recording, taping, or information storage and retrieval systems --
without written permission of OSR Open Systems Resources, Inc.,105 Route 101A Suite 19,
Amherst, New Hampshire 03031, (603) 595-6500

OSR, the traditional OSR Logo, the new OSR logo, “OSR Open Systems Resources,
Inc.”, and “The NT Insider” are trademarks of OSR Open Systems Resources, Inc. All
other trademarks mentioned herein are the property of their owners.

Printed in the United States of America

Version PR261-01

LIMITED WARRANTY

OSR Open Systems Resources, Inc. (OSR) expressly disclaims any warranty for the material
presented herein. This material is presented “as is” without warranty of any kind, either express or
implied, including, without limitation, the implied warranties of mechantability or fitness for a
particular purpose. The entire risk arising from the use of this material remains with you. OSR’s
entire liability and your exclusive remedy shall not exceed the price paid for this material. In no
event shall OSR or its supplies be liable for any damages whatsoever (including, without
limitation, damages for loss of business profit, business interruption, loss of business information,
or any other pecuniary loss) arising out of the use or inability to use this information, even if OSR
has been advised of the possibility of such damages. Because some states/jurisdictions do not
allow the exclusion or limitation of liability for consequential or incidental damages, the above
limitation may not apply to you.

U.S. GOVERNMENT RESTRICTED RIGHTS

This material is provided with RESTRICTED RIGHTS. Use, duplication, or disclosure by


the Government is subject to restrictions as set forth in subparagraph (c)(1)(ii) of The
Right in Technical Data and Computer Software clause at DFARS 252.227-7013 or
subparagraphs (c)(1) and (2) of the Commercial Computer Software--Restricted Rights
48 CFR 52.227-19, as applicable. Manufacturer is OSR Open Systems Resources, Inc.
Amherst, New Hampshire 03031.

Contents provided by 2
OSR Open Systems Resources, Inc.
1 FILE SYSTEMS ON NT ...................................................................................................................... 9
1.1 DYNAMIC FILE SYSTEMS LOADING ................................................................................................. 9
1.2 MEDIA CHANGE ............................................................................................................................. 10
1.3 FAST I/O ........................................................................................................................................ 11
2 NT I/O COMPONENTS..................................................................................................................... 15
2.1 DATA STRUCTURES ....................................................................................................................... 15
2.1.1 File Object ............................................................................................................................ 15
2.1.2 IRP ........................................................................................................................................ 17
2.1.3 I/O Stack Location ................................................................................................................ 18
2.2 I/O OPERATIONS ............................................................................................................................ 22
2.2.1 CREATE................................................................................................................................ 22
2.2.2 CLOSE .................................................................................................................................. 29
2.2.3 WRITE................................................................................................................................... 29
2.2.4 READ .................................................................................................................................... 30
2.2.5 SET/QUERY _INFORMATION ............................................................................................ 30
2.2.6 SET/QUERY_EA ................................................................................................................... 34
2.2.7 SET/QUERY_VOLUME_INFORMATION ........................................................................... 35
2.2.8 SET/QUERY_QUOTA........................................................................................................... 37
2.2.9 DIRECTORY_CONTROL ..................................................................................................... 38
2.2.10 CLEANUP............................................................................................................................. 44
2.2.11 QUERY/SET_SECURITY...................................................................................................... 44
2.2.12 FILE_SYSTEM_CONTROL .................................................................................................. 44
2.2.13 Sample Rename Code............................................................................................................ 63
2.2.14 Sample IRP Code .................................................................................................................. 67
3 CACHE MANAGER RUNTIME.................................................................................................... 113
3.1 CACHE MANAGER OVERVIEW ..................................................................................................... 113
3.2 CACHE MANAGER DATA STRUCTURES ........................................................................................ 113
3.2.1 Buffer Control Block (BCB) ................................................................................................ 113
3.2.2 File Size Information........................................................................................................... 113
3.3 CACHE MANAGER CALLBACKS ................................................................................................... 114
3.4 CCCANIWRITE ............................................................................................................................ 116
3.5 CCCOPYREAD ............................................................................................................................. 116
3.6 CCCOPYWRITE ............................................................................................................................ 117
3.7 CCDEFERWRITE .......................................................................................................................... 118
3.8 CCGETDIRTYPAGES .................................................................................................................... 118
3.9 CCGETFILEOBJECTFROMBCB ..................................................................................................... 119
3.10 CCGETFILEOBJECTFROMSECTIONPTRS ...................................................................................... 119
3.11 CCGETLSNFORFILEOBJECT ........................................................................................................ 119
3.12 CCFASTCOPYREAD ..................................................................................................................... 120
3.13 CCFASTCOPYWRITE.................................................................................................................... 120
3.14 CCFLUSHCACHE .......................................................................................................................... 120
3.15 CCINITIALIZECACHEMAP ............................................................................................................ 121
3.16 CCISTHEREDIRTYDATA .............................................................................................................. 122
3.17 CCMAPDATA............................................................................................................................... 122
3.18 CCMDLREAD ............................................................................................................................... 123
3.19 CCMDLREADCOMPLETE ............................................................................................................. 123
3.20 CCMDLWRITECOMPLETE ............................................................................................................ 123
3.21 CCPINMAPPEDDATA ................................................................................................................... 124
3.22 CCPINREAD ................................................................................................................................. 124
3.23 CCPREPAREMDLWRITE ............................................................................................................... 124
3.24 CCPREPAREPINWRITE ................................................................................................................. 125

Contents provided by 3
OSR Open Systems Resources, Inc.
3.25 CCPURGECACHESECTION............................................................................................................ 125
3.26 CCREPINBCB ............................................................................................................................... 126
3.27 CCSETADDITIONALCACHEATTRIBUTES ...................................................................................... 126
3.28 CCSETBCBOWNERPOINTER......................................................................................................... 126
3.29 CCSETDIRTYPAGETHRESHOLD ................................................................................................... 127
3.30 CCSETDIRTYPINNEDDATA .......................................................................................................... 127
3.31 CCSETFILESIZES ......................................................................................................................... 127
3.32 CCSETLOGHANDLEFORFILE ....................................................................................................... 128
3.33 CCSETREADAHEADGRANULARITY ............................................................................................. 128
3.34 CCUNINITIALIZECACHEMAP ....................................................................................................... 128
3.35 CCUNPINDATA ............................................................................................................................ 129
3.36 CCUNPINDATAFORTHREAD ........................................................................................................ 129
3.37 CCUNPINREPINNEDBCB .............................................................................................................. 129
3.38 CCZERODATA ............................................................................................................................. 130
3.39 CCZEROENDOFLASTPAGE .......................................................................................................... 130
4 FILE SYSTEM RUNTIME LIBRARY .......................................................................................... 131
4.1 BYTE RANGE LOCKS.................................................................................................................... 131
4.1.1 FILE_LOCK_INFO ............................................................................................................ 131
4.1.2 PCOMPLETE_LOCK_IRP_ROUTINE .............................................................................. 131
4.1.3 PUNLOCK_ROUTINE ....................................................................................................... 132
4.1.4 FILE_LOCK........................................................................................................................ 132
4.1.5 FsRtlInitializeFileLock........................................................................................................ 132
4.1.6 FsRtlUnitializeFileLock ...................................................................................................... 133
4.1.7 FsRtlProcessFileLock ......................................................................................................... 133
4.1.8 FsRtlCheckLockForReadAccess ......................................................................................... 133
4.1.9 FsRtlCheckLockForWriteAccess......................................................................................... 133
4.1.10 FsRtlFastCheckLockForRead ............................................................................................. 134
4.1.11 FsRtlFastCheckLockForWrite ............................................................................................ 134
4.1.12 FsRtlGetNextFileLock......................................................................................................... 135
4.1.13 FsRtlFastUnlockSingle ....................................................................................................... 135
4.1.14 FsRtlFastUnlockAll............................................................................................................. 136
4.1.15 FsRtlFastUnlockAllByKey .................................................................................................. 136
4.1.16 FsRtlPrivateLock ................................................................................................................ 136
4.1.17 FsRtlFastLock ..................................................................................................................... 137
4.2 DIRECTORY CHANGE NOTIFICATION ........................................................................................... 138
4.2.1 PCHECK_FOR_TRAVERSE_ACCESS .............................................................................. 138
4.2.2 FsRtlNotifyInitializeSync .................................................................................................... 138
4.2.3 FsRtlNotifyUninitializeSync................................................................................................ 138
4.2.4 FsRtlNotifyFullChangeDirectory........................................................................................ 138
4.2.5 FsRtlNotifyFullReportChange ............................................................................................ 139
4.2.6 FsRtlNotifyCleanup............................................................................................................. 139
4.3 I/O SUPPORT ................................................................................................................................ 139
4.3.1 FsRtlCopyRead ................................................................................................................... 139
4.3.2 FsRtlCopyWrite................................................................................................................... 140
4.3.3 FsRtlMdlReadCompleteDev................................................................................................ 140
4.3.4 FsRtlMdlWriteCompleteDev ............................................................................................... 140
4.4 MISCELLANEOUS ROUTINES ........................................................................................................ 140
4.4.1 FsRtlIsTotalDeviceFailure.................................................................................................. 140
4.4.2 FsRtlBalanceReads ............................................................................................................. 140
4.5 NAME SUPPORT ........................................................................................................................... 141
4.5.1 FsRtlIsUnicodeCharacterWild............................................................................................ 141
4.5.2 FsRtlDissectName............................................................................................................... 141
4.5.3 FsRtlDoesNameContainWildCards .................................................................................... 141
4.5.4 FsRtlAreNamesEqual.......................................................................................................... 141
4.5.5 FsRtlIsNameInExpression................................................................................................... 142

Contents provided by 4
OSR Open Systems Resources, Inc.
4.6 TUNNEL (NAME) CACHE.............................................................................................................. 142
4.6.1 FsRtlInitializeTunnelCache................................................................................................. 142
4.6.2 FsRtlAddToTunnelCache .................................................................................................... 143
4.6.3 FsRtlFindInTunnelCache.................................................................................................... 143
4.6.4 FsRtlDeleteKeyFromTunnelCache ..................................................................................... 143
4.6.5 FsRtlDeleteTunnelCache .................................................................................................... 143
4.7 UNC SUPPORT............................................................................................................................. 144
4.7.1 FsRtlRegisterUncProvider.................................................................................................. 144
4.7.2 FsRtlDeregisterUncProvider .............................................................................................. 144
5 KERNEL RUNTIME ....................................................................................................................... 145
5.1 KERNEL QUEUES ......................................................................................................................... 145
5.1.1 KQUEUE ............................................................................................................................ 145
5.1.2 KeInitializeQueue................................................................................................................ 145
5.1.3 KeReadStateQueue.............................................................................................................. 145
5.1.4 KeInsertQueue .................................................................................................................... 145
5.1.5 KeRemoveQueue ................................................................................................................. 145
5.1.6 KeRundownQueue............................................................................................................... 146
5.2 PROCESS CONTEXT MANAGEMENT.............................................................................................. 146
5.2.1 KeAttachProcess ................................................................................................................. 146
5.2.2 KeDetachProcess ................................................................................................................ 146
6 I/O MANAGER RUNTIME ............................................................................................................ 147
6.1 IOASYNCHRONOUSPAGEWRITE .................................................................................................. 147
6.2 IOATTACHDEVICETODEVICESTACK ........................................................................................... 147
6.3 IOCHECKEABUFFERVALIDITY .................................................................................................... 147
6.4 IOGETBASEFSDEVICEOBJECT ..................................................................................................... 147
6.5 IOGETREQUESTORPROCESS......................................................................................................... 148
6.6 IOISSYSTEMTHREAD ................................................................................................................... 148
6.7 IOPAGEREAD ............................................................................................................................... 148
6.8 IOQUERYFILEINFORMATION........................................................................................................ 148
6.9 IOQUERYVOLUMEINFORMATION................................................................................................. 149
6.10 IOREGISTERFILESYSTEM ............................................................................................................. 149
6.11 IOREGISTERFSREGISTRATIONCHANGE ........................................................................................ 149
6.12 IOSYNCHRONOUSPAGEWRITE ..................................................................................................... 149
6.13 IOTHREADTOPROCESS ................................................................................................................ 150
6.14 IOUNREGISTERFILESYSTEM ........................................................................................................ 150
6.15 IOSETINFORMATION .................................................................................................................... 150
7 MEMORY MANAGER RUNTIME ............................................................................................... 151
7.1 MMCANFILEBETRUNCATED ....................................................................................................... 151
7.2 MMISRECURSIVEIOFAULT .......................................................................................................... 151
7.3 MMFORCESECTIONCLOSED......................................................................................................... 151
7.4 MMFLUSHIMAGESECTION ........................................................................................................... 152
7.5 MMSETADDRESSRANGEMODIFIED ............................................................................................. 152
8 NT NATIVE API .............................................................................................................................. 153
8.1 DATA STRUCTURES ..................................................................................................................... 153
8.1.1 TOKEN_INFORMATION_CLASS ...................................................................................... 153
8.2 NTADJUSTPRIVILEGESTOKEN ..................................................................................................... 153
8.3 NTDUPLICATEOBJECT ................................................................................................................. 154
8.4 NTDUPLICATETOKEN .................................................................................................................. 154
8.5 NTOPENPROCESSTOKEN ............................................................................................................. 154
8.6 NTQUERYINFORMATIONTOKEN .................................................................................................. 154
9 OBJECT MANAGER RUNTIME .................................................................................................. 155
9.1 DATA STRUCTURES ..................................................................................................................... 155
9.1.1 OB_DUMP_METHOD ....................................................................................................... 155

Contents provided by 5
OSR Open Systems Resources, Inc.
9.1.2 OB_OPEN_REASON .......................................................................................................... 155
9.1.3 OB_OPEN_METHOD ........................................................................................................ 155
9.1.4 OB_CLOSE_METHOD ...................................................................................................... 155
9.1.5 OB_DELETE_METHOD .................................................................................................... 156
9.1.6 OB_PARSE_METHOD....................................................................................................... 156
9.1.7 SECURITY_OPERATION_CODE ...................................................................................... 156
9.1.8 OB_SECURITY_METHOD................................................................................................. 156
9.1.9 OB_QUERYNAME_METHOD ........................................................................................... 157
9.1.10 OBJECT_TYPE_INITIALIZER ........................................................................................... 157
9.2 OBCREATEOBJECT ...................................................................................................................... 157
9.3 OBGETOBJECTPOINTERCOUNT ................................................................................................... 158
9.4 OBINSERTOBJECT ........................................................................................................................ 158
9.5 OBOPENOBJECTBYPOINTER ....................................................................................................... 158
9.6 OBQUERYNAMESTRING .............................................................................................................. 158
10 RUNTIME LIBRARY.................................................................................................................. 161
10.1 MEMORY ACCESS ........................................................................................................................ 161
10.1.1 ProbeForRead..................................................................................................................... 161
10.1.2 ProbeForWrite .................................................................................................................... 161
10.2 BITMAP ROUTINES ....................................................................................................................... 161
10.2.1 RtlClearAllBits .................................................................................................................... 162
10.2.2 RtlSetAllBits ........................................................................................................................ 162
10.2.3 RtlFindClearBits ................................................................................................................. 162
10.2.4 RtlFindSetBits ..................................................................................................................... 162
10.2.5 RtlFindClearBitsAndSet...................................................................................................... 162
10.2.6 RtlFindSetBitsAndClear...................................................................................................... 163
10.2.7 RtlClearBits......................................................................................................................... 163
10.2.8 RtlSetBits............................................................................................................................. 163
10.2.9 RtlFindLongestRunClear .................................................................................................... 163
10.2.10 RtlFindLongestRunSet .................................................................................................... 163
10.2.11 RtlFindFirstRunClear ..................................................................................................... 164
10.2.12 RtlFindFirstRunSet ......................................................................................................... 164
10.2.13 RtlNumberOfClearBits.................................................................................................... 164
10.2.14 RtlNumberOfSetBits ........................................................................................................ 164
10.2.15 RtlAreBitsClear............................................................................................................... 164
10.2.16 RtlAreBitsSet ................................................................................................................... 164
10.3 PREFIX CACHE ............................................................................................................................. 165
10.4 SPLAY TREE................................................................................................................................. 165
10.5 GENERIC TABLE .......................................................................................................................... 165
10.6 SHORT NAME SUPPORT................................................................................................................ 165
10.6.1 GENERATE_NAME_CONTEXT ........................................................................................ 165
10.6.2 RtlGenerate8dot3Name....................................................................................................... 165
11 SECURITY REFERENCE MONITOR ..................................................................................... 167
11.1 DATA STRUCTURES ..................................................................................................................... 167
11.1.1 SECURITY_DESCRIPTOR................................................................................................. 167
11.1.2 SECURITY_SUBJECT_CONTEXT..................................................................................... 168
11.2 SEACCESSCHECK ........................................................................................................................ 169
11.3 SEAPPENDPRIVILEGES ................................................................................................................. 170
11.4 SEASSIGNSECURITY .................................................................................................................... 170
11.5 SEAUDITINGFILEEVENTS ............................................................................................................ 171
11.6 SEAUDITINGFILEORGLOBALEVENTS .......................................................................................... 171
11.7 SECAPTURESUBJECTSECURITYCONTEXT .................................................................................... 171
11.8 SELOCKSUBJECTCONTEXT .......................................................................................................... 172
11.9 SEMARKLOGONSESSIONFORTERMINATIONNOTIFICATION ......................................................... 172
11.10 SEOPENOBJECTAUDITALARM................................................................................................. 172
11.11 SEOPENOBJECTFORDELETEAUDITALARM ............................................................................. 173

Contents provided by 6
OSR Open Systems Resources, Inc.
11.12 SEQUERYSECURITYDESCRIPTORINFO ..................................................................................... 174
11.13 SEREGISTERLOGONSESSIONTERMINATEDROUTINE ................................................................ 175
11.14 SERELEASESUBJECTSECURITYCONTEXT................................................................................. 175
11.15 SESETACCESSSTATEGENERICMAPPING .................................................................................. 175
11.16 SESETSECURITYDESCRIPTORINFO........................................................................................... 176
11.17 SEUNLOCKSUBJECTCONTEXT ................................................................................................. 177
11.18 SEUNREGISTERLOGONSESSIONTERMINATEDROUTINE ........................................................... 177
12 SUPPLEMENTARY READING LIST....................................................................................... 179

Contents provided by 7
OSR Open Systems Resources, Inc.
Contents provided by 8
OSR Open Systems Resources, Inc.
1 File Systems on NT
In this section we turn our attention to characteristics and operations of file systems in Windows NT.
This section provides a basic tutorial on issues with respect to file systems in NT. Specific discussions
about how those issues impacted the project are in later sections of this document.

Windows NT 4.0 BL 1381ships with three physical file systems: FAT, NTFS, and CDFS. Support for
HPFS has been removed from Windows NT V4.0. Each of these file systems has specific
characteristics, advantages, and disadvantages.

For instance, FAT’s principal advantages are that it allows dual booting between Windows NT and
other operating systems such as DOS, OS/2, and Windows 95. FAT provides a simple, efficient
scheme for managing logical partitions up to 2GB in size. Further, there is performance data
indicating that for certain types of applications FAT is faster than NTFS. Thus, sometimes it is used
because of its performance characteristics.

NTFS however, offers support for Unicode names which allow full native internationalized support
without resorting to tricks such as “code pages”, discretionary access controls on files and directories,
and utilizes logging or journaling techniques to ensure rapid recovery from a system crash.

CDFS provides support for the CD-ROM file system format, ISO 9660, as well as their own CD-ROM
extensions for long file names. Practically speaking, CDFS only works with removable CD-ROM
media. For instance, it does not support writing data – and hence is not suitable for use with WORM
media.

1.1 Dynamic File Systems Loading


Very few Windows NT systems require all file systems drivers to be loaded; many require only one
file system driver be loaded. In order to minimize the memory demands on the system, file system
drivers are loaded on demand rather than at boot time.

To manage this demand based system, NT uses file system recognizers. In NT V4.0 BL 1381, there is
one file system recognizer which handles all recognition for the three NT native file systems. It does
this by creating five device objects and registering each as a file system. 1 Then, when new media is
accessed, all registered file systems are called, via an IRP_MJ_FILE_SYSTEM_CONTROL
operation with a request to mount the volume (via IRP_MN_MOUNT_VOLUME). This process
calls into the file system recognizer which then calls specific code which looks for the particular file
system type signature. If the signature corresponds to the particular file system type a special status
code (STATUS_FS_DRIVER_REQUIRED) is returned to the I/O manager. This results in a call
back to the same device object with a different minor request code,
IRP_MN_LOAD_FILE_SYSTEM. The recognizer then uses the undocumented ZwLoadDriver(...)
call to load the particular file system driver.

This technique is extremely flexible, as additional file system recognizers can be added to Windows
NT using the same approach. Thus, non-Microsoft provided file systems can be implemented using
exactly the same technique.

It is interesting to note that file systems are not unloaded.

1
Strictly speaking, however, NT will not register the CDFS recognizer device object if no CD-ROM is present on the
system itself.

Contents provided by 9
OSR Open Systems Resources, Inc.
1.2 Media Change
Windows NT supports removable media devices. Because of this, NT does not directly associate a
particular file system with a particular drive letter or physical media device. This is because when
media is changed the file system format on that media is not necessarily the same as it was on the last
piece of file system media in that driver.

For example, an optical drive could be inserted with the FAT file system. That media could then be
removed and replaced with an NTFS formatted media unit.

I/O Manager

FS Disk
\FS Driver Driver

\Device\
\Device\
Unnamed Unnamed HardDisk0\
CdRom0
Partition1

Volume
Parameter Block

Volume
Parameter Block

Figure 1: Mapping Devices to File Systems

The NT I/O Manager provides a specific data structure, the volume parameter block for tracking
exactly this information. Further, the names registered for drive letters in the Object Manager
correspond to the media device – not the file system device.

Thus, suppose a Win32 application program attempts to open “c:\autoexec.nt”. This call is modified
by Win32 to be “\DosDevices\C:\autoexec.nt” which is handed to the I/O Manager. In turn, the I/O
Manager then calls the Object Manager, which traverses the name until it finds a portion of the name
which represents something in its name space. In this case, “\DosDevices\C:” represents a symbolic
link (a logical name) and the name is now reconstructed with the symbolic link name being replaced.
The resultant link would be “\Device\HardDisk0\Partition0\autoexec.nt”.

This name is then re-parsed (with the symbolic link name replacement) and the Object Manager
recognizes “\Device\HardDisk0\Partition0” as a device object. Name parsing halts at this point, and
the Object Manager returns that device object to the I/O Manager, along with the portion of the name
which was not parsed (“\autoexec.nt”).

Contents provided by 10
OSR Open Systems Resources, Inc.
The I/O Manager in turn takes this device object and notes that it is a mass storage device. From this
simple fact it looks at the volume parameter block which is associated with that mass storage device.
The volume parameter block in turn points to two things – the physical media device object (that is the
device object for “\Device\HardDisk0\Partition0” in our example) and the file system device object
being used for that particular instance. This last portion cannot be stressed enough.

Each unnamed file system instance is associated with exactly one physical media device
objects.

While we have not described it here, the I/O Manager calls file systems as needed to create and delete
these unnamed file system instance objects.2

1.3 Fast I/O


While the I/O Request packet model is extremely flexible and very general in the Windows NT
environment, for file system I/O it can often be a significant portion of the overhead in processing I/O
requests for certain types of I/O.

Windows NT utilizes the cache manager for several fundamental reasons:

To provide support for memory mapped files which need to be stored in the virtual memory subsystem.
Further, by using the VM system (the NT “memory manager”) exclusively, both standard read/write
calls and memory mapped I/O calls can be used on the same data.

To allow for dynamic caching of data, so that when data is being frequently accessed it is stored in
memory rather than on disk.

To provide write-back caching. This substantially improves the appearance of performance for most
applications with minimal risk.3

Because Windows NT caches much of the information in the file system, it is often possible to satisfy
a request without requiring a call to a lower level disk driver. Whenever that is the case, the IRP
management overhead adds substantial overhead to the I/O path without any benefit or usefulness.
Thus, as a performance-optimized path, Windows NT provides support for fast I/O routines.

This set of procedure call entry points are registered by the device driver (and hence are a property of
the driver, not of the file system instance) and can be called at any point in time.

The first entry point, FastIoCheckIfPossible is called in order to determine quickly if fast I/O is even
possible for a particular file. If it is possible, a subsequent call to one of the other routines will be
made; otherwise the I/O manager will create an I/O request packet as appropriate.

2
This is alluded to in the DDK documentation, Kernel Mode Drivers, Reference, Kernel Mode Support Routines (Part
1), Chapter 4 I/O Manager Routines, IoCreateDevice. This documentation discusses the fact that a name for a
device object is necessary to support I/O operations, except, as it notes: “An unnamed device object is invisible to
other drivers except, possibly, FSDs and to user-mode protected subsystems because a symbolic link cannot be set
up for an unnamed device object. Consequently, higher-level drivers cannot attach their device objects to an
unnamed device object nor can the unnamed object be the target of an IRP.” Further, that same document also
notes: “An unnamed device object is visible only to the driver that created it or to an FSD through a volume
parameter block (VPB).”
3
Note, however, that “minimal risk” means that data is not on disk after a write completes. Typically it takes a few
seconds for the I/O system to “catch up” although this time interval is not guaranteed and is dependent on system
activity.

Contents provided by 11
OSR Open Systems Resources, Inc.
The following code fragment provides the current description of the Fast I/O vector:

typedef struct _FAST_IO_DISPATCH {


ULONG SizeOfFastIoDispatch;
PFAST_IO_CHECK_IF_POSSIBLE FastIoCheckIfPossible;
PFAST_IO_READ FastIoRead;
PFAST_IO_WRITE FastIoWrite;
PFAST_IO_QUERY_BASIC_INFO FastIoQueryBasicInfo;
PFAST_IO_QUERY_STANDARD_INFO FastIoQueryStandardInfo;
PFAST_IO_LOCK FastIoLock;
PFAST_IO_UNLOCK_SINGLE FastIoUnlockSingle;
PFAST_IO_UNLOCK_ALL FastIoUnlockAll;
PFAST_IO_UNLOCK_ALL_BY_KEY FastIoUnlockAllByKey;
PFAST_IO_DEVICE_CONTROL FastIoDeviceControl;
PFAST_IO_ACQUIRE_FILE AcquireFileForNtCreateSection;
PFAST_IO_RELEASE_FILE ReleaseFileForNtCreateSection;
PFAST_IO_DETACH_DEVICE FastIoDetachDevice;
} FAST_IO_DISPATCH, *PFAST_IO_DISPATCH;

Once registered, each of these routines may be called by the I/O manager at will. Thus, it is not
possible to “remove” this vector after the fact. Further, if a subsequent device attaches to the file
system device its fast I/O entry points will be called, rather than those for the underlying file system
instance.

Except for the last three calls, all of these routines return either TRUE or FALSE to indicate if the
Fast I/O operation was completed. If FALSE is returned, the I/O manager will then create a regular
IRP and send that on to the file system in question.

The two entry points AcquireFileForNtCreateSection and ReleaseFileForNtCreateSection are only


used by the HPFS file system. No other file system uses these; they are used to allow specialized
locking functionality for memory mapped files.

The entry point FastIoDetachDevice is there for supporting filter drivers. Immediately before deleting
a device object, the I/O manager calls this entry point for any attached devices (e.g., the filter driver) in
order to allow it to clean up and detach itself prior to return. Failure to properly detach causes a
system halt when the system attempts to delete a device object with a non-zero reference count.

With the increasing maturity of Windows NT, uses for the Fast I/O dispatch table have continued to
grow. BL 1381 of Windows NT adds (ready?) no fewer than FOURTEEN new Fast I/O entry points.
These new entry points, described in the Filter Driver source code, are:

QueryNetworkOpenInfo;

MdlRead

MdlReadComplete

PrepareMdlWrite

MdlWriteComplete

ReadCompressed

WriteCompressed

MdlReadCompleteCompressed

MdlWriteCompleteCompressed

QueryOpen

AcquireForModWrite

Contents provided by 12
OSR Open Systems Resources, Inc.
ReleaseForModeWrite

AcquireForCcFlush

ReleaseForCcFlush

It’s interesting to note that several of these new entry points are not used as of BL 1381, even by native
Microsoft file systems. Also, as of BL 1381, several of these entry points do not appear to actually be
called by the operating system, even when a file system registers support for them.
The Fast I/O routines are only dimly alluded to in the available Microsoft documentation.

Contents provided by 13
OSR Open Systems Resources, Inc.
Contents provided by 14
OSR Open Systems Resources, Inc.
2 NT I/O Components
In this section we describe the various components of Windows NT and their interactions.

2.1 Data Structures


In this section we describe the various individual data structures used within the FSD path of the
I/O Manager for Windows NT. While these data structures are described in some detail within the
NT DDK, our focus in this section is how these data structures are used and what the information
within them actually represents.

2.1.1 File Object


The File Object is the fundamental unit used by Windows NT to describe an open file instance.
The critical point here is that there may be multiple open instances of a single file and a single
open instance may even be referenced by multiple user handles. A graphical description of the
File Object and the associated data structures important to file systems is shown in Figure 2.

A handle to a File Object is only valid within the context of the calling process. Handles represent
entries in a per-process index table. The I/O Manager translates handle-based calls (such as
NtCreateFile) into object based calls prior to transferring control to the underlying FSD (or file
system filter driver.) The File Object itself is valid within any process context and hence can be
used by a kernel-mode driver in a worker-thread.

There are a number of fields in the FILE_OBJECT of interest:

• The DeviceObject field represents one of the following:

♦ The physical media device object (disk device) if this is a local on-disk file
system

♦ The named device object if this is a network file system or specialized file
system

• The Vpb is meaningful only for files stored on local disk. In that case, it points to the
Device Object’s volume parameter block.

• The FsContext pointer is used exclusively by the FSD. The balance of the OS assumes
that this points to a common “File Control Block” (FCB) header. The critical attribute of
this pointer is that it uniquely identifies the file. That is, two File Objects reference the
same file if their FsContext pointers are equal.

• The RelatedFileObject field is used to indicate that the given file has been opened
relative to an already open file. Typically, this indicates the relative file is a directory.4

• The ReadAccess, WriteAccess, and DeleteAccess fields are used by the I/O subsystem to
indicate if the given file has been opened for Read, Write, or Delete access. This
information is used when checking the sharing status of the given file
(IoCheckShareAccess, etc.)

4
We say typically here because “stream based” files may be opened relative to an already existing stream of a file.

Contents provided by 15
OSR Open Systems Resources, Inc.
Type
Size
DeviceObject
Type
Vpb
Size
FsContext
Flags
FsContext2
VolumeLabelLength
SectionObjectPointers
DeviceObject
PrivateCacheMap
RealDevice
FinalStatus
SerialNumber
RelatedFileObject
ReferenceCount
LockOperation
VolumeLabel
DeletePending
ReadAccess
SharedRead
DataSectionObject
SharedWrite
SharedCacheMap
SharedDelete
ImageSectionObject
Flags
FileName Type Type
Size Size
CurrentByteOf fset
DeviceObject DeviceObject
Waiters Vpb Vpb
FsContext FsContext
Busy
FsContext2 FsContext2
LastLock SectionObjectPointers SectionObjectPointers
PrivateCacheMap PrivateCacheMap
Lock
FinalStatus FinalStatus
Event RelatedFileObject RelatedFileObject
LockOperation LockOperation
CompletionContext
DeletePending DeletePending
ReadAccess ReadAccess
SharedRead SharedRead
SharedWrite SharedWrite
SharedDelete SharedDelete
Flags Flags
FileName FileName
CurrentByteOffset CurrentByteOffset
Waiters Waiters
Busy Busy
LastLock LastLock
Lock Lock
Event Event
CompletionContext CompletionContext

Figure 2: Windows NT File Object (with Significant Data Structures)

• The SharedRead, SharedWrite, and SharedDelete fields are used by the I/O subsystem to
indicate if the given file has been opened to allow sharing of the appropriate type. This
information is used when updating the sharing status of the given file
(IoUpdateShareAccess, etc.)

• The FileName field indicates the “name” under which this file was opened. If the
RelatedFileObject field is set, this name is relative to that other opened directory. The
pathological case here is if the file has been opened by ID (FILE_OPEN_BY_FILE_ID
in the create option flags) in which case there is no “name” associated with this file open.

Contents provided by 16
OSR Open Systems Resources, Inc.
2.1.2 IRP
The I/O Request Packet (IRP) contains the context information for a single I/O operation. The NT I/O
model does not inherently restrict any single device from handling multiple I/O operations simultaneously.
Further, since a single file may be “opened” for access multiple times it is quite conceivable that multiple
threads may be accessing a given file simultaneously. This is commonly true for heavily used directories
(such as the “root” directory on a drive) or for databases (where multi-threading is used to improve
performance.)

The IRP thus encapsulates all I/O state necessary to process a given operation. The only “context
dependency” is for the addresses of I/O buffers during the read, write, or device I/O control operations. All
other operations are completely isolated from memory context dependencies.
In the IRP, the following fields deal with data contents:

• The MdlAddress field, if present, points to a Memory Descriptor List – a scatter/gather


description of memory locations making up a single (virtually contiguous) buffer.
• The AssociatedIrp.SystemBuffer field, if present, points to a virtual address which
corresponds to a valid system address. This address can be used from any process context.
• The UserBuffer field, if present, points to a virtual address which corresponds to a valid user
address. This address is only valid in the context of the original calling user process.
The actual specifics of which field is valid depends entirely upon the context of the call.

In addition to the data characteristics of the user buffer, the characteristics of a given IRP often can be used
to identify unique characteristics of the particular I/O. The Flags field is a bitfield indicating the status of a
given I/O operation. Of the various values, the following are of interest:
Next
Size
Type
MdlFlags
Size
MasterIrp Process
MdlAddress
MappedSystemVa
Flags
StartVa
AssociatedIrp
IrpCount
ByteCount
ThreadListEntry
ByteOff set
IoStatus
SystemBuffer Page[1]
RequestorMode
Page[2]
PendingReturned
StackCount
Status
Page[N]
CurrentLocation
Information
Cancel
CancelIrql UserApcRoutine
ApcEnvironment UserApcContext
AllocationFlags
AllocationSize
UserIosb
UserEvent
Overlay
CancelRoutine
DeviceQueueEntry
UserBufferOverlay
Thread
Tail
AuxiliaryBuffer
ListEntry
AssociatedIrp
CurrentStackLocation
OriginalFileObject

Apc

CompletionKey

Figure 3: I/O Request Packet

Contents provided by 17
OSR Open Systems Resources, Inc.
• IRP_NOCACHE – this entry indicates that the data being read or written is not to be cached.
This is typically used by the Memory Manager to indicate that the operation is against the
cache itself (and hence cannot be “cached”.)
• IRP_PAGING_IO – in combination with IRP_NOCACHE this is sufficient to indicate the
I/O operation in question originated with the Memory Manager. Otherwise, it indicates this is
a mount operation.
• IRP_INPUT_OPERATION – indicates the given operation is a write operation (that is, the
user is providing the input to the operation.)
• IRP_SYNCHRONOUS_PAGING_IO – indicates the I/O operation is from the
IoSynchronousPageWrite call (on behalf of the Memory Manager.)
2.1.3 I/O Stack Location
The I/O Stack Location is part of the IRP passed to the FSD from the I/O Manager. The contents of this
data structure are of supreme interest. Note further that the I/O Stack Location as declared in ntddk.h has
some missing fields in it. The complete I/O Stack Location is:

typedef struct _IO_STACK_LOCATION {


UCHAR MajorFunction;
UCHAR MinorFunction;
UCHAR Flags;
UCHAR Control;
union {
struct {
PIO_SECURITY_CONTEXT SecurityContext;
ULONG Options;
USHORT FileAttributes;
USHORT ShareAccess;
ULONG EaLength;
} Create;

struct {
PIO_SECURITY_CONTEXT SecurityContext;
ULONG Options;
USHORT Reserved;
USHORT ShareAccess;
PNAMED_PIPE_CREATE_PARAMETERS Parameters;
} CreatePipe;

struct {
PIO_SECURITY_CONTEXT SecurityContext;
ULONG Options;
USHORT Reserved;
USHORT ShareAccess;
PMAILSLOT_CREATE_PARAMETERS Parameters;
} CreateMailslot;

struct {
ULONG Length;

Contents provided by 18
OSR Open Systems Resources, Inc.
ULONG Key;
LARGE_INTEGER ByteOffset;
} Read;

struct {
ULONG Length;
ULONG Key;
LARGE_INTEGER ByteOffset;
} Write;

struct {
ULONG Length;
PSTRING FileName;
FILE_INFORMATION_CLASS FileInformationClass;
ULONG FileIndex;
} QueryDirectory;

struct {
ULONG Length;
ULONG CompletionFilter;
} NotifyDirectory;

struct {
ULONG Length;
FILE_INFORMATION_CLASS FileInformationClass;
} QueryFile;

struct {
ULONG Length;
FILE_INFORMATION_CLASS FileInformationClass;
PFILE_OBJECT FileObject;
union {
struct {
BOOLEAN ReplaceIfExists;
BOOLEAN AdvanceOnly;
};
ULONG ClusterCount;
};
} SetFile;

struct {
ULONG Length;
PVOID EaList;
ULONG EaListLength;
ULONG EaIndex;
} QueryEa;

Contents provided by 19
OSR Open Systems Resources, Inc.
struct {
ULONG Length;
} SetEa;

struct {
ULONG Length;
FS_INFORMATION_CLASS FsInformationClass;
} QueryVolume;

struct {
ULONG Length;
FS_INFORMATION_CLASS FsInformationClass;
} SetVolume;

struct {
ULONG OutputBufferLength;
ULONG InputBufferLength;
ULONG FsControlCode;
PVOID Type3InputBuffer;
} FileSystemControl;

struct {
PLARGE_INTEGER Length;
ULONG Key;
LARGE_INTEGER ByteOffset;
} LockControl;

struct {
ULONG OutputBufferLength;
ULONG InputBufferLength;
ULONG IoControlCode;
PVOID Type3InputBuffer;
} DeviceIoControl;

struct {
SECURITY_INFORMATION SecurityInformation;
ULONG Length;
} QuerySecurity;

struct {
SECURITY_INFORMATION SecurityInformation;
PSECURITY_DESCRIPTOR SecurityDescriptor;
} SetSecurity;

struct {

Contents provided by 20
OSR Open Systems Resources, Inc.
PVPB Vpb;
PDEVICE_OBJECT DeviceObject;
} MountVolume;

struct {
PVPB Vpb;
PDEVICE_OBJECT DeviceObject;
} VerifyVolume;

struct {
struct _SCSI_REQUEST_BLOCK *Srb;
} Scsi;

struct {
PVOID Argument1;
PVOID Argument2;
PVOID Argument3;
PVOID Argument4;
} Others;

} Parameters;

PDEVICE_OBJECT DeviceObject;

PFILE_OBJECT FileObject;

PIO_COMPLETION_ROUTINE CompletionRoutine;

PVOID Context;

} IO_STACK_LOCATION, *PIO_STACK_LOCATION;

The I/O Stack location itself contains much of the state information essential to the FSD. In this section
we will describe those fields of the stack location which are both general and of interest to FSDs. In the
following section we will describe the variable 32 byte field which contains per-operation data.
The general fields of interest are:

• MajorFunction – this field indicates which of the “major” I/O operations this IRP represents.
• MinorFunction – this field indicates which of the “minor” I/O operations this IRP represents.
Typically this field is unused. In the next section we will discuss the individual major
functions and describe available minor values as appropriate.
• Flags – per-operation variable flags information regarding the specific I/O operation.
• FileObject – points to the currently active file object for this file (note that the field in the IRP
with a similar name is not the correct one to use.) We use this as a reference to the open
instance and to the specific (per-file) information (via the FsContext pointer.)

The specific details of the I/O Stack location (where important to this discussion) with respect to the
parameters are described in the next section.

Contents provided by 21
OSR Open Systems Resources, Inc.
2.2 I/O Operations
This section describes I/O operation on the basis of their Major Function. In turn, for those functions
which further support Minor Functions we describe them in turn. Then, as appropriate we describe the
basic operation and how they are affected by various minor numbers.

2.2.1 CREATE
In Figure 4 we have provided a graphic description of the Create I/O request. As it turns out, Create itself
has a “complex” representation because various fields essential to the create operation are scattered around
the I/O Request. Thus, to ease the process of finding those fields, we have shaded those fields which are
specifically associated with parameters to the Create operation itself. Most other operations are far simpler
and do not have arguments scattered in such a fashion.
This call is typically made by the I/O Manager due to the creation of a new FileObject. The FSD then, in
turn, is responsible for:
• Validating the existence of the file
• Ensuring the caller has appropriate security
• Verifying sharing semantics are preserved
The actual create operation can be on any one of several different “types” of devices:
• Files
• Directories
• Volumes
• Streams (where supported)
The create parameters in the I/O stack location consist of:
• SecurityContext – a pointer to the calling user’s security context. This is used as an argument
to the SeAccessCheck() call by a file system supporting ACLs (such as NTFS.)

Contents provided by 22
OSR Open Systems Resources, Inc.
Type
Size
MdlAddress
Flags System Buffer (EA)
AssociatedIrp
ThreadListEntry
IoStatus
RequestorMode
PendingReturned
StackCount
Status
CurrentLocation
Inform ation
Length
Cancel
Im personation
CancelIrql
ContextTrackingMode
ApcEnvironm ent
EffectiveOnly
AllocationFlags
AllocationSize
UserIosb
SecurityQos
UserEvent
AccessState
Overlay
DesiredAccess
CancelRoutine
DeviceQueueEntry FullCreateOptions
UserBufferOverlay
Thread
Tail
AuxiliaryBuffer OperationId
ListEntry SecurityEvaluated
AssociatedIrp GenerateAudit
CurrentStackLocation GenerateOnClose
OriginalFileObject Flags
Rem ainingDesiredAccess
SecurityContext PreviouslyGrantedAccess
Options OriginalDesiredAccess
FileAttributes SubjectSecurityContext
MajorFunction
ShareAccess SecurityDescriptor
MinorFunction
EaLength AuxData
Flags
Privileges
Control
AuditPrivileges
Param eters
ObjectNam e
DeviceObject
ObjectTypeNam e
FileObject
Com pletionRoutine
Context

Figure 4: Create Irp with Associated Fields

• Options – this field consists of two pieces. The upper eight bits indicate the disposition of the
file, and the lower 24 bits indicate characteristics of the file, or of how it is being opened.
These are described later in this section.
• FileAttributes – this field indicates the attributes to use when opening or creating a new file.
This indicates, for instance, if the file is a system file, etc. We have enumerated the various
options later in this section.
• ShareAccess – this field indicate the share access being requested for this file. This includes
standard I/O sharing (read/write) as well as the ability to delete the file, and the various
transactional sharing modes. These are described later in this section.
• EaLength – this field indicates the size of the Extended Attribute buffer to be applied to this
file if it is being newly created. The buffer containing the EA is Irp-
>AssociatedIrp.SystemBuffer. Of course, not all NT file systems support EAs, and hence they
will reject a request indicating an EA is to be included.

In addition to these values, the initial size of the file may be set (via the AllocationSize field in Irp-
>Overlay.)

Contents provided by 23
OSR Open Systems Resources, Inc.
2.2.1.1 Disposition Information
The disposition information of a file (the upper eight bits of the Options field) indicates how the create
operation should handle the creation of this file should it already exist (or not exist.) They include:

• FILE_SUPERSEDE
• FILE_OPEN
• FILE_CREATE
• FILE_OPEN_IF
• FILE_OVERWRITE
• FILE_OVERWRITE_IF
• FILE_MAXIMUM_DISPOSITION

2.2.1.2 File Options


The lower 24 bits of the Options field indicates characteristics of the file, or indicate how this particular
create operation is to be processed. These are:

• FILE_DIRECTORY_FILE
• FILE_WRITE_THROUGH
• FILE_SEQUENTIAL_ONLY
• FILE_NO_INTERMEDIATE_BUFFERING
• FILE_SYNCHRONOUS_IO_ALERT
• FILE_SYNCHRONOUS_IO_NONALERT
• FILE_NON_DIRECTORY_FILE
• FILE_CREATE_TREE_CONNECTION
• FILE_COMPLETE_IF_OPLOCKED
• FILE_NO_EA_KNOWLEDGE
• FILE_DISABLE_TUNNELING
• FILE_RANDOM_ACCESS
• FILE_DELETE_ON_CLOSE
• FILE_OPEN_BY_FILE_ID
• FILE_OPEN_FOR_BACKUP_INTENT
• FILE_NO_COMPRESSION

2.2.1.3 File Attributes


In this section we describe the File Attributes which may be indicated in a create operation. Each FSD can
choose not to implement individual operations, or to reject calls with those attributes specified. The first
five attributes are consistent with “DOS” style file attributes:

• FILE_ATTRIBUTE_READONLY – this attribute value indicate the file does not have write
access enabled.
• FILE_ATTRIBUTE_HIDDEN – this attribute indicates the file is “hidden” from normal
views of the directory.
• FILE_ATTRIBUTE_SYSTEM – this attribute indicates the file belongs to the operating
system.
• FILE_ATTRIBUTE_DIRECTORY – this attribute indicates the file is a directory. Note that
if the file is not a directory, this will generate an error.
• FILE_ATTRIBUTE_ARCHIVE – this indicates the “archive” bit should be set for the given
file. This can be used to implement a basic incremental backup scheme.
• FILE_ATTRIBUTE_NORMAL – this indicates the attribute bits are one of the first five
values (the “normal” file attributes.)

Contents provided by 24
OSR Open Systems Resources, Inc.
• FILE_ATTRIBUTE_TEMPORARY – this indicates the file is being opened for temporary
access – its contents need not be written to disk. This attribute is incompatible with
FILE_ATTRIBUTE_READONLY.

The remaining attributes are not currently implemented by NT file systems, but are reserved for future use:

• FILE_ATTRIBUTE_ATOMIC_WRITE
• FILE_ATTRIBUTE_XACTION_WRITE
• FILE_ATTRIBUTE_COMPRESSED
• FILE_ATTRIBUTE_HAS_EMBEDDING

2.2.1.4 Share Access


The ShareAccess field indicates what rights the caller (attempting the create) is willing to share with
respect to the given file. If this sharing activity is incompatible with existing opens on the given file, this
call will fail. Otherwise, future callers must abide by the sharing attributes of this caller.
Sharing is implemented by using the standard I/O Manager calls (documented in the DDK):

• IoCheckShareAccess
• IoRemoveShareAccess
• IoSetShareAccess
• IoUpdateShareAccess

These calls manipulate fields within the File Object and are stateful (that is, they must be called in the
proper order to ensure correct sharing semantics.) The supported sharing modes (all reasonably self-
explanatory) at present are:

• FILE_SHARE_READ
• FILE_SHARE_WRITE
• FILE_SHARE_DELETE

In addition, the following values are presently defined but not used:

• FILE_SHARE_TRANSACTED
• FILE_SHARE_DIRECT
• FILE_SHARE_PRIORITY
• FILE_SHARE_JOINWITHPARENT

2.2.1.5 Sample Create Code


The following sample program demonstrates creating a new file via the Native NT I/O API, setting its read
only attribute and then writing to the file. This demonstrates that the read only bit only becomes active on
the next access to the file – not on this access to the file.
//
// (C) Copyright 1996 OSR Open Systems Resources, Inc.
// All Rights Reserved.
//

//
// This sample program demonstrates opening a file for write access that is
// (a) new and (b) being created for the first time.
//
// It uses the Native NT API:
// NtCreateFile - allows opening the file object directly.
// NtClose - close the object

Contents provided by 25
OSR Open Systems Resources, Inc.
//

#include <ntddk.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <memory.h>
#include <wrapper-ioctl.h>

//
// The NATIVE NT API definitions.
//

NTSYSAPI
NTSTATUS
NTAPI
NtCreateFile(
OUT PHANDLE FileHandle,
IN ACCESS_MASK DesiredAccess,
IN POBJECT_ATTRIBUTES ObjectAttributes,
OUT PIO_STATUS_BLOCK IoStatusBlock,
IN PLARGE_INTEGER AllocationSize OPTIONAL,
IN ULONG FileAttributes,
IN ULONG ShareAccess,
IN ULONG CreateDisposition,
IN ULONG CreateOptions,
IN PVOID EaBuffer OPTIONAL,
IN ULONG EaLength
);

NTSYSAPI
NTSTATUS
NTAPI
NtWriteFile(
IN HANDLE FileHandle,
IN HANDLE Event OPTIONAL,
IN PIO_APC_ROUTINE ApcRoutine OPTIONAL,
IN PVOID ApcContext OPTIONAL,
OUT PIO_STATUS_BLOCK IoStatusBlock,
IN PVOID Buffer,
IN ULONG Length,
IN PLARGE_INTEGER ByteOffset OPTIONAL,
IN PULONG Key OPTIONAL
);

NTSYSAPI
NTSTATUS
NTAPI
NtClose(
IN HANDLE Handle
);

__cdecl main(ULONG argc, LPSTR *argv)


{

UNICODE_STRING fileName;
ANSI_STRING fileAnsiName;

Contents provided by 26
OSR Open Systems Resources, Inc.
OBJECT_ATTRIBUTES fileAttributes;
NTSTATUS status;
HANDLE fileHandle;
IO_STATUS_BLOCK iosb;
char buffer[512];

//
// The usage pattern for this driver is:
// > fsctl [drivername]
//
// For your example you might want to change this so it is tied
// specifically to YOUR driver.
//

if (argc != 2) {

printf("Usage: ctest [NT Name to create]\n");

exit(1);
}

//
// First, set up an ANSI string version of the file name specified
// on the command line.
//

fileAnsiName.Length = strlen(argv[1]);

fileAnsiName.MaximumLength = fileAnsiName.Length + sizeof(CHAR);

fileAnsiName.Buffer = malloc(fileAnsiName.MaximumLength);

if (fileAnsiName.Buffer == 0) {

printf("malloc failed\n");

exit(1);

memset(fileAnsiName.Buffer, 0, fileAnsiName.MaximumLength);

memcpy(fileAnsiName.Buffer, argv[1], fileAnsiName.Length);

//
// Allocate space for the unicode string version.
//

fileName.Length = sizeof(WCHAR) * fileAnsiName.Length;

fileName.MaximumLength = fileName.Length + sizeof(WCHAR);

fileName.Buffer = malloc(fileName.MaximumLength);

if (fileName.Buffer == 0) {

printf("malloc failed\n");

Contents provided by 27
OSR Open Systems Resources, Inc.
exit(1);

//
// Convert the ANSI name to UNICODE.
//

status = RtlAnsiStringToUnicodeString(&fileName, &fileAnsiName, FALSE);

if (status) {

//
// Conversion error
//

printf("RtlAnsiStringToUnicodeString failed (0x%x)\n", status);

exit(1);

//
// Now create the new file. Note that we've chosen to mark the file as
// read only, with requested acess including read, write, and execute.
//
// If this works, we'll try to write to this read-only file.
//

InitializeObjectAttributes(&fileAttributes,
&fileName,
OBJ_CASE_INSENSITIVE,
0,
0);

status = NtCreateFile(&fileHandle,
FILE_ALL_ACCESS,
&fileAttributes,
&iosb,
0,
FILE_ATTRIBUTE_READONLY,
0, // no sharing
FILE_CREATE,
FILE_NON_DIRECTORY_FILE|FILE_SYNCHRONOUS_IO_NONALERT,
0,
0);

if (status) {

printf("NtCreateFile failed (0x%x)\n", status);

exit(1);

//
// Now, write contents to this READ ONLY file.
//

Contents provided by 28
OSR Open Systems Resources, Inc.
status = NtWriteFile(fileHandle,
0, // optional event
0, // optional APC
0, // optional APC context
&iosb,
buffer, // random data
sizeof(buffer), // amount of data
0, // optional byte offset
0); // optional key value

if (status) {
//
// Write failed
//

printf("NtWriteFile failed (0x%x)\n", status);

exit(1);
}

//
// Cleanup
//

status = NtClose(fileHandle);

if (status) {

printf("NtClose failed (0x%x)\n", status);

return (0);
}

2.2.2 CLOSE
The close major function is used to indicate that all outstanding references on a given file have been
dropped. This call indicates all interest in a given file has been exhausted. Typically, this call results only
in internal FSD state being “cleaned up”.
See Cleanup for much of the “real” activity.

2.2.3 WRITE
The write major function indicates the caller wishes to write a portion of the files contents. This operation
has three minor function codes:

• IRP_MN_MDL – indicates that the caller wishes to have the FSD build an MDL for the
portion of the file being written. The FSD is responsible for allocating and building the MDL.
The IoStatus.Information field is used to indicate the number of bytes successfully written to
the MDL.
• IRP_MN_MDL_COMPLETE – indicates that the caller is done with an FSD-created MDL.
The FSD is responsible for deallocating and deleting the MDL.
• IRP_MN_DPC – indicates that the caller is at a DPC context. In this case, the FSD must
return STATUS_PENDING for the I/O operation and complete it in a worker thread context.

Contents provided by 29
OSR Open Systems Resources, Inc.
The actual parameters to the write call are (from the I/O stack location):

• Length – the total number of bytes to be write.


• Key – indicates the “lock key” against which this request should be checked. This is used by
file servers and subsystems.
• ByteOffset – indicates the location where the operation should begin.

The actual mechanics of a write operation can further be modified if any of the following are true:

• The write is against the paging file. In this case the file system is restricted in the operations
it can perform.
• The write is marked as being non-cached. This either indicates the file was opened with
caching explicitly disabled or the I/O is from the Memory Management system, which can be
determined by the other bits set in the IRP. In this case, data must be written directly to the
disk from the provided buffer. The FSD enforces alignment constraints on non-cached I/O
operations.

The actual location of the data is dependent upon the characteristics of the FSD (or filter) device object.
Typically, an NT file system will use “neither” I/O. Thus, the data buffer will be Irp->UserBuffer – even if
the call originated from the system address space.

The exception to this is paging I/O which is done from the VM system and utilizes MDLs to describe the
memory to be written to disk.

2.2.4 READ
For most purposes the IRP_MJ_READ function is the same as the IRP_MJ_WRITE function described
above, except of course user data is moved from the device to the user’s data buffer.

2.2.5 SET/QUERY _INFORMATION


The query and set information operations can be used against open directories or files. There are a large
number of file attributes which can be set or queried by way of this call.5 Each of these attributes has an
associated data structure which describes the specific layout of the data provided to or to be returned by the
FSD. In the remainder of this section we describe the various data structure layouts which can be used.

2.2.5.1 FileBasicInformation
The “basic information” regarding a file (in ntddk.h) contains information about the creation, access, and
modification times of the file as well as the attributes associated with that file. For more information see
FILE_BASIC_INFORMATION.

2.2.5.2 FileStandardInformation
Standard information consists of:

• AllocationSize – the amount of disk space consumed by the file


• EndOfFile – the size of the valid data within the file. Typically this is not the same at the
AllocationSize field.
• NumberOfLinks – the number of distinct names associated with the file within the file system
directory tree. Only NTFS supports files with more than one link.

5
From ntioapi.h in the October ’94 IFS DDK.

Contents provided by 30
OSR Open Systems Resources, Inc.
• DeletePending – indicates if this file will be deleted when the last reference to it is release.
• Directory – indicates if this file is really a “directory” rather than a normal data file.

The actual data structure FILE_STANDARD_INFORMATION can be found in ntddk.h

2.2.5.3 FileInternalInformation
The FILE_INTERNAL_INFORMATION structure (from ntioapi.h) is:

typedef struct _FILE_INTERNAL_INFORMATION {


LARGE_INTEGER IndexNumber;
} FILE_INTERNAL_INFORMATION, *PFILE_INTERNAL_INFORMATION;

The IndexNumber field can be used with either NTFS or CDFS to open a file by its “ID” rather than by
name. FAT and HPFS provide this number but do not support opening a file by ID.

2.2.5.4 FileEaInformation
The FILE_EA_INFORMATION structure (from ntioapi.h) is:

typedef struct _FILE_EA_INFORMATION {


ULONG EaSize;
} FILE_EA_INFORMATION, *PFILE_EA_INFORMATION;

The single field, EaSize, is self-explanatory.

2.2.5.5 FileAccessInformation
The FILE_ACCESS_INFORMATION structure (from ntioapi.h) is:

typedef struct _FILE_ACCESS_INFORMATION {


ACCESS_MASK AccessFlags;
} FILE_ACCESS_INFORMATION, *PFILE_ACCESS_INFORMATION;

This is used to retrieve the access mask used for the given file object when it was created. It is not
supported by all file systems.

2.2.5.6 FileNameInformation
The FILE_NAME_INFORMATION structure (from ntioapi.h) is:

typedef struct _FILE_NAME_INFORMATION {


ULONG FileNameLength;
WCHAR FileName[1];
} FILE_NAME_INFORMATION, *PFILE_NAME_INFORMATION;

The name returned is the full name of the file (the “long” file name.)

2.2.5.7 FileRenameInformation
The FILE_RENAME_INFORMATION structure (from ntioapi.h) is:

Contents provided by 31
OSR Open Systems Resources, Inc.
typedef struct _FILE_RENAME_INFORMATION {
BOOLEAN ReplaceIfExists;
HANDLE RootDirectory;
ULONG FileNameLength;
WCHAR FileName[1];
} FILE_RENAME_INFORMATION, *PFILE_RENAME_INFORMATION;

This is used to rename the given file. Our analysis indicates that the RootDirectory field is not used by the
physical file systems. Rather, they rely upon the FileName argument exclusively.

The ReplaceIfExists field indicates whether or not the rename should continue if the specified target file
exists. This option is not supported by all physical file systems.
Rename is only valid for SET_INFORMATION.

2.2.5.8 FileLinkInformation
The FILE_LINK_INFORMATION structure (from ntioapi.h) is:

typedef struct _FILE_LINK_INFORMATION {


BOOLEAN ReplaceIfExists;
HANDLE RootDirectory;
ULONG FileNameLength;
WCHAR FileName[1];
} FILE_LINK_INFORMATION, *PFILE_LINK_INFORMATION;

Link information is only used for file systems which support multiply linked files which simply means
“NTFS” with respect to standard NT local file systems.

2.2.5.9 FilePositionInformation
This structure returns the current offset position for the given FILE_OBJECT.
2.2.5.10 FileModeInformation
The FILE_MODE_INFORMATION structure (from ntioapi.h) is:

typedef struct _FILE_MODE_INFORMATION {


ULONG Mode;
} FILE_MODE_INFORMATION, *PFILE_MODE_INFORMATION;

2.2.5.11 FileAlignmentInformation
This structure is used to indicate the alignment consideration of the underlying file system. Typically, this
will be the “cluster size” being used by the underlying file system.
2.2.5.12 FileAllInformation
The FILE_ALL_INFORMATION structure (from ntioapi.h) is:

typedef struct _FILE_ALL_INFORMATION {


FILE_BASIC_INFORMATION BasicInformation;
FILE_STANDARD_INFORMATION StandardInformation;
FILE_INTERNAL_INFORMATION InternalInformation;

Contents provided by 32
OSR Open Systems Resources, Inc.
FILE_EA_INFORMATION EaInformation;
FILE_ACCESS_INFORMATION AccessInformation;
FILE_POSITION_INFORMATION PositionInformation;
FILE_MODE_INFORMATION ModeInformation;
FILE_ALIGNMENT_INFORMATION AlignmentInformation;
FILE_NAME_INFORMATION NameInformation;
} FILE_ALL_INFORMATION, *PFILE_ALL_INFORMATION;

This is simply an accumulation of various other informational fields regarding the given file.

2.2.5.13 FileAllocationInformation
The FILE_ALLOCATION_INFORMATION structure (from ntioapi.h) is:

typedef struct _FILE_ALLOCATION_INFORMATION {


LARGE_INTEGER AllocationSize;
} FILE_ALLOCATION_INFORMATION, *PFILE_ALLOCATION_INFORMATION;

This can be used to set the disk-based allocation of the file. This can have either the effect of increasing or
decreasing the size of the file. Disk space is actually allocated at the time this call is made which
maximizes the likelihood that the new allocation units will be contiguous on the disk.

2.2.5.14 FileDispositionInformation
The FILE_DISPOSITION_INFORMATION structure is used to indicate the delete stare of a file on close.
Setting the DeleteFile BOOLEAN to TRUE causes the file to be deleted on close.

typedef struct _FILE_DISPOSITION_INFORMATION {


BOOLEAN DeleteFile;
} FILE_DISPOSITION_INFORMATION, *PFILE_DISPOSITION_INFORMATION;

2.2.5.15 FileEndOfFileInformation
This can be used to adjust the size of the active data portion of a file. It adjusts the allocation as needed to
ensure data storage is available for the new EOF information.

2.2.5.16 FileAlternateNameInformation
This call uses the FILE_NAME_INFORMATION structure to retrieve the “short name” alternative (8.3)
for the given file.

2.2.5.17 FileStreamInformation
The FILE_STREAM_INFORMATION structure (from ntioapi.h) is:

typedef struct _FILE_STREAM_INFORMATION {


ULONG NextEntryOffset;
ULONG StreamNameLength;
LARGE_INTEGER StreamSize;
LARGE_INTEGER StreamAllocationSize;
WCHAR StreamName[1];

Contents provided by 33
OSR Open Systems Resources, Inc.
} FILE_STREAM_INFORMATION, *PFILE_STREAM_INFORMATION;

This structure is used to enumerate the user-defined streams associated with a given file. It can be only
used via a query interface – streams cannot be “created” using this mechanism. It fills the user-provided
buffer with information about as many streams as possible.

2.2.5.18 FileCompressionInformation
The FILE_COMPRESSION_INFORMATION structure (from ntioapi.h) is:

typedef struct _FILE_COMPRESSION_INFORMATION {


LARGE_INTEGER CompressedFileSize;
USHORT CompressionFormat;

This can be used to retrieve the “compression state” of the given file. The CompressedFileSize is
indicative of the actual amount of disk space consumed by the file, rather than the data represented by the
file.

2.2.6 SET/QUERY_EA
The query/set EA functions are used to retrieve and manipulate information about the extended attributes
associated with a particular file or directory. The EA contents of the individual file or directory can be
enumerated by using a combination of calls using the following calls:

• SL_RESTART_SCAN - indicates the EA “scan” should be started from the beginning


• SL_RETURN_SINGLE_ENTRY - indicates the next EA entry should be returned
• SL_INDEX_SPECIFIED - indicates a specific EA entry should be returned, based upon its
position.

The query interface can be used to enumerate the various extended attributes associated with the file. The
set interface can be used to add or modify the extended attributes. The individual file system may set a
limit on the size of the extended attributes it supports. The NTFS limit on EA size is identical to the limit
on user-data stream sizes (that is, 264 bytes.)

The query operation parameters from the I/O stack location are (from ntifs.h):

struct {
ULONG Length;
PVOID EaList;
ULONG EaListLength;
ULONG EaIndex;
} QueryEa;

• Length indicates the total length of the user-provided buffer in which the EA information is to
be returned.
• EaList points to an array of FILE_GET_EA_INFORMATION structures indicating which EA
information should be returned to the caller in the caller-supplied buffer. The
FILE_GET_EA_INFORMATION structure (from ntioapi.h) is:

typedef struct _FILE_GET_EA_INFORMATION {


ULONG NextEntryOffset;
UCHAR EaNameLength;
CHAR EaName[1];
} FILE_GET_EA_INFORMATION, *PFILE_GET_EA_INFORMATION;

Contents provided by 34
OSR Open Systems Resources, Inc.
• EaListLength indicates the size of the EaList buffer.
• EaIndex is used if the SL_INDEX_SPECIFIED flag is set, indicating that a particular entry,
requested by its index location is to be returned. In this case the EaList argument is ignored.

The information returned to the user is stored in the FILE_FULL_EA_INFORMATION structure. That
structure (which is also used when setting the EA information) is (from ntioapi.h):

typedef struct _FILE_FULL_EA_INFORMATION {


ULONG NextEntryOffset;
UCHAR Flags;
UCHAR EaNameLength;
USHORT EaValueLength;
CHAR EaName[1];
} FILE_FULL_EA_INFORMATION, *PFILE_FULL_EA_INFORMATION;

This structure is used to examine (or set) information about the extended attribute. The EA value begins at
EaName[EaNameLength+1].

Note that not all file systems support EAs.

2.2.7 SET/QUERY_VOLUME_INFORMATION
These operations manage volume level information rather than file or directory information. For instance,
volume labels can be manipulated via this interface. These operations are only valid against an open
volume and cannot be performed against a file or directory. Individual data elements are described in the
following sections.
The parameters used for query are included in ntddk.h. The parameters for set are identical (although they
are omitted from ntddk.h and are only included in ntifs.h.) The two fields used for these operations are:

• Length is used to indicate the available space in the caller-provided buffer.


• FsInformationClass is used to indicate what information regarding the volume is being
queried or set.

In the following sections we discuss the various possible values for the FsInformationClass element.
2.2.7.1 FileFsVolumeInformation
The basic volume information is describe by the FILE_FS_VOLUME_INFORMATION structure (from
ntioapi.h):

typedef struct _FILE_FS_VOLUME_INFORMATION {


LARGE_INTEGER VolumeCreationTime;
ULONG VolumeSerialNumber;
ULONG VolumeLabelLength;
BOOLEAN SupportsObjects;
WCHAR VolumeLabel[1];
} FILE_FS_VOLUME_INFORMATION, *PFILE_FS_VOLUME_INFORMATION;

This basic structure can be used either to query or to set these basic volume attributes. The
SupportsObjects argument is reserved for future use (presumably in the Cairo release of Windows NT.)

Contents provided by 35
OSR Open Systems Resources, Inc.
2.2.7.2 FileFsLabelInformation
The label information is described by the FILE_FS_LABEL_INFORMATION structure (from ntioapi.h):

typedef struct _FILE_FS_LABEL_INFORMATION {


ULONG VolumeLabelLength;
WCHAR VolumeLabel[1];
} FILE_FS_LABEL_INFORMATION, *PFILE_FS_LABEL_INFORMATION;

This structure can be used to read or set the volume label on the given volume.

2.2.7.3 FileFsSizeInformation
Information about the physical disk, including the available disk space, total disk size, cluster size, and
underlying media sector size can be obtained using the FILE_FS_SIZE_INFORMATION structure (from
ntioapi.h):

typedef struct _FILE_FS_SIZE_INFORMATION {


LARGE_INTEGER TotalAllocationUnits;
LARGE_INTEGER AvailableAllocationUnits;
ULONG SectorsPerAllocationUnit;
ULONG BytesPerSector;
} FILE_FS_SIZE_INFORMATION, *PFILE_FS_SIZE_INFORMATION;

2.2.7.4 FileFsDeviceInformation
Typically, this operation (which returns the FILE_FS_DEVICE_INFORMATION structure) is
implemented by the underlying media device, rather than the file system itself. This structure is declared in
ntddk.h.

2.2.7.5 FileFsAttributeInformation
This allows the caller to ascertain basic information about the nature of the given FSD. The basic structure
used (from ntioapi.h) is:

typedef struct _FILE_FS_ATTRIBUTE_INFORMATION {


ULONG FileSystemAttributes;
LONG MaximumComponentNameLength;
ULONG FileSystemNameLength;
WCHAR FileSystemName[1];
} FILE_FS_ATTRIBUTE_INFORMATION, *PFILE_FS_ATTRIBUTE_INFORMATION;

The FileSystemAttributes field is a bit-field of the following values (the values are in winnt.h):

• FILE_CASE_SENSITIVE_SEARCH – indicates the FSD supports case-sensitive naming


• FILE_CASE_PRESERVED_NAMES – indicates the FSD preserves the case of names
• FILE_UNICODE_ON_DISK – indicates the FSD stores wide character (UNICODE) names
on disk
• FILE_PERSISTENT_ACLS – indicates the FSD stores access control lists on disk
• FILE_FILE_COMPRESSION – indicates the FSD does per-file compression
• FILE_VOLUME_IS_COMPRESSED – indicates the volume is a compressed volume (like
“double-space”)

Contents provided by 36
OSR Open Systems Resources, Inc.
The MaximumComponentNameLength field indicates the maximum size for any single path name
component.

The FileSystemNameLength indicates the size of the FileSystemName field.

The FileSystemName field indicates the internal name for the particular FSD (i.e., “NTFS”);

2.2.8 SET/QUERY_QUOTA
At present, there is no support for quotas in Windows NT 3.51. In NT 4.0 two new major function values
were added to manage quotas and we expect the final NT 4.0 release will include quota support.
A quota is a limit on the amount of disk space on a given volume that can be consumed by a particular
security entity. Quotas are stored based on the Security Identifier (SID).
The structure use for the quota information is:

typedef struct _FILE_FS_QUOTA_INFORMATION {


ULONG NextEntryOffset;
ULONG SidLength;
LARGE_INTEGER ChangeTime;
LARGE_INTEGER QuotaUsed;
LARGE_INTEGER QuotaThreshold;
LARGE_INTEGER QuotaLimit;
SID Sid; // Variable length field
} FILE_FS_QUOTA_INFORMATION, *PFILE_FS_QUOTA_INFORMATION;

In addition to the new quota information structure, there is also a set of new control structures, used for
tracking both free disk space and quota management. These are declared in the
FILE_FS_CONTROL_INFORMATION structure (from ntioapi.h):

typedef struct _FILE_FS_CONTROL_INFORMATION {


LARGE_INTEGER FreeSpaceStartFiltering;
LARGE_INTEGER FreeSpaceThreshold;
LARGE_INTEGER FreeSpaceStopFiltering;
LARGE_INTEGER DefaultQuotaThreshold;
LARGE_INTEGER DefaultQuotaLimit;
LARGE_INTEGER DeletionLogQuotaLimit;
ULONG FileSystemControlFlags;
} FILE_FS_CONTROL_INFORMATION, *PFILE_FS_CONTROL_INFORMATION;

The following control flags have thus far been defined. We provide them here simply for documentation
purposes and advise they are subject to change:

• #define VOLUME_CONTROL_QUOTA_NONE 0x00000000


• #define VOLUME_CONTROL_QUOTA_TRACK 0x00000001
• #define VOLUME_CONTROL_QUOTA_ENFORCE 0x00000002
• #define VOLUME_CONTROL_QUOTA_MASK 0x00000003
• #define VOLUME_CONTROL_QUOTAS_INCOMPLETE 0x00000004
• #define VOLUME_CONTROL_CONTENT_INDEX_ENABLED 0x00000008
• #define VOLUME_CONTROL_LOG_QUOTA_THRESHOLD 0x00000010
• #define VOLUME_CONTROL_LOG_QUOTA_LIMIT 0x00000020
• #define VOLUME_CONTROL_LOG_VOLUME_THRESHOLD 0x00000040

Contents provided by 37
OSR Open Systems Resources, Inc.
• #define VOLUME_CONTROL_LOG_VOLUME_LIMIT 0x00000080
• #define VOLUME_CONTROL_VALID_MASK 0x000000ff

2.2.9 DIRECTORY_CONTROL
Directory Control operations are further divided into two sub-categories, based upon the minor IRP
function code:

• IRP_MN_QUERY_DIRECTORY
• IRP_MN_NOTIFY_CHANGE_DIRECTORY

The query directory operation is used to enumerate the contents of the given directory, while the notify
change directory is used to implement change notification (c.f., the Win32 FindFirstChangeNotification.)
The following sections describe the information obtained via the query directory minor operation.

2.2.9.1 FileNamesInformation
The FILE_NAMES_INFORMATION structure (from ntioapi.h) is:

typedef struct _FILE_NAMES_INFORMATION {


ULONG NextEntryOffset;
ULONG FileIndex;
ULONG FileNameLength;
WCHAR FileName[1];
} FILE_NAMES_INFORMATION, *PFILE_NAMES_INFORMATION;

This is used to enumerate the entries within a directory.


2.2.9.2 FileDirectoryInformation
This call, valid only against a directory, provides detailed information regarding the current state of the
given directory. This data is returned in the FILE_DIRECTORY_INFORMATION data structure:

typedef struct _FILE_DIRECTORY_INFORMATION {


ULONG NextEntryOffset;
ULONG FileIndex;
LARGE_INTEGER CreationTime;
LARGE_INTEGER LastAccessTime;
LARGE_INTEGER LastWriteTime;
LARGE_INTEGER ChangeTime;
LARGE_INTEGER EndOfFile;
LARGE_INTEGER AllocationSize;
ULONG FileAttributes;
ULONG FileNameLength;
WCHAR FileName[1];
} FILE_DIRECTORY_INFORMATION, *PFILE_DIRECTORY_INFORMATION;

2.2.9.3 FileFullDirectoryInformation
This call appears to return information virtually identical to FileDirectoryInformation except that it also
describes the size of any associated EA for the file:

typedef struct _FILE_FULL_DIR_INFORMATION {

Contents provided by 38
OSR Open Systems Resources, Inc.
ULONG NextEntryOffset;
ULONG FileIndex;
LARGE_INTEGER CreationTime;
LARGE_INTEGER LastAccessTime;
LARGE_INTEGER LastWriteTime;
LARGE_INTEGER ChangeTime;
LARGE_INTEGER EndOfFile;
LARGE_INTEGER AllocationSize;
ULONG FileAttributes;
ULONG FileNameLength;
ULONG EaSize;
WCHAR FileName[1];
} FILE_FULL_DIR_INFORMATION, *PFILE_FULL_DIR_INFORMATION;

2.2.9.4 FileBothDirectoryInformation
This call is an extension of FileFullDirectoryInformation in that it describes both the “short” name of the
directory as well as the full (long) name of the directory. The “short” name is the 8.3 compliant name
required for DOS/Win16 compatibility:

typedef struct _FILE_BOTH_DIR_INFORMATION {


ULONG NextEntryOffset;
ULONG FileIndex;
LARGE_INTEGER CreationTime;
LARGE_INTEGER LastAccessTime;
LARGE_INTEGER LastWriteTime;
LARGE_INTEGER ChangeTime;
LARGE_INTEGER EndOfFile;
LARGE_INTEGER AllocationSize;
ULONG FileAttributes;
ULONG FileNameLength;
ULONG EaSize;
CCHAR ShortNameLength;
WCHAR ShortName[12];
WCHAR FileName[1];
} FILE_BOTH_DIR_INFORMATION, *PFILE_BOTH_DIR_INFORMATION;

2.2.9.5 Sample Directory Enumeration Code


The following sample program demonstrates using the NT API for directory enumeration. We have
included the SOURCES file to build this (in the DDK environment) immediately following this sample.

#include <ntddk.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <memory.h>
#include <wrapper-ioctl.h>

Contents provided by 39
OSR Open Systems Resources, Inc.
//
// The NATIVE NT API definitions.
//

NTSYSAPI
NTSTATUS
NTAPI
NtCreateFile(
OUT PHANDLE FileHandle,
IN ACCESS_MASK DesiredAccess,
IN POBJECT_ATTRIBUTES ObjectAttributes,
OUT PIO_STATUS_BLOCK IoStatusBlock,
IN PLARGE_INTEGER AllocationSize OPTIONAL,
IN ULONG FileAttributes,
IN ULONG ShareAccess,
IN ULONG CreateDisposition,
IN ULONG CreateOptions,
IN PVOID EaBuffer OPTIONAL,
IN ULONG EaLength
);

NTSYSAPI
NTSTATUS
NTAPI
NtQueryDirectoryFile(
IN HANDLE FileHandle,
IN HANDLE Event OPTIONAL,
IN PIO_APC_ROUTINE ApcRoutine OPTIONAL,
IN PVOID ApcContext OPTIONAL,
OUT PIO_STATUS_BLOCK IoStatusBlock,
OUT PVOID FileInformation,
IN ULONG Length,
IN FILE_INFORMATION_CLASS FileInformationClass,
IN BOOLEAN ReturnSingleEntry,
IN PUNICODE_STRING FileName OPTIONAL,
IN BOOLEAN RestartScan
);

NTSYSAPI
NTSTATUS
NTAPI
NtCreateEvent (
OUT PHANDLE EventHandle,
IN ACCESS_MASK DesiredAccess,
IN POBJECT_ATTRIBUTES ObjectAttributes OPTIONAL,
IN EVENT_TYPE EventType,
IN BOOLEAN InitialState
);

NTSYSAPI
NTSTATUS
NTAPI
NtWaitForSingleObject(
IN HANDLE Handle,
IN BOOLEAN Alertable,
IN PLARGE_INTEGER Timeout OPTIONAL
);

typedef struct _FILE_BOTH_DIR_INFORMATION {

Contents provided by 40
OSR Open Systems Resources, Inc.
ULONG NextEntryOffset;
ULONG FileIndex;
LARGE_INTEGER CreationTime;
LARGE_INTEGER LastAccessTime;
LARGE_INTEGER LastWriteTime;
LARGE_INTEGER ChangeTime;
LARGE_INTEGER EndOfFile;
LARGE_INTEGER AllocationSize;
ULONG FileAttributes;
ULONG FileNameLength;
ULONG EaSize;
CCHAR ShortNameLength;
WCHAR ShortName[12];
WCHAR FileName[1];
} FILE_BOTH_DIR_INFORMATION, *PFILE_BOTH_DIR_INFORMATION;

//
// This simple test program demonstrates opening the root directory of the C:\ volume and
enumerating its contents
//

__cdecl main(ULONG argc, LPSTR *argv)


{
UNICODE_STRING RootDirectoryName;
OBJECT_ATTRIBUTES RootDirectoryAttributes;
NTSTATUS Status;
HANDLE RootDirectoryHandle;
IO_STATUS_BLOCK Iosb;
HANDLE Event;
PUCHAR Buffer[65536];
PFILE_BOTH_DIR_INFORMATION DirInformation;

//
// We use the name DosDevices rather than ?? so that it works on NT 3.51 as well as
NT 4.0
//

RtlInitUnicodeString(&RootDirectoryName, L"\\DosDevices\\C:\\");

//
// Now open it
//

InitializeObjectAttributes(&RootDirectoryAttributes,
&RootDirectoryName,
OBJ_CASE_INSENSITIVE,
0, // absolute open, no relative directory handle
0); // no security descriptor necessary

Status = NtCreateFile(&RootDirectoryHandle,
GENERIC_READ,
&RootDirectoryAttributes,
&Iosb,
0, // no meaning for allocation
FILE_ATTRIBUTE_DIRECTORY, // MUST be a directory
FILE_SHARE_READ|FILE_SHARE_WRITE|FILE_SHARE_DELETE, // share all
FILE_OPEN, // must already exist
FILE_DIRECTORY_FILE, // MUST be a directory
0,

Contents provided by 41
OSR Open Systems Resources, Inc.
0);

if (!NT_SUCCESS(Status)) {

printf("Unable to open %Z, error = 0x%x\n", &RootDirectoryName, Status);

return Status;

//
// Create an event
//

Status = NtCreateEvent(&Event,
GENERIC_ALL,
0, // no object attributes
NotificationEvent,
FALSE);

if (!NT_SUCCESS(Status)) {

printf("Event creation failed with error 0x%x\n", Status);

return Status;

//
// We pass NO NAME which is the same as *.*
//

Status = NtQueryDirectoryFile(RootDirectoryHandle,
Event,
0, // No APC routine
0, // No APC context
&Iosb,
Buffer,
sizeof(Buffer),
FileBothDirectoryInformation,
FALSE,
NULL,
FALSE);

//
// If the directory operation is in progress, wait for it to finish.
//

if (Status == STATUS_PENDING) {

Status = NtWaitForSingleObject(Event, TRUE, 0);

//
// Check for errors.
//

Contents provided by 42
OSR Open Systems Resources, Inc.
if (!NT_SUCCESS(Status)) {

printf("Unable to query directory contents, error 0x%x\n", Status);

return Status;

//
// Note that as this is an example we're not ITERATING over the directory. To
// do so we should use a loop and query the directory AGAIN until we get back
// STATUS_NO_MORE_FILES. If the directory was TOTALLY EMPTY we'd get back
// STATUS_NO_SUCH_FILE - but only the ROOT directory can ever be TOTALLY EMPTY.
//

DirInformation = (PFILE_BOTH_DIR_INFORMATION) Buffer;

while (1) {
UNICODE_STRING EntryName;

EntryName.MaximumLength = EntryName.Length = (USHORT) DirInformation-


>FileNameLength;

EntryName.Buffer = &DirInformation->FileName[0];

//
// Dump the full name of the file. We could dump the other information
// here as well, but we'll keep the example shorter instead.
//

printf("%Z:\n", &EntryName);

//
// If there is no offset in the entry, the buffer has been exhausted.
//

if (0 == DirInformation->NextEntryOffset) {

break;

} else {

//
// Advance to the next entry.
//

DirInformation = (PFILE_BOTH_DIR_INFORMATION) (((PUCHAR)DirInformation) +


DirInformation->NextEntryOffset);

//
// Skip a line
//

printf("\n");

Contents provided by 43
OSR Open Systems Resources, Inc.
//
// Note that we skip closing our handles. The process death will do it for us.
//

return STATUS_SUCCESS;

The SOURCES file we used to build this program:

BLDCRT=1

MAJORCOMP=fsdk
MINORCOMP=dt

TARGETNAME=dt
TARGETPATH=obj
TARGETTYPE=UMAPPL_NOLIB

MSC_WARNING_LEVEL=/W3 /WX

SOURCES=dt.c

INCLUDES=$(BASEDIR)\inc

UMLIBS= $(BASEDIR)\lib\*\$(DDKBUILDENV)\ntdll.lib
UMBASE=0x400000
UMAPPL=dt
UMTYPE=console

2.2.10 CLEANUP
Cleanup corresponds to the last user reference to a given file being released. At this stage, an FSD
typically performs all “background” activities such as processing any pending deletes, deleting any lock
state, etc. The VM system may still hold references and once they have been released and cleanup
processing completed, Close will be called.
There are no arguments for cleanup.

2.2.11 QUERY/SET_SECURITY
These routines are self-explanatory in nature. The data structures used by them are documented. The size
of an opaque security descriptor can be obtained using RtlLengthSecurityDescriptor, which is documented
in the Windows NT DDK.
2.2.12 FILE_SYSTEM_CONTROL
A file system control is a special file-system targeted operation. Instead of being a device control which is
typically passed through the file system onto the underlying media device, these control codes must be
processed by the file system itself.

2.2.12.1 Mount
A mount call is handled by the named device object of a file system. In this case, the call indicates a
particular physical media volume which the system is attempting to load.

Contents provided by 44
OSR Open Systems Resources, Inc.
The FSD is responsible for examining the volume in question, determining if it belongs to the given FSD
and then, if it does belong to this FSD, the FSD then creates the necessary state for subsequent directed file
system operations

The FSD must create a device object (typically unnamed so that the only access to that device object is via
the VPB) and initialize the VPB in the media volume’s device object to point to the new FSD device
object. Any other internal FSD state must be established at this point and upon completion of this
processing, STATUS_SUCCESS is returned by the FSD.

If the media is not recognized by a file system, then the call must be failed with the
STATUS_UNRECOGNIZED_VOLUME status code. This allows the I/O Manager to continue looking
for a candidate file system to mount the given media.

One special case here is if the FSD is in fact a “mini-file system” or a “file system recognizer”. In that
case, rather than returning STATUS_SUCCESS, the recognizer will return
STATUS_FS_DRIVER_REQUIRED which indicates to the I/O Manager that a subsequent call to load the
driver must be made. See Section 2.2.12.3 for more information about the I/O Manager’s loading of file
system drivers.

2.2.12.2 Verify
A verify call is made to an FSD whenever the SL_VERIFY_VOLUME bit is set in the media device
object’s flags field. Typically, this occurs because the device has signaled a possible media change
condition requiring that the FSD validate the state of the media in the drive.

Normally, this is done by the FSD during a CREATE operation. In this case, the FSD checks to determine
if this bit is set. If not, the operation continues normally. If it is set however, then the FSD calls
IoVerifyVolume – a routine provided by the I/O Manager that in turn calls back into the FSD to verify the
volume’s state is valid.

If the verification determines that the media has changed, the FSD must then indicate this fact to the I/O
Manager by returning STATUS_UNRECOGNIZED_VOLUME. This will cause the I/O Manager to
initiate mount processing on the volume again – for all file systems, not just the current file system.

If the verification determines that the media has not changed, the FSD simply clears the
SL_VERIFY_VOLUME bit and returns STATUS_PENDING to the I/O Manager to indicate the volume is
OK.

2.2.12.3 LoadFileSystem
This call is used by the I/O Manager to request that a given file system recognizer actually mount the given
file system.

Upon receipt of this request, a recognizer will load the file system driver (via a call to ZwLoadDriver,)
deregister it’s own file system device object (so that it does not receive further mount calls) and return
STATUS_SUCCESS to the I/O Manager.

2.2.12.4 UserRequest
A User Request via the FS Control path is a “catch all” for all other requests. Any of these requests can be
made from user mode. The pre-defined control codes are described in the remainder of this section.

Contents provided by 45
OSR Open Systems Resources, Inc.
2.2.12.4.1 Oplock

An opportunistic lock is used by the Windows NT LanManager components to implement a simple cache
consistency protocol for network accessed files.

File Server Application

SRV NT Executive

I/O Manager
I/O Manager

TDI RDR
NTFS FAT

TDI
Disk Drivers NDIS NDIS Network Client

Figure 5: Client/Server Communications (Oplocks)

As it turns out, the Windows NT implementation of oplocks relies very heavily upon the network
implementation of oplocks in Microsoft’s LanManager protocol as implemented by Windows NT. In
Figure 5 we provide our basic reference diagram for this discussion of oplocks. Oplocks are granted by
SRV to instances of RDR running on systems across the network – possibly even including the same
system on which SRV is running.

When a Network Client opens a file across the network, it is typically the only user accessing that file. In
this very common case, the network client need not store data back to the server immediately, nor need it
fetch data repeatedly from the server. Allowing this optimization minimizes unnecessary network traffic
which in turn provides overall better perceived performance for both the network client and all other clients
using the network.

“Cache consistency” requires that any two clients on the network must see the same information in the file
at the same point in time. Thus, if one client is not writing data back to the file server on a regular basis, a
second client reading data from the server would receive stale data. This would violate the requirement
that two clients on the network see the same information in the file at a given point in time.

To allow client-side caching without suffering from cache consistency problems requires a cache
consistency protocol – a mechanism whereby the client keeping data locally rather than writing it back to
disk or refetching it from the server each time it needs it can be informed when it must write the data back
to disk or reread it from the file server.

On Windows NT this is done via the “opportunistic locking” protocol, or oplock. In the balance of this
section we describe the various types of oplocks, their uses, and how an FSD should deal with them.

There are three types of oplocks: level 1, batch, and level 2. Both the level 1and batch oplocks are
“exclusive access” opens. They are used slightly differently, however, and hence have somewhat different
semantics. A level 2 oplock is a “shared access” grant on the file.

Contents provided by 46
OSR Open Systems Resources, Inc.
Level 1 is used by a remote client that wishes to modify the data. Once granted a Level 1 oplock, the
remote client may cache the data, modify the data in its cache and need not write it back to the server
immediately.

Batch oplocks are used by remote clients for accessing script files where the file is opened, read or written,
and then closed – repeatedly. Thus, a batch oplock corresponds not to a particular application opening the
file, but rather to a remote clients network file system caching the file because it knows something about the
semantics of the given file access. The name “batch” comes from the fact that this behavior was observed
by Microsoft with “batch files” being processed by command line utilities. Log files especially exhibit this
behavior – when a script it being processed each command is executed in turn. If the output of the script is
redirected to a log file the file fits the pattern described earlier, namely open/write/close. With many lines
in a file this pattern can be repeated hundreds of times.

Level 2 is used by a remote client that merely wishes to read the data. Once granted a Level 2 oplock, the
remote client may cache the data and need not worry that the data on the remote file server will change
without it being advised of that change.

An oplock must be broken whenever the cache consistency guarantee provided by the oplock can no longer
be provided. Thus, whenever a second network client attempts to access data in the same file across the
network, the file server is responsible for “breaking” the oplocks and only then allowing the remote client
to access the file. This ensures that the data is guaranteed to be consistent and hence we have preserved the
consistency guarantees essential to proper operation.

An oplock break occurs whenever SRV detects that some condition necessary to maintaining the oplock
has ceased to be correct. In that case, SRV begins breaking the oplock. Depending upon the type of oplock
being broken, SRV may have to engage in a multi-message protocol to complete the oplock break.

The simplest oplock break is for a level 2 oplock. In this case, SRV merely advises the remote client that it
must invalidate any cached data it has and reread it from the file server.

File Server Application

SRV NT Executive

I/O Manager
I/O Manager

TDI RDR
NTFS FAT

TDI
Break Notification
Disk Drivers NDIS NDIS Network Client
I/O (Write Back) +
ACK
Break
Acknow ledgement

Figure 6: Level 1 Oplock Break

Breaking a level 1 oplock, however, is a bit more complicated. In that case the client may have in memory
data that must be written back to the file server before the oplock break should be considered complete. A
graphical description of the control flow between SRV and RDR is shown in Figure 6. It demonstrates the
call from SRV indicating that an oplock break is in progress. In that case, the remote client initiates a series
of write operations back to the server. The write back process can consist of many operations between the

Contents provided by 47
OSR Open Systems Resources, Inc.
server and client. Once all data has been written back to the server, the client then acknowledges the
oplock break. Microsoft’s protocol allows the server to grant a Level 2 oplock to the client if the client so
desires. This would allow the client to retain the data in its cache (as it is valid) minimizing unnecessary
network traffic.

Breaking a batch oplock is initiated by the file server (SRV) which indicates to the client that an oplock
break is in progress. The client (RDR) then writes any dirty cached data back to the file server. When that
is completed, the client then closes the file. This causes the file to be reread from the file server on a
subsequent access.

In fact, closing a file always releases an oplock on the given file. A client is no longer interested in cache
consistency once the file has been closed – no data may be cached by the client if the file is not open.

The oplock protocol itself is sufficient to ensure cache consistency between clients anywhere on the
network. There is one case, however, that is not covered by this mechanism – the case of local file system
access, perhaps from a local application program. In this case, the application will call directly into the
FSD without using either the server (SRV) or client (RDR) components.

This detail of course is essential to our fundamental requirement for cache consistency. It is the
requirement that NT support local client access for cache consistency that requires oplocks be implemented
in the FSD. Thus, an inherently network activity (remote caching of data) has an important impact on file
systems.

The balance of this section describes specific control codes that are used by SRV (or any other system
component using oplocks) to obtain and release oplocks from the file system.

2.2.12.4.1.1 FSCTL_REQUEST_OPLOCK_LEVEL_1

A level 1 oplock is an exclusive oplock on the file. That is, it gives the holder of the lock the right to cache
the data and to modify the data in its cache. Essentially, no other process (on any system in the network)
may be accessing the file.

An FSD will grant such an oplock when the file is only opened by a single process. Thus, if the file is
already opened by two or more clients when a request for a level 1 lock is made, the request will be denied.

Similarly, if a level 1 lock is already held by the remote client and a second client opens the file, the level 1
lock previously granted must be revoked. This will trigger a write-back of any dirty data stored by the first
client before the oplock break is completed.

An interesting requirement of the oplock protocol is that it requires the interface be implemented
asynchronously. The oplock is granted when STATUS_PENDING is returned to the IRP containing the
oplock request. Thus, an FSD must complete the processing of the original IRP synchronously because
returning STATUS_PENDING would indicate the oplock grant was successful to the caller.

Once an oplock has been granted, the IRP representing that oplock is queued and held. The oplock break
processing is implemented by completing the original IRP that requested the oplock. The IRP must be
completed by setting the Information field of the IoStatus field to either
FILE_OPLOCK_BROKEN_TO_LEVEL_2 or FILE_OPLOCK_BROKEN_TO_NONE.

However, the oplock break at this stage has not completed. Instead, the owner of the oplock must do any
internal processing required. Once that processing has completed, the oplock owner must acknowledge the
oplock break. If FILE_OPLOCK_BROKEN_TO_LEVEL_2 was returned, the owner of the oplock may
either indicate FSCTL_OPLOCK_BREAK_ACKNOWLEDGE, in which case the acknowledgment IRP is
treated as a request for a level 2 oplock (See Section 2.2.12.4.1.4.) Alternatively, the oplock owner may
acknowledge the IRP but decline the offer of a level 2 oplock (See Section 2.2.12.4.1.7) by indicating
FSCTL_OPLOCK_BREAK_ACK_NO_2.

Contents provided by 48
OSR Open Systems Resources, Inc.
The principal reason a level 1 oplock is broken is because another caller opens the file. Normally a caller
who wishes to open the file will block until the oplock break is completed. However, SRV (the
LanManager file server) requires, for internal deadlock prevention reasons, that a create be completed
before the oplock break is completed. This is done by setting (in the create request) the
FILE_COMPLETE_IF_OPLOCKED bit in the option flags.

However, before SRV can use the file thus created, it must later verify that the oplock break has really
completed. It does this by making a subsequent call to the FSD to wait until the oplock break on the given
file is completed (See Section 2.2.12.4.1.5).

2.2.12.4.1.2 FSCTL_REQUEST_OPLOCK_LEVEL_2

A level 2 oplock is a shared oplock on the contents of the file. It allows a network client (RDR) to cache
the data in memory without fear that the data will change.

As with a level 1 oplock, the oplock is requested via an IRP and the oplock is granted when the FSD
returns STATUS_PENDING. Unlike a level 1 oplock, however, a level 2 oplock may be granted when a
file has previously been opened. Further, a level 2 oplock may be granted even when other opens of the file
allow write access. This point is really very important. As it turns out, many applications will open a file
for write access, even if they never intend on modifying the contents of the file.

Thus, an FSD must check when a write is done to a file to ensure that no level 2 oplocks have been granted
against the file – and hence need to be invalidated. This ensures that if the remote client did cache data that
it will be properly invalidated.

The oplock is broken by completing the pending IRP. In the case of a level 2 oplock nothing is set in the
Information field - the IRP is simply completed with STATUS_SUCCESS. This ensures that the oplock
holder has received notification that their cached data is now stale and must be refreshed prior to
subsequent use.

2.2.12.4.1.3 FSCTL_REQUEST_BATCH_OPLOCK

A batch oplock is an exclusive oplock against a file’s contents and against changes in the attributes of the
file (notably, but not exclusively, its name.) It allows a network client to keep a file “oplocked” even
though the application on the remote client is opening and closing the file repeatedly (as is the case for a
batch file and hence the name of the oplock.)

A batch oplock can only be granted under the same circumstances as a level 1 oplock (See Section
2.2.12.4.1.1.) The oplock itself is requested via an IRP. Returning STATUS_PENDING for that IRP
indicates the oplock itself has been granted.

Breaking a batch oplock is a superset of the cases where a level 1 oplock must be broken. In addition to
breaking a batch oplock whenever the data itself has changed a batch oplock must also be broken whenever
the name of the file changes. This is because a batch oplock covers the file even thought the client may be
opening and closing the file repeatedly. Were that the case and a rename occurred, the client needs to be
advised that the file handle it is using no longer represents the file it used to represent.

One interesting side-effect to using batch oplocks is that certain CREATE operations may fail with the
Information field set to FILE_OPBATCH_BREAK_UNDERWAY. This occurs when the caller indicated
they were unwilling to wait for the oplock break to complete by setting the
FILE_COMPLETE_IF_OPLOCKED options flag, as is typically the case for SRV, the LanManager file
server. In this case the create operation will fail (typically with STATUS_SHARING_VIOLATION) to
indicate to the caller that the problem is with a batch oplock presently held on the file and that a blocking
call to CREATE would not necessarily fail.

Contents provided by 49
OSR Open Systems Resources, Inc.
2.2.12.4.1.4 FSCTL_OPLOCK_BREAK_ACKNOWLEDGE

Once an exclusive (level 1 or batch) oplock has been broken, other file system requests cannot continue
until the oplock break is acknowledged. This can be done one of two ways – either by a subsequent call to
the FSD indicating a control code of FSCTL_OPLOCK_BREAK_ACKNOWLEDGE or by closing the file
handle.

A batch oplock break is normally acknowledged by the file object being closed. A level 1 oplock is
normally acknowledged by way of this call.

2.2.12.4.1.5 FSCTL_OPLOCK_BREAK_NOTIFY

When SRV opens a new file, indicating that it does not wish to wait for the oplock break to complete (See
Section 2.2.12.4.1.1) it must subsequently make a call to the underlying FSD to ensure that the oplock
break has successfully completed.

This is accomplished by indicating FSCTL_OPLOCK_BREAK_NOTIFY as the control code in the IRP.


This IRP will then block waiting for any oplock break activity to complete on the file. Once this call
returns (STATUS_SUCCESS) the FSD may use the file object safely.

For SRV, proper implementation of these semantics by the FSD is essential to correct behavior. If a
normally asynchronous CREATE operation by SRV is forced to be synchronous (perhaps by a filter driver)
SRV will experience internal deadlock conditions.

2.2.12.4.1.6 FSCTL_OPBATCH_ACK_CLOSE_PENDING

Section 2.2.12.4.1.4 mentioned that an oplock break could be acknowledged by closing the file. This
control code is used by the oplock owner to indicate the oplock break has been acknowledged and a close
of the file is imminent.

In this case, a level 2 oplock is not necessary. No further use should be made of this file object except to
close the file.

2.2.12.4.1.7 FSCTL_OPLOCK_BREAK_ACK_NO_2

This control code is a variation on the general acknowledgment operation. In this instance, the owner of
the oplock is declining the offer (by the FSD) of a level 2 oplock. This is typically because the owner of
the oplock does not use or support level 2 oplocks.

2.2.12.4.2 Volume

The volume operations all operate on a given piece of media and are used to manipulate the media volume
itself. The Win32 Programmer’s Reference describes these operations under the DeviceIoControl
documentation section.

2.2.12.4.2.1 LOCK_VOLUME

This operation is used by an application to obtain exclusive access to the given volume. In this way, it can
ensure the volume will not be modified except by the given application. To grant this request there must be
no open file handles against the volume. Thus, for instance, volumes containing paging files cannot be
locked because the paging file is always open. Similarly the volume containing the registry cannot be
locked.

A file system need not wait to determine if all the files will be closed, nor does it need to attempt to force
any of them to close. Merely being open is sufficient. However, for a file system tightly integrated with
the VM system it is straight-forward to walk the internal table of open files, purge the cache for each one

Contents provided by 50
OSR Open Systems Resources, Inc.
and then determine if the volume can be locked since that will ensure there are no VM references to the
files. Typical NT file systems do not do this, however.

2.2.12.4.2.2 UNLOCK_VOLUME

This operation is the inverse of LOCK_VOLUME and releases a previously locked volume.

2.2.12.4.2.3 DISMOUNT_VOLUME

This operation dismounts the volume. Internally, the requirements for dismounting the volume are
identical to those required for locking the volume (See Section 2.2.12.4.2.1.) Thus, the volume must be
quiescent. Upon completion of this call the volume itself is no longer mounted.

Note that any access to the volume at this point will cause it to be remounted. Typically this operation is
used only by utilities.

2.2.12.4.2.4 MARK_VOLUME_DIRTY

This entry point may be used to mark the volume dirty. Typically this is done to ensure that the next time
the system boots the volume will be checked. Thus, this can be used by a utility which is going to perform
extended operations (such as chkdsk) which might be interrupted. In that case, the utility marks the volume
dirty and then begins its operations.

2.2.12.4.3 Compression

These operations control the compression state of the given file or directory. Since compression itself is
not required, these operations need not be supported by the file system if compression is, itself, not
supported.

More information about how these operations can be used from Win32 programs is included in the Win32
Programmer’s Reference documentation.

2.2.12.4.3.1 GET_COMPRESSION

This operation queries the compression state of the given file. The return value is a USHORT which
indicates that compression is enabled and what type of compression is enabled.6

2.2.12.4.3.2 SET_COMPRESSION

This operation sets the compression state of the given file. The input value is a USHORT which indicates
the type of compression to be used for this file. A value of zero indicates that compression is to be disabled
for this file.

2.2.12.4.3.3 READ_COMPRESSION

This operation is not used in NT 3.51.

2.2.12.4.3.4 WRITE_COMPRESSION

This operation is not used in NT 3.51.

2.2.12.4.4 Others

This section describes other valid user requests against the given file.

6
Note that the value returned will be plus one if compression was enabled.

Contents provided by 51
OSR Open Systems Resources, Inc.
2.2.12.4.4.1 IS_PATHNAME_VALID

This call may be used by a caller to determine if the representative path name is in fact valid. The FSD
examines the path name to determine if the form of the name is valid – not if the object actually exists.
Thus, if the name is syntactically correct, this routine should return success. Otherwise, it should return
STATUS_OBJECT_NAME_INVALID.

2.2.12.4.4.2 QUERY_RETRIEVAL_POINTERS

This entry point is only usable by kernel mode drivers in order to obtain a physical mapping of locations
within the paging file on the disk. This is returned as a series of value pairs indicating the number of
sectors covered and the logical block offset on the drive.

2.2.12.4.4.3 MARK_AS_SYSTEM_HIVE

This entry point is used by the system to indicate that the specified file is a registry file. This operation
may be used to manage the behavior of this file in some “special” manner. For example, NTFS might use
this attribute to actually journal the updates to this file in a manner that ensures transactional correctness in
the case of a system crash.

2.2.12.5 Sample Code


The following code demonstrates calling an FSD via the NtFsControlFile API. This mechanism allows
sending a broad range of file system controls – which is not possible using the Win32 DeviceIoControl call.
This example is intended to provide a basic skeleton from which the reader can work and is derived from a
simple test program to probe the IOCTL and FSCTL interfaces.

//
// This utility demonstrates opening and sending an IOCTL to a file system.
//
// It uses the Native NT API:
// NtCreateFile - to allow opening the FSD device object (which might not be
// in \DosDevices, for instance.)
// NtDeviceIoControlFile - used to send IOCTLs to the FSD
// NtFsControlFile - used to send FSCTL calls to the FSD
//
// This skeleton file demonstrates HOW to connect to your FSD. The rest is left
// as an exercise for the reader.
//

#include <ntddk.h>
#include <ntdddisk.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <memory.h>

//
// The NATIVE NT API definitions.
//

NTSYSAPI
NTSTATUS
NTAPI
NtCreateEvent (
OUT PHANDLE EventHandle,
IN ACCESS_MASK DesiredAccess,
IN POBJECT_ATTRIBUTES ObjectAttributes OPTIONAL,

Contents provided by 52
OSR Open Systems Resources, Inc.
IN EVENT_TYPE EventType,
IN BOOLEAN InitialState
);

NTSYSAPI
NTSTATUS
NTAPI
NtWaitForSingleObject(
IN HANDLE Handle,
IN BOOLEAN Alertable,
IN PLARGE_INTEGER Timeout OPTIONAL
);

NTSYSAPI
NTSTATUS
NTAPI
NtCreateFile(
OUT PHANDLE FileHandle,
IN ACCESS_MASK DesiredAccess,
IN POBJECT_ATTRIBUTES ObjectAttributes,
OUT PIO_STATUS_BLOCK IoStatusBlock,
IN PLARGE_INTEGER AllocationSize OPTIONAL,
IN ULONG FileAttributes,
IN ULONG ShareAccess,
IN ULONG CreateDisposition,
IN ULONG CreateOptions,
IN PVOID EaBuffer OPTIONAL,
IN ULONG EaLength
);

NTSYSAPI
NTSTATUS
NTAPI
NtDeviceIoControlFile(
IN HANDLE FileHandle,
IN HANDLE Event OPTIONAL,
IN PIO_APC_ROUTINE ApcRoutine OPTIONAL,
IN PVOID ApcContext OPTIONAL,
OUT PIO_STATUS_BLOCK IoStatusBlock,
IN ULONG IoControlCode,
IN PVOID InputBuffer OPTIONAL,
IN ULONG InputBufferLength,
OUT PVOID OutputBuffer OPTIONAL,
IN ULONG OutputBufferLength
);

NTSYSAPI
NTSTATUS
NTAPI
NtFsControlFile(
IN HANDLE FileHandle,
IN HANDLE Event OPTIONAL,
IN PIO_APC_ROUTINE ApcRoutine OPTIONAL,
IN PVOID ApcContext OPTIONAL,
OUT PIO_STATUS_BLOCK IoStatusBlock,
IN ULONG IoControlCode,
IN PVOID InputBuffer OPTIONAL,
IN ULONG InputBufferLength,
OUT PVOID OutputBuffer OPTIONAL,

Contents provided by 53
OSR Open Systems Resources, Inc.
IN ULONG OutputBufferLength
);

NTSYSAPI
NTSTATUS
NTAPI
NtClose(
IN HANDLE Handle
);

//
// The following is a name for the Device directory.
//

CHAR *DeviceDirName = "\\Device\\";

//
// A single global event handle
//

HANDLE TestEvent;

Contents provided by 54
OSR Open Systems Resources, Inc.
#define TestDeviceControl(h, i, in, out, sz) TestControl(h, TRUE, i, in, out, sz)
#define TestFsControl(h, i, in, out, sz) TestControl(h, FALSE, i, in, out, sz)

VOID TestControl(HANDLE Handle, BOOLEAN DeviceCtrl, ULONG Ioctl, PVOID Input, PVOID
Output, ULONG Size)
{
NTSTATUS code;
IO_STATUS_BLOCK iosb;

//
// First, set the input buffer to be one value, and the output buffer
// to be a different value. Ignore any exceptions, etc. Caller might
// want to test invalid addresses.
//

try {

memset(Input, 1, Size);

} except (EXCEPTION_EXECUTE_HANDLER) {

// NOTHING
}

try {

memset(Output, 0, Size);

} except (EXCEPTION_EXECUTE_HANDLER) {

// NOTHING

try {

if (DeviceCtrl) {

code = NtDeviceIoControlFile(Handle,
TestEvent,
0, // no apc
0, // no apc context
&iosb,
Ioctl,
Input,
Size, // input size
Output,
Size); // output size -> same as input size!
} else {

code = NtFsControlFile(Handle,
TestEvent,
0, // no apc
0, // no apc context
&iosb,
Ioctl,

Contents provided by 55
OSR Open Systems Resources, Inc.
Input,
Size, // input size
Output,
Size); // output size -> same as input size!

} except (EXCEPTION_EXECUTE_HANDLER) {

printf ("Exception (0x%x)\n", GetExceptionCode());

return;

if (code == STATUS_PENDING) {

//
// Must wait for the I/O to complete. We are alertable
//

code = NtWaitForSingleObject(TestEvent, TRUE, 0);

//
// First, did we get back a zero return code? If so, the operation
// should have worked. If not, something is wrong.
//

if (code) {

printf("Failure (0x%x)\n", code);

return;

//
// Now, check to make sure we got back enough data.
//

if (iosb.Information != Size) {

printf("Information wrong (%d)\n", iosb.Information);

return;

//
// Compare the data we did get back. Should be the same.
//

if (Input && Output) {

if (memcmp(Input, Output, Size)) {

Contents provided by 56
OSR Open Systems Resources, Inc.
//
// Wrong!
//

printf("Input and output mismatch\n");

return;

return;
}

Contents provided by 57
OSR Open Systems Resources, Inc.
//
// The actual main function.
//

__cdecl main(ULONG argc, LPSTR *argv)


{
HANDLE handle;
NTSTATUS code;
OBJECT_ATTRIBUTES fsdAttributes;
IO_STATUS_BLOCK iosb;
ANSI_STRING fsdAnsiName;
UNICODE_STRING fsdUnicodeName;
ULONG ioctlCode;
ULONG inputBuffer[64]; // must be same as output buffer size.
ULONG outputBuffer[64];

//
// The usage pattern for this utility is:
// > fsctl [drivername]
//
// For your example you might want to change this so it is tied
// specifically to YOUR driver.
//

if (argc != 2) {

printf("Usage: fsctl [DriverName]\n");

exit(1);
}

//
// There are really only two places to look:
// "\[DriverName]"
// and
// "\Device\[DriverName]"
//
// The first instance is where physical file system device objects
// are created, the second is where network file system device
// objects are created.
//
// Why? Because this is consistent with existing naming structure
// on Windows NT.
//

//
// Figure out how much space we need for the longer of the two:
// - Size of the prefix
// - Size of the FSD name
// - Null terminator
//
// UNICODE version is * sizeof(WCHAR)
//

fsdAnsiName.MaximumLength = strlen(DeviceDirName) + strlen(argv[1]) + sizeof(CHAR);

Contents provided by 58
OSR Open Systems Resources, Inc.
fsdUnicodeName.MaximumLength = sizeof(WCHAR) * fsdAnsiName.MaximumLength;

fsdAnsiName.Buffer = malloc(fsdAnsiName.MaximumLength);

fsdUnicodeName.Buffer = malloc(fsdUnicodeName.MaximumLength);

if (!fsdAnsiName.Buffer || !fsdUnicodeName.Buffer) {

printf("malloc failure\n");

exit(2);

//
// Zero out the newly allocated buffers
//

memset(fsdAnsiName.Buffer, 0, fsdAnsiName.MaximumLength);

memset(fsdUnicodeName.Buffer, 0, fsdUnicodeName.MaximumLength);

//
// Format the ANSI string.
//

sprintf(fsdAnsiName.Buffer, "%s%s", DeviceDirName, argv[1]);

//
// Set the final size.
//

fsdAnsiName.Length = strlen(fsdAnsiName.Buffer);

fsdUnicodeName.Length = 0;

//
// Convert to UNICODE.
//

code = RtlAnsiStringToUnicodeString(&fsdUnicodeName, &fsdAnsiName, FALSE);

if (code) {

//
// Conversion error
//

printf("RtlAnsiStringToUnicodeString failed (0x%x)\n", code);

//
// Now, build an object attributes structure. Note that the name is
// case insensitive (could be case sensitive if this is important for

Contents provided by 59
OSR Open Systems Resources, Inc.
// YOUR FSD.)
//

InitializeObjectAttributes(&fsdAttributes,
&fsdUnicodeName,
OBJ_CASE_INSENSITIVE,
0,
0);

printf("#1: %S\n", fsdUnicodeName.Buffer);

//
// Call NATIVE NT function to open the file. This avoids the need to
// have "\DosDevices" prepended to the name of the thing being opened.
//

code = NtCreateFile(&handle,
GENERIC_READ,
&fsdAttributes,
&iosb,
0,
FILE_ATTRIBUTE_NORMAL,
FILE_SHARE_WRITE,
FILE_OPEN,
0,
0,
0);

if (code) {

//
// Perhaps the name was just wrong. We'll try the second alternative.
// First, zero out the buffers.
//

memset(fsdAnsiName.Buffer, 0, fsdAnsiName.MaximumLength);

memset(fsdUnicodeName.Buffer, 0, fsdUnicodeName.MaximumLength);

//
// Format the ANSI string.
//

sprintf(fsdAnsiName.Buffer, "\\%s", argv[1]);

//
// Set the final size.
//

fsdAnsiName.Length = strlen(fsdAnsiName.Buffer);

fsdUnicodeName.Length = 0;

//
// Convert to UNICODE.
//

Contents provided by 60
OSR Open Systems Resources, Inc.
code = RtlAnsiStringToUnicodeString(&fsdUnicodeName, &fsdAnsiName, FALSE);

if (code) {

//
// Conversion error
//

printf("RtlAnsiStringToUnicodeString failed (0x%x)\n", code);

//
// Set the final size.
//

InitializeObjectAttributes(&fsdAttributes,
&fsdUnicodeName,
OBJ_CASE_INSENSITIVE,
0,
0);

printf("#2: %S\n", fsdUnicodeName.Buffer);

//
// Try again
//

code = NtCreateFile(&handle,
GENERIC_READ,
&fsdAttributes,
&iosb,
0,
FILE_ATTRIBUTE_NORMAL,
FILE_SHARE_WRITE,
FILE_OPEN,
0,
0,
0);

//
// If it failed again, we're going to bail out.
//

if (code) {

fprintf(stderr, "Cannot open FSD (0x%x)\n", code);

exit(3);

//
// Now, create an event for use by the test routines.

Contents provided by 61
OSR Open Systems Resources, Inc.
//

code = NtCreateEvent(&TestEvent,
GENERIC_ALL,
0, // no object attributes
NotificationEvent,
FALSE);

if (code) {
//
// Bummer.
//

fprintf(stderr, "Cannot create event (0x%x)\n", code);

exit (93);

//
// Begin testing.
//

printf("Begin testing\n");

//
// Phase 1 testing: all properly formatted requests.
//

printf("Phase 1 testing: proper requests\n");

//
// If we make it to this point, the device itself is opened. We send
// an I/O control code.
//

printf("Device test #1\n");

TestDeviceControl(handle, IOCTL_DISK_GET_DRIVE_GEOMETRY, inputBuffer, outputBuffer,


sizeof(inputBuffer));

//
// Now do the FS controls.
//

printf("FS test #1\n");


#define FSCTL_IS_VOLUME_MOUNTED CTL_CODE(FILE_DEVICE_FILE_SYSTEM,10,
METHOD_BUFFERED, FILE_ANY_ACCESS)

TestFsControl(handle, FSCTL_IS_VOLUME_MOUNTED, inputBuffer, outputBuffer,


sizeof(inputBuffer));

//
// Done testing
//

printf("End testing\n");

Contents provided by 62
OSR Open Systems Resources, Inc.
//
// At this stage we are done. We need to do cleanup and then finish.
//

free(fsdAnsiName.Buffer);

free(fsdUnicodeName.Buffer);

fsdAnsiName.Buffer = 0;

fsdUnicodeName.Buffer = 0;

if (NtClose(handle)) {

printf("Close failed!\n");

//
// Done!
//

printf("Success\n");

return (0);
}

2.2.13 Sample Rename Code


The following code example was written to provide an example of a “relative” rename (that is, a rename of
a file relative to a directory.)
//
// This sample program demonstrates performing a relative rename operation.
//
// It uses the Native NT API:
// NtCreateFile - allows opening the file object directly.
// NtClose - close the object
// NetSetInformationFile - perform the rename operation.
//

#include <ntddk.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <memory.h>
#include <wrapper-ioctl.h>

//
// The NATIVE NT API definitions.
//

NTSYSAPI
NTSTATUS
NTAPI
NtCreateFile(
OUT PHANDLE FileHandle,

Contents provided by 63
OSR Open Systems Resources, Inc.
IN ACCESS_MASK DesiredAccess,
IN POBJECT_ATTRIBUTES ObjectAttributes,
OUT PIO_STATUS_BLOCK IoStatusBlock,
IN PLARGE_INTEGER AllocationSize OPTIONAL,
IN ULONG FileAttributes,
IN ULONG ShareAccess,
IN ULONG CreateDisposition,
IN ULONG CreateOptions,
IN PVOID EaBuffer OPTIONAL,
IN ULONG EaLength
);

NTSYSAPI
NTSTATUS
NTAPI
NtSetInformationFile(
IN HANDLE FileHandle,
OUT PIO_STATUS_BLOCK IoStatusBlock,
IN PVOID FileInformation,
IN ULONG Length,
IN FILE_INFORMATION_CLASS FileInformationClass
);

typedef struct _FILE_RENAME_INFORMATION {


BOOLEAN ReplaceIfExists;
HANDLE RootDirectory;
ULONG FileNameLength;
WCHAR FileName[1];
} FILE_RENAME_INFORMATION, *PFILE_RENAME_INFORMATION;

__cdecl main(ULONG argc, LPSTR *argv)


{

UNICODE_STRING directoryName;
UNICODE_STRING subDir1Name, subDir2Name;
UNICODE_STRING fileName;
OBJECT_ATTRIBUTES directoryAttributes;
OBJECT_ATTRIBUTES subDir1Attributes, subDir2Attributes;
OBJECT_ATTRIBUTES fileAttributes;
NTSTATUS status;
HANDLE directoryHandle;
HANDLE subDir1Handle, subDir2Handle;
HANDLE fileHandle;
IO_STATUS_BLOCK iosb;
UCHAR buffer[128];
PFILE_RENAME_INFORMATION renameInfo;

//
// This driver creates a couple of objects and then does a relative rename using
those objects. It
// isn't general, but it demonstrates the essence of the relative rename.
//

RtlInitUnicodeString(&directoryName, L"\\??\\C:\\RRTest");

RtlInitUnicodeString(&subDir1Name, L"Subdir1");

RtlInitUnicodeString(&subDir2Name, L"Subdir2");

Contents provided by 64
OSR Open Systems Resources, Inc.
RtlInitUnicodeString(&fileName, L"Foo"); // classically named file

//
// Open the test directory
//

InitializeObjectAttributes(&directoryAttributes,
&directoryName,
OBJ_CASE_INSENSITIVE,
0,
0);

status = NtCreateFile(&directoryHandle,
GENERIC_READ|GENERIC_WRITE,
&directoryAttributes,
&iosb,
0, // no allocation
FILE_ATTRIBUTE_DIRECTORY, // we're creating a directory here!
FILE_SHARE_READ|FILE_SHARE_WRITE|FILE_SHARE_DELETE,
FILE_CREATE,
FILE_DIRECTORY_FILE,
0, // no ea
0); // no ea

if (status) {
printf("NtCreateFile (directory) failed (0x%x)\n", status);
exit(1);
}

//
// Open the subdirectories
//

InitializeObjectAttributes(&subDir1Attributes,
&subDir1Name,
OBJ_CASE_INSENSITIVE,
directoryHandle,
0); // no security descriptor

status = NtCreateFile(&subDir1Handle,
GENERIC_READ|GENERIC_WRITE|FILE_DELETE_CHILD,
&subDir1Attributes,
&iosb,
0,
FILE_ATTRIBUTE_DIRECTORY,
FILE_SHARE_READ|FILE_SHARE_WRITE|FILE_SHARE_DELETE,
FILE_CREATE,
FILE_DIRECTORY_FILE,
0,
0);

if (status) {
printf("NtCreateFile (subdir 1) failed (0x%x)\n", status);
exit(1);
}

Contents provided by 65
OSR Open Systems Resources, Inc.
InitializeObjectAttributes(&subDir2Attributes,
&subDir2Name,
OBJ_CASE_INSENSITIVE,
directoryHandle,
0); // no security descriptor

status = NtCreateFile(&subDir2Handle,
GENERIC_READ|GENERIC_WRITE,
&subDir2Attributes,
&iosb,
0,
FILE_ATTRIBUTE_DIRECTORY,
FILE_SHARE_READ|FILE_SHARE_WRITE|FILE_SHARE_DELETE,
FILE_CREATE,
FILE_DIRECTORY_FILE,
0,
0);

if (status) {
printf("NtCreateFile (subdir 2) failed (0x%x)\n", status);
exit(1);
}

//
// Now create a dummy file in subdir 1
//

InitializeObjectAttributes(&fileAttributes,
&fileName,
OBJ_CASE_INSENSITIVE,
subDir1Handle,
0);

status = NtCreateFile(&fileHandle,
FILE_ALL_ACCESS,
&fileAttributes,
&iosb,
0,
FILE_ATTRIBUTE_NORMAL,
0, // no sharing
FILE_CREATE,
FILE_NON_DIRECTORY_FILE,
0,
0);

if (status) {
printf("NtCreateFile (file) failed (0x%x)\n", status);
exit(1);
}

//
// OK. Now rename the file to subdir 2 This is all done relative to the open handle
// against subdir 2.
//

memset(buffer, 0, sizeof(buffer));

Contents provided by 66
OSR Open Systems Resources, Inc.
renameInfo = (PFILE_RENAME_INFORMATION) buffer;

renameInfo->ReplaceIfExists = FALSE;
renameInfo->RootDirectory = subDir2Handle;
renameInfo->FileNameLength = fileName.Length;
RtlCopyMemory(&renameInfo->FileName[0],
fileName.Buffer,
fileName.Length);

status = NtSetInformationFile(fileHandle,
&iosb,
renameInfo,
sizeof(buffer),
FileRenameInformation);

if (status) {
printf("NtSetInformationFile failed (0x%x)\n", status);
}

//
// We should clean up here, but we'll let the runtime handle it.
//

return (0);
}

2.2.14 Sample IRP Code


The following code is from the OSR article “Roll Your Own IRPs”. It is intended to serve as a skeleton
from which you can build your own routines – but be warned that building IRPs and manipulating MDLs is
a perilous task and you should be prepared to handle many blue screens, etc.
2.2.14.1 Asynchronous Example (async.c)
//
// (C) Copyright 1997 OSR Open Systems Resources, Inc.
// All Rights Reserved
//

#include <ntddk.h>

//
// Forward declaration
//

NTSTATUS AsyncCompletion(PDEVICE_OBJECT, PIRP, PVOID);

//
// This demonstrates how to build an asynchronous I/O request. Note
// that the caller (in this case) does NOT need to process the request.
//

BOOLEAN SendAsyncIrp(PDEVICE_OBJECT DeviceObject,


ULONG MajorFunction, // read, write, flush, etc.
PFILE_OBJECT FileObject,
PVOID Buffer,
ULONG BufferLength,
ULONG Offset,

Contents provided by 67
OSR Open Systems Resources, Inc.
PIO_STATUS_BLOCK IoStatusBlock)

{
PIRP Irp;
LARGE_INTEGER StartingOffset;

//
// The starting offset is a quad. For this example we restricted to ULONG
// so we do the conversion here. For YOUR work, you can use a quad if that's
// what's needed.
//

StartingOffset.QuadPart = (LONGLONG) Offset;

//
// We're going to build the I/O request. Note that this is an ASYNCHRONOUS
// request, so we don't provide it with an Event to set when the I/O itself
// is done.
//

Irp = IoBuildAsynchronousFsdRequest(MajorFunction,
DeviceObject,
Buffer,
BufferLength,
&StartingOffset,
IoStatusBlock);

if (!Irp) {

//
// Creation of the IRP failed. This may happen when there is insufficient memory
to
// create the IRP (non-paged pool exhaustion.)
//

return FALSE;

//
// If you are going to call your own driver with this IRP, you should advance
// the I/O stack location (c.f. floppy.c for an example of this) using the
// IoSetNextIrpStackLocation() call.
//

// IoSetNextIrpStackLocation(Irp);

//
// Set a completion routine for processing this IRP after it has completed. This
// will allow us to free the IRP and avoid completion processing.
//

IoSetCompletionRoutine(Irp, AsyncCompletion, NULL, TRUE, TRUE, TRUE);

//
// Pass along to the underlying device
//

Contents provided by 68
OSR Open Systems Resources, Inc.
IoCallDriver(DeviceObject, Irp);

//
// Done!
//

return TRUE;
}

//
// AsyncCompletion
//

NTSTATUS AsyncCompletion(PDEVICE_OBJECT DeviceObject, PIRP Irp, PVOID Context)


{

//
// For the asynchronous example, cleanup of the MDLs attached to the IRP
// are done here in the completion routine. This allows us to free the
// IRP below.
//

if (Irp->MdlAddress) {

MmUnmapLockedPages(MmGetSystemAddressForMdl(Irp->MdlAddress), Irp->MdlAddress);

MmUnlockPages(Irp->MdlAddress);

IoFreeMdl(Irp->MdlAddress);

IoFreeIrp(Irp);

//
// This advises the I/O Manager to STOP PROCESSING this request - that the
// DRIVER is going to perform (and may, in fact, have already performed)
// additional processing on the I/O request.
//
// In fact, in this case, note that we've already freed the IRP - it might be
// a VERY BAD idea for the I/O Manager to do ANYTHING with that IRP at this
// point (since it isn't even an IRP anymore...)
//

return STATUS_MORE_PROCESSING_REQUIRED;

2.2.14.2 Roll An Irp (roll.c)


//
// (C) Copyright 1997 OSR Open Systems Resources, Inc.
// All Rights Reserved
//

Contents provided by 69
OSR Open Systems Resources, Inc.
#include "roll.h"

//
// This demonstrates how to build your own IRPs completely from scratch, using
// your own memory, etc.
//
// Please keep in mind as you read this example that it was written with the intent
// to demonstrate a specific technique. Thus, there are numerous ways this could
// be implemented, some of them far more brief. Feel free to modify this code to use
// in your OWN programs...
//

Contents provided by 70
OSR Open Systems Resources, Inc.
//
// Data structures local to this module.
//

LIST_ENTRY MyIrpFreeList; // contains freed, not currently in use


KSPIN_LOCK MyIrpFreeListLock; // protection for the list
USHORT MyIrpFreeListMax; // maximum # of IRPs to create for my free list
USHORT MyIrpFreeListCount; // current # of IRPs created for my free list
KEVENT MyIrpFreeListEvent; // allows caller to wait for an IRP
USHORT MyIrpSize; // # of bytes in the IRPs we allocate
CCHAR MyNumberOfIrpStackLocations; // # of IRP stack locations allocated in each IRP

//
// NOTE: In NT 4.0 you can use "look aside" lists to track free IRPs (they
// are really "free lists".) These examples were written for NT 3.51 or
// 4.0 so you should modify the code to fit your specific requirements.
//

//
// For our purposes we hard-code the size of the IRPs we'll be creating. We
// use five because that's safe, even with a single filter driver in the stack.
//

#define MAXIMUM_IRP_STACK_LOCATIONS (5)

//
// We'll limit the # of IRPs we create in this package to an arbitrary number. Again,
// this can be changed to fit your requirements - or don't limit it at all!
//

#define MAXIMUM_IRP_COUNT (64)

Contents provided by 71
OSR Open Systems Resources, Inc.
//
// AllocateNewIrp
//
// This internal helper routine is used to allocate a new IRP from non-paged pool.
//
// Inputs:
// None.
//
// Outputs:
// None.
//
// Returns:
// A pointer to the newly created IRP.
//
// Notes:
// None.

static PIRP AllocateNewIrp()


{
PIRP newIrp;

//
// First, this routine fails whenever we've hit the maximum number.
//

if (MyIrpFreeListCount > MyIrpFreeListMax) {

//
// Return null - refuse this request.
//

return (PIRP) 0;

//
// We use a special tag value to allow us to detect the IRPs we are creating (using
// the kernel debugger or "poolmon".
//

newIrp = ExAllocatePoolWithTag(NonPagedPool, MyIrpSize, 'rIyM');

if (!newIrp) {

//
// The creation of a new IRP failed. The only time this really happens is when there
is
// no more non-paged pool.
//
// We choose the simpler solution here - we just return a null pointer.
// If your driver requires it, please feel free to call NonPagedPoolMustSucceed - but
// remember that the amount of such memory is VERY limited and if it cannot be
granted
// the system crashes ("blue screen"). Use sparingly.
//

Contents provided by 72
OSR Open Systems Resources, Inc.
return 0;
}

//
// At this point, the IRP is allocated. Return it to the caller.
//

return newIrp;

//
// MyInitIrpPackage
//
// This routine is called to initialize the IRP management package.
//
// Inputs:
// MaximumNumberOfIrpStackLocations - the # of IRP stack locations to allocate for all
IRPs in the package
// MaximumNumberOfIrps - the maximum # of IRPs to be allocated (on demand) by this
package
//
// Outputs:
// None.
//
// Returns:
// None.
//
// Notes:
// This package uses defaults if the passed-in values are zero.
//

VOID MyInitIrpPackage(CCHAR MaximumNumberOfIrpStackLocations, USHORT MaximumNumberOfIrps)


{
ULONG MyIrpSize;

if (MaximumNumberOfIrpStackLocations == 0) {

//
// We'll use the "default" number in this case
//

MyNumberOfIrpStackLocations = MAXIMUM_IRP_STACK_LOCATIONS;

} else {

//
// We'll use the number passed in by the caller
//

MyNumberOfIrpStackLocations = MaximumNumberOfIrpStackLocations;

Contents provided by 73
OSR Open Systems Resources, Inc.
if (MaximumNumberOfIrps == 0) {

//
// We'll fall back on the backup value.
// Note: this could be turned into a registry lookup for your particular driver.
//

MyIrpFreeListMax = MAXIMUM_IRP_COUNT;

} else {

//
// We'll use the number given us by the caller
//

MyIrpFreeListMax = MaximumNumberOfIrps;

//
// Now, compute the size of an IRP with the requisite # of stack
// locations.
//

MyIrpSize = IoSizeOfIrp(MyNumberOfIrpStackLocations);

//
// Initialize our free list, spin lock, and free list event...
//

InitializeListHead(&MyIrpFreeList);

KeInitializeSpinLock(&MyIrpFreeListLock);

MyIrpFreeListCount = 0;

KeInitializeEvent(&MyIrpFreeListEvent, SynchronizationEvent, FALSE);

//
// Now, let's start the list off with the first IRP. We could defer this to first
// reference but we decided to add this code early because it ensures we exercise
// most of the logic we use in this package.
//

MyFreeIrp(AllocateNewIrp());

//
// We believe there will be something on the list. We assert this at this point.
//

ASSERT(!IsListEmpty(&MyIrpFreeList));
//
// Done with the initialization. Of course, feel free to add more
// code here for YOUR specific project!
//
return;

Contents provided by 74
OSR Open Systems Resources, Inc.
//
// MyAllocateIrp
//
// This routine is called by your driver to allocate a new I/O request packet.
//
// Inputs:
// StackSize - this is the # of stack elements required. It must be <= the max # of
stack sizes
// Wait - indicates if the caller is willing to wait for the IRP allocation
//
// Outputs:
// None.
//
// Returns:
// The new IRP if successful. Otherwise, null
//
// Notes:
// Any caller already "owning" an IRP created by this package MUST indicate Wait = FALSE
to
// avoid deadlock conditions. Callers of this routine which indicate Wait == TRUE must
be
// dispatchable (IRQL < DISPATCH_LEVEL).
//

PIRP MyAllocateIrp(CCHAR StackSize, BOOLEAN Wait)


{
KIRQL oldIrql;
PIRP newIrp = 0;

//
// The caller should never ask for an IRP with more stack locations than those we
// have created. If they do, there is something wrong - we'll assert out at this point
// so we can catch this problem.
//
// (If you ignore this failure here, later you will see a blue screen with
// NO_MORE_IRP_STACK_LOCATIONS as the stop code - eventually.)
//

ASSERT(StackSize <= MyNumberOfIrpStackLocations);

while (1) {

KeAcquireSpinLock(&MyIrpFreeListLock, &oldIrql);

//
// We must use RemoveHeadList in an IF statement because it is not an expression.
//

if (!IsListEmpty(&MyIrpFreeList)) {
PLIST_ENTRY listEntry;

//
// Remove the head of the list.
//

listEntry = RemoveHeadList(&MyIrpFreeList);

Contents provided by 75
OSR Open Systems Resources, Inc.
//
// Convert into an IRP pointer.
//

newIrp = CONTAINING_RECORD(listEntry, IRP, Tail.Overlay.ListEntry);

KeReleaseSpinLock(&MyIrpFreeListLock, oldIrql);

if (newIrp) {

//
// Jump out of the loop.
//

break;

//
// We'll try to create a new entry.
//

newIrp = AllocateNewIrp();

if (!newIrp) {

//
// The allocation was unsuccessful. If the caller indicated they could wait
// then we'll put them to sleep waiting for the synchronization event
//
// NOTE: We assume here we seeded the list, so there is AT LEAST ONE IRP already
// created for the pool. If there were zero, we might never wake up, because
// we're going to count on some other thread releasing an IRP.
//

if (Wait) {

ASSERT(KeGetCurrentIrql() < DISPATCH_LEVEL);

KeWaitForSingleObject(&MyIrpFreeListEvent,
Spare6, // reason - we use this to ease debugging.
KernelMode, // mode
FALSE, // not altertable
0); // no timeout
} else {
//
// The caller indicated they could not wait. We'll break from the loop and
handle
// this condition below.
//

break;

} // if (!newIrp)

Contents provided by 76
OSR Open Systems Resources, Inc.
} // while (1)

//
// At this point we're done with the processing. If this is null, it is because the
caller
// was unwilling to wait for allocation.
//

return newIrp;

//
// MyFreeIrp
//
// This routine is called to free up an IRP previously allocated by a call to
MyAllocateIrp
//
// Inputs:
// Irp - the IRP to be freed
//
// Outputs:
// None.
//
// Returns:
// None.
//
// Notes:
// NEVER call this package with I/O manager allocated IRPs - doing so may cause
catastrophic system failure
// (since we're going to reinitialize the IRPs and if they aren't of the correct size we
could trash things...)
//
//

VOID MyFreeIrp(PIRP Irp)


{
//
// If a null pointer is passed we just return. Not much we can do with
// a null pointer.
//

if (!Irp) {

return;

//
// Before we re-queue the IRP, we will initialize it. It will then be ready to hand
off to the new caller
// when they allocate it.
//
// NOTE: This is the ONLY time you would call IoInitializeIrp - with your own allocated
memory. Do NOT
// call it after calling IoAllocateIrp as that will change flags within the IRP which
can lead to a system
// crash (memory arena corruption.)

Contents provided by 77
OSR Open Systems Resources, Inc.
//

IoInitializeIrp(Irp,
MyIrpSize,
(USHORT) MAXIMUM_IRP_STACK_LOCATIONS);

//
// Stick it on the free list
//

ExInterlockedInsertTailList(&MyIrpFreeList,
&Irp->Tail.Overlay.ListEntry,
&MyIrpFreeListLock);

//
// And set the event - this awakens anyone who might be waiting for an IRP.
//

KeSetEvent(&MyIrpFreeListEvent, 1, FALSE);

return;
}

2.2.14.3 Synchronous Example (sync.c)


//
// (C) Copyright 1997 OSR Open Systems Resources, Inc.
// All Rights Reserved
//

#include <ntddk.h>

//
// SendSyncIrp
//
// This is a demonstration routine on how to build your own IRP
// using the IoBuildSynchronousFsdRequest.
//
// Inputs:
// DeviceObject - the device to call when passing the IRP
// MajorFunction - IRP_MJ_* where * is READ, WRITE, FLUSH_BUFFERS or SHUTDOWN
// FileObject - the file object to use when calling the underlying driver.
// This is only useful for the case where you are working
// with a file system. Can be NULL
// Buffer - pointer to the buffer where the data is located. Can be NULL
// BufferLength - zero if Buffer is null, otherwise the # of bytes to
// be transferred as part of this operation.
// Offset - the file/device offset for this operation.
// Event - the event to be signaled when the I/O is done.
//
// Outputs:
// IoStatusBlock - the results of the I/O operation.
//
// Returns:
// TRUE if the IRP was sent
// FALSE otherwise.
//
// Notes:
// The restriction of using READ, WRITE, FLUSH, or SHUTDOWN is one imposed

Contents provided by 78
OSR Open Systems Resources, Inc.
// by this routine.
//
// The I/O Manager adds the IRP created to the thread's oustanding I/O list
// so you MUST complete this via the I/O Manager.
//
// The I/O Manager will do any IRP cleanup processing necessary - no need for
// your driver to do it.
//

BOOLEAN SendSyncIrp(PDEVICE_OBJECT DeviceObject,


ULONG MajorFunction, // read, write, flush, etc.
PFILE_OBJECT FileObject,
PVOID Buffer,
ULONG BufferLength,
ULONG Offset,
PKEVENT Event,
IO_STATUS_BLOCK IoStatusBlock)

{
PIRP Irp;
LARGE_INTEGER StartingOffset;
IO_STATUS_BLOCK Iosb;

//
// The starting offset is a quad. For this example we restricted to ULONG
// so we do the conversion here. For YOUR work, you can use a quad if that's
// what's needed.
//

StartingOffset.QuadPart = (LONGLONG) Offset;

//
// We're going to build the I/O request. Note that this is an ASYNCHRONOUS
// request, so we don't provide it with an Event to set when the I/O itself
// is done.
//

Irp = IoBuildSynchronousFsdRequest(MajorFunction,
DeviceObject,
Buffer,
BufferLength,
&StartingOffset,
Event,
&Iosb);

if (!Irp) {

//
// Creation of the IRP failed. This may happen when there is insufficient memory
to
// create the IRP (non-paged pool exhaustion.)
//

return FALSE;

//

Contents provided by 79
OSR Open Systems Resources, Inc.
// If you are going to call your own driver with this IRP, you should advance
// the I/O stack location (c.f. floppy.c for an example of this) using the
// IoSetNextIrpStackLocation() call.
//

// IoSetNextIrpStackLocation(Irp);

//
// Pass along to the underlying device
//

(void) IoCallDriver(DeviceObject, Irp);

//
// Done!
//

return TRUE;
}

2.2.14.4 Device Control Example (devctrl.c)


//
// (C) Copyright 1997 OSR Open Systems Resources, Inc.
// All Rights Reserved
//

#include <ntddk.h>

//
// This demonstrates how to build a device control and send it
// to a lower level device.
//

MakeDeviceControl(PDEVICE_OBJECT DeviceObject,
ULONG IoctlCode,
PVOID InputBuffer,
ULONG InputBufferSize,
PVOID OutputBuffer,
ULONG OutputBufferSize)
{
PIRP Irp;
NTSTATUS Status;
KEVENT Event;
IO_STATUS_BLOCK Iosb;

//
// First, start by initializing the event
//

KeInitializeEvent(&Event, SynchronizationEvent, FALSE);

//
// Build the request, using the I/O Manager routine...
//

Irp = IoBuildDeviceIoControlRequest(IoctlCode,
DeviceObject,
InputBuffer,
InputBufferSize,

Contents provided by 80
OSR Open Systems Resources, Inc.
OutputBuffer,
OutputBufferSize,
FALSE,
&Event,
&Iosb);

//
// Send the request to the lower layer driver.
//

Status = IoCallDriver(DeviceObject, Irp);

//
// Wait, if necessary
//

if (Status == STATUS_PENDING) {

//
// We must wait here in non-interruptable mode. Why? Because our
// event is on the stack. If we were to return out of here (because
// of an APC, for example) we might return from this function BEFORE
// the event is set. When the event is set later, at a minimum we'll
// trash the stack. Worse yet, the stack might be paged out and the
// system will die.
//

(void) KeWaitForSingleObject(&Event, Executive, KernelMode, TRUE, 0);

return (Status);

2.2.14.5 Header file (roll.h)


//
// (C) Copyright 1997 OSR Open Systems Resources, Inc.
// All Rights Reserved
//

#ifndef __OSR_ROLL_H__
#define __OSR_ROLL_H__ (1)

#include <ntddk.h>

VOID MyInitIrpPackage(CCHAR MaximumNumberOfIrpStackLocations, USHORT


MaximumNumberOfIrps);
PIRP MyAllocateIrp(CCHAR StackSize, BOOLEAN Wait);
VOID MyFreeIrp(PIRP);

#endif // __OSR_ROLL_H__

2.2.14.6 Kernel File Copy Example


This sample demonstrates a technique for copying files in kernel mode. The code presented was used on
the NTFS file system without any problems. Again, the purpose of this example is to demonstrate the basic
technique and allow the reader to customize it for their particular needs.

Contents provided by 81
OSR Open Systems Resources, Inc.
2.2.14.6.1 Driver (kfc.c)
//
// (C) Copyright 1996 OSR Open Systems Resources, Inc.
// All Rights Reserved
//
// Permission to use this code is granted provided that the OSR
// copyright is retained in the original source code and that
// the OSR copyright is displayed anywhere a copyright notice is
// displayed for the product in which this code is used.
//

//
// This sample program demonstrates how to copy data between two
// files.
//

#include "kfc.h"

//
// Forward declarations
//

static NTSTATUS KfcCreateClose(PDEVICE_OBJECT, PIRP);


static NTSTATUS KfcDeviceControl(PDEVICE_OBJECT, PIRP);
static NTSTATUS KfcCopyFile(PFILE_OBJECT, PFILE_OBJECT);
static VOID KfcGetFileStandardInformation(PFILE_OBJECT, PFILE_STANDARD_INFORMATION,
PIO_STATUS_BLOCK);
static VOID KfcRead(PFILE_OBJECT, PLARGE_INTEGER, ULONG, PMDL, PIO_STATUS_BLOCK);
static VOID KfcWrite(PFILE_OBJECT, PLARGE_INTEGER, ULONG, PMDL, PIO_STATUS_BLOCK);
static VOID KfcSetFileAllocation(PFILE_OBJECT, PLARGE_INTEGER, PIO_STATUS_BLOCK);
static VOID KfcUnload(PDRIVER_OBJECT);
static NTSTATUS KfcIoCompletion(PDEVICE_OBJECT, PIRP, PVOID);

//
// Internal constants
//

#define KFC_MAX_TRANSFER_SIZE (0x10000)


#define KFC_DEVICE_NAME L"\\Device\\OsrKfc"
#define KFC_DOS_NAME L"\\DosDevices\\OsrKfc"

//
// DriverEntry
//
// This is the entry point for the driver, responsible for initializing the
// data structures used by the driver, setting up the driver object and
// preparing for normal operation.
//
// Inputs:
// DriverObject - the driver object representing THIS driver. This is
// created by the I/O Manager when our driver loads.
//
// RegistryPath - A pointer to the registry path to the key for this driver.
//
//
// Outputs:
// None.
//

Contents provided by 82
OSR Open Systems Resources, Inc.
// Context:
// Like all driver entry routines, this routine is called in the context
// of the system process.
//

NTSTATUS DriverEntry(PDRIVER_OBJECT DriverObject, PUNICODE_STRING RegistryPath)


{
UNICODE_STRING deviceName;
UNICODE_STRING dosDeviceName;
PDEVICE_OBJECT deviceObject;
NTSTATUS status;

DbgPrint("KFC Driver Entry called\n");

RtlInitUnicodeString(&deviceName, KFC_DEVICE_NAME);

//
// Create a named device object for the KFC device. This will allow
// communications with the KFC driver.
//

status = IoCreateDevice(DriverObject,
0,
&deviceName,
FILE_DEVICE_UNKNOWN,
0,
FALSE,
&deviceObject);

if (!NT_SUCCESS(status)) {
//
// Indicate the error
//

DbgPrint("IoCreateDevice failed 0x%x\n", status);

return status;

//
// Create a symbolic link into the DosDevices area to allow access
// to Win32 applications.
//

RtlInitUnicodeString(&dosDeviceName, KFC_DOS_NAME);

status = IoCreateSymbolicLink(&dosDeviceName, &deviceName);

if (!NT_SUCCESS(status)) {
//
// Indicate the error
//

DbgPrint("IoCreateSymbolicLink failed 0x%x\n", status);

IoDeleteDevice(deviceObject);

Contents provided by 83
OSR Open Systems Resources, Inc.
return status;

//
// Overwrite the driver dispatch entry points this driver will
// implement. Remember that the others all point to the "trivial"
// function that returns STATUS_NOT_IMPLEMENTED.
//

DriverObject->MajorFunction[IRP_MJ_CREATE] = KfcCreateClose;

DriverObject->MajorFunction[IRP_MJ_CLOSE] = KfcCreateClose;

DriverObject->MajorFunction[IRP_MJ_DEVICE_CONTROL] = KfcDeviceControl;

//
// Set up the unload function
//

DriverObject->DriverUnload = KfcUnload;

//
// We are done. Note that if you wanted to export the functionality of
// this driver to Win32 applications, this would be the IDEAL place to
// add a symbolic link. We skip that step for this example, since the
// samke kfcopy application will use the native NT interface instead.
//

return STATUS_SUCCESS;
}

//
// KfcUnload
//
// This routine is called to unload the driver.
//
// Inputs:
// DriverObject - the driver object for this driver
//
// Outputs:
// None.
//
// Returns:
// None.
//
// Notes:
// None.

static VOID KfcUnload(PDRIVER_OBJECT DriverObject)


{
UNICODE_STRING symLinkName;

//
// First, delete the symbolic link.
//

Contents provided by 84
OSR Open Systems Resources, Inc.
RtlInitUnicodeString(&symLinkName, KFC_DOS_NAME);

IoDeleteSymbolicLink(&symLinkName);

//
// Now, delete the device object.
//

IoDeleteDevice(DriverObject->DeviceObject);

//
// Done!
//

return;
}

//
// KfcCreateClose
//
//

static NTSTATUS KfcCreateClose(PDEVICE_OBJECT DeviceObject, PIRP Irp)


{

Irp->IoStatus.Status = STATUS_SUCCESS;

Irp->IoStatus.Information = 0;

IoCompleteRequest(Irp, IO_NO_INCREMENT);

return STATUS_SUCCESS;

Contents provided by 85
OSR Open Systems Resources, Inc.
//
// KfcDeviceControl
//

static NTSTATUS KfcDeviceControl(PDEVICE_OBJECT DeviceObject, PIRP Irp)


{
PIO_STACK_LOCATION irpSp = IoGetCurrentIrpStackLocation(Irp);
PHANDLE fileHandles;
PFILE_OBJECT source, target;
NTSTATUS status;

switch(irpSp->Parameters.DeviceIoControl.IoControlCode) {

case KFC_COPY_FILE:
//
// Extract the file handles
//

if (irpSp->Parameters.DeviceIoControl.InputBufferLength < 2 * sizeof(PHANDLE)) {

//
// The buffer is not large enough to contain the file handles.
//

Irp->IoStatus.Status = STATUS_INVALID_PARAMETER;

Irp->IoStatus.Information = 0;

break;
}

fileHandles = (PHANDLE)Irp->AssociatedIrp.SystemBuffer;

status = ObReferenceObjectByHandle(fileHandles[0],
FILE_READ_ACCESS, // ACCESS_MASK
*IoFileObjectType, // POBJECT_TYPE
UserMode, // Access mode
&source, // output file object
0);

if (!NT_SUCCESS(status)) {

Irp->IoStatus.Status = STATUS_INVALID_PARAMETER;

Irp->IoStatus.Information = 0;

break;

status = ObReferenceObjectByHandle(fileHandles[1],
FILE_WRITE_ACCESS, // ACCESS_MASK
*IoFileObjectType,
UserMode,
&target,
0);

Contents provided by 86
OSR Open Systems Resources, Inc.
if (!NT_SUCCESS(status)) {

//
// Failure to dereference the object will cause file object leakage in
// the system.
//

ObDereferenceObject(source);

//
// Fail the request.
//

Irp->IoStatus.Status = STATUS_INVALID_PARAMETER;

Irp->IoStatus.Information = 0;

break;

//
// If we get to this point, we perform the copy.
//

Irp->IoStatus.Status = KfcCopyFile(target, source);

Irp->IoStatus.Information = 0;

//
// Now release the file object references
//

ObDereferenceObject(source);

ObDereferenceObject(target);

break;

default:
Irp->IoStatus.Status = STATUS_NOT_IMPLEMENTED;

Irp->IoStatus.Information = 0;

IoCompleteRequest(Irp, IO_NO_INCREMENT);

return STATUS_SUCCESS;

Contents provided by 87
OSR Open Systems Resources, Inc.
//
// KfcCopyFile
//
// This routine implements the fast file copy code.
//
// Inputs:
// TargetFileObject - copying TO
// SourceFileObject - copying FROM
//
// Outputs:
// None.
//
// Returns:
// SUCCESS when it works, otherwise an appropriate error
//
// Notes:
// None.
//

static NTSTATUS KfcCopyFile(PFILE_OBJECT TargetFileObject, PFILE_OBJECT SourceFileObject)


{
PVOID buffer;
PMDL mdl;
IO_STATUS_BLOCK iosb;
FILE_STANDARD_INFORMATION standardInformation;
LARGE_INTEGER currentOffset;
LONGLONG bytesToTransfer;

//
// The algorithm used by this routine is straight-forward: read 64k chunks from the
// source file and write it to the target file, until the entire file itself has been
copied.
//

buffer = ExAllocatePoolWithTag(NonPagedPool,
KFC_MAX_TRANSFER_SIZE,
'BcfK');

if (!buffer) {
//
// Allocation must have failed.
//

return STATUS_INSUFFICIENT_RESOURCES;

//
// Build an MDL describing the buffer. We'll use THAT to do the
// I/O (rather than a direct buffer address.)
//

mdl = IoAllocateMdl(buffer, KFC_MAX_TRANSFER_SIZE, FALSE, TRUE, 0);

Contents provided by 88
OSR Open Systems Resources, Inc.
MmBuildMdlForNonPagedPool(mdl);

//
// Set up the current offset information
//

currentOffset.QuadPart = 0;

//
// Get the size of the input file.
//

KfcGetFileStandardInformation(SourceFileObject, &standardInformation, &iosb);

if (!NT_SUCCESS(iosb.Status)) {
//
// This is a failure condition.
//

return (iosb.Status);

//
// Set the allocation size of the output file.
//

KfcSetFileAllocation(TargetFileObject,
&standardInformation.AllocationSize,
&iosb);

if (!NT_SUCCESS(iosb.Status)) {

//
// Failure...
//

return (iosb.Status);

//
// Save away the information about the # of bytes to transfer.
//

bytesToTransfer = standardInformation.EndOfFile.QuadPart;

//
// Now copy the source to the target until we run out...
//

while (bytesToTransfer) {
ULONG nextTransferSize;

//
// The # of bytes to copy in the next operation is based upon the maximum of
// the balance IN the file, or KFC_MAX_TRANSFER_SIZE
//

Contents provided by 89
OSR Open Systems Resources, Inc.
nextTransferSize = (bytesToTransfer < KFC_MAX_TRANSFER_SIZE) ?
(ULONG) bytesToTransfer : KFC_MAX_TRANSFER_SIZE;

KfcRead(SourceFileObject, &currentOffset, nextTransferSize, mdl, &iosb);

if (!NT_SUCCESS(iosb.Status)) {

//
// An error condition occurred.
//

return (iosb.Status);

KfcWrite(TargetFileObject, &currentOffset, nextTransferSize, mdl, &iosb);

if (!NT_SUCCESS(iosb.Status)) {

//
// An error condition occurred.
//

return (iosb.Status);

//
// Now, update the offset/bytes to transfer information
//

currentOffset.QuadPart += nextTransferSize;

bytesToTransfer -= nextTransferSize;

//
// At this point, we're done with the copy operation. Return success
//

return (STATUS_SUCCESS);
}

Contents provided by 90
OSR Open Systems Resources, Inc.
//
// KfcGetFileStandardInformation
//
// This function retrieves the "standard" information for the underlying file system.
//
// Inputs:
// FileObject - the file to retrieve information about
//
// Outputs:
// StandardInformation - the buffer where the data should be stored
// IoStatusBlock - information about what actually happened.
//
// Returns:
// None.
//
// Notes:
// This is equivalent to ZwQueryInformationFile, for FILE_STANDARD_INFORMATION
//

static VOID KfcGetFileStandardInformation(PFILE_OBJECT FileObject,


PFILE_STANDARD_INFORMATION StandardInformation,
PIO_STATUS_BLOCK IoStatusBlock)
{
PIRP irp;
PDEVICE_OBJECT fsdDevice = IoGetRelatedDeviceObject(FileObject);
KEVENT event;
PIO_STACK_LOCATION ioStackLocation;

//
// Start off on the right foot - zero the information block.
//

RtlZeroMemory(StandardInformation, sizeof(FILE_STANDARD_INFORMATION));

//
// Allocate an irp for this request. This could also come from a
// private pool, for instance.
//

irp = IoAllocateIrp(fsdDevice->StackSize, FALSE);

if (!irp) {
//
// Failure!
//

return;
}

irp->AssociatedIrp.SystemBuffer = StandardInformation;

irp->UserEvent = &event;

irp->UserIosb = IoStatusBlock;

irp->Tail.Overlay.Thread = PsGetCurrentThread();

Contents provided by 91
OSR Open Systems Resources, Inc.
irp->Tail.Overlay.OriginalFileObject = FileObject;

irp->RequestorMode = KernelMode;

//
// Initialize the event
//

KeInitializeEvent(&event, SynchronizationEvent, FALSE);

//
// Set up the I/O stack location.
//

ioStackLocation = IoGetNextIrpStackLocation(irp);

ioStackLocation->MajorFunction = IRP_MJ_QUERY_INFORMATION;

ioStackLocation->DeviceObject = fsdDevice;

ioStackLocation->FileObject = FileObject;

ioStackLocation->Parameters.QueryFile.Length = sizeof(FILE_STANDARD_INFORMATION);

ioStackLocation->Parameters.QueryFile.FileInformationClass = FileStandardInformation;

//
// Set the completion routine.
//

IoSetCompletionRoutine(irp, KfcIoCompletion, 0, TRUE, TRUE, TRUE);

//
// Send it to the FSD
//

(void) IoCallDriver(fsdDevice, irp);

//
// Wait for the I/O
//

KeWaitForSingleObject(&event, Executive, KernelMode, TRUE, 0);

//
// Done!
//

return;

Contents provided by 92
OSR Open Systems Resources, Inc.
//
// KfcIoCompletion
//
// This routine is used to handle I/O (read OR write) completion
//
// Inputs:
// DeviceObject - not used
// Irp - the I/O operation being completed
// Context - not used
//
// Outputs:
// None.
//
// Returns:
// STATUS_MORE_PROCESSING_REQUIRED
//
// Notes:
// The purpose of this routine is to do "cleanup" on I/O operations
// so we don't constantly throw away perfectly good MDLs as part of
// completion processing.
//

static NTSTATUS KfcIoCompletion(PDEVICE_OBJECT DeviceObject,


PIRP Irp,
PVOID Context)
{
//
// Copy the status information back into the "user" IOSB.
//

*Irp->UserIosb = Irp->IoStatus;

//
// Set the user event - wakes up the mainline code doing this.
//

KeSetEvent(Irp->UserEvent, 0, FALSE);

//
// Free the IRP now that we are done with it.
//

IoFreeIrp(Irp);

//
// We return STATUS_MORE_PROCESSING_REQUIRED because this "magic" return value
// tells the I/O Manager that additional processing will be done by this driver
// to the IRP - in fact, it might (as it is in this case) already BE done - and
// the IRP cannot be completed.
//

return STATUS_MORE_PROCESSING_REQUIRED;
}

Contents provided by 93
OSR Open Systems Resources, Inc.
static VOID KfcRead(PFILE_OBJECT FileObject,
PLARGE_INTEGER Offset,
ULONG Length,
PMDL Mdl,
PIO_STATUS_BLOCK IoStatusBlock)
{
PIRP irp;
KEVENT event;
PIO_STACK_LOCATION ioStackLocation;
PDEVICE_OBJECT fsdDevice = IoGetRelatedDeviceObject(FileObject);

//
// Set up the event we'll use.
//

KeInitializeEvent(&event, SynchronizationEvent, FALSE);

//
// Allocate and build the IRP we'll be sending to the FSD.
//

irp = IoAllocateIrp(fsdDevice->StackSize, FALSE);

if (!irp) {

//
// Allocation failed, presumably due to memory allocation failure.
//

IoStatusBlock->Status = STATUS_INSUFFICIENT_RESOURCES;

IoStatusBlock->Information = 0;

irp->MdlAddress = Mdl;

irp->UserEvent = &event;

irp->UserIosb = IoStatusBlock;

irp->Tail.Overlay.Thread = PsGetCurrentThread();

irp->Tail.Overlay.OriginalFileObject= FileObject;

irp->RequestorMode = KernelMode;

//
// Indicate that this is a READ operation.
//

irp->Flags = IRP_READ_OPERATION;

//
// Set up the next I/O stack location. These are the parameters

Contents provided by 94
OSR Open Systems Resources, Inc.
// that will be passed to the underlying driver.
//

ioStackLocation = IoGetNextIrpStackLocation(irp);

ioStackLocation->MajorFunction = IRP_MJ_READ;

ioStackLocation->MinorFunction = 0;

ioStackLocation->DeviceObject = fsdDevice;

ioStackLocation->FileObject = FileObject;

//
// We use a completion routine to keep the I/O Manager from doing
// "cleanup" on our IRP - like freeing our MDL.
//

IoSetCompletionRoutine(irp, KfcIoCompletion, 0, TRUE, TRUE, TRUE);

ioStackLocation->Parameters.Read.Length = Length;

ioStackLocation->Parameters.Read.ByteOffset = *Offset;

//
// Send it on. Ignore the return code.
//

(void) IoCallDriver(fsdDevice, irp);

//
// Wait for the I/O to complete.
//

KeWaitForSingleObject(&event, Executive, KernelMode, TRUE, 0);

//
// Done. Return results are in the io status block.
//

return;
}

//
// KfcSetFileAllocation
//
// This routine sets a file's ALLOCATION size to the specified value.
// Note that this DOES NOT extend the file's EOF.
//
// Inputs:
// FileObject - the file on which to set the allocation size
// AllocationSize - the new allocation size
//
// Ouputs:

Contents provided by 95
OSR Open Systems Resources, Inc.
// IoStatusBlock - the results of this operation
//
// Returns:
// None.
//
// Notes:
// None.

static VOID KfcSetFileAllocation(PFILE_OBJECT FileObject,


PLARGE_INTEGER AllocationSize,
PIO_STATUS_BLOCK IoStatusBlock)
{
PIRP irp;
PDEVICE_OBJECT fsdDevice = IoGetRelatedDeviceObject(FileObject);
KEVENT event;
PIO_STACK_LOCATION ioStackLocation;

//
// Allocate an irp for this request. This could also come from a
// private pool, for instance.
//

irp = IoAllocateIrp(fsdDevice->StackSize, FALSE);

if (!irp) {
//
// Failure!
//

return;
}

irp->AssociatedIrp.SystemBuffer = AllocationSize;

irp->UserEvent = &event;

irp->UserIosb = IoStatusBlock;

irp->Tail.Overlay.Thread = PsGetCurrentThread();

irp->Tail.Overlay.OriginalFileObject = FileObject;

irp->RequestorMode = KernelMode;

//
// Initialize the event
//

KeInitializeEvent(&event, SynchronizationEvent, FALSE);

//
// Set up the I/O stack location.
//

ioStackLocation = IoGetNextIrpStackLocation(irp);

Contents provided by 96
OSR Open Systems Resources, Inc.
ioStackLocation->MajorFunction = IRP_MJ_SET_INFORMATION;

ioStackLocation->DeviceObject = fsdDevice;

ioStackLocation->FileObject = FileObject;

ioStackLocation->Parameters.SetFile.Length = sizeof(LARGE_INTEGER);

ioStackLocation->Parameters.SetFile.FileInformationClass = FileAllocationInformation;

ioStackLocation->Parameters.SetFile.FileObject = 0; // not used for allocation

ioStackLocation->Parameters.SetFile.AdvanceOnly = FALSE;

//
// Set the completion routine.
//

IoSetCompletionRoutine(irp, KfcIoCompletion, 0, TRUE, TRUE, TRUE);

//
// Send it to the FSD
//

(void) IoCallDriver(fsdDevice, irp);

//
// Wait for the I/O
//

KeWaitForSingleObject(&event, Executive, KernelMode, TRUE, 0);

//
// Done!
//

return;
}

static VOID KfcWrite(PFILE_OBJECT FileObject,


PLARGE_INTEGER Offset,
ULONG Length,
PMDL Mdl,
PIO_STATUS_BLOCK IoStatusBlock)
{
PIRP irp;
KEVENT event;
PIO_STACK_LOCATION ioStackLocation;
PDEVICE_OBJECT fsdDevice = IoGetRelatedDeviceObject(FileObject);

//
// Set up the event we'll use.
//

KeInitializeEvent(&event, SynchronizationEvent, FALSE);

Contents provided by 97
OSR Open Systems Resources, Inc.
//
// Allocate and build the IRP we'll be sending to the FSD.
//

irp = IoAllocateIrp(fsdDevice->StackSize, FALSE);

if (!irp) {

//
// Allocation failed, presumably due to memory allocation failure.
//

IoStatusBlock->Status = STATUS_INSUFFICIENT_RESOURCES;

IoStatusBlock->Information = 0;

irp->MdlAddress = Mdl;

irp->UserEvent = &event;

irp->UserIosb = IoStatusBlock;

irp->Tail.Overlay.Thread = PsGetCurrentThread();

irp->Tail.Overlay.OriginalFileObject= FileObject;

irp->RequestorMode = KernelMode;

//
// Indicate that this is a WRITE operation.
//

irp->Flags = IRP_WRITE_OPERATION;

//
// Set up the next I/O stack location. These are the parameters
// that will be passed to the underlying driver.
//

ioStackLocation = IoGetNextIrpStackLocation(irp);

ioStackLocation->MajorFunction = IRP_MJ_WRITE;

ioStackLocation->MinorFunction = 0;

ioStackLocation->DeviceObject = fsdDevice;

ioStackLocation->FileObject = FileObject;

//
// We use a completion routine to keep the I/O Manager from doing
// "cleanup" on our IRP - like freeing our MDL.

Contents provided by 98
OSR Open Systems Resources, Inc.
//

IoSetCompletionRoutine(irp, KfcIoCompletion, 0, TRUE, TRUE, TRUE);

ioStackLocation->Parameters.Write.Length = Length;

ioStackLocation->Parameters.Write.ByteOffset = *Offset;

//
// Send it on. Ignore the return code.
//

(void) IoCallDriver(fsdDevice, irp);

//
// Wait for the I/O to complete.
//

KeWaitForSingleObject(&event, Executive, KernelMode, TRUE, 0);

//
// Done. Return results are in the io status block.
//

return;

2.2.14.6.2 Application (kfcopy.c)


//
// (C) Copyright 1997 OSR Open Systems Resources, Inc.
// All Rights Reserved.
//
//
// This program is intended for use with the KFC kernel mode driver.
//

#include <windows.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <memory.h>
#include <../drv/kfc.h>

#define KFC_DRIVER_NAME "osr-kfc"


#define KFC_DRIVER_BINARY "\\osr-kfc.sys"

//
// Forward declarations.
//

BOOL
InstallDriver( IN SC_HANDLE SchSCManager,
IN LPCTSTR DriverName,
IN LPCTSTR ServiceExe );

Contents provided by 99
OSR Open Systems Resources, Inc.
BOOL
RemoveDriver( IN SC_HANDLE SchSCManager,
IN LPCTSTR DriverName );

BOOL
StartDriver( IN SC_HANDLE SchSCManager,
IN LPCTSTR DriverName );

BOOL
StopDriver( IN SC_HANDLE SchSCManager,
IN LPCTSTR DriverName );

//
// Main:
//
// This is the entry point for this function. Caller provides the input and output
// names on the command line.
//
// Inputs:
// argc - the argument count (# of arguments). Minimum is one (the name of the
program)
// argv - the array of pointers to the arguments on the input line.
//
// Outputs:
// None.
//
// Returns:
// Various.

int _cdecl main(ULONG argc, LPSTR *argv)


{
LPSTR sourceFileName = 0, targetFileName = 0;
HANDLE sourceFileHandle, targetFileHandle;
HANDLE kfcDriverHandle;
HANDLE handles[2];
DWORD outputSize;
SC_HANDLE scmHandle;
PVOID driverBinaryName;
DWORD driverBinaryNameLength;
LPVOID lpMsgBuf;

if (argc != 3) {
//
// This version requires that both names be specified
// on the command line.
//

fprintf(stderr, "Usage: kfcopy <source> <target>\n");

return 1;

sourceFileName = argv[1];

Contents provided by 100


OSR Open Systems Resources, Inc.
targetFileName = argv[2];

//
// Open the service control manager
//

scmHandle = OpenSCManager(NULL, // use local system's SCM


NULL, // default database
SC_MANAGER_ALL_ACCESS); // we want to do EVERYTHING

if (INVALID_HANDLE_VALUE == scmHandle) {
//
// Could not open SCM.
//

fprintf(stderr, "Unable to contact the Service Control Manager\n");

return 2;
}

//
// First, try to open the KFC driver.
//

kfcDriverHandle = CreateFile("\\\\.\\OsrKfc",
0, // any access
FILE_SHARE_READ|FILE_SHARE_WRITE,
0, // no security
OPEN_EXISTING, // must already exist
0,
0);

if (INVALID_HANDLE_VALUE == kfcDriverHandle) {
//
// Open failed. Try to install and load the KFC driver.
//

//
// Compute the path to the binary (this directory)
//

driverBinaryNameLength = GetCurrentDirectory(0, driverBinaryName);

driverBinaryNameLength += strlen(KFC_DRIVER_BINARY);

driverBinaryName = malloc(driverBinaryNameLength);

if (!driverBinaryName) {
//
// malloc failed
//

fprintf(stderr, "Malloc failed\n");

return 3;

Contents provided by 101


OSR Open Systems Resources, Inc.
GetCurrentDirectory(driverBinaryNameLength, driverBinaryName);

strcat(driverBinaryName, KFC_DRIVER_BINARY);

printf("driverBinaryName = %s\n", driverBinaryName);

//
// First, remove it but ignore the error value. That's to
// ensure we start with a clean registry key.
//

(void) RemoveDriver(scmHandle, KFC_DRIVER_NAME);

//
// Now, install the driver.
//

if (!InstallDriver(scmHandle, KFC_DRIVER_NAME, driverBinaryName)) {

//
// Driver installation failed
//

fprintf(stderr, "KFC driver could not be installed\n");

return 4;

if (!StartDriver(scmHandle, KFC_DRIVER_NAME)) {
//
// Driver start failed.
//

fprintf(stderr, "KFC driver could not be started\n");

return 5;
}

//
// Try opening the KFC driver again.
//

kfcDriverHandle = CreateFile("\\\\.\\OsrKfc",
0, // any access
FILE_SHARE_READ|FILE_SHARE_WRITE,
0, // no security
OPEN_EXISTING, // must already exist
0,
0);

if (INVALID_HANDLE_VALUE == kfcDriverHandle) {
//
// Everything failed
//

fprintf(stderr, "Unable to load/open driver\n");

return 6;

Contents provided by 102


OSR Open Systems Resources, Inc.
}

//
// Now, open the source file.
//

sourceFileHandle = CreateFile(sourceFileName,
GENERIC_READ,
FILE_SHARE_READ,
0,
OPEN_EXISTING,
0,
0);

if (INVALID_HANDLE_VALUE == sourceFileHandle) {
//
// Open failed.
//

fprintf(stderr, "Unable to open source file %s\n", sourceFileName);

// XXX: for testing purposes ignore errors here.


// return 7;
}

//
// Finally, create the target file.
//

targetFileHandle = CreateFile(targetFileName,
GENERIC_WRITE,
0, // no sharing
0,
CREATE_ALWAYS,
FILE_ATTRIBUTE_ARCHIVE,
0);

if (INVALID_HANDLE_VALUE == sourceFileHandle) {
//
// Open failed.
//

fprintf(stderr, "Unable to open taret file %s\n", targetFileName);

// XXX: for testing purposes ignore errors here.


// return 8;

//
// OK. Build the two handle array we need for the call to KFC.
//

handles[0] = sourceFileHandle;

handles[1] = targetFileHandle;

Contents provided by 103


OSR Open Systems Resources, Inc.
printf("handles[0] = 0x%x\n", handles[0]);

printf("handles[1] = 0x%x\n", handles[1]);

if (!DeviceIoControl(kfcDriverHandle,
KFC_COPY_FILE,
handles,
sizeof(handles),
0, // no output
0,
&outputSize,
0)) {
DWORD lastError = GetLastError();

fprintf(stderr, "Copy failed (0x%x)\n", lastError);

FormatMessage( FORMAT_MESSAGE_ALLOCATE_BUFFER | FORMAT_MESSAGE_FROM_SYSTEM,


NULL,
lastError,
MAKELANGID(LANG_NEUTRAL, SUBLANG_DEFAULT), // Default language
(LPTSTR) &lpMsgBuf,
0,
NULL);

fprintf(stderr, "*** %s\n", lpMsgBuf);

LocalFree( lpMsgBuf );

return 9;

//
// If we make it to this point, we've succeeded.
//

CloseHandle(sourceFileHandle);

FlushFileBuffers(targetFileHandle);

CloseHandle(targetFileHandle);

//
// Stop the driver
//

(void) StopDriver(scmHandle, KFC_DRIVER_NAME);

//
// Uninstall the driver
//

(void) RemoveDriver(scmHandle, KFC_DRIVER_NAME);

//
// Close the SCM
//

CloseServiceHandle(scmHandle);

Contents provided by 104


OSR Open Systems Resources, Inc.
return 0;
}

//
// Code following this point is taken from the instdrv.c example in the
// NT DDK verbatim.
//

/*++

Copyright (c) 1993 Microsoft Corporation

Module Name:

Instdrv.c

Abstract:

A simple Win32 app that installs a device driver

Environment:

user mode only

Notes:

See readme.txt

Revision History:

06-25-93 : created
--*/

BOOL
InstallDriver(
IN SC_HANDLE SchSCManager,
IN LPCTSTR DriverName,
IN LPCTSTR ServiceExe
)
/*++

Routine Description:

Arguments:

Return Value:

--*/
{
SC_HANDLE schService;
DWORD err;

Contents provided by 105


OSR Open Systems Resources, Inc.
//
// NOTE: This creates an entry for a standalone driver. If this
// is modified for use with a driver that requires a Tag,
// Group, and/or Dependencies, it may be necessary to
// query the registry for existing driver information
// (in order to determine a unique Tag, etc.).
//

schService = CreateService (SchSCManager, // SCManager database


DriverName, // name of service
DriverName, // name to display
SERVICE_ALL_ACCESS, // desired access
SERVICE_KERNEL_DRIVER, // service type
SERVICE_DEMAND_START, // start type
SERVICE_ERROR_NORMAL, // error control type
ServiceExe, // service's binary
NULL, // no load ordering group
NULL, // no tag identifier
NULL, // no dependencies
NULL, // LocalSystem account
NULL // no password
);

if (schService == NULL)
{
err = GetLastError();

if (err == ERROR_SERVICE_EXISTS)
{
//
// A common cause of failure (easier to read than an error code)
//

printf ("failure: CreateService, ERROR_SERVICE_EXISTS\n");


}
else
{
printf ("failure: CreateService (0x%02x)\n",
err
);
}

return FALSE;
}
else
{
printf ("CreateService SUCCESS\n");
}

CloseServiceHandle (schService);

return TRUE;
}

BOOL
RemoveDriver(

Contents provided by 106


OSR Open Systems Resources, Inc.
IN SC_HANDLE SchSCManager,
IN LPCTSTR DriverName
)
/*++

Routine Description:

Arguments:

Return Value:

--*/
{
SC_HANDLE schService;
BOOL ret;

schService = OpenService (SchSCManager,


DriverName,
SERVICE_ALL_ACCESS
);

if (schService == NULL)
{
printf ("failure: OpenService (0x%02x)\n", GetLastError());
return FALSE;
}

ret = DeleteService (schService);

if (ret)
{
printf ("DeleteService SUCCESS\n");
}
else
{
printf ("failure: DeleteService (0x%02x)\n",
GetLastError()
);
}

CloseServiceHandle (schService);

return ret;
}

BOOL
StartDriver(
IN SC_HANDLE SchSCManager,
IN LPCTSTR DriverName
)
{
SC_HANDLE schService;
BOOL ret;
DWORD err;

schService = OpenService (SchSCManager,


DriverName,

Contents provided by 107


OSR Open Systems Resources, Inc.
SERVICE_ALL_ACCESS
);

if (schService == NULL)
{
printf ("failure: OpenService (0x%02x)\n", GetLastError());
return FALSE;
}

ret = StartService (schService, // service identifier


0, // number of arguments
NULL // pointer to arguments
);
if (ret)
{
printf ("StartService SUCCESS\n");
}
else
{
err = GetLastError();

if (err == ERROR_SERVICE_ALREADY_RUNNING)
{
//
// A common cause of failure (easier to read than an error code)
//

printf ("failure: StartService, ERROR_SERVICE_ALREADY_RUNNING\n");


}
else
{
printf ("failure: StartService (0x%02x)\n",
err
);
}
}

CloseServiceHandle (schService);

return ret;
}

BOOL
StopDriver(
IN SC_HANDLE SchSCManager,
IN LPCTSTR DriverName
)
{
SC_HANDLE schService;
BOOL ret;
SERVICE_STATUS serviceStatus;

schService = OpenService (SchSCManager,


DriverName,
SERVICE_ALL_ACCESS
);

Contents provided by 108


OSR Open Systems Resources, Inc.
if (schService == NULL)
{
printf ("failure: OpenService (0x%02x)\n", GetLastError());
return FALSE;
}

ret = ControlService (schService,


SERVICE_CONTROL_STOP,
&serviceStatus
);
if (ret)
{
printf ("ControlService SUCCESS\n");
}
else
{
printf ("failure: ControlService (0x%02x)\n",
GetLastError()
);
}

CloseServiceHandle (schService);

return ret;
}

BOOL
OpenDevice(
IN LPCTSTR DriverName
)
{
char completeDeviceName[64] = "";
LPCTSTR dosDeviceName = DriverName;
HANDLE hDevice;
BOOL ret;

//
// Create a \\.\XXX device name that CreateFile can use
//
// NOTE: We're making an assumption here that the driver
// has created a symbolic link using it's own name
// (i.e. if the driver has the name "XXX" we assume
// that it used IoCreateSymbolicLink to create a
// symbolic link "\DosDevices\XXX". Usually, there
// is this understanding between related apps/drivers.
//
// An application might also peruse the DEVICEMAP
// section of the registry, or use the QueryDosDevice
// API to enumerate the existing symbolic links in the
// system.
//

strcat (completeDeviceName,
"\\\\.\\"
);

Contents provided by 109


OSR Open Systems Resources, Inc.
strcat (completeDeviceName,
dosDeviceName
);

hDevice = CreateFile (completeDeviceName,


GENERIC_READ | GENERIC_WRITE,
0,
NULL,
OPEN_EXISTING,
FILE_ATTRIBUTE_NORMAL,
NULL
);

if (hDevice == ((HANDLE)-1))
{
printf ("Can't get a handle to %s\n",
completeDeviceName
);

ret = FALSE;
}
else
{
printf ("CreateFile SUCCESS\n");

CloseHandle (hDevice);

ret = TRUE;
}

return ret;
}

2.2.14.6.3 Header file (kfc.h)


//
//
// (C) Copyright 1996 OSR Open Systems Resources, Inc.
// All Rights Reserved
//
// Permission to use this code is granted provided that the OSR
// copyright is retained in the original source code and that
// the OSR copyright is displayed anywhere a copyright notice is
// displayed for the product in which this code is used.
//

//
// This header file describes io control codes implemented by the kfc
// driver.
//

#ifndef __OSR_KFC_H__
#define __OSR_KFC_H__ 1
#ifdef KFC
#include <ntddk.h>
#else
#include <winioctl.h>

Contents provided by 110


OSR Open Systems Resources, Inc.
#endif // KFC

#define KFC_COPY_FILE CTL_CODE(FILE_DEVICE_UNKNOWN, 3121, METHOD_BUFFERED,


FILE_ANY_ACCESS)

#endif // __OSR_KFC_H__

Contents provided by 111


OSR Open Systems Resources, Inc.
Contents provided by 112
OSR Open Systems Resources, Inc.
3 Cache Manager Runtime
In this section, we provide a basic description of the runtime routines used by the cache manager.
Additionally, there are samples for using several of the routines as well as cross references to code that can
be found within the Microsoft IFS Kit itself.

3.1 Cache Manager Overview


The Cache Manager is a software-only component that is tightly integrated with the Windows NT Memory
Manager to integrate the caching of file system data with the Virtual Memory System. Some operating
systems implement their file systems so they have a distinct data cache. However, because such caches
must be managed from physical memory they are limited in size – and memory used for such a cache is not
available for use elsewhere in the system.

Thus one key advantage of using the Windows NT Cache Manager is that it allows for a balancing of the
use of physical memory between file caching and programs running on the system. When an application is
I/O intensive the balance can be “tipped” towards caching data. When an application is consuming
memory the amount of memory used for caching data can be reduced to practically zero. Thus, the net
result is that the system makes better use of physical memory and ultimately provides better performance.
The other key reason for the file system to use the Cache Manager is that a file can be accessed either via
the standard file system interface, such as read and write, or it can be accessed via the Memory Manager –
a “memory mapped” file. When both access methods are being used on the same file the Cache Manager
provides a mechanism for bridging between the two to ensure consistency of the data.

3.2 Cache Manager Data Structures


The interface between the file system and the Cache Manager relies upon a procedural interface.
Essentially all of the data structures within the Cache Manager are associated with a file, but the actual
internal structure of those data structures is transparent to the file system. In this section we describe those
key data structures which are shared between the file system and the Cache Manager.
3.2.1 Buffer Control Block (BCB)
A buffer control block is used internally by the Cache Manager to track when a portion of a file is mapped
into the system address space. This is exposed to the file system because sometimes it is necessary for the
file system to pin the data in memory while it is performing some critical operation.

Most of the buffer control block (BCB) is opaque. The first portion of the BCB is exposed to file systems:

typedef struct _PUBLIC_BCB {


CSHORT NodeTypeCode;
CSHORT NodeByteSize;
ULONG MappedLength;
LARGE_INTEGER MappedFileOffset;
} PUBLIC_BCB, *PPUBLIC_BCB;

The first two fields of the buffer control block are standard for Windows NT data structures – they
uniquely identify both the type and size of the data structure itself. The last two fields are of interest to
the file system as they identify the range of the file managed by this particular buffer control block.

3.2.2 File Size Information


The file system and Memory Manager each maintain information about the size of the file. Whenever the
file system establishes mapping for a file it indicates the current size of the file. Any subsequent changes to
the size of the file are similarly indicated to the Cache Manager.

There are three values used by the Cache Manager to indicate the current size of the file:

Contents provided by 113


OSR Open Systems Resources, Inc.
typedef struct _CC_FILE_SIZES {
LARGE_INTEGER AllocationSize;
LARGE_INTEGER FileSize;
LARGE_INTEGER ValidDataLength;
} CC_FILE_SIZES, *PCC_FILE_SIZES;

The names of these fields can be confusing. For instance the AllocationSize field is use, not to identify the
actual physical space allocated for the file, but rather the amount of data which can fit in the presently
allocated space. For some file systems, this turns out to be the same value. However, for a file system
which supports compression or expansion of the actual data, this value represents the amount of data which
could fit.

The AllocationSize of the file is used by the Memory Manager to represent the size of the “section object.”
Since a section object is then used to determine how a file is mapped into memory it is essential that the
AllocationSizealways be at least as large as the file. The Cache Manager and Memory Manager do not
detect the case when the file system sets the AllocationSize to be smaller than the file size – instead the
system crashes due to the inconsistency in the data structures.

The FileSize of the file represents the last valid byte of data in the file – logically it is the “End of File”
marker.

The ValidDataLength of the file represents the last valid byte of data in memory. Thus, a file can be
extended in memory prior to the data being written to disk.

Note that the CC_FILE_SIZES structure has precisely the same layout of the size fields as the
FSRTL_COMMON_FCB_HEADER structure has. Typically, a file system does not maintain a separate
CC_FILE_SIZES data structure but instead passes the address of the overlapping fields.

3.3 Cache Manager Callbacks


Interactions between the file system and the Cache Manager are manipulated via a series of callback
functions. These callback functions are registered on a per file basis with the Cache Manager and are then
used by the Cache Manager in order to ensure that the data structures are “locked” prior to performing a file
system operation.

Windows NT assumes there is a strict ordering in how resources are acquired between the file system,
Cache Manager, and Memory Manager. If followed, this ordering will ensure that deadlocks do not occur.
Of course, if it is not followed, deadlocks can (and will) occur. Specifically, file system resources are
acquired first. Then Cache Manager resources are acquired. Finally, Memory Manager resources are
acquired.

Thus, these callbacks are used by the Cache Manager to honor this hierarchy. The callbacks required by
the Cache Manager are:

typedef BOOLEAN (*PACQUIRE_FOR_LAZY_WRITE) (


IN PVOID Context,
IN BOOLEAN Wait
);
typedef VOID (*PRELEASE_FROM_LAZY_WRITE) (
IN PVOID Context
);
typedef BOOLEAN (*PACQUIRE_FOR_READ_AHEAD) (
IN PVOID Context,
IN BOOLEAN Wait
);
typedef VOID (*PRELEASE_FROM_READ_AHEAD) (
IN PVOID Context
);

Contents provided by 114


OSR Open Systems Resources, Inc.
typedef struct _CACHE_MANAGER_CALLBACKS {
PACQUIRE_FOR_LAZY_WRITE AcquireForLazyWrite;
PRELEASE_FROM_LAZY_WRITE ReleaseFromLazyWrite;
PACQUIRE_FOR_READ_AHEAD AcquireForReadAhead;
PRELEASE_FROM_READ_AHEAD ReleaseFromReadAhead;
} CACHE_MANAGER_CALLBACKS, *PCACHE_MANAGER_CALLBACKS;

Note that the callbacks are used for two distinct parts of the Cache Manager. The first, the lazy writer, is
responsible for writing dirty cached data back to the file system. The second is for read ahead handling –
reading data prior to an actual call from the user to obtain that information.

First, in designing these it is important to note what you are protecting your file system against (and what
you aren’t.) There is no reason to serialize cached I/O operations from applications with I/O from the
Cache Manager’s lazy writer. However, you do need to protect against non-cached user I/O operations and
user operations that modify the size of the file.

The NT file systems do this by using two ERESOURCE structures. Both of these can also be used (and
located) by other components within the operating system by walking through the common header –
specifically the Resource and PagingIoResource fields within the common header. The Cache Manager
does not directly acquire these resources – instead it calls into the file system to acquire any necessary
resources (typically these resources).

Note that these routines must be provided by your file system – they are not optional and the system will
crash if you fail to provide them.

The following code is a sample implementation from an older version of the OSR FSDK on the
implementation of a callback management routine:

static BOOLEAN OwAcquireForLazyWrite(PVOID Context, BOOLEAN Wait)


{
POW_FCB fcb = (POW_FCB) Context;
BOOLEAN result;
// Take out the lock on the file.
result = OwAcquireResourceExclusiveExp(&fcb->Resource, Wait);
if (!result) {
// We did not acquire the resource.
return (result);
}
// We did acquire the resource. We need to:
// (1) Store away the thread id of this thread (for the release)
// (2) Set top level irp to a pseudo value
// In both cases, the previous value should be zero.
OwAssert(!fcb->ResourceThread);
fcb->ResourceThread = OwGetCurrentResourceThread();
return (TRUE);
}

Each of the file systems in the Microsoft IFS kit also contains examples of routines like these. For the Lazy
Writer these routines are located in the following locations:

File File Routine


System

Contents provided by 115


OSR Open Systems Resources, Inc.
FAT resrcsup.c
FatAcquireFcbForLazyWrite

CDFS resrcsup.c CdAcquireForCache


RDR2 rxce\resrcsup.c RxAcquireFcbForLazyWrite
Other similar routines (for read ahead for example) can be located in the same file.

3.4 CcCanIWrite
Because an application program can modify data in memory at a rate that exceeds the ability to
write the data to disk, the Virtual Memory system can "fill up" with data. This in turn can then
cause fatal out-of-memory conditions to occur within the VM system. To avoid this, the file
system must cooperate with the VM system to detect these conditions. One of the key operations
provided by the Cache Manager for this support is CcCanIWrite. The prototype for this call is:
NTKERNELAPI
BOOLEAN
CcCanIWrite (
IN PFILE_OBJECT FileObject,
IN ULONG BytesToWrite,
IN BOOLEAN Wait,
IN BOOLEAN Retrying
);

If this call returns FALSE then the FSD needs to delay actually writing dirty data into the cache in order to
avoid an out-of-memory condition. The typical symptom of such out of memory conditions is a stop code
of NO_PAGES_AVAILABLE.

The FSD must handle posting and subsequent retrying of the write operation. The FSD can post the write
either via an internal posting mechanism or by using the routine CcDeferWrite.

The routine FsRtlCopyWrite can be used by your FSD instead of accessing the cache directly. In this case,
deferring the I/O operation is handled internally within this function.

3.5 CcCopyRead
Once a file system has established caching (via the CcInitializeCacheMap call) it uses either the FsRtl
routines (such as FsRtlCopyRead) or this routine. Typically, FsRtlCopyRead is used to implement the fast
I/O path for read and this routine is used to implement IRP_MJ_READ. The prototype for this call is:
NTKERNELAPI BOOLEAN CcCopyRead (
IN PFILE_OBJECT FileObject,
IN PLARGE_INTEGER FileOffset,
IN ULONG Length,
IN BOOLEAN Wait,
OUT PVOID Buffer,
OUT PIO_STATUS_BLOCK IoStatus
);

Contents provided by 116


OSR Open Systems Resources, Inc.
The FileObject contains a pointer to the SectionObjectPointer that are to be used by the Cache Manager
when copying data from the cache into the user buffer (the Buffer argument provided here.) Thus, there is
an assumption here that caching has been previously initialized.

The Length indicates the length of the read operation. The Buffer is assumed to be large enough to contain
the amount of data being read.

The Wait parameter indicates if the caller is willing to block for an indeterminate period of time, such as
might be required if a lock must be acquired. This parameter should be viewed as a "hint" however, rather
than a guarantee. For example, if disk I/O is necessary to complete this operation the operation might
proceed, even if Wait is FALSE.

The Buffer refers to the caller-provided buffer. It need not be valid and in such a case this routine will raise
an exception. An FSD should trap that exception and return an error to the user application.

The IoStatus block will be set to indicate the completion status of the operation and the total number of
bytes read.

Note that the Cache Manager may be required to page-fault the data into the cache. In that case an FSD
will be re-entered in order to process the actual paging I/O operation.

3.6 CcCopyWrite
Once a file system has established caching (via the CcInitializeCacheMap call) it uses either the
FsRtl routines (such as FsRtlCopyWrite) or this routine. Typically, FsRtlCopyRead is used to
implement the fast I/O path for read and this routine is used to implement IRP_MJ_READ. The
prototype for this call is:
NTKERNELAPI BOOLEAN CcCopyWrite (
IN PFILE_OBJECT FileObject,
IN PLARGE_INTEGER FileOffset,
IN ULONG Length,
IN BOOLEAN Wait,
IN PVOID Buffer
);

The FileObject contains a pointer to the SectionObjectPointer that are to be used by the Cache Manager
when copying data from the cache into the user buffer (the Buffer argument provided here.) Thus, there is
an assumption here that caching has been previously initialized.

The Length indicates the length of the read operation. The Buffer is assumed to be large enough to contain
the amount of data being read.

The Wait parameter indicates if the caller is willing to block for an indeterminate period of time, such as
might be required if a lock must be acquired. This parameter should be viewed as a "hint" however, rather
than a guarantee. For example, if disk I/O is necessary to complete this operation the operation might
proceed, even if Wait is FALSE.

The Buffer refers to the caller-provided buffer. It need not be valid and in such a case this routine will raise
an exception. An FSD should trap that exception and return an error to the user application.

Note that this operation may require that a portion of a cached page be written. If that is the case, then the
contents of that page will be read from disk first and then the portion of the page that is being written will
be modified. Thus, it is possible for this call to cause re-entry into an FSD to process read page faults
against the file being modified.

Contents provided by 117


OSR Open Systems Resources, Inc.
Because data can be written to the VM system at a rate considerably faster than the rate at which it can be
written, an FSD must implement a write-throttling mechanism, typically by using CcCanIWrite in
combination with CcDeferWrite. A failure to implement write throttling will cause the system to crash
with a stop code of NO_PAGES_AVAILABLE.

3.7 CcDeferWrite
In order to simplify the process of implementing write throttling within your file system, the Cache
Manager provides a simple mechanism for queuing write operations until the VM system can accommodate
them. This is done via a Deferred Write callback which your FSD registers with the Cache Manager when
the call CcCanIWrite returns FALSE.

The prototype for this callback function is:


Typedef VOID (*PCC_POST_DEFERRED_WRITE) (
IN PVOID Context1,
IN PVOID Context2
);

The context pointers are typically specified by your file system as part of establishing the deferred write
processing. The prototype for CcDeferWrite is:
NTKERNELAPI VOID CcDeferWrite (
IN PFILE_OBJECT FileObject,
IN PCC_POST_DEFERRED_WRITE PostRoutine,
IN PVOID Context1,
IN PVOID Context2,
IN ULONG BytesToWrite,
IN BOOLEAN Retrying
);

The FileObject indicates the file to which the caller is attempting to write.

The PostRoutine is the FSD-provided callback function which will be called by the Cache Manager when
the VM state has changed so that additional writes can be allowed.

The Context1 and Context2 pointers are FSD-defined and will be passed to the FSD-provided callback
function once writing to the file is allowed.

The BytesToWrite argument indicates the number of bytes that are to be written to the file by this operation.
The VM system uses this information to determine if it has become "safe" to write (based upon the number
of available pages.)

The Retrying argument indicates whether or not this is the first attempt (Retrying is FALSE) or a subsequent
attempt (Retrying is TRUE.)

3.8 CcGetDirtyPages
This routine is listed here for completeness. It is used within file systems that take advantage of the
internal logging mechanism within Windows NT. It is not generally useful for file systems. The prototype
for this function is:
NTKERNELAPI LARGE_INTEGER CcGetDirtyPages (

Contents provided by 118


OSR Open Systems Resources, Inc.
IN PVOID LogHandle,
IN PDIRTY_PAGE_ROUTINE DirtyPageRoutine,
IN PVOID Context1,
IN PVOID Context2
);

3.9 CcGetFileObjectFromBcb
An individual Buffer Control Block includes within it a pointer to the file object that is being used by the
VM system to track the file cache information. The file object can thus be extracted from a given BCB,
should that be necessary. The prototype for this routine is:
NTKERNELAPI PFILE_OBJECT CcGetFileObjectFromBcb (
IN PVOID Bcb
);

3.10 CcGetFileObjectFromSectionPtrs
When caching is first established for the file, the Cache Manager uses the FileObject argument to
CcInitializeCacheMap to create the new section object that is used for caching the file data. So long as
cached data is maintained by the Cache Manager for that file, that original FileObject is used by the VM
System for all the various necessary I/O operations.

Given a SectionObjectPointer structure from an arbitrary FileObject, this routine can thus tell the file
system about the actual file object that is used by the VM system for the various necessary I/O operations.
The prototype for this call is:
NTKERNELAPI
PFILE_OBJECT
CcGetFileObjectFromSectionPtrs (
IN PSECTION_OBJECT_POINTERS SectionObjectPointer
);

An interesting side-effect of this implementation model (where the SectionObjectPointer field refers to a
particular section object that in turn refers to a particular file object) is that a given FileObject may remain
valid for a considerable period of time - far beyond the point when the file has been closed by the
application program.

3.11 CcGetLsnForFileObject
This routine is listed here for completeness. It is used within file systems that take advantage of the
internal logging mechanism within Windows NT. It is not generally useful for file systems. The prototype
for this function is:
NTKERNELAPI LARGE_INTEGER CcGetLsnForFileObject(
IN PFILE_OBJECT FileObject,
OUT PLARGE_INTEGER OldestLsn OPTIONAL
);

Contents provided by 119


OSR Open Systems Resources, Inc.
3.12 CcFastCopyRead
This routine can be used by file systems that do not support file offsets larger than 4GB. It is a
"replacement" call for CcCopyRead and can be used in an essentially identical fashion. The prototype for
this function is:
NTKERNELAPI VOID CcFastCopyRead (
IN PFILE_OBJECT FileObject,
IN ULONG FileOffset,
IN ULONG Length,
IN ULONG PageCount,
OUT PVOID Buffer,
OUT PIO_STATUS_BLOCK IoStatus
);

• The FileObject indicates the file being read.


• The Length indicates the number of bytes to be copied to the caller-supplied Buffer.
• The PageCount indicates the number of physical pages that are spanned by the caller-supplied
Buffer.
• The IoStatus contains the completion status of the read operation as well as the number of
bytes read.

3.13 CcFastCopyWrite
This routine can be used by file systems that do not support file offsets larger than 4GB. It is a
"replacement" call for CcCopyWrite and is used in essentially identical fashion. The prototype for this
function is:
NTKERNELAPI VOID CcFastCopyWrite (
IN PFILE_OBJECT FileObject,
IN ULONG FileOffset,
IN ULONG Length,
IN PVOID Buffer
);

• The FileObject indicates the file being read.


• The Length indicates the number of bytes to be copied to the caller-supplied Buffer.
• The PageCount indicates the number of physical pages that are spanned by the caller-supplied
Buffer.
• The IoStatus contains the completion status of the read operation as well as the number of
bytes read.
• As with CcCopyWrite, a file system using this routine must also implement write throttling
using CcCanIWrite.

3.14 CcFlushCache
This routine is used by an FSD to ensure that any dirty data presently being cached for the given file is
written back to disk. The prototype for this call is:
NTKERNELAPI VOID CcFlushCache (
IN PSECTION_OBJECT_POINTERS SectionObjectPointer,
IN PLARGE_INTEGER FileOffset OPTIONAL,
IN ULONG Length,
OUT PIO_STATUS_BLOCK IoStatus OPTIONAL
);

• This routine is used by an FSD to ensure that all dirty data is committed to disk.
• If the FileOffset parameter is null, the whole file is flushed.
• If the FileOffset parameter is set, the portion of the file from that offset and for Length bytes is
flushed.

Contents provided by 120


OSR Open Systems Resources, Inc.
• Note that this call can cause I/O operations and hence reenter the FSD. This call is typically
used by an FSD as part of its implementation of IRP_MJ_FLUSH_BUFFERS.

3.15 CcInitializeCacheMap
A cache map of a file is maintained by the Cache Manager to track the activities being performed for the
file. The first open instance of a file causes the generation of a public cache map (information shared
between the various open instances of the file.) In addition, each open instance of the file also has a private
cache map which tracks information specific to operations that are ongoing for that particular file object.
Normally, the creation of the cache maps is deferred until the first I/O. This ensures that the underlying file
system does not create and delete the cache maps for operations which entail no I/O operations, as such
operations are quite common for Win32 applications.

Once I/O is being performed on the given file, however, the underlying file system establishes the cache
map for the file in question. This is done via the CcInitializeCacheMap call:

NTKERNELAPI VOID CcInitializeCacheMap (


IN PFILE_OBJECT FileObject,
IN PCC_FILE_SIZES FileSizes,
IN BOOLEAN PinAccess,
IN PCACHE_MANAGER_CALLBACKS Callbacks,
IN PVOID LazyWriteContext
);

Most of these parameters are reasonably self-explanatory – the file sizes normally coming from the
common header, the Callbacks being a consistent set of functions defined by your file system. The
LazyWriteContext is the argument passed to the callback functions (the Context argument each of them
takes) so it allows you to specify what information will be passed back to your callback function for further
processing.

The PinAccess argument is used by the Cache Manager to determine if the data represented by this memory
region will be locked (or pinned) in memory by the file system. The NT file systems use this ability to pin
memory down as a mechanism for memory mapping their own data structures. However, in order to ensure
that a particular critical data structure is resident in memory and ineligible to be released, the Cache
Manager allows a file system to pin the data in memory for the duration of the critical operation. Buffer
Control Blocks (BCBs) describe such pinned sections.

Thus, normally user data is not held for pinned access.

Note that the cache map is not normally initialized for files being accessed using unbuffered I/O operations
– files that were opened with the FILE_NO_INTERMEDIATE_BUFFERING bit specified. Once
CcInitializeCacheMap has been called by your file system, it is possible for you to receive fast I/O
operations for the file. The Cache Manager sets the PrivateCacheMap field in the file object to point to a
Cache Manager allocated data structure and the I/O Manager decides if the fast I/O path can be taken based
on the value in this field.

The following code sample is from an older version of the OSR FSDK:

NTSTATUS OwInitializeCacheMap(POW_IRP_CONTEXT IrpContext)


{
// Make sure that this thing can even be cached.
OwAssert(IrpContext->IrpSp->FileObject->SectionObjectPointer);
OwAssert(OwIsResourceAcquiredExclusive(&IrpContext->Fcb->Resource));
OwAssert(!IrpContext->FileObject->PrivateCacheMap);
OwAssert(IrpContext->Fcb->CommonHeader.AllocationSize.QuadPart >=
IrpContext->Fcb->CommonHeader.FileSize.QuadPart);

Contents provided by 121


OSR Open Systems Resources, Inc.
CcInitializeCacheMap(IrpContext->FileObject,
(PCC_FILE_SIZES) &IrpContext->Fcb->CommonHeader.AllocationSize,
FALSE, // access for pinning?
&OwCallbacks,
IrpContext->Fcb);
OSR_TRACE1(IrpContext);
return (STATUS_SUCCESS);
}

Each of the file systems in the Microsoft IFS kit also contains examples of cache map initialization:

File System File Routine

FAT read.c FatCommonRead

CDFS write.c CdCommonWrite


RDR2 rdbss\fileinfo.c RxSetAllocationInfo
Other similar routines (for read ahead for example) can be located in the same file.

3.16 CcIsThereDirtyData
This routine is used to determine if there is any dirty data on the given physical media volume, as specified
by its VPB structure. The prototype for this routine is:
NTKERNELAPI
BOOLEAN
CcIsThereDirtyData (
IN PVPB Vpb
);

Data cached by the Cache Manager is described using Section Objects. In turn, a Section Object refers to
some File Object that backs it. That File Object indicates what volume it is located on (for physical media
file systems.) Thus, the Cache Manager can ascertain if a given physical media volume has any dirty data
stored on it by calling this routine.

File Systems that do not maintain a VPB structure, such as network file systems, cannot use this call.

3.17 CcMapData
This routine is used by a file system to build a mapping for data in such a fashion it can be controlled (via a
BCB) by the file system. Typically, this is used by a file system that memory maps its own file system data
structures. The prototype for this call is:
NTKERNELAPI BOOLEAN CcMapData (
IN PFILE_OBJECT FileObject,
IN PLARGE_INTEGER FileOffset,
IN ULONG Length,
IN BOOLEAN Wait,
OUT PVOID *Bcb,
OUT PVOID *Buffer
);

Contents provided by 122


OSR Open Systems Resources, Inc.
After calling this routine, the file system must then pin the buffer prior to actually accessing the data.
Accessing data mapped but not pinned in the cache may lead to unpredictable results.

3.18 CcMdlRead
This routine is typically used by a file system to obtain an MDL describing the cache buffer. Since an
MDL can only be used by a kernel-mode component, this is typically only used by kernel-mode
applications, such as a file server. By using an MDL describing the cache, however, the kernel-resident
code can avoid a data copy between a buffer and the cache. The prototype for this function is:
NTKERNELAPI VOID CcMdlRead (
IN PFILE_OBJECT FileObject,
IN PLARGE_INTEGER FileOffset,
IN ULONG Length,
OUT PMDL *MdlChain,
OUT PIO_STATUS_BLOCK IoStatus
);

Typically, an FSD will use this routine to satisfy an IRP_MJ_READ request with a minor function of
IRP_MN_MDL. The MDL returned by this routine can then be subsequently released using either
CcMdlReadComplete or FsRtlMdlReadCompleteDev. Because there are potential ramifications involved
when using these values, developers should carefully consider their requirements before choosing one
particular function versus another.

The MDL is returned to the caller in the IRP (as the MdlAddress field.)

3.19 CcMdlReadComplete
This routine is used as the compliment of CcMdlRead. The prototype for this function is:
NTKERNELAPI VOID CcMdlReadComplete (
IN PFILE_OBJECT FileObject,
IN PMDL MdlChain
);

An FSD uses this in response to an IRP_MJ_READ request with a minor function code of
IRP_MN_MDL_COMPLETE.

Note that in Windows NT 4.0 through Service Pack 3, this call is implemented by calling the Fast I/O entry
point MdlRead. If this routine is not implemented, then FsRtlMdlReadCompleteDev is called. If this
routine is implemented its return value is ignored. This can cause problems with layered filter drivers, such
as the examples included in the Microsoft IFS Kit.

The MDL to be released is normally the one provided to the FSD as the MdlAddress field of the IRP.

3.20 CcMdlWriteComplete
This routine is used as the compliment of CcMdlWrite. The prototype for this function is:
NTKERNELAPI VOID CcMdlWriteComplete (
IN PFILE_OBJECT FileObject,
IN PLARGE_INTEGER FileOffset,
IN PMDL MdlChain
);

Contents provided by 123


OSR Open Systems Resources, Inc.
An FSD uses this in response to an IRP_MJ_READ request with a minor function code of
IRP_MN_MDL_COMPLETE.

Note that in Windows NT 4.0 through Service Pack 3, this call is implemented by calling the Fast I/O entry
point MdlRead. If this routine is not implemented, then FsRtlMdlReadCompleteDev is called. If this
routine is implemented its return value is ignored. This can cause problems with layered filter drivers, such
as the examples included in the Microsoft IFS Kit.

The MDL to be released is normally the one provided to the FSD as the MdlAddress field of the IRP.

3.21 CcPinMappedData
This call is used to ensure that data mapped into memory via a call to CcMapData is pinned in memory so
that it can be used by the file system. The prototype for this call is:
NTKERNELAPI BOOLEAN CcPinMappedData (
IN PFILE_OBJECT FileObject,
IN PLARGE_INTEGER FileOffset,
IN ULONG Length,
IN BOOLEAN Wait,
IN OUT PVOID *Bcb
);

The FileObject, FileOffset, and Length arguments identify the specifics of what is being pinned in memory.
The Wait parameter indicates if the caller is willing to block (for synchronization objects) while making
this call. If Wait is FALSE and locks cannot be immediately acquired, then this call will return FALSE to
the caller, who may attempt the call at a later time and/or in a different context.

The Bcb argument is the BCB pointer returned to the FSD via the earlier call to CcMapData. The FSD is
responsible for releasing this BCB once it has finished accessing the pinned data.

3.22 CcPinRead
This call is used to read, map, and pin data into the cache in a single operation. Its prototype is:
NTKERNELAPI BOOLEAN CcPinRead (
IN PFILE_OBJECT FileObject,
IN PLARGE_INTEGER FileOffset,
IN ULONG Length,
IN BOOLEAN Wait,
OUT PVOID *Bcb,
OUT PVOID *Buffer
);

Functionally, this is equivalent to calling CcMapData and CcPinMappedData via a single operation.

The FSD must unpin the data once the buffer is no longer needed.

3.23 CcPrepareMdlWrite
This routine is typically used by a file system to obtain an MDL describing the cache buffer. Since an
MDL can only be used by a kernel-mode component, this is typically only used by kernel-mode
applications, such as a file server. By using an MDL describing the cache, however, the kernel-resident
code can avoid a data copy between a buffer and the cache. The prototype for this function is:
NTKERNELAPI VOID CcPrepareMdlWrite (
IN PFILE_OBJECT FileObject,
IN PLARGE_INTEGER FileOffset,

Contents provided by 124


OSR Open Systems Resources, Inc.
IN ULONG Length,
OUT PMDL *MdlChain,
OUT PIO_STATUS_BLOCK IoStatus
);

Typically, an FSD will use this routine to satisfy an IRP_MJ_WRITE request with a minor function of
IRP_MN_MDL. Note that because the data is to be written, it is only read from disk if necessary. It would
be necessary if, for instance, the offset and length indicate that only a portion of a physical memory page
will be modified. The data currently in that area of the file is fetched from disk. Because of this, data need
not be present in the buffer when this call returns.

3.24 CcPreparePinWrite
This call is used to map, and pin data into the cache in a single operation that is subsequently going to be
modified. Its prototype is:
NTKERNELAPI BOOLEAN CcPreparePinWrite (
IN PFILE_OBJECT FileObject,
IN PLARGE_INTEGER FileOffset,
IN ULONG Length,
IN BOOLEAN Zero,
IN BOOLEAN Wait,
OUT PVOID *Bcb,
OUT PVOID *Buffer
);

Because the pages are going to be modified, they need not be read from disk, except when a portion of a
page is being modified. The caller is responsible for releasing the Bcb once the modifications have been
made to the data.

3.25 CcPurgeCacheSection
This routine is used by an FSD to attempt to purge any mappings of the pages within the cache. Data being
purged is discarded from memory. If the data had been modified prior to the purge operation, the updates
to the data are lost. The prototype for this function is:
NTKERNELAPI BOOLEAN CcPurgeCacheSection (
IN PSECTION_OBJECT_POINTERS SectionObjectPointer,
IN PLARGE_INTEGER FileOffset OPTIONAL,
IN ULONG Length,
IN BOOLEAN UninitializeCacheMaps
);

The SectionObjectPointer structure identifies the cache data structures to use as part of this operation. The
FileOffset is a pointer to a variable containing the file offset where the purge should begin. The Length
indicates the number of bytes to purge, beginning with the FileOffset. The UninitializeCacheMaps
argument indicates that all file objects that are maintaining private cache map information must be
uninitialized prior to the actual purge operation taking place.

If the section objects specified by the SectionObjectPointer structure are in use to map the file in anything
other than the cache itself, this call will fail. Typically, this is the case when a file has been memory
mapped by an application program and hence cannot be purged so long as those mappings persist.
The FileOffset and Length parameter interact together to advise the Cache Manager what should be purged.

FileOffset Length Effect


NULL Any Value Length is ignored and the whole file is purged
Any Value NULL The file is purged from the byte indicated by FileOffset through the end of file.
Any Value Any Value The file is purged beginning with the byte indicated by FileOffset for Length bytes.

Contents provided by 125


OSR Open Systems Resources, Inc.
Note that the FSD must be able to handle the case where this routine returns FALSE, as this will be the
case under certain circumstances. In such a case, it is not possible to purge the data in the cache.

3.26 CcRepinBcb
This routine is used by an FSD to increment the reference count on a previously created BCB. The
prototype for this call is:
NTKERNELAPI VOID CcRepinBcb (
IN PVOID Bcb
);

An FSD may find that it is necessary to use a previously created buffer control block. In such
circumstances this routine is used to ensure that the Bcb remains valid for the duration of the operation.
The FSD is responsible for releasing that reference count using the CcUnpinRepinnedBcb call.

3.27 CcSetAdditionalCacheAttributes
This routine is used by an FSD to enable or disable read ahead and write behind for a given file. The
prototype for this call is:
NTKERNELAPI
VOID
CcSetAdditionalCacheAttributes (
IN PFILE_OBJECT FileObject,
IN BOOLEAN DisableReadAhead,
IN BOOLEAN DisableWriteBehind
);

The FileObject is the file for which the additional attributes are to be established. The Cache Manager uses
this information to determine the behavior of the cache when the file is being accessed. Thus, if
DisableReadAhead is TRUE the Cache Manager will not perform read ahead for I/O operations done by
this particular file. Similarly, if DisableWriteBehind is TRUE the Cache Manager will disable caching
dirty data. Instead, writes will be done through the cache (so the data is available for subsequent reads) but
the write does not complete until such time as the data is on the disk.

This impacts the behavior of the Cache Manager calls CcCopyRead and CcCopyWrite.

3.28 CcSetBcbOwnerPointer
The Cache Manager uses this information in order to determine the "owner" of the ERESOURCE
embedded within a buffer control block. While not generally useful, there are odd circumstances under
which that ERESOURCE might be obtained on behalf of one thread and be released by a different thread.
The prototype for this call is:
NTKERNELAPI VOID CcSetBcbOwnerPointer (
IN PVOID Bcb,
IN PVOID OwnerPointer
);

Contents provided by 126


OSR Open Systems Resources, Inc.
The Bcb argument indicates the BCB containing the ERESOURCE in question. The OwnerPointer is a
pointer to the ETHREAD structure of the thread that is the new owner.

3.29 CcSetDirtyPageThreshold
An FSD may limit the total amount of dirty data the Cache Manager will maintain for a given file. The
prototype for this call is:
NTKERNELAPI VOID CcSetDirtyPageThreshold (
IN PFILE_OBJECT FileObject,
IN ULONG DirtyPageThreshold
);

Once the number of dirty pages being cached for a particular file exceeds the DirtyPageThreshold
subsequent writes will block as data is flushed from the cache to disk. Once the number of dirty pages has
dropped below the threshold, new writes are allowed to proceed.

There is no requirement that an FSD set this limit. The default is to allow the Cache Manager and Memory
Manager to control the write-behind policy.

3.30 CcSetDirtyPinnedData
This call is used to indicate that data in a cache memory region described by a previously pinned BCB
should be marked dirty, whether or not any changes were made to that data. The prototype for this call is:
NTKERNELAPI VOID CcSetDirtyPinnedData (
IN PVOID Bcb,
IN PLARGE_INTEGER Lsn OPTIONAL
);

The Lsn parameter should be passed as NULL for file systems not taking advantage of the Windows NT
log mechanism.

This routine could be used by an FSD to force a data region to be written to disk even though it had not
been modified.

3.31 CcSetFileSizes
The Cache Manager relies upon the FSD to advise it whenever the size of a file actually changes. This
routine is used by an FSD to indicate to the Cache Manager that a file size is changing. The prototype for
this call is:
NTKERNELAPI VOID CcSetFileSizes (
IN PFILE_OBJECT FileObject,
IN PCC_FILE_SIZES FileSizes
);

The FileObject identifies the specific file that is changing size. The FileSizes indicate the new file sizes.
The CC_FILE_SIZES data structure is related to the file size information contained within the common
header structure used by all Windows NT file systems.

Of these sizes provided by an FSD, the two critical sizes are the AllocationSize and FileSize of the file. The
AllocationSize is the maximum amount of data that may be stored in the allocated space and is used by the
VM system to indicate the size of the section describing that file. The FileSize indicates the amount of data
currently present within the file. This is used by the VM system to indicate the size of the mapped view
describing that file.

Contents provided by 127


OSR Open Systems Resources, Inc.
The AllocationSize must be larger than the FileSize.

3.32 CcSetLogHandleForFile
This routine is listed here for completeness. It is used within file systems that take advantage of the
internal logging mechanism within Windows NT. It is not generally useful for file systems. The prototype
for this function is:
NTKERNELAPI VOID CcSetLogHandleForFile (
IN PFILE_OBJECT FileObject,
IN PVOID LogHandle,
IN PFLUSH_TO_LSN FlushToLsnRoutine
);

3.33 CcSetReadAheadGranularity
This routine is used by an FSD to control the read-ahead policy of the Cache Manager. The prototype for
this function is:
NTKERNELAPI VOID CcSetReadAheadGranularity (
IN PFILE_OBJECT FileObject,
IN ULONG Granularity
);

The default read-ahead size used by the Windows NT Cache Manager is 4K, although it appears that all the
Windows NT file systems set their own default to be 64K.

Granularity must be 2N * PAGE_SIZE, for N≥0. Otherwise, your results will be unpredictable.
Note that the Memory Manager has a hard-coded limitation of 64KB when reading from disk drives. Thus,
even if your FSD establishes a read-ahead size larger than 64KB it will be satisfied via a series of 64KB
read-ahead units.

3.34 CcUninitializeCacheMap
NTKERNELAPI BOOLEAN CcUninitializeCacheMap (
IN PFILE_OBJECT FileObject,
IN PLARGE_INTEGER TruncateSize OPTIONAL,
IN PCACHE_UNINITIALIZE_EVENT UninitializeCompleteEvent OPTIONAL
);

Uninitializing the cache map is normally done when the file object has been closed by the user application
– as the result of an IRP_MJ_CLEANUP request arriving in the underlying file system. At that time there
are certain operations which must be performed in order to make sure the cache is torn down properly.
For a normal file, the two optional parameters are omitted. The code for this would look like:

CcUninitializeCacheMap(FileObject, 0, 0);

Literally, this is a request to cease caching on behalf of this file. If this function returns TRUE, this is the
last open instance of this file and the Cache Manager has deleted the shared cache map (otherwise, the
shared cache map is still in use by other open instances of the file.)

There are a few interesting side effects for this function:

Contents provided by 128


OSR Open Systems Resources, Inc.
If the file is being deleted, the TruncateSize parameter will point to a LARGE_INTEGER containing the
truncated size of the file (typically zero.) This will tell the Cache Manager that any dirty data associated
with this file need not be written back to disk – although there is no guarantee, since the Memory Manager
might decide to write it back independently of the file Cache Manager.

If the file system wishes to block for the cache map to be torn down, it can optionally provide an event that
can be used to wait for the final destruction of the shared cache map.

This routine may be safely called for all file objects, even for those file objects for which the file system did
not call CcInitializeCacheMap.

3.35 CcUnpinData
This routine is used to release a previously pinned BCB. The prototype for this call is:
NTKERNELAPI VOID CcUnpinData (
IN PVOID Bcb
);

For each call made by an FSD to the routines CcPinRead,CcPreparePinWrite, and CcPinMappedData this
call releases the pinning done on the given Bcb. When the reference count on the Bcb drops to zero, it can
be freed by the Cache Manager so that the range in the Cache Manager's address space can be reused.

3.36 CcUnpinDataForThread
This routine is used to allow a thread, other than the one that initially acquired the BCB, to release the
ERESOURCE within the BCB. The prototype for this call is:
NTKERNELAPI
VOID
CcUnpinDataForThread (
IN PVOID Bcb,
IN ERESOURCE_THREAD ResourceThreadId
);

3.37 CcUnpinRepinnedBcb
This call is used by an FSD to release a BCB previously pinned by a call to CcRepinBcb. The prototype
for this call is:
NTKERNELAPI VOID CcUnpinRepinnedBcb (
IN PVOID Bcb,
IN BOOLEAN WriteThrough,
OUT PIO_STATUS_BLOCK IoStatus
);

The WriteThrough option indicates if the FSD wishes to ensure that any dirty data in the region of the file
described by the BCB be committed to disk prior to completion of this call. If WriteThrough is TRUE
then upon completion of this routine the IoStatus will be set to indicate the results of any write operations.
If WriteThrough is FALSE the dirty data will be written at a later time by the Cache Manager's Lazy
Writer.

Contents provided by 129


OSR Open Systems Resources, Inc.
3.38 CcZeroData
This routine is used by an FSD to ensure that a range within memory is set to zero. The prototype for this
call is:
NTKERNELAPI BOOLEAN CcZeroData (
IN PFILE_OBJECT FileObject,
IN PLARGE_INTEGER StartOffset,
IN PLARGE_INTEGER EndOffset,
IN BOOLEAN Wait
);

Typically, an FSD uses this routine to zero new data areas within a file so that any detritus left from
previous uses of that memory are obliterated and not available to application programs.

The FileObject indicates which file is to be zeroed, while the StartOffset and EndOffset indicate the range
within the file that is to be zeroed. The Wait parameter indicates if the caller is willing to wait while any
necessary synchronization objects are acquired.

In general, this call does not result in disk I/O. If the offsets specified are not on even page boundaries, a
page fault will be triggered to fetch the page so that the data not being modified will be preserved properly.
Additionally, if the FileObject indicates the file was opened with write-through semantics, then as the
pages are zeroed they will be written back to disk.

3.39 CcZeroEndOfLastPage
This routine is used to zero the portion of the last page of the file past the last valid byte (the end of file)
and the end of the physical page. The prototype for this call is:
NTKERNELAPI VOID CcZeroEndOfLastPage(
IN PFILE_OBJECT FileObject
);

This ensures that if the file is extended in size, no data from previous usage of the memory becomes
accessible. This call is not normally used by an FSD, but is included here for completeness.

Contents provided by 130


OSR Open Systems Resources, Inc.
4 File System Runtime Library
The File System Run-time Library is actually part of the standard Windows NT operating system image
(ntoskrnl.exe or ntkrnlmp.exe) and provides a base library package of functions that can be used by
Windows NT file systems to implement their functionality. Of course, individual file systems might
choose to implement these routines “on their own.”

There are two key reasons why it is better to use this package directly:

• Using it ensures your file system is “bug for bug” compatible.

• Using it is far easier than building it yourself.

4.1 Byte Range Locks


Windows NT supports a comprehensive byte range lock mechanism that is used by application programs to
arbitrate access between them to a shared file. Byte range locks are acquired in either a shared or exclusive
mode, and are acquired on a range of bytes within the file.

Windows NT file systems are responsible for enforcing such byte range locks for user I/O. It is inadvisable
to enforce them for paging I/O, since I/O from the lazy writer might be disallowed – and that would cause
data loss.

4.1.1 FILE_LOCK_INFO
The following data structure is used by the lock package to track an individual byte range lock. Since it is
possible for an FSD to enumerate these locks, this otherwise internal structure is exposed. Normally, it is
not used by an FSD.

typedef struct {
LARGE_INTEGER StartingByte;
LARGE_INTEGER Length;
BOOLEAN ExclusiveLock;
ULONG Key;
PFILE_OBJECT FileObject;
PVOID ProcessId;
LARGE_INTEGER EndingByte;
} FILE_LOCK_INFO, *PFILE_LOCK_INFO;

4.1.2 PCOMPLETE_LOCK_IRP_ROUTINE
In order to allow support for file systems performing distributed locking, such as is the case for a network
file system (a redirector) the lock package provides two callback routines. These routines can be used to
pass the lock request call onwards once it can be locally granted.

This routine is optionally specified by an FSD so that once the lock is compatible with current local lock
state it can be passed (by the FSD) to a secondary lock mechanism, such as a distributed lock manager.
typedef NTSTATUS (*PCOMPLETE_LOCK_IRP_ROUTINE) (
IN PVOID Context,
IN PIRP Irp);

The Irp is the lock request Irp in question. Note that it is only a lock request – not an unlock request
(unlock requests are handled by the PUNLOCK_ROUTINE).

Contents provided by 131


OSR Open Systems Resources, Inc.
The Context value allows the FSD to identify what precisely is being locked. Note that the FileObject for
the file being locked is located in the IRP, as well, so essentially everything needed is provided here.

4.1.3 PUNLOCK_ROUTINE
For file systems that require they be advised once the lock has been released, this callback function
provides the information needed to complete the release of the lock.
typedef VOID (*PUNLOCK_ROUTINE) (
IN PVOID Context,
IN PFILE_LOCK_INFO FileLockInfo);

Note that the Context argument here is typically used by the FSD to identify what is being unlocked. The
FileLockInfo argument describes the lock that is being released.

See also Section 4.1.2, as it describes the complementary function here for handling the lock request.

4.1.4 FILE_LOCK
This data structure describes information about lock state for a given file. The storage for this structure is
provided by an FSD, although it need only be allocated once byte range locking is started on the file in
question.

Since byte range locking is relatively rare, deferring allocation of this structure until needed can save
considerable memory, with the precise amount depending upon how many files are opened at any given
time. This structure can be allocated from paged memory, as the lock package consists of pageable code.
typedef struct _FILE_LOCK {
PCOMPLETE_LOCK_IRP_ROUTINE CompleteLockIrpRoutine;
PUNLOCK_ROUTINE UnlockRoutine;
BOOLEAN FastIoIsQuestionable;
BOOLEAN Spare[3];
PVOID LockInformation;
FILE_LOCK_INFO LastReturnedLock;
LIST_ENTRY GrantedLocks;
LIST_ENTRY WaitingLocks;
LONG LocksOutstanding;
} FILE_LOCK *PFILE_LOCK;

4.1.5 FsRtlInitializeFileLock
This routine is used to initialize a FILE_LOCK data structure (See Section 4.1.4 for information about this
structure.) Normally, it is called for the first request to allocate a byte range lock against a file.
NTKERNELAPI VOID FsRtlInitializeFileLock (
IN PFILE_LOCK FileLock,
IN PCOMPLETE_LOCK_IRP_ROUTINE CompleteLockIrpRoutine OPTIONAL,
IN PUNLOCK_ROUTINE UnlockRoutine OPTIONAL
);

The FSD is responsible for allocating the storage for the FileLock structure.

If provided, the CompleteLockIrpRoutine is used to pass any lock IRPs back to the file system for further
processing. See Section 4.1.2 for more information.

If provided, the UnlockRoutine is used to advise the FSD when a byte range lock is being released. See
Section 4.1.3 for additional information. Note that this routine is called multiple times in cases where a
single IRP releases multiple locks.

Contents provided by 132


OSR Open Systems Resources, Inc.
4.1.6 FsRtlUnitializeFileLock
This routine is called to cleanup any remaining lock state. It frees all file locks and completes any
outstanding queued lock requests.

NTKERNELAPI VOID FsRtlUninitializeFileLock (


IN PFILE_LOCK FileLock
);

The FileLock argument is the file lock information for the given file.

4.1.7 FsRtlProcessFileLock
Normally, this routine is used by an FSD to process a file lock.

NTKERNELAPI NTSTATUS FsRtlProcessFileLock (


IN PFILE_LOCK FileLock,
IN PIRP Irp,
IN PVOID Context OPTIONAL
);

The FileLock argument represents the file lock state for this file.

The Irp argument represents the lock control request.

The Context argument is passed to the CompleteLockIrpRoutine or UnlockRoutine specified when the
FileLock structure was initialized. See Sections 4.1.2 and 4.1.3 for additional information about these
routines. This argument is not needed if those routines were not provided.

The FSD passes all lock control requests, regardless of whether they are requests to lock or unlock the file
to this routine. This makes the implementation of byte range lock handling straight-forward for an NT
FSD.

4.1.8 FsRtlCheckLockForReadAccess
Since byte range locks on Windows NT are enforced locks, the FSD must actually validate that a user level
read request is compatible with the existing lock state. This routine is used by the FSD to check a user read
request to ensure it is compatible with existing locks:

NTKERNELAPI BOOLEAN FsRtlCheckLockForReadAccess (


IN PFILE_LOCK FileLock,
IN PIRP Irp
);

Note that an FSD should not use this to enforce byte range locking on paging I/O operations, as such
operations are generated by the Cache Manager (where the files are stored.) The disadvantage to this is that
updates made by processes mapping the file do not obey byte range locks.

• The FileLock argument represents the file lock state for the given file being read.
• The Irp argument represents the IRP_MJ_READ request being processed by the file system.

4.1.9 FsRtlCheckLockForWriteAccess
Since byte range locks on Windows NT are enforced locks, the FSD must actually validate that a user level
write request is compatible with the existing lock state. This routine is used by the FSD to check a user
read request to ensure it is compatible with existing locks:

NTKERNELAPI BOOLEAN FsRtlCheckLockForWriteAccess (


IN PFILE_LOCK FileLock,

Contents provided by 133


OSR Open Systems Resources, Inc.
IN PIRP Irp
);

Note that an FSD should not use this to enforce byte range locking on paging I/O operations, as such
operations are generated by the Cache Manager (where the files are stored.) The disadvantage to this is that
updates made by processes mapping the file do not obey byte range locks.

• The FileLock argument represents the file lock state for the given file being written.
• The Irp argument represents the IRP_MJ_WRITE request being processed by the file system.

4.1.10 FsRtlFastCheckLockForRead
This routine is normally used by an FSD to validate that the locking state for a file is consistent when the
FSD is processing a fast I/O read operation.

NTKERNELAPI BOOLEAN FsRtlFastCheckLockForRead (


IN PFILE_LOCK FileLock,
IN PLARGE_INTEGER StartingByte,
IN PLARGE_INTEGER Length,
IN ULONG Key,
IN PFILE_OBJECT FileObject,
IN PVOID ProcessId
);

• The FileLock argument represents the file lock state for the given file being read.
• The StartingByte information indicates the first byte being read.
• The Length information indicates the total number of bytes being read.
• The Key identifies the matching lock key (specified when the lock was first acquired.)
• The FileObject identifies the file being read.
• The ProcessId identifies the process performing the I/O operation.
• Typically, this is handled by the FSD, either in its Fast I/O read routine, or in its Fast I/O
check if possible routine, depending upon how the FSD is written to handle fast I/O
operations.
• The Key is not FSD generated – it is proved (optionally) by applications that perform locking
on behalf of multiple clients. This would be the case for a subsystem or a file server.
Normally, in Windows NT, we observe this used by the LanmanagerServer driver (SRV.)

4.1.11 FsRtlFastCheckLockForWrite
This routine is normally used by an FSD to validate that the locking state for a file is consistent when the
FSD is processing a fast I/O write operation.

NTKERNELAPI BOOLEAN FsRtlFastCheckLockForWrite (


IN PFILE_LOCK FileLock,
IN PLARGE_INTEGER StartingByte,
IN PLARGE_INTEGER Length,
IN ULONG Key,
IN PVOID FileObject,
IN PVOID ProcessId
);

• The FileLock argument represents the file lock state for the given file being write.
• The StartingByte information indicates the first byte being write.
• The Length information indicates the total number of bytes being write.
• The Key identifies the matching lock key (specified when the lock was first acquired.)
• The FileObject identifies the file being written.
• The ProcessId identifies the process performing the I/O operation.

Contents provided by 134


OSR Open Systems Resources, Inc.
• Typically, this is handled by the FSD, either in its Fast I/O write routine, or in its Fast I/O
check if possible routine, depending upon how the FSD is written to handle fast I/O
operations.
• The Key is not FSD generated – it is proved (optionally) by applications that perform locking
on behalf of multiple clients. This would be the case for a subsystem or a file server.
Normally, in Windows NT, we observe this used by the Lanmanager File Server driver
(SRV.)

4.1.12 FsRtlGetNextFileLock
Normally, this routine is not used by an FSD, although it might be used during debugging.

NTKERNELAPI PFILE_LOCK_INFO FsRtlGetNextFileLock (


IN PFILE_LOCK FileLock,
IN BOOLEAN Restart
);

• The FileLock argument represents the file lock state for the given file.
• The Restart argument indicates if this is the first enumeration.

This would be used by an FSD attempting to enumerate all of the locks held for a file. An example of an
enumeration would be:

for (lock = FsRtlGetNextFileLock( FileLock, TRUE );


lock != NULL;
lock = FsRtlGetNextFileLock( FileLock, FALSE )) {
// lock processing code goes here
}

Keep in mind that this routine maintains state across calls to this routine – so the FSD is responsible for
synchronization of access!

4.1.13 FsRtlFastUnlockSingle
This routine is used to release a single byte range lock, as a result of a fast I/O call.

NTKERNELAPI NTSTATUS FsRtlFastUnlockSingle (


IN PFILE_LOCK FileLock,
IN PFILE_OBJECT FileObject,
IN PLARGE_INTEGER FileOffset,
IN PLARGE_INTEGER Length,
IN PEPROCESS ProcessId,
IN ULONG Key,
IN PVOID Context OPTIONAL,
IN BOOLEAN AlreadySynchronized
);

• The FileLock argument represents the file lock state for the given file.
• The FileObject identifies the file being unlocked.
• The FileOffset information indicates the first byte of the lock range being released.
• The Length information indicates the total number of bytes in the range to unlock.
• The ProcessId identifies the process performing the unlock operation.
• The Key is a unique 32 bit value used by the process to identify the true holder of the lock.
This is used by processes that perform proxy lock operations, such as subsystems and file
servers. The FSD does not generate this value – it is created by the caller.
• The Context argument is passed to the PUNLOCK_ROUTINE if one was specified. See
Section 4.1.3 and Section 4.1.5 for additional information about this optional unlock routine.
• The AlreadySynchronized argument is not used. The FSD is always responsible for
synchronization.

Contents provided by 135


OSR Open Systems Resources, Inc.
This routine is used to release a single byte range lock, and can be used to implement the Fast I/O unlock
routine. Note that the range being unlocked must exactly fit the description of the byte range lock provided
by these arguments – it is not possible to release a portion of a locked range.

4.1.14 FsRtlFastUnlockAll
This routine is called to release all locks held by a given process.

NTKERNELAPI NTSTATUS FsRtlFastUnlockAll (


IN PFILE_LOCK FileLock,
IN PFILE_OBJECT FileObject,
IN PEPROCESS ProcessId,
IN PVOID Context OPTIONAL
);

• The FileLock argument represents the file lock state for the given file.
• The FileObject identifies the file being unlocked.
• The ProcessId identifies the process performing the unlock operation.
• The Context argument is passed to the PUNLOCK_ROUTINE if one was specified. See
Section 4.1.3 and Section 4.1.5 for additional information about this optional unlock routine.

This routine is used to release all locks associated with the given ProcessId value. Typically, this is
generated as part of process-death handling when the I/O Manager is closing all open files owned by the
process.

An FSD can use this to implement the Fast I/O unlock all operation.

4.1.15 FsRtlFastUnlockAllByKey
This routine is called to release all locks held by a given process under a given key.

NTKERNELAPI NTSTATUS FsRtlFastUnlockAllByKey (


IN PFILE_LOCK FileLock,
IN PFILE_OBJECT FileObject,
IN PEPROCESS ProcessId,
IN ULONG Key,
IN PVOID Context OPTIONAL
);

• The FileLock argument represents the file lock state for the given file.
• The FileObject identifies the file being unlocked.
• The ProcessId identifies the process performing the unlock operation.
• The Context argument is passed to the PUNLOCK_ROUTINE if one was specified. See
Section 4.1.3 and Section 4.1.5 for additional information about this optional unlock routine.

This routine is used by an FSD to implement the Fast I/O unlock all by key routine. It is normally called by
subsystems or file servers upon the death of a particular client. Hence, it can be used to release a subset of
the locks held by that particular process.

4.1.16 FsRtlPrivateLock
Despite its name, this is the primary routine used by an FSD to handle fast I/O operations.

NTKERNELAPI BOOLEAN FsRtlPrivateLock (


IN PFILE_LOCK FileLock,
IN PFILE_OBJECT FileObject,
IN PLARGE_INTEGER FileOffset,
IN PLARGE_INTEGER Length,
IN PEPROCESS ProcessId,

Contents provided by 136


OSR Open Systems Resources, Inc.
IN ULONG Key,
IN BOOLEAN FailImmediately,
IN BOOLEAN ExclusiveLock,
OUT PIO_STATUS_BLOCK Iosb,
IN PIRP Irp,
IN PVOID Context,
IN BOOLEAN AlreadySynchronized
);

• The FileLock argument represents the file lock state for the given file.
• The FileObject identifies the file being unlocked.
• The FileOffset information indicates the first byte of the lock range being released.
• The Length information indicates the total number of bytes in the range to unlock.
• The ProcessId identifies the process performing the unlock operation.
• The Key identifies the unique value assigned by the caller to identify the true holder of the
lock.
• The FailImmediately value indicates if the caller is willing to block waiting for the lock to be
granted, or if the caller wishes the operation to fail immediately if the lock cannot be granted.
• The ExclusiveLock value indicates if this is a shared (FALSE) or exclusive (TRUE) lock.
• The Iosb value sets the output information correctly so that the I/O operation results can be
returned to the ultimate caller.
• The Irp value is an optional lock IRP. If provided, the values in the IRP are used to process
the lock request.
• The Context value is passed to the FSD specified PCOMPLETE_LOCK_IRP_ROUTINE if
one was specified (see Sections 4.1.2 and 4.1.5 for additional information.
• The AlreadySynchronized value is ignored. The FSD is responsible for serializing access for
the given file information.

An FSD can use this routine directly to implement both the IRP and fast I/O paths for acquiring a byte
range lock. Note that there is a separate “interface” for handling fast I/O routines (FsRtlFastLock,
described in Section 4.1.17) but this is merely #define-ed in terms of this function.

4.1.17 FsRtlFastLock
This routine is in fact a wrapper around FsRtlPrivateFastLock, described in Section 4.1.16.

BOOLEAN FsRtlFastLock (
IN PFILE_LOCK FileLock,
IN PFILE_OBJECT FileObject,
IN PLARGE_INTEGER FileOffset,
IN PLARGE_INTEGER Length,
IN PEPROCESS ProcessId,
IN ULONG Key,
IN BOOLEAN FailImmediately,
IN BOOLEAN ExclusiveLock,
OUT PIO_STATUS_BLOCK Iosb,
IN PVOID Context OPTIONAL,
IN BOOLEAN AlreadySynchronized
);

• The FileLock argument represents the file lock state for the given file.
• The FileObject identifies the file being unlocked.
• The FileOffset information indicates the first byte of the lock range being released.
• The Length information indicates the total number of bytes in the range to unlock.
• The ProcessId identifies the process performing the unlock operation.
• The Key identifies the unique value assigned by the caller to identify the true holder of the
lock.
• The FailImmediately value indicates if the caller is willing to block waiting for the lock to be
granted, or if the caller wishes the operation to fail immediately if the lock cannot be granted.
• The ExclusiveLock value indicates if this is a shared (FALSE) or exclusive (TRUE) lock.

Contents provided by 137


OSR Open Systems Resources, Inc.
• The Iosb value sets the output information correctly so that the I/O operation results can be
returned to the ultimate caller.
• The Context value is passed to the FSD specified PCOMPLETE_LOCK_IRP_ROUTINE if
one was specified (see Sections 4.1.2 and 4.1.5 for additional information.
• The AlreadySynchronized value is ignored. The FSD is responsible for serializing access for
the given file information.

This call is identical to FsRtlPrivateFastLock except that the IRP parameter is absent. It is normally used by
an FSD to implement the fast I/O lock routine.

4.2 Directory Change Notification


4.2.1 PCHECK_FOR_TRAVERSE_ACCESS
typedef PVOID PNOTIFY_SYNC;

Actual declaration for this is private to notification package

typedef BOOLEAN (*PCHECK_FOR_TRAVERSE_ACCESS) (


IN PVOID NotifyContext,
IN PVOID TargetContext,
IN PSECURITY_SUBJECT_CONTEXT SubjectContext
);

An FSD registers a callback with the Notify package. This routine is called when verifying a caller has
traverse access for watching a subdirectory.

4.2.2 FsRtlNotifyInitializeSync
NTKERNELAPI VOID FsRtlNotifyInitializeSync (
IN PNOTIFY_SYNC *NotifySync
);

An FSD calls this routine to initialize the synchronization object used by the notification package.

This routine allocates storage for the synchronization object.

Normally, this call is made as part of volume initialization


Could be deferred to first use
4.2.3 FsRtlNotifyUninitializeSync
NTKERNELAPI VOID FsRtlNotifyUninitializeSync (
IN PNOTIFY_SYNC *NotifySync
);

This routine is used to cleanup the notification routine synchronization object created in the call to
FsRtlNotifyIntializeSync.

Normally, this is called as part of dismount or shutdown processing.


4.2.4 FsRtlNotifyFullChangeDirectory
NTKERNELAPI VOID FsRtlNotifyFullChangeDirectory (
IN PNOTIFY_SYNC NotifySync,
IN PLIST_ENTRY NotifyList,
IN PVOID FsContext,
IN PSTRING FullDirectoryName,
IN BOOLEAN WatchTree,

Contents provided by 138


OSR Open Systems Resources, Inc.
IN BOOLEAN IgnoreBuffer,
IN ULONG CompletionFilter,
IN PIRP NotifyIrp,
IN PCHECK_FOR_TRAVERSE_ACCESS TraverseCallback OPTIONAL,
IN PSECURITY_SUBJECT_CONTEXT SubjectContext OPTIONAL
);

Two uses for this routine:

• FSD calls this to queue an IRP


• NotifyIrp is non-null
• FSD calls this to process deletions
• NotifyIrp is null
• Triggers completion of any pending IRPs

Remember, those IRPs are on an outstanding OPEN file!

4.2.5 FsRtlNotifyFullReportChange
NTKERNELAPI VOID FsRtlNotifyFullReportChange (
IN PNOTIFY_SYNC NotifySync,
IN PLIST_ENTRY NotifyList,
IN PSTRING FullTargetName,
IN USHORT TargetNameOffset,
IN PSTRING StreamName OPTIONAL,
IN PSTRING NormalizedParentName OPTIONAL,
IN ULONG FilterMatch,
IN ULONG Action,
IN PVOID TargetContext
);

This routine is similar to FsRtlNotifyReportChange. This routine provides more detailed information in the
caller’s buffer.

All pending change notifications are scanned and any matching notifications are completed

4.2.6 FsRtlNotifyCleanup
NTKERNELAPI VOID FsRtlNotifyCleanup (
IN PNOTIFY_SYNC NotifySync,
IN PLIST_ENTRY NotifyList,
IN PVOID FsContext
);

This routine is called by an FSD’s cleanup routine to ensure that there are no remaining references to a
particular FsContext structure.

Any pending IRPs are completed.

4.3 I/O Support


4.3.1 FsRtlCopyRead
NTKERNELAPI BOOLEAN FsRtlCopyRead (
IN PFILE_OBJECT FileObject,
IN PLARGE_INTEGER FileOffset,
IN ULONG Length,
IN BOOLEAN Wait,
IN ULONG LockKey,
OUT PVOID Buffer,
OUT PIO_STATUS_BLOCK IoStatus,
IN PDEVICE_OBJECT DeviceObject
);

Contents provided by 139


OSR Open Systems Resources, Inc.
4.3.2 FsRtlCopyWrite
NTKERNELAPI BOOLEAN FsRtlCopyWrite (
IN PFILE_OBJECT FileObject,
IN PLARGE_INTEGER FileOffset,
IN ULONG Length,
IN BOOLEAN Wait,
IN ULONG LockKey,
IN PVOID Buffer,
OUT PIO_STATUS_BLOCK IoStatus,
IN PDEVICE_OBJECT DeviceObject
);

4.3.3 FsRtlMdlReadCompleteDev
NTKERNELAPI BOOLEAN FsRtlMdlReadCompleteDev( PFILE_OBJECT FileObject,
PMDL MdlChain,
PDEVICE_OBJECT DeviceObject);

• FSD or filter uses this instead of CcMdlWriteComplete:


• Eliminates call to the filter driver
• Ensures that the MDL is torn down

Note that this call is not present in the NT 4.0 IFS Kit.

4.3.4 FsRtlMdlWriteCompleteDev
NTKERNELAPI BOOLEAN FsRtlMdlWriteCompleteDev( PFILE_OBJECT FileObject,
PLARGE_INTEGER FileOffset,
PMDL MdlChain,
PDEVICE_OBJECT DeviceObject);

• FSD or filter uses this instead of CcMdlWriteComplete:


• Eliminates call to the filter driver
• Ensures that the MDL is torn down

Note that this call is not present in the NT 4.0 IFS Kit.

4.4 Miscellaneous Routines


4.4.1 FsRtlIsTotalDeviceFailure
NTKERNELAPI BOOLEAN FsRtlIsTotalDeviceFailure(
IN NTSTATUS Status
);

An FSD can call this routine to determine if an I/O error indicates a transient failure or catastrophic failure
4.4.2 FsRtlBalanceReads
NTKERNELAPI NTSTATUS FsRtlBalanceReads (
IN PDEVICE_OBJECT TargetDevice
);

This routine is used to inform the fault tolerant driver that the FSD has been initialized and it may begin
balancing reads across the mirrored partitions.

If the underlying device is not an FT volume this routine will return STATUS_INVALID_DEVICE_
REQUEST.

Contents provided by 140


OSR Open Systems Resources, Inc.
Can just ignore this

4.5 Name Support


4.5.1 FsRtlIsUnicodeCharacterWild
extern PUCHAR FsRtlLegalAnsiCharacterArray;
#define FsRtlIsUnicodeCharacterWild(C) ( \
(((C) >= 0x40) ? FALSE: FlagOn(LEGAL_ANSI_CHARACTER_ARRAY[(C)], \
FSRTL_WILD_CHARACTER ) ) )

This macro can be used to ascertain if a particular character is a wildcard

This can be used by your FSD if you build your own name routines
4.5.2 FsRtlDissectName
NTKERNELAPI VOID FsRtlDissectName (
IN UNICODE_STRING Path,
OUT PUNICODE_STRING FirstName,
OUT PUNICODE_STRING RemainingName
);

• This routine takes an arbitrary path name and returns


• The first element of the name
• The balance of the name
• Output strings use the input string buffer
• Not null terminated
• For instance “foo\bar.exe” is separated into “foo” and “bar.exe”
• Removes separators
• Handles any possible name
• Empty strings
• Single element names
• Multiple element names

4.5.3 FsRtlDoesNameContainWildCards
NTKERNELAPI BOOLEAN FsRtlDoesNameContainWildCards (
IN PUNICODE_STRING Name
);

Given a name this routine determines if the name contains a wildcard.

• The individual FSD determines if “wildcards” are valid


• System will pass names with wildcards - FSD rejects as “not there”
• Wildcards are typically only useful for search operations (directories, EAs, etc.)

Typically, the NT FSDs do this check early and reject the name if it DOES contain wildcards
Your FSD may just simply accept the wildcard if that fits its semantics.
4.5.4 FsRtlAreNamesEqual
NTKERNELAPI BOOLEAN FsRtlAreNamesEqual (
PUNICODE_STRING ConstantNameA,
PUNICODE_STRING ConstantNameB,
IN BOOLEAN IgnoreCase,
IN PWCH UpcaseTable OPTIONAL
);

Contents provided by 141


OSR Open Systems Resources, Inc.
• Given two names, this routine determines if they are equal or not.

• See also RtlCompareUnicodeString


• Doesn’t allow a private upcase table
• Not the same implementation (NT 3.51)

See also the next slide for a more expensive option...


4.5.5 FsRtlIsNameInExpression
NTKERNELAPI BOOLEAN FsRtlIsNameInExpression (
IN PUNICODE_STRING Expression,
IN PUNICODE_STRING Name,
IN BOOLEAN IgnoreCase,
IN PWCH UpcaseTable OPTIONAL
);

Given a name and an expression, this routine determines if the name matches against the regular
expression.

The Name may not include any wildcards; the expression may include wildcards. If neither includes
wildcards then this routine functions identically to FsRtlAreNamesEqual.

An FSD uses this in the DIRECTORY_CONTROL operations (for enumerating a directory).

4.6 Tunnel (Name) Cache


Tunnel is a per volume cache of information.
typedef struct {
FAST_MUTEX Mutex;
PRTL_SPLAY_LINKS Cache;
LIST_ENTRY TimerQueue;
USHORT NumEntries;
} TUNNEL, *PTUNNEL;

• Tunneling information is tracked on a “per directory” basis


• Each directory in the cache must have a unique key
• Tunnel cache is typically per-volume

4.6.1 FsRtlInitializeTunnelCache
When first initializing your tunneling cache:

NTKERNELAPI
VOID
FsRtlInitializeTunnelCache (
IN TUNNEL *Cache);

• Note that your file system provides storage for the tunnel cache
• You are responsible for synchronizing access to the cache
• REMEMBER - tunnel caches are SPLAY TREES
• Read operations on a splay tree cause the tree to change shape
• Must always access the tunnel cache for exclusive access

Contents provided by 142


OSR Open Systems Resources, Inc.
4.6.2 FsRtlAddToTunnelCache
NTKERNELAPI VOID FsRtlAddToTunnelCache (
IN TUNNEL *Cache,
IN ULONGLONG DirectoryKey,
IN UNICODE_STRING *ShortName,
IN UNICODE_STRING *LongName,
IN BOOLEAN KeyByShortName,
IN ULONG DataLength,
IN VOID *Data);

• Use this anytime an entry in a directory is:


• Deleted
• Renamed

• Notice that both the short AND long name are given
• Can be used to preserve “short name” as well as long name

4.6.3 FsRtlFindInTunnelCache
NTKERNELAPI BOOLEAN FsRtlFindInTunnelCache (
IN TUNNEL *Cache,
IN ULONGLONG DirectoryKey,
IN UNICODE_STRING *Name,
OUT UNICODE_STRING *ShortName,
OUT UNICODE_STRING *LongName,
IN OUT ULONG *DataLength,
OUT VOID *Data);

• Used during create to find an “old name”


• Returns both the long name (for the Win16 case) and short name (for the Win32 case).
• Successful lookup
• Use old name, if possible
• Unsuccessful lookup
• Generate new short name, check for collisions!

4.6.4 FsRtlDeleteKeyFromTunnelCache
NTKERNELAPI VOID FsRtlDeleteKeyFromTunnelCache (
IN TUNNEL *Cache,
IN ULONGLONG DirectoryKey);

• Used to cleanup directory cache information


• Normally used when directory is deleted

4.6.5 FsRtlDeleteTunnelCache
NTKERNELAPI VOID FsRtlDeleteTunnelCache (
IN TUNNEL *Cache);

• Used to “tear down” the existing tunnel cache.

Contents provided by 143


OSR Open Systems Resources, Inc.
• Volume dismount
• System shutdown

4.7 UNC Support


4.7.1 FsRtlRegisterUncProvider
NTKERNELAPI NTSTATUS FsRtlRegisterUncProvider(
IN OUT PHANDLE MupHandle,
IN PUNICODE_STRING RedirectorDeviceName,
IN BOOLEAN MailslotsSupported
);

• This routine is used by a network redirector to register with MUP


• Redirector’s device object must be valid (initialized)
• It will be opened by MUP

• The last argument indicates if this redirector supports mailslots


• If so, messages to mailslots will be sent to this redirector
• All redirectors supporting mailslots will receive these messages
• Mup replicates certain mailslot messages to all mailslot capable redirectors.

4.7.2 FsRtlDeregisterUncProvider
NTKERNELAPI VOID
FsRtlDeregisterUncProvider(
IN HANDLE Handle
);

• This indicates the given redirector is no longer serving as a redirector


• This would be used when stopping a redirector

• Not required to support unload (stop) and hence not required to use this call

Note that the Handle must be used IN CONTEXT

Contents provided by 144


OSR Open Systems Resources, Inc.
5 Kernel Runtime
5.1 Kernel Queues
5.1.1 KQUEUE
typedef struct _KQUEUE {
DISPATCHER_HEADER Header;
LIST_ENTRY EntryListHead;
ULONG CurrentCount;
ULONG MaximumCount;
LIST_ENTRY ThreadListHead;
} KQUEUE, *PKQUEUE;

5.1.2 KeInitializeQueue
NTKERNELAPI VOID KeInitializeQueue (
IN PKQUEUE Queue,
IN ULONG Count OPTIONAL
);

This routine is used to initialize a new kernel queue.

If provided, count indicates the MAXIMUM number of elements allowed in the kernel queue structure.

5.1.3 KeReadStateQueue
NTKERNELAPI LONG KeReadStateQueue (
IN PKQUEUE Queue
);

Indicates the current signal state of the queue (signaled or not-signaled)

5.1.4 KeInsertQueue
NTKERNELAPI LONG KeInsertQueue (
IN PKQUEUE Queue,
IN PLIST_ENTRY Entry
);

This routine is used to insert a new entry in the queue

As a side-effect, this routine signals the next waiter on this queue

5.1.5 KeRemoveQueue
NTKERNELAPI PLIST_ENTRY KeRemoveQueue (
IN PKQUEUE Queue,
IN KPROCESSOR_MODE WaitMode,
IN PLARGE_INTEGER Timeout OPTIONAL
);

This routine is used to fetch an entry from a queue.

Contents provided by 145


OSR Open Systems Resources, Inc.
If there is nothing on the queue at present, the caller sleeps

Upon return, this may return an entry from the queue OR:
• STATUS_TIMEOUT
• STATUS_USER_APC
5.1.6 KeRundownQueue
PLIST_ENTRY KeRundownQueue (
IN PKQUEUE Queue
);

This returns the next entry on the queue.

Returns NULL if the queue is now empty.

5.2 Process Context Management


5.2.1 KeAttachProcess
An alternative solution to getting back to a specific thread context:
NTKERNELAPI VOID KeAttachProcess (
IN PKPROCESS Process
);

• This attaches to the specified process context


• All process-specific data structures are adjusted
• VM page tables
• Object handles

• Establishes a specific process context from an arbitrary thread context


• Note that the thread remains the same - but the associated process is temporarily adjusted
• You must clean up from this when done!

5.2.2 KeDetachProcess
NTKERNELAPI VOID KeDetachProcess (
VOID
);

• This disassociates the thread from the process


• Used to release a previously attached process

Use these routines with care! (Not documented in the DDK)

Contents provided by 146


OSR Open Systems Resources, Inc.
6 I/O Manager Runtime
6.1 IoAsynchronousPageWrite
NTKERNELAPI NTSTATUS IoAsynchronousPageWrite(
IN PFILE_OBJECT FileObject,
IN PMDL MemoryDescriptorList,
IN PLARGE_INTEGER StartingOffset,
IN PIO_APC_ROUTINE ApcRoutine,
IN PVOID ApcContext,
OUT PIO_STATUS_BLOCK IoStatusBlock,
OUT PIRP *Irp OPTIONAL
);

This routine is used by the Memory Manager to do a paging write (from the Modified Page Daemon).

• IRP Flags are IRP_PAGING_IO|IRP_NOCACHE


• PAGE_SIZE increments
• Not more than 64k at a time

Note: this call is here for INFORMATION. Do not use--Unexpected I/O Manager behavior will result.

6.2 IoAttachDeviceToDeviceStack
PDEVICE_OBJECT
IoAttachDeviceToDeviceStack(
IN PDEVICE_OBJECT SourceDevice,
IN PDEVICE_OBJECT TargetDevice
);

• This may be used by a filter driver rather than IoAttachDeviceByPointer.


• Either one works, though!
• Error handling is different (one returns a status, the other the device pointer of the top device.)

• The advantage of this routine is that it fixes the attachment race condition
• This is a real problem with multiple filter drivers in the system.

6.3 IoCheckEaBufferValidity
NTKERNELAPI NTSTATUS IoCheckEaBufferValidity(
IN PFILE_FULL_EA_INFORMATION EaBuffer,
IN ULONG EaLength,
OUT PULONG ErrorOffset
);

This routine is used by an FSD to validate that an EA buffer is valid.

Returns:
• STATUS_SUCCESS if the EaBuffer is valid
• STATUS_EA_LIST_INCONSISTENT otherwise

6.4 IoGetBaseFsDeviceObject
NTKERNELAPI PDEVICE_OBJECT IoGetBaseFileSystemDeviceObject(
IN PFILE_OBJECT FileObject

Contents provided by 147


OSR Open Systems Resources, Inc.
);

Used to retrieve the file system device object. Ignores any attached device objects ABOVE the file system.

6.5 IoGetRequestorProcess
This call is used to retrieve the process pointer from an IRP

NTKERNELAPI PEPROCESS IoGetRequestorProcess(


IN PIRP Irp
);

• Remember - a PKPROCESS is identical to a PEPROCESS


• Based on DDK definition, no less!

6.6 IoIsSystemThread
NTKERNELAPI BOOLEAN IoIsSystemThread(
IN PETHREAD Thread
);

• Might prove useful when examining a specific thread’s context


• System threads run in the system context

6.7 IoPageRead
NTKERNELAPI NTSTATUS IoPageRead(
IN PFILE_OBJECT FileObject,
IN PMDL MemoryDescriptorList,
IN PLARGE_INTEGER StartingOffset,
IN PKEVENT Event,
OUT PIO_STATUS_BLOCK IoStatusBlock
);

This routine is called by the VM system to read a page from the FSD

IRP Flags are IRP_PAGING_IO | IRP_SYNCHRONOUS_PAGING_IO | IRP_NO_CACHE |


IRP_INPUT_OPERATION

The MDL provided by the VM system will be passed to the FSD as part of the Irp.

Note: This call is here for informational purposes only. Do not use
The I/O Manager provides special treatment for these IRPs

6.8 IoQueryFileInformation
NTKERNELAPI NTSTATUS IoQueryFileInformation(
IN PFILE_OBJECT FileObject,
IN FILE_INFORMATION_CLASS FileInformationClass,
IN ULONG Length,
OUT PVOID FileInformation,
OUT PULONG ReturnedLength
);

Standard question: how does my filter driver figure out if this is a file or directory?

Contents provided by 148


OSR Open Systems Resources, Inc.
Answer - you call to the FSD. This call allows you to do it with a file object rather than a file handle
(which is what the Zw* routines want)

6.9 IoQueryVolumeInformation
NTKERNELAPI NTSTATUS IoQueryVolumeInformation(
IN PFILE_OBJECT FileObject,
IN FS_INFORMATION_CLASS FsInformationClass,
IN ULONG Length,
OUT PVOID FsInformation,
OUT PULONG ReturnedLength
);

Need we say more?

6.10 IoRegisterFileSystem
NTKERNELAPI VOID IoRegisterFileSystem(
IN OUT PDEVICE_OBJECT DeviceObject
);

• This entry point is used by a physical media FSD to register with the I/O Manager
• Only one device object for the FSD need be registered using this operation

6.11 IoRegisterFsRegistrationChange
typedef VOID (*PDRIVER_FS_NOTIFICATION) (
IN struct _DEVICE_OBJECT *DeviceObject,
IN BOOLEAN FsActive
);

NTKERNELAPI NTSTATUS IoRegisterFsRegistrationChange(


IN PDRIVER_OBJECT DriverObject,
IN PDRIVER_FS_NOTIFICATION DriverNotificationRoutine
);

This call allows your filter driver to register for notification whenever a file system actually loads or
unloads!

6.12 IoSynchronousPageWrite
NTKERNELAPI NTSTATUS IoSynchronousPageWrite(
IN PFILE_OBJECT FileObject,
IN PMDL MemoryDescriptorList,
IN PLARGE_INTEGER StartingOffset,
IN PKEVENT Event,
OUT PIO_STATUS_BLOCK IoStatusBlock
);

This routine is called by the VM system to write a page back to the FSD - synchronously

• Irp Flags are IRP_PAGING_IO | IRP_SYNCHRONOUS_PAGING_IO | IRP_NO_CACHE

• The FSD will receive an IRP with the MDL passed to this routine!

Note: this call is here for informational purposes only. Do not use

Contents provided by 149


OSR Open Systems Resources, Inc.
The I/O Manager treats these IRPs specially

6.13 IoThreadToProcess
NTKERNELAPI PEPROCESS IoThreadToProcess(
IN PETHREAD Thread
);

• Given a specific thread, this will retrieve its process context.

• This can be useful in reconstructing specific context for a thread

• Note that there is a PETHREAD pointer in the IRP...

6.14 IoUnRegisterFileSystem
NTKERNELAPI VOID IoUnregisterFileSystem(
IN OUT PDEVICE_OBJECT DeviceObject
);

Note that typically, however, media-based FSDs do not unload, and hence never unregister

6.15 IoSetInformation
NTKERNELAPI NTSTATUS IoSetInformation(
IN PFILE_OBJECT FileObject,
IN FILE_INFORMATION_CLASS FileInformationClass,
IN ULONG Length,
IN PVOID FileInformation
);

This allows your filter driver to modify the file system information.

Contents provided by 150


OSR Open Systems Resources, Inc.
7 Memory Manager Runtime
7.1 MmCanFileBeTruncated
BOOLEAN MmCanFileBeTruncated (
IN PSECTION_OBJECT_POINTERS SectionPointer,
IN PLARGE_INTEGER NewFileSize
);

• This routine is used to determine if the VM sections of the file are in use
• Of course, if they are it cannot be truncated
• This call does not require the file be truncated subsequently!

• A file may be truncated whenever:


• It is not mapped (executable or user data)

• An FSD may use this prior to truncating


• Overwrite
• Supersede
• Set EOF
• Set Allocation

7.2 MmIsRecursiveIoFault
NTKERNELAPI BOOLEAN MmIsRecursiveIoFault(
VOID
);

This routine may be used by an FSD to determine if the current thread is processing a page fault that
occurred during an I/O operation.

This may be used by an FSD to determine if it is processing a paging operation in the context of an in-
progress I/O operation.

When a page fault is being recursively processed:


I/O must be completed in thread context - locks are already held

7.3 MmForceSectionClosed
BOOLEAN MmForceSectionClosed (
IN PSECTION_OBJECT_POINTERS SectionObjectPointer,
IN BOOLEAN DelayClose
);

This routine is used by an FSD to forcibly close an open section if there are no further references to that
section.

The DelayClose parameter may be used by the caller to indicate that the caller wishes to block waiting for
the close to occur.

If the section is closed, this routine returns TRUE - otherwise FALSE

Contents provided by 151


OSR Open Systems Resources, Inc.
7.4 MmFlushImageSection
typedef enum _MMFLUSH_TYPE {
MmFlushForDelete,
MmFlushForWrite
} MMFLUSH_TYPE;

BOOLEAN MmFlushImageSection (
IN PSECTION_OBJECT_POINTERS SectionObjectPointer,
IN MMFLUSH_TYPE FlushType
);

• This routine is used by an FSD to flush the image portion of the section objects
• Image is the “shared” portion of the section object
• Typically used when an executable is being overwritten
• Also used as part of dismounting (or shutting down) a volume

• Flush Type indicates which operation the MM system should perform.

7.5 MmSetAddressRangeModified
VOID MmSetAddressRangeModified (
IN PVOID Address,
IN ULONG Length
);

This is used to advise the VM system it should mark the range as “dirty”.
FSDs typically do not use this (they use the Cc routine instead.)

Contents provided by 152


OSR Open Systems Resources, Inc.
8 NT Native API
8.1 Data Structures
8.1.1 TOKEN_INFORMATION_CLASS
typedef enum _TOKEN_INFORMATION_CLASS {
TokenUser = 1,
TokenGroups,
TokenPrivileges,
TokenOwner,
TokenPrimaryGroup,
TokenDefaultDacl,
TokenSource,
TokenType,
TokenImpersonationLevel,
TokenStatistics
} TOKEN_INFORMATION_CLASS, *PTOKEN_INFORMATION_CLASS;

Sample Code
Status = ZwOpenThreadToken(NtCurrentThread(), TOKEN_READ, TRUE, &handle);
if (Status == STATUS_NO_TOKEN) Status = ZwOpenProcessToken(NtCurrentProcess(),
TOKEN_READ, &handle);
ASSERT(NT_SUCCESS(code));
Status = ZwQueryInformationToken(handle, TokenUser, buffer,
sizeof(buffer), &tokenInfoLength);
ASSERT(NT_SUCCESS(code));
DbgPrint("*** BEGIN SID Dump ***");
DbgPrint("Caller's SID (Revision %u, SubAuthorityCount %u):\n",
sid->Revision,sid->SubAuthorityCount);
DbgPrint("\tIdentifierAuthority = %u-%u-%u-%u-%u-%u\n",
sid->IdentifierAuthority.Value[0],sid->IdentifierAuthority.Value[1],
sid->IdentifierAuthority.Value[2],sid->IdentifierAuthority.Value[3],
sid->IdentifierAuthority.Value[4],sid->IdentifierAuthority.Value[5]);
DbgPrint(("*** END SID Dump ***"));

8.2 NtAdjustPrivilegesToken
NTSYSAPI NTSTATUS NTAPI NtAdjustPrivilegesToken (
IN HANDLE TokenHandle,
IN BOOLEAN DisableAllPrivileges,
IN PTOKEN_PRIVILEGES NewState OPTIONAL,
IN ULONG BufferLength OPTIONAL,
IN PTOKEN_PRIVILEGES PreviousState OPTIONAL,
OUT PULONG ReturnLength
);

• Modify the privileges of a token


• Just because a user has specific privileges does not mean they are “activated”
• Can add/remove and enable/disable privileges for a given thread or process.

Contents provided by 153


OSR Open Systems Resources, Inc.
8.3 NtDuplicateObject
NTSYSAPI NTSTATUS NTAPI ZwDuplicateObject(
IN HANDLE SourceProcessHandle,
IN HANDLE SourceHandle,
IN HANDLE TargetProcessHandle OPTIONAL,
OUT PHANDLE TargetHandle OPTIONAL,
IN ACCESS_MASK DesiredAccess,
IN ULONG HandleAttributes,
IN ULONG Options
);

Creates a new handle in a new process from an old handle in an old process.

8.4 NtDuplicateToken
NTSYSAPI NTSTATUS NTAPI NtDuplicateToken(
IN HANDLE ExistingTokenHandle,
IN ACCESS_MASK DesiredAccess,
IN POBJECT_ATTRIBUTES ObjectAttributes,
IN BOOLEAN EffectiveOnly,
IN TOKEN_TYPE TokenType,
OUT PHANDLE NewTokenHandle
);

• Duplicate an existing token


• Gives you a second, independent token handle
• Can specify different access to the token than granted to the original handle

8.5 NtOpenProcessToken
NTSYSAPI NTSTATUS NTAPI
NtOpenProcessToken(
IN HANDLE ProcessHandle,
IN ACCESS_MASK DesiredAccess,
OUT PHANDLE TokenHandle
);

• This is used to open the token of the current process.


• Handle to a token is essential for extracting various pieces of information about the caller.

Typically, try to open the thread token and if this fails, open the process token.

8.6 NtQueryInformationToken
NTSYSAPI NTSTATUS NTAPI
NtQueryInformationToken (
IN HANDLE TokenHandle,
IN TOKEN_INFORMATION_CLASS TokenInformationClass,
OUT PVOID TokenInformation,
IN ULONG TokenInformationLength,
OUT PULONG ReturnLength
);

Retrieve information about the specified token.

Contents provided by 154


OSR Open Systems Resources, Inc.
9 Object Manager Runtime
9.1 Data Structures
9.1.1 OB_DUMP_METHOD
typedef struct _OBJECT_DUMP_CONTROL {
PVOID Stream;
ULONG Detail;
} OB_DUMP_CONTROL, *POB_DUMP_CONTROL;

typedef VOID (*OB_DUMP_METHOD)(


IN PVOID Object,
IN POB_DUMP_CONTROL Control OPTIONAL );

Used for debugging objects - dumps out object specific information

9.1.2 OB_OPEN_REASON
typedef enum _OB_OPEN_REASON {
ObCreateHandle,
ObOpenHandle,
ObDuplicateHandle,
ObInheritHandle,
ObMaxOpenReason
} OB_OPEN_REASON;

• The open reason describes why a particular object is being opened


• This is passed to the object’s open method

9.1.3 OB_OPEN_METHOD
typedef VOID (*OB_OPEN_METHOD)(
IN OB_OPEN_REASON OpenReason,
IN PEPROCESS Process OPTIONAL,
IN PVOID Object,
IN ACCESS_MASK GrantedAccess,
IN ULONG HandleCount
);

• This is called when an object is opened


• This includes the first time when it is created
• This includes each time it is “opened”
• Such as “handle duplication”
9.1.4 OB_CLOSE_METHOD
typedef VOID (*OB_CLOSE_METHOD)(
IN PEPROCESS Process OPTIONAL,
IN PVOID Object,
IN ACCESS_MASK GrantedAccess,
IN ULONG ProcessHandleCount,
IN ULONG SystemHandleCount
);

• This routine is called from the object manager each time a given object is closed.
• Note that close is not identical to delete
• Typically, this method implements some form of reference count management.

• Note that there are TWO reference counts!

Contents provided by 155


OSR Open Systems Resources, Inc.
• The ProcessHandleCount (user level references)
• The SystemHandleCount (OS internal references)

9.1.5 OB_DELETE_METHOD
typedef VOID (*OB_DELETE_METHOD)(
IN PVOID Object
);

• This method is called by the Object Manager when the object is being deleted
• Typically, this is associated with the reference count dropping to zero

Not much activity on this one - the object is DEAD and will go away very soon.

9.1.6 OB_PARSE_METHOD
typedef NTSTATUS (*OB_PARSE_METHOD)(
IN PVOID ParseObject,
IN PVOID ObjectType,
IN OUT PACCESS_STATE AccessState,
IN KPROCESSOR_MODE AccessMode,
IN ULONG Attributes,
IN OUT PUNICODE_STRING CompleteName,
IN OUT PUNICODE_STRING RemainingName,
IN OUT PVOID Context OPTIONAL,
IN PSECURITY_QUALITY_OF_SERVICE SecurityQos OPTIONAL,
OUT PVOID *Object
);

• This method is used to find objects in “containers”


• A media device (really its FSD) may contain objects
• A directory object (NOT in the FSD) may contain objects
• Note the IN/OUT for the names - allows symbolic links!

9.1.7 SECURITY_OPERATION_CODE
typedef enum _SECURITY_OPERATION_CODE {
SetSecurityDescriptor,
QuerySecurityDescriptor,
DeleteSecurityDescriptor,
AssignSecurityDescriptor
} SECURITY_OPERATION_CODE, *PSECURITY_OPERATION_CODE;

These values indicate the type of security operation being performed on an object

9.1.8 OB_SECURITY_METHOD
typedef NTSTATUS (*OB_SECURITY_METHOD)(
IN PVOID Object,
IN SECURITY_OPERATION_CODE OperationCode,
IN PSECURITY_INFORMATION SecurityInformation,
IN OUT PSECURITY_DESCRIPTOR SecurityDescriptor,
IN OUT PULONG CapturedLength,
IN OUT PSECURITY_DESCRIPTOR *ObjectsSecurityDescriptor,
IN POOL_TYPE PoolType,
IN PGENERIC_MAPPING GenericMapping
);

This method is called to query or modify the security descriptor on a given object. Note that the actual
operation is specified by the OperationCode.

More on security later...

Contents provided by 156


OSR Open Systems Resources, Inc.
OB_SECURITY_METHOD
• There are a number of important rules to follow:
• The caller of this routine must pass in a kernel address
• The data pointed to must not change during the call
• The SecurityDescriptor must be properly probed

The method must use a try/except clause when writing the security descriptor (it might be deallocated
during the call!)

For set:

• SecurityDescriptor must be the return from SeCaptureSecurityDescriptor

More on security later...

9.1.9 OB_QUERYNAME_METHOD
typedef NTSTATUS (*OB_QUERYNAME_METHOD)(
IN PVOID Object,
IN BOOLEAN HasObjectName,
OUT POBJECT_NAME_INFORMATION ObjectNameInfo,
IN ULONG Length,
OUT PULONG ReturnLength
);

This method is used to obtain the name of a given object. Not supported by all objects - notably file objects

9.1.10 OBJECT_TYPE_INITIALIZER
typedef struct _OBJECT_TYPE_INITIALIZER {
USHORT Length;
BOOLEAN UseDefaultObject;
BOOLEAN Reserved;
ULONG InvalidAttributes;
GENERIC_MAPPING GenericMapping;
ULONG ValidAccessMask;
BOOLEAN SecurityRequired;
BOOLEAN MaintainHandleCount;
BOOLEAN MaintainTypeList;
POOL_TYPE PoolType;
ULONG DefaultPagedPoolCharge;
ULONG DefaultNonPagedPoolCharge;
OB_DUMP_METHOD DumpProcedure;
OB_OPEN_METHOD OpenProcedure;
OB_CLOSE_METHOD CloseProcedure;
OB_DELETE_METHOD DeleteProcedure;
OB_PARSE_METHOD ParseProcedure;
OB_SECURITY_METHOD SecurityProcedure;
OB_QUERYNAME_METHOD QueryNameProcedure;
} OBJECT_TYPE_INITIALIZER, *POBJECT_TYPE_INITIALIZER;

9.2 ObCreateObject
NTKERNELAPI NTSTATUS ObCreateObject(
IN KPROCESSOR_MODE ProbeMode,
IN POBJECT_TYPE ObjectType,
IN POBJECT_ATTRIBUTES ObjectAttributes OPTIONAL,
IN KPROCESSOR_MODE OwnershipMode,
IN OUT PVOID ParseContext OPTIONAL,
IN ULONG ObjectBodySize,
IN ULONG PagedPoolCharge,
IN ULONG NonPagedPoolCharge,

Contents provided by 157


OSR Open Systems Resources, Inc.
OUT PVOID *Object
);

This routine is used when creating a new object. Note that the Object Type is defined by the creator of the
object itself. The methods are associated with the Object Type.

Note that two object types are declared in NTDDK.H: IoFileObjectType, ExEventObjectType. Look for
others in ntoskrnl.exe exports!

9.3 ObGetObjectPointerCount
NTKERNELAPI ULONG ObGetObjectPointerCount(
IN PVOID Object
);

This returns the current reference count to the given object.

9.4 ObInsertObject
NTKERNELAPI NTSTATUS ObInsertObject(
IN PVOID Object,
IN PACCESS_STATE PassedAccessState OPTIONAL,
IN ACCESS_MASK DesiredAccess OPTIONAL,
IN ULONG ObjectPointerBias,
OUT PVOID *NewObject OPTIONAL,
OUT PHANDLE Handle
);

This can be used to take an existing object and acquire a handle in a process context for that object.
The ObjectPointerBias is the initial reference count on the object.

The returned handle is ONLY valid in the context of the process where the object was inserted!

9.5 ObOpenObjectByPointer
NTKERNELAPI NTSTATUS ObOpenObjectByPointer(
IN PVOID Object,
IN ULONG HandleAttributes,
IN PACCESS_STATE PassedAccessState OPTIONAL,
IN ACCESS_MASK DesiredAccess OPTIONAL,
IN POBJECT_TYPE ObjectType OPTIONAL,
IN KPROCESSOR_MODE AccessMode,
OUT PHANDLE Handle
);

This is used to open an existing object based upon its pointer. Upon completion, the handle representing
this open instance is returned. This handle is process specific!

9.6 ObQueryNameString
NTKERNELAPI NTSTATUS ObQueryNameString(
IN PVOID Object,
OUT POBJECT_NAME_INFORMATION ObjectNameInfo,
IN ULONG Length,
OUT PULONG ReturnLength
);

• This is the object manager generic call to retrieve an objects name


• Not all objects have names (e.g., FSD instances)
• Not all objects names are stored as part of the object information

Contents provided by 158


OSR Open Systems Resources, Inc.
• Notably, file objects!

• Typically called twice


• Retrieve the length of the buffer needed
• Retrieve the name

Contents provided by 159


OSR Open Systems Resources, Inc.
Contents provided by 160
OSR Open Systems Resources, Inc.
10 Runtime Library
10.1 Memory Access
10.1.1 ProbeForRead
#define ProbeForRead(Address, Length, Alignment) \
ASSERT(((Alignment) == 1) || ((Alignment) == 2) || \
((Alignment) == 4) || ((Alignment) == 8)); \
\
if ((Length) != 0) { \
if (((ULONG)(Address) & ((Alignment) - 1)) != 0) { \
ExRaiseDatatypeMisalignment(); \
\
} else if ((((ULONG)(Address) + (Length)) < (ULONG)(Address)) ||
\
(((ULONG)(Address) + (Length)) > (ULONG)MM_USER_PROBE_ADDRESS)) { \
ExRaiseAccessViolation(); \
} \
}

10.1.2 ProbeForWrite
//
// Common probe for write functions.
//

NTKERNELAPI
VOID
NTAPI
ProbeForWrite (
IN PVOID Address,
IN ULONG Length,
IN ULONG Alignment
);

10.2 Bitmap Routines


typedef struct _RTL_BITMAP {
ULONG SizeOfBitMap; // Number of bits in bit map
PULONG Buffer; // Pointer to the bit map itself
} RTL_BITMAP, *PRTL_BITMAP;

The bitmap routines may be used to ease the burden of building allocation routines.

The RTL_BITMAP structure is used throughout the bitmap routines

• Bitmap requirements for these routines:


• They must be quadword aligned
• They must be a multiple of quadwords in size

Clever file systems will actually set up the bitmap to be a file backed section in memory.

Contents provided by 161


OSR Open Systems Resources, Inc.
10.2.1 RtlClearAllBits
NTSYSAPI VOID NTAPI RtlClearAllBits (
PRTL_BITMAP BitMapHeader
);

• This routine is used to clear all the elements of a bitmap


• Typically used when initializing the bitmap

10.2.2 RtlSetAllBits
NTSYSAPI VOID NTAPI RtlSetAllBits (
PRTL_BITMAP BitMapHeader
);

• This routine is used to set all the bits in the bitmap.


• Typically used during initialization

10.2.3 RtlFindClearBits
NTSYSAPI ULONG NTAPI RtlFindClearBits (
PRTL_BITMAP BitMapHeader,
ULONG NumberToFind,
ULONG HintIndex
);

This routine is used to locate a contiguous region of cleared bits within the bitmap. The NumberToFind
indicates the number of bits required to fulfill this request and the HintIndex indicates the first bit to be
used in this search.

Returns:

• 0xFFFFFFFF if a region was not located


• The index of the first bit otherwise

Indexes are zero-based

10.2.4 RtlFindSetBits
NTSYSAPI ULONG NTAPI RtlFindSetBits (
PRTL_BITMAP BitMapHeader,
ULONG NumberToFind,
ULONG HintIndex
);

This operates much like RtlFindClearBits, except that it searches for set bits
10.2.5 RtlFindClearBitsAndSet
NTSYSAPI ULONG NTAPI RtlFindClearBitsAndSet (
PRTL_BITMAP BitMapHeader,
ULONG NumberToFind,
ULONG HintIndex
);

This routine is used to locate a contiguous region of cleared bits within the bitmap. The NumberToFind
indicates the number of bits required to fulfill this request and the HintIndex indicates the first bit to be
used in this search.

Contents provided by 162


OSR Open Systems Resources, Inc.
Once the region has been located it will be set.

Returns:

• 0xFFFFFFFF if a region was not located


• The index of the first bit otherwise

Indexes are zero-based

10.2.6 RtlFindSetBitsAndClear
NTSYSAPI ULONG NTAPI RtlFindSetBitsAndClear (
PRTL_BITMAP BitMapHeader,
ULONG NumberToFind,
ULONG HintIndex
);

This routine is the compliment of RtlFindClearBitsAndSet.

10.2.7 RtlClearBits
NTSYSAPI VOID NTAPI RtlClearBits (
PRTL_BITMAP BitMapHeader,
ULONG StartingIndex,
ULONG NumberToClear
);

This routine is used to clear a specified range of bits within the bitmap.

Note that the bits in the specified range need not be set for this call.

10.2.8 RtlSetBits
NTSYSAPI VOID NTAPI RtlSetBits (
PRTL_BITMAP BitMapHeader,
ULONG StartingIndex,
ULONG NumberToSet
);

This is the compliment to RtlClearBits.

10.2.9 RtlFindLongestRunClear
NTSYSAPI ULONG NTAPI RtlFindLongestRunClear (
PRTL_BITMAP BitMapHeader,
PULONG StartingIndex
);

This routine is used to search the bitmap for the longest range of clear bits available.

The return value is the starting index (zero-based) of the first bit of the region located.

StartingIndex points to the number of bits available in the region upon return

10.2.10RtlFindLongestRunSet
NTSYSAPI ULONG NTAPI RtlFindLongestRunSet (
PRTL_BITMAP BitMapHeader,
PULONG StartingIndex
);

Contents provided by 163


OSR Open Systems Resources, Inc.
This routine is the compliment of RtlFindLongestRunClear.

10.2.11RtlFindFirstRunClear
NTSYSAPI ULONG NTAPI RtlFindFirstRunClear (
PRTL_BITMAP BitMapHeader,
PULONG StartingIndex
);

This routine is used to locate the first clear bits in the bitmap.

The return value is the number of bits in the run.

StartingIndex represents the first bit in the run.


10.2.12RtlFindFirstRunSet
NTSYSAPI ULONG NTAPI RtlFindFirstRunSet (
PRTL_BITMAP BitMapHeader,
PULONG StartingIndex
);

This is the compliment of RtlFindFirstRunClear.

10.2.13RtlNumberOfClearBits
NTSYSAPI ULONG NTAPI RtlNumberOfClearBits (
PRTL_BITMAP BitMapHeader
);

This routine returns the total # of bits in the bitmap that are clear
10.2.14RtlNumberOfSetBits
NTSYSAPI ULONG NTAPI RtlNumberOfSetBits (
PRTL_BITMAP BitMapHeader
);

This routine returns the total # of bits in the bitmap that are set.

10.2.15RtlAreBitsClear
NTSYSAPI BOOLEAN NTAPI RtlAreBitsClear (
PRTL_BITMAP BitMapHeader,
ULONG StartingIndex,
ULONG Length
);

This routine is used to determine if the bits in the given range are all clear.

• Returns TRUE if they are all clear.


• Returns FALSE otherwise.

10.2.16RtlAreBitsSet
NTSYSAPI BOOLEAN NTAPI RtlAreBitsSet (
PRTL_BITMAP BitMapHeader,
ULONG StartingIndex,
ULONG Length
);

Contents provided by 164


OSR Open Systems Resources, Inc.
This is the compliment of RtlAreBitsClear.

• It returns TRUE if the bits in the specified range are all set.
• It returns FALSE otherwise.

10.3 Prefix Cache

10.4 Splay Tree

10.5 Generic Table

10.6 Short Name Support


10.6.1 GENERATE_NAME_CONTEXT
The Rtl routine for 8.3 name generation requires a private context structure:

typedef struct {
USHORT Checksum;
BOOLEAN ChecksumInserted;
UCHAR NameLength;
WCHAR NameBuffer[8];
ULONG ExtensionLength; // including dot
WCHAR ExtensionBuffer[4];
ULONG LastIndexValue;
} GENERATE_NAME_CONTEXT;
typedef GENERATE_NAME_CONTEXT *PGENERATE_NAME_CONTEXT;

Note:
• NameLength is in wide characters, not bytes
• NameBuffer is the name BEFORE the dot
• ExtensionBuffer is the name starting with the dot (INCLUDING the dot)

This structure is used when calling the Rtl routine repeatedly (for state)
10.6.2 RtlGenerate8dot3Name
NTSYSAPI VOID NTAPI RtlGenerate8dot3Name (
IN PUNICODE_STRING Name,
IN BOOLEAN AllowExtendedCharacters,
IN OUT PGENERATE_NAME_CONTEXT Context,
OUT PUNICODE_STRING Name8dot3
);

This routine is used by an FSD to generate a short file name.

• An FSD must implement this in a loop:


• Generate the name
• Check to determine if the name already exists
• If it does exist, do it again

This mechanism ensures that an 8.3 name is unique.

Contents provided by 165


OSR Open Systems Resources, Inc.
Contents provided by 166
OSR Open Systems Resources, Inc.
11 Security Reference Monitor
The security reference monitor is the component within Windows NT that is responsible for making
fundamentally all security decisions. However, because the file systems participate in the initialization of
new file objects, and are directly responsible for storing the security information associated with a file
object, it is the responsibility of the file system to use the Security Reference Monitor to make these
security decisions.

This section describes the basic routines necessary to implement your own security checking mechanisms.
Unfortunately, because the file system examples contained within the Microsoft IFS Kit do not implement
Windows NT compatible security, it is not possible to provide any code references in this section.
Wherever possible, code samples have been provided from earlier versions of the OSR FSDK.

11.1 Data Structures


In this section we describe the security objects which are used by the Security Reference Monitor. Most of
these are simply in-memory data structures that are used by the Security Reference Monitor as part of
processing specific security operations. Some, such as the SECURITY_DESCRIPTOR is stored by the file
system, and hence must be retrieved as needed from the file system backing store (for a physical file
system, the disk drive and for a network file system, the remote file system.)

Particularly important to the entire security discussion is the concept of the subject context which identifies
the security entity attempting to perform the operation. This context information is normally only provided
to the file system as part of the initialization of a new file object. Thus, if a file system (or file system filter
driver) requires this information for subsequent operations, it is responsible for capturing and storing it.

11.1.1 SECURITY_DESCRIPTOR
Typically, the security descriptor is stored by the file system and associated in some fashion with the file.
Thus, whenever a file is opened its security descriptor is also fetched so that it can be analyzed as part of
the IRP_MJ_CREATE processing. The security descriptor can also be fetched via the programmatic
interface (IRP_MJ_QUERY_SECURITY.)

The security descriptor is changed when requested by an application program (via


IRP_MJ_SET_SECURITY) or when a file is first created (IRP_MJ_CREATE.) When the file is first
created, the initial security descriptor, like the initial file attributes, is not validated during the create –
rather it is the security on the directory containing the new file which is used. Thus, if a file is newly
created with a security descriptor that would normally disallow the requested access, that access would be
granted despite the information contained within the security descriptor. Of course, these semantics may
vary depending upon the particular file system implementation.

typedef struct _SECURITY_DESCRIPTOR {


UCHAR Revision;
UCHAR Sbz1;
SECURITY_DESCRIPTOR_CONTROL Control;
PSID Owner;
PSID Group;
PACL Sacl;
PACL Dacl;
} SECURITY_DESCRIPTOR, *PISECURITY_DESCRIPTOR;

For versions of Windows NT through Version 4.0, the Revision is always 1. The Control field indicates
how the security descriptor itself was constructed:

Contents provided by 167


OSR Open Systems Resources, Inc.
• SE_OWNER_DEFAULTED (0x1) - owner was obtained via a default mechanism, not by the
original descriptor created. Affects inheritance.
• SE_GROUP_DEFAULTED (0x2) - group was obtained via a default mechanism, not by the
original descriptor created. Affects inheritance.
• SE_DACL_PRESENT (0x4) - the DACL field is a valid ACL. If the field is null, this is
considered deliberate (an empty ACL is being set.)
• SE_DACL_DEFAULTED (0x8) - the DACL field was obtained via a default mechanism, not
by the original descriptor created. Affects inheritance. Only valid when
SE_DACL_PRESENT is also set.
• SE_SACL_PRESENT (0x10) - the SACL field is a valid ACL. If the field is null, this is
considered deliberate (an empty ACL is being set.)
• SE_SACL_DEFAULTED (0x20) - the SACL field was obtained via a default mechanism, not
by the original descriptor created. Affects inheritance. Only valid when
SE_SACL_PRESENT is also set.
• SE_SELF_RELATIVE (0x8000) - this security descriptor is in self-relative form. All fields
of the descriptor are contiguous in memory and pointer fields are offsets from the beginning
of the security descriptor.
• SE_DACL_UNTRUSTED (0x0040) - this security descriptor indicates that server SIDs found
in ACEs should be substituted with known valid SIDs.
• SE_SERVER_SECURITY (0x0080) - used when impersonating to indicate the new object
should use the passed-in security attributes and in addition grant explicit access to the current
server.

Of these, the most important from the perspective of the file system is the SE_SELF_RELATIVE bit, as
this indicates whether or not the security descriptor is contained within a single buffer. The pointers
themselves are instead offsets from the beginning of the buffer. This form is ideal for storage by the file
system to the backing store.

Of course, a file system may need to convert a security descriptor to its self-relative form. This can be
done via SeSetSecurityDescriptorInfo (Section 11.16) or SeQuerySecurityDescriptorInfo (Section 11.12.)

11.1.2 SECURITY_SUBJECT_CONTEXT
In order to uniquely identify the security identify for a particular operation (typically as part of an
IRP_MJ_CREATE operation) the I/O Manager will “capture” the necessary security context. This is
described as the SECURITY_SUBJECT_CONTEXT and provides a single encapsulation of the necessary
security information - independent of the currently executing thread.

typedef struct _SECURITY_SUBJECT_CONTEXT {


PACCESS_TOKEN ClientToken;
SECURITY_IMPERSONATION_LEVEL ImpersonationLevel;
PACCESS_TOKEN PrimaryToken;
PVOID ProcessAuditId;
} SECURITY_SUBJECT_CONTEXT, *PSECURITY_SUBJECT_CONTEXT;

This context information indicates not only the security credentials of the process requesting the operation
(as indicated by the PrimaryToken) but may also optionally indicate the security credentials of the original
requestor (the ClientToken.) The ClientToken is optional – only the PrimaryToken is required.

The fields within this data structure are opaque to the file system.

Contents provided by 168


OSR Open Systems Resources, Inc.
11.2 SeAccessCheck
A file system may validate that a particular caller has the necessary access to perform the given operation.
For most operations, this is done as part of opening or creating a new file, although it is sometimes
necessary for other operations, such as rename where a target file might be deleted as a byproduct of the
deletion.

This function is also documented in the Windows NT DDK Reference Manual (Part 1, Chapter 10.) The
prototype for this function is:
BOOLEAN SeAccessCheck(
IN PSECURITY_DESCRIPTOR SecurityDescriptor,
IN PSECURITY_SUBJECT_CONTEXT SubjectSecurityContext,
IN BOOLEAN SubjectContextLocked,
IN ACCESS_MASK DesiredAccess,
IN ACCESS_MASK PreviouslyGrantedAccess,
OUT PPRIVILEGE_SET *Privileges OPTIONAL,
IN PGENERIC_MAPPING GenericMapping,
IN KPROCESSOR_MODE AccessMode,
OUT PACCESS_MASK GrantedAccess,
OUT PNTSTATUS AccessStatus
);

The SecurityDescriptor argument is normally provided by the file system itself as the file system is
responsible for storing (and hence retrieving) this security descriptor from the backing store.

The SubjectSecurityContext argument is normally provided as part of the IRP, although it can also be
obtained by the file system using SeCaptureSubjectSecurityContext (Section 11.7.)

The SubjectContextLocked argument indicates if the security context has been locked. Locking may be
done more than once, so a file system can lock the context, perform this call, and then unlock this context.
Normally this is extracted from the AccessState field in the FileObject during IRP_MJ_CREATE
processing.

The DesiredAccess argument indicates those rights being requested by the caller. Typically, this is
computed by the file system from the AccessState field in the FileObject during IRP_MJ_CREATE
processing by using the DesiredAccess field.

The PreviouslyGrantedAccess argument indicates those rights which have already been granted during the
access checking for this file. This could be done because the file system has checked for certain rights on
the directory, or the operating system itself has decided to grant certain rights based on privileges the caller
holds.

The Privileges parameter, if provided, is set to a memory block containing the privilege information for the
given caller. This allows your file system to determine if the caller has particular privileges (e.g., the
backup privilege which indicates the caller can “circumvent” normal security to read the data from the file
system.)

The GenericMapping defines the file object specific mappings from the generic rights (e.g.,
GENERIC_READ) to the standard and specific rights appropriate for file objects (keep in mind that file
objects represent not only files, but also directories and volumes.) For a file system, this is normally
provided via a call to IoGetFileObjectMapping() which defines the standard I/O Manager mappings for the
generic rights.

The AccessMode defines the access mode of the caller to be used when performing the access check. Note
that if the access mode is KernelMode then this call will always return TRUE (indicating that access should
be granted.) During IRP_MJ_CREATE processing it is normal for a file system to pass in the value from
the IRP except when the SL_FORCE_ACCESS_CHECK bit is set in the I/O stack location – in this case
UserMode should be used as this will ensure that a true access check is performed. This is essential for

Contents provided by 169


OSR Open Systems Resources, Inc.
correct behavior of kernel resident services, such as the LanManager file server (which runs as a kernel-
resident service.)

The GrantedAccess defines the access that was granted by the Security Reference Monitor. The contents of
this field are undefined if SeAccessCheck returns FALSE.

The AccessStatus defines the precise reason the access check failed. Presently it only returns
STATUS_ACCESS_DENIED but is provided for future security enhancements.

11.3 SeAppendPrivileges
This call is used to add additional privileges to an existing AccessState data structure provided as part of the
IRP_MJ_CREATE arguments.
NTKERNELAPI NTSTATUS SeAppendPrivileges(
PACCESS_STATE AccessState,
PPRIVILEGE_SET Privileges
);

Typically, this call is used to update the set of Privileges associated with this data structure as a result of a
call to SeAccessCheck (Section 11.2.)

11.4 SeAssignSecurity
This routine is used to build a new security descriptor. It is described more fully in the Windows NT DDK
Reference Manual (Part 1, Chapter 10.)
NTSTATUS SeAssignSecurity(
IN PSECURITY_DESCRIPTOR ParentDescriptor,
IN PSECURITY_DESCRIPTOR ExplicitDescriptor,
OUT PSECURITY_DESCRIPTOR *NewDescriptor,
IN BOOLEAN IsDirectoryObject,
IN PSECURITY_SUBJECT_CONTEXT SubjectContext,
IN PGENERIC_MAPPING GenericMapping,
IN POOL_TYPE PoolType
);

The mechanism this routine uses to build the new security descriptor is based upon the information
provided. If an ExplicitDescriptor is provided it will be used and combined with information from the
ParentDescriptor and the results are stored in the NewDescriptor.

A code sample for this call is from the Windows NT DDK (src\network\tdi\address.c):
//
// Assign the security descriptor (need to do this with
// the spinlock released because the descriptor is not
// mapped. BUGBUG: Need to synchronize Assign and Access).
//

AccessState = IrpSp->Parameters.Create.SecurityContext->AccessState;

status = SeAssignSecurity(
NULL, // parent descriptor
AccessState->SecurityDescriptor,
&address->SecurityDescriptor,
FALSE, // is directory
&AccessState->SubjectSecurityContext,
&AddressGenericMapping,
PagedPool);

Contents provided by 170


OSR Open Systems Resources, Inc.
Additional information about the owner, primary group, and default process security information are
extracted from the SubjectContext argument.

11.5 SeAuditingFileEvents
This routine is used by a file system to determine if auditing is necessary at all, based on auditing policy
established on the current system. Typically, this is used to “circumvent” the expensive operations required
to actually perform an auditing operation.

The prototype for this function is:


NTKERNELAPI BOOLEAN SeAuditingFileEvents(
IN BOOLEAN AccessGranted,
IN PSECURITY_DESCRIPTOR SecurityDescriptor
);

• The AccessGranted parameter indicates if the requested access was granted.


• The SecurityDescriptor is the security descriptor associated with the file (or directory.)

If this routine returns true, full auditing is required for this particular operation. The file system should build all
information necessary and call SeOpenObjectAuditAlarm (Section 11.10)

11.6 SeAuditingFileOrGlobalEvents
This routine is provided to allow future modifications to Windows NT that will enable per-user auditing
(as would be required for more secure systems.) At present, while implemented, it does not provide any
additional functionality to that provided by the routine SeAuditingFileEvents (Section 11.5.) Either
routine may be used but this routine has additional processing overhead due to the additional context
information (which identifies the particular user in this case.) The prototype for this call is:
NTKERNELAPI
BOOLEAN
SeAuditingFileOrGlobalEvents(
IN BOOLEAN AccessGranted,
IN PSECURITY_DESCRIPTOR SecurityDescriptor,
IN PSECURITY_SUBJECT_CONTEXT SubjectSecurityContext
);

The AccessGranted parameter indicates if the requested access was granted.


The SecurityDescriptor is the security descriptor associated with the file (or directory.)
The SubjectSecurityContext argument is normally provided as part of the IRP, although it can also be
obtained by the file system using SeCaptureSubjectSecurityContext (Section 11.7.)

If this routine returns true, full auditing is required for this particular operation. The file system should
build all information necessary and call SeOpenObjectAuditAlarm (Section 11.10)

11.7 SeCaptureSubjectSecurityContext
This routine can be used to “capture” the security information (the “context”) for a given caller.
NTKERNELAPI
VOID
SeCaptureSubjectContext (
OUT PSECURITY_SUBJECT_CONTEXT SubjectContext
);

Contents provided by 171


OSR Open Systems Resources, Inc.
The SubjectContext describes the security context for the current security credentials at the time the context
is captured. The Security Reference Monitor allocates the storage necessary for the captured security
context. This context information must be released by the FSD using the SeReleaseSubjectSecurityContext
call (see Section 11.14.)

Once captured, the SubjectContext may be used in an arbitrary process context.

11.8 SeLockSubjectContext
The prototype for this function is:
NTKERNELAPI VOID SeLockSubjectContext(
IN PSECURITY_SUBJECT_CONTEXT SubjectContext
);

The SubjectContext argument is the context to be locked.

This can be used by a file system to ensure that the SubjectContext is locked and will remain valid for the
duration of the file system operation. Typically, a file system will call this prior to passing the
SubjectContext parameter to any of the security routines.

The SubjectContext must be unlocked when no longer needed by the file system. This can be done via a
call to SeUnlockSubjectContext (Section 11.17.)

11.9 SeMarkLogonSessionForTerminationNotification
This routine is used to indicate that the FSD should be notified whenever the specified logon session
terminates. An FSD learns of such termination via its previously registered session termination routine
(see SeRegisterLogonSessionTerminatedRoutine Section 11.13.) The prototype for this function is:
NTSTATUS
SeMarkLogonSessionForTerminationNotification(
IN PLUID LogonId
);

The LogonId uniquely identifies the logon session for which the FSD should be notified.

11.10 SeOpenObjectAuditAlarm
This routine is used by a file system to generate any necessary audit events during IRP_MJ_CREATE
processing. The Security Reference Monitor makes the actual decision on whether or not to perform
auditing operations. The file system is merely responsible for actually calling this function.
NTKERNELAPI VOID SeOpenObjectAuditAlarm (
IN PUNICODE_STRING ObjectTypeName,
IN PVOID Object OPTIONAL,
IN PUNICODE_STRING AbsoluteObjectName OPTIONAL,
IN PSECURITY_DESCRIPTOR SecurityDescriptor,
IN PACCESS_STATE AccessState,
IN BOOLEAN ObjectCreated,
IN BOOLEAN AccessGranted,
IN KPROCESSOR_MODE AccessMode,
OUT PBOOLEAN GenerateOnClose
);

The ObjectTypeName argument is used to identify the type of object. For a file object this must be a
UNICODE_STRING that indicates the word “File”.

The Object argument identifies the object itself. This parameter need not be provided by your file system.

Contents provided by 172


OSR Open Systems Resources, Inc.
The AbsoluteObjectName argument identifies the name of the object itself. This parameter should be
provided by the file system if at all possible (this might not be possible for some file systems if they
support opening of a file by ID.)

Normally, either the Object or the AbsoluteObjectName is provided as an argument to this call. Since the
file system maintains the name (rather than being directly associated with the object by the object manager)
it is normal for the file system to provide the name directly.

The SecurityDescriptor is the security descriptor that applies to the file being opened.

The AccessState argument is the access information provided as part of the IRP_MJ_CREATE calls
parameters.

The ObjectCreated argument indicates if this is a creation of a new object.

The AccessGranted argument indicates if access to this object was granted (presumably by the Security
Reference Monitor.)

The AccessMode argument is either UserMode or KernelMode. KernelMode operations are not audited,
but your code need not be aware of this distinction.

The GenerateOnClose value is normally the field of the same name in the AccessState parameter (part of
the IRP_MJ_CREATE arguments.) This advises the object manager to generate auditing when the file
object itself is closed – no action is required by the file system to ensure this is done, aside from ensuring
that this field is set as appropriate when the file object is first created.

This routine is used by a file system that supports auditing. Note that this operation can take considerable
time.

This call should not be used if the object is being opened for delete access. In that case, the file system
should instead use SeOpenObjectForDeleteAuditAlarm (Section 11.11.)

11.11 SeOpenObjectForDeleteAuditAlarm
This routine is used when an object is opened for delete access. Otherwise, SeOpenObjectAuditAlarm
(Section 11.10) should be used.

The prototype for this call is:


NTKERNELAPI VOID SeOpenObjectForDeleteAuditAlarm (
IN PUNICODE_STRING ObjectTypeName,
IN PVOID Object OPTIONAL,
IN PUNICODE_STRING AbsoluteObjectName OPTIONAL,
IN PSECURITY_DESCRIPTOR SecurityDescriptor,
IN PACCESS_STATE AccessState,
IN BOOLEAN ObjectCreated,
IN BOOLEAN AccessGranted,
IN KPROCESSOR_MODE AccessMode,
OUT PBOOLEAN GenerateOnClose
);

The ObjectTypeName argument is used to identify the type of object. For a file object this must be a
UNICODE_STRING that indicates the word “File”.

The Object argument identifies the object itself. This parameter need not be provided by your file system.

Contents provided by 173


OSR Open Systems Resources, Inc.
The AbsoluteObjectName argument identifies the name of the object itself. This parameter should be
provided by the file system if at all possible (this might not be possible for some file systems if they
support opening of a file by ID.)

Normally, either the Object or the AbsoluteObjectName is provided as an argument to this call. Since the
file system maintains the name (rather than being directly associated with the object by the object manager)
it is normal for the file system to provide the name directly.

The SecurityDescriptor is the security descriptor that applies to the file being opened.

The AccessState argument is the access information provided as part of the IRP_MJ_CREATE calls
parameters.

The ObjectCreated argument indicates if this is a creation of a new object.

The AccessGranted argument indicates if access to this object was granted (presumably by the Security
Reference Monitor.)

The AccessMode argument is either UserMode or KernelMode. KernelMode operations are not audited,
but your code need not be aware of this distinction.

The GenerateOnClose value is normally the field of the same name in the AccessState parameter (part of
the IRP_MJ_CREATE arguments.) This advises the object manager to generate auditing when the file
object itself is closed – no action is required by the file system to ensure this is done, aside from ensuring
that this field is set as appropriate when the file object is first created.

This call is used to audit deletion events rather than normal “open” events. This should only be used when
the file is being opened for DELETE access. Otherwise, the FSD should use SeOpenObjectAuditAlarm
(Section 11.10.)

11.12 SeQuerySecurityDescriptorInfo
This routine is used to extract specific information from a security descriptor. The prototype for this call is:
NTKERNELAPI NTSTATUS SeQuerySecurityDescriptorInfo (
IN PSECURITY_INFORMATION SecurityInformation,
OUT PSECURITY_DESCRIPTOR SecurityDescriptor,
IN OUT PULONG Length,
IN PSECURITY_DESCRIPTOR *ObjectsSecurityDescriptor
);

The SecurityInformation field is used to identify what security information should be extracted from the
given security descriptor.

The SecurityDescriptor argument provides a pointer to a buffer where the output security descriptor should
be built (in self-relative form.)

The Length argument indicates the size of the buffer provided as the SecurityDescriptor. If the buffer is not
large enough this routine will return STATUS_BUFFER_TOO_SMALL.

The ObjectSecurityDescriptor argument provides the input security descriptor from which the requisite
information is extracted.

This routine is normally used by a file system to return information as a result of an


IRP_MJ_QUERY_SECURITY operation, with the SecurityInformation field being specified in the
Parameters field of the I/O stack location.

Contents provided by 174


OSR Open Systems Resources, Inc.
This routine may also be used to build a self-relative version of an existing security descriptor.

11.13 SeRegisterLogonSessionTerminatedRoutine
For some file systems it is necessary to track when logon sessions are terminated so that any security state
being managed by the file system on behalf of a user who has logged onto the system can be discarded.
This is a normal security precaution with network file systems, for instance, so that when a user logs off, a
new user does not “inherit” access to files because of this residual security information.

Note that the logon processing is normally done as part of a security support provider (SSPI) and
documentation on constructing such a provider is included in the Win 32 SDK (in the file spk.mvb.) The
details of building such a provider are beyond the scope of this discussion, however.

File Systems are interested only when a session terminates so that they can destroy whatever security
information (i.e., security tokens such as Kerberos tickets) is associated with the particular security. For
example, a network file system might “log off” with every server that had been previously connected.

An FSD does this by registering a session termination function with the Security Reference Monitor. The
prototype for the FSD provided session termination routine is:
typedef NTSTATUS
(*PSE_LOGON_SESSION_TERMINATED_ROUTINE)(
IN PLUID LogonId);

The LogonId uniquely identifies the previously logged on entity that is now logging off.

This callback routine is registered via the SeRegisterLogonSessionTerminatedRoutine call. The prototype
for this call is:
NTSTATUS
SeRegisterLogonSessionTerminatedRoutine(
IN PSE_LOGON_SESSION_TERMINATED_ROUTINE CallbackRoutine
);

The CallbackRoutine is a pointer to the FSD-provided function. This function will be called anytime a user
logs off the system.

11.14 SeReleaseSubjectSecurityContext
This routine is used to release a previously captured security context (see SeCaptureSubjectSecurityContext
Section 11.7.) The prototype for this operation is:
NTKERNELAPI
VOID
SeReleaseSubjectContext (
IN PSECURITY_SUBJECT_CONTEXT SubjectContext
);

This routine may be called in an arbitrary process context.

11.15 SeSetAccessStateGenericMapping
This routine is used to set the generic mapping information for a given access state data structure.
Normally, this call is needed only by file systems, file system filter drivers, and kernel-resident applications
(such as a file server) which construct their own access state data structures.

Contents provided by 175


OSR Open Systems Resources, Inc.
The prototype for this function is:
VOID
SeSetAccessStateGenericMapping (
PACCESS_STATE AccessState,
PGENERIC_MAPPING GenericMapping
);

For a normal IRP_MJ_CREATE operation, the establishment of this mapping is done by Windows NT
and is not done by the file system.

It is possible to obtain the standard file object mappings by using the I/O Manager Function
IoGetFileObjectGenericMapping (documented in the Windows NT DDK Reference Manual, Part 1:
Kernel-Mode Support Routines, Chapter 4, I/O Manager Routines.)

11.16 SeSetSecurityDescriptorInfo
This routine is used to set the security descriptor on an existing object. It is not normally used by file
systems, but can be used to (for instance) modify the security descriptor on a device object. Note, however,
that because the file system itself calls the security reference monitor to make security decisions (rather
than relying upon the object manager) security information is not applied to file object normally.

The prototype for this call is:


NTKERNELAPI NTSTATUS SeSetSecurityDescriptorInfo (
IN PVOID Object OPTIONAL,
IN PSECURITY_INFORMATION SecurityInformation,
IN PSECURITY_DESCRIPTOR SecurityDescriptor,
IN OUT PSECURITY_DESCRIPTOR *ObjectsSecurityDescriptor,
IN POOL_TYPE PoolType,
IN PGENERIC_MAPPING GenericMapping
);

The Object specifies the (optional) object to which the updated security descriptor should be applied.

The SecurityInformation specifies the security information about the input SecurityDescriptor that should
be modified. This could be, for instance, the security information provided as part of the
IRP_MJ_SET_SECURITY call (Parameters.SetSecurity.SecurityInformation.) It indicates some
combination of the four components that make up the security descriptor – the owner, group, discretionary
ACL or system ACL.

The ObjectSecurityDescriptor is the output security descriptor. Note that it is allocated from pool (with the
type specified by the next argument the PoolType) and the caller is responsible for freeing this memory.

The PoolType indicates what type of memory to allocate for the ObjectSecurityDescriptor being returned.
The caller is responsible for freeing this memory.

The GenericMapping defines the mapping from generic rights (GENERIC_READ, GENERIC_WRITE,
GENERIC_EXECUTE, GENERIC_ALL) to standard and specific rights. For a file system, normally this
is provided by using the I/O Manager function IoGetFileObjectGenericMapping.

This routine can also be used by a file system to build a “self relative” version of a security descriptor. In
this case, it is sufficient to indicate in SecurityInformation field that all four values are being “modified”
without specifying an Object.

Contents provided by 176


OSR Open Systems Resources, Inc.
11.17 SeUnlockSubjectContext
The prototype for this call is:
NTKERNELAPI VOID SeUnlockSubjectContext(
IN PSECURITY_SUBJECT_CONTEXT SubjectContext
);

This routine is used to release a previously locked SubjectContext.


See also SeLockSubjectContext (Section 11.8.)

11.18 SeUnregisterLogonSessionTerminatedRoutine
This routine is provided to allow an FSD to deregister a previously registered session termination function
(see SeRegisterLogonSessionTerminatedRoutine Section 11.13.)
NTSTATUS
SeUnregisterLogonSessionTerminatedRoutine(
IN PSE_LOGON_SESSION_TERMINATED_ROUTINE CallbackRoutine
);

The CallbackRoutine must be the same routine provided to a previous call to


SeRegisterLogonSessionTerminatedRoutine.

Contents provided by 177


OSR Open Systems Resources, Inc.
Contents provided by 178
OSR Open Systems Resources, Inc.
12 Supplementary Reading List
In this section we provide a supplementary reading list on file systems development.

Adve, Sartia V. and Kourosh Gharachorloo, Shared Memory Consistency Models: A Tutorial, Digital
Equipment Corporation, Western Research Laboratory, Research Report 95/7, September, 1995.
Agrawal, Divyakant, and Amr El Abbadi, An Efficient and Fault-Tolerant Solution for Distributed Mutual
Exclusion, ACM Transactions on Computer Systems, Vol 9, No. 1, February 1991, pp. 1-20.
Akyürek, Sedat and Kenneth Salem, Adaptive Block Rearrangement, ACM Transactions on Computer
Systems, Vol. 13, No. 2, May 1995, Pages 89-121.
Anderson, David P., Yoshitomo Osawa and Ramesh Govindan, A File System for Continuous Media, ACM
Transactions on Computer Systems, Vol. 10, No. 4, November 1992, pp. 311-337.
Anderson, David P., Device Reservation in Audio/Video Editing Systems, ACM Transactions on Computer
Systems, Vol. 15, No. 2, May 1997, pp. 111-133.
Anderson, Thomas E., Michael D. Dahlin, Jeanna M. Neefe, David A. Patterson, Drew S. Roselli, and
Randolph Y. Wang, Serverless Network File Systems, ACM Transactions on Computer Systems, Vol.
14, No. 1, February 1996, pp. 41-79.
Bach, Maurice J., The Design of the UNIX Operating System, Prentice-Hall Software Series, 1986.
Bacon, Jean, Ken Moody, Sue Thompson, and Tim Wilson, A Multi-Service Storage Architecture,
Operating Systems Review, Vol. 25, No. 4, October 1991, pp. 47-65.
Bershad, Brian, David Black, David DeWitt, Garth Gibson, Kai Li, Larry Peterson, Marc Snir, Scalable I/O
Initiative Working Paper No. 4, Operating Systems Working Group of the Scalable I/O Initiative.
Undated.
Borg, Anita, Wolfgang Blau, Wolfgang Graetsch, Ferdinand Herrmann, and Wolfgang Oberle, Fault
Tolerance Under UNIX, ACM Transactions on Computer Systems, Vol. 7, No. 1, February 1989, pp.
1-24.
Carretero, J. and Pérez, P. de Miguel, F. García, and L. Alonso, ParFiSys: A Parallel File System for MPP,
Operating Systems Review, Vol. 30, No. 2, April 1996, pp. 74-80.
Chang, Ye-In and Yao-Jen Chang, A Fault-Tolerant Dynamic Triangular Mesh Protocol for Distributed
Mutual Exclusion, Operating Systems Review, July 1997, pp. 29-44.
Chao, Chia, Robert English, David Jacobson, Alexander Stepanov, and John Wilkes, Mime: a high
performance parallel storage device with strong recovery guarantees, Hewlett-Packard Laboratories
Concurrent Systems Project, Technical Report HP-CSP-92-9 rev 1, 18 March 1992 revised 6
November 1992.
Cristian, Flavin, Understanding Fault-Tolerant Distributed Systems, Communications of the ACM,
February 1991, Vol. 34, No. 2, pp. 57-78.
Deconinck, Geert, Johan Vounckx, Rudi Cuyvers, Rudy Lauwereins, Survey of Checkpointing and
Rollback Techniques, ESPRIT Project 6731 (FTMPS), Technical Reports O3.1.8 and O3.1.12, June
1993.
De Jonge, Wiebren, The Logical Disk: A New Approach to Improving File Systems, Proceedings of the
Fourteenth ACM Symposium on Operating Systems Principles, December, 1993, pp. 15-28.
Faber, Theodore, Optimizing Throughput in a Workstation-based Network File System over a High
Bandwidth Local Area Network, Operating Systems Review, Vol 32, No. 1, January, 1998, pp. 29-40.

Contents provided by 179


OSR Open Systems Resources, Inc.
Fox, Armando, Steven D. Gribble, Yatin Chawathe, Eric A. Brewer, and Paul Gauthier, Cluster-Based
Scalable Network Services, Proceedings of the Sixteenth ACM Symposium on Operating Systems
Principles, October 1997, pp. 78-91.
Franklin, Michael J., Michael J. Carey, and Miron Livny, Transactionsl Client-Server Cache Consistency:
Alternatives and Performance, ACM Transactions on Database Systems, Vol. 22, No. 3, September,
1997, pp. 315-363.
Ganger, Gregory R. and N. Patt, Metadata Update Performance in File Systems, First Symposium on
Operating Systems Design and Implementation, November, 1994, pp. 49-60.
Gifford, David K., Pierre Jouvelot, Mark A. Sheldon, and James W. O’Toole, Jr., Semantic File Systems,
Thirteenth ACM Symposium on Operating Systems Principles, October, 1991, pp. 16-25.
Goldstein, Andrew C., The Design and Implementation of a Distributed File System, Digital Technical
Journal, No. 5, September 1987, pp. 45-55.
Goscinski, A., Distributed Operating Systems: The Logical Design, Addison-Wesley Publishing Company,
1991.
Hartman, John H. and John K. Ousterhout, The Zebra Striped Network File System, ACM Transactions on
Computer Systems, Vol. 13, No. 3, August 1995, pp. 274-310. Also, Fourteenth ACM Symposium on
Operating Systems Principles, December, 1993, pp. 29-43.
Haskin, Roger L., The Shark Continuous-Media File Server, IBM Almaden Research Center, Technical
Report, Undated.
Haskin, Roger L. and Frank B. Schmuck, The Tiger Shark File System, IBM Almaden Research Center,
Technical Report, Undated.
Haskin, Roger L, and Frank L. Stein, A System for the Delivery of Interactive Television Programming ,
IBM Almaden Research Center, Technical Report, Undated.
Helme, Arne, and Tage Stabell-Kulø, Security Functions for a File Repository, Operating Systems Review,
April 1997, pp. 3-8.
Huang, Y. and C. M. R. Kintala, Software Fault Tolerance: Technologies and Experience, AT&T Bell
Laboratories, Technical Report, 1993.
Kotz, David, Disk Directed I/O for MIMD Multiprocessors, First Symposium on Operating Systems
Design and Implementation, November, 1994, pp. 61-74.
Kotz, David and Nils Nieuwejaar, Flexibility and Performance of Parallel File Systems, Operating Systems
Review, Vol. 30, No. 2, April 1996, pp. 63-73.
Krieger, Orran, Michael Stumm, Ron Unrau, and Jonathan Hanna, A Fair Fast Scalable Reader-Writer
Lock, Proc. Intl. Conf. On Parallel Processing, 1993.
Krieger, Orran and Michael Stumm, HFS: A Flexible File System for large-scale Multiprocessors,
Proceedings of the 1993 DAGS/PC Symposium.
Lampson, Butler and David Lomet, A New Presumed Commit Optimization for Two Phase Commit, Digital
Equipment Corporation, Cambridge Research Laboratory, Technical Report 93/1, February, 1993.
Leach, Paul and Salz, Rich, UUIDs and GUIDs, IETF Draft RFC, The Open Group, February, 1997. Text
available on-line as http://www.camb.opengroup.org/dce/info/ietf-draft.txt.
Lee, Edward K. Highly-Available, Scalable Network Storage, Digital Equipment Corporation, Systems
Research Center, Technical Report, Undated.
Lee, Edward K. and Chandramohan A. Thekkath, Petal: Distributed Virtual Disks, Digital Equipment
Corporation, Systems Research Center, Technical Report, Undated.
Levi, Shem-Tov and Ashok K. Agrawala, Fault Tolerant System Design, McGraw-Hill Series on Computer
Engineering, 1994.

Contents provided by 180


OSR Open Systems Resources, Inc.
Li, Qun, Jie Jing and Li Xie, BFXM: A Parallel File System Model Based on the Mechanism of Distributed
Shared Memory, Operating Systems Review, Vol. 31, No. 4, October, 1997, pp. 30-40.
Liskov, Barbara and Sanjay Ghemawat, Robert Gruber, Paul Johnson, Liuba Shrira, and Michael Williams,
ACM Symposium on Operating Systems Principles, October, 1991, pp. 226-238.
Litwin, Witold, Marie-Anne Niemat, and Donovan A. Schneider, LH* - A Scalable, Distributed Data
Structure, ACM Transactions on Database Systems, Vol. 24, No. 4, December 1996, Pages 480-525.
Lomet, David B., Recovery for Shared Disk Systems Using Multiple Redo Logs, Digital Equipment
Corporation, Cambridge Research Laboratory, Technical Report # 90/4, October 1, 1990.
Lomet, David, Consistent Timestamping for Transactions in Distributed Systems, Digital Equipment
Corporation, Cambridge Research Laboratory, Technical Report 90/3, September, 1990.
Lyu, Michael R. (Editor), Software Fault Tolerance, John Wiley & Sons, 1995.
Mahoney, Bill, An “Open” Oriented File System, Operating Systems Review, Vol. 28, No. 1, January
1994, pp. 48-54.
Mann, Timothy, Andrew Birrell, Andy Hisgen, Charles Jerian, and Garret Swart, A Coherent Distributed
File Cache With Directory Write-Behind, ACM Transactions on Computer Systems, Vol. 12, No. 2,
May 1994, pp. 123-164.
Matthews, Jeanna Neefe, Drew Roselli, Adam M. Costello, Randolph Y. Wang, and Thomas E. Anderson,
Improving the Performance of Log-Structured File Systems with Adaptive Methods, Proceedings of the
Sixteenth ACM Symposium on Operating Systems Principles, October, 1997, pp. 238-251.
Mogul, Jeffrey C., A Better Update Policy, Digital Equipment Corporation, Western Research Laboratory,
Research Report, to be published in the Summer 1994 USENIX Conference Proceedings, April, 1994.
Mohindra, Ajay and Umakishore Ramachandran, Fault-tolerant Transactions Using Distributed Shared
Memory, School of Information and Computer Science, Georgia Institute of Technology, Technical
Report GIT-ICS-89/41, November 1989.
Mohindra, Ajay and Umakishore Ramachandran, Implementing Fault-tolerant Atomic Transactions Using
Distributed Shared Memory, School of Information and Computer Science, Georgia Institute of
Technology, Technical Report GIT-CC-91/13, January 1991.
Muller, Keith and Joseph Pasquale, A High Performance Multi-Structured File System Design, Thirteenth
ACM Symposium on Operating Systems Principles, October 1991, pp. 56-67.
Nelson, Michael M., Virtual Memory vs. The File System, Digital Equipment Corporation, Western
Research Laboratory, Technical Report, Undated.
O’Toole, James, Liuba Shrira, Opportunistic Log: Efficient Installation Reads in a Reliable Storage Server,
First Symposium on Operating Systems Design and Implementation, November, 1994, pp. 39-48.
Ousterhout, John and Fred Douglis, Beating the I/O Bottleneck: A Case for Log-Structured File Systems,
Computer Science Division, Electrical Engineering and Computer Sciences, University of California at
Berkeley, Technical Report, Undated.
Patterson, R.H., G.A. Gibson, E. Ginting, D. Stodolsky, Informed Prefetching and Caching, Fifteenth ACM
Symposium on Operating Systems Principles, December, 1995, pp. 79-95.
Plank, James S. and Kai Li, ickp: A Consistent Checkpointer for Multicomputers, IEEE Parallel &
Distributed Technology, Summer 1994, pp. 62-67.
Rahm, Erhard, Concurrency and Coherency Control in Database Sharing Systems, University of
Kaiserslautern, Dept. of Computer Science, Technical Report ZRI 3/91, December 1991, Revised
March 1993.
Rangan, P. Venkat and Harrick M. Vin, Designing File Systems For Digital Video and Audio, Thirteenth
ACM Symposium on Operating Systems Principles, October 1991, pp. 81-94.

Contents provided by 181


OSR Open Systems Resources, Inc.
Reed, Benjamin, and Darrell D.E. Long, Analysis of Caching Algorithms for Distributed File Systems,
Operating Systems Review, Vol. 30, No. 3, July 1996, pp. 12-21.
Robinson, John T. Analysis of Steady-State Segment Storage Utilizations in a Log-Structured File System
with Least-Utilized Segment Cleaning, Operating Systems Review, October 1996, pp. 29-32.
Rosenblum, Mendel and John K. Ousterhout, The Design and Implementation of a Log-Structured File
System, ACM Transactions on Computer Systems, Vol 10., No. 1, February 1992, pp. 26-52. Also in
the Thirteend ACM Symposium on Operating Systems Principles, October, 1991, pp. 1-15.
Rosenblum, Mendel and John K. Ousterhout, The LFS Storage Manager, USENIX Technical Conference,
Anaheim, CA., June 1990.
Shirriff, Ken and John Ousterhout, Sawmill: A High-Bandwith Logging File System, University of
California Berkeley, Technical Report, Undated.
Stabell-Kulø, Security and Log Structured File Systems, Operating Systems Review, April 1997, pp. 9-10.
Steere, David C. Exploiting the Non-Determinism and Asynchrony of Set Iterators to Reduce Aggregate
File I/O Latency, Proceedings of the Sixteenth ACM Symposium on Operating Systems Principles,
October 1997, pp. 252-263.
Tannenbaum, Andrew S., Distributed Operating Systems, Prentice-Hall, 1995.
Tewari, Renu, Rajat Mukherjee and Daniel M. Dias, Real-Time Issues for Clustered Multimedia Servers,
IBM Research Report, RC 20020, April 6, 1995.
Thekkath, Chandramohan A., Timothy Mann and Edward K. Lee, Frangipani: A Scalable Distributed File
System, Proceedings of the Sixteenth ACM Symposium on Operating Systems Principles, October,
1997, pp. 224-237.
Tridgell, Andrew and David Walsh, The HiDOS filesystem, Australian National University, Technical
Report, Undated.
Van Meter, Rodney, A Brief Survey of Current Work on Network Attached Peripherals, Operating Systems
Review, Vol. 30, No.1, January 1996, pp. 63-70.

Contents provided by 182


OSR Open Systems Resources, Inc.

You might also like