P. 1
Linux Kernel Hackers' Guide

Linux Kernel Hackers' Guide

|Views: 272|Likes:
The Linux Kernel Hackers' Guide

The HyperNews Linux KHG Discussion Pages

Linux Kernel Hackers' Guide Due to the fact that nearly every post to this site recently has been either by rude cracker-wannabes asking how to break into other people's systems or a request for basic technical support, posting to the KHG has been disabled, probably permanently. For now, you can read old posts, but you cannot send replies. In any case, there are now far better resources available.
Go get the real thing!
The Linux Kernel Hackers' Guide

The HyperNews Linux KHG Discussion Pages

Linux Kernel Hackers' Guide Due to the fact that nearly every post to this site recently has been either by rude cracker-wannabes asking how to break into other people's systems or a request for basic technical support, posting to the KHG has been disabled, probably permanently. For now, you can read old posts, but you cannot send replies. In any case, there are now far better resources available.
Go get the real thing!

More info:

Published by: Srinivasa Kumar Gullapalli on Nov 07, 2010
Copyright:Attribution Non-commercial


Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less





The Linux Kernel Hackers' Guide

The HyperNews Linux KHG Discussion Pages

Linux Kernel Hackers' Guide Due to the fact that nearly every post to this site recently has been either by rude cracker-wannabes asking how to break into other people's systems or a request for basic technical support, posting to the KHG has been disabled, probably permanently. For now, you can read old posts, but you cannot send replies. In any case, there are now far better resources available.
Go get the real thing!
Alessandro Rubini wrote Linux Device Drivers, which is what the KHG could have been (maybe) but isn't. If you have a question and can't find the answer here, go get a copy of Linux Device Drivers and read it--chances are that when you are done, you will not need to ask a question here. Run, don't walk to get a copy of this book.

The Linux Kernel
Go read The Linux Kernel if you want an introduction to the Linux kernel that is better than the KHG. It is a great complement to Linux Device Drivers. Read it.

Table of Contents
Tour of the Linux Kernel This is a somewhat incomplete tour of the Linux Kernel, based on Linux 1.0.9 and the 1.1.x development series. Most of it is still relevant. Device Drivers The most common Linux kernel programming task is writing a new device driver. The great majority of the code in the kernel is new device drivers; between 1.2.13 and 2.0 the size of the source code more than doubled, and most of that was from adding device drivers.

http://ldp.iol.it/LDP/khg/HyperNews/get/khg.html (1 di 13) [08/03/2001 10.07.54]

The Linux Kernel Hackers' Guide

Filesystems Adding a filesystem to Linux doesn't have to involve magic... Linux Memory Management A few outdated documents, and one completely new one by David Miller on the Linux cache flush architecture. How System Calls Work on Linux/i86 Although this was written while Linux 0.99.2 was current, it still applies. A few filenames may need updating. find is your friend--just respond with the changes and they will be added. Other Sources of Information The KHG is just one collection of information about the Linux kernel. There are others!

Membership and Subscription
At the bottom of the page, you will notice two hyperlinks (among several others): Subscribe and Members. Using the KHG to its fullest involves these two hyperlinks, even though you are not required to be a member to read these pages and post responses.

HyperNews membership is site-wide. That is, you only need to sign up and become a member once for the entire KHG. It doesn't take much to be a member. Each member is identified by a unique name, which can either be a nickname or an email address. We suggest using your email address; that way it will be unique and easy to remember. On the other hand, you may want to choose a nickname if you expect to be changing your email address at any time. We also want your real name, email address, and home page (if you have one). You can give us your phone and address if you want. You will be asked to choose a password. You can change any of these items at any time by clicking on the Membership hyperlink again.

Subscribing to a page puts you on a mailing list to be sent notification of any new responses to the page to which you are subscribed. You subscribe separately to each page in which you are interested by clicking the Subscription link on the page to which you want to subscribe. You are also subscribed, by default, to pages that you write. When you subscribe to a page, you subscribe to that page and all of its responses.

Please respond to these pages if you have something to add. Think of posting a response rather like posting to an email list, except that an editor might occasionally come along to clean things up and/or put them in the main documents' bodies. So if you would post it to an email list in a similar discussion, it is probably appropriate to post here. In order to make reading these pages a pleasure for everyone, any incomprehensible, unrelated, outdated, abusive, or other completely unnecessary post may be removed by an administrator. So if you have a message that would

http://ldp.iol.it/LDP/khg/HyperNews/get/khg.html (2 di 13) [08/03/2001 10.07.54]

The Linux Kernel Hackers' Guide

be inappropriate on a mailing list, it's probably also inappropriate here. The administrators have the final say on what's appropriate. We don't expect this to become an issue...

About the new KHG
The Linux Kernel Hackers' Guide has changed quite a bit since its original conception four years ago. I struggled along with the help of many other hackers to produce a document that lived primarily on paper, and was intended to document the kernel in much the same way that a program's user guide is intended to document the program for users. It was less successful than most user guides, for a number of reasons: q I was working on it part time, and was otherwise busy. q The Linux kernel is a moving target. q I am not personally capable of documenting the entire Linux kernel. q I became far too concerned with making the typesetting pretty, getting bogged down in details and making the document typographically noisy at the same time. I floundered around, trying to be helpful, and made at least one right decision: most of the people who needed to read the old KHG needed to write device drivers, and the most fully-developed part of the KHG was the device driver section. There is a clear need for further development of the KHG, and it's clear that my making it a monolithic document stood in the way of progress. The KHG is now a series of more or less independent web pages, with places for readers to leave comments and corrections that can be incorporated in the document at the maintainer's leisure--and are available to readers before they are incorporated. The KHG is now completely web-based. There will be no official paper version. You need kernel source code nearby to read the KHG anyway, and I want to shift the emphasis from officially documenting the Linux kernel to being a learning resource about the Linux kernel--one that may well be useful to other people who want to document one part or another of the Linux kernel more fully, as well as to people who just want to hack the kernel. Enjoy! Copyright (C) 1996,1997 Michael K. Johnson, johnsonm@redhat.com Messages Loading shared objects - How? by Wesley Terpstra 349. 342. 1. 340. 338. 335. 333. 331. How can I see the current kernel configuration? by Melwin My mouse no work in X windows by alfonso santana The crash(1M) command in Linux? by Dmitry Where can I gen detailed info on VM86 by Sebastien Plante How to print floating point numbers from the kernel? by pkunisetty@hotmail.com PS/2 Mouse Operating in Remote Mode by Andrei Racz basic module by vano0023@tc.umn.edu

http://ldp.iol.it/LDP/khg/HyperNews/get/khg.html (3 di 13) [08/03/2001 10.07.54]

The Linux Kernel Hackers' Guide

329. 328. 326. 323. 322. 319. 314. 1. 313. 310. 308. 300. 1. 297. 293. 290. 1. 289. 288. 1. 286. 1. 283. 1.

How to check if the user is local? by jb@nicol.ml.org Ldt & Privileges by Ganesh skb queues by Rahul Singh Page locking (for DMA) and process termination? by Espen Skoglund SMP code by 97yadavm@scar.utoronto.ca Porting GC: Difficulties with pthreads by Talin Linux for "Besta - 88"? by Dmitry MVME147 Linux by Edward Tulupnikov /proc/locks by Marco Morandini syscall by ppappu@lrc.di.epfl.ch How to run a bigger kernel ? by Kyung D. Ryu Linux Terminal Device Driver by Nils Appeldoorn Terminal DD by Doug McNash DMA to user allocated buffer ? by Chris Read 1. 1. allocator-example in A.Rubini's book by Thomas Sefzick Untitled by welch@mcmail.com Ethernet collisions by Juha Laine Segmentation in Linux by Andrew Sampson How can the kernel copy directly data from one process to another process? by Jürgen Zeller Use the /Proc file system by marty@twsu.campus.mci.net Remapping Memory Buffer using vmalloc/vma_nopage by Brian W. Taylor Fixed.... strncpy to blame by Brian W. Taylor Does memory area assigned by "vmalloc()" get swapped to disk? by Saurabh Desai Lock the pages in memory by balaji@ittc.ukans.edu How about assigning a fixed size array...does it get swapped too? by saurabh desai Creative Lab's DVD Encore by Brandon TCP sliding window by Olivier Packets and default route versus direct route by Steve Resnick IPv6 description - QoS Implementation - 2 IP Queues by wehrle 2. See the kernel IPv4 implementation documentation by Juha Laine writing to user file directly from kernel space, How can it be done? by Johan how can i increase the number of processes running? by ElmerFudd How do I change the amount of time a process is allowed before it is pre-empted? by -> Patching problems by Maryam Ethernet Collision by jerome bonnet

282. 274. 273. 269. 268. 267. 261.

http://ldp.iol.it/LDP/khg/HyperNews/get/khg.html (4 di 13) [08/03/2001 10.07.54]

The Linux Kernel Hackers' Guide

Escher@dn101aw.cse.eng.auburn.edu Network device stops after a while by Andrew Ordin 260. 1. 259. 1. Untitled by Andrew Does MMAP work with Redhat 4.2? by Guy Yes, it works just fine. by Michael K. Johnson 3. 2. 256. 1. -> -> 247. 241. 1. 240. 237. 1. 235. 234. 1. 231. 230. 225. 1. -> 221. 220. 2. 1. 217. 216. 1. 213. 1. 1. What about mprotect? by Sengan Baring-Gould It Works! Thanks! by Guy

multitasking by Dennis J Perkins Answer by David Welch multitasking by Dennis J Perkins answer by David Welch

linux on sparc by darrin hodges How to call a function in user space from inside the kernel ? by Ronald Tonn How to call a user routine from kernel mode by David Welch Can I map kernel (device driver) memory into user space ? by Ronald Tonn driver x_open,x_release work, x_ioctl,x_write don't by Carl Schwartz Depmod Unresolved symbols? by Carl Schwartz How to sleep for x jiffies? by Trent Piepho Use add_timer/del_timer (in kernel/sched.c) by Amos Shapira /dev/random by Simon Green MSG_WAITALL flag by Leonard Mosescu possible bug in ipc/msg.c by Michael Adda scheduler Question by Arne Spetzler Untitled by Ovsov thanks by arne spetzler File Descriptor Passing? by The Llamatron Linux SMP Scheduling by Angela Finding definitions in the source by Felix Rauch Re: Linux SMP Scheduling by Franky Difference between ELF and kernel file by Thomas Prokosch How kernel communicates with outside when it's started? by Xuan Cao Printing to the kernel log by Thomas Prokosch The way from a kernel hackers' idea to an "official" kernel? by Roger Schreiter linux-kernel@vger.rutgers.edu by Michael K. Johnson Adding code to the Linux Kernel by Patrick

http://ldp.iol.it/LDP/khg/HyperNews/get/khg.html (5 di 13) [08/03/2001 10.07.54]

The Linux Kernel Hackers' Guide

212. 208. 1.

Curious about sleep_on_interruptible() in ancient kernels. by Colin Howell Server crashes using 2.0.32 and SMP by Steve Resnick Debugging server crash by Balaji Srinivasan More Information by Steve Resnick it should not have happenned... by Balaji Srinivasan -> ->

207. 206. 205. 1. 203. 1. 200. 197. 193. 190. 186. 1. 185. 184. 183. 1. 181. 179. 178.

Signals ID definitions by Franky the segment D000 is not visible by martinv2@ctima.uma.es ICMP - Supressing Messages in 2.1.82 Kernel by Brent Johnson Change /etc/syslog.conf by Balaji Srinivasan Modem bits by Franky Untitled by Kostya I need some way to measure the time a process spend in READY QUEUE by Leandro Gelasi How to make sockets work whilst my process is in kernel mode? by Mikhail Kourinny Realtime Problem by Uwe Gaethke 1. SCHED_FIFO scheduling by Balaji Srinivasan inodes by Ovsov Difference between SOCK_RAW SOCK_PACKET by Chris Leung SOCK_PACKET by Eddie Leung Need additional termcap entries for TERM=linux by Karl Bullock Question on Umount or sys_umount by teddy Passing file descriptors to the kernel by Pradeep Gore A way to "transform" a file descriptor into a struct file* in a user process by Lorenzo Cavallaro Dead Man Timer by Jody Winston raw sockets by lightman a kernel-hacking newbie by Bradley Lawrence 2. 1. A place to start. Modems in general by Ian Carr-de Avelon

176. 174. 173. 1. 172. 1. 171. 170.

How to write CD-ROM Driver ? Any Source Code ? by Madhura Upadhya Measuring the scheduler overhead by Jasleen Kaur Where can I find the tcpdump or snoop in linux? by wangc@taurus man which by trajek@j00nix.org Timers don't work?? by Joshua Liew Timers Work... by Balaji Srinivasan problem of Linux's bridge code by wangc@taurus Documention on writing kernel modules by Erik Nygren

http://ldp.iol.it/LDP/khg/HyperNews/get/khg.html (6 di 13) [08/03/2001 10.07.54]

The Linux Kernel Hackers' Guide

168. 167.

How to display a clock on my console? by keco Difference between SCO and Linux drivers. by M COTE

165. Changing the scheduler from round robin to shortest job first for kernel 2.0 and up by royal_and_mary_harrell@msn.com Improving the Scheduer by Lee Ingram 2. 1. 1. 164. 162. 161. 160. 159. 158. 3. 2. 1. 157. 154. 153. 151. 1. 150. 149. 148. 3. 1. 143. 140. 1. 139. 138. 136. 1. Improving the Scheduler : use QNX-like by Leandro Gelasi Re: Changing the sched. from round robin to shortest job first for kernel 2.0 and up. by Pirasenna V.T. meanings of file->private_data by ncuandre@ms14.hinet.net /dev/signalprocess by flatmax how to track VM page access sequence? by shawn Whats the difference between dev_tint(dev) and mark_bh(NET_BH)? by Jaspreet Singh PCI by mullerc@iname.com 1. RE: PCI by Armin A. Arbinger Re: Can I make syscall from inside a kernel module? by Massoud Asgharifard 1. Make a syscall despite of wrong fs!! by Mikhail Kourinny code snip to make a sys_* call from a module by Pradeep Gore Dont use system calls within kernel...(esp sys_mlock) by Balaji Srinivasan Untitled by Steve Durst RAW Sockets (Art) use phy mem by WYB HyperNews for RH Linux ? by Eigil Krogh Sorensen Not really needed by Cameron about raw ethernet frame: how to do it ? by crbild@smc.it process table by Blaz Novak Stream drivers by Nick Egorov Streams drivers Stream in Solaris by cai.yu@rdc.etc.ericsson.se Xircom External Ethernet driver anywhere? by mike head interruptible_sleep_on() too slow! by Bill Blackwell wrong functions by Michael K. Johnson creating a kernel relocatable module by Simon Kittle Up to date serial console patches by Simon Green Kernel-Level Support for Checkpointing on Linux? by Argenis R. Fernandez Working on it. by Jan Rychter Can I make syscall from inside a kernel module? by Shawn Chang

http://ldp.iol.it/LDP/khg/HyperNews/get/khg.html (7 di 13) [08/03/2001 10.07.54]

The Linux Kernel Hackers' Guide

135. 3. 2.

Problem creating a new system call by sauru How did the file /arch/i386/kernel/entry.S do its job by Wang Ju system call returns "Bad Address". Why? by sauru 1. Re:return values by C.H.Gopinath 2. 1. 1. 2. 1. Re:return values by Sameer Shah possible reason for segmentation fault

Creating a new sytem call: solution by C.H.Gopinath problem with system call slot 167 by Todd Medlock Kernel Debuggers for Linux by sauru

133. 130.

Resetting interface counters by Keith Dart writing/accessing modules by Jones MB 1. -> -> Use a device driver and read()/write()/ioctl() by Michael K. Johnson getting to the kernel's memory by Jones MB use buffers! by Rubens Response to "Help with CPU scheduler!" by Jeremy Impson Response to "Help with CPU scheduler!" (Redux) by Jeremy Impson You can't by Michael K. Johnson Calling BIOS interrupts from Linux kernel by Ian Collier Possible, but takes work by Michael K. Johnson VBE video driver by Ian Collier VM86 mode at which abstraction level? by Michael K. Johnson DVD-ROM and linux by Yuqing_Deng@brown.edu Response to DVD and Mpeg in Linux by Mike Corrieri 1. -> -> DVD Encryption by Mark Treiber Untitled by Tim DVD? calling interupts from linux by John J. Binder 1. -> -> -> ->

124. 1.

Help with CPU scheduler! by Lerris ->


116. 3.

DVD-ROM and Linux? (sorry if it's off topic...) by Joel Hardy 2.

115. 2. 1. 113. 1.

Kernel Makefile Configuration: how? by Simon Green How to add a driver to the kernel ? by jacek Radajewski See include/linux/autoconf.h by Balaji Srinivasan Multiprocessor Linux by Davis Terrell Building an SMP kernel by Michael K. Johnson SMP and module versions by linux@catlimited.com ->

http://ldp.iol.it/LDP/khg/HyperNews/get/khg.html (8 di 13) [08/03/2001 10.07.54]

The Linux Kernel Hackers' Guide

111. 109. 94. 93.

Improving event timers? by bodomo@hotmail.com measuring time to load a virtual mem page from disk by kandr using cli/sti() and save_flags/restore_flags() by george Protected Mode by ac 2. 1. 'Developers manual' from Intel(download)... by Mats Odman Advanced 80386 Programming Techniques by Michael K. Johnson DMA limits by Albert Cahalan <acahalan at cs.uml.edu> Not page size, page order by Michael K. Johnson Check it's the right file, zImage not vmlinux by Cameron Usually easy, but.... by Ian Carr-de Avelon


DMA buffer sizes by nomrom@hotmail.com 2. 1.


Problem Getting the Kernel small enough by afish@tdcdesigncorps.com 2. 1.

89. 88. 87. 86.

How to create /proc/sys variables? by Orlando Cantieni Linux for NeXT black? by Dale Amon vremap() in kernel modules? by Liam Wickins giveing compatiblity to win95 for ext2 partitions (for programmers forced to deal with both) by pharos 2. 1. Well, What's the status of the Windows / Dos driver for Ext2? by Brock Lynn Working on it! by ibaird 1. -> revision by ibarid Untitled by Olaf

84. 83. 77. 76.

setsockopt() error when triying to use ipfwadm for masquerading by omaq@encomix.es 1. 1. 1. 5. 4. 3. Re: masquerading by Charles Barrasso Re: fixed, patch for kernel 2.0.30 by Dong Chen Untitled by lolley Transparent Proxy by Zygo Blaxell Untitled by qwzhang@public2.bta.net.cn Untitled by navin97@hotmail.com 2. 1. 2. 1. Changing your IP address is easy, but... by Zygo Blaxell You have to know a bit of C (if u wanna learn) ;) by Lorenzo Cavallaro reset the irq 0 timer after APM suspend by Dong Chen Source Code in C for make Linux partitions. by Limbert Sanabria How can I "cheat" and change the IP address (src,dest) in the sent socket? by Rami

Untitled Do it in the kernel by Michael K. Johnson


Where is the source file for accept() by kaixu@hocpa.ho.lucent.com

http://ldp.iol.it/LDP/khg/HyperNews/get/khg.html (9 di 13) [08/03/2001 10.07.55]

The Linux Kernel Hackers' Guide

1. 72. 69. 1. 2.

Here, in /usr/src/linux/net/socket.c by wumin@netchina.co.cn Re: Raw sockets by genie@risq.belcaf.minsk.by Si tenga preguntas, quisa yo pueda ayudarte. by KernelJock 3. 2. Tengo una pregunta by riderghost@rocketmail.com Español by LL2

How can I use RAW SOCKETS in UNIX? by Rami the KHG in spanish? by Jorge Alvarado Revatta

1. 67. 1. -> 66. 65. 64. 62. 61. 59. 1. 1. 1. 1. 1. 2. 1. 58. 57. 56. 55.

No esta aqui! Pero... by Michael K. Johnson Why not to get a memory snapshot? by Jukka Santala Why you would want to get a memory snapshot by Dave M. Setting resource limits by Jukka Santala Read the rest of the KHG! by Michael K. Johnson Kernel tunable parameters by Jukka Santala Forced Cast data type by Wang Ju Problem with ICMP echo-request/reply by Raghavendra Bhat Increasing number of files in system by Simon Cooper 1. 1. Increasing number of open files parameter by Simon Cooper Setting and getting kernel vars by kbrown@csuhayward.edu sysctl in Linux by Jukka Santala

How to get a Memory snapshot ? by Manuel Porras Brand

resources hard limits by castejon@kbw350.chem.yale.edu How to invalidate a chache page by Gerhard Uttenthaler Where are the tunable parameters of the kernel? by demiguel@robot3.cps.unizar.es How can my device driver access data structures in user space? by Stephan Theil Problem in doing RAW SOCKET Programming by anjali sharma Tunable Kernel Parameters? by dennyf@bmn.net

ELF matters by Carlos Munoz 1. 1. 1. 4. Information about ELF Internals by Pat Ekman [Selectively] Droping Packets by Jose R. cordones readprofile systool by Jukka Santala ICMP send rate limit / ignoring by Jukka Santala 1. Omission in earlier rate-limit... by Jukka Santala Droping Packets by Charles Barrasso The /proc/profile by Charles Barrasso Can you block or ignore ICMP packets? by HackOps@nutnet.com

http://ldp.iol.it/LDP/khg/HyperNews/get/khg.html (10 di 13) [08/03/2001 10.07.55]

The Linux Kernel Hackers' Guide

-> 3. 1. 52. 51. 49. 47. 38. 1.

Patch worked... by Jukka Santala ipfwadm configuration utility by Sonny Parlin

Using ipfwadm by Charles Barrasso Icmp.c and kernal ping replies by Don Thomas

encaps documentation by Kuang-chun Cheng Mounting Caldrea OpenDOS formatted fs's by Trey Childs finding the address that caused a SIGSEGV. by Ben Shelef sti() called too late. by Erik Thiele 1. 1. 2. sti() called too late. by Gadi Oxman Needed here too by ajay Help needed here too! by ajay 10 ms timer patch by Reinhold J. Gerharz 2. 1. please send me 10 ms timer patch by Tolga Ayav Please send me the patch by Jin Hwang 1. UTIME: Microsecond Resolution Timers by BalajiSrinivasan Module Development Info? by Mark S. Mathews


Need quicker timer than 100 ms in kernel-module by Erik Thiele 1.

34. 31. 30. 29. 27. 25.

Need help with finding the linked list of unacked sk_buffs in TCP by Vijay Gupta Partition Type by Suman Ball New document on exception handling by Michael K. Johnson How to make paralelism in to the kernel? by Delian Dlechev readv/writev & other sock funcs by Dave Wreski I'd like to see the scheduler chapter by Tim Bird 1. 3. Untitled by Vijay Gupta Go ahead! by Michael K. Johnson Get a proxy by Michael K. Johnson Examples code as documentation by Jeremy Impson What raw sockets are for. by Cameron MacKinnon GDB for Linux by David Grothe 2. Another kernel debugging tool by David Hinds 2. Kernel debugging with breakpoints by Keith Owens

21. 20. 18. 15.

Unable to access KHG, port 8080 giving problem. by Srihari Nelakuditi 1. 1. 1. 2. proc fs docs? by David Woodruff What is SOCK_RAW and how do I use it? by arkane Linux kernel debugging by yylai@hk.net

http://ldp.iol.it/LDP/khg/HyperNews/get/khg.html (11 di 13) [08/03/2001 10.07.55]

The Linux Kernel Hackers' Guide

-> 1. 1. 9. 7. 6.

Need help for debugging by C.H.Gopinath

gdb debugging of kernel now available by David Grothe

Device debugging by alombard©iiic.ethz.ch Summary of Linux Real-Time Status by Markus Kuhn Hard real-time now available by Michael K. Johnson 2. 1. Shortcomings of RT-Linux by Balaji Srinivasan Firm Realtime available by Balaji Srinivasan I want to know how to hack Red Hat Linux Release 5.0 by Kevin cli()/sti() latency, hard numbers by Ingo Molnar

Realtime mods anyone? by bill duncan

5. 4. 2. 1. 7. 3.

found some hacks ?!? by Mayk Langer 2. 1. POSIX.4 scheduler by Peter Monta Realtime is already done(!) by Kai Harrekilde-Petersen 100 ms real time should be easy by jeff millar 1. Real-Time Applications with Linux POSIX.4 Scheduling by P. Woolley

Why can't we incorporate new changes in linux kernel in KHG ? by Praveen Kumar Dwivedi 1. 1. You can! by Michael K. Johnson The sounds of silence... by Gabor J.Toth 1. 2. 2. Breaking the silence :) by Kyle Ferrio 1. Scribbling in the margins by Michael K. Johnson It requires thought... by Michael K. Johnson Kernel source code by Gabor J.Toth

Kernel source is already browsable online by Axel Boldt KHG being mirrored nightly for download! by Michael K. Johnson postscript version of these documents? by Michael Stiller 1. -> -> Sure! by Michael K. Johnson Not so Sure! by jeff millar Enough already! by Michael K. Johnson Mirror whole KHG package, off line reading and Post to this site by Kim In-Sung 2. 1. Untitled by Jim Van Zandt That works. (using it now). Two tips: by Richard Braakman 2. -> Appears to be a bug in getwww, though... by Michael K. Johnson Sucking up to the wrong site... ;) by Jukka Santala


Need easy way to download whole KHG 5. 2.


Mirror packages are available, but that's not really enough by Michael K. Johnson 4.

http://ldp.iol.it/LDP/khg/HyperNews/get/khg.html (12 di 13) [08/03/2001 10.07.55]

The Linux Kernel Hackers' Guide


Help make the new KHG a success by Michael K. Johnson

http://ldp.iol.it/LDP/khg/HyperNews/get/khg.html (13 di 13) [08/03/2001 10.07.55]

The Linux Kernel

Table of Contents
q q q q q q q

The Linux Kernel

Title Page Preface Hardware Basics Software Basics Memory Management Processes Interprocess Communication Mechanisms PCI Interrupts and Interrupt Handling Device Drivers The File System Networks Kernel Mechanisms Modules Processors The Linux Kernel Sources Linux Data Structures Useful Web and FTP Sites The LPD Manifesto The GNU General Public License Glossary Table of Contents, Show Frames, No Frames © 1996-1999 David A Rusling copyright notice

q q

This book is for Linux enthusiasts who want to know how the Linux kernel works. It is not an internals manual. Rather it describes the principles and mechanisms that Linux uses; how and why the Linux kernel works the way that it does. Linux is a moving target; this book is based upon the current, stable, 2.0.33 sources as those are what most individuals and companies are now using. This book is freely distributable, you may copy and redistribute it under certain conditions. Please refer to the copyright and distribution statement. Version 0.8-3 David A Rusling david.rusling@arm.com)

q q q q q q q

q q

q q


David A Rusling 3 Foxglove Close, Wokingham, Berkshire RG41 3NF, United Kingdom

http://ldp.iol.it/LDP/tlk/tlk.html (1 di 2) [08/03/2001 10.08.04]

The Linux Kernel

Show Frames, No Frames © 1996-1999 David A Rusling copyright notice david.rusling@arm.com

http://ldp.iol.it/LDP/tlk/tlk.html (2 di 2) [08/03/2001 10.08.04]

The Linux Kernel

The Linux Kernel

This book is for Linux enthusiasts who want to know how the Linux kernel works. It is not an internals manual. Rather it describes the principles and mechanisms that Linux uses; how and why the Linux kernel works the way that it does. Linux is a moving target; this book is based upon the current, stable, 2.0.33 sources as those are what most individuals and companies are now using. This book is freely distributable, you may copy and redistribute it under certain conditions. Please refer to the copyright and distribution statement. Version 0.8-3 David A Rusling david.rusling@arm.com)

Table of Contents, Show Frames, No Frames © 1996-1999 David A Rusling copyright notice

http://ldp.iol.it/LDP/tlk/tlk-title.html [08/03/2001 10.08.06]

The Linux Kernel (Copyright Notice)

Legal Notice
UNIX is a trademark of Univel. Linux is a trademark of Linus Torvalds, and has no connection to UNIXTM or Univel. Copyright © 1996,1997,1998,1999 David A Rusling 3 Foxglove Close, Wokingham, Berkshire RG41 3NF, UK david.rusling@arm.com This book (``The Linux Kernel'') may be reproduced and distributed in whole or in part, without fee, subject to the following conditions: q The copyright notice above and this permission notice must be preserved complete on all complete or partial copies. q Any translation or derived work must be approved by the author in writing before distribution. q If you distribute this work in part, instructions for obtaining the complete version of this manual must be included, and a means for obtaining a complete version provided. q Small portions may be reproduced as illustrations for reviews or quotes in other works without this permission notice if proper citation is given. Exceptions to these rules may be granted for academic purposes: Write to the author and ask. These restrictions are here to protect us as authors, not to restrict you as learners and educators. All source code in this document is placed under the GNU General Public License, available via anonymous FTP from prep.ai.mit.edu:/pub/gnu/COPYING. It is also reproduced in appendix gpl.

http://ldp.iol.it/LDP/tlk/misc/copyright.html [08/03/2001 10.08.06]

United Kingdom Show Frames.08.09] .iol. Wokingham.html [08/03/2001 10.it/LDP/tlk/tlk-toc. Berkshire RG41 3NF.com http://ldp. No Frames © 1996-1999 David A Rusling copyright notice david.rusling@arm.The Linux Kernel: Table of Contents Table of Contents q q q q q q q q q q q q q q q q q q q q q Title Page Preface Hardware Basics Software Basics Memory Management Processes Interprocess Communication Mechanisms PCI Interrupts and Interrupt Handling Device Drivers The File System Networks Kernel Mechanisms Modules Processors The Linux Kernel Sources Linux Data Structures Useful Web and FTP Sites The LPD Manifesto The GNU General Public License Glossary David A Rusling 3 Foxglove Close.

Table of Contents. He started with an Intel 386 based PC and started to write.html (1 di 6) [08/03/2001 10.it/LDP/tlk/intro/preface. Linus was less than impressed with its features. The only software that Linus Torvalds. Show Frames. his solution was to write his own software. Ken Thompson of the Research Group at Bell Laboratories began experimenting on a multi-user. A lot of Linux users use it to write applications or to run applications written by others. Others saw the software and started contributing. Minix is a simple. Before long. as they say. it is a fully developed and professionally written operating system used by enthusiasts all over the world. Unix TM like. The rest. Most people use Linux as a simple tool. but the third version was rewritten in a new programming language. Richie was strongly influenced by an earlier project. Born out of the hobby project of a student it has grown to become more popular than any other freely available operating system.iol. Much of this new software was itself the solution to a problem that one of the contributors had. Linux was the solution to a simple need. C. Massachusetts. He took Unix TM as his model as that was an operating system that he was familiar with in his day to day student life. send electronic mail and. how can something that has been written by a bunch of ``hackers'' (sic) hope to compete? How can software contributed to by many different people in many different countries around the world have a hope of being stable and effective? Yet stable and effective it is and compete it does. operating system widely used as a teaching aid. then mainly used by the academic community.08. He was soon joined by Dennis Richie and the two of them. Many Linux users read the HOWTOs1 avidly and feel both the thrill of success when some part of the http://ldp. Early versions were written in assembly code. This rewrite allowed Unix TM to move onto the more powerful PDP-11/45 and 11/70 computers then being produced by DIGITAL. Linus offered his efforts to other students via the emerging world wide computer networks. excited by this. Linux's author and principle maintainer was able to afford was Minix.12] . as always with computers. It is important to note that Linux contains no Unix TM code. MULTICS and the name Unix TM is itself a pun on the name MULTICS. write theses. How can something that is free be worthwhile? In a world dominated by a handful of large software corporations. Unix TM moved out of the laboratory and into mainstream computing and soon most major computer manufacturers were producing their own versions. Linux had become an operating system. People are running it on their home PCs and I would wager that most companies are using it somewhere even if they do not always realize that they do. The roots of Linux can be traced back to the origins of Unix TM . Linux is used to browse the web. multi-tasking operating system using an otherwise idle PDP-7. Linux is built with and uses a lot of the GNU (GNU's Not Unix TM) software produced by the Free Software Foundation in Cambridge. often just installing one of the many good CD ROM-based distributions. Many Universities and research establishments use it for their everyday computing needs. host web sites. it is a rewrite based on published POSIX standards. Linux is emphatically not a toy. Progress was rapid and. No Frames Preface Linux is a phenomenon of the Internet. to play games. along with other members of the Research Group produced the early versions of Unix TM. In 1969. is history. To many Linux is an enigma. C was designed and written by Richie expressly as a programming language for writing operating systems.

I began more and more to appreciate not only the operating system but also the community of engineers that produces it. It must be noted that around 95% of the Linux kernel sources are common to all of the hardware platforms that it runs on. Linus accepts additions and modifications to the kernel sources from anyone. there are only a handful of people contributing sources to the Linux kernel.12] . Not only is it well written. The majority of Linux users do not look at how the operating system works. preferably the PC will help the reader derive real benefit from the material. To provide a mind model that allows you to picture what is happening within the system as you copy a file from one place to another or read electronic mail. I had worked for Digital Equipment Co. At any one time though.system has been correctly configured and the frustration of failure when it has not. Amongst these are Alpha AXP. they allow the sources to be freely redistributable under the Free Software Foundation's GNU Public License. However.it/LDP/tlk/intro/preface. the creator and maintainer of the Linux kernel. I could have written this book using any one of those platforms but my background and technical experiences with Linux are with Linux on the Alpha AXP and. as will some knowledge of the C programming language. all the sources are freely available for you to look at. As I worked on this. My involvement with Linux started late in 1994 when I visited Jim Paradis who was working on a port of Linux to the Alpha AXP processor based systems. This.html (2 di 6) [08/03/2001 10. is the aim of this book: to promote a clear understanding of how Linux. This is why this book sometimes uses non-Intel hardware as an example to illustrate some key point. It is that excitement that I want to pass on to the readers of this book. anywhere. around 95% of this book is about the machine independent parts of the Linux kernel. mostly in networks and communications and in 1992 I started working for the newly formed Digital Semiconductor division. This is because although the authors retain the copyrights to their software. Most Linux kernels are running on Intel processor based systems but a growing number of non-Intel Linux systems are becoming more commonly available. Reader Profile This book does not make any assumptions about the knowledge or experience of the reader. This might sound like a recipe for anarchy but Linus exercises strict quality control and merges all new code into the kernel himself. I believe that interest in the subject matter will encourage a process of self education where neccessary. http://ldp. mm and net but what do they contain and how does that code work? What is needed is a broader understanding of the overall structure and aims of Linux. I well remember the excitement that I felt when I first realized just how an operating system actually worked. ARM. Limited since 1984. to a lesser extent on the ARM. Likewise.08. a degree of familiarity with computers. in short. A minority are bold enough to write device drivers and offer kernel patches to Linus Torvalds. you will see directories called kernel. When I first heard about Linux I immediately saw an opportunity to have fun. works. the operating system. That said. Sparc and PowerPC. how it fits together. This is a shame because looking at Linux is a very good way to learn more about how an operating system functions. the sources can be confusing. Alpha AXP is only one of the many hardware platforms that Linux runs on. This division's goal was to enter fully into the merchant chip vendor market and sell chips. and in particular the Alpha AXP range of microprocessors but also Alpha AXP system boards outside of Digital. MIPS. At first glance though.iol. Jim's enthusiasm was catching and I started to help on the port.

like the Linux kernel subsystem that they each describe. I have described many of the relevant kernel data structures and their interrelationships in a fair amount of detail. Each chapter is fairly independent. The operating system needs certain servit d that can only be provided by the hardware.it/LDP/tlk/intro/preface. The Peripheral Component Interconnect (PCI) standard is now firmly established as the low cost. some of the interrupt handling details are hardware and architecture specific. These interprocess communications mechanisms are described in Chapter IPC-chapter. It looks at the toold that are used to build an operating system like Linux and it gives an overview of the aims and functions of an operating system. The chapters each follow my rule of ``working from the general to the particular''. You can read the code to find these things out. The Software Basics chapter (Chapter sw-basics-chapter) introduc d basic software principles and looks at assembly and C programing languages. high performance data bus for PCs. They first give an overview of the kernel subsystem that they are describing before launching into its gory details.Organisation of this Book This book is not intended to be used as an internals manual for Linux. The Processes chapter (Chapter processes-chapter) describes what a process is and how the Linux kernel creates. there are linkages. Instead it is an introduction to operating systems in general and to Linux in particular. you need to understand the basics of the underlying hardware. The Hardware Basics chapter (Chapter hw-basics-chapter) gives a brief introduction to the modern PC. Signals and pipes are two of them but Linux also supports the System V IPC mechanisms named after the Unix TM release in which they first appeared. The Memory Management chapter (Chapter mm-chapter) describes the way that Linux handles the physical and virtual memory in the system. Whenever I need to understand a piece of code or describe it to someone else I often start with drawing its data structures on the white-board. Sometimes. So. One of Linux's strengths is its support for the many available hardware devit d for the modern PC. The Interrupts and Interrupt Handling chapter (Chapter interrupt-chapter) looks at how the Linux kernel handles interrupts. The PCI chapter (Chapter PCI-chapter) describes how the Linux kernel initializes and uses PCI buses and devit d in the system. though.08.html (3 di 6) [08/03/2001 10. Whilst the kernel has generic mechanisms and interfat d for handling interrupts. The Devit Drivers chapter (Chapter dd-chapter) describes how the Linux kernel controls the physical http://ldp. manages and deletes the processes in the system. Linux supports a number of Inter-Process Communication (IPC) mechanisms. for example you cannot describe a process without understanding how virtual memory works. I have deliberately not described the kernel's algorithms. Processes communicate with each other and with the kernel to coordinate their activities. An operating system has to work closely with the hardware system that acts as its foundations.12] . its methods of doing things. In order to fully understand the Linux operating system.iol. in terms of routine_X() calls routine_Y() which increments the foo field of the bar data structure.

html (4 di 6) [08/03/2001 10. type font refers to data structures or fields within data structures. Trademarks ARM is a trademark of ARM Holdings PLC. for example file systems.08. then looking at the code is a worthwhile experience and you can use this book as an aid to understanding the code and as a guide to its many data structures.iol.c as an example. serif font identifies commands or other text that is to be typed literally by the user.it/LDP/tlk/intro/preface. These are given in case you wish to look at the source code itself and all of the file references are relative to /usr/src/linux. Inc. The File system chapter (Chapter filesystem-chapter) describes how the Linux kernel maintains the files in the file systems that it supports. code and Linux itself is often used to support the networking needs of organizations.devices in the system.c If you are running Linux (and you should). Caldera OpenDOS 1997 Caldera. Conventions used in this Book The following is a list of the typographical conventions used in this book. The Processors chapter (Chapter processors-chapter) gives a brief description of some of the processors that Linux has been ported to. the full filename would be /usr/src/linux/foo/bar. In a very real sense Linux is a product of the Internet or World Wide Web (WWW). Caldera. Its developers and users use the web to exchange information ideas. The Modules chapter (Chapter modules-chapter) describes how the Linux kernel can dynamically load functions.12] . Chapter networks-chapter describes how Linux supports the network protocols known collectively as TCP/IP. Inc. Throughout the text there references to pieces of code within the Linux kernel source tree (for example the boxed margin note adjacent to this text ). It describes the Virtual File System (VFS) and how the Linux kernel's real file systems are supported. OpenLinux and the ``C'' logo are trademarks of Caldera. Taking foo/bar. only when they are needed. Networking and Linux are terms that are almost synonymous. The Sources chapter (Chapter sources-chapter) describes where in the Linux kernel sources you should start looking for particular kernel functions. http://ldp. The Kernel Mechanisms chapter (Chapter kernel-chapter) looks at some of the general tasks and mechanisms that the Linux kernel needs to supply so that other parts of the kernel work effectively together.

Inc. Linux suits my needs perfectly.12] . Red Hat. Motif is a trademark of The Open System Foundation. UNIX is a registered trademark of X/Open. People often ask me about Linux at work and at home and I am only too happy to oblige. DIGITAL is a trademark of Digital Equipment Corporation. I first met Unix at University. Ultrix). I loved using the newly delivered PDP-11 for my final year project.it/LDP/tlk/intro/preface. I define a Linux zealot to be an enthusiast that recognizes that there are other operating systems but prefers not to use them. MSDOS is a trademark of Microsoft Corporation.html (5 di 6) [08/03/2001 10. glint and the Red Hat logo are trademarks of Red Hat Software.iol. in the north of England. Inc. Gill. a few weeks before Sputnik was launched. Most freely available software easily builds on Linux and I can often simply download pre-built executable files or install them from a CD ROM. I have attempted to incorporated those comments in each new version that I have produced and I am more than happy to receive comments. I worked for the semiconductor group on Alpha and StrongARM evaluation boards. My children (Esther and Stephen) describe me as a geek. After graduating (in 1982 with a First Class Honours degree in Computer Science) I worked for Prime Computers (Primos) and then after a couple of years for Digital (VMS. The more that I use Linux in both my professional and personal life the more that I become a Linux zealot. where a lecturer used it as an example when teaching the notions of kernels. as an engineer. You may note that I use the term `zealot' and not `bigot'. At Digital I worked on many things but for the last 5 years there. For me. In 1998 I moved to ARM where I have a small group of engineers writing low level firmware and porting operating systems. What else could I use to learn to program in C++. XFree86 is a trademark of XFree86 Project.08. X Window System is a trademark of the X Consortium and the Massachusetts Institute of Technology. Linux is a trademark of Linus Torvalds. A number of lecturers have written to me asking if they can use some or parts of this book in order to http://ldp. however please note my new e-mail address.DEC is a trademark of Digital Equipment Corporation. It is a superb. Inc. scheduling and other operating systems goodies. flexible and adaptable engineering tool that I use at work and at home. As my wife. The Author I was born in 1957. Perl or learn about Java for free? Acknowledgements I must thank the many people who have been kind enough to take the time to e-mail me with comments about this book. who uses Windows 95 once remarked ``I never realized that we would have his and her operating systems''.

a document describing how to do something. File translated from TEX by TTH. My answer is an emphatic yes.12] . Many have been written for Linux and all are very useful. version 1.0. there may be another Linus Torvalds sat in the class. Finally. No Frames © 1996-1999 David A Rusling copyright notice. Show Frames.iol.teach computing. Footnotes: 1 A HOWTO is just what it sounds like.html (6 di 6) [08/03/2001 10.08. Special thanks must go to John Rigby and Michael Bauer who gave me full. Table of Contents. Top of Chapter.it/LDP/tlk/intro/preface. Alan Cox and Stephen Tweedie have patiently answered my questions thanks. Not an easy task. this is one use of the book that I particularly wanted. thank you to Greg Hankins for accepting this book into the Linux Documentation Project and onto their web site. http://ldp. I used Larry Ewing's penguins to brighten up the chapters a bit. detailed review notes of the whole book. Who knows.

A lot of hardware appeared. To these early pioneers it represented freedom. by the IBM PC-XT which had the luxury of a 10Mbyte hard drive. Show Frames. Ed Roberts. With its Intel 8088 processor. coined the term ``personal computer'' to describe his new invention.html (1 di 6) [08/03/2001 10. named after the destination of an early Star Trek episode. in 1983. When the ``Popular Electronics'' magazine for January 1975 was printed with an illustration of the Altair 8080 on its front cover.Table of Contents. The Altair 8080. This de-facto standard helped a multitude of hardware companies to compete together in a growing market which. This chapter gives a brief introduction to that hardware: the modern PC.it/LDP/tlk/basics/hw. Paradoxically it was IBM who firmly cast the mould of the modern PC by announcing the IBM PC in 1981 and shipping it to customers early in 1982.iol. could be assembled by home electronics enthusiasts for a mere $397. For example. It was not long before IBM PC clones were being produced by a host of companies such as Compaq and the architecture of the PC became a de-facto standard. In order to fully understand the Linux operating system. an Intel 80386 PC. happily for consumers. he picked the most plentiful and reasonably priced hardware. Many of the system architectural features of these early PCs have carried over into the modern PC. a computer that you could have at home on your kitchen table. but the term PC is now used to refer to almost any computer that you can pick up without needing help. even the most powerful Intel Pentium Pro based system starts running in the Intel 8086's addressing mode. Its inventor. http://ldp. When Linus Torvalds started writing what was to become Linux. even some of the very powerful Alpha AXP systems are PCs. 64K of memory (expandable to 256K). the freedom from huge batch processing mainframe systems run and guarded by an elite priesthood. It was followed. By this definition. With its Intel 8080 processor and 256 bytes of memory but no screen or keyboard it was puny by today's standards. you need to understand the basics of the underlying hardware. The operating system needs certain services that can only be provided by the hardware. a revolution started. kept prices low. Enthusiastic hackers saw the Altair's potential and started to write software and build hardware for it.18] . Overnight fortunes were made by college dropouts fascinated by this new phenomenon. No Frames Chapter 1 Hardware Basics An operating system has to work closely with the hardware system that acts as its foundations.08. all different to some degree and software hackers were happy to write software for these new machines. two floppy disks and an 80 character by 25 lines Colour Graphics Adapter (CGA) it was not very powerful by today's standards but it sold well.

the most obvious components are a system box.html (2 di 6) [08/03/2001 10. Inside the PC (Figure 1. http://ldp.08. a mouse and a video monitor. the memory and a number of slots for the ISA or PCI peripheral controllers.1) you will see a motherboard containing the CPU or microprocessor. a little display showing some numbers and a floppy drive. then there will also be a tape drive for backups. Some of the controllers. for example the IDE disk controller may be built directly onto the system board.Figure 1. On the front of the system box are some buttons. for example the IDE controller.iol. All of the peripheral controllers. Although the CPU is in overall control of the system. Looking at a PC from the outside.1: A typical PC motherboard. it is not the only intelligent device. Most systems these days have a CD ROM and if you feel that you have to protect your data.it/LDP/tlk/basics/hw.18] . a keyboard. These devices are collectively known as the peripherals. have some level of intelligence.

Just as 42 is a decimal number meaning ``4 10s and 2 units''. The processor's execution is governed by an external clock. the system clock. E and F. Registers are the microprocessor's internal storage. C. This clock. So.iol. binary 0011 is 3. multiply and divide and logical operations such as ``is X greater than Y?''. for a start. Rather than using binary to represent numbers in computer programs. As decimal numbers only go from 0 to 9 the numbers 10 to 15 are represented as a single digit by the letters A. the processor does some work. Microprocessors can perform arithmetic operations such as add. Microprocessors operate on binary data. performs logical operations and manages data flows by reading instructions from memory and then executing them. This is when the term Central Processing Unit was coined. generates regular clock pulses to the processor and.1 The CPU The CPU. 103 is 10x10x10 and so on.08.1. or rather microprocessor. Instructions may themselves reference data within memory and that data must be fetched from memory and saved there when appropriate. For example. The modern microprocessor combines these components onto an integrated circuit etched onto a very small piece of silicon. for example ``read the contents of memory at location X into register Y''. In the early days of computing the functional components of the microprocessor were separate (and physically large) units.000. For example. each digital represents a power of 16. However. The size.html (3 di 6) [08/03/2001 10. hexadecimal E is decimal 14 and hexadecimal 2A is decimal 42 (two 16s) + 10). that is data composed of ones and zeros. Binary 0001 is decimal 1. microprocessor and processor are all used interchangeably in this book. The terms CPU. In this base. The operations performed may cause the processor to stop what it is doing and jump to another instruction somewhere else in memory.18] . a binary number is a series of binary digits each one representing a power of 2.000 clock ticks every second. the Intel's are 32 bits http://ldp. binary 0010 is decimal 2. Using the C programming language notation (as I do throughout this book) hexadecimal numbers are prefaced by ``0x''. another base. A 100Mhz processor will receive 100. all things being equal. a processor could execute an instruction every clock pulse. a power means the number of times that a number is multiplied by itself. It is misleading to describe the power of a CPU by its clock rate as different processors perform different amounts of work per clock tick. binary 0100 is 4 and so on. The instructions executed by the processor are very simple. a faster clock speed means a more powerful processor. In this context. These tiny building blocks give the modern microprocessor almost limitless power as it can execute millions or even billions of instructions a second. 10 to the power 1 ( 101 ) is 10. is the heart of any computer system. 42 decimal is 101010 binary or (2 + 8 + 32 or 21 + 23 + 25 ).it/LDP/tlk/basics/hw. The microprocessor calculates. hexadecimal is usually used. D. The instructions have to be fetched from memory as they are executed. These ones and zeros correspond to electrical switches being either on or off. at each clock pulse. B. used for storing data and performing operations on it. hexadecimal 2A is written as 0x2A . number and type of register within a microprocessor is entirely dependent on its type. A processor's speed is described in terms of the rate of the system clock ticks. 10 to the power 2 ( 102 ) is 10x10. An Intel 4086 processor has a different register set to an Alpha AXP processor.

Processor Status (PS) Instructions may yield results. In other words. Relative to the on-CPU cache. though. This is also true for a number of major system tasks where the hardware and software must cooperate closely to achieve their aims. The job of cache coherency is done partially by the hardware and partially by the operating system. then the system must make sure that the contents of cache and memory are the same. Some processor's support both types.wide and the Alpha AXP's are 64 bits wide. Some processors have one cache to contain both instructions and data. The Alpha AXP processor has two internal memory caches. The PS register would hold information identifying the current mode. x and y. The stack works on a last in first out (LIFO) basis. For example. for example ``is the content of register X greater than the content of register Y?'' will yield true or false as a result. http://ldp. most processors have at least two modes of operation. one for instructions and the other for data. or cache.18] . of memory.iol. Stack Pointer (SP) Processors have to have access to large amounts of external read/write random access memory (RAM) which facilitates temporary storage of data. registers: Program Counter (PC) This register contains the address of the next instruction to be executed.it/LDP/tlk/basics/hw. if you push two values. or base. The stack is a way of easily saving and restoring temporary values in external memory. 1. kernel (or supervisor) and user. main memory is positively crawling. In general.memory that is used to temporarily hold. In other words.html (4 di 6) [08/03/2001 10. The fastest memory is known as cache memory and is what it sounds like . Some processor's stacks grow upwards towards the top of memory whilst others grow downwards towards the bottom. for example ARM. any given processor will have a number of general purpose registers and a smaller number of dedicated registers. processors have special instructions which allow you to push values onto the stack and to pop them off again later. Finally there is the main memory which relative to the external cache memory is very slow.08. dedicated.2 Memory All systems have a memory hierarchy with memory at different speeds and sizes at different points in the hierarchy. onto a stack and then pop a value off of the stack then you will get back the value of y. The external cache (or B-Cache) mixes the two together. if a word of main memory is held in one or more locations in cache. but others have two. therefore most processors have a small amount of on-chip cache memory and more system based (on-board) cache memory. Most processors have the following special purpose. Usually. This sort of memory is very fast but expensive. The cache and main memories must be kept in step (coherent). The PS register holds this and other information about the current state of the processor. The contents of the PC are automatically incremented each time an instruction is fetched. one for data (the D-Cache) and one for instructions (the I-Cache). contents of the main memory.

3 Buses The individual components of the system board are interconnected by multiple connection systems known as buses. This is exactly how the CPU controls the system's hardware peripherals. Another good example is the PCI bus which allows each device (for example a graphics card) to be accessed independently. there might be an instruction that means ``read a byte from I/O address 0x3f0 into register X''. These controllers are connected to the CPU and to each other by a variety of buses. changing the mode of the controller. For example. such as graphics cards or disks controlled by controller chips on the system board or on cards plugged into it. The IDE ribbon is a good example.html (5 di 6) [08/03/2001 10. serial port.iol. as it gives you the ability to access each drive on the bus separately. Another might be used for control purposes. it allows data to be read into the CPU and written from the CPU. All controllers are different. One register might contain status describing an error. http://ldp. The control bus contains various lines used to route timing and control signals throughout the system. 1. The IDE disks are controlled by the IDE controller chip and the SCSI disks by the SCSI disk controller chips and so on. the address bus. the data bus and the control bus. this is so that the software device driver can write to its registers and thus control it. The I/O space address 0x3f0 just happens to be the address of one of the serial port's (COM1) control registers. Many flavours of bus exist. The CPU can access both the system space memory and the I/O space memory. The data bus holds the data transfered. The data bus is bidirectional. for example ISA and PCI buses are popular ways of connecting peripherals to the system. The system bus is divided into three logical functions. The address bus specifies the memory locations (addresses) for the data transfers. whereas the controllers themselves can only access system memory indirectly and then only with the help of the CPU.4 Controllers and Peripherals Peripherals are real devices. and not the system memory. From the point of view of the device. Software running on the CPU must be able to read and write those controlling registers. they can be viewed as intelligent helpers to the CPU. Typically a CPU will have separate instructions for accessing the memory and I/O space.5 Address Spaces The system bus connects the CPU with the main memory and is separate from the buses connecting the CPU with the system's hardware peripherals. Where in I/O space the common peripherals (IDE controller. but we will not worry too much about that for the moment. but they usually have registers which control them.it/LDP/tlk/basics/hw. Collectively the memory space that the hardware peripherals exist in is known as I/O space.1. 1. Each controller on a bus can be individually addressed by the CPU.18] . say the floppy disk controller. by reading and writing to their registers in I/O space. Most systems built now use PCI and ISA buses to connect together the main system components.08. it will see only the address space that its control registers are in (ISA). The CPU is in overall control of the system. The controllers are processors like the CPU itself. I/O space may itself be further subdivided. floppy disk controller and so on) have their registers has been set by convention over the years as the PC architecture has developed.

There are times when controllers need to read or write large amounts of data directly to or from system memory.html (6 di 6) [08/03/2001 10.08. Show Frames. The RTC has its own battery so that it continues to run even when the PC is not powered on. Direct Memory Access (DMA) controllers are used to allow hardware peripherals to directly access system memory but this access is under strict control and supervision of the CPU.6 Timers All operating systems need to know the time and so the modern PC includes a special peripheral called the Real Time Clock (RTC). File translated from TEX by TTH. this is how your PC always ``knows'' the correct date and time. Top of Chapter. version 1. This provides two things: a reliable time of day and an accurate timing interval.it/LDP/tlk/basics/hw. For example when user data is being written to the hard disk. No Frames © 1996-1999 David A Rusling copyright notice.0. 1. Table of Contents. http://ldp. The interval timer allows the operating system to accurately schedule essential work. In this case.18] .iol.

One of the first software tools invented for the earliest computers was an assembler. The hexadecimal number 0x89E5 is an Intel 80486 instruction which copies the contents of the ESP register to the EBP register. . That program can be written in assembler. Assembly languages explicitly handle registers and operations on data and they are specific to a particular microprocessor. They are machine codes which tell the computer precisely what to do. (r15) r17.r17. . a very low level computer language. Line 3 compares the contents of register 16 with that of register 17 and. (r15) . branches to label 100.iol.08. An operating system is a special program which allows the user to run applications such as spreadsheets and word processors. The following Alpha AXP assembly code shows the sort of operations that a program can perform: ldr ldr beq str 100: r16.1. Show Frames. a program which takes a human readable source file and assembles it into machine code. or in a high level. This chapter introduces basic programming principles and gives an overview of the aims and functions of an operating system.it/LDP/tlk/basics/sw.html (1 di 7) [08/03/2001 10. 2. machine independent language such as the C programming language. 4(r15) r16. Assembly level programs are tedious and tricky to http://ldp. if they are equal. If the registers do contain the same value then no data needs to be saved. . Line Line Line Line Line 1 2 3 4 5 The first statement (on line 1) loads register 16 from the address held in register 15. .1 Computer Languages 2. The next instruction loads register 17 from the next location in memory.100 r17. If the registers do not contain the same value then the program continues to line 4 where the contents of r17 are saved into memory. The assembly language for an Intel X86 microprocessor is very different to the assembly language for an Alpha AXP microprocessor.23] . No Frames Chapter 2 Software Basics A program is a set of computer instructions that perform a particular task.Table of Contents.1 Assembly Languages The instructions that a CPU fetches from memory and executes are not at all understandable to human beings.

char b . it is the linker (see below) that has to worry about that. C supports many types of variables. C allows you to describe programs in terms of their logical algorithms and the data that they operate on. the location in memory of other data. Very little of the Linux kernel is written in assembly language and those parts that are are written only for efficiency and they are specific to particular microprocessors.html (2 di 7) [08/03/2001 10. generating machine specific code from it. C allows you to bundle together related variables into data structures. These C source code modules group together logical functions such as filesystem handling code.1. Most of the Linux kernel is written in the C language. called px. a variable is a location in memory which can be referenced by a symbolic name. In the above C fragment x and y refer to locations in memory. it might live in memory at address 0x80010000. C code is organized into routines. The following C fragment: if (x != y) x = y . For example.23] .2 The C Programming Language and Compiler Writing large programs in assembly language is a difficult and time consuming task. each of which perform a task. being tied to one particular processor family. Some variables contain different sorts of data. Large programs like the Linux kernel comprise many separate C source modules each with its own routines and data structures.it/LDP/tlk/basics/sw. The value of px would be 0x80010000: the address of the variable x.iol. which points at x.08. If the contents of the variable x are not the same as the contents of variable y then the contents of y will be copied to x. A good compiler can generate assembly instructions that are very nearly as efficient as those written by a good assembly programmer. Consider a variable called x. px might live at address 0x80010030. http://ldp. performs exactly the same operations as the previous example assembly code. Pointers are variables that contain the address. It is prone to error and the resulting program is not portable. The programmer does not care where in memory the variables are put. is a data structure called my_struct which contains two elements.write and prone to errors. You could have a pointer. integer and floating point and others are pointers. struct { int i . } my_struct . an integer (32 bits of data storage) called i and a character (8 bits of data) called b. Special programs called compilers read the C program and translate it into assembly language. 2. Routines may return any value or data type supported by C. It is far better to use a machine independent language like C.

Most PCs can run one or more operating systems and each one can have a very different look and feel. large program linked together from its many constituent object modules. An operating system is a collection of system programs which allow the user to run application software. The filesystem might make use of cached filesystem information or use the disk device driver to read this information from the disk.html (3 di 7) [08/03/2001 10. to type some command. If the hardware is the heart of a computer then the software is its soul. It might even cause a network driver to exchange information with a remote machine to find out details of remote files that this system has access to (filesystems can be remotely mounted via the Networked File System or NFS). the user.2. Object modules are the machine code output from an assembler or compiler and contain executable machine code and data together with information that allows the linker to combine the modules together to form a program.iol. All of the above seems rather complicated but it shows that even most simple commands reveal that an operating system is in fact a co-operating set of functions that together give you. Typing ls causes the keyboard driver to recognize that characters have been typed. in /bin/ls. It finds that image.1.3 Linkers Linkers are programs that link together several object modules and libraries to form a single. This means that it is waiting for you. The ls image makes calls to the file subsystem of the kernel to find out what files are available. For example one module might contain all of a program's database functions and another module its command line argument handling functions. coherent. but even that would be useless without libraries or shells.23] . comprise the operating system. the user. Linux is made up of a number of functionally separate pieces that. One obvious part of Linux is the kernel itself. program. a coherent http://ldp. together. consider what happens when you type an apparently simple command: $ ls Mail docs $ c tcl images perl The $ is a prompt put out by a login shell (in this case bash). In order to start understanding what an operating system is. ls writes that information out and the video driver displays it on the screen. Whichever way the information is located.it/LDP/tlk/basics/sw. 2. where a routine or data structure referenced in one module actually exists in another module. In a very real sense the software provides the character of the system. Kernel services are called to pull the ls executable image into virtual memory and start executing it.08. The operating system abstracts the real hardware of the system and presents the system's users and its applications with a virtual machine.2 What is an Operating System? Without software a computer is just a pile of electronics that gives off heat. The Linux kernel is a single. The keyboard driver passes them to the shell which processes that command by looking for an executable image of the same name. Linkers fix up references between these object modules.

multi-processing.html (4 di 7) [08/03/2001 10. The operating system achieves this by giving each process a separate address space which only they have access to. Unfortunately.iol.2.view of the system. The software does not notice because of another trick. This trick is known as multi-processing or scheduling and it fools each process into thinking that it is the only process.23] . many of the things that an operating system has to do would be redundant. The idea is that the software running in the system is fooled into believing that it is running in a lot of memory.tex ps If my system had many CPUs then each process could (theoretically at least) run on a different CPU. for example memory. One of the basic tricks of any operating system is the ability to make a small amount of physical memory behave like rather more memory. For example.1 Memory management With infinite resources. 2. 2. Processes are protected from one another so that if one process crashes or malfunctions then it will not affect any others. If you look at the processes on your Linux system. This apparently large memory is known as virtual memory. The system divides the memory into easily handled pages and swaps these pages onto a hard disk as the system runs.2. there is only one so again the operating system resorts to trickery by running each process in turn for a short period. This period of time is known as a time-slice. you will see that there are rather a lot.2 Processes A process could be thought of as a program in action. each process is a separate entity that is running a particular program.08.it/LDP/tlk/basics/sw. http://ldp. typing ps shows the following processes on my system: $ ps PID 158 174 175 178 182 184 185 187 202 203 1796 1797 3056 3270 $ TTY pRe pRe pRe pRe pRe pRe pRe pp6 pRe ppc pRe v06 pp6 pp6 STAT 1 1 1 1 N 1 N 1 < 1 < 1 1 N 2 1 N 1 3 < 3 TIME 0:00 0:00 0:00 0:00 0:01 0:00 0:00 9:26 0:00 0:00 0:00 0:00 0:02 0:00 COMMAND -bash sh /usr/X11R6/bin/startx xinit /usr/X11R6/lib/X11/xinit/xinitrc -bowman rxvt -geometry 120x35 -fg white -bg black xclock -bg grey -geometry -1500-1500 -padding 0 xload -bg grey -geometry -0-0 -label xload /bin/bash rxvt -geometry 120x35 -fg white -bg black /bin/bash rxvt -geometry 120x35 -fg white -bg black /bin/bash emacs intro/introduction.

for example. 2. IDE and SCSI) and. One of the most important features of Linux is its support for many different filesystems. Linux transparently supports many different filesystems (for example MS-DOS and EXT2) and presents all of the mounted files and filesystems as one integrated virtual filesystem. Device drivers control the interaction between the operating system and the hardware device that they are controlling.2. So. Every data structure has a purpose and although some are used by several kernel subsystems. A filesystem gives the user a sensible view of files and directories held on the hard disks of the system regardless of the filesystem type or the characteristics of the underlying physical device. The driver takes care of the details and makes device specific things happen. again. Taken all together. The block sizes may vary between devices. Linux adds each new filesystem into this single filesystem tree as they are mounted onto a mount directory. you need the NCR810 SCSI driver if your system has an NCR810 SCSI controller. An EXT2 filesystem looks the same no matter what device holds it. for example 512 bytes is common for floppy devices whereas 1024 bytes is common for IDE devices and.it/LDP/tlk/basics/sw. The most popular filesystem for Linux is the EXT2 filesystem and this is the filesystem supported by most of the Linux distributions. For example. This makes it very flexible and well able to coexist with other operating systems. the separate filesystems that the system may use are not accessed by device identifiers (such as a drive number or a drive name) but instead they are combined into a single hierarchical tree structure that represents the filesystem as a single entity.4 The Filesystems In Linux. users and processes do not need to know what sort of filesystem that any file is part of. The kernel must create a data structure representing the new process and link it with the data structures representing all of the other processes in the system. Like other parts of the operating system. in general. so far as each filesystem is concerned.html (5 di 7) [08/03/2001 10.2.3 Kernel Data Structures The operating system must keep a lot of information about the current state of the system. a new process might be created when a user logs onto the system. this is hidden from the users of the system. they are more simple than they appear at first sight. http://ldp. As things happen within the system these data structures must be changed to reflect the current reality.3 Device drivers Device drivers make up the major part of the Linux kernel. as it is for Unix TM . the filesystem makes use of a general block device interface when writing blocks to an IDE disk.iol. For example.2. The block device drivers hide the differences between the physical block device types (for example. they operate in a highly privileged environment and can cause disaster if they get things wrong.23] . addresses of other data structures or the addresses of routines. 2. Mostly these data structures exist in physical memory and are accessible only by the kernel and its subsystems. for example /mnt/cdrom. the physical devices are just linear collections of blocks of data. Device drivers are specific to the controller chip that they are driving which is why. Data structures contain data and pointers.08. they just use them. the data structures used by the Linux kernel can look very confusing.

A bookshelf could be said to be an array of books. in the list and each data structure contains a pointer to the next element in the list. 2. This book bases its description of the Linux kernel on its data structures. you might ask for the 5th book. you could describe each book by its position on the shelf. If the data structure can be found in the cache (this is known as a cache hit. In adding new data structures into the cache an old cache entry may need discarding. Data structures are put into a cache and kept there because the kernel often accesses them. Arrays are accessed by an index. A hash table is an array or vector of pointers. is simply a set of things coming one after another in memory. Unfortunately many people in the village are likely to have the same age and so the hash table pointer becomes a pointer to a chain or list of data structures each describing people of the same age.3. However.23] . It talks about each kernel subsystem in terms of its algorithms. the kernel must be able to find all of the instances. Linux uses another technique. and their usage of the kernel's data structures. Caches are handy information that needs to be accessed quickly and are usually a subset of the full set of information available. There is a drawback to caches in that they are more complex to use and maintain than simple linked lists or hash tables. or vector. As a hash table speeds up access to commonly used data structures. If it cannot then all of the relevant data structures must be searched and. it must be added into the cache.1 Linked Lists Linux uses a number of software engineering techniques to link together its data structures. Linux often uses hash tables to implement caches. or element.2 Hash Tables Linked lists are handy ways of tying data structures together but navigating linked lists can be inefficient. The last element's next pointer would be 0 or NULL to show that it is the end of the list. In a linked list a root pointer contains the address of the first data structure. hashing to get around this restriction. the danger being that the discarded data structure may be the next one that Linux needs. Taking the bookshelf analogy a little further. http://ldp. If you were searching for a particular element.08. If you had data structures describing the population of a village then you could use a person's age as an index. To find a particular person's data you could use their age as an index into the population hash table and then follow the pointer to the data structure containing the person's details.html (6 di 7) [08/03/2001 10. On a lot of occasions it uses linked or chained data structures. then all well and good. An array. if the data structure exists at all. This is a typical operating system trade off: memory accesses versus CPU cycles. you might easily have to look at the whole list before you find the one that you need. 2.it/LDP/tlk/basics/sw. for example a process or a network device. A hash table is an array of pointers to data structures and its index is derived from information in those data structures.Understanding the Linux kernel hinges on understanding its data structures and the use that the various functions within the Linux kernel makes of them. Linux must decide which one to discard. searching these shorter chains is still faster than searching all of the data structures. Using doubly linked lists makes it easier to add or remove elements from the middle of list although you do need more memory accesses. its methods of getting things done. In a doubly linked list each element contains both a pointer to the next element in the list but also a pointer to the previous element in the list.3. the index is an offset into the array. If each data structure describes a single instance or occurance of something.iol.

using filesystem registration as an example. Often these lower layers register themselves with the upper layer at boot time. http://ldp.23] .0. Show Frames. the data structure that each filesystem passes to the Linux kernel as it registers includes the address of a filesystem specfic routine which must be called whenever that filesystem is mounted.08. The network layer is generic and it is supported by device specific code that conforms to a standard interface.iol. version 1. For example all network device drivers have to provide certain routines in which particular data structures are operated on. An interface is a collection of routines and data structures which operate in a particular way.2. Top of Chapter.it/LDP/tlk/basics/sw. This way there can be generic layers of code using the services (interfaces) of lower layers of specific code.3. when the filesystem is first used. For example each filesystem built into the kernel registers itself with the kernel at boot time or. This registration usually involves adding a data structure to a linked list. File translated from TEX by TTH. if you are using modules. Again. No Frames © 1996-1999 David A Rusling copyright notice.3 Abstract Interfaces The Linux kernel often abstracts its interfaces. The registration data structure often includes pointers to functions. You can see which filesystems have registered themselves by looking at the file /proc/filesystems.html (7 di 7) [08/03/2001 10. Table of Contents. These are the addresses of software functions that perform particular tasks.

Table of Contents. the contents of a file are linked directly into the virtual address space of a process. Virtual memory does more than just make your computer's memory go further. No Frames Chapter 3 Memory Management The memory management subsystem is one of the most important parts of the operating system. The memory management subsystem provides: Large Address Spaces The operating system makes the system appear as if it has a larger amount of memory than it actually has. it is better to have only one copy in physical memory and all of the processes running bash share it. Show Frames. Fair Physical Memory Allocation The memory management subsystem allows each running process in the system a fair share of the physical memory of the system. the hardware virtual memory mechanisms allow areas of memory to be protected against writing.iol. one in each processes virtual address space.28] . For example there could be several processes in the system running the bash command shell.it/LDP/tlk/mm/memory.08. Also. there has been a need for more memory than exists physically in a system. Shared memory can also be used as an Inter Process Communication (IPC) mechanism. Rather than have several copies of bash. Since the early days of computing. 3. The virtual memory can be many times larger than the physical memory in the system. with two or more processes exchanging information via memory common to all of them. Memory Mapping Memory mapping is used to map image and data files into a processes address space. Strategies have been developed to overcome this limitation and the most successful of these is virtual memory. In memory mapping. Shared Virtual Memory Although virtual memory allows processes to have separate (virtual) address spaces. This protects code and data from being overwritten by rogue applications. Virtual memory makes the system appear to have more memory than it actually has by sharing it between competing processes as they need it. Dynamic libraries are another common example of executing code shared between several processes. These virtual address spaces are completely separate from each other and so a process running one application cannot affect another. there are times when you need processes to share memory. Linux supports the Unix TM System V shared memory IPC. Protection Each process in the system has its own virtual address space.html (1 di 16) [08/03/2001 10.1 An Abstract Model of Virtual Memory http://ldp.

This shows that process X's virtual page frame number 0 is mapped into memory in physical page frame number 1 and that process Y's virtual page frame number 1 is mapped into physical page frame number 4. The processor then executes the instruction and moves onto the next instruction in the program. In a virtual memory system all of these addresses are virtual addresses and not physical addresses. each with their own page tables. As the processor executes a program it reads an instruction from memory and decodes it. If the page size is 4 Kbytes.it/LDP/tlk/mm/memory. q Access control information. This indicates if this page table entry is valid.Figure 3. Each entry in the theoretical page table contains the following information: q Valid flag. These pages are all the same size. To do this the processor uses page tables. bits 11:0 of the virtual address contain the offset and bits 12 and above are the virtual page frame number. the system would be very hard to administer. the page frame number (PFN). In this way the processor is always accessing memory either to fetch instructions or to fetch and store data.08.1: Abstract model of Virtual to Physical address mapping Before considering the methods that Linux uses to support virtual memory it is useful to consider an abstract model that is not cluttered by too much detail. Can it be written to? Does it contain executable code? http://ldp. This describes how the page may be used. they need not be but if they were not. To make this translation easier. These virtual addresses are converted into physical addresses by the processor based on information held in a set of tables maintained by the operating system.iol. These page tables map each processes virtual pages into physical pages in memory. In decoding the instruction it may need to fetch or store the contents of a location in memory.28] . In this paged model.html (2 di 16) [08/03/2001 10. virtual and physical memory are divided into handy sized chunks called pages. a virtual address is composed of two parts. q The physical page frame number that this entry is describing. Linux on Alpha AXP systems uses 8 Kbyte pages and on Intel x86 systems it uses 4 Kbyte pages. Each of these pages is given a unique number.1 shows the virtual address spaces of two processes. Each time the processor encounters a virtual address it must extract the offset and the virtual page frame number. The processor must translate the virtual page frame number into a physical one and then access the location at the correct offset into that physical page. Figure 3. process X and process Y. an offset and a virtual page frame number.

Linux uses demand paging to load executable images into a processes virtual memory.1. If there are other processes that could run then the operating system will select one of them to run. If the page table entry at that offset is valid. the operating system must bring the appropriate page into memory from the image on disk. In this case the operating system will terminate it. process Y's virtual page frame number 1 is mapped to physical page frame number 4 which starts at 0x8000 (4 x 0x2000). a database program may be run to query a database. Finally. If the database query is a search query then it does not make sense to load the code from the database program that deals with adding new records. In this case not all of the database needs to be loaded into memory. protecting the other processes in the system from this rogue process. In this case. just those data records that are being examined.1 Demand Paging As there is much less physical memory than virtual memory the operating system must be careful that it does not use the physical memory inefficiently. in Figure 3. the processor must first work out the virtual addresses page frame number and the offset within that virtual page. This is done by modifying the data http://ldp. However the processor delivers it. This time the virtual memory access is made. If the faulting virtual address was valid but the page that it refers to is not currently in memory. 3.html (3 di 16) [08/03/2001 10. For example. the pages of virtual memory do not have to be present in physical memory in any particular order. in Figure 3. Looking again at Figures 3. One way to save physical memory is to only load virtual pages that are currently being used by the executing program. the processor adds in the offset to the instruction or data that it needs. and so the process must wait quite a while until the page has been fetched.iol. the file containing it is opened and its contents are mapped into the processes virtual memory.The page table is accessed using the virtual page frame number as an offset. For example. Assuming that this is a valid page table entry. relatively speaking. Virtual page frame 5 would be the 6th element of the table (0 is the first element). Disk access takes a long time. Using the above example again. The process is then restarted at the machine instruction where the memory fault occurred. the virtual memory can be mapped into the system's physical pages in any order.1 and assuming a page size of 0x2000 bytes (which is decimal 8192) and an address of 0x2194 in process Y's virtual address space then the processor would translate that address into offset 0x194 into virtual page frame number 1. By mapping virtual to physical addresses this way.1 process X's virtual page frame number 0 is mapped to physical page frame number 1 whereas virtual page frame number 7 is mapped to physical page frame number 0 even though it is higher in virtual memory than virtual page frame number 0. the processor takes the physical page frame number from this entry. the processor can make the virtual to physical address translation and so the process continues to run.it/LDP/tlk/mm/memory. When a process attempts to access a virtual address that is not currently in memory the processor cannot find a page table entry for the virtual page referenced. This technique of only loading virtual pages into memory as they are accessed is known as demand paging. To translate a virtual address into a physical one. the processor takes that physical page frame number and multiplies it by the page size to get the address of the base of the page in physical memory. For example. The fetched page is written into a free physical page frame and an entry for the virtual page frame number is added to the processes page table. Whenever a command is executed. By making the page size a power of 2 this can be easily done by masking and shifting. This demonstrates an interesting byproduct of virtual memory.08. If the faulting virtual address is invalid this means that the process has attempted to access a virtual address that it should not have. The processor uses the virtual page frame number as an index into the processes page table to retrieve its page table entry. Maybe the application has gone wrong in some way. Just how the processor notifies the operating system that the correct process has attempted to access a virtual address for which there is no valid translation is specific to the processor. the processor cannot resolve the address and must pass control to the operating system so that it can fix things up. this is known as a page fault and the operating system is notified of the faulting virtual address and the reason for the page fault. If the entry is invalid. for example writing to random addresses in memory.29] . Adding in the 0x194 byte offset gives us a final physical address of 0x8194. the process has accessed a non-existent area of its virtual memory. At this point the processor notifies the operating system that a page fault has occurred.1 there is no entry in process X's page table for virtual page frame number 2 and so if process X attempts to read from an address within virtual page frame number 2 the processor cannot translate the address into a physical one.

3. The more that a page is accessed. the operating system must make room for this page by discarding another page from physical memory. The Linux kernel is linked to run in physical address space.1.4 Physical and Virtual Addressing Modes It does not make much sense for the operating system itself to run in virtual memory.1. for example. However. In order to execute from code linked in KSEG (by definition. Figure 3. This illustrates an interesting point about sharing pages: the shared physical page does not have to exist at the same place in virtual memory for any or all of the processes sharing it. An efficient swap scheme would make sure that all processes have their working set in physical memory. For process X this is virtual page frame number 4 whereas for process Y this is virtual page frame number 6. The set of pages that a process is currently using is called the working set. physical page frame number 1 in Figure 3. Old pages are good candidates for swapping. However. As the image executes. The Alpha AXP processor does not have a special physical addressing mode. This would be a nightmare situation where the operating system must maintain page tables for itself.1 is being regularly accessed then it is not a good candidate for swapping to hard disk. Accesses to the swap file are very long relative to the speed of the processor and physical memory and the operating system must juggle the need to write pages to disk with the need to retain them in memory to be used again. 3.structures describing this processes memory map and is known as memory mapping.iol. The Linux kernel on Alpha is linked to execute from address 0xfffffc0000310000. the operating system must preserve the contents of that page so that it can be accessed at a later time. kernel code) or access data there. 3. If the page to be discarded from physical memory came from an image or data file and has not been written to then the page does not need to be saved.1. it can easily use the access control information to check that the process is not accessing memory in a way that it should not. it divides up the memory space into several areas and designates two of them as physically mapped addresses.08. All memory access are made via page tables and each process has its own separate page table. it generates page faults and Linux uses the processes memory map in order to determine which parts of the image to bring into memory for execution. This kernel address space is known as KSEG address space and it encompasses all addresses upwards from 0xfffffc0000000000. Most multi-purpose processors support the notion of a physical address mode as well as a virtual address mode. if the page has been modified. As the processor is already using the page table entry to map a processes virtual address to a physical one. 3.5 Access Control The page table entries also contain access control information.1 shows two processes that each share physical page frame number 4. For two processes sharing a physical page of memory. In this case.it/LDP/tlk/mm/memory.1. the less that it is accessed the older and more stale it becomes.2 Swapping If a process needs to bring a virtual page into physical memory and there are no free physical pages available.html (4 di 16) [08/03/2001 10. If. Instead it can be discarded and if the process needs that page again it can be brought back into memory from the image or data file. the younger it is. http://ldp. The rest of the image is left on disk.3 Shared Virtual Memory Virtual memory makes it easy for several processes to share memory. Instead. only the first part of the image is actually brought into physical memory. pages are constantly being written to disk and then being read back and the operating system is too busy to allow much real work to be performed. This type of page is known as a dirty page and when it is removed from memory it is saved in a special sort of file called the swap file. Linux uses a Least Recently Used (LRU) page aging technique to fairly choose pages which might be removed from the system. This scheme involves every page in the system having an age which changes as the page is accessed. the code must be executing in kernel mode. If the algorithm used to decide which pages to discard or swap (the swap algorithm is not efficient then a condition known as thrashing occurs. Physical addressing mode requires no page tables and the processor does not attempt to perform any address translations in this mode.29] . its physical page frame number must appear in a page table entry in both of their page tables.

is naturally read only memory.iol. The bit fields have the following meanings: V Valid. if set this PTE is valid.html (5 di 16) [08/03/2001 10.29] . pages containing data can be written to but attempts to execute that memory as instructions should fail. FOE ``Fault on Execute''. Whenever an attempt to execute instructions in this page occurs. By contrast. http://ldp.08. You would not want kernel code executing by a user or kernel data structures to be accessible except when the processor is running in kernel mode.There are many reasons why you would want to restrict access to areas of memory.it/LDP/tlk/mm/memory. as above but page fault on an attempt to write to this page. Some memory. UWE Code running in user mode can write to this page. the operating system should not allow a process to write data over its executable code. KRE Code running in kernel mode can read this page. FOW ``Fault on Write''. Figure 3. KWE Code running in kernel mode can write to this page. as above but page fault on an attempt to read from this page. figure 3. ASM Address Space Match. such as that containing executable code. This is used when the operating system wishes to clear only some of the entries from the Translation Buffer. URE Code running in user mode can read this page.2 shows the PTE for Alpha AXP. Most processors have at least two modes of execution: kernel and user. GH Granularity hint used when mapping an entire block with a single Translation Buffer entry rather than many. FOR ``Fault on Read''.2: Alpha AXP Page Table Entry The access control information is held in the PTE and is processor specific. the processor reports a page fault and passes control to the operating system.

html (6 di 16) [08/03/2001 10.iol. http://ldp. it can directly translate the virtual address into a physical one and perform the correct operation on the data. This time it will work because there is now a valid entry in the TLB for that address. So long as these pages are not modified after they have been written to the swap file then the next time the page is swapped out there is no need to write it to the swap file as the page is already in the swap file. Swap Cache Only modified (or dirty) pages are saved in the swap file. for example a hard disk. if this field is not zero. A block device is one that can only be accessed by reading and writing fixed sized blocks of data. Instead the page can simply be discarded.29] . These are the Translation Look-aside Buffers and contain cached copies of the page table entries from one or more processes in the system.08. the processor does not always read the page table directly but instead caches translations for pages as it needs them. it contains information about where the page is in the swap file. if the caches become corrupted. When the exception has been cleared. The buffer cache is indexed via the device identifier and the desired block number and is used to quickly find a block of data. and access to it is much faster. is that in order to save effort Linux must use more time and space maintaining these caches and. the processor will make another attempt to translate the virtual address. the page needs to be written out to the swap file. _PAGE_ACCESSED Used by Linux to mark a page as having been accessed. the processor will attempt to find a matching TLB entry. the system will crash. All hard disks are block devices. Page Cache This is used to speed up access to images and data on disk. Block devices are only ever accessed via the buffer cache. hardware or otherwise. 3. It does this by signalling the operating system that a TLB miss has occurred. memory and so on faster the best approach is to maintain caches of useful information and data that make some operations faster. they are cached in the page cache. Hardware Caches One commonly implemented hardware cache is in the processor. As pages are read into memory from disk. Apart from making the processors. The operating system generates a new TLB entry for the address mapping.2 Caches If you were to implement a system using the above theoretical model then it would work. In a heavily swapping system this saves many unnecessary and costly disk operations. a cache of Page Table Entries. In this case. If the processor cannot find a matching TLB entry then it must get the operating system to help. The drawback of using caches. For invalid PTEs. If it finds one. Both operating system and processor designers try hard to extract more performance from the system. but not particularly efficiently.it/LDP/tlk/mm/memory. If data can be found in the buffer cache then it does not need to be read from the physical block device. this field contains the physical Page Frame Number (page frame number) for this PTE. When the reference to the virtual address is made. A system specific mechanism is used to deliver that exception to the operating system code that can fix things up. Linux uses a number of memory management related caches: Buffer Cache The buffer cache contains data buffers that are used by the block device drivers. These buffers are of fixed sizes (for example 512 bytes) and contain blocks of information that have either been read from a block device or are being written to it. The following two bits are defined and used by Linux: _PAGE_DIRTY if set.page frame number For PTEs with the V bit set. It is used to cache the logical contents of a file a page at a time and is accessed via the file and offset within the file.

and for Intel x86 processors. convert it into an offset into the physical page containing the Page Table and read the page frame number of the next level of Page Table. The mechanisms and data structures used for page allocation and deallocation are perhaps the most critical in maintaining the efficiency of the virtual memory subsystem.3 Linux Page Tables Figure 3. the byte offset. These will be freed when the image has finished executing and is unloaded.08. the kernel does not need to know the format of the page table entries or how they are arranged. is used to find the data inside the page. each field providing an offset into a particular Page Table.4 Page Allocation and Deallocation There are many demands on the physical pages in the system. when an image is loaded into memory the operating system needs to allocate pages. 3. This is so successful that Linux uses the same page table manipulation code for the Alpha processor. Each platform that Linux runs on must provide translation macros that allow the kernel to traverse the page tables for a particular process. Another use for physical pages is to hold kernel specific data structures such as the page tables themselves. which has three levels of page tables.3 shows how a virtual address can be broken into a number of fields.29] .iol. Each Page Table accessed contains the page frame number of the next level of Page Table. To translate a virtual address into a physical one. Now the final field in the virtual address. Important fields (so far as memory management is concerned) are: count http://ldp. Each mem_map_t describes a single physical page in the system. For example.3.it/LDP/tlk/mm/memory. which have two levels of page tables. This is repeated three times until the page frame number of the physical page containing the virtual address is found.3: Three Level Page Tables Linux assumes that there are three levels of page tables.html (7 di 16) [08/03/2001 10. This way. the processor must take the contents of each level field. Figure 3. All of the physical pages in the system are described by the mem_map data structure which is a list of mem_map_t 1 structures which is initialized at boot time.

If no blocks of pages of the requested size are free. map_nr This is the physical page frame number that this mem_map_t describes. http://ldp. This process continues until all of the free_area has been searched or until a block of pages has been found. Pages are allocated in blocks which are powers of 2 in size. 2 pages. element 2 of the array has a memory map that describes free and allocated blocks each of 4 pages long. 4 pages and so on. Element 0 has one free page (page frame number 0) and element 2 has 2 free blocks of 4 pages. Free blocks of pages are queued here. the next blocks of 4 pages and so on upwards in powers of two. The whole buffer management scheme is supported by this mechanism and so far as the code is concerned. If the block of pages found is larger than that requested it must be broken down until there is a block of the right size. blocks of the next size (which is twice that of the size requested) are looked for.4. So long as there are enough free pages in the system to grant this request (nr_free_pages > min_free_pages) the allocation code will search the free_area for a block of pages of the size requested. the first starting at page frame number 4 and the second at page frame number 56. map is a pointer to a bitmap which keeps track of allocated groups of pages of this size. the size of the page and physical paging mechanisms used by the processor are irrelevant. The page allocation code attempts to allocate a block of one or more physical pages. 3.html (8 di 16) [08/03/2001 10.08.it/LDP/tlk/mm/memory. Because the blocks are each a power of 2 pages big then this breaking down process is easy as you simply break the blocks in half.iol.1 Page Allocation Linux uses the Buddy algorithm 2 to effectively allocate and deallocate blocks of pages. The first element in the array describes single pages. For example. Figure free-area-figure shows the free_area structure. That means that it can allocate a block 1 page. the next blocks of 2 pages. The free_area vector is used by the page allocation code to find and free pages. The list element is used as a queue head and has pointers to the page data structures in the mem_map array. The free blocks are queued on the appropriate queue and the allocated block of pages is returned to the caller. The allocation algorithm first searches for blocks of pages of the size requested. It follows the chain of free pages that is queued on the list element of the free_area data structure. The count is greater than one when the page is shared between many processes.29] . Each element of free_area contains information about blocks of pages. Bit N of the bitmap is set if the Nth block of pages is free.This is a count of the number of users of this page. Each element of the free_area has a map of the allocated and free blocks of pages for that sized block. age This field describes the age of the page and is used to decide if the page is a good candidate for discarding or swapping.

If it is. then that would be combined with the already free page frame number 0 and queued onto element 1 of the free_area as a free block of size 2 pages. The same is also true of any shared libraries that the executable image has been linked to use. instead it is merely linked into the processes virtual memory. 3.4 if a block of 2 pages was requested.html (9 di 16) [08/03/2001 10. The executable file is not actually brought into physical memory. if page frame number 1 were to be freed.4. The first. in Figure 3.5 Memory Mapping When an image is executed.4.it/LDP/tlk/mm/memory. in Figure 3.08. 3. the adjacent or buddy block of the same size is checked to see if it is free. Whenever a block of pages is freed. The page deallocation code recombines pages into larger blocks of free pages whenever it can.Figure 3. starting at page frame number 4 would be returned to the caller as the allocated pages and the second block. as the parts of the program http://ldp. the contents of the executable image must be brought into the processes virtual address space. For example.2 Page Deallocation Allocating blocks of pages tends to fragment memory with larger blocks of free pages being broken down into smaller ones. Then. In fact the page block size is important as it allows for easy combination of blocks into larger blocks.29] .4: The free_area data structure For example.iol. then it is combined with the newly freed block of pages to form a new free block of pages for the next size block of pages. starting at page frame number 6 would be queued as a free block of 2 pages onto element 1 of the free_area array. In this way the blocks of free pages are as large as memory usage will allow. the first block of 4 pages (starting at page frame number 4) would be broken into two 2 page blocks. Each time two blocks of pages are recombined into a bigger block of free pages the page deallocation code attempts to recombine that block into a yet larger one.

Figure 3. the image is brought into memory from the executable image. the correct set of virtual memory operations are associated with them. http://ldp.are referenced by the running application. unitialized data and so on.html (10 di 16) [08/03/2001 10. the executable code. Each vm_area_struct data structure describes the start and end of the area of virtual memory. the processes access rights to that memory and a set of operations for that memory. These operations are a set of routines that Linux must use when manipulating this area of virtual memory. Linux supports a number of standard virtual memory operations and as the vm_area_struct data structures are created. This linking of an image into a processes virtual address space is known as memory mapping. The nopage operation is used when Linux demand pages the pages of an executable image into memory. When an executable image is mapped into a processes virtual address a set of vm_area_struct data structures is generated. This operation is the nopage operation.29] .iol. Each vm_area_struct data structure represents a part of the executable image. one of the virtual memory operations performs the correct actions when the process has attempted to access this virtual memory but finds (via a page fault) that the memory is not actually in physical memory.5: Areas of Virtual Memory Every processes virtual memory is represented by an mm_struct data structure. initialized data (variables). This contains information about the image that it is currently executing (for example bash) and also has pointers to a number of vm_area_struct data structures.08. For example.it/LDP/tlk/mm/memory.

3.6 Demand Paging
Once an executable image has been memory mapped into a processes virtual memory it can start to execute. As only the very start of the image is physically pulled into memory it will soon access an area of virtual memory that is not yet in physical memory. When a process accesses a virtual address that does not have a valid page table entry, the processor will report a page fault to Linux. The page fault describes the virtual address where the page fault occurred and the type of memory access that caused. Linux must find the vm_area_struct that represents the area of memory that the page fault occurred in. As searching through the vm_area_struct data structures is critical to the efficient handling of page faults, these are linked together in an AVL (Adelson-Velskii and Landis) tree structure. If there is no vm_area_struct data structure for this faulting virtual address, this process has accessed an illegal virtual address. Linux will signal the process, sending a SIGSEGV signal, and if the process does not have a handler for that signal it will be terminated. Linux next checks the type of page fault that occurred against the types of accesses allowed for this area of virtual memory. If the process is accessing the memory in an illegal way, say writing to an area that it is only allowed to read from, it is also signalled with a memory error. Now that Linux has determined that the page fault is legal, it must deal with it. Linux must differentiate between pages that are in the swap file and those that are part of an executable image on a disk somewhere. It does this by using the page table entry for this faulting virtual address. If the page's page table entry is invalid but not empty, the page fault is for a page currently being held in the swap file. For Alpha AXP page table entries, these are entries which do not have their valid bit set but which have a non-zero value in their PFN field. In this case the PFN field holds information about where in the swap (and which swap file) the page is being held. How pages in the swap file are handled is described later in this chapter. Not all vm_area_struct data structures have a set of virtual memory operations and even those that do may not have a nopage operation. This is because by default Linux will fix up the access by allocating a new physical page and creating a valid page table entry for it. If there is a nopage operation for this area of virtual memory, Linux will use it. The generic Linux nopage operation is used for memory mapped executable images and it uses the page cache to bring the required image page into physical memory. However the required page is brought into physical memory, the processes page tables are updated. It may be necessary for hardware specific actions to update those entries, particularly if the processor uses translation look aside buffers. Now that the page fault has been handled it can be dismissed and the process is restarted at the instruction that made the faulting virtual memory access.

3.7 The Linux Page Cache

http://ldp.iol.it/LDP/tlk/mm/memory.html (11 di 16) [08/03/2001 10.08.29]

Figure 3.6: The Linux Page Cache The role of the Linux page cache is to speed up access to files on disk. Memory mapped files are read a page at a time and these pages are stored in the page cache. Figure 3.6 shows that the page cache consists of the page_hash_table, a vector of pointers to mem_map_t data structures. Each file in Linux is identified by a VFS inode data structure (described in Chapter filesystem-chapter) and each VFS inode is unique and fully describes one and only one file. The index into the page table is derived from the file's VFS inode and the offset into the file. Whenever a page is read from a memory mapped file, for example when it needs to be brought back into memory during demand paging, the page is read through the page cache. If the page is present in the cache, a pointer to the mem_map_t data structure representing it is returned to the page fault handling code. Otherwise the page must be brought into memory from the file system that holds the image. Linux allocates a physical page and reads the page from the file on disk. If it is possible, Linux will initiate a read of the next page in the file. This single page read ahead means that if the process is accessing the pages in the file serially, the next page will be waiting in memory for the process. Over time the page cache grows as images are read and executed. Pages will be removed from the cache as they are no longer needed, say as an image is no longer being used by any process. As Linux uses memory it can start to run low on physical pages. In this case Linux will reduce the size of the page cache.

3.8 Swapping Out and Discarding Pages
When physical memory becomes scarce the Linux memory management subsystem must attempt to free physical pages. This task falls to the kernel swap daemon (kswapd). The kernel swap daemon is a special type of process, a kernel thread. Kernel threads are processes have no virtual memory, instead they run in kernel mode in the physical address space. The kernel swap daemon is slightly misnamed in that it does more than merely swap pages out to the system's swap files. Its role is make sure that there are enough free pages in the system to keep the memory management system operating efficiently. The Kernel swap daemon (kswapd) is started by the kernel init process at startup time and sits waiting for the kernel swap timer to periodically expire.

http://ldp.iol.it/LDP/tlk/mm/memory.html (12 di 16) [08/03/2001 10.08.29]

Every time the timer expires, the swap daemon looks to see if the number of free pages in the system is getting too low. It uses two variables, free_pages_high and free_pages_low to decide if it should free some pages. So long as the number of free pages in the system remains above free_pages_high, the kernel swap daemon does nothing; it sleeps again until its timer next expires. For the purposes of this check the kernel swap daemon takes into account the number of pages currently being written out to the swap file. It keeps a count of these in nr_async_pages; this is incremented each time a page is queued waiting to be written out to the swap file and decremented when the write to the swap device has completed. free_pages_low and free_pages_high are set at system startup time and are related to the number of physical pages in the system. If the number of free pages in the system has fallen below free_pages_high or worse still free_pages_low, the kernel swap daemon will try three ways to reduce the number of physical pages being used by the system: Reducing the size of the buffer and page caches, Swapping out System V shared memory pages, Swapping out and discarding pages. If the number of free pages in the system has fallen below free_pages_low, the kernel swap daemon will try to free 6 pages before it next runs. Otherwise it will try to free 3 pages. Each of the above methods are tried in turn until enough pages have been freed. The kernel swap daemon remembers which method it was using the last time that it attempted to free physical pages. Each time it runs it will start trying to free pages using this last successful method. After it has free sufficient pages, the swap daemon sleeps again until its timer expires. If the reason that the kernel swap daemon freed pages was that the number of free pages in the system had fallen below free_pages_low, it only sleeps for half its usual time. Once the number of free pages is more than free_pages_low the kernel swap daemon goes back to sleeping longer between checks.

3.8.1 Reducing the Size of the Page and Buffer Caches
The pages held in the page and buffer caches are good candidates for being freed into the free_area vector. The Page Cache, which contains pages of memory mapped files, may contain unneccessary pages that are filling up the system's memory. Likewise the Buffer Cache, which contains buffers read from or being written to physical devices, may also contain unneeded buffers. When the physical pages in the system start to run out, discarding pages from these caches is relatively easy as it requires no writing to physical devices (unlike swapping pages out of memory). Discarding these pages does not have too many harmful side effects other than making access to physical devices and memory mapped files slower. However, if the discarding of pages from these caches is done fairly, all processes will suffer equally. Every time the Kernel swap daemon tries to shrink these caches it examines a block of pages in the mem_map page vector to see if any can be discarded from physical memory. The size of the block of pages examined is higher if the kernel swap daemon is intensively swapping; that is if the number of free pages in the system has fallen dangerously low. The blocks of pages are examined in a cyclical manner; a different block of pages is examined each time an attempt is made to shrink the memory map. This is known as the clock algorithm as, rather like the minute hand of a clock, the whole mem_map page vector is examined a few pages at a time. Each page being examined is checked to see if it is cached in either the page cache or the buffer cache. You should note that shared pages are not considered for discarding at this time and that a page cannot be in both caches at the same time. If the page is not in either cache then the next page in the mem_map page vector is examined. Pages are cached in the buffer cache (or rather the buffers within the pages are cached) to make buffer allocation and deallocation more efficient. The memory map shrinking code tries to free the buffers that are contained within the page being examined. If all the buffers are freed, then the pages that contain them are also be freed. If the examined page is in the Linux page cache, it is removed from the page cache and freed. When enough pages have been freed on this attempt then the kernel swap daemon will wait until the next time it is periodically woken. As none of the freed pages were part of any process's virtual memory (they were cached pages), then no page tables need updating. If there were not enough cached pages discarded then the swap daemon will try to swap out some shared pages.

http://ldp.iol.it/LDP/tlk/mm/memory.html (13 di 16) [08/03/2001 10.08.29]

3.8.2 Swapping Out System V Shared Memory Pages
System V shared memory is an inter-process communication mechanism which allows two or more processes to share virtual memory in order to pass information amongst themselves. How processes share memory in this way is described in more detail in Chapter IPC-chapter. For now it is enough to say that each area of System V shared memory is described by a shmid_ds data structure. This contains a pointer to a list of vm_area_struct data structures, one for each process sharing this area of virtual memory. The vm_area_struct data structures describe where in each processes virtual memory this area of System V shared memory goes. Each vm_area_struct data structure for this System V shared memory is linked together using the vm_next_shared and vm_prev_shared pointers. Each shmid_ds data structure also contains a list of page table entries each of which describes the physical page that a shared virtual page maps to. The kernel swap daemon also uses a clock algorithm when swapping out System V shared memory pages. . Each time it runs it remembers which page of which shared virtual memory area it last swapped out. It does this by keeping two indices, the first is an index into the set of shmid_ds data structures, the second into the list of page table entries for this area of System V shared memory. This makes sure that it fairly victimizes the areas of System V shared memory. As the physical page frame number for a given virtual page of System V shared memory is contained in the page tables of all of the processes sharing this area of virtual memory, the kernel swap daemon must modify all of these page tables to show that the page is no longer in memory but is now held in the swap file. For each shared page it is swapping out, the kernel swap daemon finds the page table entry in each of the sharing processes page tables (by following a pointer from each vm_area_struct data structure). If this processes page table entry for this page of System V shared memory is valid, it converts it into an invalid but swapped out page table entry and reduces this (shared) page's count of users by one. The format of a swapped out System V shared page table entry contains an index into the set of shmid_ds data structures and an index into the page table entries for this area of System V shared memory. If the page's count is zero after the page tables of the sharing processes have all been modified, the shared page can be written out to the swap file. The page table entry in the list pointed at by the shmid_ds data structure for this area of System V shared memory is replaced by a swapped out page table entry. A swapped out page table entry is invalid but contains an index into the set of open swap files and the offset in that file where the swapped out page can be found. This information will be used when the page has to be brought back into physical memory.

3.8.3 Swapping Out and Discarding Pages
The swap daemon looks at each process in the system in turn to see if it is a good candidate for swapping. Good candidates are processes that can be swapped (some cannot) and that have one or more pages which can be swapped or discarded from memory. Pages are swapped out of physical memory into the system's swap files only if the data in them cannot be retrieved another way. A lot of the contents of an executable image come from the image's file and can easily be re-read from that file. For example, the executable instructions of an image will never be modified by the image and so will never be written to the swap file. These pages can simply be discarded; when they are again referenced by the process, they will be brought back into memory from the executable image. Once the process to swap has been located, the swap daemon looks through all of its virtual memory regions looking for areas which are not shared or locked. Linux does not swap out all of the swappable pages of the process that it has selected; instead it removes only a small number of pages. Pages cannot be swapped or discarded if they are locked in memory. The Linux swap algorithm uses page aging. Each page has a counter (held in the mem_map_t data structure) that gives the Kernel swap daemon some idea whether or not a page is worth swapping. Pages age when they are unused and rejuvinate on access; the swap daemon only swaps out old pages. The default action when a page is first allocated, is to give it an initial age of 3. Each time it is touched, it's age is increased by 3 to a maximum of 20. Every time the Kernel swap daemon runs it ages pages, decrementing their age by 1. These default actions can be changed and for this reason they (and other swap related information) are stored in the swap_control data structure.

http://ldp.iol.it/LDP/tlk/mm/memory.html (14 di 16) [08/03/2001 10.08.29]

If the page is old (age = 0), the swap daemon will process it further. Dirty pages are pages which can be swapped out. Linux uses an architecture specific bit in the PTE to describe pages this way (see Figure 3.2). However, not all dirty pages are necessarily written to the swap file. Every virtual memory region of a process may have its own swap operation (pointed at by the vm_ops pointer in the vm_area_struct) and that method is used. Otherwise, the swap daemon will allocate a page in the swap file and write the page out to that device. The page's page table entry is replaced by one which is marked as invalid but which contains information about where the page is in the swap file. This is an offset into the swap file where the page is held and an indication of which swap file is being used. Whatever the swap method used, the original physical page is made free by putting it back into the free_area. Clean (or rather not dirty) pages can be discarded and put back into the free_area for re-use. If enough of the swappable processes pages have been swapped out or discarded, the swap daemon will again sleep. The next time it wakes it will consider the next process in the system. In this way, the swap daemon nibbles away at each processes physical pages until the system is again in balance. This is much fairer than swapping out whole processes.

3.9 The Swap Cache
When swapping pages out to the swap files, Linux avoids writing pages if it does not have to. There are times when a page is both in a swap file and in physical memory. This happens when a page that was swapped out of memory was then brought back into memory when it was again accessed by a process. So long as the page in memory is not written to, the copy in the swap file remains valid. Linux uses the swap cache to track these pages. The swap cache is a list of page table entries, one per physical page in the system. This is a page table entry for a swapped out page and describes which swap file the page is being held in together with its location in the swap file. If a swap cache entry is non-zero, it represents a page which is being held in a swap file that has not been modified. If the page is subsequently modified (by being written to), its entry is removed from the swap cache. When Linux needs to swap a physical page out to a swap file it consults the swap cache and, if there is a valid entry for this page, it does not need to write the page out to the swap file. This is because the page in memory has not been modified since it was last read from the swap file. The entries in the swap cache are page table entries for swapped out pages. They are marked as invalid but contain information which allow Linux to find the right swap file and the right page within that swap file.

3.10 Swapping Pages In
The dirty pages saved in the swap files may be needed again, for example when an application writes to an area of virtual memory whose contents are held in a swapped out physical page. Accessing a page of virtual memory that is not held in physical memory causes a page fault to occur. The page fault is the processor signalling the operating system that it cannot translate a virtual address into a physical one. In this case this is because the page table entry describing this page of virtual memory was marked as invalid when the page was swapped out. The processor cannot handle the virtual to physical address translation and so hands control back to the operating system describing as it does so the virtual address that faulted and the reason for the fault. The format of this information and how the processor passes control to the operating system is processor specific. The processor specific page fault handling code must locate the vm_area_struct data structure that describes the area of virtual memory that contains the faulting virtual address. It does this by searching the vm_area_struct data structures for this process until it finds the one containing the faulting virtual address. This is very time critical code and a processes vm_area_struct data structures are so arranged as to make this search take as little time as possible. Having carried out the appropriate processor specific actions and found that the faulting virtual address is for a valid area of virtual memory, the page fault processing becomes generic and applicable to all processors that Linux runs on. The generic page fault handling code looks for the page table entry for the faulting virtual address. If the page table entry it finds is for a swapped out page, Linux must swap the page back into physical memory. The format of the page table entry for a swapped out page is processor specific but all processors mark these pages as invalid and put the information neccessary to locate the page within the swap file into the page table entry. Linux needs this information in order to bring the page back into physical memory.

http://ldp.iol.it/LDP/tlk/mm/memory.html (15 di 16) [08/03/2001 10.08.29]

At this point, Linux knows the faulting virtual address and has a page table entry containing information about where this page has been swapped to. The vm_area_struct data structure may contain a pointer to a routine which will swap any page of the area of virtual memory that it describes back into physical memory. This is its swapin operation. If there is a swapin operation for this area of virtual memory then Linux will use it. This is, in fact, how swapped out System V shared memory pages are handled as it requires special handling because the format of a swapped out System V shared page is a little different from that of an ordinairy swapped out page. There may not be a swapin operation, in which case Linux will assume that this is an ordinairy page that does not need to be specially handled. It allocates a free physical page and reads the swapped out page back from the swap file. Information telling it where in the swap file (and which swap file) is taken from the the invalid page table entry. If the access that caused the page fault was not a write access then the page is left in the swap cache and its page table entry is not marked as writable. If the page is subsequently written to, another page fault will occur and, at that point, the page is marked as dirty and its entry is removed from the swap cache. If the page is not written to and it needs to be swapped out again, Linux can avoid the write of the page to its swap file because the page is already in the swap file. If the access that caused the page to be brought in from the swap file was a write operation, this page is removed from the swap cache and its page table entry is marked as both dirty and writable.

1 2

Confusingly the structure is also known as the page structure. Bibliography reference here

File translated from TEX by TTH, version 1.0.

Top of Chapter, Table of Contents, Show Frames, No Frames © 1996-1999 David A Rusling copyright notice.

http://ldp.iol.it/LDP/tlk/mm/memory.html (16 di 16) [08/03/2001 10.08.29]

Table of Contents, Show Frames, No Frames

Chapter 5 Interprocess Communication Mechanisms

Processes communicate with each other and with the kernel to coordinate their activities. Linux supports a number of Inter-Process Communication (IPC) mechanisms. Signals and pipes are two of them but Linux also supports the System V IPC mechanisms named after the Unix TM release in which they first appeared.

5.1 Signals
Signals are one of the oldest inter-process communication methods used by Unix TM systems. They are used to signal asynchronous events to one or more processes. A signal could be generated by a keyboard interrupt or an error condition such as the process attempting to access a non-existent location in its virtual memory. Signals are also used by the shells to signal job control commands to their child processes. There are a set of defined signals that the kernel can generate or that can be generated by other processes in the system, provided that they have the correct privileges. You can list a system's set of signals using the kill command (kill -l), on my Intel Linux box this gives: 1) 5) 9) 13) 18) 22) 26) 30) SIGHUP SIGTRAP SIGKILL SIGPIPE SIGCONT SIGTTOU SIGVTALRM SIGPWR 2) 6) 10) 14) 19) 23) 27) SIGINT SIGIOT SIGUSR1 SIGALRM SIGSTOP SIGURG SIGPROF 3) 7) 11) 15) 20) 24) 28) SIGQUIT SIGBUS SIGSEGV SIGTERM SIGTSTP SIGXCPU SIGWINCH 4) 8) 12) 17) 21) 25) 29) SIGILL SIGFPE SIGUSR2 SIGCHLD SIGTTIN SIGXFSZ SIGIO

The numbers are different for an Alpha AXP Linux box. Processes can choose to ignore most of the signals that are generated, with two notable exceptions: neither the SIGSTOP signal which causes a process to halt its execution nor the SIGKILL signal which causes a process to exit can be ignored. Otherwise though, a process can choose just how it wants to handle the various signals. Processes can

http://ldp.iol.it/LDP/tlk/ipc/ipc.html (1 di 11) [08/03/2001 10.08.32]

block the signals and, if they do not block them, they can either choose to handle them themselves or allow the kernel to handle them. If the kernel handles the signals, it will do the default actions required for this signal. For example, the default action when a process receives the SIGFPE (floating point exception) signal is to core dump and then exit. Signals have no inherent relative priorities. If two signals are generated for a process at the same time then they may be presented to the process or handled in any order. Also there is no mechanism for handling multiple signals of the same kind. There is no way that a process can tell if it received 1 or 42 SIGCONT signals. Linux implements signals using information stored in the task_struct for the process. The number of supported signals is limited to the word size of the processor. Processes with a word size of 32 bits can have 32 signals whereas 64 bit processors like the Alpha AXP may have up to 64 signals. The currently pending signals are kept in the signal field with a mask of blocked signals held in blocked. With the exception of SIGSTOP and SIGKILL, all signals can be blocked. If a blocked signal is generated, it remains pending until it is unblocked. Linux also holds information about how each process handles every possible signal and this is held in an array of sigaction data structures pointed at by the task_struct for each process. Amongst other things it contains either the address of a routine that will handle the signal or a flag which tells Linux that the process either wishes to ignore this signal or let the kernel handle the signal for it. The process modifies the default signal handling by making system calls and these calls alter the sigaction for the appropriate signal as well as the blocked mask. Not every process in the system can send signals to every other process, the kernel can and super users can. Normal processes can only send signals to processes with the same uid and gid or to processes in the same process group1. Signals are generated by setting the appropriate bit in the task_struct's signal field. If the process has not blocked the signal and is waiting but interruptible (in state Interruptible) then it is woken up by changing its state to Running and making sure that it is in the run queue. That way the scheduler will consider it a candidate for running when the system next schedules. If the default handling is needed, then Linux can optimize the handling of the signal. For example if the signal SIGWINCH (the X window changed focus) and the default handler is being used then there is nothing to be done. Signals are not presented to the process immediately they are generated., they must wait until the process is running again. Every time a process exits from a system call its signal and blocked fields are checked and, if there are any unblocked signals, they can now be delivered. This might seem a very unreliable method but every process in the system is making system calls, for example to write a character to the terminal, all of the time. Processes can elect to wait for signals if they wish, they are suspended in state Interruptible until a signal is presented. The Linux signal processing code looks at the sigaction structure for each of the current unblocked signals. If a signal's handler is set to the default action then the kernel will handle it. The SIGSTOP signal's default handler will change the current process's state to Stopped and then run the scheduler to select a new process to run. The default action for the SIGFPE signal will core dump the process and then cause it to exit. Alternatively, the process may have specfied its own signal handler. This is a routine which will be called whenever the signal is generated and the sigaction structure holds the address of this routine. The kernel must call the process's signal handling routine and how this happens is processor specific but all CPUs must cope with the fact that the current process is running in kernel mode and is just about to return to the process that called the kernel or system routine in user mode. The problem is solved by manipulating the stack and registers of the process. The process's program counter is set to the
http://ldp.iol.it/LDP/tlk/ipc/ipc.html (2 di 11) [08/03/2001 10.08.32]

address of its signal handling routine and the parameters to the routine are added to the call frame or passed in registers. When the process resumes operation it appears as if the signal handling routine were called normally. Linux is POSIX compatible and so the process can specify which signals are blocked when a particular signal handling routine is called. This means changing the blocked mask during the call to the processes signal handler. The blocked mask must be returned to its original value when the signal handling routine has finished. Therefore Linux adds a call to a tidy up routine which will restore the original blocked mask onto the call stack of the signalled process. Linux also optimizes the case where several signal handling routines need to be called by stacking them so that each time one handling routine exits, the next one is called until the tidy up routine is called.

5.2 Pipes
The common Linux shells all allow redirection. For example

$ ls | pr | lpr pipes the output from the ls command listing the directory's files into the standard input of the pr command which paginates them. Finally the standard output from the pr command is piped into the standard input of the lpr command which prints the results on the default printer. Pipes then are unidirectional byte streams which connect the standard output from one process into the standard input of another process. Neither process is aware of this redirection and behaves just as it would normally. It is the shell which sets up these temporary pipes between the processes.

http://ldp.iol.it/LDP/tlk/ipc/ipc.html (3 di 11) [08/03/2001 10.08.32]

Figure 5.1: Pipes In Linux, a pipe is implemented using two file data structures which both point at the same temporary VFS inode which itself points at a physical page within memory. Figure 5.1 shows that each file data structure contains pointers to different file operation routine vectors; one for writing to the pipe, the other for reading from the pipe. This hides the underlying differences from the generic system calls which read and write to ordinary files. As the writing process writes to the pipe, bytes are copied into the shared data page and when the reading process reads from the pipe, bytes are copied from the shared data page. Linux must synchronize access to the pipe. It must make sure that the reader and the writer of the pipe are in step and to do this it uses locks, wait queues and signals. When the writer wants to write to the pipe it uses the standard write library functions. These all pass file descriptors that are indices into the process's set of file data structures, each one representing an open file or, as in this case, an open pipe. The Linux system call uses the write routine pointed at by the file

http://ldp.iol.it/LDP/tlk/ipc/ipc.html (4 di 11) [08/03/2001 10.08.32]

data structure describing this pipe. That write routine uses information held in the VFS inode representing the pipe to manage the write request. If there is enough room to write all of the bytes into the pipe and, so long as the pipe is not locked by its reader, Linux locks it for the writer and copies the bytes to be written from the process's address space into the shared data page. If the pipe is locked by the reader or if there is not enough room for the data then the current process is made to sleep on the pipe inode's wait queue and the scheduler is called so that another process can run. It is interruptible, so it can receive signals and it will be woken by the reader when there is enough room for the write data or when the pipe is unlocked. When the data has been written, the pipe's VFS inode is unlocked and any waiting readers sleeping on the inode's wait queue will themselves be woken up. Reading data from the pipe is a very similar process to writing to it. Processes are allowed to do non-blocking reads (it depends on the mode in which they opened the file or pipe) and, in this case, if there is no data to be read or if the pipe is locked, an error will be returned. This means that the process can continue to run. The alternative is to wait on the pipe inode's wait queue until the write process has finished. When both processes have finished with the pipe, the pipe inode is discarded along with the shared data page. Linux also supports named pipes, also known as FIFOs because pipes operate on a First In, First Out principle. The first data written into the pipe is the first data read from the pipe. Unlike pipes, FIFOs are not temporary objects, they are entities in the file system and can be created using the mkfifo command. Processes are free to use a FIFO so long as they have appropriate access rights to it. The way that FIFOs are opened is a little different from pipes. A pipe (its two file data structures, its VFS inode and the shared data page) is created in one go whereas a FIFO already exists and is opened and closed by its users. Linux must handle readers opening the FIFO before writers open it as well as readers reading before any writers have written to it. That aside, FIFOs are handled almost exactly the same way as pipes and they use the same data structures and operations.

5.3 Sockets
REVIEW NOTE: Add when networking chapter written.

5.3.1 System V IPC Mechanisms
Linux supports three types of interprocess communication mechanisms that first appeared in Unix TM System V (1983). These are message queues, semaphores and shared memory. These System V IPC mechanisms all share common authentication methods. Processes may access these resources only by passing a unique reference identifier to the kernel via system calls. Access to these System V IPC objects is checked using access permissions, much like accesses to files are checked. The access rights to the System V IPC object is set by the creator of the object via system calls. The object's reference identifier is used by each mechanism as an index into a table of resources. It is not a straight forward index but requires some manipulation to generate the index. All Linux data structures representing System V IPC objects in the system include an ipc_perm

http://ldp.iol.it/LDP/tlk/ipc/ipc.html (5 di 11) [08/03/2001 10.08.32]

structure which contains the owner and creator process's user and group identifiers. The access mode for this object (owner, group and other) and the IPC object's key. The key is used as a way of locating the System V IPC object's reference identifier. Two sets of keys are supported: public and private. If the key is public then any process in the system, subject to rights checking, can find the reference identifier for the System V IPC object. System V IPC objects can never be referenced with a key, only by their reference identifier.

5.3.2 Message Queues
Message queues allow one or more processes to write messages, which will be read by one or more reading processes. Linux maintains a list of message queues, the msgque vector; each element of which points to a msqid_ds data structure that fully describes the message queue. When message queues are created a new msqid_ds data structure is allocated from system memory and inserted into the vector.

Figure 5.2: System V IPC Message Queues Each msqid_ds data structure contains an ipc_perm data structure and pointers to the messages entered onto this queue. In addition, Linux keeps queue modification times such as the last time that this queue was written to and so on. The msqid_ds also contains two wait queues; one for the writers to the queue and one for the readers of the message queue. Each time a process attempts to write a message to the write queue its effective user and group identifiers

http://ldp.iol.it/LDP/tlk/ipc/ipc.html (6 di 11) [08/03/2001 10.08.32]

are compared with the mode in this queue's ipc_perm data structure. If the process can write to the queue then the message may be copied from the process's address space into a msg data structure and put at the end of this message queue. Each message is tagged with an application specific type, agreed between the cooperating processes. However, there may be no room for the message as Linux restricts the number and length of messages that can be written. In this case the process will be added to this message queue's write wait queue and the scheduler will be called to select a new process to run. It will be woken up when one or more messages have been read from this message queue. Reading from the queue is a similar process. Again, the processes access rights to the write queue are checked. A reading process may choose to either get the first message in the queue regardless of its type or select messages with particular types. If no messages match this criteria the reading process will be added to the message queue's read wait queue and the scheduler run. When a new message is written to the queue this process will be woken up and run again.

5.3.3 Semaphores
In its simplest form a semaphore is a location in memory whose value can be tested and set by more than one process. The test and set operation is, so far as each process is concerned, uninterruptible or atomic; once started nothing can stop it. The result of the test and set operation is the addition of the current value of the semaphore and the set value, which can be positive or negative. Depending on the result of the test and set operation one process may have to sleep until the semphore's value is changed by another process. Semaphores can be used to implement critical regions, areas of critical code that only one process at a time should be executing. Say you had many cooperating processes reading records from and writing records to a single data file. You would want that file access to be strictly coordinated. You could use a semaphore with an initial value of 1 and, around the file operating code, put two semaphore operations, the first to test and decrement the semaphore's value and the second to test and increment it. The first process to access the file would try to decrement the semaphore's value and it would succeed, the semaphore's value now being 0. This process can now go ahead and use the data file but if another process wishing to use it now tries to decrement the semaphore's value it would fail as the result would be -1. That process will be suspended until the first process has finished with the data file. When the first process has finished with the data file it will increment the semaphore's value, making it 1 again. Now the waiting process can be woken and this time its attempt to increment the semaphore will succeed.

http://ldp.iol.it/LDP/tlk/ipc/ipc.html (7 di 11) [08/03/2001 10.08.32]

Figure 5.3: System V IPC Semaphores System V IPC semaphore objects each describe a semaphore array and Linux uses the semid_ds data structure to represent this. All of the semid_ds data structures in the system are pointed at by the semary, a vector of pointers. There are sem_nsems in each semaphore array, each one described by a sem data structure pointed at by sem_base. All of the processes that are allowed to manipulate the semaphore array of a System V IPC semaphore object may make system calls that perform operations on them. The system call can specify many operations and each operation is described by three inputs; the semaphore index, the operation value and a set of flags. The semaphore index is an index into the semaphore array and the operation value is a numerical value that will be added to the current value of the semaphore. First Linux tests whether or not all of the operations would succeed. An operation will succeed if the operation value added to the semaphore's current value would be greater than zero or if both the operation value and the semaphore's current value are zero. If any of the semaphore operations would fail Linux may suspend the process but only if the operation flags have not requested that the system call is non-blocking. If the process is to be suspended then Linux must save the state of the semaphore operations to be performed and put the current process onto a wait queue. It does this by building a sem_queue data structure on the stack and filling it out. The new sem_queue data structure is put at the end of this semaphore object's wait queue (using the sem_pending and sem_pending_last pointers). The current process is put on the wait queue in the sem_queue data structure (sleeper) and the scheduler called to choose another process to run.
http://ldp.iol.it/LDP/tlk/ipc/ipc.html (8 di 11) [08/03/2001 10.08.32]

If all of the semaphore operations would have succeeded and the current process does not need to be suspended, Linux goes ahead and applies the operations to the appropriate members of the semaphore array. Now Linux must check that any waiting, suspended, processes may now apply their semaphore operations. It looks at each member of the operations pending queue (sem_pending) in turn, testing to see if the semphore operations will succeed this time. If they will then it removes the sem_queue data structure from the operations pending list and applies the semaphore operations to the semaphore array. It wakes up the sleeping process making it available to be restarted the next time the scheduler runs. Linux keeps looking through the pending list from the start until there is a pass where no semaphore operations can be applied and so no more processes can be woken. There is a problem with semaphores, deadlocks. These occur when one process has altered the semaphores value as it enters a critical region but then fails to leave the critical region because it crashed or was killed. Linux protects against this by maintaining lists of adjustments to the semaphore arrays. The idea is that when these adjustments are applied, the semaphores will be put back to the state that they were in before the a process's set of semaphore operations were applied. These adjustments are kept in sem_undo data structures queued both on the semid_ds data structure and on the task_struct data structure for the processes using these semaphore arrays. Each individual semaphore operation may request that an adjustment be maintained. Linux will maintain at most one sem_undo data structure per process for each semaphore array. If the requesting process does not have one, then one is created when it is needed. The new sem_undo data structure is queued both onto this process's task_struct data structure and onto the semaphore array's semid_ds data structure. As operations are applied to the semphores in the semaphore array the negation of the operation value is added to this semphore's entry in the adjustment array of this process's sem_undo data structure. So, if the operation value is 2, then -2 is added to the adjustment entry for this semaphore. When processes are deleted, as they exit Linux works through their set of sem_undo data structures applying the adjustments to the semaphore arrays. If a semaphore set is deleted, the sem_undo data structures are left queued on the process's task_struct but the semaphore array identifier is made invalid. In this case the semaphore clean up code simply discards the sem_undo data structure.

5.3.4 Shared Memory
Shared memory allows one or more processes to communicate via memory that appears in all of their virtual address spaces. The pages of the virtual memory is referenced by page table entries in each of the sharing processes' page tables. It does not have to be at the same address in all of the processes' virtual memory. As with all System V IPC objects, access to shared memory areas is controlled via keys and access rights checking. Once the memory is being shared, there are no checks on how the processes are using it. They must rely on other mechanisms, for example System V semaphores, to synchronize access to the memory.

http://ldp.iol.it/LDP/tlk/ipc/ipc.html (9 di 11) [08/03/2001 10.08.32]

Figure 5.4: System V IPC Shared Memory Each newly created shared memory area is represented by a shmid_ds data structure. These are kept in the shm_segs vector. The shmid_ds data structure decribes how big the area of shared memory is, how many processes are using it and information about how that shared memory is mapped into their address spaces. It is the creator of the shared memory that controls the access permissions to that memory and whether its key is public or private. If it has enough access rights it may also lock the shared memory into physical memory. Each process that wishes to share the memory must attach to that virtual memory via a system call. This creates a new vm_area_struct data structure describing the shared memory for this process. The process can choose where in its virtual address space the shared memory goes or it can let Linux choose a free area large enough. The new vm_area_struct structure is put into the list of vm_area_struct pointed at by the shmid_ds. The vm_next_shared and vm_prev_shared pointers are used to link them together. The virtual memory is not actually created during the attach; it happens when the first process attempts to access it. The first time that a process accesses one of the pages of the shared virtual memory, a page fault will occur. When Linux fixes up that page fault it finds the vm_area_struct data structure describing it. This contains pointers to handler routines for this type of shared virtual memory. The shared memory page fault handling code looks in the list of page table entries for this shmid_ds to see if one exists for this page of the shared virtual memory. If it does not exist, it will allocate a physical page and create a

http://ldp.iol.it/LDP/tlk/ipc/ipc.html (10 di 11) [08/03/2001 10.08.32]

page table entry for it. As well as going into the current process's page tables, this entry is saved in the shmid_ds. This means that when the next process that attempts to access this memory gets a page fault, the shared memory fault handling code will use this newly created physical page for that process too. So, the first process that accesses a page of the shared memory causes it to be created and thereafter access by the other processes cause that page to be added into their virtual address spaces. When processes no longer wish to share the virtual memory, they detach from it. So long as other processes are still using the memory the detach only affects the current process. Its vm_area_struct is removed from the shmid_ds data structure and deallocated. The current process's page tables are updated to invalidate the area of virtual memory that it used to share. When the last process sharing the memory detaches from it, the pages of the shared memory current in physical memory are freed, as is the shmid_ds data structure for this shared memory. Further complications arise when shared virtual memory is not locked into physical memory. In this case the pages of the shared memory may be swapped out to the system's swap disk during periods of high memory usage. How shared memory memory is swapped into and out of physical memory is described in Chapter mm-chapter.


REVIEW NOTE: Explain process groups.

File translated from TEX by TTH, version 1.0.

Top of Chapter, Table of Contents, Show Frames, No Frames © 1996-1999 David A Rusling copyright notice.

http://ldp.iol.it/LDP/tlk/ipc/ipc.html (11 di 11) [08/03/2001 10.08.32]

1 is a logical diagram of an example PCI based system. No Frames Chapter 6 PCI Peripheral Component Interconnect (PCI). Connected to the secondary PCI bus are the SCSI and ethernet devices for the system.1 PCI Address Spaces The CPU and the PCI devices need to access memory that is shared between them. as its name implies is a standard that describes how to connect the peripheral components of a system together in a structured and controlled way.html (1 di 12) [08/03/2001 10. In the jargon of the PCI specification.1: Example PCI Based System Figure 6. the PCI SCSI device driver would read its status register to find out if the SCSI device was ready to write a block of information to the SCSI disk.40] . The PCI buses and PCI-PCI bridges are the glue connecting the system components together. This chapter looks at how the Linux kernel initializes the system's PCI buses and devices. The PCI-ISA bridge in the system supports older. These registers are used to control the device and to read its status. The CPU's system memory could be used for this shared memory but if it were. A special PCI device. mouse and floppy. The standard describes the way that the system components are electrically connected and the way that they should behave. the CPU is connected to PCI bus 0. Typically the shared memory contains control and status registers for the device. Physically the bridge. PCI bus 1 is described as being downstream of the PCI-PCI bridge and PCI bus 0 is up-stream of the bridge. For example. legacy ISA devices and the diagram shows a super I/O controller chip. the primary PCI bus as is the video device.it/LDP/tlk/dd/pci. 1 6. secondary PCI bus and two devices would all be contained on the same combination PCI card. Or it might write to the control register to start the device running after it has been turned on. This memory is used by device drivers to control the PCI devices and to pass information between them. which controls the keyboard.Table of Contents. Figure 6. Show Frames. then every time a PCI device accessed memory. the http://ldp.08. PCI bus 1. a PCI-PCI bridge connects the primary bus to the secondary PCI bus.iol.

2 PCI Configuration Headers Figure 6. This would slow the system down. waiting for the PCI device to finish.it/LDP/tlk/dd/pci.40] . Peripheral devices have their own memory spaces. a rogue device could make the system very unstable. a PCI video card plugged into one PCI slot on the PC motherboard will have its configuration header at one location and if it is plugged into another PCI slot then its header will appear in another location in PCI Configuration memory. PCI I/O. It uses support chipsets to access other address spaces such as PCI Configuration space. PCI has three. Exactly where the header is in the PCI Configuration address space depends on where in the PCI topology that device is. including the PCI-PCI bridges has a configuration data structure that is somewhere in the PCI configuration address space. The CPU can access these spaces but access by the devices into the system's memory is very strictly controlled using DMA (Direct Memory Access) channels. This does not matter. 6. For example. All of these address spaces are also accessible by the CPU with the the PCI I/O and PCI Memory address spaces being used by the device drivers and the PCI Configuration space being used by the PCI initialization code within the Linux kernel. for wherever the PCI devices and bridges are the system will find and configure them using the status and configuration registers in their configuration headers. PCI Memory and PCI Configuration space. It is also not a good idea to allow the system's peripheral devices to access main memory in an uncontrolled way. http://ldp.CPU would have to stall. ISA I/O (Input/Output) and ISA memory. The Alpha AXP processor does not have natural access to addresses spaces other than the system address space.08. This would be very dangerous.2: The PCI Configuration Header Every PCI device in the system.iol. It uses a sparse address mapping scheme which steals part of the large virtual address space and maps it to the PCI address spaces. ISA devices have access to two address spaces.html (2 di 12) [08/03/2001 10. Access to memory is generally limited to one system component at a time. The PCI Configuration header allows the system to identify and control the device.

So. The class code for SCSI is 0x0100. The number written there is meaningless to the the device driver but it allows the interrupt handler to correctly route an interrupt from the PCI device to the correct device driver's interrupt handling code within the Linux operating system. the Linux device drivers only read and write PCI I/O and PCI memory addresses. nothing can access them. C and D.08. For example. Class Code This identifies the type of device that this is. It contains the following fields: Vendor Identification A unique number describing the originator of the PCI device. Command By writing to this field the system controls the device. Interrupt Pin Four of the physical pins on the PCI card carry interrupts from the card to the PCI bus.2 shows the layout of the 256 byte PCI configuration header. http://ldp.html (3 di 12) [08/03/2001 10. Status This field gives the status of the device with the meaning of the bits of this field set by the standard. Generally it is hardwired for a pariticular device. Digital's 21141 fast ethernet device has a device identification of 0x0009. the device's driver and Linux's interrupt handling subsystem. 256 bytes) and so on.it/LDP/tlk/dd/pci. The standard labels these as A. Video drivers typically use large amounts of PCI memory space to contain video information. amount and location of PCI I/O and PCI memory space that the device can use. Digital's PCI Vendor Identification is 0x1011 and Intel's is 0x8086.3 PCI I/O and PCI Memory Addresses These two address spaces are used by the devices to communicate with their device drivers running in the Linux kernel on the CPU. It should be noted that only the PCI configuration code reads and writes PCI configuration addresses. Its Linux device driver then reads and writes those registers to control the device. the device uses the same interrupt pin. the DECchip 21141 fast ethernet device maps its internal registers into PCI I/O space. The Interrupt Pin field describes which of these pins this PCI device uses. Until the PCI system has been set up and the device's access to these address spaces has been turned on using the Command field in the PCI Configuration header.40] .iol.Typically. the first slot on the board might have its PCI Configuration at offset 0 and the second slot at offset 256 (all headers are the same length. This information allows the interrupt handling subsystem to manage interrupts from this device. . That is. systems are designed so that every PCI slot has it's PCI Configuration Header in an offset that is related to its slot on the board. See Chapter interrupt-chapter on page for details on how Linux handles interrupts. There are standard classes for every sort of device. Interrupt Line The Interrupt Line field of the device's PCI Configuration header is used to pass an interrupt handle between the PCI initialisation code. video. For example. B. Device Identification A unique number describing the device itself. The describes one possible error message as returning 0xFFFFFFFF when attempting to read the Vendor Identification and Device Identification fields for an empty PCI slot. Figure 6. A system specific hardware mechanism is defined so that the PCI configuration code can attempt to examine all possible PCI Configuration Headers for a given PCI bus and know which devices are present and which devices are absent simply by trying to read one of the fields in the header (usually the Vendor Identification field) and getting some sort of error. for example allowing the device to access PCI I/O memory. for example. every time the system boots. Base Address Registers These registers are used to determine and allocate the type. SCSI and so on. 6.

5 PCI-PCI Bridges PCI-PCI bridges are special PCI devices that glue the PCI buses of the system together. the PCI-PCI Bridges are invisible.4: Type 1 PCI Configuration Cycle So that the CPU's PCI initialization code can address devices that are not on the main PCI bus.40] .3 and Figure 6. 6.it/LDP/tlk/dd/pci. Bits 31:11 of the Type 0 configuraration cycles are treated as the device select field. bit 12 would select the PCI device in slot 1 and so on.1 on page pageref. The PCI specification copes with this by reserving the lower regions of the PCI I/O and PCI Memory address spaces for use by the ISA peripherals in the system and using a single PCI-ISA bridge to translate any PCI memory accesses to those regions into ISA accesses.08. For example. Type 0 PCI Configuration cycles do not contain a bus number and these are interpretted by all devices as being for PCI configuration addresses on this PCI bus. This filtering stops addresses propogating needlessly throughout the system.1 PCI-PCI Bridges: PCI I/O and PCI Memory Windows PCI-PCI bridges only pass a subset of PCI I/O and PCI memory read and write requests downstream. Where in the ISA address spaces (I/O and Memory) the ISA devices of the system have their registers was fixed in the dim mists of time by the early Intel 8080 based PCs. these are shown in Figure 6. the PCI-PCI bridges must be programmed with a base and limit for PCI I/O and PCI Memory space access that they have to pass from their primary bus onto their secondary bus.4 respectively. Simple systems have a single PCI bus but there is an electrical limit on the number of PCI devices that a single PCI bus can support. Even a $5000 Alpha AXP based computer systems will have its ISA floppy controller at the same place in ISA I/O space as the first IBM PC. 6.3: Type 0 PCI Configuration Cycle Figure 6. there has to be a mechanism that allows bridges to decide whether or not to pass Configuration cycles from their primary interface to their secondary interface.6.iol. Using PCI-PCI bridges to add more PCI buses allows the system to support many more PCI devices.html (4 di 12) [08/03/2001 10. However. Over time the need for this backwards compatibility will dwindle and PCI only systems will be sold. In this case bit 11 would select the PCI device in slot 0. All of the PCI-PCI Bridges seeing Type 1 configuration cycles may choose to pass them to the PCI buses http://ldp. A lot of systems now sold contain several ISA bus slots and several PCI bus slots.2 PCI-PCI Bridges: PCI Configuration Cycles and PCI Bus Numbering Figure 6. the PCI-PCI bridge will only pass read and write addresses from PCI bus 0 to PCI bus 1 if they are for PCI I/O or PCI memory addresses owned by either the SCSI or ethernet device.5. Type 0 and Type 1. This is particularly important for a high performance server. Once the PCI-PCI Bridges in a system have been configured then so long as the Linux device drivers only access PCI I/O and PCI Memory space via these windows.5. One way to design a system is to have each bit select a different device. Linux fully supports the use of PCI-PCI bridges. it also makes PCI-PCI bridges somewhat tricky for Linux to configure as we shall see later on. in Figure 6. all other PCI I/O and memory addresses are ignored. This is an important feature that makes life easier for Linux PCI device driver writers. Another way is to write the device's slot number directly into bits 31:11. Of course. Type 1 PCI Configuration cycles contain a PCI bus number and this type of configuration cycle is ignored by all PCI devices except the PCI-PCI bridges. A cycle is just an address as it appears on the PCI bus. Which mechanism is used in a system depends on the system's PCI memory controller.4 PCI-ISA Bridges These bridges support legacy ISA devices by translating PCI I/O and PCI Memory space accesses into ISA I/O and ISA Memory accesses. To do this. The PCI specification defines two formats for the PCI Configuration addresses. 6.

1 The Linux Kernel PCI Data Structures http://ldp.html (5 di 12) [08/03/2001 10.'' If this rule is broken then the PCI-PCI Bridges will not pass and translate Type 1 PCI configuration cycles correctly and the system will fail to find and initialise the PCI devices in the system. 6. It is up to each individual operating system to allocate bus numbers during PCI configuration but whatever the numbering scheme used the following statement must be true for all of the PCI-PCI bridges in the system: ``All PCI buses located behind a PCI-PCI bridge must reside between the seondary bus number and the subordinate bus number (inclusive). it numbers all of the bridges that it finds. PCI Fixup System specific fixup code tidies up the system specific loose ends of PCI initialization. Bridge1 passes this unchanged onto Bus 1. PCI BIOS This software layer provides the services described in bib-pci-bios-specification. Linux configures these special devices in a particular order. Bridge2 ignores it but Bridge3 converts it into a Type 0 Configuration command and sends it out on Bus 3 where Device 1 responds to it. Additionally.6. Each PCI-PCI Bridge also has a subordinate bus number and this is the maximum bus number of all the PCI buses that are bridged beyond the secondary bus interface. Or to put it another way. q Pass it onto the secondary bus interface unchanged if the bus number specified is greater than the secondary bus number and less than or equal to the subordinate bus number.downstream of themselves. Whether the PCI-PCI Bridge ignores the Type 1 configuration cycle or passes it onto the downstream PCI bus depends on how the PCI-PCI Bridge has been configured. When the PCI-PCI bridge sees a Type 1 PCI configuration cycle it does one of the following things: q Ignore it if the bus number specified is not in between the bridge's secondary bus number and subordinate bus number (inclusive). Section pci-pci-bus-numbering on page describes Linux's PCI bridge and bus numbering scheme in detail together with a worked example. It builds a linked list of data structures describing the topology of the system. To achieve this numbering scheme.it/LDP/tlk/dd/pci. there is equivalent code in the Linux kernel providing the same functions. So.08. the subordinate bus number is the highest numbered PCI bus downstream of the PCI-PCI bridge. q Convert it to a Type 0 configuration command if the bus number specified matches the secondary bus number of the bridge. 6. if we want to address Device 1 on bus 3 of the topology Figure pci-pci-config-eg-4 on page we must generate a Type 1 Configuration command from the CPU.iol.6 Linux PCI Initialization The PCI initialisation code in Linux is broken into three logical parts: PCI Device Driver This pseudo-device driver searches the PCI system starting at Bus 0 and locates all PCI devices and bridges in the system. Even though Alpha AXP does not have BIOS services. Every PCI-PCI bridge has a primary bus interface number and a secondary bus interface number.40] . The primary bus interface being the one nearest the CPU and the secondary bus interface being the one furthest away.

it/LDP/tlk/dd/pci. All of the PCI devices in the system have their pci_dev data structures queued onto this queue. Figure 6. Each PCI device (including the PCI-PCI Bridges) is described by a pci_dev data structure. This queue is used by the Linux kernel to quickly find all of the PCI devices in the system.40] .5 is a pointer to all of the PCI devices in the system.Figure 6.iol.5: Linux Kernel PCI Data Structures As the Linux kernel initialises the PCI system it builds data structures mirroring the real PCI topology of the system.1 on page pageref. That PCI device is a child of the the PCI Bus's parent PCI bus. Not shown in the Figure 6.08. The result is a tree structure of PCI buses each of which has a number of child PCI devices attached to it. As a PCI bus can only be reached using a PCI-PCI Bridge (except the primary PCI bus.. http://ldp.5 shows the relationships of the data structures that it would build for the example PCI system in Figure 6. bus 0). pci_devices.html (6 di 12) [08/03/2001 10. each pci_bus contains a pointer to the PCI device (the PCI-PCI Bridge) that it is accessed through. Each PCI bus is described by a pci_bus data structure.

40] . The PCI initialisation code must scan all of the PCI buses in the system looking for all PCI devices in the system (including PCI-PCI bridge devices). It uses the PCI BIOS code to find out if every possible slot in the current PCI bus that it is scanning is occupied. the system's PCI topology is fully mapped depthwise before searching breadthwise. When it finds an occupied slot it builds a pci_dev data structure describing the device. The PCI initialisation code can tell if the PCI device is a PCI-PCI Bridge because it has a class code of 0x060400. All of the pci_dev data structures built by the PCI initialisation code (including all of the PCI-PCI Bridges) are linked into a singly linked list. Linux would configure PCI Bus 1 with its Ethernet and SCSI device before it configured the video device on PCI Bus 0. If more PCI-PCI Bridges are found then these are also configured. pci_devices. If the PCI slot is occupied.6.it/LDP/tlk/dd/pci. The PCI initialisation code starts by scanning PCI Bus 0.08. it builds a pci_dev data structure describing the device and links into the list of known PCI devices (pointed at by pci_devices). The Linux kernel then configures the PCI bus on the other (downstream) side of the PCI-PCI Bridge that it has just found. If the PCI device that was found was a PCI-PCI bridge then a pci_bus data structure is built and linked into the tree of pci_bus and pci_dev data structures pointed at by pci_root. This is described in detail in Section pci-pci-bus-numbering below. As Linux searches for downstream PCI buses it must also configure the intervening PCI-PCI bridges' secondary and subordinate bus numbers. It tries to read the Vendor Identification and Device Identification fields for every possible PCI device in every possible PCI slot.2 The PCI Device Driver The PCI device driver is not really a device driver at all but a function of the operating system called at system initialisation time. Looking at Figure 6.html (7 di 12) [08/03/2001 10. This process is known as a depthwise algorithm.6.6: Configuring a PCI System: Part 1 http://ldp.1 on page pageref.Assigning PCI Bus Numbers Figure 6. Configuring PCI-PCI Bridges .iol.

PCI Memory or PCI Configuration address space reads and writes across them. They would be translated into Type 0 Configuration cycles if they have a bus number of 1 but left untranslated for all other bus numbers. The problem is that at the time when you wish to configure any given PCI-PCI bridge you do not know the subordinate bus number for that bridge.html (8 di 12) [08/03/2001 10. The answer is to use a depthwise recursive algorithm and scan each bus for any PCI-PCI bridges assigning them numbers as they are found. Subordinate Bus Number The highest bus number of all of the buses that can be reached downstream of the bridge. PCI I/O and PCI Memory Windows The window base and size for PCI I/O address space and PCI Memory address space for all addresses downstream of the PCI-PCI Bridge. This is exactly what the Linux PCI initialisation code needs to do in order to go and scan PCI Bus 1. As each PCI-PCI bridge is found and its secondary bus numbered.40] .08. they need to know the following: Primary Bus Number The bus number immediately upstream of the PCI-PCI Bridge. This means that all Type 1 PCI Configuration addresses specifying a PCI bus number of 1 or higher would be passed across Bridge1 and onto PCI Bus 1. This all seems complicated but the worked example below makes this process clearer. The PCI bus downstream of Bridge1 would be numbered as 1 and Bridge1 assigned a secondary bus number of 1 and a temporary subordinate bus number of 0xFF. the first bridge the scan would find is Bridge1.6.For PCI-PCI bridges to pass PCI I/O.it/LDP/tlk/dd/pci.iol. assign it a temporary subordinate number of 0xFF and scan and assign numbers to all PCI-PCI bridges downstream of it. Secondary Bus Number The bus number immediately downstream of the PCI-PCI Bridge. http://ldp. PCI-PCI Bridge Numbering: Step 1 Taking the topology in Figure 6. You do not know if there are further PCI-PCI bridges downstream and if you did. you do not know what numbers will be assigned to them.

Type 1 PCI configuration cycles with a bus number of 1.8 on page pageref shows how the system is configured now.40] . Figure 6.08.it/LDP/tlk/dd/pci. There are no further PCI-PCI bridges beyond PCI-PCI Bridge2.8: Configuring a PCI System: Part 3 PCI-PCI Bridge Numbering: Step 3 The PCI initialisation code returns to scanning PCI Bus 1 and finds another PCI-PCI bridge. http://ldp. Figure 6. 3 as its secondary bus interface number and 0xFF as its subordinate bus number.7: Configuring a PCI System: Part 2 PCI-PCI Bridge Numbering: Step 2 Linux uses a depthwise algorithm and so the initialisation code goes on to scan PCI Bus 1. It is assigned 1 as its primary bus interface number.7 shows how the buses and PCI-PCI bridges are numbered at this point. Bridge3. Figure 6. so it is assigned a subordinate bus number of 2 which matches the number assigned to its secondary interface.Figure 6.iol. 2 or 3 wil be correctly delivered to the appropriate PCI buses. Here it finds PCI-PCI Bridge2.html (9 di 12) [08/03/2001 10.

08.html (10 di 12) [08/03/2001 10. For example. they are the same for both Intel and Alpha AXP based systems.40] . Only Linux kernel code and device drivers may use them. Figure 6.iol.6. PCI Bus 3 has another PCI-PCI bridge (Bridge4) on it.6. The next subsections describe how that code works. downstream of PCI-PCI Bridge3.Figure 6.4 PCI Fixup The PCI fixup code for Alpha AXP does rather more than that for Intel (which basically does nothing). 6. For Intel based systems the system BIOS. has already fully configured the PCI system. it is assigned 3 as its primary bus number and 4 as its secondary bus number. 6. This leaves Linux with little to do other than map that configuration.it/LDP/tlk/dd/pci. Finally. the PCI initialisation code can assign 4 as the subordinate bus number for PCI-PCI Bridge1. The initialisation code returns to PCI-PCI Bridge3 and assigns it a subordinate bus number of 4. For non-Intel based systems further configuration needs to happen to: q Allocate PCI I/O and PCI Memory space to each device.3 PCI BIOS Functions The PCI BIOS functions are a series of standard routines which are common across all platforms.9: Configuring a PCI System: Part 4 PCI-PCI Bridge Numbering: Step 4 Linux starts scanning PCI Bus 3. They allow the CPU controlled access to all of the PCI address spaces.9 on page pageref shows the final bus numbers. which ran at boot time. these control interrupt handling for the device. q Configure the PCI I/O and PCI Memory address windows for each PCI-PCI bridge in the system. q Generate Interrupt Line values for the devices. It is the last bridge on this branch and so it is assigned a subordinate bus interface number of 4. Finding Out How Much PCI I/O and PCI Memory Space a Device Needs http://ldp.

This is indicated by Bit 0 of the register. it tells you that it needs 0x100 bytes of space of either PCI I/O or PCI Memory. The moment that it allocates space. the 21142's control and status registers can be seen at those addresses. Allocating PCI I/O and PCI Memory to PCI-PCI Bridges and Devices Like all memory the PCI I/O and PCI memory spaces are finite. Figure 6. For example when you initialize the DECChip 21142 PCI Fast Ethernet device. Both PCI I/O and PCI Memory must be allocated to a device in a naturally aligned way. The device will return 0's in the don't-care address bits. To find out just how much of each address space a given Base Address Register is requesting. and to some extent scarce.10: PCI Configuration Header: Base Address Registers There are two basic types of Base Address Register. To do this. each Base Address Register has all 1's written to it and then read.40] . Starting at the root PCI bus (pointed at by pci_root) the BIOS fixup code: q Aligns the current global PCI I/O and Memory bases on 4K and 1 Mbyte boundaries respectively. effectively specifying the address space required. For example. q For every device on the current bus (in ascending PCI I/O memory needs). r allocates it space in PCI I/O and/or PCI Memory. either PCI I/O or PCI Memory space. The algorithm that Linux uses relies on each device described by the bus/device tree built by the PCI Device Driver being allocated address space in ascending PCI I/O memory order. if a device asks for 0xB0 of PCI I/O space then it must be aligned on an address that is a multiple of 0xB0.html (11 di 12) [08/03/2001 10. you write all 1s into the register and then read it back.Each PCI device found is queried to find out how much PCI I/O and PCI Memory address space it requires.it/LDP/tlk/dd/pci. http://ldp. The PCI Fixup code for non-Intel systems (and the BIOS code for Intel systems) has to allocate each device the amount of memory that it is requesting in an efficient manner. the PCI I/O and PCI Memory bases for any given bridge must be aligned on 4K and on 1Mbyte boundaries respectively.08. Given that the address spaces for downstream devices must lie within all of the upstream PCI-PCI Bridge's memory ranges for any given device. The device will specify zeros in the don't care address bits. Again a recursive algorithm is used to walk the pci_bus and pci_dev data structures built by the PCI initialisation code. Figure 6. r moves on the global PCI I/O and Memory bases by the appropriate amounts. The initialization code allocates it space.10 shows the two forms of the Base Address Register for PCI Memory and for PCI I/O.iol. it is a somewhat difficult problem to allocate space efficiently. In addition to this. This design implies that all address spaces used are a power of two and are naturally aligned. effectively specifying the address space required. the first indicates within which address space the devices registers must reside.

q q q q enables the device's use of PCI I/O and PCI Memory. The SCSI Device This is asking for 0x1000 PCI Memory and so it is allocated it at 0x401000 after it has been naturally aligned. Programs the PCI-PCI bridge that links to this bus with its PCI I/O and PCI Memory bases and limits. The PCI-PCI Bridge's PCI I/O and Memory Windows We now return to the bridge and set its PCI I/O window at between 0x4000 and 0x40B0 and it's PCI Memory window at between 0x400000 and 0x402000. version 1. r Taking the PCI system in Figure 6.html (12 di 12) [08/03/2001 10. This means that if any PCI I/O or PCI Memory addresses seen on the Bridge's primary PCI bus that are within its PCI I/O and PCI Memory address windows will be bridged onto its secondary PCI bus. Allocates space recursively to all of the buses downstream of the current bus.0. Note that this will change the global PCI I/O and Memory bases.08. note that we do not need to align the bases as they are already correctly aligned: The Ethernet Device This is asking for 0xB0 bytes of both PCI I/O and PCI Memory space.iol.it/LDP/tlk/dd/pci. This means that the PCI-PCI Bridge will ignore the PCI Memory accesses for the video device and pass them on if they are for the ethernet or SCSI devices. The PCI I/O base is still 0x40B0 and the PCI Memory base has been moved to 0x402000. The Video Device This is asking for 0x200000 of PCI Memory and so we allocate it that amount starting at the current PCI Memory base of 0x200000 as it has to be naturally aligned to the size requested.1 on page pageref as our example the PCI Fixup code would set up the system in the following way: Align the PCI bases PCI I/O is 0x4000 and PCI Memory is 0x100000. Aligns the current global PCI I/O and Memory bases on 4K and 1 Mbyte boundaries respectively and in doing so figure out the size and base of PCI I/O and PCI Memory windows required by the current PCI-PCI bridge. http://ldp. Footnotes: 1 For example? File translated from TEX by TTH.40] . The PCI-PCI Bridge We now cross the PCI-PCI Bridge and allocate PCI memory there. The PCI Memory base is moved to 0x400000 and the PCI I/O base remains at 0x4000. Top of Chapter. Show Frames. It gets allocated PCI I/O at 0x4000 and PCI Memory at 0x400000. This allows the PCI-ISA bridges to translate all addresses below these into ISA address cycles. The PCI Memory base is moved to 0x4000B0 and the PCI I/O base to 0x40B0. Turns on bridging of PCI I/O and PCI Memory accesses in the PCI-PCI Bridge. Table of Contents. No Frames © 1996-1999 David A Rusling copyright notice.

Show Frames. Figure 7. No Frames Chapter 7 Interrupts and Interrupt Handling This chapter looks at how interrupts are handled by the Linux kernel. Whilst the kernel has generic mechanisms and interfaces for handling interrupts.iol.45] .1: A Logical Diagram of Interrupt Routing Linux uses a lot of different pieces of hardware to perform many different tasks.Table of Contents. most of the interrupt handling details are architecture specific. The video device drives http://ldp.html (1 di 6) [08/03/2001 10.08.it/LDP/tlk/dd/interrupts.

The interrupt controller has mask and status registers that control the interrupts. pin 4 on the interrupt controller may be connected to PCI slot number 0 which might one day have an ethernet card in it but the next have a SCSI controller in it. The CPU will then continue to doing whatever it was doing before being interrupted. A better. no other interrupts can happen in this mode. http://ldp. general purpose processors such as the Alpha AXP use a similar method. others may be connected to the other devices in the system. some CPUs rank the interrupts in priority and higher level interrupts may happen.iol. When the interrupt has been handled. That method. When a hardware interrupt occurs the CPU stops executing the instructions that it was executing and jumps to a location in memory that either contains the interrupt handling code or an instruction branching to the interrupt handling code. for example. that is you could send a request for some operation (say writing a block of memory out to disk) and then wait for the operation to complete. such as the SCSI controller. although it would work. There are exceptions though. This saves interrupt pins on the CPU and also gives flexibility when designing systems. Some of the physical pins of the CPU are wired such that changing the voltage (for example changing it from +5v to -5v) causes the CPU to stop what it is doing and to start executing special code to handle the interruption. is very inefficient and the operating system would spend a lot of time ``busy doing nothing'' as it waited for each operation to complete. the interrupt handling code. With this scheme. what some of the pins are connected to may be determined by what controller card is plugged into a particular ISA or PCI slot. Most.it/LDP/tlk/dd/interrupts. Systems often use an interrupt controller to group the device interrupts together before passing on the signal to a single interrupt pin on the CPU. more efficient. the real time clock's interval timer may be permanently connected to pin 3 on the interrupt controller.the monitor.html (2 di 6) [08/03/2001 10. This means that the first level interrupt handling code must be very carefully written and it often has its own stack. which it uses to store the CPU's execution state (all of the CPU's normal registers and context) before it goes off and handles the interrupt.08. the IDE device drives the disks and so on. Most modern general purpose microprocessors handle the interrupts the same way. It is important that the interrupt processing code is as efficient as possible and that the operating system does not block interrupts too often or for too long. There has to be some hardware support for the devices to interrupt whatever the CPU is doing. the CPU's state is restored and the interrupt is dismissed. more useful work and later be interrupted by the device when it has finished the request. Setting the bits in the mask register enables and disables interrupts and the status register returns the currently active interrupts in the system. One of these pins might be connected to an interval timer and receive an interrupt every 1000th of a second. The bottom line is that each system has its own interrupt routing mechanisms and the operating system must be flexible enough to cope. normally. Some CPUs have a special set of registers that only exist in interrupt mode. if not all. For example. and the interrupt code can use these registers to do most of the context saving it needs to do. and. This code usually operates in a special mode for the CPU. You could drive these devices synchronously. Some of the interrupts in the system may be hard-wired. there may be many outstanding requests to the devices in the system all happening at the same time. However.45] . way is to make the request and then do other. interrupt mode.

Non-Intel based systems such as Alpha AXP based PCs are free from these architectural constraints and so often use different interrupt controllers. This controller has been around since the dawn of the PC and it is programmable with its registers being at well known locations in the ISA address space. Even very modern support logic chip sets keep equivalent registers in the same place in ISA memory.1 Programmable Interrupt Controllers Systems designers are free to use whatever interrupt architecture they wish but IBM PCs use the Intel 82C59A-2 CMOS Programmable Interrupt Controller or its derivatives.7. .

The interrupt pin that a device uses is fixed and is kept in a field in the PCI configuration header for this device. So. pin B of PCI slot 4 to pin 7 of the interrupt controller and so on. On the other hand. PCI based systems are much more dynamic than ISA based systems. PCI devices have their interrupts allocated by the PCI BIOS or the PCI subsystem as PCI is initialized when the system boots. C or D. 7.html (4 di 6) [08/03/2001 10. it reads this information and uses it to request control of the interrupt from the Linux kernel. The number of interrupt sources may exceed the number of pins on the system's programmable interrupt controllers. It determines the interrupt pin (or IRQ) number using its knowledge of the PCI interrupt routing topology together with the devices PCI slot number and which PCI interrupt pin that it is using. Sharing interrupts results in several irqaction data structures being pointed at by one entry in the irq_action vector vector. A.45] . B. In this case. The PCI set up code writes the pin number of the interrupt controller into the PCI configuration header for each device. On Intel based PCs this is the system BIOS code that runs at boot time but for system's without BIOS (for example Alpha AXP based systems) the Linux kernel does this setup. PCI devices may share interrupts. for example when PCI-PCI bridges are used. one pin on the interrupt controller taking interrupts from more than one PCI device. The interrupt pin that an ISA device uses is often set using jumpers on the hardware device and fixed in the device driver. When a shared interrupt happens. How the PCI interrupts are routed is entirely system specific and there must be some set up code which understands this PCI interrupt routing topology. B.3 Interrupt Handling http://ldp.it/LDP/tlk/dd/interrupts.If the ISA device driver has successfully found its IRQ number then it can now request control of it as normal. Each PCI device may use one of four interrupt pins. Linux supports this by allowing the first requestor of an interrupt source declare whether it may be shared. When the device driver runs. It writes this information into the interrupt line field that is reserved for this purpose. The PCI interrupt lines A. This was fixed when the device was built and most devices default to interrupt on pin A. Linux will call all of the interrupt handlers for that source.iol. C and D for each PCI slot are routed to the interrupt controller. There may be many PCI interrupt sources in the system. Any device driver that can share interrupts (which should be all PCI device drivers) must be prepared to have its interrupt handler called when there is no interrupt to be serviced.08. Pin A from PCI slot 4 might be routed to pin 6 of the interrupt controller.

45] .it/LDP/tlk/dd/interrupts. for example. the Linux interrupt handling code is architecture specific. Linux uses a set of pointers to data structures containing the addresses of the routines that handle the system's interrupts. These routines belong to the device drivers for the devices in the system and it is the responsibility of each device driver to request the interrupt that it wants when the driver is initialized. between systems.Figure 7. This means that the size of the irq_action vector vector varies depending on the number of interrupt sources that there are. As the number of interrupts and how they are handled varies between architectures and. So. sometimes. If.2 shows that irq_action is a vector of pointers to the irqaction data structure.iol. This code must understand the interrupt topology of the system.html (5 di 6) [08/03/2001 10. The device may be reporting an error or that a requested http://ldp. To find the cause of the interrupt the device driver would read the status register of the device that interrupted. If there is not an interrupt handler for the interrupt that occurred then the Linux kernel will log an error. including the address of the interrupt handling routine. Linux must first determine its source by reading the interrupt status register of the system's programmable interrupt controllers. the floppy controller interrupts on pin 6 1 of the interrupt controller then it must recognize the interrupt as from the floppy and route it to the floppy device driver's interrupt handling code.2: Linux Interrupt Handling Data Structures One of the principal tasks of Linux's interrupt handling subsystem is to route the interrupts to the right pieces of interrupt handling code. for example. When the device driver's interrupt handling routine is called by the Linux kernel it must efficiently work out why it was interrupted and respond. When the interrupt happens.08. Figure 7. an interrupt on pin 6 of the interrupt controller from the floppy controller would be translated into the seventh pointer in the vector of interrupt handlers. It then translates that source into an offset into the irq_action vector vector. otherwise it will call into the interrupt handling routines for all of the irqaction data structures for this interrupt source. Each irqaction data structure contains information about the handler for this interrupt.

operation has completed. Once the reason for the interrupt has been determined.45] .it/LDP/tlk/dd/interrupts. Table of Contents. Show Frames. Top of Chapter. by convention. No Frames © 1996-1999 David A Rusling copyright notice.html (6 di 6) [08/03/2001 10. File translated from TEX by TTH. version 1.0. the floppy controller is always wired to interrupt 6. This avoids the CPU spending too much time in interrupt mode. For example the floppy controller may be reporting that it has completed the positioning of the floppy's read head over the correct sector on the floppy disk. http://ldp. See the Device Driver chapter (Chapter dd-chapter) for more details. the device driver may need to do more work.08. If it does. the Linux kernel has mechanisms that allow it to postpone that work until later. the floppy controller is one of the fixed interrupts in a PC system as.iol. are these an Intel thing? Footnotes: 1 Actually. REVIEW NOTE: Fast and slow interrupts.

standard. every physical device has its own hardware controller. Network devices are also represented by device special files but they are created by Linux as it finds and initializes the network controllers in the system. The CPU is not the only intelligent device in the system. For block (disk) and character devices. chrdevs . Block devices can be accessed via their device special file but more commonly they are accessed via the file system. the second partition of the primary IDE disk has a major number of 3 and a minor number of 2. these device special files are created by the mknod command and they describe the device using major and minor device numbers. a shared library of privileged. essentially. read and written using the same. All hardware devices look like regular files. Block devices can only be written to and read from in multiples of the block size. The software that handles or manages a hardware controller is known as a device driver. The CSRs for an Adaptec 2940 SCSI controller are completely different from those of an NCR 810 SCSI controller. For example the Virtual File System presents a uniform view of the mounted filesystems irrespective of the underlying physical devices. memory resident. to initialize it and to diagnose any problems with it. for example the first IDE disk in the system is represented by /dev/hda. Show Frames. mouse and serial ports are controlled by a SuperIO chip.08. they can be opened. Network devices are accessed via the BSD socket interface and the networking subsytems described in the Networking chapter (Chapter network-chapter). It is Linux's device drivers that handle the peculiarities of the devices they are managing. the IDE disks by an IDE controller.html (1 di 15) [08/03/2001 10. One of the basic features of is that it abstracts the handling of devices. Only a block device can support a mounted file system. /dev/hda2. Instead of putting code to manage the hardware controllers in the system into every application.it/LDP/tlk/dd/drivers. the code is kept in the Linux kernel. Linux maps the device special file passed in system calls (say to mount a file system on a block device) to the device's device driver using the major device number and a number of system tables. like other code within the kernel. for example the system's serial ports /dev/cua0 and /dev/cua1. There are many different device drivers in the Linux kernel (that is one of Linux's strengths) but they all share some common attributes: kernel code Device drivers are part of the kernel and. any block can be read or written no matter where it is on the device. All devices controlled by the same device driver have a common major device number. The keyboard. So. if they go wrong they can seriously http://ldp. typically 512 or 1024 bytes. Block devices are accessed via the buffer cache and may be randomly accessed. Every device in the system is represented by a device special file. The minor device numbers are used to distinguish between different devices and their controllers. closed. low level hardware handling routines. SCSI disks by a SCSI controller and so on. for example the character device table. for example each partition on the primary IDE disk has a different minor device number. Linux supports three types of hardware device: character.iol. The Linux kernel device drivers are. No Frames Chapter 8 Device Drivers One of the purposes of an operating system is to hide the peculiarities of the system's hardware devices from its users. This chapter describes how the Linux kernel manages the physical devices in the system. Character devices are read and written directly without buffering. that is to say.Table of Contents.53] . block and network. The CSRs are used to start and stop the device. Each hardware controller has its own control and status registers (CSRs) and these differ between devices. system calls that are used to manipulate files.

It registers the address of an interrupt handling routine and the interrupt number that it wishes to own. Polling by means of timers is at best approximate. As a device driver is part of the kernel it would be disasterous if a driver were to poll as nothing else in the kernel would run until the device had completed the request. An interrupt driven device driver is one where the hardware device being controlled will raise a hardware interrupt whenever it needs to be serviced.1 Polling and Interrupts Each time the device is given a command. You can see which interrupts are being used by the device drivers.53] . For example. This is achieved by the device driver registering its usage of the interrupt with the kernel. Loadable Most of the Linux device drivers can be loaded on demand as kernel modules when they are needed and unloaded when they are no longer being used. Which devices are built is configurable when the kernel is compiled. an ethernet device driver would interrupt whenever it receives an ethernet packet from the network.damage the system. It does not matter if the device being controlled by a particular device driver does not exist. In this case the device driver is simply redundant and causes no harm apart from occupying a little of the system's memory. by looking at /proc/interrupts: 0: 1: 2: 3: 4: 5: 727432 timer 20534 keyboard 0 cascade 79691 + serial 28258 + serial 1 sound blaster http://ldp. possibly corrupting file systems and losing data. interrupt delivery and wait queues to operate. This timer routine would check the status of the command and this is exactly how Linux's floppy driver works. the terminal driver provides a file I/O interface to the Linux kernel and a SCSI device driver provides a SCSI device interface to the SCSI subsystem which. Configurable Linux device drivers can be built into the kernel. Polling the device usually means reading its status register every so often until the device's status changes to indicate that it has completed the request. A badly written driver may even crash the system. Kernel mechanisms and services Device drivers make use of standard kernel services such as memory allocation. For example. a much more efficient method is to use interrupts. Instead polling device drivers use system timers to have the kernel call a routine within the device driver at some later time. 8.it/LDP/tlk/dd/drivers. Kernel interfaces Device drivers must provide a standard interface to the Linux kernel or to the subsystem that they are part of.html (2 di 15) [08/03/2001 10. for example ``move the read head to sector 42 of the floppy disk'' the device driver has a choice as to how it finds out that the command has completed. provides both file I/O and buffer cache interfaces to the kernel. The Linux kernel needs to be able to deliver the interrupt from the hardware device to the correct device driver.08. in turn. Dynamic As the system boots and each device driver is initialized it looks for the hardware devices that it is controlling. as well as how many of each type of interrupts there have been. The device drivers can either poll the device or they can use interrupts. This makes the kernel very adaptable and efficient with the system's resources.iol.

a number of ethernet devices use this technique. A SCSI device can transfer up to 40 Mbytes of information per second. some devices have a fixed DMA channel. For high speed devices. preventing them from being swapped out to the swap device during a DMA operation. For example a 9600 baud modem can transfer approximately one character every millisecond (1/1000 'th second). The floppy device. including their IRQ numbers. You can however lock the process's physical pages into memory. Device drivers that need to do a lot of work as a result of receiving an interrupt can use the kernel's bottom half handlers or task queues to queue routines to be called later on. or DMA. This means that DMA requests are limited to the bottom 16 Mbytes of memory. A DMA controller allows devices to transfer data to or from the system's memory without the intervention of the processor. for example. Therefore the memory that is being DMA'd to or from must be a contiguous block of physical memory. read or write. When the transfer is complete the device interrupts the PC. always uses DMA channel 2. the floppy disk controller always uses interrupt 6. Secondly. It then tells the device that it may start the DMA when it wishes. there are only 7 of them. A PC's ISA DMA controller has 8 DMA channels of which 7 are available for use by the device drivers.53] . The dma_chan data structure contains just two fields.002% of the CPU's processing time. This means that you cannot DMA directly into the virtual address space of a process. The DMA channel's address register represents the first 16 bits of the DMA address. the amount of time that it takes between the hardware device raising the interrupt and the device driver's interrupt handling routine being called. a pointer to a string describing the owner of the DMA channel and a http://ldp. The more flexible devices can be told (via their CSRs) which DMA channels to use and. Just like interrupts. Each DMA channel has associated with it a 16 bit address register and a 16 bit count register.11: 13: 14: 15: 20868 + aic7xxx 1 math error 247 + ide0 170 + ide1 This requesting of interrupt resources is done at driver initialization time. The 9600 baud modem data transfer would only take 0. First of all the DMA controller knows nothing of virtual memory. So. the DMA controller cannot access the whole of physical memory.2 Direct Memory Access (DMA) Using interrupts driven device drivers to transfer data to or from hardware devices works well when the amount of data is reasonably low. this is a legacy of the IBM PC's architecture. Like interrupts. Other interrupts. Device drivers have to be careful when using DMA. and they cannot be shared between device drivers. DMA channels are scarce resources. Whilst the transfer is taking place the CPU is free to do other things.iol. Some of the interrupts in the system are fixed.html (3 di 15) [08/03/2001 10. is low (say 2 milliseconds) then the overall system impact of the data transfer is very low. it only has access to the physical memory in the system. the device driver must be able to work out which DMA channel it should use.it/LDP/tlk/dd/drivers. Sometimes the DMA channel for a device can be set by jumpers. Direct Memory Access. such as hard disk controllers or ethernet devices the data transfer rate is a lot higher. In this case the device driver must first discover the interrupt number (IRQ) of the device that it is controlling before it requests ownership of that interrupt. Linux tracks the usage of the DMA channels using a vector of dma_chan data structures (one per DMA channel). for example. 8. How an interrupt is delivered to the CPU itself is architecture dependent but on most architectures the interrupt is delivered in a special mode that stops other interrupts from happening in the system. was invented to solve this problem. For PCI interrupts Linux supports standard PCI BIOS callbacks to determine information about the devices in the system. To initiate a data transfer the device driver sets up the DMA channel's address and count registers together with the direction of the data transfer. for example the interrupts from PCI devices are dynamically allocated at boot time. the next 8 bits come from the page register.08. the device driver can simply pick a free DMA channel to use. in this case. A device driver should do as little as possible in its interrupt handling routine so that the Linux kernel can dismiss the interrupt and return to what it was doing before it was interrupted. If the interrupt latency.

4. even if the device driver asks for less. memory to hold their data.1 Character Devices http://ldp. 8. Kernel memory is allocated in chunks that are powers of 2.4 Interfacing Device Drivers with the Kernel The Linux kernel must be able to interact with them in standard ways. Normally. Each class of device driver.it/LDP/tlk/dd/drivers. If the device driver wishes to DMA to or from the allocated memory it can also specify that the memory is DMA'able.iol. Linux maintains tables of registered device drivers as part of its interfaces with them. These tables include pointers to routines and information that support the interface with that class of devices. Not all device drivers (or indeed Linux kernel code) may want this to happen and so the kernel memory allocation routines can be requested to fail if they cannot immediately allocate memory. provides common interfaces that the kernel uses when requesting services from them.08. Linux is very dynamic. To cope with this dynamic nature of device drivers. putting the process onto a wait queue until there is enough physical memory.flag indicating if the DMA channel is allocated or not. When these drivers are initialized at boot time they may not discover any hardware to control. 8. The number of bytes that the device driver requests is rounded up to the next block size boundary. If the amount of free memory is low. This makes kernel memory deallocation easier as the smaller free blocks can be recombined into bigger blocks. These common interfaces mean that the kernel can treat often very different devices and their device drivers absolutely the same. Each time a device driver runs. device drivers register themselves with the kernel as they are initialized. It may be that Linux needs to do quite a lot of extra work when the kernel memory is requested. These data structures can be statically allocated.3 Memory Device drivers have to be careful when using memory. As they are part of the Linux kernel they cannot use virtual memory. physical pages may need to be discarded or written to the swap device. This way it is the Linux kernel that needs to understand what constitutes DMA'able memory for this system. For example. and not the device driver. device drivers use data structures to keep track of the device that it is controlling. Like the rest of the kernel. non-paged. For example 128 or 512 bytes. 8. Linux allows you to include device drivers at kernel build time via its configuration scripts. It is this vector of dma_chan data structures that is printed when you cat /proc/dma. but that would be wasteful as it makes the kernel larger than it need be. the current process may change. maybe as an interrupt is received or as a bottom half or task queue handler is scheduled. SCSI and IDE disks behave very differently but the Linux kernel uses the same interface to both of them. character. Most device drivers allocate kernel. Other drivers can be loaded as kernel modules when they are needed. Linux provides kernel memory allocation and deallocation routines and it is these that the device drivers use. part of the device driver's code. block and network.53] . every time a Linux kernel boots it may encounter different physical devices and thus need different device drivers. Linux would suspend the requestor.html (4 di 15) [08/03/2001 10. The device driver cannot rely on a particular process running even if it is doing work on its behalf.

a pointer to the name of the registered device driver and a pointer to a block of file operations. read from them. the open file operation. Each VFS inode has associated with it a set of file operations and these are different depending on the filesystem object that the inode represents. Whenever a VFS inode representing a character special file is created. write to them and close them exactly as if the device were a file.iol. As a character device is initialized its device driver registers itself with the Linux kernel by adding an entry into the chrdevs vector of device_struct data structures.Figure 8.1: Character Devices Character devices. Thereafter all of the applications file operations will be mapped to calls to the character devices set of file operations. When the character special file is opened by an application the generic open file operation uses the device's major identifier as an index into the chrdevs vector to retrieve the file operations block for this particular device. The device's major device identifier (for example 4 for the tty device) is used as an index into this vector. each device special file is represented by a VFS inode . for example EXT2. This VFS inode was created by the underlying filesystem. read. The major device identifier for a device is fixed. When a character special file representing a character device (for example /dev/cua0) is opened the kernel must set things up so that the correct character device driver's file operation routines will be called.html (5 di 15) [08/03/2001 10.53] . Each entry in the chrdevs vector. its file operations are set to the default character device operations . contains both the major and minor identifiers for the device. making its file operations pointer point to those of the device driver. The VFS inode for a character special file. write and close. a device_struct data structure contains two elements. Just like an ordinairy file or directory. applications use standard system calls to open them. The contents of /proc/devices for character devices is taken from the chrdevs vector. This has only one file operation.08. are accessed as files. from information in the real filesystem when the device special file's name was looked up. indeed for all device special files. It also sets up the file data structure describing this character special file.it/LDP/tlk/dd/drivers. This block of file operations is itself the addresses of routines within the device character device driver each of which handles specific file operations such as open. This is true even if the device is a modem being used by the PPP daemon to connect a Linux system onto a network. http://ldp. the simplest of Linux's devices.

a SCSI device driver has to provide interfaces to the SCSI subsystem which the SCSI subsystem uses to provide file operations for this device to the kernel.4. Every block device driver must provide an interface to the buffer cache as well as the normal file operations interface.html (6 di 15) [08/03/2001 10. The blk_dev_struct data structure consists of the address of a request routine and a pointer to a list of request data structures. So. The mechanisms used to provide the correct set of file operations for the opened block special file are very much the same as for character devices.iol. This unlocking of the buffer_head will wake up http://ldp. It. If the request is being added to an empty request list. Otherwise the driver will simply process every request on the request list.8. the driver's request function is called to start processing the request queue. Its entries are also device_struct data structures.53] . the device's major number.2: Buffer Cache Block Device Requests Each time the buffer cache wishes to read or write a block of data to or from a registered device it adds a request data structure onto its blk_dev_struct.2 shows that each request has a pointer to one or more buffer_head data structures. for example.08. Linux maintains the set of registered block devices as the blkdevs vector. Figure 8. Figure 8. each one a request to read or write a block of data. like the chrdevs vector.2 Block Devices Block devices also support being accessed like files. again. Unlike character devices. The index into this vector is. the all_requests list.it/LDP/tlk/dd/drivers. each one representing a request from the buffer cache for the driver to read or write a block of data. mark them as up to date and unlock them. The buffer_head structures are locked (by the buffer cache) and there may be a process waiting on the block operation to this buffer to complete. there are classes of block devices. Once the device driver has completed a request it must remove each of the buffer_head structures from the request structure. The device drivers for a class of block device provide class specific interfaces to the class. Each block device driver fills in its entry in the blk_dev vector of blk_dev_struct data structures . Each request structure is allocated from a static list. SCSI devices are one such class and IDE devices are another. It is the class that registers itself with the Linux kernel and provides file operations to the kernel. is indexed using the device's major device number.

iol. keeping it on spinning disk platters.it/LDP/tlk/dd/drivers. A disk is usually described by its geometry. For DOS formatted disks. 510 cylinders Units = cylinders of 2048 * 512 bytes http://ldp. The read/write heads do not physically touch the surface of the platters. 8. another an EXT2 filesystem and a third for the swap partition. A disk drive consists of one or more platters.08. The partitions of a hard disk are described by a partition table. Hard disks can be further subdivided into partitions. Track 0 is the outermost track and the highest numbered track is the track closest to the central spindle. 516MB w/64kB Cache. Extended and logical partitions were invented as a way around the limit of four primary partitions. The disk's read/write heads are responsible for reading and writing data and there is a pair for each platter. Compare this to a floppy disk which only spins at 360 RPM. extended and logical. or block. instead they float on a very thin (10 millionths of an inch) cushion of air. a tiny head magnetizes minute particles on the platter's surface.5 Hard Disks Disk drives provide a more permanent method for storing data. at boot time Linux describes one of my IDE disks as: hdb: Conner Peripherals 540MB . So all of the 5th tracks from each side of every platter in the disk is known as cylinder 5. each made of finely polished glass or ceramic composites and coated with a fine layer of iron oxide. As the number of cylinders is the same as the number of tracks. those formatted by fdisk. A common sector size is 512 bytes and the sector size was set when the disk was formatted. With a sector.html (7 di 15) [08/03/2001 10.CFS540A. The process sleeps on the buffer_head that will contain the directory entry until the device driver wakes it up. sectors and cylinder numbers. A sector is the smallest unit of data that can be written to or read from a hard disk and it is also the disk's block size. All of the read/write heads are attached together. Not all four entries in the partition table have to be used. A partition is a large group of sectors allocated for a particular purpose.000 RPM depending on the model. This does not match the disk's stated capacity of 516 Mbytes as some of the sectors are used for disk partitioning information. CHS=1050/16/63 This means that it has 1050 cylinders (tracks). the number of cylinders. The request data structure is marked as free so that it can be used in another block request. To write data. usually when the disk is manufactured. they contain any number of logical parititions. The read/write heads are moved across the surface of the platters by an actuator. The following is the output from fdisk for a disk containing two primary partitions: Disk /dev/sda: 64 heads. Extended partitions are not real partitions at all. concentric circles called tracks. For example. they all move across the surfaces of the platters together. Each track is divided into sectors. 16 heads (8 platters) and 63 sectors per track. size of 512 bytes this gives the disk a storage capacity of 529200 bytes. Some disks automatically find bad sectors and re-index the disk to work around them. The data is read by a head. you often see disk geometries described in terms of cylinders. A cylinder is the set of all tracks with the same number. there are four primary disk partitions. each entry describing where the partition starts and ends in terms of heads. Partitioning a disk allows the disk to be used by several operating system or for several purposes. 32 sectors. one head for each surface. one containing a DOS filesystem. Each surface of the platter is divided into narrow. heads and sectors. which can detect whether a particular minute particle is magnetized. An example of this would be where a file name is being resolved and the EXT2 filesystem must read the block of data that contains the next EXT2 directory entry from the block device that holds the filesystem. The platters are attached to a central spindle and spin at a constant speed that can vary between 3000 and 10.any process that has been sleeping waiting for the block operation to complete. There are three types of partition supported by fdisk. primary.53] . A lot of Linux systems have a single disk with three partitions.

iol.Device Boot /dev/sda1 /dev/sda2 Begin 1 479 Start 1 479 End 478 510 Blocks 489456 32768 Id 83 82 System Linux native Linux swap Expert command (m for help): p Disk /dev/sda: 64 heads. the major number of all SCSI disk devices. As there are 32 sectors in a track and 64 read/write heads. This is all represented by a list of gendisk data structures pointed at by the gendisk_head list pointer. the primary IDE controller.53] . 32 sectors. For example. Although the disk subsystems build the gendisk entries during their initialization they are only used by Linux during partition checking. Figure 8. is initialized it generates gendisk data structures representing the disks that it finds. the first one for the SCSI disk subsystem and the second for an IDE disk controller.3: Linked list of disks During initialization Linux maps the topology of the hard disks in the system. head 1 and sector 1 and extends to include cylinder 477. Instead.08. sector 32 and head 63.3 shows two gendisk entries. for example IDE. It starts at the outermost cylinder (0) and extends inwards. the swap partition. 510 cylinders Nr 1 2 3 4 AF 00 00 00 00 Hd Sec 1 1 0 1 0 0 0 0 Cyl 0 478 0 0 Hd Sec 63 32 63 32 0 0 0 0 Cyl 477 509 0 0 Start 32 978944 0 0 Size 978912 65536 0 0 ID 83 82 00 00 This shows that the first partition starts at cylinder or track 0. It does this at the same time as it registers its file operations and adds its entry into the blk_dev data structure. towards the spindle. It finds out how many hard disks there are and of what type. Figure 8. fdisk alligns partitions on cylinder boundaries by default. the SCSI disk subsystem creates a single gendisk entry (``sd'') with a major number of 8.html (8 di 15) [08/03/2001 10. starts at the next cylinder (478) and extends to the innermost cylinder of the disk. Linux discovers how the individual disks have been partitioned. The second partition. this partition is a whole number of cylinders in size. each disk subsystem maintains its own data structures which allow it to map device special http://ldp. As each disk subsystem. for 478 cylinders. Additionally. This is ide0.it/LDP/tlk/dd/drivers. Each gendisk data structure has a unique major device number and these match the major numbers of the block special devices.

the slave IDE drive on the primary IDE controller is (3. has raised the disk size to a maximum of 8. Whenever a block device is read from or written to. These addresses were set by convention in the early days of the IBM PC.3 Mbytes per second of data transfer to or from the disk and the maximum IDE disk size is 538Mbytes.08.html (9 di 15) [08/03/2001 10. Linux first looks to see if there is information about the disks present in the system's CMOS memory. When the request is made. 8.2 Initializing the IDE Subsystem IDE disks have been around for much of the IBM PC's history. can be overridden http://ldp. However. Each IDE controller can support up to two disks. Each controller is represented by an ide_hwif_t data structure in the ide_hwifs vector. The first IDE controller in the system is known as the primary IDE controller. Extended IDE. the next the secondary controller and so on. The IDE drive will also request control of the appropriate interrupt. Any file or buffer cache operations for the IDE subsystem operations on these block special files will be directed to the IDE subsystem as the kernel uses the major identifier as an index. disks /dev/hda and /dev/hdb. Linux retrieves the found disk's geometry from BIOS and uses the information to set up the ide_hwif_t data structure for this drive. IDE can manage about 3. This CMOS memory is actually in the system's real time clock device which always runs no matter if your PC is on or off. IDE is a disk interface rather than an I/O bus like SCSI. one the master disk and the other the slave disk.5.5. The device identifier for the first partition of that disk (/dev/hdb1) is (3. This is battery backed memory that does not lose its contents when the PC is powered off. More modern PCs use PCI chipsets such as Intel's 82430 VX chipset which includes a PCI EIDE controller. either via the buffer cache or file operations. During the initializing of the IDE subsystem. /dev/hdc is the master disk on the secondary IDE controller. The IDE subsystem uses PCI BIOS callbacks to locate the PCI (E)IDE controllers in the system. it is up to the IDE subsystem to work out which IDE disk the request is for. The CMOS memory locations are set up by the system's BIOS and tell Linux what IDE controllers and drives have been found. The master and slave functions are usually set by jumpers on the disk. Each ide_hwif_t data structure contains two ide_drive_t data structures. its ide_hwif_t is set up to reflect the controllers and attached disks. Linux names IDE disks in the order in which it finds their controllers.53] . The IDE driver registers each controller with the Linux block buffer cache and VFS. adding it to the blk_dev and blkdevs vectors respectively. Again these interrupts are set by convention to be 14 for the primary IDE controller and 15 for the secondary IDE controller. This means that if a system has two IDE controllers there will be entries for the IDE subsystem at indices at 3 and 22 in the blk_dev and blkdevs vectors. 8. have a major identifier of 3.iol. During operation the IDE driver writes commands to IDE command registers that exist in the I/O memory space. the kernel directs the operation to the appropriate device using the major device number found in its block special device file (for example /dev/sda2). The block special files for IDE disks reflect this numbering. It is the individual device driver or subsystem that maps the minor device number to the real physical device. or EIDE. This makes the initialization of the IDE subsystem more complex than it might at first appear.it/LDP/tlk/dd/drivers.6 Mbytes per second. The master disk on the primary controller is /dev/hda and the slave disk is /dev/hdb. they like all IDE details. IDE and EIDE disks are cheaper than SCSI disks and most modern PCs contain one or more on board IDE controllers. Once each IDE interface or controller has been discovered.1 IDE Disks The most common disks used in Linux systems today are Integrated Disk Electronic or IDE disks.0x1F7. this contains information that allows it to direct the request to the correct partition of the correct disk. It then calls PCI specific interrogation routines for those chipsets that are present. The default I/O address for the primary IDE controller's control and status registers is 0x1F0 .6 Gbytes and the data transfer rate up to 16.major and minor device numbers to partitions within physical disks. The major identifier for the primary IDE controller is 3 and is 22 for the secondary IDE controller. Throughout this time the interface to these devices has changed. The maximum number of IDE controllers that Linux can support is 4. both connected to the primary IDE controller. To do this the IDE subsystem uses the minor device number from the device special identifier. The device identifier for /dev/hdb. one per possible supported master and slave IDE drive.64). The IDE subsystem registers IDE controllers and not disks with the Linux kernel.65).

8.08. It does this by asserting the SCSI identifier of the target on the address pins. MESSAGE OUT Additional information is transfered between the initiator and the target. tape. The target may then reselect the initiator.10 or 12 bytes of command can be transfered from the initiator to the target. These devices must be treated differently. each instance will be represented by a separate SCSI host. Each device has to have a unique identifier and this is usually set by jumpers on the disks.html (10 di 15) [08/03/2001 10. If a Linux system has more than one SCSI controller of the same type. The highest number SCSI identifier wins. The partition checking code understands that IDE controllers may each control two IDE disks. You can tell the current phase of a SCSI bus from five signals from the bus. The SCSI bus transfers both data and state information between devices. allowing Linux to direct block device requests to the appropriate SCSI type. a SCSI controller. The NCR810 PCI SCSI controller is an example of a SCSI host. STATUS This phase is entered after completion of all commands and allows the target to send a status byte indicating success or failure to the initiator. Not all SCSI devices support this phase.iol. Linux needs to detect if the media was removed.5. it does this by asserting its SCSI identifer onto the address pins. including one or more hosts. RESELECTION SCSI devices may disconnect during the processing of a request. for example with removable media such as CD-ROMs or tapes.53] . The eight phases are: BUS FREE No device has control of the bus and there are no transactions currently happening. The different disk types have different major device numbers. CD-ROM and also a generic SCSI device. The IDE driver also adds a gendisk entry into the list of gendisk's discovered during boot for each IDE controller found. Device The most common set of SCSI device is a SCSI disk but the SCSI standard supports several more types. ARBITRATION A SCSI device has attempted to get control of the SCSI bus. Data can be transfered synchronously or asynchronously between any two devices on the bus and with 32 bit wide data transfers up to 40 Mbytes per second are possible. and a single transaction between an initiator and a target can involve up to eight distinct phases. COMMAND 6. http://ldp. SELECTION When a device has succeeded in getting control of the SCSI bus through arbitration it must now signal the target of this SCSI request that it wants to send a command to it.3 SCSI Disks The SCSI (Small Computer System Interface) bus is an efficient peer-to-peer data bus that supports up to eight devices per bus.it/LDP/tlk/dd/drivers. SCSI hosts are almost always the initiator of SCSI commands. DATA IN. This list will later be used to discover the partition tables of all of the hard disks found at boot time.by command line options to the kernel. This means that a SCSI device driver may control more than one instance of its controller. SCSI devices are almost always the targets of SCSI commands. DATA OUT During these phases data is transfered between the initiator and the target. The Linux SCSI subsystem is made up of two basic elements. MESSAGE IN. each of which is represented by data structures: host A SCSI host is a physical piece of hardware.

53] . Linux finds out which of the SCSI host adapters. reflecting the dynamic nature of SCSI buses and their devices. one per controller. Each built in SCSI host has a Scsi_Host_Template entry in the builtin_scsi_hosts vector The Scsi_Host_Template data structure contains pointers to routines that carry out SCSI host specific actions such as detecting what SCSI devices are attached to this SCSI host. that were built into the kernel at kernel build time have hardware to control. http://ldp. This initialization is done in four phases: First. has its Scsi_Host_Template data structure added to the scsi_hosts list of active SCSI hosts. those for which there are real SCSI devices attached. These routines are called by the SCSI subsystem as it configures itself and they are part of the SCSI device driver supporting this host type.html (11 di 15) [08/03/2001 10.08. or controllers. It then initializes those devices and makes them available to the rest of the Linux kernel via the normal file and buffer cache block device operations. Each Scsi_Host points at the Scsi_Host_Template representing its device driver.iol. Each detected SCSI host. Each instance of a detected host type is represented by a Scsi_Host data structure held in the scsi_hostlist list.it/LDP/tlk/dd/drivers. Linux initializes the SCSI subsystem at boot time. it finds the SCSI controllers (known as SCSI hosts) in the system and then probes each of their SCSI buses finding all of their devices. For example a system with two NCR810 PCI SCSI controllers would have two Scsi_Host entries in the list.Initializing the SCSI Subsystem Initializing the SCSI subsystem is quite complex.

That in turn points at the Scsi_Host data structure which ``owns'' it. Each SCSI type is represented by a Scsi_Device_Template data structure. It also adds the gendisk data structure representing all SCSI disks to the linked list of disks shown in Figure 8. SCSI commands are represented by a Scsi_Cmnd data structure and these are passed to the device driver for this SCSI host by calling the device driver routines within its Scsi_Host_Template data structure. When a device responds.4 shows how the main data structures relate to one another. This gives Linux the vendor's name and the device's model and revision names. For the SCSI disk type this spins up all of the SCSI disks that were found and then records their disk geometry. the SCSI subsystem must find out what SCSI devices are attached to each host's bus. this generates an index of 1. its identification is read by sending it an ENQUIRY command. The SCSI initialization code finds each SCSI device on a SCSI bus by sending it a TEST_UNIT_READY command. Taking a SCSI disk driver that has one or more EXT2 filesystem partitions as an example. All of the Scsi_Device data structures are added to the scsi_devices list. However they will only register themselves if one or more of a given SCSI device type has been found. http://ldp.08.iol. the buffer cache need not do anything else. The Scsi_Type_Template data structures are added to the scsi_devicelist list if one or more SCSI devices of that type have been detected. Each SCSI type.4: SCSI Data Structures Now that every SCSI host has been discovered. The request data structures from the buffer cache are translated into Scsi_Cmd structures describing the SCSI command that needs to be sent to the SCSI device and this is queued onto the Scsi_Host structure representing this device. Each SCSI disk in the system is represented by a Scsi_Disk data structure. Delivering Block Device Requests Once Linux has initialized the SCSI subsystem. There can be buffer cache requests via blk_dev or file operations via blkdevs. If the request list is being processed.3. CD and generic. These are kept in the rscsi_disks vector that is indexed using part of the SCSI disk partition's minor device number. Each of these SCSI types are individually registered with the kernel as different major block device types. SCSI identifiers are usually set by jumpers on the device. These will be processed by the individual SCSI device driver once the appropriate data blocks have been read or written. /dev/sdb1 has a major number of 8 and a minor number of 17. Every SCSI device that is found is represented by a Scsi_Device data structure.Figure 8. each device's number or SCSI identifier being unique on the SCSI bus to which it is attached. For exmaple. The SCSI subsystem uses these templates to call the SCSI type routines for each type of SCSI device. There are four SCSI device types: disk. Figure 8. It uses these tables to direct kernel block operations (file or buffer cache) to the correct device driver or SCSI host. This contains information about this type of SCSI device and the addresses of routines to perform various tasks.53] . the SCSI devices may be used. if the SCSI subsystem wishes to attach a SCSI disk device it will call the SCSI disk type attach routine. for example SCSI disk. Each active SCSI device type registers itself with the kernel so that Linux can direct block device requests to it. tape.it/LDP/tlk/dd/drivers. SCSI devices are numbered between 0 and 7 inclusively.html (12 di 15) [08/03/2001 10. each of which points to its parent Scsi_Host. Each Scsi_Disk data structure contains a pointer to the Scsi_Device data structure representing this device. otherwise it must nudge the SCSI disk subsystem to go and process its request queue. maintains its own tables of devices. The final phase of the SCSI subsystem initialization is to call the finish functions for each registered Scsi_Device_Template. In other words. how do kernel buffer requests get directed to the right SCSI disk when one of its EXT2 partitions is mounted? Each request to read or write a block of data to or from a SCSI disk partition results in a new request structure being added to the SCSI disks current_request list in the blk_dev vector.

53] . Thus the ethernet devices are known as /dev/eth0. This chapter concentrates on the device data structure and on how network devices are discovered and initialized. an entity that sends and receives packets of data. it will receive all packets no matter who they are addressed to IFF_ALLMULTI Receive all IP multicast frames IFF_MULTICAST Can receive IP multicast frames Protocol Information http://ldp.08.8. All of this information is set at boot time as the device is initialized. Multiple devices of the same type are numbered upwards from 0. This is normally a physical device such as an ethernet card. How the network protocol layers use the network devices. IFF_BROADCAST Broadcast address in device is valid IFF_DEBUG Device debugging turned on IFF_LOOPBACK This is a loopback device IFF_POINTTOPOINT This is point to point link (SLIP and PPP) IFF_NOTRAILERS No network trailers IFF_RUNNING Resources allocated IFF_NOARP Does not support ARP protocol IFF_PROMISC Device in promiscuous receive mode. The device uses standard networking support mechanisms to pass received data up to the appropriate protocol layer.html (13 di 15) [08/03/2001 10. The base address is the address of any of the device's control and status registers in I/O memory. The device data structure contains information about the device and the addresses of functions that allow the various supported network protocols to use the device's services.6 Network Devices A network device is. The device data structure contains information about the network device: Name Unlike block and character devices which have their device special files created using the mknod command. Some common network devices are: /dev/ethN Ethernet devices /dev/slN SLIP devices /dev/pppN PPP devices /dev/lo Loopback devices Bus Information This is information that the device driver needs in order to control the device. Their names are standard.it/LDP/tlk/dd/drivers. All network data (packets) transmitted and received are represented by sk_buff data structures. The irq number is the interrupt that this device is using. Interface Flags These describe the characteristics and abilities of the network device: IFF_UP Interface is up and running. these are flexible data structures that allow network protocol headers to be easily added and removed.iol. These functions are mostly concerned with transmitting data using the network device. Each network device is represented by a device data structure./dev/eth1. how they pass data back and forth using sk_buff data structures is described in detail in the Networks chapter (Chapter networks-chapter). Network device drivers register the devices that they control with Linux during network initialization at kernel boot time. each name representing the type of device that it is. The DMA channel is the DMA channel number that this network device is using. network device special files appear spontaniously as the system's network devices are discovered and initialized. so far as Linux's network subsystem is concerned. Some network devices though are software only such as the loopback device which is used for sending data to yourself./dev/eth2 and so on.

Each potential network device is represented by a device data structure within the network device list pointed at by dev_base list pointer. the ethernet devices in the system are always called /dev/eth0.53] . If the driver could find a device it fills out the rest of the device data structure with information about the device and the addresses of the support functions within the network device driver. There are two problems to be solved for network device drivers. be built into the Linux kernel. The problem of ``missing'' network devices is easily solved.6. version 1. one for eth0.08. There are many different types of media that Linux network devices support. no matter what their underlying device drivers are. Addresses The device data structure holds a number of addresses that are relevent to this network device. like other Linux device drivers. When the driver finds its ethernet device it fills out the ethN device data structure. X. A driver may find several instances of the network device that it is controlling and. These include Ethernet.it/LDP/tlk/dd/drivers. There are eight standard entries in the devices list. Token Ring. The family for all Linux network devices is AF_INET. File translated from TEX by TTH. Packet Queue This is the queue of sk_buff packets queued waiting to be transmitted on this network device. its entry in the device list pointed at by dev_base is removed. it tries each ethernet device driver built into the kernel in turn until one finds a device. which DMA channel (if any) and so on. The network layers call one of a number of network device service routines whose addresses are held in the device data structure if they need device specific work performing. This maximum is used by the protocol layers. /dev/eth1 and so on. Support Functions Each device provides a standard set of routines that protocol layers call as part of their interface to this device's link layer. 8. for example IP. The second problem. PPP and Apple Localtalk. Secondly. If the driver could not find any devices. including its IP addresses. It is also at this time that the network device driver initializes the physical hardware that it is controlling and works out which IRQ it is using. to select suitable packet sizes to send. As the initialization routine for each network device is called. which it now owns. it will take over several of the /dev/ethN device data structures. Slip. Initially though. eth1 and so on to eth7. Family The family indicates the protocol family that the device can support. each device data structure holds only the address of an initialization or probe routine. These statistics can be seen using the ifconfig command. no more ethernet devices will be probed for.Each device describes how it may be used by the network protocool layers: mtu The size of the largest packet that this network can transmit not including any link layer headers that it needs to add. Type The hardware interface type describes the media that this network device is attached to. in this case. it returns a status indicating whether or not it located an instance of the controller that it is driving. Once all eight standard /dev/ethN have been allocated.iol.0.1 Initializing Network Devices Network device drivers can. not all of the network device drivers built into the Linux kernel will have devices to control. The initialization routine is the same for all of them. the Internet address family.25.html (14 di 15) [08/03/2001 10. These include setup and frame transmit routines as well as routines to add standard frame headers and collect statistics. that of dynamically assigning ethernet devices to the standard /dev/ethN device special files is solved more elegantly. Firstly. http://ldp.

Table of Contents.53] . Show Frames.it/LDP/tlk/dd/drivers. http://ldp.Top of Chapter.html (15 di 15) [08/03/2001 10.iol.08. No Frames © 1996-1999 David A Rusling copyright notice.

iol. ext2. It describes the Virtual File System (VFS) and explains how the Linux kernel's real file systems are supported. are mounted onto a directory and the files of the mounted file system cover up the existing contents of that directory.58] . In Linux. soft links and so on held in blocks on physical devices. feel and operate in the same way no matter what device is holding it. At the time of writing. say) they have a partition structure imposed on them that divides the physical disk into a number of logical partitions. hpfs. and no doubt. Linux adds each new file system into this single file system tree as it is mounted. /E is the master IDE disk on the second IDE controller. using Linux's file systems. Each partition may hold a single file system. umsdos. vfat.html (1 di 20) [08/03/2001 10.it/LDP/tlk/fs/filesystem. When disks are initialized (using fdisk. the mount directory's own files are once again revealed. affs and ufs. Linux supports 15 file systems. No Frames Chapter 9 The File system This chapter describes how the Linux kernel maintains the files in the file systems that it supports. The IDE disk partition /dev/hda1. the separate file systems the system may use are not accessed by device identifiers (such as a drive number or a drive name) but instead they are combined into a single hierarchical tree structure that represents the file system as one whole single entity. Show Frames. One of the most important features of Linux is its support for many different file systems. It is the task of each block device driver to map a request to read a particular block of its device into terms meaningful to its device. This directory is known as the mount directory or mount point. In the example (which is actually my home Linux system). iso9660. they do not know or care about the underlying physical disk's geometry. This makes it very flexible and well able to coexist with many other operating systems. Consider the following example where a Linux system has its root file system on a SCSI disk: A C D E F bin boot cdrom dev etc fd home lib proc mnt opt tmp root var lost+found usr sbin Neither the users nor the programs that operate on the files themselves need know that /C is in fact a mounted VFAT file system that is on the first IDE disk in the system. for example an EXT2 file system. of whatever type. minix. The file system might not even be on the local system. as it is for Unix TM. ext. over time more will be added. xia. it does not matter (at least to the system user) that these different file systems are on different physical media controlled by different hardware controllers. All file systems. File systems organize files into logical hierarchical structures with directories. Devices that can contain file systems are known as block devices.Table of Contents. When the file system is unmounted. proc. the first partition of the first IDE disk drive in the system. ncp. the particular track. It does not matter either that the first IDE http://ldp. is a block device. msdos. smb. sysv. Moreover. sector and cylinder of its hard disk where the block is kept.08. A file system has to look. it could just as well be a disk remotely mounted over a network link. The Linux file systems regard these block devices as simply linear collections of blocks.

08. An important development took place when the EXT file system was added into Linux. was introduced in April 1992 and cured a lot of the problems but it was still felt to lack performance. The most important of these caches is the Buffer Cache. the first file system that Linux had is rather restrictive and lacking in performance. Moreover it must hold that information safely and securely.iol. fetching and saving data. which is integrated into the way controllindividual file systems access their underlying block devices. or EXT2. 64Mbytes might at first glance seem large enough but large file sizes are necessary to hold even modest databases. These two requirements can be ontodds with each other. each presenting a common software interface to the VFS. A file system not only holds the data that is contained within the files of the file system but also the structure of the file system. I can dial into the network where I work using a modem and the PPP network protocol using a modem and in this case I can remotely mount my Alpha AXP Linux system's file systems on /mnt/remote. The first file system designed specifically for Linux.58] . The files in a file system are collections of data. If you could see the file system's data structures within the running kernel. It must also make sure controllfiles and their data are kept correctly.html (2 di 20) [08/03/2001 10. the file holding the sources to this chapter is an ASCII file called filesystems. The Linux VFS caches information in memory from each file system as it is mounted and used. So. Minix. Linux's Virtual File system layer allows you to transparently mount the many different file systems at the same time. you would be oble to see data blocks being read and written by the file system. It holds all of the information that Linux users and processes see as files. The real file systems were separated from the operating system and system services by an interface layer known as the Virtual File system.tex. Its filenames cannot be longer than 14 characters (which is still better than 8. file systems. the basic integrity of the operating system depends on its file systems. or EXT. or VFS. 9.1 The Second Extended File system (EXT2) http://ldp. As blocks are accessed they are put into the Buffer Cache and kept on various queues depending on their states. It is this file system that is described in detail later on in this chapter. A lot of care must be taken to update the file system correctly as data within these caches is modified aslfiles and directories are created. the Second Extended File system. often very different.it/LDP/tlk/fs/filesystem. The Linux Virtual File system is implemented so that access to its files is as fast and efficient as possible. file protection information and so on.controller is a PCI controller and that the second is an ISA controller which also controls the IDE CDROM. The Buffer Cache not only caches data buffers. in 1993. All of the details of the Linux file systems are translated by software so that all file systems appear identical to the rest of the Linux kernel and to programs running in the system. written to and deleted. it also helps manage the asynchronous interface with the block device drivers. directories soft links. the Extended File system.3 filenames) and the maximum file size is 64MBytes. VFS allows Linux to support many. was added. Nobody would use an operating system that randomly lost data and files1. describing rollfiles and directories being accessed would be created and destroyed and all rolltime the device drivers would be working away. Data structures.

html (3 di 20) [08/03/2001 10. Figure 9. So far as each file system is concerned. Usually in computing you trade off CPU usage for memory and disk space utilisation. The inodes for the file system are all kept together in inode tables. It is also the most successful file system so far in the Linux community and is the basis for all of the currently shipping Linux distributions. Every file in the EXT2 file system is described by a single inode and each inode has a single unique number identifying it. The EXT2 file system.08. A file system does not need to concern itself with where on the physical media a block should be put. block devices are just a series of blocks that can be read and written. Unfortunately this means that on average you waste half a block per file. The subsections describe in more detail the contents of each Block Group. EXT2 defines the file system topology by describing each file in the system with an inode data structure. This duplication is neccessary should a disaster occur and the file system need recovering. In this case Linux. that is the job of the device's driver. then a file of 1025 bytes will occupy two 1024 byte blocks. some must be used to contain the information that describes the structure of the file system. it requests that its supporting device driver reads an integral number of blocks. An inode describes which blocks the data within a file occupies as well as the access rights of the file. Each group duplicates information critical to the integrity of the file system as well as holding real files and directories as blocks of information and data.Figure 9. http://ldp. along with most operating systems. is built on the premise that the data held in files is kept in data blocks. like a lot of the file systems. If the block size is 1024 bytes.iol.58] . EXT2 directories are simply special files (themselves described by inodes) which contain pointers to the inodes of their directory entries.it/LDP/tlk/fs/filesystem. although that length can vary between different EXT2 file systems the block size of a particular EXT2 file system is set when it is created (using mke2fs). These data blocks are all of the same length and. Every file's size is rounded up to an integral number of blocks. Not all of the blocks in the file system hold data. trades off a relatively inefficient disk usage in order to reduce the workload on the CPU.1 shows the layout of the EXT2 file system as occupying a series of blocks in a block structured device. the file's modification times and the type of the file. The EXT2 file system divides the logical partition that it occupies into Block Groups.1: Physical Layout of the EXT2 File system The Second Extended File system was devised (by Rémy Card) as an extensible and powerful file system for Linux. Whenever a file system needs to read information or data from the block device containing it.

http://ldp. directory. every file and directory in the file system is described by one and only one inode. This allows the file system to correctly allow the right sort of accesses. block device. Timestamps The time that the inode was created and the last time that it was modified. Size The size of the file in bytes.2 shows the format of an EXT2 inode. Owner Information The user and group identifiers of the owners of this file or directory.58] .it/LDP/tlk/fs/filesystem. it contains the following fields: mode This holds two pieces of information. character device or FIFO.08.9. symbolic link. The EXT2 inodes for each Block Group are kept in the inode table together with a bitmap that allows the system to keep track of allocated and unallocated inodes.2: EXT2 Inode In the EXT2 file system.1 The EXT2 Inode Figure 9. an inode can describe one of file. amongst other information.iol. Figure 9.html (4 di 20) [08/03/2001 10. For EXT2.1. the inode is the basic building block. what this inode describes and the permissions that users have to it.

58] . Blocks per Group The number of blocks in a group.08. for example 1024 bytes.2 The EXT2 Superblock The Superblock contains a description of the basic size and shape of this file system.it/LDP/tlk/fs/filesystem. running e2fsck is recommended'' is displayed. Amongst other information it holds the: Magic Number This allows the mounting software to check that this is indeed the Superblock for an EXT2 file system.Datablocks Pointers to the blocks that contain the data that this inode is describing. The mount count is incremented each time the file system is mounted and when it equals the maximum mount count the warning message ``maximal mount count reached. Like the block size this is fixed when the file system is created. Mount Count and Maximum Mount Count Together these allow the system to determine if the file system should be fully checked. First Inode This is the inode number of the first inode in the file system. Block Group Number The Block Group number that holds this copy of the Superblock. The information within it allows the file system manager to use and maintain the file system. the double indirect blocks pointer points at a block of pointers to blocks of pointers to data blocks. Free Blocks The number of free blocks in the file system. All of the device files in /dev are there to allow programs to access Linux's devices. You should note that EXT2 inodes can describe special device files. There are also feature compatibility fields which help the mounting code to determine which new features can safely be used on this file system. Revision Level The major and minor revision levels allow the mounting code to determine whether or not this file system supports features that are only available in particular revisions of the file system. http://ldp. For example the mount program takes as an argument the device file that it wishes to mount. These are not real files but handles that programs can use to access devices.html (5 di 20) [08/03/2001 10. This means that files less than or equal to twelve data blocks in length are more quickly accessed than larger files. Free Inodes The number of free Inodes in the file system. The first twelve are pointers to the physical blocks containing the data described by this inode and the last three pointers contain more and more levels of indirection.1. 9. Usually only the Superblock in Block Group 0 is read when the file system is mounted but each Block Group contains a duplicate copy in case of file system corruption. For the current version of EXT2 this is 0xEF53. For example. Block Size The size of the block for this file system in bytes.iol. The first inode in an EXT2 root file system would be the directory entry for the '/' directory.

This is used during block allocation and deallocation.1. all the group descriptors for all of the Block Groups are duplicated in each Block Group in case of file system corruption.4 EXT2 Directories http://ldp. Each Group Descriptor contains the following information: Blocks Bitmap The block number of the block allocation bitmap for this Block Group. like the copies of the Superblock. This is used during inode allocation and deallocation. Used directory count The group descriptors are placed on after another and together they make the group descriptor table.html (6 di 20) [08/03/2001 10.1.3 The EXT2 Group Descriptor Each Block Group has a data structure describing it. Like the Superblock.08. in case the main copy is corrupted. Free Inodes count. Each inode is represented by the EXT2 inode data structure described below. Only the first copy (in Block Group 0) is actually used by the EXT2 file system.iol. Free blocks count. The other copies are there. Inode Table The block number of the starting block for the inode table for this Block Group.9.it/LDP/tlk/fs/filesystem.58] . 9. Each Blocks Group contains the entire table of group descriptors after its copy of the Superblock. Inode Bitmap The block number of the inode allocation bitmap for this Block Group.

08. name length The length of this directory entry in bytes.3 shows the layout of a directory entry in memory. http://ldp.3.3: EXT2 Directory In the EXT2 file system.'' entries meaning ``this directory'' and ``the parent directory'' respectively. A directory file is a list of directory entries. the directory entry for the file called file has a reference to inode number i1.it/LDP/tlk/fs/filesystem.html (7 di 20) [08/03/2001 10.'' and ``. The first two entries for every directory are always the standard ``.. name The name of this directory entry.iol.58] .Figure 9. Figure 9. In figure 9. directories are special files that are used to create and hold access paths to the files in the file system. This is an index into the array of inodes held in the Inode Table of the Block Group. each one containing the following information: inode The inode for this directory entry.

respectively. The VFS inode representing the file that we are trying to allocate a new data block for has two EXT2 specific fields. which are the block number of the first preallocated data block and how many of them there are. Having locked the superblock. it will have to wait until this process has finished. Processes waiting for the superblock are suspended. The root inode is for an EXT2 directory. in other words the mode of the root inode describes it as a directory and it's data blocks contain EXT2 directory entries. We have to read this directory (by first reading its inode and then reading the directory entries from the data blocks described by its inode) to find the rusling entry which gives us the number of the inode describing the /home/rusling directory. If the EXT2 file system has been built to preallocate data blocks then we may be able to take one of those.cshrc.html (8 di 20) [08/03/2001 10.6 Changing the Size of a File in an EXT2 File System One common problem with a file system is its tendency to fragment. the process checks that there are enough free blocks left in this file system. then we need the 42nd inode from the inode table of Block Group 0. If there are not enough free blocks. Access to the superblock is granted on a first come. it keeps control until it has finished. Finally we read the directory entries pointed at by the inode describing the /home/rusling directory to find the inode number of the . it can be any length and consist of any of the printable characters. Only when this fails does it allocate data blocks in another Block Group. unable to run. for example. If there are enough free blocks in the file system. It is a series of directory names separated by forward slashes (``/'') and ending in the file's name. Allocating and deallocating changes fields within the superblock.it/LDP/tlk/fs/filesystem. they are just reserved within the allocated block bitmap. To find the inode representing this file within an EXT2 file system the system must parse the filename a directory at a time until we get to the file itself. then this attempt to allocate more will fail and the process will relinquish control of this file system's superblock. home is just one of the many directory entries and this directory entry gives us the number of the inode describing the /home directory.58] . If another process needs to allocate more data blocks. The EXT2 file system first looks to see if the data http://ldp. If. If there were no preallocated blocks or block preallocation is not enabled.iol. the process cannot run.9.08. the root inode number is 42. then it must allocate a new data block for this file. Like all other Unix TM systems. first served basis and once a process has control of the superblock. If it has. One example filename would be /home/rusling/. To read an EXT2 inode we must look for it in the inode table of the appropriate Block Group. 9.cshrc file and from this we get the data blocks containing the information in the file. Until the allocation is complete. prealloc_block and prealloc_count. The blocks that hold the file's data get spread all over the file system and this makes sequentially accessing the data blocks of a file more and more inefficient the further apart the data blocks are.1. The first inode we need is the inode for the root of the file system and we find its number in the file system's superblock. it must wait for the file system to allocate a new data block and write the rest of the data to it before it can continue. until control of the superblock is relinquished by its current user. Linux does not care about the format of the filename itself. Whenever a process attempts to write data into a file the Linux file system checks to see if the data has gone off the end of the file's last allocated block.cshrc where /home and /rusling are directory names and the file's name is . The preallocated blocks do not actually exist. The first thing that the EXT2 block allocation routines do is to lock the EXT2 Superblock for this file system. and the Linux file system cannot allow more than one process to do this at the same time.5 Finding a File in an EXT2 File System A Linux filename has the same format as all Unix TM filenames have. The EXT2 file system tries to overcome this by allocating the new blocks for a file physically close to its current data blocks or at least in the same Block Group as its current data blocks. the process tries to allocate one.1. the EXT2 file system must allocate a new block.

If even that block is not free.08. The process's data is written to the new data block and. The data in the buffer is zero'd and the buffer is marked as ``dirty'' to show that it's contents have not been written to the physical disk. although not ideal is at least fairly close and within the same Block Group as the other data blocks belonging to this file.html (9 di 20) [08/03/2001 10. If it cannot find eight together. the process starts looking in all of the other Block Groups in turn until it finds some free blocks.iol. if that data block is filled. If this block is not free. this is the most efficient block to allocate as it makes sequential accesses much quicker. If block preallocation is wanted and enabled it will update prealloc_block and prealloc_count accordingly. If there were any processes waiting for the superblock. the entire process is repeated and another data block allocated. Logically. the superblock itself is marked as ``dirty'' to show that it has been changed and it is unlocked.2 The Virtual File System (VFS) http://ldp. then the search widens and it looks for a data block within 64 blocks of the of the ideal block. Finally. This block. the block allocation code updates the Block Group's block bitmap and allocates a data buffer in the buffer cache.58] .it/LDP/tlk/fs/filesystem. That data buffer is uniquely identified by the file system's supporting device identifier and the block number of the allocated block. the first one in the queue is allowed to run again and will gain exclusive control of the superblock for its file operations. Wherever it finds the free block. The block allocation code looks for a cluster of eight free data blocks somewhere in one of the Block Groups. 9.block after the last data block in the file is free. it will settle for less.

As each file system is initialised. then a file system specific routine must be called in order to read the appropriate inode. system routines are called that traverse the VFS inodes in the system. It has the distinct advantage of making the Linux file systems independent from the underlying media and from the device drivers that support them.08. This EXT2 inode read routine. then a number of inodes will be being repeatedly accessed. All block structured devices register themselves with the Linux kernel and present a uniform.4 shows the relationship between the Linux kernel's Virtual File System and it's real file systems. As blocks are read by the file systems they are saved in the global buffer cache shared by all of the file systems and the Linux http://ldp. the superblock representing a mounted EXT2 file system contains a pointer to the EXT2 specific inode reading routine. the VFS must read its superblock. When a block device based file system is mounted. then it is only loaded when a VFAT file system is mounted. The less used VFS inodes get removed from the cache. Even relatively complex block devices such as SCSI devices do this. for example. The virtual file system must manage all of the different file systems that are mounted at any given time.58] . The real file systems are either built into the kernel itself or are built as loadable modules. the VFS inodes describe files and directories within the system. this is the inode that represents the ``/'' directory. the VFS describes the system's files in terms of superblocks and inodes in much the same way as the EXT2 file system uses superblocks and inodes. So.it/LDP/tlk/fs/filesystem. like all of the file system specific inode read routines. Like the EXT2 inodes. File System modules are loaded as the system needs them. to avoid confusion. usually asynchronous interface. the contents and topology of the Virtual File System. This buffer cache is independent of the file systems and is integrated into the mechanisms that the Linux kernel uses to allocate and read and write data buffers. typing ls for a directory or cat for a file cause the the Virtual File System to search through the VFS inodes that represent the file system. so.html (10 di 20) [08/03/2001 10. Integrated into this block device interface is the buffer cache. This happens as the operating system initialises itself at system boot time. Each file system type's superblock read routine must work out the file system's topology and map that information onto a VFS superblock data structure. As the real file systems read data from the underlying physical disks. To do this it maintains data structures that describe the whole (virtual) file system and the real. block based. Rather confusingly. The VFS keeps a list of the mounted file systems in the system together with their VFS superblocks. mounted. this results in requests to the block device drivers to read physical blocks from the device that they control.4: A Logical Diagram of the Virtual File System Figure 9. This mapping of information is very efficient for the EXT2 file system but moderately less so for other file systems. The action of reading the inode causes it to be put into the inode cache and further accesses to the inode keep it in the cache. file systems. Each VFS superblock contains information and pointers to routines that perform particular functions. for example. As the system's processes access directories and files. fills out the fields in a VFS inode. All of the Linux file systems use a common buffer cache to cache data buffers from the underlying devices to help speed up access by all of the file systems to the physical devices holding the file systems.iol. For the root file system. From now on. and this includes the root file system. If an inode is not in the inode cache. As every file and directory on the system is represented by a VFS inode. it registers itself with the VFS.Figure 9. These inodes are kept in the inode cache which makes access to them quicker. For example. Each VFS superblock contains a pointer to the first VFS inode on the file system. if the VFAT file system is implemented as a kernel module. I will write about VFS inodes and VFS superblocks to distinquish them from EXT2 inodes and superblocks.

iol. For example. amongst other information. As an experiment. 9.html (11 di 20) [08/03/2001 10. these routines are used by the VFS to read and write inodes and superblocks. Buffers within it are identified by their block number and a unique identifier for the device that read it. Amongst other things. every file. File System type A pointer to the mounted file system's file_system_type data structure. for example 1024 bytes. the first IDE hard disk in the system has a device identifier of 0x301. Inode pointers The mounted inode pointer points at the first inode in this file system.kernel.58] . 9.08. it will be retrieved from the buffer cache rather than read from the disk. The information in each VFS inode is built from information in the underlying file system by file system specific routines. The directory cache does not store the inodes for the directories itself. the directory cache simply stores the mapping between the full directory names and their inode numbers. Amongst other information. these should be in the inode cache. The first time you list it. /dev/hda1. The VFS also keeps a cache of directory lookups so that the inodes for frequently used directories can be quickly found.it/LDP/tlk/fs/filesystem. directory and so on in the VFS is represented by one and only one VFS inode. File System specific A pointer to information needed by this file system. inode number This is the number of the inode and is unique within this file system.2 The VFS Inode Like the EXT2 file system. VFS inodes contain the following fields: device This is the device identifer of the device holding the file or whatever that this VFS inode represents. Superblock operations A pointer to a set of superblock routines for this file system.2. the VFS superblock contains the: Device This is the device identifier for the block device that this file system is contained in. The covered inode pointer points at the inode representing the directory that this file system is mounted on. The combination of device and http://ldp. VFS inodes exist only in the kernel's memory and are kept in the VFS inode cache as long as they are useful to the system. if the same data is needed often.2. The root file system's VFS superblock does not have a covered pointer. Some devices support read ahead where data blocks are speculatively read just in case they are needed. So. which would take somewhat longer.1 The VFS Superblock Every mounted file system is represented by a VFS superblock. try listing a directory that you have not listed recently. you may notice a slight pause but the second time you list its contents the result is immediate. Blocksize The block size in bytes of this file system.

if so the underlying file system will need modifying.iol.5: Registered File Systems When you build the Linux kernel you are asked if you want each of the supported file systems. Linux file systems may also be built as modules and. the file system startup code contains calls to the initialisation routines of all of the built in file systems. These routines are specific to the file system and they perform operations for this inode. times The creation. when it is being read from the file system. mode Like EXT2 this field describes what this VFS inode represents as well as access rights to it. Figure 9. dirty Indicates whether this VFS inode has been written to. Each file system's initialisation routine registers itself with the Virtual File System and is represented by a file_system_type data structure which contains the name of the file system and a pointer to its VFS superblock read routine. user ids The owner identifiers. for example 1024 bytes. Whenever a file system module is loaded it registers itself with the kernel and unregisters itself when it is unloaded. for example.3 Registering the File Systems Figure 9. for example. they may be demand loaded as they are needed or loaded by hand using insmod. in this case. block size The size of a block for this file in bytes. lock This field is used to lock the VFS inode.it/LDP/tlk/fs/filesystem. Each file_system_type data structure contains the following information: Superblock read routine http://ldp.58] . A count of zero means that the inode is free to be discarded or reused. modification and write times. inode operations A pointer to a block of routine addresses. count The number of system components currently using this VFS inode.08. truncate the file that is represented by this inode. When the kernel is built.5 shows that the file_system_type data structures are put into a list pointed at by the file_systems pointer. file system specific information 9.inode number is unique within the Virtual File System.html (12 di 20) [08/03/2001 10.2.

for example. Device needed Does this file system need a device to support? Not all file system need a device to hold them.This routine is called by the VFS when an instance of the file system is mounted. You can see which file systems are registered by looking in at /proc/filesystems.iol. the name of the file system. the physical block device that contains the file system and. Once the inode has been found it is checked to see that it is a directory and that there is not already some other file system mounted there. For the EXT2 file system this mapping or translation of information is quite easy. For example: ext2 nodev proc iso9660 9. The superblock read routine must fill out the VFS superblock fields based on information that it reads from the physical device.4 Mounting a File System When the superuser attempts to mount a file system. The first thing that the Virtual File System must do is to find the file system. If it cannot find a matching file system name then all is not lost if the kernel is built to demand load kernel modules (see Chapter modules-chapter). it must find the VFS inode of the directory that is to be the new file system's mount point.html (13 di 20) [08/03/2001 10. thirdly. The /proc file system.2. In this case the kernel will request that the kernel daemon loads the appropriate file system module before continuing as before. filling out the VFS superblock means that the file system must read whatever describes it from the block device that supports it. it does not know which file systems this kernel has been built to support or that the proposed mount point actually exists. File System name The name of this file system. This VFS inode may be in the inode cache or it might have to be read from the block device supporting the file system of the mount point. it simply reads the EXT2 superblock and fills out the VFS superblock from there. does not require a block device. where in the existing file system topology the new file system is to be mounted. All of the system's VFS superblocks are kept in the super_blocks vector of super_block data structures and one must be allocated for this mount. such as the MS DOS file system. If it finds a matching name it now knows that this file system type is supported by this kernel and it has the address of the file system specific routine for reading this file system's superblock. Consider the following mount command: $ mount -t iso9660 -o ro /dev/cdrom /mnt/cdrom This mount command will pass the kernel three pieces of information. Although mount does some basic checking. If the block device cannot be read from or if it does not contain http://ldp. At this point the VFS mount code must allocate a VFS superblock and pass it the mount information to the superblock read routine for this file system. For other file systems. To do this it searches through the list of known file systems by looking at each file_system_type data structure in the list pointed at by file_systems. The same directory cannot be used as a mount point for more than one file system.08. Next if the physical device passed by mount is not already mounted. the Linux kernel must first validate the arguments passed in the system call. Whatever the file system.58] . for example ext2. it is not quite such an easy task.it/LDP/tlk/fs/filesystem.

9.2. the real file system gets the VFS inode either from the underlying file system or from the inode cache.08. vfsmnttail points at the last entry in the list and the mru_vfsmnt pointer points at the most recently used file system. Figure 9. see figure 9.6: A Mounted File System Each mounted file system is described by a vfsmount data structure. Another pointer. Each directory lookup involves calling the file system specific lookup whose address is held in the VFS inode representing the parent directory. the directory where this file system is mounted and a pointer to the VFS superblock allocated when this file system was mounted.5 Finding a File in the Virtual File System To find the VFS inode of a file in the Virtual File System. These are queued on a list pointed at by vfsmntlist.59] .this type of file system then the mount command will fail. If there is no entry in the directory cache.6. This inode is kept resident in the VFS inode cache all of the time that this file system is loaded. looking up the VFS inode representing each of the intermediate directories in the name. Each time an inode is looked up by the real file system it checks the directory cache for the directory.iol.html (14 di 20) [08/03/2001 10.it/LDP/tlk/fs/filesystem. Each vfsmount structure contains the device number of the block device holding the file system. http://ldp. In turn the VFS superblock points at the file_system_type data structure for this sort of file system and to the root inode for this file system. VFS must resolve the name a directory at a time. This works because we always have the VFS inode of the root of each file system available and pointed at by the VFS superblock for that system.

Once a candidate for reuse has been located it is cleaned up. this indicates that the system is not currently using them.08. their VFS inodes are being continually read and. inodes and puts them into the inode list. Finally the vfsmount data structure for this mount is unlinked from vfsmntlist and freed. that is it has been modified. you cannot umount /mnt/cdrom if a process is using that directory or any of its children. for example. a file system specific routine must be called to fill it out from information read from the underlying real file system. in some cases. and the code checks for this by looking through the list of inodes looking for inodes owned by the device that this file system occupies. Otherwise a free VFS inode must be found so that the file system can read the inode from memory.html (15 di 20) [08/03/2001 10. the memory occupied by the VFS superblock is returned to the kernel's free pool of memory. the system first calculates its hash value and then uses it as an index into the inode hash table. Really important VFS inodes. To find an inode in the cache. The VFS inode cache is implmented as a hash table whose entries are pointers to lists of VFS inodes that have the same hash value. A file system cannot be unmounted if something in the system is using one of its files. Every time a VFS inode is read from the inode cache the system saves an access to a physical device. VFS has a number of choices about how to get a free inode. for example the root inodes of file systems always have a usage count greater than zero and so are never candidates for reuse. This happens when you read a directory. To get the VFS inode that is actually needed.2. The hash value of an inode is calculated from its inode number and from the device identifier for the underlying physical device containing the file system.7 Unmounting a File System The workshop manual for my MG usually describes assembly as the reverse of disassembly and the reverse is more or less true for unmounting a file system. However the new VFS inode is found. the less used inodes will be discarded and the more used inodes will remain in the cache. its count is incremented to show that it has another user and the file system access continues.iol. The Virtual File System maintains an inode cache to speed up accesses to all of the mounted file systems.59] .2. If the system may allocate more VFS inodes then this is what it does. 9.8 The VFS Inode Cache As the mounted file systems are navigated. It then reads each inode in turn until it finds one with both the same inode number and the same device identifier as the one that it is searching for. If the system already has all of the inodes that it is allowed to have.6 Creating a File in the Virtual File System 9.2. If it can find the inode in the cache.9. Once it has been written to disk. only the inode for the final directory is needed but the inodes for the intermediate directories must also be read. free. it allocates kernel pages and breaks them up into new. As the VFS inode cache is used and filled up. If anything is using the file system to be unmounted there may be VFS inodes from it in the VFS inode cache. the file system may need to access several other inodes.it/LDP/tlk/fs/filesystem. it must find an inode that is a good candidate to be reused. This gives it a pointer to a list of inodes with the same hash value. written. Good candidates are inodes with a usage count of zero. The VFS inode might be dirty and in this case it needs to be written back to the file system or it might be locked and in this case the system must wait for it to be unlocked before continuing. If the VFS superblock for the mounted file system is dirty. The candidate VFS inode must be cleaned up before it can be reused. So. Whilst it is being filled out. the new VFS inode has a usage count of one and is locked so that nothing else accesses it until it contains valid information. it first looks in the VFS inode cache. http://ldp. All of the system's VFS inodes are in a list pointed at by first_inode as well as in the inode hash table. then it must be written back to the file system on disk. Whenever the Virtual File System needs to access an inode.

9 The Directory Cache To speed up accesses to commonly used directories. The entries in the second level LRU cache list are safer than entries in the level one LRU cache list.9. In a full cache this will displace an existing entry from the front of the LRU list. As the directory entry is accessed again it is promoted to the back of the second LRU cache list. into the hash table. then it will be found in the directory cache.08. they would be nearer the back of the lists.html (16 di 20) [08/03/2001 10.iol. For example. When a directory entry is first put into the cache. Only short directory entries (up to 15 characters long) are cached but this is reasonable as the shorter directory names are the most commonly used ones. or even not to find them. Again.59] . The hash function uses the device number of the device holding the file system and the directory's name to calculate the offset. each entry of which points at a list of directory cache entries that have the same hash value. this may displace a cached level two directory entry at the front of the level two LRU cache list. /usr/X11R6/bin is very commonly accessed when the X server is running. In an effort to keep the caches valid and up to date the VFS keeps lists of Least Recently Used (LRU) directory cache entries.it/LDP/tlk/fs/filesystem. the VFS maintains a cache of directory entries.3 The Buffer Cache http://ldp. The next time the same directory is looked up.2. It allows cached directory entries to be quickly found. As directories are looked up by the real file systems their details are added into the directory cache. The directory cache consists of a hash table. It is no use having a cache when lookups within the cache take too long to find entries. This displacing of entries at the front of the level one and level two LRU lists is fine. If they had. REVIEW NOTE: Do we need a diagram for this? 9. it is added onto the end of the first level LRU list. The only reason that entries are at the front of the lists is that they have not been recently accessed. for example to list it or open a file within it. This is the intention as these entries have not only been looked up but also they have been repeatedly referenced. or index. which is when it is first looked up.

7: The Buffer Cache As the mounted file systems are used they generate a lot of requests to the block devices to read and write data blocks. All block data read and write requests are given to the device drivers in the form of buffer_head data structures via standard kernel routine calls. This cache is shared between all of the physical block devices. The buffer cache is composed of two functional parts. The currently supported buffer sizes are 512. There is an LRU list for each buffer type and these are used by the system to perform work on buffers of a type.7 shows the hash table together with a few entries. Block buffers within the cache are uniquely identfied by the owning device identifier and the block number of the buffer. 4096 and 8192 bytes. writing buffers with new data in them out to http://ldp.Figure 9. The hash index is generated from the owning device identifier and the block number of the data block. the device identifier uniquely identifies the device and the block number tells the driver which block to read. There is one list per supported buffer size and the system's free block buffers are queued onto these lists when they are first created or when they have been discarded. 1024.iol.it/LDP/tlk/fs/filesystem. Linux maintains a cache of block buffers. The first part is the lists of free block buffers. Any block buffer that has been used to read data from a block device or to write data to it goes into the buffer cache.html (17 di 20) [08/03/2001 10. unused buffers. Over time it may be removed from the cache to make way for a more deserving buffer or it may remain in the cache as it is frequently accessed. 2048. When they are in the buffer cache they are also queued onto Least Recently Used (LRU) lists. To speed up access to the physical block devices.08.59] . If valid data is available from the buffer cache this saves the system an access to a physical device. These give all of the information that the block device drivers need. All of the block buffers in the system are kept somewhere in this buffer cache. The second functional part is the cache itself. for example. Block buffers are either in one of the free lists or they are in the buffer cache. This is a hash table which is a vector of pointers to chains of buffers that have the same hash index. belonging to any one of the system's block devices and often in many different states. at any one time there are many block buffers in the cache. Figure 9. even the new. All block devices are viewed as linear collections of blocks of the same size.

4 0: 60 Max fraction of LRU list to examine for dirty blocks 1: 500 Max number of dirty blocks to write each time bdflush activated 2: 64 Num of clean buffers to be loaded onto free list by refill_freelist 3: 256 Dirty block threshold for activating bdflush in refill_freelist 4: 15 Percentage of cache to scan for free clusters 5: 3000 Time for data buffers to age before flushing 6: 500 Time for non-data (dir. new buffers. 9. then it may or may not be up to date.disk. Like all caches. if the system is desperate for buffers. The buffer's type reflects its state and Linux currently supports the following types: clean Unused. These contain new. it trys to get a block from the buffer cache. valid data. then it will get a clean one from the appropriate sized free list and this new buffer will go into the buffer cache.it/LDP/tlk/fs/filesystem. the file system must request that the device driver read the appropriate block of data from the disk. If the buffer that it needed is in the buffer cache. If it cannot get a buffer from the buffer cache. waiting to be written. As buffers are allocated and discarded the number of dirty buffers in the system is checked. etc) buffers to age before flushing http://ldp. shared Shared buffers. locked Buffers that are locked. Whenever a file system needs to read a buffer from its underlying physical device. If there are too many as a percentage of the total number of buffers in the system then bdflush is woken up. If it is not up to date or if it is a new block buffer.08.html (18 di 20) [08/03/2001 10. bdflush will be woken up anyway. dirty Dirty buffers.1 The bdflush Kernel Daemon The bdflush kernel daemon is a simple kernel daemon that provides a dynamic response to the system having too many dirty buffers. Mostly this daemon sleeps waiting for the number of dirty buffers in the system to grow too large. buffers that contain data that must be written out to disk at some time.iol. Linux uses the bdflush kernel daemon to perform a lot of housekeeping duties on the cache but some happen automatically as a result of the cache being used.59] . rather confusingly.3. The default threshold is 60% but. the buffer cache must be maintained so that it runs efficiently and fairly allocates cache entries between the block devices using the buffer cache. unshared Buffers that were once shared but which are now not shared. and will be written but so far have not been scheduled to write. bitmap. It is started as a kernel thread at system startup time and. it calls itself ``kflushd'' and that is the name that you will see if you use the ps command to show the processes in the system. This value can be seen and changed using the update command: # update -d bdflush version 1.

or instance of that major type. For example. Character devices allow I/O operations in character mode and block devices require that all I/O is via the buffer cache.3.59] . 9.7: 8: 1884 Time buffer cache load average constant 2 LAV ratio (used to determine threshold for buffer fratricide). http://ldp. There are two types of device file. 1 Nov 24 15:09 /dev/hda1 Within the kernel. Device files are referenced by a major number. when the VFS makes calls to it requesting inodes as its files and directories are opened.08. Again this number can be seen and controlled by the update command and the default is 500 (see above). neither the /proc directory nor its subdirectories and its files actually exist. When an I/O request is made to a device file. So how can you cat /proc/devices? The /proc file system. Within the kernel itself. Every time that update runs it looks at all of the dirty buffers in the system looking for ones with an expired flush time. this is two bytes long. close them and so on.html (19 di 20) [08/03/2001 10. every device is uniquely described by a kdev_t data type. create entries in the the /proc file system. Every expired buffer is written out to disk. The /proc file system presents a user readable window into the kernel's inner workings. 9. the IDE disks on the first IDE controller in the system have a major number of 3 and the first partition of an IDE disk would have a minor number of 1. It does this by calling a system service routine that does more or less the same thing as bdflush. the first byte containing the minor device number and the second byte holding the major device number. All of the dirty buffers are linked into the BUF_DIRTY LRU list whenever they are made dirty by having data written to them and bdflush tries to write a reasonable number of them out to their owning disks. character and block special files.2 The update Process The update command is more than just a command. The EXT2 file system and the Linux VFS both implement device files as special types of inode.it/LDP/tlk/fs/filesystem. Often this is not a real device driver but a pseudo-device driver for some subsystem such as the SCSI device driver layer. A device file does not use any data space in the file system. the device drivers implement file semantices: you can open them. registers itself with the Virtual File System. For example. So. /dev/null is the null device. it is forwarded to the appropriate device driver within the system. like a real file system. Several Linux subsystems. such as Linux kernel modules described in chapter modules-chapter.5 Device Special Files Linux. which identifies the device type. However. the kernel's /proc/devices file is generated from the kernel's data structures describing its devices. it is also a daemon. for example. So. When run as superuser (during system initialisation) it will periodically flush all of the older dirty buffers out to disk. Whenever a dirty buffer is finished with. like all versions of Unix TM presents its hardware devices as special files. it is tagged with the system time that it should be written out to its owning disk.4 The /proc File System The /proc file system really shows the power of the Linux Virtual File System. the /proc file system creates those files and directories from information within the kernel. 9.iol. it is only an access point to the device driver. It does not really exist (yet another of Linux's conjuring tricks). which identifies the unit. ls -l of /dev/hda1 gives: $ brw-rw---1 root disk 3. and a minor type.

version 1. Table of Contents.iol.it/LDP/tlk/fs/filesystem. An EXT2 inode that represents a block or character device keeps the device's major and minor numbers in its first direct block pointer.59] .The IDE device above is held within the kernel as 0x0301. not knowingly. Show Frames. http://ldp. No Frames © 1996-1999 David A Rusling copyright notice.08. Top of Chapter. the VFS inode data structure representing it has its i_rdev field set to the correct device identifier.html (20 di 20) [08/03/2001 10. When it is read by the VFS. Footnotes: 1 Well. although I have been bitten by operating systems with more lawyers than Linux has developers File translated from TEX by TTH.0.

Table of Contents. during this time. There can be up to 32 different bottom half handlers. bh_base is a vector of pointers to each of the http://ldp.html (1 di 7) [08/03/2001 10.1 shows the kernel data structures associated with bottom half handling. A good example of this is during interrupt processing. When the interrupt was asserted.09. There is often some work that could just as well be done later on.1 Bottom Half Handling Figure 11. Figure 11. the processor stopped what it was doing and the operating system delivered the interrupt to the appropriate device driver.00] . Linux's bottom half handlers were invented so that device drivers and other parts of the Linux kernel could queue work to be done later on. No Frames Chapter 11 Kernel Mechanisms This chapter describes some of the general tasks and mechanisms that the Linux kernel needs to supply so that other parts of the kernel work effectively together. Show Frames.it/LDP/tlk/kernel/kernel.1: Bottom Half Handling Data Structures There are often times in a kernel when you do not want to do work at this moment.iol. Device drivers should not spend too much time handling interrupts as. nothing else in the system can run. 11.

These indices are statically defined. the console bottom half handler is next in priority (index 1) and so on. just before control is returned to the calling process. Bit 8 is set if the driver has queued something on the immediate queue and wishes the immediate bottom half handler to run and process it. for example the timer queue. needs to schedule work to be done later. For example. the bottom half handler routines that are active are called. CONSOLE This handler is used to process console messages. the timer bottom half handler is the highest priority (index 0). Whenever a device driver. Some of the kernel's bottom half handers are device specific.iol. the immediate bottom half handler works its way through the immediate tasks queue (tq_immediate) which contains tasks that need to be performed immediately.09. It does this by setting the appropriate bit in bh_active. If bit N of bh_mask is set then the Nth element of bh_base contains the address of a bottom half routine.kernel's bottom half handling routines. IMMEDIATE This is a generic handler used by several device drivers to queue work to be done later. If it has any bits set. The bit in bh_active is cleared as each bottom half handling routine is called. Bit 0 is checked first.00] .it/LDP/tlk/kernel/kernel. bh_active and bh_mask have their bits set according to what handlers have been installed and are active. The bh_active bitmask is checked at the end of each system call. Typically the bottom half handling routines have lists of tasks associated with them. or some other part of the kernel. NET This handler handles general network processing.html (2 di 7) [08/03/2001 10. it adds work to the appropriate system queue. TQUEUE This handler is used to process tty messages. it only has meaning between calls to the scheduler and is a way of not calling bottom half handling routines when there is no work for them to do. but others are more generic: TIMER This handler is marked as active each time the system's periodic timer interrupts and is used to drive the kernel's timer queue mechanisms. bh_active is transient. 11. and then signals the kernel that some bottom half handling needs to be done. then 1 and so on until bit 31.2 Task Queues http://ldp. If bit N of bh_active is set then the N'th bottom half handler routine should be called as soon as the scheduler deems reasonable.

This queue should not be confused with system timers.it/LDP/tlk/kernel/kernel.09. When task queues are processed. 11. which are a much more sophisticated mechanism.html (3 di 7) [08/03/2001 10. The routine will be called when the element on the task queue is processed and it will be passed a pointer to the data. the timer queue bottom half handler is made active. say for a device driver. this queue is checked to see if it contains any entries and. along with all the other bottom half handlers. The timer queue bottom half handler is processed. A task queue is a simple data structure.2 which consists of a singly linked list of tq_struct data structures each of which contains the address of a routine and a pointer to some data. the pointer to the first element in the queue is removed from the queue and replaced with a null pointer.3 Timers http://ldp. Each clock tick. Task queues are often used in conjunction with bottom half handlers. this removal is an atomic operation.iol. The elements in the queue are often statically allocated data.Figure 11. in this case. can create and use task queues but there are three task queues created and managed by the kernel: timer This queue is used to queue work that will be done as soon after the next system clock tick as is possible. when the scheduler next runs. However there is no inherent mechanism for discarding allocated memory. if it does. one that cannot be interrupted. immediate This queue is also processed when the scheduler processes the active bottom half handlers. It is used to support other task queues in the system and. Anything in the kernel. Then each element in the queue has its handling routine called in turn. see figure 11. Linux has a generic mechanism for queuing work on queues and for processing them later. It is the job of the task itself to ensure that it properly cleans up any allocated kernel memory. The immediate bottom half handler is not as high in priority as the timer queue bottom half handler and so these tasks will be run later. In fact.2: A Task Queue Task queues are the kernel's way of deferring work until later. scheduler This task queue is processed directly by the scheduler.00] . The task queue processing routine simply moves onto the next element in the list. the task to be run will be a routine that processes a task queue. the timer task queue is processed when the timer queue bottom half handler runs. for example a device driver.

both queue routines to be called at some system time but they are slightly different in their implementations. The first. A mechanism is needed whereby activities can be scheduled to run at some relatively precise time.it/LDP/tlk/kernel/kernel.00] . All system times are based on this measurement. which is known as jiffies after the globally available variable of the same name. has a static array of 32 pointers to timer_struct data structures and a mask of active timers.Figure 11. Linux has a very simple view of what time it is. it measures time in clock ticks since the system booted. mechanism uses a linked list of timer_list data structures held in ascending expiry time order. orchestrating the system's activities. Figure 11. Any microprocessor that wishes to support an operating system must have a programmable interval timer that periodically interrupts the processor. The second.iol.09. This periodic interrupt is known as a system clock tick and it acts like a metronome. newer.3 shows both mechanisms. http://ldp.3: System Timers An operating system needs to be able to schedule an activity sometime in the future. timer_active.html (4 di 7) [08/03/2001 10. the old timer mechanism. Where the timers go in the timer table is statically defined (rather like the bottom half handler table bh_base). Entries are added into this table mostly at system initialization time. Linux has two types of system timers.

1 When the wait queue is processed. which consists of a pointer to the processes task_struct and a pointer to the next element in the wait queue. If the process has been removed from the run queue. 11. the state of every process in the wait queue is set to RUNNING. Every expired timer is removed from the list and its routine is called. wait_queue *task *next Figure 11. In this case the process must wait for that inode to be fetched from the physical media containing the file system before it can carry on. The waiting processes state will reflect this and either be INTERRUPTIBLE or UNINTERRUPTIBLE. a wait queue (see figure 11. when it selects a new process to run. the processes that are on the wait queue are now candidates to be run as they are now no longer waiting.4). When processes are added to the end of a wait queue they can either be interruptible or uninterruptible. As this process can not now continue to run. If the expiry time for an active timer has expired (expiry time is less than the current system jiffies). The new timer mechanism has the advantage of being able to pass an argument to the timer routine.html (5 di 7) [08/03/2001 10. Interruptible processes may be interrupted by events such as timers expiring or signals being delivered whilst they are waiting on a wait queue.iol. For new system timers. For the old system timers the timer_active bit mask is check for bits that are set. the scheduler is run and. The next time the scheduler runs. When a process on the wait queue is scheduled the first thing that it will do is remove itself from the wait queue. the entries in the linked list of timer_list data structures are checked. the waiting process will be suspended. the timer queues will be processed. Every system clock tick the timer bottom half handler is marked as active so that the when the scheduler next runs.it/LDP/tlk/kernel/kernel.4 Wait Queues There are many times when a process must wait for a system resource. For example a process may need the VFS inode describing a directory in the file system and that inode may not be in the buffer cache.4: Wait Queue The Linux kernel uses a simple data structure.09.00] . The timer bottom half handler processes both types of system timer. it is put back onto the run queue. Wait queues can be used to synchronize access to system resources and they are used by Linux in its implementation of semaphores (see below). its timer routine is called and its active bit is cleared.Both methods use the time in jiffies as an expiry time so that a timer that wished to run in 5s would have to convert 5s to units of jiffies and add that to the current system time to get the system time in jiffies when the timer should expire. http://ldp.

5 Buzz Locks These are better known as spin locks and they are a primitive way of protecting a data structure or piece of code. using a single integer field as a lock. It would be very dangerous to allow one process to alter a critical data structure that is being used by another process. spinning in a tight loop of code. The access to the memory location holding the lock must be atomic. the action of reading its value. lock A buzz lock used when accessing the waking field. The waiting processes are suspended. the process tries again. When processes want this resource they decrement the count and when they have finished with this resource they increment the count.09. A positive value means that the resource is available. A negative or zero value means that processes are waiting for it. They only allow one process at a time to be within a critical region of code. Instead Linux uses semaphores to allow just one process at a time to access critical regions of code and data. 11. If its current value is 1. all other processes wishing to access this resource will be made to wait until it becomes free. waking This is the count of processes waiting for this resource which is also the number of process waiting to be woken up when this resource becomes free. They are used in Linux to restrict access to fields in data structures.11. the first one to do this will increment it to 1 and enter the critical region. Any processes spinning on the lock will now read it as 0.6 Semaphores Semaphores are used to protect critical regions of code or data structures. One way to achieve this would be to use a buzz lock around the critical piece of data is being accessed but this is a simplistic approach that would not give very good system performance. Suppose the initial count for a semaphore is 1. returning its value to 0. When the owning process leaves the critical region of code it decrements the buzz lock. A Linux semaphore data structure contains the following information: count This field keeps track of the count of processes wishing to use this resource.html (6 di 7) [08/03/2001 10. making it 0. Most CPU architectures provide support for this via special instructions but you can also implement buzz locks using uncached main memory. wait queue When processes are waiting for this resource they are put onto this wait queue. Each process wishing to enter the region attempts to change the lock's initial value from 0 to 1.it/LDP/tlk/kernel/kernel.iol. Remember that each access of a critical piece of data such as a VFS inode describing a directory is made by kernel code running on behalf of a process. other processes in the system can continue to run as normal.00] . The process now ``owns'' the critical piece of code or http://ldp. An initial value of 1 means that one and only one process at a time can use this resource. checking that it is 0 and then changing it to 1 cannot be interrupted by any other process. the first process to come along will see that the count is positive and decrement it by 1.

0. When the waiting process wakes up. Show Frames. Top of Chapter. waiting for this resource. http://ldp.iol. When the process leaves the critical region it increments the semphore's count. As the count is now negative (-1) the process cannot enter the critical region.resource that is being protected by the semaphore.it/LDP/tlk/kernel/kernel. Instead it must wait until the owning process exits it. The waiting process adds itself to the semaphore's wait queue and sits in a loop checking the value of the waking field and calling the scheduler until waking is non-zero. case. It decrements the waking counter. returning it to a value of zero. No Frames © 1996-1999 David A Rusling copyright notice. the waking counter is now 1 and it knows that it it may now enter the critical region. Linux has implemented semaphores to work efficiently for this.00] . The owning process increments the waking counter and wakes up the process sleeping on the semaphore's wait queue. Linux makes the waiting process sleep until the owning process wakes it on exiting the critical region. If another process wishes to enter the critical region whilst it is owned by a process it too will decrement the count.html (7 di 7) [08/03/2001 10. version 1.09. File translated from TEX by TTH. The most optimal case is where there are no other processes contending for ownership of the critical region. In the optimal case the semaphore's count would have been returned to its initial value of 1 and no further work would be neccessary. Table of Contents. and continues. The owner of the critical region increments the semaphore's count and if it is less than or equal to zero then there are processes sleeping. the most common. All access to the waking field of semaphore are protected by a buzz lock using the semaphore's lock. Footnotes: 1 REVIEW NOTE: What is to stop a task in state INTERRUPTIBLE being made to run the next time the scheduler runs? Processes in a wait queue should never run until they are woken up.

The alternative is to have a micro-kernel structure where the functional pieces of the kernel are broken out into separate units with strict communication mechanisms between them. You would have to configure and then build a new kernel before you could use the NCR 810.iol. or file-systems. They can be unlinked from the kernel and removed when they are no longer needed.02] . though. large program where all the functional components of the kernel have access to all of its internal data structures and routines. in other words.html (1 di 6) [08/03/2001 10. Dynamically loading code as it is needed is attractive as it keeps the kernel size to a minimum and makes the kernel very flexible. Linux kernel modules can crash the kernel just like all kernel code or device drivers can. It has the same rights and responsibilities as any kernel code. Linux is a monolithic kernel. Once a Linux module has been loaded it is as much a part of the kernel as any normal kernel code. single. There is also a level of indirection introduced that makes accesses of kernel resources slightly less efficient for modules. Nothing. No Frames Chapter 12 Modules This chapter describes how the Linux kernel can dynamically load functions. You can either load and unload Linux kernel modules explicitly using the insmod and rmmod commands or the kernel itself can demand that the kernel daemon (kerneld) loads and unloads the modules as they are needed. Show Frames. Linux allows you to dynamically load and unload components of the operating system as you need them.it/LDP/tlk/modules/modules. There is an alternative.09. for example filesystems. I only occasionally use VFAT file systems and so I build my Linux kernel to automatically load the VFAT file system module as I mount a VFAT partition. This makes adding new components into the kernel via the configuration process rather time consuming.Table of Contents. When I have unmounted the VFAT partition the system detects that I no longer need the VFAT file system module and removes it from the system. pseudo-device drivers such as network drivers. that is. only when they are needed. is for free and there is a slight performance and memory penalty associated with kernel modules. Say you wanted to use a SCSI driver for an NCR 810 SCSI and you had not built it into the kernel. My current Intel kernel uses modules extensively and is only 406Kbytes long. There is a little more code that a loadable module must provide and this and the extra data structures take a little more memory. Mostly Linux kernel modules are device drivers. it is one. Modules can also be useful for trying out new kernel code without having to rebuild and reboot the kernel every time you try it out. http://ldp. Linux modules are lumps of code that can be dynamically linked into the kernel at any point after the system has booted.

iol. it presents another danger.02] . the kernel modifies the kernel symbol table.09. Linux allows module stacking. When an attempt is made to unload a module. The kernel can optionally protect against this by making rigorous version checks on the module as it is loaded. adding to it all of the resources or symbols exported by the newly loaded module.html (2 di 6) [08/03/2001 10. the module makes a call to a kernel routine and supplies the wrong arguments. it has access to the services of the already loaded modules. When the module is unloaded. a module does not know where in memory kmalloc() is. for example kernel memory or interrupts. the kernel memory allocation routine. Say a module needs to call kmalloc(). this is where one module requires the services of another module. when the next module is loaded. the kernel needs to know that the module is unused and it needs some way of notifying the module that it is about to be unloaded. they must be able to find them. the VFAT file system module requires the services of the FAT file system module as the VFAT file system is more or less a set of extensions to the FAT file system.it/LDP/tlk/modules/modules. This means that. the kernel must fix up all of the module's references to kmalloc() before the module can work. Only here the required services are in another. As each module is loaded. previously loaded module.So that modules can use the kernel resources that they need. One module requiring services or resources from another module is very similar to the situation where a module requires services and resources from the kernel itself. That way the module will be able to free up any system resources that it has allocated. before it is removed from the kernel. For example. 12. The kernel keeps a list of all of the kernel's resources in the kernel symbol table so that it can resolve references to those resources from the modules as they are loaded. so when the module is loaded.1 Loading a Module http://ldp. Apart from the ability of a loaded module to crash the operating system by being badly written. the kernel removes any symbols that that module exported into the kernel symbol table. At the time that it is built. say. What happens if you load a module built for an earlier or later kernel than the one that you are now running? This may cause a problem if.

When it is started up. it opens up an Inter-Process Communication (IPC) channel to the kernel. is to load the module as it is needed.iol. This link is used by the kernel to send messages to the kerneld asking for various tasks to be performed.Figure 12. The first way is to use the insmod command to manually insert the it into the kernel. The kernel daemon is a normal user process albeit with super user privileges.1: The List of Kernel Modules There are two ways that a kernel module can be loaded. The second.09. the kernel will request that the kernel daemon (kerneld) attempts to load the appropriate module.02] . and much more clever way. this is known as demand loading.html (3 di 6) [08/03/2001 10. for example when the user mounts a file system that is not in the kernel.it/LDP/tlk/modules/modules. usually at system boot time. When the kernel discovers the need for a module. http://ldp.

1 shows the list of kernel modules after two modules. They can be either a. again using a privileged system call. Again. These are kept in pairs containing the symbol's name and its value. Demand loaded kernel modules are normally kept in /lib/modules/kernel-version. Kerneld is just an agent of the kernel. Not shown in the diagram is the first module on the list. it asks the kernel for enough space to hold the new kernel. The ksyms utility can either show you all of the exported kernel symbols or only those symbols exported by loaded modules. which is a pseudo-module that is only there to hold the kernel's exported symbol table. it runs the neccessary programs such as insmod to do the work. The kernel's exported symbol table is held in the first module data structure in the list of modules maintained by the kernel and pointed at by the module_list pointer. for example its address. The kernel allocates a new module data structure and enough kernel memory to hold the new module and puts it at the end of the kernel modules list. The new module is marked as UNINITIALIZED. When insmod has fixed up the module's references to exported kernel symbols.Kerneld's major function is to load and unload kernel modules but it is also capable of other tasks such as starting up the PPP link over serial line when it is needed and closing it down when it is not.iol. insmod copies the module into the allocated space and relocates it so that it will run from the kernel address that it has been allocated.it/LDP/tlk/modules/modules. You can use the command lsmod to list all of the loaded kernel modules and their interdependencies. This must happen as the module cannot expect to be loaded at the same address twice let alone into the same address in two different Linux systems. scheduling work on its behalf. insmod makes a privileged system call to find the kernel's exported symbols. All being well. insmod reads the module into its virtual memory and fixes up its unresolved references to kernel routines and resources using the exported symbols from the kernel. You can easily see the exported kernel symbols and their values by looking at /proc/ksyms or by using the ksyms utility. In my current kernel. lsmod simply reformats /proc/modules which is built from the list of kernel module data structures. That is. not every symbol in the kernel is exported to its modules. Every kernel module must contain module initialization and module cleanup routines and these symbols are deliberately not exported but insmod must know the addresses of them so that it can pass them to the kernel. This fixing up takes the form of patching the module image in memory.02] . An example symbol is ``request_irq'' which is the kernel routine that must be called when a driver wishes to take control of a particular system interrupt. Figure 12. The insmod utility must find the requested kernel module that it is to load. this relocation involves patching the module image with the appropriate addresses. The memory that the kernel allocates for it is mapped into the insmod process's address space so that it can access it. insmod is now ready to initialize the module and it makes a privileged system call passing the kernel the addresses of the module's initialization and cleanup routines. The new module also exports symbols to the kernel and insmod builds a table of these exported images. this has a value of 0x0010cd30.out or elf format object files. VFAT and VFAT have been loaded into the kernel.09. Kerneld does not perform these tasks itself.html (4 di 6) [08/03/2001 10. images that are not linked to run from a particular address. insmod physically writes the address of the symbol into the appropriate place in the module. Only specifically entered symbols are added into the table. http://ldp. The kernel modules are linked object files just like other programs in the system except that they are linked as a relocatable images. which is built when the kernel is compiled and linked.

In the above example. my kerneld checks every 180 seconds. The VISITED flag marks the module as in use by one or more other system components. the module's state is set to RUNNING.09. If I were to load another VFAT file system then the vfat module's count would become 2. If the candidate has its VISITED flag cleared then it will remove the module.2 Unloading a Module Modules can be removed using the rmmod command but demand loaded modules are automatically removed from the system by kerneld when they are no longer being used. So.iol. Every time its idle timer expires. otherwise it will clear the VISITED flag and go on to look at the next module in the system.02] . if it is successful it carries on installing the module. The module's cleanup routine address is stored in it's module data structure and it will be called by the kernel when that module is unloaded.1 shows that the VFAT file system module is dependent on the FAT file system module. Each time the system is asked by kerneld to remove unused demand loaded modules it looks through all of the modules in the system for likely candidates. 12.html (5 di 6) [08/03/2001 10. For example: Module: msdos vfat fat #pages: 5 4 6 Used by: 1 1 (autoclean) 2 (autoclean) [vfat msdos] The count is the number of kernel entities that are dependent on this module. it is set whenever another component makes use of the module. the vfat and msdos modules are both dependent on the fat module and so it has a count of 2. This field is slightly overloaded as it also holds the AUTOCLEAN and VISITED flags. kerneld makes a system call requesting that all unused demand loaded modules are removed from the system. the FAT module contains a reference to the VFAT module. you will see that each module has a count associated with it. Figure 12. If you look at the output of lsmod. The timer's value is set when you start kerneld. Both the vfat and msdos modules have 1 dependent.When a new module is added into the kernel. So. it must update the kernel's set of symbols and modify the modules that are being used by the new module. then shortly after the CD ROM is unmounted. A module's count is held in the first longword of its image. the iso9660 module will be removed from the kernel. These modules are marked as AUTOCLEAN so that the system can recognize which ones it may automatically unload. A module cannot be unloaded so long as other components of the kernel are depending on it. which is a mounted file system. you cannot unload the VFAT module if you have one or more VFAT file systems mounted. For example. The kernel calls the modules initialization routine and.it/LDP/tlk/modules/modules. It only looks at modules marked as AUTOCLEAN and in the state RUNNING. for example. Modules that have other modules dependent on them must maintain a list of references at the end of their symbol table and pointed at by their module data structure. Both of these flags are used for demand loaded modules. if you mount an iso9660 CD ROM and your iso9660 filesystem is a loadable module. http://ldp. the reference was added when the VFAT module was loaded. Finally.

http://ldp.09.it/LDP/tlk/modules/modules. its cleanup routine is called to allow it to free up the kernel resources that it has allocated. All of the kernel memory that the module needed is deallocated. The module data structure is marked as DELETED and it is unlinked from the list of kernel modules.iol. No Frames © 1996-1999 David A Rusling copyright notice. Top of Chapter.html (6 di 6) [08/03/2001 10. Any other modules that it is dependent on have their reference lists modified so that they no longer have it as a dependent. Show Frames. version 1. File translated from TEX by TTH.Assuming that a module can be unloaded.02] .0. Table of Contents.

It operates in several modes. 13. 13. No Frames Chapter 13 Processors Linux runs on a number of processors. including a system mode that can be entered from user mode via a SWI (software interrupt). One interesting feature it has is that every instruction is conditional. A read from them generates a zero value and a write to them has no effect.3 Alpha AXP Processor The Alpha AXP architecture is a 64-bit load/store RISC architecture designed with speed in mind. It allows other processors to be tightly coupled via a co-processor interface and it has several memory management unit variations. All instructions are 32 bits long and memory operations are either reads or http://ldp. Show Frames.1 X86 TBD 13. All registers are 64 bits in length.html (1 di 2) [08/03/2001 10.it/LDP/tlk/processors/processors. It is a synthasisable core and ARM (the company) does not itself manufacture processors. Its instructions are simple load and store instructions (load a value from memory. It is being widely used in embedded devices such as mobile phones and PDAs (Personal Data Assistants). this chapter gives a brief outline of each of them. you can test the value of a register and. perform an operation and store the result back into memory). Integer register 31 and floating point register 31 are used for null operations.2 ARM The ARM processor implements a low power. These range from simple memory protection schemes to complex page hierarchies. you can conditionally execute instructions as and when you like.Table of Contents.03] . high performance 32 bit RISC architecture.09. until you next test for the same condition.iol. Another interesting feature is that you can perform arithmetic and shift operations on values as you load them. It has 31 32 bit registers with 16 visible in any mode. For example. 32 integer registers and 32 floating point registers. Instead the ARM partners (companies such as Intel or LSI for example) implement the ARM architecture in silicon.

writes. Instructions on unrelated registers do not have to wait for each other to execute as they would if there were a single status register. One interesting feature of Alpha AXP is that there are instructions that can generate flags. These subroutines can be invoked by hardware or by CALL_PAL instructions. PALcode is executed in PALmode. Table of Contents. No Frames © 1996-1999 David A Rusling copyright notice. such as testing if two registers are equal. the result is not stored in a processor status register. PALcode is written in standard Alpha AXP assembler with some implementation specific extensions to provide direct access to low level hardware functions. a privileged mode that stops some system events happening and allows the PALcode complete control of the physical system hardware. exceptions and memory management. The lack of direct operations on memory and the large number of registers also help issue multiple instructions. There are no instructions that operate directly on values stored in memory. The instructions only interact with each other by one instruction writing to a register or memory location and another register reading that register or memory location. http://ldp. interrupts. Show Frames.09. all data manipulation is done between registers. for example internal processor registers. The architecture allows different implementations so long as the implementations follow the architecture. These subroutines provide operating system primitives for context switching. File translated from TEX by TTH.it/LDP/tlk/processors/processors. but is instead stored in a third register. So. if you want to increment a counter in memory. the CPU implementation of the Alpha AXP architecture and to the system hardware. then modify it and write it out. called privileged architecture library code (PALcode).0. you first read it into a register.html (2 di 2) [08/03/2001 10. Top of Chapter. version 1. PALcode is specific to the operating system. The Alpha AXP architecture uses a set of subroutines.iol. but removing this dependency from a status register means that it is much easier to build a CPU which can issue multiple instructions every cycle.03] . This may seem strange at first.

how they are arranged and where you might start to look for particular code.Table of Contents.0. So. Where to Get The Linux Kernel Sources All of the major Linux distributions ( Craftworks.30 source tree.09. kernel and any odd numbered kernel (for example 2. Development kernels have all of the latest features and support all of the latest devices. Usually the Linux kernel that got installed on your Linux system was built from those sources. Your local Linux User Group is also a good source of sources.it/LDP/tlk/sources/sources.0.helsinki. Any even number kernel (for example 2.0. The patch utility is used to apply a series of edits to a set of source files. This book does not depend on a knowledge of the 'C' programming language or require that you have the Linux kernel sources available in order to understand how the Linux kernel works. Show Frames. No Frames Chapter 14 The Linux Kernel Sources This chapter describes where in the Linux kernel sources you should start looking for particular kernel functions. Slackware. you would obtain the 2. Remember that it is always worth backing up your system thoroughly if you do try out non-production kernels.30) is a stable. Some even offer a subscription service with quarterly or even monthly updates.30 source tree. released. By their very nature these sources tend to be a little out of date so you may want to get the latest sources from one of the web sites mentioned in chapter www-appendix.1. there are many CD ROM vendors who offer snapshots of the world's major web sites at a very reasonable cost. They are kept on ftp://ftp. it is a fruitful exercise to look at the kernel sources to get an in-depth understanding of the Linux operating system.html (1 di 5) [08/03/2001 10. for example.iol.30 patch file and apply the patches (edits) to that http://ldp. which may not be exactly what you want it. This chapter gives an overview of the kernel sources. This book is based on the stable 2. If you do not have access to the web. The Linux kernel sources have a very simple numbering system. Although they can be unstable.04] . Debian. Changes to the kernel sources are distributed as patch files.cs. This makes the Helsinki web site the most up to date.0. Red Hat etcetera) include the kernel sources in them. That said.fi and all of the other web sites shadow them.29 kernel source tree and you wanted to move to the 2.0. is important that the Linux community tries the latest kernels. if you have the 2.42 is a development kernel. but sites like MIT and Sunsite are never very far behind. That way they are tested for the whole community.

com web site.html (2 di 5) [08/03/2001 10. The architecture specific memory management code lives down in arch/*/mm/. include The include subdirectory contains most of the include files needed to build the kernel code. They are further sub-divided into classes of device driver. perhaps over slow serial connections. kernel The main kernel code.c. init This directory contains the initialization code for the kernel and it is a very good place to start looking at how the kernel works. one per supported file system. ipc This directory contains the kernels inter-process communications code. drivers All of the system's device drivers live in this directory. for example i386 and alpha. The include/asm subdirectory is a soft link to the real include directory needed for this architecture. net http://ldp. To change architectures you need to edit the kernel makefile and rerun the Linux kernel configuration program.source tree: $ cd /usr/src/linux $ patch -p1 < patch-2.04] . the architecture specific kernel code is in arch/*/kernel. for example arch/i386/mm/fault. How The Kernel Sources Are Arranged At the very top level of the source tree /usr/src/linux you will see a number of directories: arch The arch subdirectory contains all of the architecture specific kernel code. It too has further subdirectories including one for every architecture supported. It has further subdirectories.09. A good source of kernel patches (official and unofficial) is the http://www. for example vfat and ext2. This is further sub-divided into directories.it/LDP/tlk/sources/sources. modules This is simply a directory used to hold built modules.linuxhq. mm This directory contains all of the memory management code. for example block.30 This saves copying whole source trees. for example include/asm-i386.0. one per supported architecture.iol. Again. fs All of the file system code.

The bottom half handling code is in include/linux/interrupt.exe or LILO has loaded the kernel into memory and passed control to it. lib This directory contains the kernel's library code.iol.c and the swap cache in mm/swap_state.h. The page fault handling code is in mm/memory.c with the system wide definitions in include/linux/pci. Look in arch/i386/kernel/head. The scheduler is in kernel/sched. The task_struct data structure can be found in include/linux/sched.c.S does some architecture specific setup and then jumps to the main() routine in init/main. Alpha AXP's is in arch/alpha/kernel/bios32. Kernel Most of the relevent generic code is in kernel with the architecture specific code in arch/*/kernel.c. The buffer cache is implemented in mm/buffer. PCI The PCI pseudo driver is in drivers/pci/pci. System Startup and Initialization On an Intel based system. The next subsections give you a hint as to where in the source tree the best place to look is for a given subject. Each architecture has some specific PCI BIOS code. Looking at one part of the kernel often leads to looking at several other related files and before long you have forgotten what you were looking for.html (3 di 5) [08/03/2001 10. It is rather like a large ball of string with no end showing. The architecture specific library code can be found in arch/*/lib/. Memory Management This code is mostly in mm but the architecture specific code is in arch/*/mm. http://ldp.09. the kernel starts when either loadlin.c.c and the fork code is in kernel/fork.c and mm/swapfile. Head.c and the memory mapping and page cache code is in mm/filemap.c.The kernel's networking code.04] .c. Where to Start Looking A large complex program like the Linux kernel can be rather daunting to look at.h.it/LDP/tlk/sources/sources. scripts This directory contains the scripts (for example awk and tk scripts) that are used when the kernel is configured.h.S for this part.

c.c in drivers/block and that the SCSI CD driver is in scsi.h. Device Drivers Most of the lines of the Linux kernel's source code are in its device drivers. All System V IPC objects include an ipc_perm data structure and this can be found in include/linux/ipc. The Intel interrupt handling code is in arch/i386/kernel/irq. shared memory in ipc/shm.c and its definitions in include/asm-i386/irq. Note that the ide CD driver is ide-cd.c and semaphores in ipc/sem.iol.c).h. A good place to look at how the PCI subsystem is mapped and initialized.c.c in drivers/scsi. serial ports and mice. /pci This are the sources for the PCI pseudo-driver. /sound This is where all of the sound card drivers are. It not only initializes the hard disks but also the network as you need a network to mount nfs file systems.09. /cdrom All of the CDROM code for Linux.c. /char This the place to look for character based devices such as ttys.it/LDP/tlk/sources/sources. /net This is where to look to find the network device drivers such as the DECChip 21040 PCI ethernet driver which is in tulip. If you want to look at how all of the devices that could possibly contain file systems are initialized then you should look at device_setup() in drivers/block/genhd.c. It is here that the special CDROM devices (such as Soundblaster CDROM) can be found. Block devices include both IDE and SCSI based devices. All of Linux's device driver sources are held in drivers but these are further broken out by type: /block block device drivers such as ide (in ide. /scsi This is where to find all of the SCSI code as well as all of the drivers for the scsi devices supported by Linux.c.04] . http://ldp. System V messages are implemented in ipc/msg. Pipes are implemented in ipc/pipe.html (4 di 5) [08/03/2001 10.c.Interprocess Communication This is all in ipc. The Alpha AXP PCI fixup code is also worth looking at in arch/alpha/kernel/bios32. Interrupt Handling The kernel's interrupt handling code is almost all microprocessor (and often platform) specific.

Modules The kernel module code is partially in the kernel and partially in the modules package. http://ldp. ext2_fs_i.09. You may want to look at the structure of an ELF object file in include/linux/elf.h and include/linux/kerneld. The buffer cache is implemented in fs/buffer.iol.html (5 di 5) [08/03/2001 10.h.h and ext2_fs_sb. The kernel code is all in kernel/modules. Network The networking code is kept in net with most of the include files in include/net. Table of Contents.it/LDP/tlk/sources/sources.h and the code is in fs/*. No Frames © 1996-1999 David A Rusling copyright notice.h. The network device drivers are in drivers/net. Top of Chapter.c.c with the data structures and kernel demon kerneld messages in include/linux/module.c along with the update kernel daemon.h.c and the IP version 4 INET socket code is in net/ipv4/af_inet. File translated from TEX by TTH.0. The BSD socket code is in net/socket. The Virtual File System data structures are described in include/linux/fs. Show Frames.File Systems The sources for the EXT2 file system are all in the fs/ext2/ directory with data structure definitions in include/linux/ext2_fs. The generic protocol support code (including the sk_buff handling routines) is in net/core with the TCP/IP networking code in net/ipv4.h respectively.04] . version 1.

09] . it may run again. the rest of the processes must wait before a CPU becomes free until they can be run. 4. Linux must keep track of the process itself and of the system resources that it has so that it can manage it and the other processes in the system fairly. the current. a new task_struct is allocated from system memory and added into the task vector. It would not be fair to the other processes in the system if one process monopolized most of the system's physical memory or its CPUs. The task vector is an array of pointers to every task_struct data structure in the system. process is pointed to by the current pointer. running. The current executing program. As well as the normal type of process. constantly changing as the machine code instructions are executed by the processor. To make it easy to find.iol. Processes carry out tasks within the operating system. Linux supports real time processes. a passive entity. by default it has 512 entries. interruptible http://ldp.1 Linux Processes So that Linux can manage the processes in the system. or process. manages and deletes the processes in the system. A program is a set of machine code instructions and data stored in an executable image on disk and is.html (1 di 11) [08/03/2001 10. As well as the program's instructions and data. Multiprocessing is a simple idea. Linux differentiates between two types of waiting process. a process can be thought of as a computer program in action. Java is another and these must be managed transparently as must the processes use of the system's shared libraries. to maximize CPU utilization.Table of Contents. During the lifetime of a process it will use many system resources. the process also includes the program counter and all of the CPU's registers as well as the process stacks containing temporary data such as routine parameters. Linux supports a number of different executable file formats. It is a dynamic entity. If one process crashes it will not cause another process in the system to crash. kernel managed mechanisms. a process is executed until it must wait. ELF is one. As processes are created. but its fields can be divided into a number of functional areas: State As a process executes it changes state according to its circumstances. It is the scheduler which chooses which is the most appropriate process to run next and Linux uses a number of scheduling strategies to ensure fairness.it/LDP/tlk/kernel/processes. Waiting The process is waiting for an event or for a resource. It will use the CPUs in the system to run its instructions and the system's physical memory to hold it and its data. In a multiprocessing system many processes are kept in memory at the same time. These processes have to react very quickly to external events (hence the term ``real time'') and they are treated differently from normal user processes by the scheduler. Whenever a process has to wait the operating system takes the CPU away from that process and gives it to another. Show Frames.09. This means that the maximum number of processes in the system is limited by the size of the task vector. more deserving process. the CPU would simply sit idle and the waiting time would be wasted. It will open and use files within the filesystems and may directly or indirectly use the physical devices in the system. when it has this resource. each process is represented by a task_struct data structure (task and process are terms that Linux uses interchangeably). Processes are separate tasks each with their own rights and responsibilities. usually for some system resource. No Frames Chapter 4 Processes This chapter describes what a process is and how the Linux kernel creates. as such. Although the task_struct data structure is quite large and complex. for example DOS. includes all of the current activity in the microprocessor. return addresses and saved variables. The most precious resource in the system is the CPU. Linux is a multiprocessing operating system. Each individual process runs in its own virtual address space and is not capable of interacting with another process except through secure. its objective is to have a process running on each CPU in the system at all times. In a uniprocessing system. usually there is only one. Linux is a multiprocessing operating system. If there are more processes than CPUs (and there usually are). Linux processes have the following states: 1 Running The process is either running (it is the current process in the system) or it is ready to run (it is waiting to be assigned to one of the system's CPUs).

File system Processes can open and close files as they wish and the processes task_struct contains pointers to descriptors for each open file as well as pointers to two VFS inodes. pipes and semaphores and also the System V IPC mechanisms of shared memory. for some reason. it is simply a number. pwd is derived from the Unix TM command pwd. Every process in the system. Inter-Process Communication Linux supports the classic Unix TM IPC mechanisms of signals. except the initial process has a parent process. Each process also has User and group identifiers. still has a task_struct data structure in the task vector. http://ldp.09. these are used to control this processes access to the files and devices in the system. Each clock tick. A process that is being debugged can be in a stopped state. they are copied.and uninterruptible. This list allows the Linux kernel to look at every process in the system. Zombie This is a halted process which. The IPC mechanisms supported by Linux are described in Chapter IPC-chapter. You can see the family relationship between the running processes in a Linux system using the pstree command: init(1)-+-crond(98) |-emacs(387) |-gpm(146) |-inetd(110) |-kerneld(18) |-kflushd(2) |-klogd(87) |-kswapd(3) |-login(160)---bash(192)---emacs(225) |-lpd(121) |-mingetty(161) |-mingetty(162) |-mingetty(163) |-mingetty(164) |-login(403)---bash(404)---pstree(594) |-sendmail(134) |-syslogd(78) `-update(166) Additionally all of the processes in the system are held in a doubly linked list whose root is the init processes task_struct data structure. Each VFS inode uniquely describes a file or directory within a file system and also provides a uniform interface to the underlying file systems. Links In a Linux system no process is independent of any other process. Interruptible waiting processes can be interrupted by signals whereas uninterruptible waiting processes are waiting directly on hardware conditions and cannot be interrupted under any circumstances. The process identifier is not an index into the task vector.iol. or rather cloned from previous processes. Every task_struct representing a process keeps pointers to its parent process and to its siblings (those processes with the same parent process) as well as to its own child processes. The first is to the root of the process (its home directory) and the second is to its current or pwd directory.it/LDP/tlk/kernel/processes. These two VFS inodes have their count fields incremented to show that one or more processes are referencing them. Linux also supports process specific interval timers.09] . Stopped The process has been stopped. It needs to do this to provide support for commands such as ps or kill. usually by receiving a signal. Scheduling Information The scheduler needs this information in order to fairly decide which process in the system most deserves to run. the kernel updates the amount of time in jiffies that the current process has spent in system and in user mode. or for that matter one of its sub-directories.html (2 di 11) [08/03/2001 10. This is why you cannot delete the directory that a process has as its pwd directory set to. How file systems are supported under Linux is described in Chapter filesystem-chapter. Times and Timers The kernel keeps track of a processes creation time as well as the CPU time that it consumes during its lifetime. a dead process. processes can use system calls to set up timers to send signals to themselves when the timers expire. New processes are not created. Identifiers Every process in the system has a process identifier. It is what it sounds like. print working directory. These timers can be single-shot or periodic timers. semaphores and message queues.

Each process decides to relinquish the CPU that it is running on when it has to wait for some system event. For example.09] . write and execute and are assigned to three classes of user. in system mode. The kernel checks the effective uid and gid whenever it checks for privilege rights.html (3 di 11) [08/03/2001 10. This small amount of time is known as a time-slice. and. processes belonging to a particular group and all of the processes in the system. when a process is suspended. more deserving process will be chosen to run. Processes are always making system calls and so may often need to wait. effective uid and gid There are some programs which change the uid and gid from that of the executing process into their own (held as attributes in the VFS inode describing the executable image). They are needed for NFS mounted filesystems where the user mode NFS server needs to access files as if it were a particular process. file system uid and gid These are normally the same as the effective uid and gid and are used when checking file system access rights. In Linux. each process is allowed to run for a small amount of time. They are used to save the real uid and gid during the time that the original uid and gid have been changed. This avoids a situation where malicious users could send a kill signal to the NFS server. 4. processes do not preempt the current. Even so. like all Unix TM uses user and group identifiers to check for access rights to files and images in the system. Kill signals are delivered to processes with a particular effective uid and gid. Each class of user can have different permissions. for example a file could have permissions which allow its owner to read and write it. You might. In this case the waiting process will be suspended and another. Each time a process makes a system call it swaps from user mode to system mode and continues executing. these permissions describe what access the system's users have to that file or directory. particularly those that run on behalf of someone else. Whenever a process is running it is using the processor's registers.2 Identifiers Linux. gid The user identifier and group identifier of the user that the process is running on behalf of.09. When a process is restarted by the scheduler its context is restored from here. Processor Specific Context A process could be thought of as the sum total of the system's current state. Basic permissions are read. It is the scheduler that must select the most deserving process to run out of all of the runnable processes in the system. In this scheme. if a process executes until it waits then it still might use a disproportionate amount of CPU time and so Linux uses pre-emptive scheduling. for example a network daemon.Virtual memory Most processes have some virtual memory (kernel threads and daemons do not) and the Linux kernel must track how that virtual memory is mapped onto the system's physical memory. the file's group to read it and for all other processes in the system to have no access at all. 4.iol. So long as a file has access rights for one of the groups that a process belongs to then that process will have appropriate group access rights to that file. The effective uid and gid are those from the setuid program and the uid and gid remain as they were. when this time has expired another process is selected to run and the original process is made to wait for a little while until it can run again. they cannot stop it from running so that they can run. How these modes are supported by the underlying hardware differs but generally there is a secure mechanism for getting from user mode into system mode and back again. http://ldp. 200ms. All of the files in a Linux system have ownerships and permissions. in turn made system calls to read bytes from the open file. for example. create a group for all of the users in a software project and arrange it so that only they could read and write the source code for the project. stacks and so on.3 Scheduling All processes run partially in user mode and partially in system mode. User mode has far less privileges than system mode. There are four pairs of process and group identifiers held in a processes task_struct: uid. a process may have to wait for a character to be read from a file. the owner of the file. This waiting happens within the system call. running process. Groups are Linux's way of assigning privileges to files and directories for a group of users rather than to a single user or to all processes in the system. all of that CPU specific context must be saved in the task_struct for the process. In this case only the file system uid and gid are changed (not the effective uid and gid). REVIEW NOTE: Expand and give the bit assignments (777). A process can belong to several groups (a maximum of 32 is the default) and these are held in the groups vector in the task_struct for each process. the process used a library function to open and read the file and it.it/LDP/tlk/kernel/processes. These programs are known as setuid programs and they are useful because it is a way of restricting accesses to services. At this point the kernel is executing on behalf of the process. saved uid and gid These are mandated by the POSIX standard and are used by programs which change the processes uid and gid via system calls. This is the processes context and.

Then. Current process The current process must be processed before another process can be selected to run.09] . This is a system dependent operation. but it is still the current process that is running. There are two types of Linux process. It will be in a privileged mode. their run order tends to get moved around.html (4 di 11) [08/03/2001 10. It is set to priority when the process is first run and is decremented each clock tick. round robin and first in first out. which has consumed some of its time-slice (its counter has been decremented) is at a disadvantage if there are other processes with equal priority in the system. Process selection The scheduler looks through the processes on the run queue looking for the most deserving process to run. If the scheduling policy of the current processes is round robin then it is put onto the back of the run queue.A runnable process is one which is waiting only for a CPU to run on. just before a process is returned to process mode from system mode. Each time it calls a routine it passes its arguments in registers and may stack saved values such as the address to return to in the calling routine. It then restores the state of the new process (again this is processor specific) to run and gives control of the system to that process. When that process comes to be suspended. all of its machine state. then its state becomes RUNNING. Each time the scheduler is run it does the following: kernel work The scheduler runs the bottom half handlers and processes the scheduler task queue. It is run after putting the current process onto a wait queue and it may also be run at the end of a system call. The scheduler is run from several places within the kernel. each one will run in turn. when the scheduler is running it is running in the context of the current process. Real time processes have a higher priority than all of the other processes. If the task is INTERRUPTIBLE and it has received a signal since the last time it was scheduled then its state becomes RUNNING. including the program counter (PC) and all of the processor's registers. it will always run first. This means that if there are any runnable real time processes in the system then these will always be run before any normal runnable processes. all of the machine state for the new process must be loaded. In round robin scheduling. Linux uses a reasonably simple priority based scheduling algorithm to choose between the current processes in the system. the one nearest the front of the run queue is chosen. If the current process is RUNNING then it will remain in that state. If there are any real time processes (those with a real time scheduling policy) then those will get a higher weighting than ordinary processes. priority This is the priority that the scheduler will give to this process.09. as processes wait for resources. then the current process must be suspended and the new one made to run. This field allows the scheduler to give each real time process a relative priority. counter This is the amount of time (in jiffies) that this process is allowed to run for. The priority of a real time processes can be altered using system calls. first out scheduling each runnable process is run in the order that it is in on the run queue and that order is never changed. Processes that were neither RUNNING nor INTERRUPTIBLE are removed from the run queue. Real time processes may have two types of policy. rt_priority Linux supports real time processes and these are scheduled to have a higher priority than all of the other non-real time processes in system. This is known as Round Robin scheduling. If the current process has timed out. For the scheduler to fairly allocate CPU time between the runnable processes in the system it keeps information in the task_struct for each process: policy This is the scheduling policy that will be applied to this process.iol. This means that they will not be considered for running when the scheduler looks for the most deserving process to run. the processor specific registers and other context being saved in the processes task_struct data structure.it/LDP/tlk/kernel/processes. When a process is running it is using the registers and physical memory of the CPU and of the system. In a balanced system with many processes of the same priority. The weight for a normal process is its counter but for a real time process it is counter plus 1000. The current process will get put onto the back of the run queue. If there is a real time process ready to run. If several processes have the same priority. However. that is as it should be. each runnable real time process is run in turn and in first in. The current process. So. must be saved in the processes task_struct data structure. no CPUs do this in quite the same way but there is usually some http://ldp. You can alter the priority of a process by means of system calls and the renice command. It is also the amount of time (in jiffies) that this process will run for when it is allowed to run. When it has chosen a new process to run it saves the state of the current process. One reason that it might need to run is because the system timer has just set the current processes counter to zero. kernel mode. These lightweight kernel threads are described in detail in chapter kernel-chapter. normal and real time. Swap processes If the most deserving process to run is not the current process.

The saved context for the previous process is. Processors like the Alpha AXP. hopefully.3. when the context of the new process is loaded. so SMP systems must keep track of the current and idle processes for each processor. Again. This swapping of process context takes place at the end of the scheduler.09. it too will be a snapshot of the way things were at the end of the scheduler. When the scheduler is choosing a new process to run it will not consider one that does not have the appropriate bit set for the current processor's number in its processor_mask.hardware assistance for this act.html (5 di 11) [08/03/2001 10.1 Scheduling in Multiprocessor Systems Systems with multiple CPUs are reasonably rare in the Linux world but a lot of work has already gone into making Linux an SMP (Symmetric Multi-Processing) operating system. In a multiprocessor system. If bit N is set. Nowhere is this balancing of work more apparent than in the scheduler. In a single processor system the idle process is the first task in the task vector. therefore.09] . Each will run the scheduler separately as its current process exhausts its time-slice or has to wait for a system resource. 4.it/LDP/tlk/kernel/processes. The first thing to notice about an SMP system is that there is not just one idle process in the system. Equally. Additionally there is one current process per CPU. then this process can run on processor N. this action is architecture specific. in an SMP system there is one idle process per CPU. including this processes program counter and register contents. a snapshot of the hardware context of the system as it was for this process at the end of the scheduler. That is. In an SMP system each process's task_struct contains the number of the processor that it is currently running on (processor) and its processor number of the last processor that it ran on (last_processor). must flush those cached table entries that belonged to the previous process. The scheduler also gives a slight advantage to a process that last ran on the current processor because there is often a performance overhead when moving a process to a different processor.4 Files http://ldp. There is no reason why a process should not run on a different CPU each time it is selected to run but Linux can restrict a process to one or more processors in the system using the processor_mask. If the previous process or the new current process uses virtual memory then the system's page table entries may need to be updated. and you could have more than one idle CPU.iol. one that is capable of evenly balancing work between the CPUs in the system. which use Translation Look-aside Tables or cached Page Table Entries. 4. all of the processors are busily running processes.

like all executable images.5 Virtual Memory A process's virtual memory contains executable code and data from many sources. Linux uses a technique called demand paging where the virtual memory of a process is brought into physical memory only when a process attempts to use it.html (6 di 11) [08/03/2001 10. Linux uses shared libraries that can be used by several running processes at the same time.09] . Any error messages should go to standard error. The first. so standard input. This abstraction of the interface is very powerful and allows Linux to support a wide variety of file types. pipes are implemented using this mechanism as we shall see later. These may be files. virtual. These descriptors are indices into the process's fd vector. the fs_struct contains pointers to this process's VFS inodes and its umask. Programs read from standard input and write to standard output. This newly allocated. Linux processes use libraries of commonly useful code. These are known as standard input.09.1 shows that there are two data structures that describe file system specific information for each process in the system. It could contain code that is only used during certain situations. and it can be changed via system calls. The second data structure. Therefore. The image file contains all of the information neccessary to load the executable code and associated program data into the virtual memory of the process. the Linux kernel alters the process's page table. such as during initialization or to process a particular event. All accesses to files are via standard system calls which pass or return file descriptors.Figure 4. http://ldp. marking the virtual areas as existing but not in memory. In any given time period a process will not have used all of the code and data contained within its virtual memory. Thirdly. It may only have used some of the routines from its shared libraries. standard output and standard error have file descriptors 0. Every file has its own descriptor and the files_struct contains pointers to up to 256 file data structures. Instead. contains information about all of the files that this process is currently using.iol. It does not make sense that each process has its own copy of the library. f_pos holds the position in the file where the next read or write operation will occur. instead of loading the code and data into physical memory straight away. When the process attempts to acccess the code or data the system hardware will generate a page fault and hand control to the Linux kernel to fix things up. Linux processes expect three file descriptors to be open when they start. memory needs to be linked into the process's existing virtual memory so that it can be used. the files_struct. read and write or write only. read only. for example.it/LDP/tlk/kernel/processes. for example file handling routines.1: A Process's Files Figure 4. each one describing a file being used by this process. 1 and 2. one for each function that you might wish to perform on a file. It would be wasteful to load all of this code and data into physical memory where it would lie unused. Secondly. The umask is the default mode that new files will be created in. f_inode points at the VFS inode describing the file and f_ops is a pointer to a vector of routine addresses. Multiply this wastage by the number of processes in the system and the system would run very inefficiently. So. processses can allocate (virtual) memory to use during their processing. The code and the data from these shared libraries must be linked into this process's virtual address space and also into the virtual address space of the other processes sharing the library. for example a command like ls. terminal input/output or a real device but so far as the program is concerned they are all treated as files. Each access to the file uses the file data structure's file operation routines to together with the VFS inode to achieve its needs. one of the free file pointers in the files_struct is used to point to the new file structure. This command. The f_mode field describes what mode the file has been created in. is composed of both executable code and data. Every time a file is opened. standard output and standard error and they are usually inherited from the creating parent process. say to hold the contents of files that it is reading. a write data function. First there is the program image that is loaded. 4. In Linux. for every area of virtual memory in the process's address space Linux needs to know where that virtual memory comes from and how to get it into memory so that it can fix up these page faults. There is.

When a process allocates virtual memory. each representing an area of virtual memory within this process. this is how page faults are handled. Linux looks to see if the virtual address referenced is in the current process's virtual address space. The process's mm_struct data structure also contains information about the loaded executable image and a pointer to the process's page tables. Linux goes to the root of the tree and follows each node's left and right pointers until it finds the right vm_area_struct. Instead. it will give up and raise a page fault exception. As those areas of virtual memory are from several sources. For example there is a routine that will be called when the process attempts to access the memory and it does not exist. Linux abstracts the interface by having the vm_area_struct point to a set of virtual memory handling routines (via vm_ops). This way all of the process's virtual memory can be handled in a consistent way no matter how the underlying services managing that memory differ. but as there are no Page Table Entries for any of this memory. To find the correct node. This tree is arranged so that each vm_area_struct (or node) has a left and a right pointer to its neighbouring vm_area_struct structure. nothing is for free and inserting a new vm_area_struct into this tree takes additional processing time. Of course. To speed up this access.09] . This linked list is in ascending virtual memory order.it/LDP/tlk/kernel/processes. leaving the Linux kernel to fix things up.Figure 4. figure 4. It contains pointers to a list of vm_area_struct data structures.2 shows the layout in virtual memory of a simple process together with the kernel data structures managing it. The processor will attempt to decode the virtual address. it describes the virtual memory by creating a new vm_area_struct data structure. Linux also arranges the vm_area_struct data structures into an AVL (Adelson-Velskii and Landis) tree. When the process attempts to write to a virtual address within that new virtual memory region then the system will page fault.2: A Process's Virtual Memory The Linux kernel needs to manage all of these areas of virtual memory and the contents of each process's virtual memory is described by a mm_struct data structure pointed at from its task_struct. http://ldp.iol.09. This makes the time that it takes to find the correct vm_area_struct critical to the performance of the system. This is linked into the process's list of virtual memory. Linux does not actually reserve physical memory for the process. The left pointer points to node with a lower starting virtual address and the right pointer points to a node with a higher starting virtual address. The process's set of vm_area_struct data structures is accessed repeatedly by the Linux kernel as it creates new areas of virtual memory for the process and as it fixes up references to virtual memory not in the system's physical memory.html (7 di 11) [08/03/2001 10.

the process is sent a SIGALRM signal. Profile This timer ticks both when the process is running and when the system is executing on behalf of the process itself. one that is unique within the set of process identifiers in the system. only one process. The idle process's task_struct is the only one that is not dynamically allocated. New processes are created by cloning old processes. the initial process.6 Creating a Process When the system starts up it is running in kernel mode and there is. The new task_struct is entered into the task vector and the contents of the old (current) process's task_struct are copied into the cloned task_struct. All of the processes in the system are descended from the init kernel thread. if the cloned process is to share virtual memory. Each clock tick. This is one of /etc/init. In addition to these accounting timers. It is at this point that Linux will make a copy of the memory and fix up the two processes' page tables and virtual memory data structures. Any virtual memory that is not written to. rather confusingly. and when the timer has expired. some in the executable image that the process is currently executing and possibly some would be in the swap file. When cloning processes Linux allows the two processes to share resources rather than have two separate copies. Three sorts of interval timers are supported: Real the timer ticks in real time. Instead Linux uses a technique called ``copy on write'' which means that virtual memory will only be copied when one of the two processes tries to write to it.html (8 di 11) [08/03/2001 10. http://ldp. called init_task. These will be saved in the initial process's task_struct data structure when other processes in the system are created and run. even if it can be. A new set of vm_area_struct data structures must be generated together with their owning mm_struct data structure and the cloned process's page tables. Cloning a process's virtual memory is rather tricky.09] . it is perfectly reasonable for the cloned process to keep its parents process identifier. For ``copy on write'' to work. registers and so on. It does some initial setting up of the system (such as opening the system console and mounting the root file system) and then executes the system initialization program. The read only memory.7 Times and Timers The kernel keeps track of a process's creation time as well as the CPU time that it consumes during its lifetime. the writeable areas have their page table entries marked as read only and the vm_area_struct data structures describing them are marked as ``copy on write''. in a sense. The code or data may need to be brought into that physical page from the filesystem or from the swap disk.09. 4. These new processes may themselves go on to create new processes. the initial process has a machine state represented by stacks.If it is. /bin/init or /sbin/init depending on your system. Linux creates the appropriate PTEs and allocates a physical page of memory for this process. process. When the resources are to be shared their respective count fields are incremented so that Linux will not deallocate these resources until both processes have finished using them.it/LDP/tlk/kernel/processes. The process can then be restarted at the instruction that caused the page fault and. However. the initial process starts up a kernel thread (called init) and then sits in an idle loop doing nothing. A new process identifier may be created. When one of the processes attempts to write to this virtual memory a page fault will occur. SIGPROF is signalled when it expires. or rather by cloning the current process. it is statically defined at kernel build time and is. its task_struct will contain a pointer to the mm_struct of the original process and that mm_struct has its count field incremented to show the number of current processes sharing it. A new task_struct data structure is allocated from the system's physical memory with one or more physical pages for the cloned process's stacks (user and kernel). At the end of system initialization. this time as the memory physically exists. will be shared between the two processes without any harm occuring. for example. Like all processes. signal handlers and virtual memory. So. For example the getty process may create a login process when a user attempts to login. A new task is created by a system call (fork or clone) and the cloning happens within the kernel in kernel mode. for example the executable code. The init program uses /etc/inittab as a script file to create new processes within the system. 4. That would be a rather difficult and lengthy task for some of that virtual memory would be in physical memory. The init kernel thread or process has a process identifier of 1 as it is the system's first real process.iol. At the end of the system call there is a new process waiting to run once the scheduler chooses it. will always be shared. Whenever there is nothing else to do the scheduler will run this. the kernel updates the amount of time in jiffies that the current process has spent in system and in user mode. A process can use these timers to send itself various signals each time that they expire. Virtual This timer only ticks when the process is running and when it expires it sends a SIGVTALRM signal. Linux supports process specific interval timers. This applies to the process's files. idle. None of the process's virtual memory is copied at this point. it may continue.

or rather for the child process to exit. 4. Linux is flexible enough to handle almost any object file format. This generates the SIGALRM signal and restarts the interval timer. For each command entered.8 Executing Programs In Linux. a command is an executable binary file. adding it back into the system timer queue.it/LDP/tlk/kernel/processes.09. if they have expired.09] . ELF is felt to be more flexible. in theory. Statically linked images are built by the linker (ld). the shell sends it a SIGCONT signal to restart it. Executable files do not have to be read completely into memory. held in the PATH environment variable.8. There are many shells in Linux.out. Every clock tick the current process's interval timers are decremented and. sometimes refered to as text. for an executable image with a matching name. Each process has its own timer_list data structure and. into one single image containing all of the code and data needed to run this image. If the file is found it is loaded and executed.3: Registered Binary Formats As with file systems.html (9 di 11) [08/03/2001 10.3) and when an attempt is made to execute a file. the appropriate signal is sent. programs and commands are normally executed by a command interpreter. The shell clones itself using the fork mechanism described above and then the new child process replaces the binary image that it was executing. is now firmly established as the most commonly used format in Linux. Unused parts of the image may be discarded from memory. Executable object files contain executable code and data together with enough information to allow the operating system to load them into memory and execute them. the binary formats supported by Linux are either built into the kernel at kernel build time or available to be loaded as modules. such as cd and pwd. You then use the shell command bg to push it into a background. when the real interval timer is running. some of the most popular are sh. With the exception of a few built in commands. 4. this is queued on the system timer list. Normally the shell waits for the command to complete. bash and tcsh. As each part of the executable image is used by a process it is brought into memory. When the timer expires the timer bottom half handler removes it from the queue and calls the interval timer handler. The kernel keeps a list of supported binary formats (see figure 4. You can cause the shell to run again by pushing the child process to the background by typing control-Z. or link editor.out and ELF. The image also specifies the layout in memory of this image and the address in the image of the first code to execute. as in Unix TM. for example /bin/sh interprets shell scripts. http://ldp. which causes a SIGSTOP signal to be sent to the child process.1 ELF The ELF (Executable and Linkable Format) object file format. each binary format is tried in turn until one works. with the contents of the executable image file just found. Real time interval timers are a little different and for these Linux uses the timer mechanism described in Chapter kernel-chapter.One or all of the interval timers may be running and Linux keeps all of the neccessary information in the process's task_struct data structure. the shell searches the directories in the process's search path. ELF executable files contain executable code. A command interpreter is a user process like any other process and is called a shell 2. the shell.iol. Tables within the executable image describe how the program should be placed into the process's virtual memory. Commonly supported Linux binary formats are a. Whilst there is a slight performance overhead when compared with other object file formats such as ECOFF and a. An executable file can have many formats or even be a script file. stopping it. The most commonly used object file format used by Linux is ELF but. The virtual and profile timers are handled the same way. designed by the Unix System Laboratories. where it will stay until either it ends or it needs to do terminal input or output. stop them and read their current values. a technique known as demand loading is used. Script files have to be recognized and the appropriate interpreter run to handle them. Figure 4. System calls can be made to set up these interval timers and to start them. and data.

it does not actually load the image. It is a simple C program that prints ``hello world'' and then exits. It goes at virtual address 0x8048000 and there is 65532 bytes of it.html (10 di 11) [08/03/2001 10.it/LDP/tlk/kernel/processes. the process's vm_area_struct tree and its page tables. for example the command interpreter shell such as bash. This because the first 2200 bytes contain pre-initialized data and the next 2048 bytes contain data that will be initialized by the executing code. Once the ELF binary format loader is satisfied that the image is a valid ELF executable image it flushes the process's current executable image from its virtual memory. This flushing of the old executable image discards the old virtual memory data structures and resets the process's page tables.4: ELF Executable File Format Figure 4. There are pointers to the start and end of the image's code and data. When the program is executed page faults will cause the program's code and data to be fetched into physical memory. The mm_struct data structure also contains pointers to the parameters to be passed to the program and to this process's environment variables. The entry point for the image. These values are found as the ELF executable images physical headers are read and the sections of the program that they describe are mapped into the process's virtual address space. is not at the start of the image but at virtual address 0x8048090 (e_entry). ELF Shared Libraries http://ldp. The first physical header describes the executable code in the image. It also clears away any signal handlers that were set up and closes any files that are open. This is because it is a statically linked image which contains all of the library code for the printf() call to output ``hello world''.09.iol. The code starts immediately after the second physical header. the same information gets set up in the process's mm_struct. image is the program that the parent process was executing. At the end of the flush the process is ready for the new executable image.09] . This data is both readable and writeable. old. No matter what format the executable image is. the first instruction for the program. When Linux loads an ELF executable image into the process's virtual address space.Figure 4. It sets up the virtual memory data structures. As this process is a cloned image (all processes are) this. This physical header describes the data for the program and is to be loaded into virtual memory at address 0x8059BB8. The header describes it as an ELF image with two physical headers (e_phnum is 2) starting 52 bytes (e_phoff) from the start of the image file.4 shows the layout of a statically linked ELF executable image. You will notice that the size of the data in the file is 2200 bytes (p_filesz) whereas its size in memory is 4248 bytes. Unused portions of the program will never be loaded into memory. That is also when the vm_area_struct data structures are set up and the process's page tables are modified.

so. Linux uses the standard Unux TM convention of having the first line of a script file contain the name of the interpreter. Linux tries each binary format in turn until one works.0. perl and command shells such as tcsh. There are a wide variety of interpreters available for Linux. http://ldp. This means that you could in theory stack several interpreters and binary formats making the Linux binary format handler a very flexible piece of software. a typical script file would start: #!/usr/bin/wish The script binary loader tries to find the intepreter for the script. for example wish. it has a pointer to its VFS inode and it can go ahead and have it interpret the script file. all programs would need their own copy of the these libraries and would need far more disk space and virtual memory.09. does not contain all of the code and data required to run. Table of Contents. File translated from TEX by TTH. worked example? 4. Loading the interpreter is done in the same way as Linux loads all of its executable files. information is included in the ELF image's tables for every library routine referenced.8.it/LDP/tlk/kernel/processes.html (11 di 11) [08/03/2001 10. The libraries contain commonly used code such as language subroutines.1. In dynamic linking.so.A dynamically linked image.iol. Linux uses several dynamic linkers. Without dynamic linking. The name of the script file becomes argument zero (the first argument) and all of the other arguments move up one place (the original first argument becomes the new second argument and so on). REVIEW NOTE: Do I need more detail here. all to be found in /lib. libc. Top of Chapter. If it can open it. version 1. Think of a nut the kernel is the edible bit in the middle and the shell goes around it. The information indicates to the dynamic linker how to locate the library routine and link it into the program's address space. The ELF shared library's tables are also used by the dynamic linker when the shared library is linked into the image at run time.2 Script Files Script files are executables that need an interpreter to run them.1 and ld-linux.so.09] . Footnotes: 1 2 REVIEW NOTE: I left out SWAPPING because it does not appear to be used. Show Frames. No Frames © 1996-1999 David A Rusling copyright notice. So. It does this by attempting to open the executable file that is named in the first line of the script. Some of it is held in shared libraries that are linked into the image at run time. ld. providing an interface.1. on the other hand.

Table of Contents. You can see this fragmentation and reassembly of data graphically if you access a web page containing a lot of graphical images via a moderately slow serial link. IP addresses are somewhat hard to remember. using 16.09.13] .0. perhaps by a noisy telephone line.0. for example.9 as an example. The TCP/IP protocols were designed to support communications between computers connected to the ARPANET. The destination host must reassemble the data packets before giving the data to the receiving application. Hosts connected to the same IP subnet can send IP packets directly to each other. IP network and every machine that is connected to it has to have a unique IP address assigned to it.0. IP addresses are assigned by the network administrator and having IP subnetworks is a good way of distributing the administration of the network.3 BSD.iol. This programming interface was chosen because of its popularity and to help applications be portable between Linux and other Unix TM platforms. For example. Unix TM was extensively used on the ARPANET and the first released networking version of Unix TM was 4.42. the network address would be 16. code. perhaps connected by leased telephone lines or even microwave links. 10.42. and growing. This subdivision of the IP address allows organizations to subdivide their networks.9 as an example.com is much easier to remember than 16. This data is contained in IP packets each of which have an IP header containing the IP addresses of the source and destination machine's IP addresses. The checksum is derived from the data in the IP packet and allows the receiver of IP packets to tell if the IP packet was corrupted during transmission. No Frames Chapter 10 Networks Networking and Linux are terms that are almost synonymous. These subnets might be in separate buildings.42. This IP address is actually in two parts. In a very real sense Linux is a product of the Internet or World Wide Web (WWW). say when reading a web page. The local host builds up routing tables which allow it to route IP packets to the correct machine. Gateways (or routers) are connected to more than one IP subnet and they will resend IP packets received on one subnet.42. The ARPANET pioneered networking concepts such as packet switching and protocol layering where one protocol uses the services of another.it/LDP/tlk/net/net.0 and 16.conf. This chapter describes how Linux supports the network protocols known collectively as TCP/IP.3 BSD in that it supports BSD sockets (with some extensions) and the full range of TCP/IP networking. It is not meant to be an exhaustive description. The WWW is a very large. IP subnet administrators are free to allocate IP addresses within their IP subnetworks.1.0. its IP address is used to exchange data with that machine. all other IP packets will be sent to a special host. and Linux itself is often used to support the networking needs of organizations. ARPANET was retired in 1988 but its successors (NSF1 NET and the Internet) have grown even larger. Generally though. a gateway.9 but there must be some mechanism to convert the network names into an IP address. Names are much easier. the subnetwork address would be 16.html (1 di 13) [08/03/2001 10. The data transmitted by an application may have been broken down into smaller packets which are easier to handle. Its developers and users use the web to exchange information ideas. In an IP network every machine is assigned an IP address.42 could be the network address of the ACME Computer Company.9.9. but destined for another onwards. For every IP destination there is an entry in the routing tables http://ldp. 16. Whenever you connect to another machine. These names can be statically specified in the /etc/hosts file or Linux can ask a Distributed Name Server (DNS server) to resolve the name for it. In this case the local host must know the IP address of one or more DNS servers and these are specified in /etc/resolv. The sizes of these parts may vary (there are several classes of IP addresses) but using 16. ethernet packets are generally bigger than PPP packets. linux.42. IP addresses are represented by four numbers separated by dots. 16.0. this is a 32 bit number that uniquely identifies the machine. an American research network funded by the US government.1 An Overview of TCP/IP Networking This section gives an overview of the main principles of TCP/IP networking.42.1 would be subnet 1. The host address is further subdivided into a subnetwork and a host address. Again. the network address and the host address.0 are connected together by a gateway then any packets sent from subnet 0 to subnet 1 would have to be directed to the gateway so that it could route them.42.0.acme.42 and the host address 0. a checksum and other useful information. 16. Show Frames.0 would be subnet 0 and 16. Linux's networking implementation is modeled on 4.42.9. For example. The size of the IP data packets varies depending on the connection media.0 and the host address 16. for that I suggest that you read .42.42. if subnets 16. What is now known as the World Wide Web grew from the ARPANET and is itself supported by the TCP/IP protocols.

Just as IP packets have their own header. The IP protocol layer itself uses many different physical media to transport IP packets to other IP hosts. These registered port addresses can be seen in /etc/services. These routing tables are dynamic and change over time as applications use the network and as the network topology changes. contain a protocol identifier in their headers. an example would be 08-00-2b-00-49-A4. To facilitate this every IP packet header has a byte containing a protocol identifier. The receiving IP layer uses that protocol identifier to decide which layer to pass the received data up to. These media may themselves add their own protocol headers.1: TCP/IP Protocol Layers The IP protocol is a transport layer that is used by other protocols to carry their data. When TCP transmits its packet using IP. Linux uses the Address Resolution Protocol (or ARP) to allow machines to translate IP addresses into real hardware addresses such as ethernet addresses. When TCP asks the IP layer to transmit an IP packet .iol.09. UDP is not a reliable protocol but offers a datagram service. Some ethernet addresses are reserved for multicast purposes and ethernet frames sent with these destination addresses will be received by all hosts on the network.13] . the data contained within the IP packet is the TCP packet itself. A port address uniquely identifies an application and standard network applications use standard port addresses. in this case the TCP layer. unlike TCP. TCP has its own header. Figure 10. When applications communicate via TCP/IP they must specify not only the target's IP address but also the port address of the application. As ethernet frames can carry many different protocols (as data) they. This is because IP addresses are simply an addressing concept. These unique addresses are built into each ethernet device when they are manufactured and it is usually kept in an SROM2 on the ethernet card.it/LDP/tlk/net/net. for example. but PPP and SLIP are others. that IP packet's header states that it contains a TCP packet. Ethernet addresses are 6 bytes long. This use of IP by other protocols means that when IP packets are received the receiving IP layer must know which upper protocol layer to give the data contained in this IP packet to. This layering of protocols does not stop with TCP. A http://ldp. IP addresses on the other hand can be assigned and reassigned by network administrators at will but the network hardware responds only to ethernet frames with its own physical address or to special multicast addresses which all machines must receive. virtual connection even though there may be many subnetworks. Any ethernet frame transmitted to that address will be received by the addressed host but ignored by all the other hosts connected to the network. gateways and routers between them.which tells Linux which host to send IP packets to in order that they reach their destination. the IP layer must find the ethernet address of the IP host. TCP is a connection based protocol where two networking applications are connected by a single. In order to send an IP packet via a multi-connection protocol such as ethernet.html (2 di 13) [08/03/2001 10. Every transmitted ethernet frame can be seen by all connected hosts and so every ethernet device has a unique address. the ethernet devices themselves have their own physical addresses. This allows the ethernet layer to correctly receive IP packets and to pass them onto the IP layer. The IP layer on each communicating host is responsible for transmitting and receiving IP packets. web servers use port 80. UDP and IP. An ethernet network allows many hosts to be simultaneously connected to a single physical cable. like IP packets. The Transmission Control Protocol (TCP) is a reliable end to end protocol that uses IP to transmit and receive its own packets. User Datagram Protocol (UDP) also uses the IP layer to transport its packets. TCP reliably transmits and receives data between the two applications and guarantees that there will be no lost or duplicated data. One such example is the ethernet layer.

This code prepends IP headers to transmitted data and understands how to route incoming IP packets to either the http://ldp.13] . ARP is not just restricted to ethernet devices. The IP layer contains code implementing the Internet Protocol. which translates phsyical network addresses into IP addresses.09. Supporting this is the INET socket layer.2: Linux Networking Layers Just like the network protocols themselves. BSD sockets are supported by a generic socket management software concerned only with BSD sockets. which respond to ARP requests on behalf of IP addresses that are in the remote network. it can resolve IP addresses for other physical media. this manages the communication end points for the IP based protocols TCP and UDP. TCP packets are numbered and both ends of the TCP connection make sure that transmitted data is received correctly. Those network devices that cannot ARP are marked so that Linux does not attempt to ARP. Reverse ARP or RARP. UDP (User Datagram Protocol) is a connectionless protocol whereas TCP (Transmission Control Protocol) is a reliable end to end protocol. responds with an ARP reply that contains its physical hardware address.html (3 di 13) [08/03/2001 10.it/LDP/tlk/net/net. When UDP packets are transmitted. There is also the reverse function.2 The Linux TCP/IP Networking Layers Figure 10. 10. Figure 10. for example FDDI. Linux neither knows nor cares if they arrive safely at their destination.iol. The target host that owns the IP address.host wishing to know the hardware address associated with an IP address sends an ARP request packet containing the IP address that it wishes translating to all nodes on the network by sending it to a multicast address.2 shows that Linux implements the internet protocol address family as a series of connected layers of software. This is used by gateways.

Network devices do not always represent physical devices. sockets have no limit on the amount of data that they can contain. unlike pipes. two communicating processes would each have a socket describing their end of the communication link between them. 10. for example PPP and ethernet.TCP or UDP layers. Underneath the IP layer. The ARP protocol sits between the IP layer and the protocols that support ARPing for addresses. This is because each class has its own method of addressing its communications. You will only see /dev/eth0 when you have built a kernel with the appropriate ethernet device driver in it. network devices appear only if the underlying software has found and initialized them. supporting all of Linux's networking are the network devices. Linux supports the following socket address families or domains: UNIX Unix domain sockets.3 The BSD Socket Interface This is a general interface which not only supports various forms of networking but is also an inter-process communications mechanism. Sockets could be thought of as a special case of pipes but. Unlike standard Linux devices that are created via the mknod command. A socket describes one end of a communications link. Linux supports several classes of socket and these are known as address families. some like the loopback device are purely software devices.ns:eras eras . e kto.

for example the Internet address family identifier (AF_INET is 2). Each is represented by its name. When the socket interface is initialized at boot time each protocol's initialization routine is called. each of which performs a a particular operation specific to that address family. Setting up TCP/IP connections is very different from setting up an amateur radio X. Later on. the address families built into the kernel register themselves with the BSD socket interface. Linux abstracts the socket interface with the BSD socket layer being concerned with the BSD socket interface to the application programs which is in turn supported by independent address family specific software. This is a set of routines. an association is made between the BSD socket and its supporting address family. Like the virtual filesystem.iol.Care is taken to ensure that data packets in transit are correctly dealt with. for example ``INET'' and the address of its initialization routine. The pops vector is indexed by the address family identifier. The proto_ops data structure consists of the address family type and a set of pointers to socket operation routines specific to a particular address family. This association is made via cross-linking data structures and tables of address family specific support routines.it/LDP/tlk/net/net. as applications create and use BSD sockets. When the kernel is configured.14] . For example there is an address family specific socket creation routine which the BSD socket interface uses when an application creates a new socket. a number of address families and protocols are built into the protocols vector.25 connection. At kernel initialization time. For the socket address families this results in them registering a set of protocol operations.html (5 di 13) [08/03/2001 10. The registered protocol operations are kept in the pops vector.09. http://ldp. a vector of pointers to proto_ops data structures. The exact meaning of operations on a BSD socket depends on its underlying address family.

one protocol using the services of another. 10. each server must create an INET BSD socket and bind its address to it. the INET socket layer uses its own data structure.14] .1 Creating a BSD Socket The system call to create a new socket passes identifiers for its address family. port numbers less than 1024 cannot be used by processes without superuser privileges. BSD sockets must also be represented by a VFS inode data structure.html (6 di 13) [08/03/2001 10. For example a BSD socket create request that gives the address family as INET will use the underlying INET socket create function. You can see which network interfaces are currently active in the system by using the ifconfig command. one of SOCK_STREAM. This means that the socket's state must be TCP_CLOSE. Firstly the requested address family is used to search the pops vector for a matching address family.4. This includes setting the file operations pointer to point to the set of BSD socket file operations supported by the BSD socket interface. This means that subsequent INET socket calls can easily retrieve the sock data structure. The IP address bound to is saved in the sock data structure in the recv_addr and saddr fields. By convention. but only processes with superuser privileges can bind to any IP address. As all files are represented by a VFS inode data structure.2 Binding an Address to an INET BSD Socket In order to be able to listen for incoming internet connection requests.09.iol. The IP address may also be the IP broadcast address of either all 1's or all 0's. These are special addresses that mean ``send to everybody''3. socket type and protocol.3. The sockaddr pass to the bind operation contains the IP address to be bound to and. The address family specific creation routine is called using the address kept in the proto_ops data structure. these protocols are layered.3: Linux BSD Socket Data Structures 10. The newly created BSD socket data structure contains a pointer to the address family specific socket routines and this is set to the proto_ops data structure retrieved from the pops vector. http://ldp. The BSD socket layer passes the socket data structure representing the BSD socket to the INET layer in each of these operations. This may seem strange unless you consider that sockets can be operated on in just the same way that ordinairy files can. SOCK_DGRAM and so on.Figure 10.4. Normally the IP address bound to would be one that has been assigned to a network device that supports the INET address family and whose interface is up and able to be used.it/LDP/tlk/net/net. in this case. Its type is set to the sccket type requested. then in order to support file operations. the sock which it links to the BSD socket data structure. then the sock data structure's protocol operations pointer will point to the set of TCP protocol operations needed for a TCP connection. The bind operation is mostly handled within the INET socket layer with some support from the underlying TCP and UDP protocol layers. The port number is optional and if it is not specified the supporting network is asked for a free one. a port number. These are kept in the pops vector along with the other registered address families. the kerneld daemon must load the module before we can continue. It links the sock data structure to the BSD socket data structure using the data pointer in the BSD socket. Its interface with the BSD socket layer is through the set of Internet address family socket operations which it registers with the BSD socket layer during network initialization. Actually the socket data structure is physically part of the VFS inode data structure and allocating a socket really means allocating a VFS inode.4 The INET Socket Layer The INET socket layer supports the internet address family which contains the TCP/IP protocols. Any future operations will be directed to the socket interface and it will in turn pass them to the supporting address family by calling its address family operation routines. 10. Rather than clutter the BSD socket wiht TCP/IP specific information. optionally. The socket having an address bound to cannot be being used for any other communication. The BSD socket layer calls the INET layer socket support routines from the registered INET proto_ops data structure to perform work for it. It may be that a particular address family is implemented as a kernel module and. Linux's TCP/IP code and data structures reflect this layering. The sock data structure's protocol operations pointer is also set up at creation time and it depends on the protocol requested. This linkage can be seen in Figure 10. If the underlying network does allocate a port number it always allocates ones greater than 1024. A free file descriptor is allocated from the current processes fd vector and the file data structure that it points at is initialized. As discussed above. These are used in hash lookups and as the sending IP address respectively. The IP address could also be specified as any IP address if the machine is acting as a transparent proxy or firewall. If TCP is requested. A new socket data structure is allocated to represent the BSD socket.

Additionally it sets up a cache of the routing table entry so that UDP packets sent on this BSD socket do not need to check the routing database again (unless this route becomes invalid). support the connect BSD socket operation. It does. For UDP sockets. however. It also clones the incoming sk_buff containing the connection request and queues it onto the receive_queue for the listening sock data http://ldp. its IP address and its IP port number. the transmit and receive window size and so on. A connection operation on a UDP INET BSD socket simply sets up the addresses of the remote application. For this reason UDP and TCP maintain hash tables which are used to lookup the addresses within incoming IP messages and direct them to the correct socket/sock pair. The listen socket function moves the socket into state TCP_LISTEN and does any network specific work needed to allow incoming connections. Whenever an incoming TCP connection request is received for an active listening socket.14] . UDP maintains a hash table of allocated UDP ports. TCP is much more complex as it maintains several hash tables. The TCP message contains information about the connection. TCP also starts timers so that the outbound connection request can be timed out if the target application does not respond to the request. Unacknowledges messages will be retransmitted. this cached routing and IP addressing information will be automatically be used for messages sent using this BSD socket. TCP does not actually add the binding sock data stucture into its hash tables during the bind operation. TCP is a connection oriented protocol and so there is more information involved in processing TCP packets than there is in processing UDP packets. 10. The transmit and receive window size is the number of outstanding messages that there can be without an acknowledgement being sent. As the TCP sock is now expecting incoming messages. The application making the outbound TCP connection request must now wait for a response from the target application to accept or reject the connection request. These are the tcp_bound_hash table and the tcp_listening_hash. However. that is to say one that does not already have a connection established and one that is not being used for listening for inbound connections.html (7 di 13) [08/03/2001 10.4 Listening on an INET BSD Socket Once a socket has had an address bound to it. This sock data structure will become the bottom half of the TCP connection when it is eventually accepted. provided it has not been used to listen for inbound connection requests. The UDP protocol does not establish virtual connections between applications. Every message transmitted by one end of the TCP connection and successfully received by the other is acknowledged to say that it arrived successfully and uncorrupted. it merely checks that the port number requested is not currently being used. If the receiving end's network device supports smaller maximum message sizes then the connection will use the minimum of the two. it is added to the tcp_listening_hash so that incoming TCP messages can be directed to this sock data structure.4. the maximum sized message that can be managed by the initiating host. TCP must build a TCP message containing the connection information and send it to IP destination given. any messages sent are datagrams. one off messages that may or may not reach their destinations. The maximum message size is based on the network device that is being used at the initiating end of the request. This consists of pointers to sock data structures indexed by a hash function based on the port number. changing the socket's state is enough but TCP now adds the socket's sock data structure into two hash tables as it is now active. The sock data structure is added to TCP's hash tables during the listen operation. Linux chooses a reasonably random value to avoid malicious protocol attacks.09. For a connect operation on a TCP BSD socket. Within TCP all messages are numbered and the initial sequence number is used as the first message number.3 Making a Connection on an INET BSD Socket Once a socket has been created and.4. An outbound connection can only be made on an INET BSD socket that is in the right state. TCP builds a new sock data structure to represent it.As packets are being received by the underlying network devices they must be routed to the correct INET and BSD sockets so that they can be processed. REVIEW NOTE: What about the route entered? 10. it may listen for incoming connection requests specifying the bound addresses. the udp_hash table. The cached routing information is pointed at from the ip_route_cache pointer in the INET sock data structure. Both are indexed via a hash function based on the IP port number. a unique starting message sequence number.iol.it/LDP/tlk/net/net. This means that the BSD socket data structure must be in state SS_UNCONNECTED. For connectionless protocols like UDP this socket operation does not do a whole lot but for connection orientated protocols like TCP it involves building a virtual circuit between two applications. As the UDP hash table is much smaller than the number of permissible port numbers (udp_hash is only 128 or UDP_HTABLE_SIZE entries long) some entries in the table point to a chain of sock data structures linked together using each sock's next pointer. in this case the INET socket layer finds an unused port number (for this protocol) and automatically binds it to the socket. A network application can listen on a socket without first binding an address to it. it can be used to make outbound connection requests. If no addressing information is given. UDP moves the sock's state to TCP_ESTABLISHED.

Linux uses socket buffers or sk_buffs to pass data between the protocol layers and the network device drivers. In the blocking case the network application performing the accept operation will be added to a wait queue and then suspended until a TCP connection request is received. Otherwise the accept operation is passed through to the real protocol.09. 10. The accept operation can be either blocking or non-blocking. This make passing data buffers between the protocols difficult as each layer needs to find where its particular protocol headers and tails are. Once a connection request has been received the sk_buff containing the request is discarded and the sock data structure is returned to the INET socket layer where it is linked to the new socket data structure created earlier. 10.it/LDP/tlk/net/net. say UDP. and the application can then use that file descriptor in socket operations on the newly created INET BSD socket.html (8 di 13) [08/03/2001 10. The clone sk_buff contains a pointer to the newly created sock data structure.4. The accept operation is then passed to the supporting protocol layer. accepting INET socket connection requests only applies to the TCP protocol as an accept operation on a listening socket causes a new socket data structure to be cloned from the original listening socket. which are used to manipulate and manage the socket buffer's data: http://ldp. in this case INET to accept any incoming connection requests.4: The Socket Buffer (sk_buff) Figure 10. does not support connections. In the non-blocking case if there are no incoming connections to accept.5 The IP Layer 10.5. the accept operation will fail and the newly created socket data structure will be thrown away. The file descriptor (fd) number of the new socket is returned to the network application.iol.5 Accepting Connection Requests UDP does not support the concept of connections.1 Socket Buffers One of the problems of having many layers of network protocols.14] . Figure 10. in this case TCP. The INET protocol layer will fail the accept operation if the underlying protocol.structure. is that each protocol needs to add protocol headers and tails to data as it is transmitted and to remove them as it processes received data. each one using the services of another. Instead.4 shows the sk_buff data structure. One solution is to copy buffers at each layer but that would be inefficient. The sk_buff has four data pointers. each sk_buff has a block of data associated with it. sk_buffs contain pointer and length fields that allow each protocol layer to manipulate the application data via standard functions or ``methods''.

This pointer varies depending on the protocol layer that currently owns the sk_buff. The network bottom half matches the protocol types of incoming sk_buff's against one or more of the packet_type entries in either table. When the network bottom half handler is run by the scheduler it processes any network packets waiting to be transmitted before processing the backlog queue of sk_buff's determining which protocol layer to pass the received packets to. trim This moves the tail pointer towards the start of the data area and decrements the len field.2 Receiving IP Packets Chapter dd-chapter described how Linux's network drivers built are into the kernel and initialized. tail points at the current end of the protocol data. The ptype_all chain is used to snoop all packets being received from any network device and is not normally used. This is fixed when the sk_buff and its associated data block is allocated. put This moves the tail pointer towards the end of the data area and increments the len field.head points to the start of the data area in memory. which describe the length of the current protocol packet and the total size of the data buffer respectively. pull This moves the data pointer away from the start.html (9 di 13) [08/03/2001 10. http://ldp. The ptype_base hash table is hashed by protocol identifier and is used to decide which protocol should receive the incoming network packet.09. This is used when removing data or protocol headers from the start of the data that has been received. This is used when removing data or protocol tails from the received packet. There are generic sk_buff routines for adding sk_buffs to the front and back of these lists and for removing them. finally. When a network device receives packets from its network it must convert the received data into sk_buff data structures. for example when snooping all network traffic. This is used when adding data or protocol headers to the start of the data to be transmitted. this pointer varies depending on the owning protocol layer. Each device data structure describes its device and provides a set of callback routines that the network protocol layers call when they need the network driver to perform work. then the received sk_buff's are discarded. and in this case the sk_buff will be cloned. end points at the end of the data area in memory. As the Linux networking layers were initialized.it/LDP/tlk/net/net. The protocol may match more than one entry. Again. towards the end of the data area and decrements the len field. This is used when adding data or protocol information to the end of the data to be transmitted. a pointer to the next packet_type data structure in the list or hash chain. There are two length fields len and truesize. These received sk_buff's are added onto the backlog queue by the network drivers as they are received. This is fixed when the sk_buff is allocated. 10. The sk_buff handling code provides standard mechanisms for adding and removing protocol headers and tails to the application data. a pointer to the protocol's receive data processing routine and. tail and len fields in the sk_buff: push This moves the data pointer towards the start of the data area and increments the len field.iol. These functions are mostly concerned with transmitting data and with the network device's addresses.14] . The sk_buff is passed to the matching protocol's handling routine.5. If the backlog queue grows too large. These safely manipulate the data. The sk_buff data structure also contains pointers that are used as it is stored in doubly linked circular lists of sk_buff's during processing. data points at the current start of the protocol data. a pointer to a network device. The network bottom half is flagged as ready to run as there is work to do. each protocol registered itself by adding a packet_type data structure onto either the ptype_all list or into the ptype_base hash table. This results in a series of device data structures linked together in the dev_base list. The packet_type data structure contains the protocol type.

then the IP packet must be broken down into smaller (mtu sized) fragments. the hardware address. the address of the network device data structure and. Each time an IP packet is received it is checked to see if it is an IP fragment. The ARP request contains the IP address that needs translating and the reply (hopefully) contains the translated IP address. say via the PPP protocol. This device is found from the IP routing tables.html (10 di 13) [08/03/2001 10. if the hardware header for the packet needs to be rebuilt. 10. If the device's mtu is smaller than the packet size of the IP packet that is waiting to be transmitted.4 Data Fragmentation Every network device has a maximum packet size and it cannot transmit or receive a data packet bigger than this. if it does. If. IP uses the routing tables to resolve the route for the destination IP address. It is then up to the higher level protocols to retransmit the message. This hardware header is network device specific and contains the source and destination physical addresses and other media specific information. they are combined into a single sk_buff and passed up to the next protocol level to be processed. the hardware header is cached so that future IP packets sent using this interface do not have to ARP. IP creates a new ipq data structure. 10.iol. It performs various checks to see if this device needs a hardware header and. The packet should either be sent to the local host via the loopback device or to the gateway at the end of the PPP modem connection. Each ipq data structure uniquely describes a fragmented IP receive frame with its source and destination IP addresses.09.6 The Address Resolution Protocol (ARP) The Address Resolution Protocol's role is to provide translations of IP addresses into physical hardware addresses such as ethernet addresses. the ipq data structure and its ipfrag's are dismantled and the message is presumed to have been lost in transit. When all of the fragments have been received. If the network device is an ethernet device.5.14] . an sk_buff is built to contain the data and various headers are added by the protocol layers as it passes through them. Each IP destination successfully looked up in the routing tables returns a rtable data structure describing the route to use. the correct ipq data structure is found and a new ipfrag data structure is created to describe this fragment. an ARP request and an ARP reply. This depends on the best route for the packet. the upper layer protocol identifier and the identifier for this IP frame. For every IP packet transmitted. The IP protocol allows for this and will fragment data into smaller units to fit into the packet size that the network device can handle. The IP protocol header includes a fragment field which contains a flag and the fragment offset.1 and the source and destination addresses would be physical ethernet addresses. sometimes. As more IP fragments are received.it/LDP/tlk/net/net. during the fragmentation. When an IP packet is ready to be transmited. this is the mtu field. In this case the outgoing packet is stalled until the address has been resolved. This includes the source IP address to use. All ethernet devices use the same generic header rebuilding routine which in turn uses the ARP services to translate the destination IP address into a physical address. Once it has been resolved and the hardware header built. If the hardware header needs rebuilding. Each device has a field describing its maximum transfer unit (in bytes). for example IP. and this is linked into the ipqueue list of IP fragments awaiting recombination. Each ipq contains a timer that is restarted each time a valid fragment is received. Each fragment is represented by an sk_buff. a prebuilt hardware header. The hardware header is cached with the route because it must be appended to each IP packet transmitted on this route and constructing it takes time. The hardware header may contain physical addresses that have to be resolved using the ARP protocol. the routing choice is easy.5. its IP header marked to show that it is a fragment and what offset into the data this IP packet contains. The ARP request http://ldp. it calls the device specific hardware header rebuilding routine. The sk_buff needs to be passed to a network device to be transmitted. Whichever way the data is generated. The ARP protocol itself is very simple and consists of two message types. the transmit will fail. If this timer expires. For computers connected by modem to a single network. IP cannot allocate an sk_buff. The first time that the fragment of a message is received. First though the protocol.10. Receiving IP fragments is a little more difficult than sending them because the IP fragments can be received in any order and they must all be received before they can be reassembled. the hardware header would be as shown in Figure 10. IP needs this translation just before it passes the data (in the form of an sk_buff) to the device driver for transmission. needs to decide which network device to use. For computers connected to an ethernet the choices are harder as there are many computers connected to the network.3 Sending IP Packets Packets are transmitted by applications exchanging data or else they are generated by the network protocols as they support established connections or connections being established. The last packet is marked as being the last IP fragment. IP finds the network device to send the IP packet out on. Linux caches hardware headers to avoid frequent rebuilding of them.

There are many choices to be made when transmitting IP packets. For example. the route cache is used for quick lookups of routes for IP destinations. Linux also caches prebuilt hardware headers off the arp_table entries in the form of hh_cache data structures. which network device should be used to transmit it? If there is more than one network device that could be used to reach the destination. ARP must send an ARP request message. each arp_table entry consumes some kernel memory. last updated the time that this ARP entry was last updated. some dial up services assign an IP address as each connection is established. These entries are created as IP addresses need to be translated and removed as they become stale over time. it must contain only the frequently accessed routes. The ARP protocol layer must also respond to ARP requests that specfy its IP address.is broadcast to all hosts connected to the network. its contents are derived from the Forwarding Information Database. A smaller and much faster database. ARP runs a periodic timer which looks through all of the arp_table entries to see which have timed out. timer This is a timer_list entry used to time out ARP requests that do not get a response.09. so. The hardware address is written into the hardware header of each sk_buff. if it is complete and so on. Can the destination be reached at all? If it can be reached. for an ethernet network. Network topologies can change over time and IP addresses can be reassigned to different hardware addresses. each entry is found by taking the last two bytes of its IP address to generate an index into the table and then following the chain of entries until the correct one is found. Like all caches. the most important being the Forwarding Information Database. which is the better one? The IP routing database maintains information that gives answers to these questions. 10. It sends out an ARP request and sets the ARP expiry timer running. http://ldp. sk_buff queue List of sk_buff entries waiting for this IP address to be resolved The ARP table consists of a table of pointers (the arp_tables vector) to chains of arp_table entries. It creates a new arp_table entry in the table and queues the sk_buff containing the network packet that needs the address translation on the sk_buff queue of the new entry.7 IP Routing The IP routing function determines where to send IP packets destined for a particular IP address. It is very careful not to remove entries that contain one or more cached hardware headers. The ARP table cannot be allowed to grow too large. The ARP protocol layer in Linux is built around a table of arp_table data structures which each describe an IP to physical address translation.it/LDP/tlk/net/net. The entries are cached to speed up access to them. If there is no response then ARP will retry the request a number of times and if there is still no response ARP will remove the arp_table entry. UDP does not care about lost packets but TCP will attempt to retransmit on an established TCP link. IP address The IP address that this entry describes hardware address The translated hardware address hardware header This is a pointer to a cached hardware header. Whenever the a new entry needs to be allocated and the ARP table has reached its maximum size the table is pruned by searching out the oldest entries and removing them. Any sk_buff data structures queued waiting for the IP address to be translated will be notified and it is up to the protocol layer that is transmitting them to cope with this failure. flags these describe this entry's state. retries The number of times that this ARP request has been retried. The machine that owns the IP address in the request will respond to the ARP request with an ARP reply containing its own physical address. the arp_table entry is marked as complete and any queued sk_buff's will be removed from the queue and will go on to be transmitted.iol. this includes ARP requests. Each arp_table data structure has the following fields: last used the time that this ARP entry was last used. There are two databases. It generates an ARP reply using the hardware address kept in the receiving device's device data structure.14] . This is an exhaustive list of known IP destinations and their best routes. It registers its protocol type (ETH_P_ARP). As well as ARP replies. If the owner of the IP address responds with its hardware address. all of the machines connected to the ethernet will see the ARP request.html (11 di 13) [08/03/2001 10. In order that the ARP table contains up to date entries. generating a packet_type data structure. When an IP address translation is requested and there is no corresponding arp_table entry. This means that it will be passed all ARP packets that are received by the network devices. Some arp_table entries are permanent and these are marked so that they will not be deallocated. Removing these entries is dangerous as other data structures rely on them.

7. If there is no matching route in the route cache the Forwarding Information Database is searched for a route.it/LDP/tlk/net/net. for example GATED. Systems that are not routers are known as end systems. These are the two bytes most likely to be different between destinations and provide the best spread of hash values. The route cache is a table (ip_rt_hash_table) that contains pointers to chains of rtable data structures. If no route can be found there. These routes can be fixed or they can be dynamic and change over time.14] . It also has a reference count. Routers run routing protocols which constantly check on the availability of routes to all known IP destinations. The index into the route table is a hash function based on the least significant two bytes of the IP address. These are passed onto the protocol to process. The routing protocols are implemented as daemons. If the route has not been recently used. 10.Routes are added and deleted via IOCTL requests to the BSD socket interface. Each rtable entry contains information about the route. If routes are kept in the route cache they are ordered so that the most used entries are at the front of the hash chains. the route cache is first checked for a matching route.html (12 di 13) [08/03/2001 10. The INET protocol layer only allows processes with superuser privileges to add and delete IP routes.7. If a route is in the Forwarding Information Database and not in the route cache. and they also add and delete routes via the IOCTL BSD socket interface. This means that finding them will be quicker when routes are looked up.09. the destination IP address. the IP packet will fail to be sent and the application notified. Most systems use fixed routes unless they themselves are routers. it is discarded from the route cache.iol. the network device to use to reach that IP address. It is decremented as applications stop using the route. The last used timestamp for all of the entries in the route cache is periodically checked to see if the rtable is too old . The usage count is incremented each time the route is looked up and is used to order the rtable entry in its chain of hash entries. then a new entry is generated and added into the route cache for this route. The reference count is incremented each time the route is used to show the number of network connections using this route. the maximum size of message that can be used and so on. a usage count and a timestamp of the last time that they were used (in jiffies). 10.2 The Forwarding Information Database http://ldp.1 The Route Cache Whenever an IP route is looked up.

essentially.it/LDP/tlk/net/net. In particular it would be very slow to look up destinations in this database for every IP packet transmitted. All routes to the same subnet are described by pairs of fib_node and fib_info data structures queued onto the fz_list of each fib_zone data structure. Table of Contents. In other words. a hash table is generated to make finding the fib_node data structures easier. A route's metric is. All of these are pointed at from the fib_zones hash table. although it is reasonably efficiently arranged. The hash index is derived from the IP subnet mask. http://ldp. No Frames © 1996-1999 David A Rusling copyright notice. it is not a quick database to consult. if there are several routes to a subnet. Associated with each route is its metric.09.5 contains IP's view of the routes available to this system at this time. The route cache is derived from the forwarding database and represents its commonly used entries. This is a measure of how advantagious this route is. the worse the route. Show Frames.iol.Figure 10. then each route is guaranteed to use a different gateway. If the number of routes in this subnet grows large.14] . Several routes may exist to the same IP subnet and these routes can go through one of several gateways. This is the reason that the route cache exists: to speed up IP packet transmission using known good routes. Each IP subnet is represented by a fib_zone data structure.0. The higher the metric. Footnotes: 1 2 3 National Science Foundation Synchronous Read Only Memory duh? What used for? File translated from TEX by TTH. It is quite a complicated data structure and.html (13 di 13) [08/03/2001 10. The IP routing layer does not allow more than one route to a subnet using the same gateway. Top of Chapter. the number of IP subnets that it must hop across before it reaches the destination subnet.5: The Forwarding Information Database The forwarding information database (shown in Figure 10. version 1.

edu/ [\tilde]ftp/pub/linux MIT's Linux ftp site.html Home page for the Linux Documentation Project. ftp://tsx-11.com Intel's web site and a good place to look for Intel chip information. http://altavista. http://www.ssc.html The Linux Journal is a very good Linux magazine and well worth the yearly subscription for its excellent articles.mit.org/java-linux.09.helsinki.15] .html (1 di 2) [08/03/2001 10.Table of Contents. http://www. It also has a large number of pointers to Linux and Alpha AXP specific information such as CPU data sheets.fi/pub/Software/Linux/Kernel Linus's kernel sources.edu This is the major site for a lot of free software.digital.html This is the primary site for information on Java on Linux.com/lj/index.redhat.com/ Red Hat's web site. http://www. http://www.com http://ldp. ftp://ftp. This has a lot of useful pointers.uk The UK Linux User Group.it/LDP/tlk/appendices/www.linux.org. The Linux specific software is held in pub/Linux. http://www.edu/mdw/linux. Show Frames. ftp://sunsite.intel.digital.blackdown.cs.unc.unc.com/ [\tilde]axplinux This is David Mosberger-Tang's Alpha AXP Linux web site and it is the place to go for all of the Alpha AXP HOWTOs. http://sunsite.azstarnet. No Frames Chapter 16 Useful Web and FTP Sites The following World Wide Web and ftp sites are useful: http://www.iol. http://www.com Digital Equipment Corporation's main web page.

cyrix. Top of Chapter.09.com Cyrix's web site.com ARM's web site. File translated from TEX by TTH.amd. http://www. No Frames © 1996-1999 David A Rusling copyright notice.0. Show Frames.html (2 di 2) [08/03/2001 10.it/LDP/tlk/appendices/www.com The AMD web site. http://www.arm. http://www.iol.DIGITAL's Altavista search engine. A very good place to search for information within the web and news groups. http://ldp.linuxhq.15] . Table of Contents. version 1.com The Linux HQ web site holds up to date official and unoffical patches as well as advice and web pointers that help you get the best set of kernel sources possible for your system. http://www.

Table of Contents. /* bh state bits */ #define BH_Uptodate #define BH_Dirty #define BH_Lock #define BH_Req #define BH_Touched #define BH_Has_aged #define BH_Protected #define BH_FreeOnIO 0 1 2 3 4 5 6 7 /* /* /* /* /* /* /* /* 1 1 1 0 1 1 1 1 if if if if if if if to the buffer contains valid data the buffer is dirty the buffer is locked the buffer has been invalidated the buffer has been touched (aging) the buffer has been aged (aging) the buffer is protected discard the buffer_head after IO */ */ */ */ */ */ */ */ struct buffer_head { /* First cache line: */ unsigned long b_blocknr. No Frames Chapter 15 Linux Data Structures This appendix lists the major data structures that Linux uses and which are described in this book. struct request * current_request. struct tq_struct plug_tq. }. unsigned long b_rsector.iol. struct blk_dev_struct { void (*request_fn)(void). struct buffer_head *b_this_page.17] . struct buffer_head *b_next. They have been edited slightly to fit the paper.it/LDP/tlk/ds/ds.09. They are held together in the blk_dev vector. buffer_head The buffer_head data structure holds information about a block buffer in the buffer cache. Show Frames. /* /* /* /* /* /* block number */ device (B_FREE = free) */ Real device */ Real buffer location on disk */ Hash queue list */ circular list of buffers in one page */ http://ldp.html (1 di 18) [08/03/2001 10. struct request plug. kdev_t b_dev. kdev_t b_rdev. block_dev_struct block_dev_struct data structures are used to register block devices as available for use by the buffer cache.

It is the name * the interface. */ volatile unsigned char start. wait_queue buffer_head buffer_head buffer_head b_lru_time. irq.TP. /* /* /* /* /* /* shmem "recv" end shmem "recv" start shared mem end shared mem start device I/O address device IRQ number /* Low-level status flags. /* Selectable AUI. unsigned long b_flushtime. interrupt. */ char *name. struct device *next. */ unsigned char if_port. *b_prev. */ http://ldp. /* doubly linked list of buffers *b_reqnext. follows. /* doubly linked hash list *b_prev_free. /* request queue device Every network device in the system is represented by a device data structure. */ */ */ */ */ */ */ */ */ */ *b_wait.17] . but they are not part of the usual set specified in Space. unsigned int b_list. mem_start.c" file). */ /* Some hardware also needs these fields. base_addr.iol. unsigned long tbusy. rmem_start. */ /* pointer to data block /* List that this buffer appears /* Time when this (dirty) buffer * should be written /* Time when this buffer was * last used. Called only once. as seen by users in the "Space.it/LDP/tlk/ds/ds. /* start an operation /* interrupt arrived /* transmitter busy */ */ */ /* The device initialization function. unsigned long struct struct struct struct }.html (2 di 18) [08/03/2001 10.c. /* I/O specific fields unsigned long unsigned long unsigned long unsigned long unsigned long unsigned char */ */ */ */ */ */ */ rmem_end.e. struct device { /* * This is the first field of the "visible" part of this structure * (i. /* buffer state bitmap (above) struct buffer_head *b_next_free. int (*init)(struct device *dev). /* users using this block unsigned long b_size. mem_end./* Second cache line: */ unsigned long b_state. unsigned int b_count.09. /* block size /* Non-performance-critical data char *b_data.

/* DMA channel */ struct enet_statistics* (*get_stats)(struct device *dev). */ /* These may be needed for future network-power-down code. /* hardware unsigned long pa_addr. */ unsigned long trans_start. addr len */ address */ broadcast addr*/ P-P other addr*/ netmask */ address len */ */ */ /* M'cast mac addrs /* No installed mcasts /* IP m'cast filter chain */ /* Max frames per queue */ /* For load balancing driver pair support */ unsigned long pkt_queue. struct device *dev). */ */ */ */ http://ldp.html (3 di 18) [08/03/2001 10.unsigned char dma. unsigned char dev_addr[MAX_ADDR_LEN]. mc_count. All * fields hereafter are internal to the system. /* Packets queued struct device *slave. /* protocol unsigned long pa_dstaddr. */ struct sk_buff_head buffs[DEV_NUMBUFFS]. tx_queue_len. /* MTU value */ unsigned short type. /* protocol unsigned long pa_brdaddr. int (*hard_start_xmit) (struct sk_buff *skb. /* protocol struct dev_mc_list int struct ip_mc_list __u32 *mc_list. /* hardware type */ unsigned short hard_header_len. unsigned char pad. /* Pointers to interface service routines. int (*stop)(struct device *dev). /* alias devs /* Pointer to the interface buffers. /* private data */ /* Interface address info. /* Time of last Rx */ unsigned short flags.17] . /* protocol unsigned short pa_alen. /* * This marks the end of the "visible" part of the structure. /* Time (jiffies) of last transmit */ unsigned long last_rx.09. */ unsigned char broadcast[MAX_ADDR_LEN]. /* routing metric */ unsigned short mtu. and may change at * will (read: may be cleaned up at will). /* hardware hdr len */ void *priv. /* main dev alias info struct net_alias *my_alias.iol. */ int (*open)(struct device *dev).it/LDP/tlk/ds/ds. /* Slave device struct net_alias_info *alias_info. *ip_mc_list. unsigned char addr_len. /* interface flags (BSD)*/ unsigned short family. /* address family ID */ unsigned short metric. /* protocol unsigned long pa_mask.

f_ralen. http://ldp. (*header_cache_bind)(struct hh_cache **hhp. void *saddr. *f_prev. (*get_wireless_stats)(struct device *dev). unsigned long f_reada. struct file { mode_t f_mode. f_ramax. (*header_cache_update)(struct hh_cache *hh. (*set_multicast_list)(struct device *dev). device_struct device_struct data structures are used to register character and block devices (they hold its name and the set of file operations that can be used for this device). file Each open file. struct device *dev. unsigned short f_flags. struct sk_buff *skb). }. struct device *dev. (*rebuild_header)(void *eth. struct ifmap *map). loff_t f_pos. socket etcetera is represented by a file data structure. (*set_config)(struct device *dev. unsigned long raddr. __u32 daddr). unsigned len).int int void int int int void void int struct iw_statistics* }. void *addr). (*change_mtu)(struct device *dev. int cmd). (*do_ioctl)(struct device *dev. unsigned short f_count. (*set_mac_address)(struct device *dev. struct device *dev. struct device *dev. void *daddr.17] . struct device_struct { const char * name. int new_mtu). f_rawin. Each valid member of the chrdevs and blkdevs vectors represents a character or block device respectively.html (4 di 18) [08/03/2001 10.it/LDP/tlk/ds/ds. struct file *f_next.09. struct ifreq *ifr. unsigned char * haddr).iol. struct file_operations * fops. unsigned short htype. unsigned short type. f_raend. (*hard_header) (struct sk_buff *skb.

fd_set close_on_exec.09. int max_nr. /* major number of driver */ /* name of major driver */ /* number of times minor is shifted to get real minor */ /* maximum partitions per device */ /* maximum number of real devices */ void (*init)(struct gendisk *). long nr_sects.int f_owner. fs_struct struct fs_struct { int count. int max_p. /* number of real devices */ http://ldp.iol.it/LDP/tlk/ds/ds. and maybe others */ }. struct gendisk { int major. }. unsigned long f_version. /* device size in blocks.17] . struct hd_struct { long start_sect. int minor_shift. files_struct The files_struct data structure describes the files that a process has open. unsigned short umask. /* needed for tty driver.html (5 di 18) [08/03/2001 10. gendisk The gendisk data structure holds information about a hard disk. struct inode * root. struct file_operations * f_op. /* partition table */ int *sizes. const char *major_name. }. /* pid or -pgrp where SIGIO should be sent */ struct inode * f_inode. copied to blk_size[] */ int nr_real. }. /* Initialization called before we do our thing */ struct hd_struct *part. * pwd. struct files_struct { int count. void *private_data. fd_set open_fds. They are used during initialization when the disks are found and then probed for partitions. struct file * fd[NR_OPEN].

*i_hash_next. }. i_uid. i_gid. i_ino. *i_sb. i_atime. i_nlink. i_update.09. *i_hash_prev. i_pipe. i_lock. *i_wait. http://ldp. *i_next. i_nrpages. i_sock. *i_dquot[MAXQUOTAS].void *real_devices. *i_op. ext_i. i_rdev. i_mode. i_size. i_ctime. hpfs_i.17] . *i_mount. i_blocks. *i_prev. /* internal use */ inode The VFS inode data structure holds information about a file or directory on disk.iol. i_version. i_mtime. *i_mmap. i_writecount. struct inode { kdev_t unsigned long umode_t nlink_t uid_t gid_t kdev_t off_t time_t time_t time_t unsigned long unsigned long unsigned long unsigned long struct semaphore struct inode_operations struct super_block struct wait_queue struct file_lock struct vm_area_struct struct page struct dquot struct inode struct inode struct inode struct inode unsigned short unsigned short unsigned char unsigned char unsigned char unsigned char unsigned char unsigned char unsigned short union { struct pipe_inode_info struct minix_inode_info struct ext_inode_info struct ext2_inode_info struct hpfs_inode_info struct msdos_inode_info i_dev. i_seek. *i_flock. msdos_i. i_count.it/LDP/tlk/ds/ds. *i_pages. pipe_i. i_blksize. i_flags. i_dirt. *i_bound_by. *i_bound_to. i_sem. struct gendisk *next. ext2_i.html (6 di 18) [08/03/2001 10. minix_i.

ushort seq. int (*load_binary)(struct linux_binprm *. ushort cuid.17] . affs_i. sysv_i. linux_binfmt Each binary file format that Linux understands is represented by a linux_binfmt data structure. struct pt_regs * regs).struct struct struct struct struct struct struct struct void } u. }. xiafs_i. }. umsdos_inode_info iso_inode_info nfs_inode_info xiafs_inode_info sysv_inode_info affs_inode_info ufs_inode_info socket umsdos_i. ipc_perm The ipc_perm data structure describes the access permissions of a System V IPC object . socket_i. ufs_i. struct linux_binfmt { struct linux_binfmt * next. ushort cgid. struct irqaction { void (*handler)(int.html (7 di 18) [08/03/2001 10.it/LDP/tlk/ds/ds. *generic_ip. long *use_count. ushort gid. ushort uid. http://ldp. struct pt_regs * regs). isofs_i. ushort mode. const char *name.09.iol. int (*load_shlib)(int fd). nfs_i. int (*core_dump)(long signr. }. }. /* owner euid and egid */ /* creator euid and egid */ /* access modes see mode flags below */ /* sequence number */ irqaction The irqaction data structure is used to describe the system's interrupt handlers. unsigned long mask. struct pt_regs *). struct ipc_perm { key_t key. unsigned long flags. struct irqaction *next. void *dev_id. void *.

total_vm. unsigned long swap_unlock_entry. some possibly updated asynchronously */ unsigned dirty:16.iol. struct page *next_hash. }. unsigned long map_nr. struct vm_area_struct * mmap. struct wait_queue *wait. /* parent bus this bridge is on */ /* chain of P2P bridges on this bus */ /* chain of all PCI buses */ /* bridge device as seen by parent */ /* devices behind this bridge */ http://ldp. struct vm_area_struct * mmap_avl. env_start. arg_end. struct buffer_head *buffers.it/LDP/tlk/ds/ds. end_data. *self. start_stack. /* atomic flags. start_data. unsigned long start_code. struct page *prev_hash. struct pci_bus { struct pci_bus struct pci_bus struct pci_bus struct pci_dev struct pci_dev *parent. typedef struct page { /* these must be first (free area handling) */ struct page *next. brk.17] . unsigned long rss. *devices. unsigned flags. age:8. unsigned long arg_start. unsigned long offset. struct mm_struct { int count.09. *children.mem_map_t The mem_map_t data structure (also known as page) is used to hold information about each page of physical memory. unsigned long def_flags. end_code.html (8 di 18) [08/03/2001 10. unsigned long context. start_mmap. struct page *prev. pgd_t * pgd. struct semaphore mmap_sem. *next.mem_map */ } mem_map_t. struct inode *inode. locked_vm. unsigned long start_brk. atomic_t count. mm_struct The mm_struct data structure is used to describe the virtual memory of a task or process. /* page->map_nr == page . env_end. pci_bus Every PCI bus in the system is represented by a pci_bus data structure.

struct request { volatile int rq_status.it/LDP/tlk/ds/ds. /* hook for sys-specific extension */ /* /* /* /* bus number */ number of primary bridge */ number of secondary bridge */ max number of subordinate buses */ pci_dev Every PCI device in the system. /* irq generated by this device */ }. old PCI chips don't * support these registers and return 0 instead. */ unsigned char irq. including PCI-PCI and PCI-ISA bridge devices is represented by a pci_dev data structure. However. request request data structures are used to make requests to the block devices in the system. subordinate. /* set if device is master capable */ /* * In theory.prog-if) */ unsigned int master : 1. primary. /* next device on this bus */ struct pci_dev *next. unsigned int class.html (9 di 18) [08/03/2001 10. pci_init() * initializes this field with the value at PCI_INTERRUPT_LINE * and it is the job of pcibios_fixup() to change it if * necessary. unsigned short device.17] . secondary. #define RQ_INACTIVE (-1) http://ldp. *sysdata.void unsigned unsigned unsigned unsigned }. The requests are always to read or write blocks of data to or from the buffer cache. * the Vision864-P rev 0 chip can uses INTA. but returns 0 in * the interrupt line and pin registers. /* encoded device & function index */ unsigned short vendor. the irq level can be read from configuration * space and all would be fine. The field must not be 0 unless the device * cannot generate interrupts at all. char char char char number. /* 3 bytes: (base.iol. For example. /* chain of all devices */ void *sysdata. /* bus this device is on */ struct pci_dev *sibling. /* hook for sys-specific extension */ unsigned int devfn.09.sub. /* * There is one pci_dev structure for each slot-number/function-number * combination: */ struct pci_dev { struct pci_bus *bus.

unsigned long current_nr_sectors. struct buffer_head * bhtail. char * buffer.09. struct request * next. rt_use. int waking. semaphore Semaphores are used to protect critical data structures and regions of code. rt_gateway. rt_lastuse.17] . rt_refcnt. struct semaphore * sem. struct buffer_head * bh.it/LDP/tlk/ds/ds. rt_tos. }. *rt_next. struct rtable { struct rtable __u32 __u32 __u32 atomic_t atomic_t unsigned long atomic_t struct hh_cache struct device unsigned short unsigned short unsigned short unsigned char }.iol. unsigned long nr_sectors. *rt_hh. /* READ or WRITE */ int errors. unsigned long sector. }. rt_dst. rt_src. rt_flags. *rt_dev. rtable Each rtable data structure holds information about the route to take in order to send packets to an IP host. y struct semaphore { int count.html (10 di 18) [08/03/2001 10. /* to make waking testing atomic */ http://ldp. struct wait_queue *wait. rtable data structures are used within the IP route cache. rt_mtu. int lock .#define #define #define #define RQ_ACTIVE RQ_SCSI_BUSY RQ_SCSI_DONE RQ_SCSI_DISCONNECTING 1 0xffff 0xfffe 0xffe0 kdev_t rq_dev. rt_window. int cmd. rt_irtt.

/* used. /* Device we arrived on/are leaving by */ union { struct tcphdr *th. /* Link for IP protocol level buffer chains */ struct sock *sk. /* pkt_bridged. /* saddr. /* Socket we are owned by */ unsigned long when. /* lock. /* ack_seq. /* daddr. /* len. /* raddr. /* tries.it/LDP/tlk/ds/ds. struct sk_buff { struct sk_buff *next. } mac. struct iphdr *iph. /* arp. /* proto_priv[16]. union { /* As yet incomplete physical layer views */ unsigned char *raw. /* free.sk_buff The sk_buff data structure is used to describe network data as it moves between the layers of protocol.html (11 di 18) [08/03/2001 10. /* Next buffer in list */ struct sk_buff *prev. /* List we are on */ int magic_debug_cookie. /* seq.09. struct ethhdr *eth. /* For IPPROTO_RAW Length of actual data Checksum IP source address IP target address IP next hop address TCP sequence number seq [+ fin] [+ syn] + datalen TCP ack sequence number */ */ */ */ */ */ */ */ */ unsigned char Are we acked ? */ Are we in use ? */ How to free this buffer */ Has IP/ARP resolution finished */ Times tried */ Are we locked ? */ Local routing asserted for this frame */ Packet class */ Tracker for bridging */ Driver fed us an IP checksum */ http://ldp. /* Previous buffer in list */ struct sk_buff_head *list. struct udphdr *uh. /* pkt_type.18] . struct sk_buff *link3. /* localroute.iol. /* ip_summed. /* Time we arrived */ struct device *dev. unsigned char *raw. /* csum. } h. struct iphdr unsigned long unsigned long __u32 __u32 __u32 __u32 __u32 __u32 unsigned char volatile char *ip_hdr. /* used to compute rtt's */ struct timeval stamp. /* for passing file handles in a unix domain socket */ void *filp. /* end_seq. struct ethhdr *ethernet. acked.

allocation.18] . rcv_ack_seq.it/LDP/tlk/ds/ds. /* Link to the actual data skb unsigned char *head. */ */ */ */ */ */ */ */ */ */ */ */ */ */ */ sock Each sock data structure holds protocol specific information about a BSD socket. struct sock *sklist_prev. done. */ volatile char dead. /* Destruct function __u16 redirport. acked_seq. urginline. fin_seq. for an INET (Internet Address Domain) socket this data structure would hold all of the TCP/IP and UDP/IP specific information. copied_seq. /* Allocation mode */ /* count of same ack */ /* user count */ /* * Not all are volatile. users. /* Head of buffer unsigned char *data. For example.c unsigned short protocol. */ struct sock *sklist_next. window_seq. unsigned int truesize. but some are. rmem_alloc. sent_seq. /* reference count struct sk_buff *data_skb. intr. urg_seq. write_seq.html (12 di 18) [08/03/2001 10. /* Data head pointer unsigned char *tail. struct sock { /* This must be first. struct options atomic_t atomic_t unsigned long __u32 __u32 __u32 __u32 __u32 unsigned short __u32 __u32 __u32 __u32 __u32 int *opt. blog.see datagram.iol. /* Buffer size atomic_t count. http://ldp. so we * might as well say they all are. urg_data.c. /* Packet protocol from driver.09. /* Tail pointer unsigned char *end. rcv_ack_cnt. /* Redirect port }. /* End pointer void (*destructor)(struct sk_buff *). wmem_alloc. syn_seq.tcp. /* User count .#define PACKET_HOST 0 /* To us #define PACKET_BROADCAST 1 /* To all #define PACKET_MULTICAST 2 /* To group #define PACKET_OTHERHOST 3 /* To someone else unsigned short users.

09. lastwin_seq. max_window) unsigned volatile volatile volatile unsigned short mtu.unsigned long int struct struct struct struct struct int struct struct struct struct struct struct struct long struct sock sock sock sock sock sock sk_buff sk_buff sk_buff sk_buff_head sk_buff timer_list sk_buff_head reuse. /* Sending source */ rcv_saddr. struct proto struct wait_queue __u32 __u32 __u32 unsigned short unsigned short __u32 __u32 volatile volatile volatile unsigned /* * */ unsigned long unsigned long unsigned long int mss is min(mtu. *pair. long window_clamp.it/LDP/tlk/ds/ds. /* Bound address */ max_unacked. bsdism. destroy. zapped. **bind_pprev. ack_timed. keepopen. mss . /* jiffies at last rcv */ bytes_rcv. partial_timer. back_log. write_queue.18] . lingertime. /* sequence number when we did current fast retransmit */ ato. retransmits. unsigned short user_mss.html (13 di 18) [08/03/2001 10. *bind_next. unsigned short max_window. /* mss negotiated in the syn's */ /* current eff. receive_queue. *prev. hashent. linger.can change */ /* mss requested by user in ioctl */ http://ldp. *prot. **sleep. saddr. nonagle. *next. daddr. *volatile send_head. proc.iol. /* jiffies at last data rcv */ idletime. *volatile send_tail. unsigned short mss. **pprev. *partial. delay_acks. window. *volatile send_next. /* sequence number when we last updated the window we offer */ high_seq. /* ack timeout */ lrcvtime. no_check. broadcast.

volatile unsigned short backoff.18] .09. /* * IP 'private area' */ int ip_ttl. /* TCP keepalive hack */ struct timer_list retransmit_timer. packets_out. rtt. /* TTL setting */ int ip_tos. #endif #endif } protinfo. #endif #if defined(CONFIG_IPX) || defined(CONFIG_IPX_MODULE) struct ipx_opt af_ipx. cong_count. int err. priority. mdev. localroute. struct timer_list keepalive_timer. err_soft. ack_backlog. shutdown. /* TCP retransmit timer */ http://ldp. cong_window. /* Route locally only */ This is where all the private (optional) areas that don't overlap will eventually live. state.iol. num. rto. rcvbuf. #if defined(CONFIG_ATALK) || defined(CONFIG_ATALK_MODULE) struct atalk_sock af_at. #endif #ifdef CONFIG_INET struct inet_packet_opt af_packet.it/LDP/tlk/ds/ds. /* Soft holds errors that don't cause failure but are the cause of a persistent failure not just 'timed out' */ unsigned volatile unsigned unsigned unsigned unsigned int int unsigned unsigned /* * * */ char unsigned char char char char char short char protocol. sndbuf. debug.unsigned unsigned volatile volatile volatile volatile volatile volatile volatile int short unsigned unsigned unsigned unsigned unsigned unsigned unsigned short short short short long long long ssthresh.html (14 di 18) [08/03/2001 10. #ifdef CONFIG_NUTCP struct tcp_opt af_tcp. union { struct unix_opt af_unix. max_ack_backlog. /* TOS */ struct tcphdr dummy_th. type.

*/ /* /* /* /* protocols do most everything protocol data server socket connected to incomplete client conn.. *socket. *ops. /* Multicasting TTL */ ip_mc_loop. part of the VFS inode data structure. /* Group array */ This part is used for the timeout functions (timer. (*error_report)(struct sock *sk). . *iconn. (*data_ready)(struct sock *sk.09.struct timer_list int struct rtable unsigned char #ifdef CONFIG_IP_MULTICAST int int char struct ip_mc_socklist #endif /* * */ delack_timer. stamp. type.it/LDP/tlk/ds/ds. /* Loopback */ ip_mc_name[MAX_ADDR_LEN].18] . (*state_change)(struct sock *sk). *data. *inode. **wait. *ip_route_cache.s */ */ */ */ */ */ */ /* ptr to place to wait on /* Asynchronous wake up list /* File back pointer for gc http://ldp. /* /* /* /* TCP delayed ack timer */ Why the timeout is running */ Cached output route */ Include headers ? */ ip_mc_ttl.html (15 di 18) [08/03/2001 10. *file.int bytes). state. ip_hdrincl. timer.iol. socket Each socket data structure holds information about a BSD socket. instead. /* Multicast device name */ *ip_mc_list. int struct timer_list timeout. /* What are we waiting for? */ /* This is the TIME_WAIT/receive * timer when we are doing IP */ struct timeval /* * */ Identd struct socket /* * Callbacks */ void void void void }. *fasync_list. *conn. it is. flags. (*write_space)(struct sock *sk). ip_xmit_timeout.. /* SOCK_STREAM.c). *next. struct socket { short socket_state long struct proto_ops void struct socket struct socket struct socket struct wait_queue struct inode struct fasync_struct struct file }. It does not exist independently.

maj_flt. struct timer_list real_timer.it/LDP/tlk/ds/ds. it_virt_value. it_prof_value. long debugreg[8]. cstime. >0 stopped */ long counter. unsigned long it_real_value. /* boolean value for session group leader */ int leader.09. /* -1 unrunnable. int pgrp. rt_priority. *p_ysptr. struct wait_queue *wait_chldexit. int did_exec:1. *p_pptr. unsigned long it_real_incr. it_virt_incr. struct task_struct *next_task. /* mm fault and swap info: this can arguably be seen as either mm-specific or thread-specific */ unsigned long min_flt. 0 runnable. long utime.18] . /* Hardware debugging registers */ struct exec_domain *exec_domain. *p_osptr. /* per process flags. struct task_struct *next_run. long priority. (p->father can be replaced with * p->p_pptr->pid) */ struct task_struct *p_opptr. struct task_struct { /* these are hardcoded .fsuid. cutime. cmin_flt. int groups[NGROUPS].task_struct Each task_struct data structure describes a process or task in the system. unsigned long timeout.egid. unsigned short uid. *p_cptr.euid. unsigned long signal.fsgid. unsigned long blocked. stime. policy.iol.don't touch */ volatile long state. *prev_run.html (16 di 18) [08/03/2001 10. respectively. start_time.suid. unsigned short gid. http://ldp. unsigned long kernel_stack_page. exit_signal. cnswap. it_prof_incr. int session. int tty_old_pgrp. nswap. int pid. * older sibling. younger sibling. int swappable:1. int exit_code. /* * pointers to (original) parent process. unsigned long swap_address. defined below */ int errno. *prev_task.sgid. /* bitmap of masked signals */ unsigned long flags. unsigned long saved_kernel_stack. youngest child. cmaj_flt. /* ??? */ unsigned long personality. int dumpable:1. /* various fields */ struct linux_binfmt *binfmt.

09. tq_struct Each task queue (tq_struct) data structure holds information about work that has been queued. #ifdef __SMP__ int processor. */ #endif }.it/LDP/tlk/ds/ds. If NULL. unsigned short used_math. /* signal handlers */ struct signal_struct *sig. /* open file information */ struct files_struct *files. We can context switch in and out of holding a syscall kernel lock. char comm[16]. }. struct timer_list *prev. /* number of pages to swap on next pass */ /* limits */ struct rlimit rlim[RLIM_NLIMITS].18] . /* NULL if no tty */ /* ipc stuff */ struct sem_undo *semundo.html (17 di 18) [08/03/2001 10. /* filesystem information */ struct fs_struct *fs. /* memory management info */ struct mm_struct *mm. struct timer_list { struct timer_list *next.. /* page fault count of the last time */ unsigned long swap_cnt. timer_list timer_list data structure's are used to implement real time timers for processes. struct sem_queue *semsleeping. /* tss for this task */ struct thread_struct tss. default_ldt is used */ struct desc_struct *ldt. struct tq_struct { http://ldp. int lock_depth. unsigned long expires. void (*function)(unsigned long). unsigned long data.unsigned long old_maj_flt. /* file system info */ int link_count. This is usually a task needed by a device driver but which does not have to be done immediately.iol.used by Wine. int last_processor.. /* Lock depth. /* ldt for this task . /* old value of maj_flt */ unsigned long dec_flt. struct tty_struct *tty.

/* linked list of VM areas per task. struct vm_area_struct * vm_avl_right. Show Frames. struct inode * vm_inode. sorted by address */ short vm_avl_height. File translated from TEX by TTH. the circular list of attaches */ /* otherwise unused */ struct vm_area_struct * vm_next_share. struct vm_area_struct * vm_prev_share.struct tq_struct *next. void *data.18] . /* more */ struct vm_operations_struct * vm_ops.html (18 di 18) [08/03/2001 10.iol. /* for areas with inode. }. int sync.09. struct vm_area_struct { struct mm_struct * vm_mm. No Frames © 1996-1999 David A Rusling copyright notice. /* AVL tree of VM areas per task. /* VM area parameters */ unsigned long vm_start.0. unsigned long vm_offset. unsigned long vm_pte. /* /* /* /* linked list of active bh's */ must be initialized to zero */ function to call */ argument to function */ vm_area_struct Each vm_area_struct data structure describes an area of virtual memory for a process. Top of Chapter. Table of Contents. pgprot_t vm_page_prot. struct vm_area_struct * vm_avl_left. version 1. the circular list inode->i_mmap */ /* for shm areas. unsigned long vm_end.it/LDP/tlk/ds/ds. http://ldp. /* shared mem */ }. unsigned short vm_flags. sorted by address */ struct vm_area_struct * vm_next. void (*routine)(void *).

We feel that working together and agreeing on the direction and scope of Linux documentation is the best way to go.20] . reliable docs for the Linux operating system. The overall goal of the LDP is to collaborate in taking care of all of the issues of Linux documentation. including names of projects. to reduce problems with conflicting efforts-two people writing two books on the same aspect of Linux wastes someone's time along the way. 17.it/LDP/tlk/appendices/LDP-Manifesto.2 Getting Involved Send mail to linux-howto@metalab. below.1 Overview The Linux Documentation Project is working on developing good. Johnson This file describes the goals and current status of the Linux Documentation Project. texinfo docs.09. Show Frames. see the next section.Table of Contents. and so on. FTP sites. The LDP is set out to produce the canonical set of Linux online and printed documentation. The LDP is essentially a loose team of volunteers with little central organization.iol. anyone who is interested in helping is welcome to join in the effort. 17. http://ldp. and so on) to printed manuals covering topics such as installing. see the section ``Publishing LDP Manuals''. we are able to easily update the documentation to stay on top of the many changes in the Linux world. Because our docs will be freely available (like software licensed under the terms of the GNU GPL) and distributed on the net. by Michael K. you'll also need to get in touch with the coordinator of whatever LDP projects you're interested in working on. No Frames Chapter 17 Linux Documentation Project Manifesto This is the Linux Documentation Project ``Manifesto'' Last Revision 21 September 1998. If you are interested in publishing any of the LDP works.html (1 di 4) [08/03/2001 10.unc.edu Of course. volunteers. ranging from online docs (man pages. using. and running Linux.

or suggestions to the coordinator. The best way to get involved with one of these projects is to pick up the current version of the manual and send revisions. It has not been reviewed by a lawyer.have historically been done in LaTeX. see the LDP Homepage at http://sunsite. However. The LDP provides a boilerplate license that you can use. http://ldp. guide authors have been moving towards SGML with the DocBook DTD.5 Documentation Conventions Here are the conventions that are currently used by LDP manuals.iol. The man pages .6 Copyright and License Here is a ``boilerplate'' license you may apply to your work. The copyright for each manual should be in the name of the head writer or coordinator for the project.edu/LDP/ldp.html (2 di 4) [08/03/2001 10. The guides . You can come up with your own license terms that satisfy this constraint.20] . The HOWTO documents are all required to be in SGML format. we have a style file you can use to keep your printed look consistent with other LDP documents.3 Current Projects For a list of current projects. but it is encouraged. as their primary goal has been to printed documentation. LDP documents must be freely redistributable without fees paid to the authors. both printed and on-line.the Unix standard for online manuals . There is a move afoot to switch to the DocBook DTD over time.unc.edu in the directory /pub/Linux/docs. please let us know of your plans first. they use the linuxdoc DTD.it/LDP/tlk/appendices/LDP-Manifesto. If you are interested in writing another manual using different conventions. 17. 17. and others write their own.4 FTP sites for LDP works LDP works can be found on sunsite. because it allows them to create more different kinds of output. Currently.html.17. HOWTOs and other documentation found in /pub/Linux/docs/HOWTO. Remember that in order for your document to be part of the LDP. editions. LDP manuals are found in /pub/Linux/docs/LDP.are created with the Unix standard nroff man (or BSD mdoc) macros. 17. or you can use a previously prepared license.09.unc.full books produced by the LDP . some people like to use the GPL. If you use LaTeX. you must allow unlimited reproduction and distribution without fee. You probably want to coordinate with the author before sending revisions so that you know you are working together. and we suggest that you use it. which is quite simple. ``The Linux Documentation Project'' isn't a formal entity and shouldn't be used to copyright the docs. feel free to have your own lawyer review it (or your modification of it) for its applicability to your own desires. It is not required that the text be modifiable.

We do not require to be paid royalties for any profit earned from selling LDP manuals. without fee.it/LDP/tlk/appendices/LDP-Manifesto. in writing. however. We encourage you to do so.au. if they wish to do so. subject to the following conditions: q The copyright notice above and this permission notice must be preserved complete on all complete or partial copies. 17. anyone is allowed to publish and distribute verbatim copies of the Linux Documentation Project manuals. If you are publishing or planning to publish any LDP manuals. hopefully they will fulfill this goal more and more adequately. However. and we are glad to see mail-order distributors bundling the LDP manuals with the software. if the license requires that.org. All source code in this document is placed under the GNU General Public License. just so we know how they're becoming available. the LDP as a whole. Exceptions to these rules may be granted for academic purposes: Write to the author and ask.09. q Small portions may be reproduced as illustrations for reviews or quotes in other works without this permission notice if proper citation is given. not to restrict you as learners and educators. We would like to be informed of any plans to publish or distribute LDP manuals. We encourage Linux software distributors to distribute the LDP manuals (such as the Installation and Getting Started Guide) with their software. that because the LDP manuals are freely distributable. please send mail to ldp-l@linux. As the LDP manuals mature. and a means for obtaining a complete version provided. before doing so. Your show of support for the LDP and the Linux community will be very much appreciated.html (3 di 4) [08/03/2001 10. anyone may photocopy or distribute printed copies free of charge. q If you distribute this work in part. You may. By the license requirements given previously. of course. sell the LDP manuals for profit.7 Publishing LDP Manuals If you're a publishing company interested in distributing any of the LDP manuals.ai. read on.This manual may be reproduced and distributed in whole or in part. These restrictions are here to protect us as authors. you may need to obtain permission from the author. The LDP manuals are intended to be used as the öfficial" Linux documentation. It's nice to know who's doing what.iol. Keep in mind. However. available via anonymous FTP from prep. q Any translation or derived work must be approved by the author in writing before distribution. You don't need our explicit permission for this. that you either offer the author royalties. http://ldp. or to the Linux development community. if you would like to distribute a translation or derivative work based on any of the LDP manuals. or donate a portion of your earnings to the author. we would like to suggest that if you do sell LDP manuals for profit.20] . You may also wish to send one or more free copies of the LDP manuals that you are distributing to the authors. instructions for obtaining the complete version of this manual must be included.edu:/pub/gnu/COPYING.mit.

it/LDP/tlk/appendices/LDP-Manifesto.09.File translated from TEX by TTH. Show Frames.20] . version 1.html (4 di 4) [08/03/2001 10. No Frames © 1996-1999 David A Rusling copyright notice. http://ldp. Table of Contents.iol. Top of Chapter.0.

it/LDP/tlk/appendices/gpl. (Some other Free Software Foundation software is covered by the GNU Library General Public License instead. Inc.1 Preamble The licenses for most software are designed to take away your freedom to share and change it. Show Frames. MA 02139.html (1 di 7) [08/03/2001 10. that you can change the http://ldp. Our General Public Licenses are designed to make sure that you have the freedom to distribute copies of free software (and charge for this service if you wish). but changing it is not allowed. Linux is copyrighted. 18. By contrast.) You can apply it to your programs. This General Public License applies to most of the Free Software Foundation's software and to any other program whose authors commit to using it.21] . When we speak of free software. the GNU General Public License is intended to guarantee your freedom to share and change free software-to make sure the software is free for all its users. It is reproduced here to clear up some of the confusion about Linux's copyright status-Linux is not shareware.Table of Contents. Thus. too. we are referring to freedom. not price.09. that you receive source code or can get it if you want it. and it is not in the public domain. 1991 Free Software Foundation. GNU GENERAL PUBLIC LICENSE Version 2. Cambridge. No Frames Chapter 18 The GNU General Public License Printed below is the GNU General Public License (the GPL or copyleft). June 1991 Copyright (C) 1989. USA Everyone is permitted to copy and distribute verbatim copies of this license document. however. under which Linux is licensed. 675 Mass Ave. you may redistribute it under the terms of the GPL printed below. The bulk of the Linux kernel is copyright © 1993 by Linus Torvalds. and other software and parts of the kernel are copyrighted by their authors.iol.

distribution and modification are not covered by this License. The ``Program''.] You may copy and distribute verbatim copies of the Program's source code as you receive it.) Each licensee is addressed as ``you''. For example. we want to make certain that everyone understands that there is no warranty for this free software. These restrictions translate to certain responsibilities for you if you distribute copies of the software.21] . We protect your rights with two steps: (1) copyright the software. whether gratis or for a fee. To prevent this.software or use pieces of it in new free programs. you must give the recipients all the rights that you have. in effect making the program proprietary. we need to make restrictions that forbid anyone to deny you these rights or to ask you to surrender the rights. for each author's protection and ours. any free program is threatened constantly by software patents. and give any other recipients of the Program a copy of this License along with the Program. receive or can get the source code. refers to any such program or work. they are outside its scope. Also. distribution and modification follow. we have made it clear that any patent must be licensed for everyone's free use or not licensed at all. Whether that is true depends on what the Program does. and a ``work based on the Program'' means either the Program or any derivative work under copyright law: that is to say.it/LDP/tlk/appendices/gpl. provided that you conspicuously and appropriately publish on each copy an appropriate copyright notice and disclaimer of warranty. You must make sure that they. and that you know you can do these things. http://ldp. so that any problems introduced by others will not reflect on the original authors' reputations. Distribution.] This License applies to any program or other work which contains a notice placed by the copyright holder saying it may be distributed under the terms of this General Public License. The precise terms and conditions for copying. If the software is modified by someone else and passed on. if you distribute copies of such a program.09. in any medium. And you must show them these terms so they know their rights. distribute and/or modify the software. To protect your rights. we want its recipients to know that what they have is not the original.2 Terms and Conditions for Copying. translation is included without limitation in the term ``modification''. and Modification 1.iol. or if you modify it. We wish to avoid the danger that redistributors of a free program will individually obtain patent licenses. (Hereinafter. [1. and (2) offer you this license which gives you legal permission to copy. either verbatim or with modifications and/or translated into another language. [0. Activities other than copying. 2. too. Finally. and the output from the Program is covered only if its contents constitute a work based on the Program (independent of having been made by running the Program). a work containing the Program or a portion of it. keep intact all the notices that refer to this License and to the absence of any warranty. The act of running the Program is not restricted.html (2 di 7) [08/03/2001 10. 18. below.

to give any third party. to be licensed as a whole at no charge to all third parties under the terms of this License. or.21] . rather. whose permissions for other licensees extend to the entire whole. your work based on the Program is not required to print an announcement. it is not the intent of this section to claim rights or contest your rights to work written entirely by you.] If the modified program normally reads commands interactively when run. Thus. c. that in whole or in part contains or is derived from the Program or any part thereof. [b.] Accompany it with the information you received as to the offer to distribute corresponding source code. [3.] You may modify your copy or copies of the Program or any portion of it.) These requirements apply to the modified work as a whole. do not apply to those sections when you distribute them as separate works. thus forming a work based on the Program. b. If identifiable sections of that work are not derived from the Program. when started running for such interactive use in the most ordinary way. [a. the intent is to exercise the right to control the distribution of derivative or collective works based on the Program. [b. But when you distribute the same sections as part of a whole which is a work based on the Program. 4. b. 3. c. to print or display an announcement including an appropriate copyright notice and a notice that there is no warranty (or else.] You may copy and distribute the Program (or a work based on it.] You must cause any work that you distribute or publish. and can be reasonably considered independent and separate works in themselves.html (3 di 7) [08/03/2001 10. saying that you provide a warranty) and that users may redistribute the program under these conditions. then this License. provided that you also meet all of these conditions: . and telling the user how to view a copy of this License.You may charge a fee for the physical act of transferring a copy. the distribution of the whole must be on the terms of this License. (This alternative is allowed only for noncommercial distribution http://ldp.09. for a charge no more than your cost of physically performing source distribution. (Exception: if the Program itself is interactive but does not normally print such an announcement. [c. and copy and distribute such modifications or work under the terms of Section 1 above.it/LDP/tlk/appendices/gpl. to be distributed under the terms of Sections 1 and 2 above on a medium customarily used for software interchange. [c. a complete machine-readable copy of the corresponding source code. and thus to each and every part regardless of who wrote it. [2. or.iol. valid for at least three years. mere aggregation of another work not based on the Program with the Program (or with a work based on the Program) on a volume of a storage or distribution medium does not bring the other work under the scope of this License. and you may at your option offer warranty protection in exchange for a fee. and its terms. In addition. which must be distributed under the terms of Sections 1 and 2 above on a medium customarily used for software interchange.] Accompany it with a written offer. you must cause it.] You must cause the modified files to carry prominent notices stating that you changed the files and the date of any change. [a.] Accompany it with the complete corresponding machine-readable source code. under Section 2) in object code or executable form under the terms of Sections 1 and 2 above provided that you also do one of the following: .

Any attempt otherwise to copy. . modify. or rights. agreement or otherwise) that contradict the conditions of this License. if a patent license would not permit royalty-free redistribution of the Program by all those who receive copies directly or indirectly through you.] Each time you redistribute the Program (or any work based on the Program).] You may not copy. parties who have received copies. then as a consequence you may not distribute the Program at all.09. If you cannot distribute so as to satisfy simultaneously your obligations under this License and any other pertinent obligations. even though third parties are not compelled to copy the source along with the object code.] If. by modifying or distributing the Program (or any work based on the Program). [4. as a consequence of a court judgment or allegation of patent infringement or for any other reason (not limited to patent issues).html (4 di 7) [08/03/2001 10. sublicense or distribute the Program is void. [7. they do not excuse you from the conditions of this License. 6.iol. Therefore. since you have not signed it.] You are not required to accept this License. then offering equivalent access to copy the source code from the same place counts as distribution of the source code.it/LDP/tlk/appendices/gpl.21] 5. 7. For an executable work. If any portion of this section is held invalid or unenforceable under any particular circumstance. However. complete source code means all the source code for all modules it contains. then the only way you could satisfy both it and this License would be to refrain entirely from distribution of the Program. sublicense. nothing else grants you permission to modify or distribute the Program or its derivative works. and all its terms and conditions for copying.and only if you received the program in object code or executable form with such an offer. and so on) of the operating system on which the executable runs. [6. and will automatically terminate your rights under this License. [5. For example. you indicate your acceptance of this License to do so. in accord with Subsection b above.) The source code for a work means the preferred form of the work for making modifications to it. However. You may not impose any further restrictions on the recipients' exercise of the rights granted herein. plus the scripts used to control compilation and installation of the executable. kernel. unless that component itself accompanies the executable. 8. These actions are prohibited by law if you do not accept this License. You are not responsible for enforcing compliance by third parties to this License. the balance of the section is intended to apply and the section as a whole is intended to apply in other circumstances. distributing or modifying the Program or works based on it. the source code distributed need not include anything that is normally distributed (in either source or binary form) with the major components (compiler. It is not the purpose of this section to induce you to infringe any patents or other property right http://ldp. from you under this License will not have their licenses terminated so long as such parties remain in full compliance. conditions are imposed on you (whether by court order. as a special exception. or distribute the Program except as expressly provided under this License. plus any associated interface definition files. However. the recipient automatically receives a license from the original licensor to copy. modify. If distribution of executable or object code is made by offering access to copy from a designated place. distribute or modify the Program subject to these terms and conditions.

11.html (5 di 7) [08/03/2001 10. it is up to the author/donor to decide if he or she is willing to distribute software through any other system and a licensee cannot impose that choice. YOU ASSUME THE COST OF ALL NECESSARY SERVICING. write to the Free Software Foundation. If the Program does not specify a version number of this License. 9. NO WARRANTY 12.21] .] BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE. REPAIR OR CORRECTION. THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. In such case. If the Program specifies a version number of this License which applies to it and ``any later version''. you have the option of following the terms and conditions either of that version or of any later version published by the Free Software Foundation. so that distribution is permitted only in or among countries not thus excluded. [11. write to the author to ask for permission. TO THE EXTENT PERMITTED BY APPLICABLE LAW. This section is intended to make thoroughly clear what is believed to be a consequence of the rest of this License.iol.] If you wish to incorporate parts of the Program into other free programs whose distribution conditions are different. INCLUDING. 10.] The Free Software Foundation may publish revised and/or new versions of the General Public License from time to time. Such new versions will be similar in spirit to the present version. but may differ in detail to address new problems or concerns. SHOULD THE PROGRAM PROVE DEFECTIVE. THERE IS NO WARRANTY FOR THE PROGRAM. Many people have made generous contributions to the wide range of software distributed through that system in reliance on consistent application of that system. EITHER EXPRESSED OR IMPLIED. [8. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. Each version is given a distinguishing version number.it/LDP/tlk/appendices/gpl. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM ``AS IS'' WITHOUT WARRANTY OF ANY KIND. For software which is copyrighted by the Free Software Foundation. you may choose any version ever published by the Free Software Foundation. [9.09.claims or to contest validity of any such claims. we sometimes make exceptions for this. this License incorporates the limitation as if written in the body of this License. http://ldp. [10. BUT NOT LIMITED TO. the original copyright holder who places the Program under this License may add an explicit geographical distribution limitation excluding those countries.] If the distribution and/or use of the Program is restricted in certain countries either by patents or by copyrighted interfaces. which is implemented by public license practices. this section has the sole purpose of protecting the integrity of the free software distribution system. Our decision will be guided by the two goals of preserving the free status of all derivatives of our free software and of promoting the sharing and reuse of software generally.


18.3 Appendix: How to Apply These Terms to Your New Programs
If you develop a new program, and you want it to be of the greatest possible use to the public, the best way to achieve this is to make it free software which everyone can redistribute and change under these terms. To do so, attach the following notices to the program. It is safest to attach them to the start of each source file to most effectively convey the exclusion of warranty; and each file should have at least the ``copyright'' line and a pointer to where the full notice is found. Copyright © 19yy This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. Also add information on how to contact you by electronic and paper mail. If the program is interactive, make it output a short notice like this when it starts in an interactive mode: Gnomovision version 69, Copyright (C) 19yy name of author Gnomovision comes with ABSOLUTELY NO WARRANTY; for details type `show w'. This is free software, and you are welcome to redistribute it under certain conditions; type `show c' for details. The hypothetical commands `show w' and `show c' should show the appropriate parts of the General Public License. Of course, the commands you use may be called something other than `show w' and `show c'; they could even be mouse-clicks or menu items-whatever suits your program.
http://ldp.iol.it/LDP/tlk/appendices/gpl.html (6 di 7) [08/03/2001 10.09.21]

You should also get your employer (if you work as a programmer) or your school, if any, to sign a ``copyright disclaimer'' for the program, if necessary. Here is a sample; alter the names: Yoyodyne, Inc., hereby disclaims all copyright interest in the program `Gnomovision' (which makes passes at compilers) written by James Hacker. , 1 April 1989 Ty Coon, President of Vice This General Public License does not permit incorporating your program into proprietary programs. If your program is a subroutine library, you may consider it more useful to permit linking proprietary applications with the library. If this is what you want to do, use the GNU Library General Public License instead of this License.
File translated from TEX by TTH, version 1.0.

Top of Chapter, Table of Contents, Show Frames, No Frames © 1996-1999 David A Rusling copyright notice.

http://ldp.iol.it/LDP/tlk/appendices/gpl.html (7 di 7) [08/03/2001 10.09.21]

Chapter 19 Glossary

http://ldp.iol.it/LDP/tlk/appendices/glossary.html (1 di 4) [08/03/2001 10.09.22]
Argument Table of Contents, Show Frames, No Frames

Direct Memory Access. ELF Executable and Linkable Format. This object file format designed by the Unix System Laboratories is now firmly established as the most commonly used format in Linux. EIDE Extended IDE. Executable image A structured file containing machine instructions and data. This file can be loaded into a process's virtual memory and executed. See also program. Function A piece of software that performs an action. For example, returning the bigger of two numbers. IDE Integrated Disk Electronics. Image See executable image. IP Internet Protocol. IPC Interprocess Communiction. Interface A standard way of calling routines and passing data structures. For example, the interface between two layers of code might be expressed in terms of routines that pass and return a particular data structure. Linux's VFS is a good example of an interface. IRQ Interrupt Request Queue. ISA Industry Standard Architecture. This is a standard, although now rather dated, data bus interface for system components such as floppy disk drivers. Kernel Module A dynamically loaded kernel function such as a filesystem or a device driver. Kilobyte A thousand bytes of data, often written as Kbyte, Megabyte A million bytes of data, often written as Mbyte, Microprocessor A very integrated CPU. Most modern CPUs are Microprocessors. Module A file containing CPU instructions in the form of either assembly language instructions or a high
http://ldp.iol.it/LDP/tlk/appendices/glossary.html (2 di 4) [08/03/2001 10.09.22]

level language like C. Object file A file containing machine code and data that has not yet been linked with other object files or libraries to become an executable image. Page Physical memory is divided up into equal sized pages. Pointer A location in memory that contains the address of another location in memory, Process This is an entity which can execute programs. A process could be thought of as a program in action. Processor Short for Microprocessor, equivalent to CPU. PCI Peripheral Component Interconnect. A standard describing how the peripheral components of a computer system may be connected together. Peripheral An intelligent processor that does work on behalf of the system's CPU. For example, an IDE controller chip, Program A coherent set of CPU instructions that performs a task, such as printing ``hello world''. See also executable image. Protocol A protocol is a networking language used to transfer application data between two cooperating processes or network layers. Register A location within a chip, used to store information or instructions. Register File The set of registers in a processor. RISC Reduced Instruction Set Computer. The opposite of CISC, that is a processor with a small number of assembly instructions, each of which performs simple operations. The ARM and Alpha processors are both RISC architectures. Routine Similar to a function except that, strictly speaking, routines do not return values. SCSI Small Computer Systems Interface. Shell

http://ldp.iol.it/LDP/tlk/appendices/glossary.html (3 di 4) [08/03/2001 10.09.22]

This is a program which acts as an interface between the operating system and a human user. Also called a command shell, the most commonly used shell in Linux is the bash shell. SMP Symmetrical multiprocessing. Systems with more than one processor which fairly share the work amongst those processors. Socket A socket represents one end of a network connection, Linux supports the BSD Socket interface. Software CPU instructions (both assembler and high level languages like C) and data. Mostly interchangable with Program. System V A variant of Unix TM produced in 1983, which included, amongst other things, System V IPC mechanisms. TCP Transmission Control Protocol. Task Queue A mechanism for deferring work in the Linux kernel. UDP User Datagram Protocol. Virtual memory A hardware and software mechanism for making the physical memory in a system appear larger than it actually is.
File translated from TEX by TTH, version 1.0.

Top of Chapter, Table of Contents, Show Frames, No Frames © 1996-1999 David A Rusling copyright notice.

http://ldp.iol.it/LDP/tlk/appendices/glossary.html (4 di 4) [08/03/2001 10.09.22]

Tour of the Linux kernel source

The HyperNews Linux KHG Discussion Pages

Tour of the Linux kernel source
By Alessandro Rubini, rubini@pop.systemy.it This chapter tries to explain the Linux source code in an orderly manner, trying to help the reader to achieve a good understanding of how the source code is laid out and how the most relevant unix features are implemented. The target is to help the experienced C programmer who is not accustomed to Linux in getting familiar with the overall Linux design. That's why the chosen entry point for the kernel tour is the kernel own entry point: system boot. A good understanding of C language is required to understand this material, as well as some familiarity with both Unix concepts and the PC architecture. However, no C code will appear in this chapter, but rather pointers to the actual code. The finest issues of kernel design are explained in other chapters of this guide, while this chapter tends to remain an informal overview. Any pathname for files referenced in this chapter is referred to the main source-tree directory, usually /usr/src/linux. Most of the information reported here is taken from the source code of Linux release 1.0. Nonetheless, references to later versions are provided at times. Any paragraph within the tour with the image in front of it is meant to underline changes the kernel has undergone after the 1.0 release. If no such paragraph is present, then no changes occurred up to release 1.0.9-1.1.76. Sometimes a paragraph like this occurs in the text. It is a pointer to the right sources to get more information on the subject just covered. Needless to say, the source is the primary source. Booting the system When the PC is powered up, the 80x86 processor finds itself in real mode and executes the code at address 0xFFFF0, which corresponds to a ROM-BIOS address. The PC BIOS performs some tests on the system and initializes the interrupt vector at physical address 0. After that it loads the first sector of a bootable device to 0x7C00, and jumps to it. The device is usually the floppy or the hard drive. The preceding description is quite a simplified one, but it's all that's needed to understand the kernel initial workings. The very first part of the Linux kernel is written in 8086 assembly language (boot/bootsect.S). When run, it moves itself to absolute address 0x90000, loads the next 2 kBytes of code from the boot device to address 0x90200, and the rest of the kernel to address 0x10000. The message ``Loading...'' is displayed during system load. Control is then passed to the code in boot/Setup.S, another real-mode assembly source. The setup portion identifies some features of the host system and the type of vga board. If requested to, it asks the user to choose the video mode for the console. It then moves the whole system from address 0x10000 to address 0x1000, enters protected mode and jumps to the rest of the system (at 0x1000). The next step is kernel decompression. The code at 0x1000 comes from zBoot/head.S which initializes
http://ldp.iol.it/LDP/khg/HyperNews/get/tour/tour.html (1 di 11) [08/03/2001 10.09.26]

Tour of the Linux kernel source

registers and invokes decompress_kernel(), which in turn is made up of zBoot/inflate.c, zBoot/unzip.c and zBoot/misc.c. The decompressed data goes to address 0x100000 (1 Meg), and this is the main reason why Linux can't run with less than 2 megs ram. [It's been done in 1 MB with uncompressed kernels; see Memory Savers--ED] Encapsulation of the kernel in a gzip file is accomplished by Makefile and utilities in the zBoot directory. They are interesting files to look at. Kernel release 1.1.75 moved the boot and zBoot directories down to arch/i386/boot. This change is meant to allow true kernel builds for different architectures. Nonetheless, I'll stick to i386-specific information. Decompressed code is executed at address 0x1010000 [Maybe I've lost track of physical addresses, here, as I don't know very well gas source code], where all the 32-bit setup is accomplished: IDT, GDT and LDT are loaded, the processor and coprocessor are identified, and paging is setup; eventually, the routine start_kernel is invoked. The source for the above operations is in boot/head.S. It is probably the trickiest code in the whole kernel. Note that if an error occurs during any of the preceding steps, the computer will lockup. The OS can't deal with errors when it isn't yet fully operative. start_kernel() resides in init/main.c, and never returns. Anything from now on is coded in C language, left aside interrupt management and system call enter/leave (well, most of the macros embed assembly code, too). Spinning the wheel After dealing with all the tricky questions, start_kernel() initializes all the parts of the kernel, specifically: q Sets the memory bounds and calls paging_init(). q Initializes the traps, IRQ channels and scheduling. q Parses the command line. q If requested to, allocates a profiling buffer. q Initializes all the device drivers and disk buffering, as well as other minor parts. q Calibrates the delay loop (computes the ``BogoMips'' number). q Checks if interrupt 16 works with the coprocessor. Finally, the kernel is ready to move_to_user_mode(), in order to fork the init process, whose code is in the same source file. Process number 0 then, the so-called idle task, keeps running in an infinite idle loop. The init process tries to execute /etc/init, or /bin/init, or /sbin/init. If none of them succeeds, code is provided to execute ``/bin/sh /etc/rc'' and fork a root shell on the first terminal. This code dates back to Linux 0.01, when the OS was made by the kernel alone, and no login process was available.

http://ldp.iol.it/LDP/khg/HyperNews/get/tour/tour.html (2 di 11) [08/03/2001 10.09.26]

Tour of the Linux kernel source

After exec()ing the init program from one of the standard places (let's assume we have one of them), the kernel has no direct control on the program flow. Its role, from now on is to provide processes with system calls, as well as servicing asynchronous events (such as hardware interrupts). Multitasking has been setup, and it is now init which manages multiuser access by fork()ing system daemons and login processes. Being the kernel in charge of providing services, the tour will proceed by looking at those services (the ``system calls''), as well as by providing general ideas about the underlying data structures and code organization. How the kernel sees a process From the kernel point of view, a process is an entry in the process table. Nothing more. The process table, then, is one of the most important data structures within the system, together with the memory-management tables and the buffer cache. The individual item in the process table is the task_struct structure, quite a huge one, defined in include/linux/sched.h. Within the task_struct both low-level and high-level information is kept--ranging from the copy of some hardware registers to the inode of the working directory for the process. The process table is both an array and a double-linked list, as well as a tree. The physical implementation is a static array of pointers, whose length is NR_TASKS, a constant defined in include/linux/tasks.h, and each structure resides in a reserved memory page. The list structure is achieved through the pointers next_task and prev_task, while the tree structure is quite complex and will not be described here. You may wish to change NR_TASKS from the default vaue of 128, but be sure to have proper dependency files to force recompilation of all the source files involved. After booting is over, the kernel is always working on behalf of one of the processes, and the global variable current, a pointer to a task_struct item, is used to record the running one. current is only changed by the scheduler, in kernel/sched.c. When, however, all procecces must be looked at, the macro for_each_task is used. It is conderably faster than a sequential scan of the array, when the system is lightly loaded. A process is always running in either ``user mode'' or ``kernel mode''. The main body of a user program is executed in user mode and system calls are executed in kernel mode. The stack used by the process in the two execution modes is different--a conventional stack segment is used for user mode, while a fixed-size stack (one page, owned by the process) is used in kernel mode. The kernel stack page is never swapped out, because it must be available whenever a system call is entered. System calls, within the kernel, exist as C language functions, their `official' name being prefixed by `sys_'. A system call named, for example, burnout invokes the kernel function sys_burnout(). The system call mechanism is described in chapter 3 of this guide. Looking at for_each_task and SET_LINKS, in include/linux/sched.h can help understanding the list and tree structures in the process table. Creating and destroying processes

http://ldp.iol.it/LDP/khg/HyperNews/get/tour/tour.html (3 di 11) [08/03/2001 10.09.26]

Tour of the Linux kernel source

A unix system creates a process though the fork() system call, and process termination is performed either by exit() or by receiving a signal. The Linux implementation for them resides in kernel/fork.c and kernel/exit.c. Forking is easy, and fork.c is short and ready understandable. Its main task is filling the data structure for the new process. Relevant steps, apart from filling fields, are: q getting a free page to hold the task_struct q finding an empty process slot (find_empty_process()) q getting another free page for the kernel_stack_page q copying the father's LDT to the child q duplicating mmap information of the father sys_fork() also manages file descriptors and inodes. The 1.0 kernel offers some vestigial support to threading, and the fork() system call shows some hints to that. Kernel threads is work-in-progress outside the mainstream kernel. Exiting from a process is trickier, because the parent process must be notified about any child who exits. Moreover, a process can exit by being kill()ed by another process (these are Unix features). The file exit.c is therefore the home of sys_kill() and the vairious flavours of sys_wait(), in addition to sys_exit(). The code belonging to exit.c is not described here--it is not that interesting. It deals with a lot of details in order to leave the system in a consistent state. The POSIX standard, then, is quite demanding about signals, and it must be dealt with. Executing programs After fork()ing, two copies of the same program are running. One of them usually exec()s another program. The exec() system call must locate the binary image of the executable file, load and run it. The word `load' doesn't necessarily mean ``copy in memory the binary image'', as Linux supports demand loading. The Linux implementation of exec() supports different binary formats. This is accomplished through the linux_binfmt structure, which embeds two pointers to functions--one to load the executable and the other to load the library, each binary format representing both the executable and the library. Loading of shared libraries is implemented in the same source file as exec() is, but let's stick to exec() itself. The Unix systems provide the programmer with six flavours of the exec() function. All but one of them can be implemented as library functions, and theLinux kernel implements sys_execve() alone. It performs quite a simple task: loading the head of the executable, and trying to execute it. If the first two bytes are ``#!'', then the first line is parsed and an interpreter is invoked, otherwise the registered binary formats are sequentially tried. The native Linux format is supported directly within fs/exec.c, and the relevant functions are load_aout_binary and load_aout_library. As for the binaries, the function loading an ``a.out'' executable ends up either in mmap()ing the disk file, or in calling read_exec(). The former way uses the Linux demand loading mechanism to fault-in program pages when they're accessed, while
http://ldp.iol.it/LDP/khg/HyperNews/get/tour/tour.html (4 di 11) [08/03/2001 10.09.26]

Tour of the Linux kernel source

the latter way is used when memory mapping is not supported by the host filesystem (for example the ``msdos'' filesystem). Late 1.1 kernels embed a revised msdos filesystem, which supports mmap(). Moreover, the struct linux_binfmt is a linked list rather than an array, to allow loading a new binary format as a kernel module. Finally, the structure itself has been extended to access format-related core-dump routines. Accessing filesystems It is well known that the filesystem is the most basic resource in a Unix system, so basic and ubiquitous that it needs a more handy name--I'll stick to the standard practice of calling it simply ``fs''. I'll assume the reader already knows the basic Unix fs ideas--access permissions, inodes, the superblock, mounting and umounting. Those concepts are well explained by smarter authors than me within the standard Unix literature, so I won't duplicate their efforts and I'll stick to Linux specific issues. While the first Unices used to support a single fs type, whose structure was widespread in the whole kernel, today's practice is to use a standardized interface between the kernel and the fs, in order to ease data interchange across architectures. Linux itself provides a standardized layer to pass information between the kernel and each fs module. This interface layer is called VFS, for ``virtual filesystem''. Filesystem code is therefore split into two layers: the upper layer is concerned with the management of kernel tables and data structures, while the lower layer is made up of the set of fs-dependent functions, and is invoked through the VFS data structures. All the fs-independent material resides in the fs/*.c files. They address the following issues: q Managing the buffer chache (buffer.c); q Responding to the fcntl() and ioctl() system calls (fcntl.c and ioctl.c); q Mapping pipes and fifos on inodes and buffers (fifo.c, pipe.c); q Managing file- and inode-tables (file_table.c, inode.c); q Locking and unlocking files and records (locks.c); q Mapping names to inodes (namei.c, open.c); q Implementing the tricky select() function (select.c); q Providing information (stat.c); q mounting and umounting filesystems (super.c); q exec()ing executables and dumping cores (exec.c); q Loading the various binary formats (bin_fmt*.c, as outlined above). The VFS interface, then, consists of a set of relatively high-level operations which are invoked from the fs-independent code and are actually performed by each filesystem type. The most relevant structures are inode_operations and file_operations, though they're not alone: other structures exist as well. All of them are defined within include/linux/fs.h. The kernel entry point to the actual file system is the structure file_system_type. An array of file_system_types is embodied within fs/filesystems.c and it is referenced whenever a mount is

http://ldp.iol.it/LDP/khg/HyperNews/get/tour/tour.html (5 di 11) [08/03/2001 10.09.26]

Tour of the Linux kernel source

issued. The function read_super for the relevant fs type is then in charge of filling a struct super_block item, which in turn embeds a struct super_operations and a struct type_sb_info. The former provides pointers to generic fs operations for the current fs-type, the latter embeds specific information for the fs-type. The array of filesystem types has been turned in a linked list, to allow loading new fs types as kernel modules. The function (un-)register_filesystem is coded within fs/super.c. Quick Anatomy of a Filesystem Type The role of a filesystem type is to perform the low-level tasks used to map the relatively high level VFS operations on the physical media (disks, network or whatever). The VFS interface is flexible enough to allow support for both conventional Unix filesystems and exotic situations such as the msdos and umsdos types. Each fs-type is made up of the following items, in addition to its own directory: q An entry in the file_systems[] array (fs/filesystems.c); q The superblock include file (include/linux/type_fs_sb.h); q The inode include file (include/linux/type_fs_i.h); q The generic own include file (include/linux/type_fs.h}); q Two #include lines within include/linux/fs.h, as well as the entries in struct super_block and struct inode. The own directory for the fs type contains all the real code, responsible of inode and data management. The chapter about procfs in this guide uncovers all the details about low-level code and VFS interface for that fs type. Source code in fs/procfs is quite understandable after reading the chapter. We'll now look at the internal workings of the VFS mechanism, and the minix filesystem source is used as a working example. I chose the minix type because it is small but complete; moreover, any other fs type in Linux derives from the minix one. The ext2 type, the de-facto standard in recent Linux installations, is much more complex than that and its exploration is left as an exercise for the smart reader. When a minix-fs is mounted, minix_read_super fills the super_block structure with data read from the mounted device. The s_op field of the structure will then hold a pointer to minix_sops, which is used by the generic filesystem code to dispatch superblock operations. Chaining the newly mounted fs in the global system tree relies on the following data items (assuming sb is the super_block structure and dir_i points to the inode for the mount point): q sb->s_mounted points to the root-dir inode of the mounted filesystem (MINIX_ROOT_INO); q dir_i->i_mount holds sb->s_mounted; q sb->s_covered holds dir_i Umounting will eventually be performed by do_umount, which in turn invokes minix_put_super. Whenever a file is accessed, minix_read_inode comes into play; it fills the system-wide inode
http://ldp.iol.it/LDP/khg/HyperNews/get/tour/tour.html (6 di 11) [08/03/2001 10.09.26]

Tour of the Linux kernel source

structure with fields coming form minix_inode. The inode->i_op field is filled according to inode->i_mode and it is responsible for any further operation on the file. The source for the minix functions just described are to be found in fs/minix/inode.c. The inode_operations structure is used to dispatch inode operations (you guessed it) to the fs-type specific kernel functions; the first entry in the structure is a pointer to a file_operations item, which is the data-management equivalent of i_op. The minix fs-type allows three instances of inode-operation sets (for direcotries, for files and for symbolic links) and two instances of file-operation sets (symlinks don't need one). Directory operations (minix_readdir alone) are to be found in fs/minix/dir.c; file operations (read and write) appear within fs/minix/file.c and symlink operations (reading and following the link) in fs/minix/symlink.c. The rest of the minix directory implements the following tasks: q bitmap.c manages allocation and freeing of inodes and blocks (the ext2 fs, otherwise, has two different source files); q fsynk.c is responsible for the fsync() system calls--it manages direct, indirect and double indirect blocks (I assume you know about them, it's common Unix knowledge); q namei.c embeds all the name-related inode operations, such as creating and destroying nodes, renaming and linking; q truncate.c performs truncation of files. The console driver Being the main I/O device on most Linux boxes, the console driver deserves some attention. The source code related to the console, as well as the other character drivers, is to be found in drivers/char, and we'll use this very directory as our referenece point when naming files. Console initialization is performed by the function tty_init(), in tty_io.c. This function is only concerned in getting major device numbers and calling the init function for each device set. con_init(), then is the one related to the console, and resides in console.c. Initialization of the console has changed quite a lot during 1.1 evolution. console_init() has been detatched from tty_init(), and is called directly by ../../main.c. The virtual consoles are now dynamically allocated, and quite a good deal of code has changed. So, I'll skip the details of initialization, allocation and such.
How file operations are dispatched to the console

This paragraph is quite low-level, and can be happily skipped over. Needless to say, a Unix device is accessed though the filesystem. This paragraph details all steps from the device file to the actual console functions. Moreover, the following information is extracted from the 1.1.73 source code, and it may be slightly different from the 1.0 source. When a device inode is opened, the function chrdev_open() (or blkdev_open(), but we'll stich to character devices) in ../../fs/devices.c gets executed. This function is reached by means of the structure
http://ldp.iol.it/LDP/khg/HyperNews/get/tour/tour.html (7 di 11) [08/03/2001 10.09.26]

Tour of the Linux kernel source

def_chr_fops, which in turn is referenced by chrdev_inode_operations, used by all the filesystem types (see the previous section about filesystems). chrdev_open takes care of specifying the device operations by substituting the device specific file_operations table in the current filp and calls the specific open(). Device specific tables are kept in the array chrdevs[], indexed by the majour device number, and filled by the same ../../fs/devices.c. If the device is a tty one (aren't we aiming at the console?), we come to the tty drivers, whose functions are in tty_io.c, indexed by tty_fops. Thus, tty_open() calls init_dev(), which allocates any data structure needed by the device, based on the minor device number. The minor number is also used to retrieve the actual driver for the device, which has been registered through tty_register_driver(). The driver, then, is still another structure used to dispatch computation, just like file_ops; it is concerned with writing and controlling the device. The last data structure used in managing a tty is the line discipline, described later. The line discipline for the console (and any other tty device) is set by initialize_tty_struct(), invoked by init_dev. Everything we touched in this paragraph is device-independent. The only console-specific particular is that console.c, has registered its own driver during con_init(). The line discipline, on the contrary, in independent of the device. The tty_driver structure is fully explained within <linux/tty_driver.h>. The above information has been extracted from 1.1.73 source code. It isn't unlikely for your kernel to be somewhat different (``This information is subject to change without notice'').
Writing to the console

When a console device is written to, the function con_write gets invoked. This function manages all the control characters and escape sequences used to provide applications with complete screen management. The escape sequences implemented are those of the vt102 terminal; This means that your environment should say TERM=vt102 when you are telnetting to a non-Linux host; the best choice for local activities, however, is TERM=console because the Linux console offers a superset of vt102 functionality. con_write(), thus, is mostly made up of nested switch statements, used to handle a finite state automaton interpreting escape sequences one character at a time. When in normal mode, the character being printed is written directly to the video memory, using the current attr-ibute. Within console.c, all the fields of struct vc are made accessible through macros, so any reference to (for example) attr, does actually refer to the field in the structure vc_cons[currcons], as long as currcons is the number of the console being referred to. Actually, vc_cons in newer kernels is no longer an array of structures , it now is an array of pointers whose contents are kmalloc()ed. The use of macros greatly simplified changing the approach, because much of the code didn't need to be rewritten. Actual mapping and unmapping of the console memory to screen is performed by the functions set_scrmem() (which copies data from the console buffer to video memory) and get_scrmem
http://ldp.iol.it/LDP/khg/HyperNews/get/tour/tour.html (8 di 11) [08/03/2001 10.09.26]

Tour of the Linux kernel source

(which copies back data to the console buffer). The private buffer of the current console is physically mapped on the actual video RAM, in order to minimize the number of data transfers. This means that get- and set-_scrmem() are static to console.c and are called only during a console switch.
Reading the console

Reading the console is accomplished through the line-discipline. The default (and unique) line discipline in Linux is called tty_ldisc_N_TTY. The line discipline is what ``disciplines input through a line''. It is another function table (we're used to the approach, aren't we?), which is concerned with reading the device. With the help of termios flags, the line discipline is what controls input from the tty: raw, cbreak and cooked mode; select(); ioctl() and so on. The read function in the line discipline is called read_chan(), which reads the tty buffer independently of whence it came from. The reason is that character arrival through a tty is managed by asynchronous hardware interrupts. The line discipline N_TTY is to be found in the same tty_io.c, though later kernels use a different n_tty.c source file. The lowest level of console input is part of keyboard management, and thus it is handled within keyboard.c, in the function keyboard_interrupt().
Keyboard management

Keyboard management is quite a nightmare. It is confined to the file keyboard.c, which is full of hexadecimal numbers to represent the various keycodes appearing in keyboards of different manifacturers. I won't dig in keyboard.c, because no relevant information is there to the kernel hacker. For those readers who are really interested in the Linux keyboard, the best approach to keyboard.c is from the last line upward. Lowest level details occur mainly in the first half of the file.
Switching the current console

The current console is switched through invocation of the function change_console(), which resides in tty_io.c and is invoked by both keyboard.c and vt.c (the former switches console in response to keypresses, the latter when a program requests it by invoking an ioctl() call). The actual switching process is performed in two steps, and the function complete_change_console() takes care of the second part of it. Splitting the switch is meant to complete the task after a possible handshake with the process controlling the tty we're leaving. If the console is not under process control, change_console() calls complete_change_console() by itself. Process intervertion is needed to successfully switch from a graphic console to a text one and viceversa, and the X server (for example) is the controlling process of its own graphic console.
The selection mechanism

``selection'' is the cut and paste facility for the Linux text consoles. The mechanism is mainly
http://ldp.iol.it/LDP/khg/HyperNews/get/tour/tour.html (9 di 11) [08/03/2001 10.09.26]

Tour of the Linux kernel source

handled by a user-level process, which can be instantiated by either selection or gpm. The user-level program uses ioctl() on the console to tell the kernel to highlight a region of the screen. The selected text, then, is copied to a selection buffer. The buffer is a static entity in console.c. Pasting text is accomplished by `manually' pushing characters in the tty input queue. The whole selection mechanism is protected by #ifdef so users can disable it during kernel configuration to save a few kilobytes of ram. Selection is a very-low-level facility, and its workings are hidden from any other kernel activity. This means that most #ifdef's simply deals with removing the highlight before the screen is modified in any way. Newer kernels feature improved code for selection, and the mouse pointer can be highlighted independently of the selected text (1.1.32 and later). Moreover, from 1.1.73 onward a dynamic buffer is used for selected text rather than a static one, making the kernel 4kB smaller.
ioctl()ling the device

The ioctl() system call is the entry point for user processes to control the behaviour of device files. Ioctl management is spawned by ../../fs/ioctl.c, where the real sys_ioctl() resides. The standard ioctl requests are performed right there, other file-related requests are processed by file_ioctl() (same source file), while any other request is dispatches to the device-specific ioctl() function. The ioctl material for console devices resides in vt.c, because the console driver dispatches ioctl requests to vt_ioctl(). The information above refer to 1.1.7x. The 1.0 kernel doesn't have the ``driver'' table, and vt_ioctl() is pointed to directly by the file_operations() table. Ioctl material is quite confused, indeed. Some requests are related to the device, and some are related to the line discipline. I'll try to summarize things for the 1.0 and the 1.1.7x kernels. Anything happened in between. The 1.1.7x series features the following approach: tty_ioctl.c implements only line discipline requests (namely n_tty_ioctl(), which is the only n_tty function outside of n_tty.c), while the file_operations field points to tty_ioctl() in tty_io.c. If the request number is not resolved by tty_ioctl(), it is passed along to tty->driver.ioctl or, if it fails, to tty->ldisc.ioctl. Driver-related stuff for the console it to be found in vt.c, while line discipline material is in tty_ioctl.c. In the 1.0 kernel, tty_ioctl() is in tty_ioctl.c and is pointed to by generic tty file_operations. Unresolved requests are passed along to the specific ioctl function or to the line-discipline code, in a way similar to 1.1.7x. Note that in both cases, the TIOCLINUX request is in the device-independent code. This implies that the console selection can be set by ioctlling any tty (set_selection() always operates on the foreground console), and this is a security hole. It is also a good reason to switch to a newer kernel, where the problem is fixed by only allowing the superuser to handle the selection. A variety of requests can be issued to the console device, and the best way to know about them is to browse the source file vt.c.
http://ldp.iol.it/LDP/khg/HyperNews/get/tour/tour.html (10 di 11) [08/03/2001 10.09.26]

Tour of the Linux kernel source

Copyright (C) 1994 Alessandro Rubini, rubini@pop.systemy.it Messages access a file from module by proy018@avellano.usal.es 8. 7. 6. 5. 4. 3. Which head.S? by Johnie Stafford 1. 1. Untitled by benschop@eb.ele.tue.nl Re: STREAMS and LINUX by Vineet Sharma STREAMS and Linux by Venkatesha Murthy G. Do you still need to run update ? by Chris Ebenezer Do you still need to run bdflush? by Steve Dunham 1. 1. 2. 2. 1. 1. 1. Already answered... by Michael K. Johnson Editing services available... by Michael K. Johnson Kernel configuration by Venkatesha Murthy G. More on usage of kernel threads. by David S. Miller Untitled by Karapetyants Vladimir Vladimirovitch Kernel Configuration and Makefile Structure by Steffen Moeller

Re: Kernel threads by Paul Gortmaker kernel startup code by Alan Cox

http://ldp.iol.it/LDP/khg/HyperNews/get/tour/tour.html (11 di 11) [08/03/2001 10.09.26]

09.access a file from module The HyperNews Linux KHG Discussion Pages access a file from module Forum: Tour of the Linux kernel source Date: Thu.usal.it/LDP/khg/HyperNews/get/tour/tour/8.iol.html [08/03/2001 10.es> I need to access a file from a module http://ldp. 08 May 1997 12:06:47 GMT From: <proy018@avellano.27] .

it/LDP/khg/HyperNews/get/tour/tour/7./arch/ppc/kernel/head.S? Forum: Tour of the Linux kernel source Keywords: head./arch/i386/boot/compressed/head. http://ldp./arch/mips/kernel/head.S Date: Sat.09.S .tue.S .S .S .S .27] . this is a list of the "head./arch/alpha/kernel/head.S .ele.S Obviously.S? The HyperNews Linux KHG Discussion Pages Which head.iol./arch/alpha/boot/head. 20 Jul 1996 00:57:09 GMT From: Johnie Stafford <jms@pobox.Which head.html [08/03/2001 10. I did a find on the source./arch/sparc/kernel/head.S . there is a different one for each architecture./arch/m68k/kernel/head. and what's the difference? Johnie Messages Untitled by benschop@eb.nl 1.S .S"'s in the source: . But which version for the i386 architecture is being refered to here.S.com> In the "Tour of the Linux kernel source" section there is reference to boot/head./arch/i386/kernel/head./arch/ppc/boot/compressed/head.

o form a new 32-bit object.iol.o.S Date: Tue.S comes into play.it/LDP/khg/HyperNews/get/tour/tour/7/1.S used.tue.html [08/03/2001 10. 23 Jul 1996 07:38:08 GMT From: <benschop@eb. Now the file arch/i386/boot/compressed/head.Untitled The HyperNews Linux KHG Discussion Pages Untitled Forum: Tour of the Linux kernel source Re: Which head.S is linked with the uncompressed kernel.nl> The file arch/i386/kernel/head.28] .09. If the kernel is not compressed this is the only head.o are compressed and the compressed data is lumped together in the file piggy. This and the decompressor and piggy. In a compressed kernel.ele. http://ldp.S? (Johnie Stafford) Keywords: head. all 32 bit objects from the kernel. including the above mentioned head.

iol. but Linux doesn't have any STREAMS devices or drivers as of now.ernet.ernet.in) Messages Re: STREAMS and LINUX by Vineet Sharma 1. <gvmt@csa.it/LDP/khg/HyperNews/get/tour/tour/6.iisc. Correct me if i am wrong.net drivers for instance. http://ldp.STREAMS and Linux The HyperNews Linux KHG Discussion Pages STREAMS and Linux Forum: Tour of the Linux kernel source Keywords: STREAMS devices drivers Date: Mon.28] .09. Anything being done/planned in that direction? Venkatesha Murthy (gvmt@csa.iisc.html [08/03/2001 10.in> Hi all. 15 Jul 1996 12:01:50 GMT From: Venkatesha Murthy G. they are flexible and can find use in a lot of places where piplelined processing is involved . But as Ritchie's paper explains.

0.25.29] .hns.tar. http://ldp. 10 Apr 1997 15:49:04 GMT From: Vineet Sharma <vsharma@hss.iol.gcom.) Keywords: STREAMS devices drivers Date: Thu.it/LDP/khg/HyperNews/get/tour/tour/6/1.html [08/03/2001 10.gz package.Re: STREAMS and LINUX The HyperNews Linux KHG Discussion Pages Re: STREAMS and LINUX Forum: Tour of the Linux kernel source Re: STREAMS and Linux (Venkatesha Murthy G.09.com/pub/linux/src/ and pick up the Lis-2.com> Go to ftp.

iol.30] .it/LDP/khg/HyperNews/get/tour/tour/5.0.which manage disk buffers.x) kernels starting up update does not have the effect of also starting up bdflush.html [08/03/2001 10.Do you still need to run update ? The HyperNews Linux KHG Discussion Pages Do you still need to run update ? Forum: Tour of the Linux kernel source Date: Tue.09.bdflush/update . In the latest (2. So is update still needed ? http://ldp.uk> The docs on this daemon state that it is one of a pair of two daemons . 25 Jun 1996 13:59:41 GMT From: Chris Ebenezer <ncfee@wmin.ac.

edu> The recent 1..it/LDP/khg/HyperNews/get/tour/tour/4.Do you still need to run bdflush? The HyperNews Linux KHG Discussion Pages Do you still need to run bdflush? Forum: Tour of the Linux kernel source Date: Mon.09.x kernels add a kernel thread named (kflushd) What does this do? Does it replace the functionality of the user program 'bdflush'? Messages Already answered. 27 May 1996 18:55:44 GMT From: Steve Dunham <dunham@gdl..iol.html [08/03/2001 10.3. Johnson 1.32] .msu. http://ldp. by Michael K.

posted by Paul Gortmaker..33] .. see Kernel threads. though. It will be a while before I have time. Johnson <johnsonm@redhat.Already answered. Forum: Tour of the Linux kernel source Re: Do you still need to run bdflush? (Steve Dunham) Keywords: kflushd. Your questions is already answered in a response elsewhere in this document. http://ldp.html [08/03/2001 10.. 27 May 1996 19:42:00 GMT From: Michael K..iol.com> It looks like I'll eventually have to add search capability to the KHG.09.it/LDP/khg/HyperNews/get/tour/tour/4/1. The HyperNews Linux KHG Discussion Pages Already answered. searching Date: Mon.

3. and dispensed with the reliance on the user space syscall to launch the thing.0 has "early" support for threads.2. For example the internal bdflush daemon used to be started by a non-returning syscall in all the v1. Paul.au> The above mentions that v1. but as of around v1. 22 May 1996 16:51:58 GMT From: Paul Gortmaker <gpg109@rsphy1. Miller 1.Re: Kernel threads The HyperNews Linux KHG Discussion Pages Re: Kernel threads Forum: Tour of the Linux kernel source Keywords: kernel threads Date: Wed.edu.iol. http://ldp.it/LDP/khg/HyperNews/get/tour/tour/2.anu.x kernels.4x or so.34] .x kernels. I made it into an internal thread.09. This is now what is seen as "kflushd" or process #2 on all recent kernels. They are fully functional and in use in late v1. by David S. other threads such as "kflushd" and multiple "nfsiod" processes have taken advantage of the same functionality. which can use a bit of an update.html [08/03/2001 10. Since then.3. Messages More on usage of kernel threads.

2. Steffen Messages Editing services available. q all hackers have to understand it and q it's a good place to put hyperlinks to the real stuff in this guide. If there's some positive feedback I'd like to start on this myself.iol.Kernel Configuration and Makefile Structure The HyperNews Linux KHG Discussion Pages Kernel Configuration and Makefile Structure Forum: Tour of the Linux kernel source Keywords: configuration makefile Date: Wed. 22 May 1996 17:34:39 GMT From: Steffen Moeller <steffen@di.html [08/03/2001 10.. Johnson 1.it/LDP/khg/HyperNews/get/tour/tour/3.unito. but I'd need some help . Kernel configuration by Venkatesha Murthy G.35] . http://ldp. Or is this too trivial for a Hacker's Guide? I do not think so since q it's a nice introduction. by Michael K.at least for the language.it> I'm missing a description of the Makefile mechanism and the principle of the configuration.09..

The HyperNews Linux KHG Discussion Pages Editing services available. If you are willing to tackle it. that's fine too..but I'd need some help . :-) http://ldp.iol.36] .. hopefully with helpful comments. you can send it to me for editing.html [08/03/2001 10. feel free.. If someone else wants to work on it.. I'll send it back for revision. Johnson <johnsonm@redhat.at least for the language" you mean that you would like someone to edit your piece. If by ".. If I feel that it needs more work before being added.09.com> This is certainly not too trivial a topic for the KHG.Editing services available.it/LDP/khg/HyperNews/get/tour/tour/3/1.... Forum: Tour of the Linux kernel source Re: Kernel Configuration and Makefile Structure (Steffen Moeller) Keywords: configuration makefile Date: Thu. 23 May 1996 17:00:42 GMT From: Michael K.

Code that is specefic to the configuration can now be enclosed in #ifdef CONFIG_WHATEVER .' line and that will be displayed for you during configuration.in and add a line that looks like bool 'whatever explanation' CONFIG_WHATEVER default this is supposed to mean that CONFIG_WHATEVER is a boolean taking values y or n....." lines before the 'bool . I first edit arch/i386/config. Now this automagically #defines CONFIG_WHATEVER in <linux/autoconf.html [08/03/2001 10.iisc.in> I really haven't *understood* kernel configutarion but i can tell you what i do when i want to add a config option. If you want any more explanation than can be given on one line..36] .09.iisc.. <gvmt@csa. #endif so it will be compiled in only when configured. Venkatesha Murthy (gvmt@csa. 11 Jul 1996 12:30:00 GMT From: Venkatesha Murthy G....in) http://ldp. you can have a set of 'comment .Kernel configuration The HyperNews Linux KHG Discussion Pages Kernel configuration Forum: Tour of the Linux kernel source Re: Kernel Configuration and Makefile Structure (Steffen Moeller) Keywords: configuration Date: Thu. I don't know if you'll find it useful but still ..ernet. When you 'make config' you'll get something like 'whatever explanation (CONFIG_WHATEVER) [default]' and you type in y or n.h>.iol.ernet...it/LDP/khg/HyperNews/get/tour/tour/3/2.

the AP+ multicomputer takes interrupts when one cell on the machine does a dma access to another cell and the page is not present or otherwise needs to be faulted in or whatever.rutgers.More on usage of kernel threads. David S.edu) http://ldp. Miller <davem@caip. 24 May 1996 05:33:03 GMT From: David S. Forum: Tour of the Linux kernel source Re: Re: Kernel threads (Paul Gortmaker) Keywords: kernel threads asynchronous faults Date: Fri.09.iol.html [08/03/2001 10.37] . Miller (davem@caip.it/LDP/khg/HyperNews/get/tour/tour/2/1. solution to the classic interrupt context limitation problem. . Poof. The kernel thread is called asyncd().-) Later. The interrupt handler adds this fault to a queue of faults to service and wakes up the async daemon which runs with real time priority much like the other linux kernel daemons.rutgers. The HyperNews Linux KHG Discussion Pages More on usage of kernel threads.edu> As an another addendum the AP+ multicomputer port (actually it is a part of the generic Sparc kernel sources) uses kernel threads to solve the problem of servicing a true fault from interrupt space.

kernel startup code The HyperNews Linux KHG Discussion Pages kernel startup code Forum: Tour of the Linux kernel source Keywords: SMP start_kernel() Date: Fri.the address (page boundary) that the processor is made to boot at. The kernel startup for the SMP kernel in start_kernel() calls a few startup routines for the architecture and then waits for the boot processor to complete initialisation.it/LDP/khg/HyperNews/get/tour/tour/1. At this point it starts running an idle thread and is schedulable.html [08/03/2001 10.09. http://ldp. The secondary processors (or AP's as Intel calls them for Application Processors) load their SS:SP based on the code segment enter protected mode and jump into the 32bit kernel startup. The SMP kernel writes a trampoline routine at the base of a page it allocates for the stack of each CPU.37] . 17 May 1996 10:48:00 GMT From: Alan Cox <unknown> The intel startup code and start_kernel() is partly used for SMP startup as the intel MP design starts the secondary CPU's in real mode. Messages Untitled by Karapetyants Vladimir Vladimirovitch 1.iol. In addition to make it more fun you can only pass one piece of information .

38] .it/LDP/khg/HyperNews/get/tour/tour/1/1.com:8080/HyperNews/get/tour/tour/1.html [08/03/2001 10.09.html http://ldp.redhat.iol.html Try: http://www.com:8080/HyperNews/get/tour/tour/1/1.HyperNews Redirect Broken URL: http://www.redhat.

TTY drivers are character devices that interface with the kernel's generic TTY support.html (1 di 3) [08/03/2001 10. but I cannot guarantee against it.09. Sometimes you just need to know how to write code that runs as a normal user process and still accesses hardware. Network Device Drivers Alan Cox gives an introduction to the network layer. Device Driver Basics Assuming that you need to write a ``real'' device driver.Device Drivers The HyperNews Linux KHG Discussion Pages Device Drivers If you choose to write a device driver. User-space device drivers It's not always necessary to write a ``real'' device driver. There is only one ``infallible'' direction I can give you: Back up! Back up before you test your new device driver. Kernel-Level Exception Handling An edited version of a post of Joerg Pommnitz to the linux-kernel mailing list about how the new http://ldp. you must take everything written here as a guide. Supporting Functions Many functions are useful to all sorts of drivers. even if you follow these instructions exactly. I'd appreciate it if someone would write up how to attach a character device driver to the generic TTY layer and submit it to me for inclusion in this guide. TTY drivers This section hasn't been written yet.40] . and assumes that you know everything in the previous section. and I cannot guarantee that you will not damage your computer. Block Device Drivers This section includes details specific to block device drivers (suprise!) Writing a SCSI Device Driver This is a technical paper written by Rik Faith at the University of North Carolina.. I cannot guarantee that this chapter will be free of errors. It is highly unlikely that you will damage it. or you may regret it later. and they require more than just a standard character device interface.it/LDP/khg/HyperNews/get/devices/devices. you may need to learn what type of driver you ought to write. In fact. and no more. there are some things that you need to know regardless of what type of driver you are writing. Translating Addresses in Kernel Space An edited version of a post of Linus Torvalds to the linux-kernel mailing list about how to correctly deal with translating memory references when writing kernel source code such as device drivers.. including device drivers. Character Device Drivers This section includes details specific to character device drivers. What is a Device Driver? What is this ``device driver'' stuff anyway? Here's a very short introduction to the concept. Here is a summary of quite a few of them.iol.

Linux Journal has had a long-running series of articles called Kernel Korner which. I put up some (slightly outdated by now. 25.. 20. Johnson Is Anybody know something about SIS 496 IDE chipset? by Alexander Vertical Retrace Interrupt . 16. Untitled by Praveen Dwivedi 3D Acceleration by jamesbat@innotts. written by Alan Cox. and 28. 26. has had quite a bit of useful information on it. 12. which is specifically oriented at character devices implemented as kernel runtime-loadable modules. Your choice.Device Drivers (Linux 2. Zezschwitz. Johnson. 13. by Michael K. You are somewhat confused.I need to use it by Brynn Rogers 1.com. 9. despite the wacky name. 1994. was in issues 23.40] . 21.uk Device Drivers: /dev/radio. Copyright (C) 1992. I think) notes for a talk I gave in May 1995 entitled Writing Linux Device Drivers. Issues 9. Issue 29 is slated (as of this writing) to have an article on writing network device drivers. most of them are available for purchase as back issues.. Other sources of information Quite a few other references are also available on the topic of writing Linux device drivers by now. One particularly useful series of articles. 10. by Robert Hinson What does mark_bh() do? by Erik Petersen 1.8) exception mechanism works. They were written by Alessandro Rubini and Georg v. 11. Some of the articles from that column may be available on the web.. 14. 1996 Michael K. johnsonm@redhat. and 11 have a series that I wrote on block device drivers.error by Edgar Vonk _syscallX() Macros by Tom Howley MediaMagic Sound Card DSP-16. 10. by Matthew Kirkwood Does anybody know why kernel wakes my driver up without apparant reasons? by David van Leeuwen Getting a DMA buffer aligned with 64k boundaries by Juan de La Figuera Bayon Hardware Interface I/O Access by Terry Moore 1.co. 7. 24. by Michael K.it/LDP/khg/HyperNews/get/devices/devices.iol. 19. Untitled memcpy error? by Edgar Vonk Unable to handle kernel paging request .1. DMA to user space by Marcel Boosten How a device driver can driver his device by Kim yeonseop 1. Messages 22. 6. 1993.html (2 di 3) [08/03/2001 10....09. but still worth reading. Johnson help working with skb structures by arkane http://ldp. 15. which focussed in far more detail than my 30 minute talk on the subject of kernel runtime-loadable modules. How to run in Linux. 17.

How to do with Network Drivers? by Frieder Löffler Interrupt sharing 101 by Christophe Beauregard Through application which has opened the device by Michael K.0/drivers/pci/pci. Is waitv honored? by Michael K. 3.40] .c by Hasdi Re: Network Device Drivers by Neal Tucker 1.09. 2.it/LDP/khg/HyperNews/get/devices/devices. Device Driver notification of "Linux going down" by Stan Troeh 1. -> -> Interrupt sharing-possible by Vladimir Myslik Interrupt sharing . Johnson Device Driver notification of "Linux going down" by Marko Kohtala 4.iol.Device Drivers 5. 1. 1. There is linux-2. 1. -> 2. -> network driver info by Neal Tucker Network Driver Desprately Needed by Paul Atkinson Re: Transmit function by Paul Gortmaker Skbuff by Joerg Schorr Re: Network Device Drivers by Paul Gortmaker Transmit function by Joerg Schorr http://ldp. Johnson PCI Driver by Flavia Donno 1.html (3 di 3) [08/03/2001 10. Interrupt Sharing ? by Frieder Löffler 1. 2.

) This chapter explains how to write any type of Linux device driver that you might need to. and the Linux kernel.html (1 di 2) [08/03/2001 10. you protect it from users and normal programs that use it. Also. reformat your hard drive. Normal user-mode http://ldp. If you do it right. write(). or even break your dishes. So an operating system is essentially a priviledged. you will be able to add and remove devices from your system without changing your applications at all. and then feed it the data it wants. However. etc. read the code for the tty devices.'' which are those which represent devices.41] . It merely involves writing a few functions and registering them with the Virtual Filesystem Switch (VFS). general. Creating device drivers for Linux is easier than you might think. you need to be able to load your program into memory and run it. To make sure that that code is not compromised. (See mknod(1. which the operating system also does. SCSI. Also. Instead of putting code in each application you write to control each device. To write to a hard disk. like most Unix kernels.it/LDP/khg/HyperNews/get/devices/whatis. we'll get to that later. registered with the filesystem. you share the code between applications.) All devices controlled by the same device driver are given the same major number. and network drivers.What is a Device Driver? The HyperNews Linux KHG Discussion Pages What is a Device Driver? Making hardware work is tedious.iol. and can do anything it wants to: write to any memory. the ``misc'' major device supports many minor devices that only need a few minor numbers. and if you don't but want to learn. you don't need to read this section. and of those with the same major number. a word of warning is due here: Writing a device driver is writing a part of the Linux kernel. and may use a third and possibly fourth by the time you read this. which are called to handle requests to do I/O on ``device special files. is non-pre-emptible. your computer will appear to ``freeze'' when your driver is running. wait for the hard drive to say that it is ready to receive data. sharable library of low-level hardware and memory and process control functions and routines. very carefully. your driver will run in kernel mode. To write to a floppy disk is even harder. requires that you write magic numbers in magic places. if your dishwasher is controlled by your computer.) can be used for devices and files. Furthermore. there are a set of functions. Be careful. but it is close enough. This means that if you driver takes a long time to work without giving other programs a chance to work. If you understand where it is not true. damage your monitor or video card.09. which uses up 2 major numbers. for example. It explains what functions you need to write. the same calls (read(). (This is not strictly true. and what function are built in to Linux to make your job easier. the VFS can call your functions. block. This means that your driver runs with kernel permissions. how to initialize your drivers and obtain memory for them efficiently. Within the kernel.2) for an explanation of how to make these files. including character. so that when the proper device special files are accessed. All versions of Unix have an abstract way of reading and writing devices. By making the devices act as much as possible like regular files. and requires that the program supervise the floppy disk drive almost constantly while it is running. different devices are distinguished by different minor numbers.

09.. 1993.What is a Device Driver? pre-emptive scheduling does not apply to your driver. johnsonm@redhat. 1994.iol. 1996 Michael K.html (2 di 2) [08/03/2001 10.. by Michael K.it/LDP/khg/HyperNews/get/devices/whatis.41] . Johnson http://ldp. -> Not yet.com. Johnson. Copyright (C) 1992. Messages Question ? by Rose Merone 1.

. by Michael K.Question ? The HyperNews Linux KHG Discussion Pages Question ? Forum: What is a Device Driver? Date: Mon.it/LDP/khg/HyperNews/get/devices/whatis/1.09. http://ldp. 24 Mar 1997 08:39:09 GMT From: Rose Merone <unknown> D'ya have a book that covers all about device driver management in Linux ? Messages Not yet.html [08/03/2001 10. Johnson 1.42] ..iol.

21 Apr 1997 14:00:19 GMT From: Michael K. Forum: What is a Device Driver? Re: Question ? (Rose Merone) Date: Mon.. See http://www.Not yet.com/catalog/linuxdrive/desc.ora.43] .html http://ldp. Johnson <johnsonm@redhat.com/catalog/linuxdrive/ and http://www..com> Alessandro Rubini is writing a book about writing device drivers for O'Reilly.iol...it/LDP/khg/HyperNews/get/devices/whatis/1/1.ora. The HyperNews Linux KHG Discussion Pages Not yet.09.html [08/03/2001 10.

an IRQ is an ``Interrupt ReQuest line. Example: vgalib A good example of a user-space driver is the vgalib library. The most useful example of this is a memory-mapped device. An interrupt is an asyncronous notification posted by the hardware to alert the device driver of some condition. exit (-1). When you have done this mapping. and need to write a real device driver. then you also need to write a real device driver at the kernel level. but you can also do this with devices in I/O space (devices accessed with inb() and outb(). This may be because it has data to give to the drive. The standard read() and write() calls are really inadequate for writing a really fast graphics driver. If your process is running as superuser (root).46] . it is pretty easy to write and read from real memory addresses just as you would read and write any variables. the SIG is not particularly fast. vgalib creates symbolic names for this with #define statements. If your driver needs to respond to interrupts. but runs in user space. but only a process running as root can execute the ioperm() call.it/LDP/khg/HyperNews/get/devices/fake. and so instead there is a library which acts conceptually like a device driver. Any processes which use it must run setuid root. 1)) { printf("VGAlib: can't get I/O permissions \n").'' which is triggered when the device wants to talk to the driver. You have likely dealt with `IRQ's when setting up your hardware. as there is no good way at this time to deliver interrupts to user processes. and a user-space device driver will not be sufficient or even possible.09. Although the DOSEMU project has created something called the SIG (Silly Interrupt Generator) which allows interrupts to be posted to user processes (I believe through the use of signals). Where the user-level has its signals delivered to it by the kernel. especially in applications where no two applications will compete for the device. or because it is now ready to receive data. the kernel has interrupt delivered to it by hardware.iol. by mmap()'ing a section of /dev/mem. you can use the mmap() call to map some of your process memory to actual memory locations.). or because of some other ``exceptional condition'' that the driver needs to know about. It is possible for a process that is not setuid root to write to /dev/mem if you have a group mem or kmem which is allowed write permission to /dev/mem and the process is properly setgid. because it uses the ioperm() system call. so similar that the same sigaction structure is used in the kernel to deal with interrupts as is used in user-level programs to deal with signals. It is similar to user-level processes receiving a signal. and should be thought of as a last resort for things like DOSEMU. 1. then you really need to be working in kernel space.User-space device drivers The HyperNews Linux KHG Discussion Pages User-space device drivers It is not always necessary to write a device driver for a device. and then issues the ioperm() call like this to make it possible for the process to read and write directly from and to those ports: if (ioperm(CRT_IC. etc.html (1 di 3) [08/03/2001 10. } http://ldp. If your driver must be accessible to multiple processes at once. There are several I/O ports associated with VGA graphics. and/or manage contention for a resource.

and this status is not going to change. 1). then allocates memory enough so that the mapping can be done on a page (4 KB) boundary. graph_mem = (unsigned char *)mmap( (caddr_t)graph_mem. exit (-1).html (2 di 3) [08/03/2001 10. Example: mouse conversion http://ldp.it/LDP/khg/HyperNews/get/devices/fake. } It first opens /dev/mem. and then attempts the map. After making this call. O_RDWR) ) < 0) { printf("VGAlib: can't open /dev/mem \n"). mem_fd.iol. exit (-1). 1. if ((long)graph_mem < 0) { printf("VGAlib: mmap error \n").h> for details.] It only needs to do error checking once. the process is allowed to use inb and outb machine instructions. exit (-1). } if ((unsigned long)graph_mem % PAGE_SIZE) graph_mem += PAGE_SIZE . Read <linux/asm.46] . but only on the specified ports. GRAPH_SIZE is the size of VGA memory.. PROT_READ|PROT_WRITE. GRAPH_SIZE. ioperm(ATT_IW. vgalib arranges for writing directly to kernel memory with the following code: /* open /dev/mem */ if ((mem_fd = open("/dev/mem". because the only reason for the ioperm() call to fail is that it is not being called by the superuser. and GRAPH_BASE is the first address of VGA memory in /dev/mem. GRAPH_BASE ). 1). These instructions can be accessed without writing directly in assembly by including .((unsigned long)graph_mem % PAGE_SIZE).User-space device drivers ioperm(CRT_IM. 1. but will only work if you compile with optimization on. After arranging for port I/O. } /* mmap graphics memory */ if ((graph_mem = malloc(GRAPH_SIZE + (PAGE_SIZE-1))) == NULL) { printf("VGAlib: allocation error \n"). Then by writing to the address that is returned by mmap(). the process is actually writing to screen memory.09. [.. by giving the -O? to gcc. MAP_SHARED|MAP_FIXED.

1994. 1995. the concepts in this example still stand. (If you have a better example. However.com. Johnson http://ldp.. Messages What is SMP? 1. This usually lives in the /dev/ directory (although it doesn't need to) and acts substantially like a device once set up. you need a kernel-space driver. fifo's are one-directional only--they have one reader and one writer. and one particular program used to used to use it--the clock program. Copyright (C) 1992. and a user-space driver will only cause grief as more and more Linux users use SMP machines.iol. Johnson.) The evil instruction Don't use the cli() instruction. 1996 Michael K.User-space device drivers If you want a driver that acts a bit more like a kernel-level driver. -> -> SMP: Two Definitions? by Reinhold J.. Even though XFree86 is now able to read PS/2 style ``droppings''. It's possible to use it as root to disable interrupts. and run a program called mconv which read PS/2 mouse ``droppings'' from /dev/psaux. Gerharz Only one definition for Linux. If you need to use cli(). by Michael K.09. and wanted to run XFree86. or named pipe. this kills SMP machines. you can also make a fifo. and it would be as if there were a microsoft mouse connected to /dev/mouse.46] . Then XFree86 would read the ``droppings'' from /dev/mouse. you had to create a fifo called /dev/mouse. it used to be that if you had a PS/2-style mouse.it/LDP/khg/HyperNews/get/devices/fake. 1993. I'd be glad to see it. but does not live in kernel space.html (3 di 3) [08/03/2001 10. johnsonm@redhat. and wrote the equivalent microsoft-style ``droppings'' to /dev/mouse. For instance. However.

16 Dec 1996 00:22:27 GMT From: <unknown> It might not be appropriate to ask.html [08/03/2001 10.What is SMP? The HyperNews Linux KHG Discussion Pages What is SMP? Forum: User-space device drivers Keywords: SMP Date: Mon..it/LDP/khg/HyperNews/get/devices/fake/1.. Johnson http://ldp. by Michael K.09. I never saw cli() instruction do any harm to any Linux machine I've met. -> Only one definition for Linux. but it'd be real nice to know what SMP means. Messages SMP: Two Definitions? by Reinhold J. Gerharz 1.47] .iol.

09 Jan 1997 03:18:21 GMT From: Reinhold J.html [08/03/2001 10. but in reality only 80-90 percent is achieved. Messages Only one definition for Linux..it/LDP/khg/HyperNews/get/devices/fake/1/1. This is traditionally called "asymetric multi-processing. http://ldp." a technology where two or more processors share equal access to memory..09.com> I thought SMP meant "symetric multi-processing." This technology allows multiple processors to run user programs. and interrupts.iol. SMP means "shared-memory multi-processing. device I/O.47] .SMP: Two Definitions? The HyperNews Linux KHG Discussion Pages SMP: Two Definitions? Forum: User-space device drivers Re: What is SMP? Keywords: SMP Date: Thu." and I have tentatively concluded that only "marketing types" would use this terminology to confuse potential customers. Gerharz <rgerharz@erols. However. Johnson 1. Ideally one would expect a 100 percent improvement in processing performance for each additional processor. but one processor reserves interrupt and I/O handling for itself. by Michael K. I have discovered that to some people.

. CPU-bound tasks. until at some point it actually decreases performance to add another CPU.47] . Johnson <johnsonm@redhat. on the other hand. but all the CPUs can run in kernel mode at different times. http://ldp.. Most systems simply don't support enough CPUs to get a negative marginal performance gain. Currently.iol.com> In the Linux world.09. Gerharz) Keywords: SMP Date: Mon. Also..it/LDP/khg/HyperNews/get/devices/fake/1/1/1. SMP really does mean symmetric multi-processing. the amount of extra performance you get out of each additional CPU decreases. there's a lock around the whole kernel so that only one CPU can be in kernel mode at once.html [08/03/2001 10.. As you add more CPU's to an SMP system. The HyperNews Linux KHG Discussion Pages Only one definition for Linux. 13 Jan 1997 14:26:44 GMT From: Michael K. work very well with a single lock around the kernel.Only one definition for Linux. because Linux uses a single lock. the current kernels degrade more quickly as you add more CPUs than a multiple-lock system would for I/O-bound tasks. Forum: User-space device drivers Re: What is SMP? Re: SMP: Two Definitions? (Reinhold J. so that usually isn't an issue.

there is a kmalloc() function that is a bit different: q Memory is provided in pieces whose size is a power of 2.h. and your driver will be the foo driver. Filesystems can only be mounted if they are on block devices. Also. except that pieces larger than 128 bytes are allocated in blocks whose size is a power of 2 minus some small amount for overhead. is to name your device.html (1 di 9) [08/03/2001 10. The usual priority is GFP_KERNEL. as will be described below. Take only what you need. unless you are going to use it right away again. q kmalloc() takes a second argument. See Supporting Functions for more information on kmalloc(). the floppies are the ``fd'' devices. and thus allocating extra memory in the kernel is a far worse thing to do in the kernel than in a user-level program. This name should be a short (probably two or three character) string. You will probably be writing writing two files. use GFP_ATOMIC and be truly prepared for it to fail (don't panic). but memory will not be used any more efficiently if you request a 31-byte piece than it will if you request a 32 byte piece. The other option is GFP_BUFFER. the priority. For instance. kfree().iol. and other useful functions. If it may be called from within an interrupt. kmalloc() may sleep. but character devices are not required to be. and block devices are those which are accessed through a cache. Character devices are those for which no buffering is performed. Use only what you have to. and free it when you are done.Device Driver Basics The HyperNews Linux KHG Discussion Pages Device Driver Basics We will assume that you decide that you do not wish to write a user-space device. It takes two arguments: the first is the pointer that you are freeing. Instead of having a malloc() capable of delivering almost unlimited amounts of memory.it/LDP/khg/HyperNews/get/devices/basics. This is used as an argument to the get_free_page() function. q If you know what size object you are freeing. a .h file. block devices There are two main types of devices under all Unix systems. which is used only when the kernel is allocating buffer space. character and block devices. there is a limit to the amount of memory that can be allocated. We will call your prefix foo. which is currently 131056 bytes. etc. Namespace One of the first things you will need to do. Block devices must be random access. These differ from free() in a few ways as well: q kfree() is a macro which calls kfree_s() and acts like the standard free() outside the kernel. which cannot be done on an interrupt. use one of two functions: kfree() or kfree_s(). To free memory allocated with kmalloc(). Character vs. the parallel device is the ``lp'' device. before writing any code. you can speed things up by calling kfree_s() directly.c and foo. and the second is the size of the object being freed. http://ldp. though some are. and give your functions names like foo_read(). Allocating memory Memory allocation in the kernel is a little different from memory allocation in normal user-level programs. foo_write(). as in the single argument to kfree(). where it is used to determine when to return. and would rather implement your device in the kernel. and never in device drivers.c file and a . As you write your driver. you will give your functions names prefixed with your chosen string to avoid any namespace confusion. We will refer to your files as foo. This is because if you specify GFP_KERNEL. Be gentle when you use kmalloc. and SCSI disks are the ``sd'' devices. and possibly modifying other files as well. You can request any odd size.49] . Remember that kernel memory is unswappable.09.

many of the drivers are interrupt-driven. when an interrupt occurs. Instead. This unfortunately wastes time. The kernel is not a separate task under Unix. By contrast. but in a character device driver. or where to get it from. Because the process is still running. and instead have a function which has historically been called the ``strategy routine.Device Driver Basics Character devices are read from and written to with two function: foo_read() and foo_write().49] . and is said to be ``in kernel mode. as block devices have to register a strategy routine. For instance.it/LDP/khg/HyperNews/get/devices/basics. but in a different mode. Interrupts vs. in the background. This improves system performance. which will usually make performance even better. and the sources for block devices are kept in drivers/block/. The sources for character devices are kept in drivers/char/. the CPU could be off doing something far more useful than waiting for a busy but slow device. and are very much alike. the driver will sleep for a while. this temporary space is automatically provided by the buffer cache mechanism. and bwrite(). when scheduling the interrupt. http://ldp. and try again later. In kernel mode. When a process executes a system call. so these macros cannot be used--if they are.09. interrupts are provided which can interrupt whatever is happening so that the operating system can do some task and return to what it was doing without losing information. the process changes execution modes. That is. When the interrupt-driven part of the driver has filled up that temporary space. However. so an interrupt-driven driver would be rather silly. they will either write over random memory space of the running process or cause the kernel to panic. but it sometimes needs to be done. any process might currently be running. the process can still access the user-space memory that it was previously executing in. Specifics are dealt with in Character Device Initialization and Block Device Initialization. and put_fs_*() and memcpy_tofs() write to user-space memory. and then sleep. if you have a parallel card that supplies an interrupt. These functions go through the buffer cache. it does not transfer control to another process. a driver must also provide temporary space in which to put the information. In an ideal world. In a block device driver. Polling Hardware is slow. and to do it asyncronously. so some drivers have to poll the hardware: ask the hardware if it is ready to transfer data yet. block devices do not even implement the read() and write() functions. They have similar interfaces. and if the printer stays in a not ready phase for too long. it wakes up the process. To understand this difference. but some are not. there are only a few interrupts available for use by your peripherals. the driver is responsible for allocating it itself. all devices would probably work by using interrupts. breada(). Some hardware (like memory-mapped displays) is as fast as the rest of the machine. Rather. there is no question of where in memory to put the data. which is registered in a different way than the foo_read() and foo_write() routines of a character device driver. except for reading and writing. and can be switched back and forth at runtime. A request may be asyncronous: breada() can request the strategy routine to schedule reads that have not been asked for.'' Reads and writes are done through the buffer cache mechanism by the generic functions bread(). and so may or may not actually call the strategy routine.'' In this mode. The read() and write() calls do not return until the operation is complete. and at least one can be either. which is done through a set of macros: get_fs_*() and memcpy_fromfs() read user-space memory. In Linux. and does not generate output asyncronously. you have to understand a little bit of how system calls work under Unix. even if interrupts were provided. it executes kernel code which is trusted to be safe. on a PC or clone. in the hopes that they will be needed later. Because of the difference in reading and writing. So to keep from having to busy-wait all the time. However. There are some important programming differences between interrupt-driven drivers and polling drivers.html (2 di 9) [08/03/2001 10. which copies the information from that temporary space into the process' user space and returns. it is as if each process has a copy of the kernel. depending on whether or not the block requested is in the buffer cache (for reads) or on whether or not the buffer cache is full (for writes). the lp device (the parallel port driver) normally polls the printer to see if the printer is ready to accept output. but rather. initialization is different.iol. in the time it takes to get information from your average device. However. the driver will utilize that.

it/LDP/khg/HyperNews/get/devices/basics.. if they were enabled. it should wake up and processes that might be sleeping on that wait_queue.. and returns. or conceivably by something else. } A wait_queue is a circular list of pointers to task structures. and returns. the sleep should be interruptible if the device is a slow one. }.] Perhaps the best way to try to understand the Linux sleep-wakeup mechanism is to read the source for the __sleep_on() function.html (3 di 9) [08/03/2001 10. The process then removes itself from the wait_queue. used to implement both the sleep_on() and interruptible_sleep_on() calls. restores the orginal interrupt condition with restore_flags().49] . It then recovers the original interrupt state (enabled or disabled). add_wait_queue() turns off interrupts. This way. sti() then allows interrupts to occur. and switches to it. struct wait_queue wait = { current. p. In general. if (current == task[0]) panic("task[0] trying to sleep"). &wait). NULL }. add_wait_queue(p. schedule(). depending on whether or not the sleep should be interruptable by such things as system calls. &wait). struct wait_queue * next. Then. This should mention things like all processes sleeping on an event are woken at once. save_flags() is a macro which saves the process flags in its argument. state is either TASK_INTERRUPTIBLE or TASK_UNINTERUPTIBLE. sti(). and then they contend for the event again. etc. remove_wait_queue(p. probably by calling http://ldp. the restore_flags() later can restore the interrupt state. defined in <linux/wait. including terminals and network devices or pseudodevices. int state) { unsigned long flags. Schedule will not choose this process to run again until the state is changed to TASK_RUNNING by wake_up() called on the same wait queue. and schedule() finds a new process to run. each process that finds itself locked out of access to the resource sleeps on that resource's wait_queue. This is done to preserve the previous state of the interrupt enable flag. whenever contention does occur. and adds the new struct wait_queue declared at the beginning of the function to the list p. Whenever contention for a resource might occur. one which can block indefinitely.Device Driver Basics The sleep-wakeup mechanism [Begin by giving a general description of how sleeping is used and what it does.h> to be struct wait_queue { struct task_struct * task.09. there needs to be a pointer to a wait_queue associated with that resource.iol. if (!p) return. whether it was enabled or disabled. restore_flags(flags). save_flags(flags). static inline void __sleep_on(struct wait_queue **p. When any process is finished using a resource for which there is a wait_queue. current->state = state.

or want more details on when and how to structure this sleeping. int (*ioctl) (struct inode *.Device Driver Basics wake_up(). struct file *). char *. all filesystem access went straight into routines which understood the minix filesystem. struct file *. int (*select) (struct inode *. unsigned int). is the mechanism which allows Linux to mount many different filesystems at the same time. unsigned int. int count). I urge you to buy one of the operating systems textbooks listed in the Annotated Bibliography and look up mutual exclusion and deadlock. you can code your own versions of sleep_on() and wake_up() that fit your needs. unsigned long. struct file *. To make it possible for other filesystems to be written. struct file *). int (*open) (struct inode *. For an example of this. int (*readdir) (struct inode *. struct file *. off_t offset http://ldp. struct file *. If you don't understand why a process might want to sleep. struct file *. [It should also detail all the defaults. or possibly wake_up_interruptible(). More advanced sleeping If the sleep_on()/wake_up() mechanism in Linux does not satisfy your device driver needs. From /usr/include/linux/fs. where quite a bit has to be done between the add_wait_queue() and the schedule(). off_t.iol. int. char *. size_t. int (*read) (struct inode *. int (*write) (struct inode *. the file_operations structure. unsigned long). }.html (4 di 9) [08/03/2001 10. and cover more carefully the possible return values. struct file *. int). One structure is of interest to the device driver writer. An understanding of what the system call lseek() does should be sufficient to explain this function. Essentially.] The lseek() function This function is called when the system call lseek() is called on the device special file representing your device. which moves to the desired offset. This was done by some generic code which can handle generic cases and a structure of pointers to functions which handle specific cases. It documents all the arguments that these functions take.49] . look at the serial device driver (drivers/char/serial.c) in function block_til_ready(). select_table *).09. filesystem calls had to pass through a layer of indirection which would switch the call to the routine for the correct filesystem. void (*release) (struct inode *. The VFS The Virtual Filesystem Switch. struct file * file Pointer to the file structure for this device. int). This section details the actions and requirements of the functions in the file_operations structure. this structure constitutes a parital list of the functions that you may have to write to create your driver. int). or VFS. It takes these four arguments: struct inode * inode Pointer to the inode structure for this device. In the first versions of Linux. int (*mmap) (struct inode *. struct dirent *. int.h: struct file_operations { int (*lseek) (struct inode *. struct file *.it/LDP/khg/HyperNews/get/devices/basics.

it/LDP/khg/HyperNews/get/devices/basics. The definition of the MINOR macro is in . It is the size of buf. For an origin of 2. you can find the minor number of the file by this construction: unsigned int minor = MINOR(inode->i_rdev). put_fs*(). which is to modify the file->f_pos element. as are many other useful definitions. these functions should not be implemented. and so the http://ldp. as well. and the kernel reports back to the program on which file descriptor has woken it up. For instance. It is located in user-space memory. you must write an lseek() function which returns that error. int origin 0 = take the offset from absolute offset 0 (the beginning). If the device is a block device. 1 = take the offset from the current position.09. Because of this. as buf is not guaranteed to be null-terminated. the kernel will take the default action. you can do several things. respectively. User-space memory is inaccessible during an interrupt. so if your driver is interrupt driven. which will call your strategy routine. the select() function in your device driver is not directly called by the system call select(). Do not implement it. you will have to copy the contents of your buffer into a queue.h and a few device drivers for more details. and see Supporting Functions for a short description.iol. as the VFS will route requests through the buffer cache. and the device is a character device. However. The select() function The select() function is generally most useful with character devices. lseek() returns -errno on error. or the absolute position (>= 0) after the lseek. It is also used as a timer. the default action is to return -EINVAL if file->f_inode is NULL.49] . If there is no read() or write() function in the file_operations structure registered with the kernel. if lseek() should return an error for your device. The read and write functions take these arguments: struct inode * inode This is a pointer to the inode of the device special file which was accessed. read() or write() system calls. and is how you know that you have reached the end of buf. Read fs. int count This is a count of characters in buf to be read or written.html (5 di 9) [08/03/2001 10. The read() and write() functions The read and write functions read and write a character string to the device. otherwise it sets file->f_pos to file->f_inode->i_size + offset. inode->i_mode can be used to find the mode of the file. will return -EINVAL. It is usually used to multiplex reads without polling--the application calls the select() system call. and there are macros available for this.h. The readdir() function This function is another artifact of file_operations being used for implementing filesystems as well as device drivers. giving it a list of file descriptors to watch. based on the struct inode declaration about 100 lines into /usr/include/linux/fs. and therefore must be accessed using the get_fs*(). If there is no lseek(). The kernel will return -ENOTDIR if the system call readdir() is called on your device special file. and memcpy*fs() macros detailed in Supporting Functions. struct file * file Pointer to file structure for this device. From this. 2 = take the offset from the end. char * buf This is a buffer of characters to read or write.Device Driver Basics Offset from origin to move to.

in fact.49] . usually through an interrupt. The select_wait() function does this already. but do_select() (called from sys_select()) actually puts the process to sleep by changing the process state to TASK_INTERRUPTIBLE and calling schedule(). do not provide timeouts by setting current->timeout. 2. and return even if it is not. If you provide a select() function. After having explained all this in excruciating detail. struct file * file Pointer to the file structure for this device.09. See the description of the add_timer() function in Supporting Functions for details. The process isn't put to sleep until the system call sys_select(). The sleep state that select_wait() will cause is the same as that of sleep_on_interruptible(). and the second is the select_table that was passed to your select() function. The ioctl number is passed as cmd. uses the information given to it by the select_wait() function to put the process to sleep. as the select() mechanism uses current->timeout. Its arguments are: struct inode * inode Pointer to the inode structure for this device. wake_up_interruptible() is used to wake up the process. select() should put the process to sleep. int sel_type The select type to perform: SEL_IN read SEL_OUT write SEL_EX exception select_table * wait If wait is not NULL and there is no error condition caused by the select. and. Instead the select_wait() function is used. here are two rules to follow: 1. The structure of your ioctl() function will be: first error checking. (See Supporting Functions for the definition of the select_wait() function). If the calling program wants to wait until one of the devices upon which it is selecting becomes available for the operation it is interested in. This does not require use of a sleep_on*() function. and if you do need an ioctl. The first argument to select_wait() is the same wait_queue that should be used for a sleep_on(). do not feel ashamed to ask someone knowledgeable about it. then one giant (possibly nested) switch statement to handle all possible ioctls. however. select_wait() adds the process to the wait queue. which originall called your select() function. as there is only one timeout for each process. and the argument to the ioctl is passed as arg. It is good to have an understanding of how ioctls ought to work before making them up.html (6 di 9) [08/03/2001 10. and the two methods cannot co-exist. the process will have to be put to sleep until one of those operations becomes available. However.iol.it/LDP/khg/HyperNews/get/devices/basics. Return 1 if the device is ready. The ioctl() function The ioctl() function processes ioctl calls. and return 0. and the select() function you wrote should then return. Call select_wait() if the device is not ready. and arrange to be woken up when the device becomes ready. then the driver should quickly see if the device is ready. for a few reasons: you may not even need an ioctl for your purpose. Instead. consider using a timer to provide timeouts. select_wait() will not make the process go to sleep right away. there may be a better way http://ldp. It returns directly. If wait is NULL. If you are not sure about your ioctls.Device Driver Basics file_operations select() only needs to do a few things.

struct file * file Pointer to file structure for device. and act appropriately. struct file * file Pointer to the file structure for this device. FIONCLEX (0x5450) Clears the close-on-exec bit. set O_SYNC. Since ioctls are the least regular part of the device interface. or FIOASYNC. unsigned int cmd This is the ioctl command. but it is documented here and parsed in the kernel for completeness.it/LDP/khg/HyperNews/get/devices/basics. struct inode * inode Pointer to the inode structure for this device. accessed through the fs register as usual.html (7 di 9) [08/03/2001 10. FIONBIO (0x5421) If arg is non-zero.Device Driver Basics to do it than what you have thought of. causing a very hard-to-track-down bug. Since this is the same size as a (void *). It is generally used as the switch variable for a case statement. this can be used as a pointer to user space. The first thing you need to do is look in Documentation/ioctl-number. otherwise clear O_NONBLOCK. However.09. Note that you have to avoid these four numbers when creating your own ioctls. FIOASYNC (0x5452) If arg is non-zero. Then go from there. in all cases. the VFS ioctl code will interpret them as being one of these four. unsigned long addr Beginning of address in main memory to mmap() into. default processing will be done: FIOCLEX (0x5451) Sets the close-on-exec bit. size_t len Length of memory to mmap(). If the ioctl() slot in the file_operations structure is not filled in. and pick an unused number. FIONBIO.iol. the VFS will return -EINVAL. The mmap() function struct inode * inode Pointer to inode structure for device. it takes perhaps the most work to get this part right. Returns: -errno on error Every other return is user-defined. read it. since if they conflict. Take the time and energy you need to get it right. This is user defined.txt. set O_NONBLOCK.49] . unsigned int arg This is the argument to the command. O_SYNC is not yet implemented. otherwise clear O_SYNC. http://ldp. FIONCLEX. int prot One of: PROT_READ region can be read. if cmd is one of FIOCLEX.

This address in the file will be mapped to address addr. If a process already is using the device (if the busy bit is already set) then open() should return -EBUSY. If no release() function is defined. Copyright (C) 1992. 2.com. and the address of the device_fops file_operations structure. PROT_EXEC region can be executed. the VFS could not route any requests to the driver. 1993.09. it registers your driver by calling the proper registration function. PROT_NONE region cannot be accessed. but you are required to implement it. struct file * file Pointer to file structure for device. open() should return -ENODEV to indicate this. johnsonm@redhat. the ``name'' of the device (a string). It is the policy mechanism responsible for ensuring consistency.] If devices have been marked as busy.Device Driver Basics PROT_WRITE region can be written. the VFS filesystem switch automagically routes the call. to the proper function. and a character or block special file is accessed. this is the place to do it. This function is called when the kernel first boots and is configuring itself. using whatever locking mechanism is appropriate. (See Supporting Functions for more information on the registration functions. The open() and release() functions struct inode * inode Pointer to inode structure for device. release() is called only when the process closes its last open file descriptor on the files. this is chr_dev_init() in drivers/char/mem. open() should lock the device. Return 0 on success.) register_chrdev() takes three arguments: the major device number (an int).c. it might be called on every close. usually setting a bit in some state variable to mark it as busy. unsigned long off Offset in the file to mmap() from. the VFS routines take some default action. When this is done. If the function does not exist.49] . You will have to call your init() function from the correct place: for a character device. this function is responsible to set up any necessary queues that would not be set up in write(). open() is called when a device special files is opened. none is called. [I am not sure this is true. If no such device exists. For character devices. If only one process is allowed to open the device at once. found reason for select() problem http://ldp. All reporting is done via the printk() function. The init() function usually displays some information about the driver. if a function exists.iol. and usually reports all hardware found.it/LDP/khg/HyperNews/get/devices/basics. If more than one process may open the device. 1994. whatever it is.html (8 di 9) [08/03/2001 10. The init function then detects all devices. because it is this function that registers the file_operations structure with the VFS in the first place--without this function. release() should unset the busy bits if appropriate. Johnson. Messages using XX_select() for device without interrupts by Elwood Downey 1. If you need to clean up kmalloc()'ed queues or reset devices to preserve their sanity. this is register_chrdev(). 1996 Michael K. The init() function This function is not actually included in the file_operations structure. While the init() function runs.

it/LDP/khg/HyperNews/get/devices/basics.Device Driver Basics 3.09.html (9 di 9) [08/03/2001 10.49] .iol. Why do VFS functions get both structs inode and file? by Reinhold J. Gerharz http://ldp.

unsigned long data. and cannot be called by other code. the proper place in the queue is chosen and the request is inserted in the queue. Proper order (the elevator algorithm) is defined as: 1.c. which is defined in drivers/block/blk.. 3. but I hope it is a helpful one.c See also: make_request().html (1 di 14) [08/03/2001 10. struct request * req) This is a static function in ll_rw_block. Otherwise.iol. If you find other supporting functions that are useful. The elevator algorithm is implemented by the macro IN_ORDER(). the request is put on the queue and the strategy routine is called.52] .h> Installs the timer structures in the list timer in the timer list.09.it/LDP/khg/HyperNews/get/devices/reference. I know this is not a complete list.h [This may have changed somewhat recently. ll_rw_block(). but it shouldn't matter to the driver writer anyway. maintaining proper order by insertion sort. http://ldp. may help you understand the strategy routine. 2. add_request() static void add_request(struct blk_dev_struct *dev. If the device that the request is for has an empty request queue. as well as an understanding of ll_rw_block(). unsigned long expires.Supporting Functions The HyperNews Linux KHG Discussion Pages Supporting Functions Here is a list of many of the most common supporting functions available to the device driver writer. add_timer() void add_timer(struct timer_list * timer) #include <linux/timer. struct timer_list *prev. please point them out to me. Reads come before writes.] Defined in: drivers/block/ll_rw_block. The timer_list structure is defined by: struct timer_list { struct timer_list *next. However. an understanding of this function.. Lower minor numbers come before higher minor numbers. Lower block numbers come before higher block numbers.

you need to allocate a timer_list structure. you give a pointer to the first (usually the only) element of the list as the argument to add_timer(). Note: This is not process-specific. because you will need to use it to modify the elements of the list (to set a new timeout when you need a function called again. function Kernel-space function to run after timeout has occured. If necessary.h> Deletes the timer structures in the list timer in the timer list. you can allocate multiple timer_list structures.it/LDP/khg/HyperNews/get/devices/reference. It will nullify the next and prev elements. cli stands for ``CLear Interrupt enable''. Do make sure that you properly initialize all the unused pointers to NULL.h> Prevents interrupts from being acknowledged. data Passed as the argument to function when function is called. Having passed that pointer. to change the function to be called. For each struct in your list. del_timer().Supporting Functions void (*function)(unsigned long). The functions that you install through this mechanism will run in the same context that interrupt handlers run in. thousandths or so in Linux/Alpha) after which to time out. cli() #define cli() __asm__ __volatile__ ("cli"::) #include <asm/system. passing it a pointer to your timer_list. keep a copy of the pointer handy.iol. and then call init_timer().09. which is the correct initialization. In order to call add_timer().html (2 di 14) [08/03/2001 10. The timer list that you delete must be the address of a timer list you have earlier installed with http://ldp. if necessary. Defined in: kernel/sched. }. Therefore.52] . init_timer().h. Having created this list. you set three variables: expires The number of jiffies (100ths of a second in Linux/86. or to change the data that is passed to the function) and to delete the timer.c See also: timer_table in include/linux/timer. or the timer code may get very confused. See also: sti() del_timer void del_timer(struct timer_list * timer) #include <linux/timer. and link them into a list. you will have to use the sleep and wake primitives. if you want to wake a certain process at a timeout.

html (3 di 14) [08/03/2001 10. which is slept on in make_request(). If the request was satisfied (uptodate != 0). make_request().Supporting Functions add_timer(). you may deallocate the memory used in the timer_list structures. ll_rw_page(). and may arrange for the scheduler to be run at the next convenient time (need_resched = 1. as it is no longer referenced by the kernel timer list. irqaction().h See also: ll_rw_block().h for every non-SCSI device that includes blk. end_request() static void end_request(int uptodate) #include "blk. end_request() maintains the request list.it/LDP/khg/HyperNews/get/devices/reference. If equal to 0. this is implicit in wake_up(). defined in drivers/block/blk.sizeof(*(ptr)))) http://ldp. Note: This function is a static function. Defined in: kernel/irq. Defined in: kernel/blk_drv/blk. unlocks the buffer. init_timer(). Defined in: kernel/sched. and is not explicitly part of end_request()).) It includes several defines dependent on static device information. add_timer(). and ll_rw_swap_file().52] . the high-level SCSI code itself provides this functionality to the low-level device-specific SCSI device drivers. means that the request has been satisfied. before waking up all processes sleeping on the wait_for_request event.09. Once you have called del_timer() to delete the timer from the kernel timer list.c See also: timer_table in include/linux/timer.h> Frees an irq previously aquired with request_irq() or irqaction().c See also: request_irq().h. such as the device number.h. means that the request has not been satisfied. free_irq() void free_irq(unsigned int irq) #include <linux/sched. (SCSI devices do this differently.h" Called when a request has been satisfied or aborted. add_request(). get_user() #define get_user(ptr) ((__typeof__(*(ptr)))__get_user((ptr). This is marginally faster than a more generic normal C function. Takes one argument: irq interrupt level to free. Takes one argument: uptodate If not equal to 0.iol.

Note: these functions may cause implicit I/O. Do not include these functions in critical sections of your code even if the critical sections are protected by cli()/sti() pairs. and may be garbage.iol.Supporting Functions #include <asm/segment.html (4 di 14) [08/03/2001 10. and the 3 high bytes are unused. Returns: Data at that offset in user space. if the memory being accessed has been swapped out. inb(). copy it to kernel-space memory before you enter your critical section. Both functions take one argument: port Port to read byte from.09. Defined in: include/asm/io. This means that you have to use types correctly.h See also: memcpy_*fs().52] . irqaction() http://ldp. Derives the type of the argument and the return type automatically. outb_p(). inb() goes as fast as it can.it/LDP/khg/HyperNews/get/devices/reference.h> Allows a driver to access data in user space. init_timer() Inline function for initializing timer_list structures for use with add_timer(). These functions take one argument: addr Address to get data from. Shoddy typing will simply fail to work. Defined in: include/linux/timer. and therefore pre-emption may occur at this point. Returns: The byte is returned in the low byte of the 32-bit integer. put_user(). which is in a different segment than the kernel. Defined in: include/asm/segment.h See also: outb().h See also: add_timer(). Some devices are happier if you don't read from them as fast as possible. inb_p() inline unsigned int inb(unsigned short port) inline unsigned int inb_p(unsigned short port) #include <asm/io.h> Reads a byte from a port. If you need to get at user-space memory. cli(). because that implicit I/O will violate the integrity of your cli()/sti() pair. while inb_p() pauses before returning. sti().

For an example of handler set to use the SA_INTERRUPT flag. If it is not installed with the SA_INTERRUPT flag.h> These five test to see if the inode is on a filesystem mounted the corresponding flag. kfree*() #define kfree(x) kfree_s((x). However.52] .iol.handler() is NULL. which leave out some processing. If it is set (!= 0). Normally. need_resched. new A pointer to a sigaction struct. free_irq() IS_*(inode) IS_RDONLY(inode) ((inode)->i_flags & MS_RDONLY) IS_NOSUID(inode) ((inode)->i_flags & MS_NOSUID) IS_NODEV(inode) ((inode)->i_flags & MS_NODEV) IS_NOEXEC(inode) ((inode)->i_flags & MS_NOEXEC) IS_SYNC(inode) ((inode)->i_flags & MS_SYNC) #include <linux/fs. which may schedule another process to run.it/LDP/khg/HyperNews/get/devices/reference. a global flag. depending on whether or not the IRQ is installed with the SA_INTERRUPT flag. ``fast'' interrupts are chosen. irqaction() takes two arguments: irq The number of the IRQ the driver wishes to acquire. look at how rs_interrupt() is installed in drivers/char/serial. but otherwise it is the same. -EINVAL if sa. then schedule() is run.h> Hardware interrupts are really a lot like signals. and if it is installed with the SA_INTERRUPT flag. by setting the sigaction structure member sa_flags to SA_INTERRUPT.c The SA_INTERRUPT flag is used to determine whether or not the interrupt should be a ``fast'' interrupt. The int argument to the sa. Returns: -EBUSY if the interrupt has already been acquired.c See also: request_irq().09. upon return from the interrupt.Supporting Functions int irqaction(unsigned int irq. then the argument passed is the number of the IRQ.html (5 di 14) [08/03/2001 10. struct sigaction *new) #include <linux/sched. and very specifically do not call schedule(). it makes sense to be able to register an interrupt like a signal. is checked.handler() function may mean different things. Defined in: kernel/irq. The sa_restorer() field of the struct sigaction is not used. then the argument passed to the handler is a pointer to a register structure. Therefore. They are also run with all other interrupts still enabled. 0) http://ldp. 0 on success.

For more details. read mm/kmalloc.52] . there are cases where it is better to return immediately if no pages are available. see the implementation in mm/kmalloc. size To speed this up. int priority) #include <linux/kernel. which used to be all exact powers of 2.c. Pointer to allocated memory on success. and twice that on platforms such as Alpha with 8Kb pages. However.c. kmalloc() may sleep. where things can happen at much faster speed that things could be handled by swapping to disk to make space for giving the networking code more memory. kmalloc() void * kmalloc(unsigned int len. If the maximum is exceeded.'' and return NULL. This way. priority GFP_KERNEL or GFP_ATOMIC. (For more details on this terminology.h> Free memory previously allocated with kmalloc(). the kernel memory allocator knows which bucket cache the object belongs to. Buckets.iol. kmalloc() takes two arguments: len Length of memory to allocate.09. are now a power of 2 minus some small number. because it could cause race conditions. except for numbers less than or equal to 128. and another in the networking code. One of the places in which this is true is in the swapping code.] Defined in: mm/kmalloc.c.html (6 di 14) [08/03/2001 10. Defined in: mm/kmalloc. This is the normal way of calling kmalloc().it/LDP/khg/HyperNews/get/devices/reference. It is now limited to 131056 bytes ((32*4096)-16) on Linux/Intel. when you cannot sleep.h See also: kmalloc(). allowing pre-emption to occur. kmalloc will log an error message of ``kmalloc of too large a block (%d bytes).c See also: kfree() http://ldp. There are two possible arguments: obj Pointer to kernel memory to free. The most important reason for using GFP_ATOMIC is if it is being called from an interrupt. include/linux/malloc. If GFP_KERNEL is chosen.h> kmalloc() used to be limited to 4096 bytes. use kfree_s() and provide the correct size. and doesn't have to search all of the buckets. and cannot receive other interrupts. if you know the size. Returns: NULL on failure. int size) #include <linux/malloc.) [kfree_s() may be obsolete now.Supporting Functions void kfree_s(void * obj. without attempting to sleep to find one.

If no spaces are available in the queue. struct buffer_head *bh[]) #include <linux/fs. checks to make sure that write requests don't fill the queue.52] .iol. int nr.h> No device driver will ever call this code: it is called only through the buffer cache. add_request(). After sanity checking. Otherwise. and cannot be called by other code. If the queue had to be plugged. and it is called. it locks the buffer and. However. an understanding of this function may help you understand the function of the strategy routine. with interrupts disabled.09. make_request() first checks to see if the request is readahead or writeahead and the buffer is locked.c. MAJOR() #define MAJOR(a) (((unsigned)(a))>>8) #include <linux/fs. as well as an understanding of ll_rw_block(). However. and tries again when woken. sorted by the elevator algorithm.c See also: add_request(). if there are no pending requests on the device's request queue. it simply ignores the request and returns. then the strategy routine for that device is not active. make_request() is then called for each request.it/LDP/khg/HyperNews/get/devices/reference. may help you understand the strategy routine. It is the responsibility of the strategy routine to re-enable interrupts. Defined in: devices/block/ll_rw_block. and the request is neither readahead nor writeahead. the request information is filled in and add_request() is called to actually add the request to the queue. int rw. an understanding of this function. See also: MINOR().html (7 di 14) [08/03/2001 10. make_request() static void make_request(int major.Supporting Functions ll_rw_block() void ll_rw_block(int rw. as read requests should take precedence. MINOR() #define MINOR(a) ((a)&0xff) http://ldp.h> This takes a 16 bit device number and gives the associated major number by shifting off the minor number. If so. ll_rw_block() ``plugs'' the queue so that the requests don't go out until all the requests are in the queue. except for SCSI devices. struct buffer_head *bh) This is a static function in ll_rw_block. Defined in: devices/block/ll_rw_block. make_request() sleeps on the event wait_for_request. When a space in the queue is found. ll_rw_block().c See also: make_request().

it/LDP/khg/HyperNews/get/devices/reference. outb_p() inline void outb(char value.html (8 di 14) [08/03/2001 10. outb() goes as fast as it can. port Port to write byte to. from Address to copy data from.h See also: get_user(). Be very careful to get the order of the arguments right! Note: these functions may cause implicit I/O. unsigned long n) inline void memcpy_fromfs(void * to.h> This takes a 16 bit device number and gives the associated minor number by masking off the major number. If you need to get at user-space memory. because implicit I/O will violate the cli() protection. even if the critical sections are protected by cli()/sti() pairs. cli(). Do not include these functions in critical sections of your code.52] .Supporting Functions #include <linux/fs. Defined in: include/asm/segment. if the memory being accessed has been swapped out. unsigned long n) #include <asm/segment. unsigned short port) inline void outb_p(char value. Some devices are happier if you don't write to them as fast as possible.h> Copies memory between user space and kernel space in chunks larger than one byte. word.h> Writes a byte to a port. copy it to kernel-space memory before you enter your critical section.iol. const void * from.09. sti(). put_user(). outb(). These functions take three arguments: to Address to copy data to. See also: MAJOR(). const void * from. http://ldp. while outb_p() pauses before returning. Both functions take two arguments: value The byte to write. unsigned short port) #include <asm/io. n Number of bytes to copy. and therefore pre-emption may occur at this point. or long. memcpy_*fs() inline void memcpy_tofs(void * to.

so never use it in code protected by cli(). and therefore pre-emption may occur at this point.. If you need to get at user-space memory. These functions take two arguments: val Value to write addr http://ldp. even it if didn't set the interrupt enable flag. inb_p(). Because it causes I/O. . it is not safe to use in protected code anyway. Note: these functions may cause implicit I/O.Supporting Functions Defined in: include/asm/io.iol.h> printk() is a version of printf() for the kernel.52] . if the memory being accessed has been swapped out.html (9 di 14) [08/03/2001 10. Note: printk() may cause implicit I/O... printk() will set the interrupt enable flag. It takes a variable number of arguments: fmt Format string.h> Allows a driver to write data in user space.. It cannot handle floats.09. Defined in: kernel/printk. printk() int printk(const char* fmt. printf() style. which is in a different segment than the kernel.c. Shoddy typing will simply fail to work. put_user() #define put_user(x. Derives the type of the arguments and the storage size automatically.ptr) __put_user((unsigned long)(x).) #include <linux/kernel.h See also: inb(). and therefore pre-emption may occur at this point. which are documented in kernel/vsprintf.c.sizeof(*(ptr))) #include <asm/segment. Also. The rest of the arguments.(ptr). This means that you have to use types correctly. and has a few other limitations. Returns: Number of bytes written. Do not include these functions in critical sections of your code even if the critical sections are protected by cli()/sti() pairs. .it/LDP/khg/HyperNews/get/devices/reference. printf() style. copy it to kernel-space memory before you enter your critical section. if the memory being accessed has been swapped out. with some restrictions. because that implicit I/O will violate the integrity of your cli()/sti() pair.

52] . register_*dev() int register_chrdev(unsigned int major. void (*handler)(int). letting the kernel check to make sure that no other driver has already grabbed the same major number. fops Pointer to a file_operations structure for that device. 0 on success. struct file_operations *fops) int register_blkdev(unsigned int major. Takes four arguments: irq The IRQ being requested.h> Registers a device with the kernel. const char *device) #include <linux/sched.h See also: memcpy_*fs().it/LDP/khg/HyperNews/get/devices/reference. Defined in: fs/devices. name Unique string identifying driver. respectively.Supporting Functions Address to write data to.h> #include <linux/errno.h> #include <linux/errno. -EBUSY if major device number has already been allocated.h> Request an IRQ from the kernel. for character or block devices. handler The handler to be called when the IRQ occurs. sti().html (10 di 14) [08/03/2001 10.c See also: unregister_*dev() request_irq() int request_irq(unsigned int irq. This must not be NULL. Returns: -EINVAL if major is >= MAX_CHRDEV or MAX_BLKDEV (defined in ). struct file_operations *fops) #include <linux/fs. or the kernel will panic later.09. const char *name.iol. unsigned long flags. Used in the output for the /proc/devices file. cli(). and install an IRQ interrupt handler if successful. get_user(). The argument to the handler function will be the http://ldp. const char *name. Defined in: asm/segment. Takes three arguments: major Major number of device being registered.

select_table *p) #include <linux/sched. This uses most of the capabilities of the sigaction structure to provide interrupt services similar to to the signal services provided by sigaction() to user-level programs.iol. use the irqaction() function. irqaction().52] .09. sleep_on() goes into an uninteruptible sleep: The only way the process can run is to be woken by wake_up(). p p is NULL. select_wait() inline void select_wait(struct wait_queue **wait_address.h> Sleep on an event. flags Set to SA_INTERRUPT to request a ``fast'' interrupt or 0 to request a normal. device A string containing the name of the device driver. otherwise the current process is put to sleep. This function takes two arguments: wait_address Address of a wait_queue pointer to add to the circular list of waits. -EBUSY if irq is already allocated.h> Add a process to the proper select_wait queue. Returns: -EINVAL if irq > 15 or handler = NULL. wake_up*() *sleep_on() void sleep_on(struct wait_queue ** p) void interruptible_sleep_on(struct wait_queue ** p) #include <linux/sched. This should be the select_table *wait variable that was passed to your select() function. interruptible_sleep_on() goes into an interruptible sleep that can be woken by signals and process timeouts will cause the process to wake up. Defined in: linux/sched. If you need more functionality in your interrupt handling.Supporting Functions number of the IRQ that it was invoked to handle.html (11 di 14) [08/03/2001 10. Defined in: kernel/irq. 0 on success. select_wait does nothing.it/LDP/khg/HyperNews/get/devices/reference. device. ``slow'' one.c See also: free_irq(). A call to wake_up_interruptible() is necessary to wake up the process and allow it to continue running where it left off.h See also: *sleep_on(). putting a wait_queue entry in the list so that the process can be woken on that event. Both take one argument: http://ldp.

09. wake_up*(). sti stands for ``SeT Interrupt enable''.c See also: select_wait().h> Allows interrupts to be acknowledged. Defined in: kernel/sched.it/LDP/khg/HyperNews/get/devices/reference.c unregister_*dev() http://ldp. Because of this.52] .html (12 di 14) [08/03/2001 10. sys_get*() int int int int int int int sys_getpid(void) sys_getuid(void) sys_getgid(void) sys_geteuid(void) sys_getegid(void) sys_getppid(void) sys_getpgrp(void) These system calls may be used to get the information described in the table below. sti() #define sti() __asm__ __volatile__ ("sti"::) #include <asm/system. they are no longer exported as symbols throughout the whole kernel. Defined in: asm/system. Defined in: kernel/sched.iol. like this: foo = current->pid. or the information can be extracted directly from the process table.h See also: cli(). pid uid gid euid egid ppid pgid Process ID User ID Group ID Effective user ID Effective group ID Process ID of process' parent process Group ID of process' parent process The system calls should not be used because they are slower and take more space.Supporting Functions p Pointer to a proper wait_queue structure that records the information needed to wake the process.

1993. const char *name) int unregister_blkdev(unsigned int major.h> #include <linux/errno. while wake_up_interruptible() will only wake up tasks in a TASK_INTERRUPTIBLE state.semaphores. Messages down/up() . or if name is not the same name that the device was registered with. Johnson. so that the next time schedule() is called. const char *name) #include <linux/fs. Takes two arguments: major Major number of device being registered. letting the kernel give the major number to some other device. 1996 Michael K. or if there have not been file operations registered for major device major. These take one argument: q Pointer to the wait_queue structure of the process to be woken. Defined in: fs/devices. johnsonm@redhat. http://ldp. for character or block devices.h>). Must be the same number given to register_*dev().com. Defined in: kernel/sched.52] . Must be the same number given to register_*dev(). name Unique string identifying driver. *sleep_on() Copyright (C) 1992. set/clear/test_bit() by Erez Strauss 14.h> Wakes up a process that has been put to sleep by the matching *sleep_on() function.html (13 di 14) [08/03/2001 10. 1994. they will be candidates to run.c See also: register_*dev() wake_up*() void wake_up(struct wait_queue ** p) void wake_up_interruptible(struct wait_queue ** p) #include <linux/sched.iol. it only makes processes that are woken up runnable.h> Removes the registration for a device device with the kernel. 0 on success. and will be insignificantly faster than wake_up() on queues that have only interruptible tasks.09.Supporting Functions int unregister_chrdev(unsigned int major. respectively. Returns: -EINVAL if major is >= MAX_CHRDEV or MAX_BLKDEV (defined in <linux/fs.c See also: select_wait(). wake_up() can be used to wake up tasks in a queue where the tasks may be in a TASK_INTERRUPTIBLE or TASK_UNINTERRUPTIBLE state.it/LDP/khg/HyperNews/get/devices/reference. Note that wake_up() does not switch tasks.

com 1. RE: vprintk would be nice. 11. Gerharz memcpy_tofs() and memcpy_fromfs() by David Hinds memcpy_*fs(): which way is "fs"? by Reinhold J. 3. 7. Johnson request_irq(. 9. dev_id seems to be for IRQ sharing by Steven Hunyady udelay should be mentioned by Klaus Lindemann vprintk would be nice.iol. Register_*dev() can assign an unused major number.it/LDP/khg/HyperNews/get/devices/reference. 1. 6. 8. 10.html (14 di 14) [08/03/2001 10.52] . 1.void *dev_id) by Robert Wilhelm 1.. 2. by Robert Baruch 1.09. Bug in printk description! by Theodore Ts'o File access within a device driver? by Paul Osborn man pages for reguest_region() and release_region() (?) by mharrison@i-55.com Can register_*dev() assign an unused major number? by rgerharz@erols. 4. 1. Johnson 1...Supporting Functions 13. 12... 5. bigphysarea for Linux 2.. add_timer function errata by Tom Bjorkholm add_timer function errata? by Tim Ferguson Very short waits by Kenn Humborg Add the kill_xxx() family to Supporting functions? by Burkhard Kohl Allocating large amount of memory by Michael K. by Reinhold J..0? by Greg Hager http://ldp. Gerharz init_wait_queue() by Michael K.

} buf++ = ret. It should make sure that all necessary data structures are filled out for all present hardware. and have some way of ensuring that non-present hardware does not get accessed.iol. char * buf. Generally. struct file * file. to report the driver's name in the /proc filesystem. continue. http://ldp. the foo_init() function. } foo_write_byte() and foo_handle_error() are either functions defined elsewhere in foo. char *name This is the symbolic name of the driver.Character Device Drivers The HyperNews Linux KHG Discussion Pages Character Device Drivers Initialization Besides functions defined by the file_operations structure.] Interrupts vs. It should be clear from this example how to code the foo_read() function as well. [Detail different ways of doing this. while (count > 0) { ret = foo_write_byte(minor). ret. Polling In a polling driver. In particular.c to call your foo_init() function. count-} return count. presumably because another character device has already allocated that major number. WRITE would be a constant or #define. the foo_read() and foo_write() functions are pretty easy to write.54] . if (ret < 0) { foo_handle_error(WRITE.it/LDP/khg/HyperNews/get/devices/char. Returns: 0 if no other character device has registered with the same major number. there is at least one other function that you will have to write. char ret. This is used. Here is an example of foo_write(): static int foo_write(struct inode * inode.09.html (1 di 3) [08/03/2001 10. the foo_init() routine will then attempt to detect the hardware that it is supposed to be driving. non-0 if the call fails. You will have to change chr_dev_init() in drivers/char/mem. minor). document the request_* and related functions.c or pseudocode. register_chrdev() takes three arguments: int major This is the major number which the driver wishes to allocate. int count) { unsigned int minor = MINOR(inode->i_rdev). struct file_operations *f_ops This is the address of your file_operations structure. foo_init() should first call register_chrdev() to register itself and avoid device number contention. among other things.

it/LDP/khg/HyperNews/get/devices/char.Character Device Drivers Interrupt-driven drivers are a little more difficult. system call was interrupted. buf. struct foo_struct *foo = &foo_table[minor]. while (copy_size) { /* initiate interrupts */ if (some_error_has_occured) { /* handle error condition */ } current->timeout = jiffies + FOO_INTERRUPT_TIMEOUT. memcpy_fromfs(foo->foo_buffer. do { copy_size = (count <= FOO_BUFFER_SIZE ? count : FOO_BUFFER_SIZE). unsigned long total_bytes_written = 0. /* Here.09. } while (count > 0).54] . do whatever actions ought to be taken on an interrupt. Look at a flag in foo_table to know whether you ought to be reading or writing. else return -EINTR. unsigned long copy_size. foo->bytes_written = 0. /* nothing was written.html (2 di 3) [08/03/2001 10. Here is an example of a foo_write() that is interrupt-driven: static int foo_write(struct inode * inode. int count) { unsigned int minor = MINOR(inode->i_rdev). /* set timeout in case an interrupt has been missed */ interruptible_sleep_on(&foo->foo_wait_queue). */ /* Increment foo->bytes_xfered by however many characters were read or written */ http://ldp. copy_size). unsigned long bytes_written.iol. return total_bytes_written. char * buf. bytes_written = foo->bytes_xfered. try again */ } } total_bytes_written += bytes_written. } static void foo_interrupt(int irq) { struct foo_struct *foo = &foo_table[foo_irq[irq]]. if (current->signal & ~current->blocked) { if (total_bytes_written + bytes_written) return total_bytes_written + bytes_written. count -= bytes_written. struct file * file. buf += bytes_written.

but everything else is the same. Johnson. Copyright (C) 1992. foo_irq[] is an array of 16 integers. or 0 on success. by Andrew Manison http://ldp. 1994. -EBUSY if that interrupt has already been taken. and in fact reuses the sigaction structure.Character Device Drivers if (buffer too full/empty) wake_up_interruptible(&foo->foo_wait_queue). when foo_init() is called. 1996 Michael K.com. 2.09. and works rather like an old-style signal handler. foo_table[] is an array of structures. each of which has several members. Is anything in the works? If not . To tell the interrupt-handling code to call foo_interrupt().iol. It takes two arguments: the first is the number of the irq you are requesting. request_irq() returns -EINVAL if irq > 15 or if the pointer to the interrupt handler is NULL. The sa_restorer() field of the sigaction structure is not used. you need to use either request_irq() or irqaction(). and is used for looking up which entry in foo_table[] is associated with the irq generated and reported to the foo_interrupt() function. for further information about irqaction().) by My name here TTY drivers by Daniel Taylor 1. Messages release() method called when close is called 3. some of which are foo_wait_queue and bytes_xfered. See the entry for irqaction() in Supporting Functions. This is either done when foo_open() is called. or if you want to keep things simple. } Again.54] .html (3 di 3) [08/03/2001 10. which can be used for both reading and writing. request_irq() is the simpler of the two. 1. which must take an integer argument (the irq that was generated) and have a return type of void.it/LDP/khg/HyperNews/get/devices/char.. 1993. a foo_read() function is written analagously. and the second is a pointer to your interrupt handler. return value of foo_write(... irqaction() works rather like the user-level sigaction(). johnsonm@redhat..

(Although SCSI disks and CDROMs are block devices. which the driver must call to say that it is present.] To mount a filesystem on a device. It. Note that although SCSI disks and CDROMs are considered block devices. your driver uses block_read() and block_write().) Initialization Initialization of block devices is a bit more complex than initialization of character devices. protected by #elif (MAJOR_NR == DEVICE_MAJOR). there is a section of defines that are conditionally included for certain major numbers. or request() function. in turn. are generally character devices. the following lines are required: http://ldp.Block Device Drivers The HyperNews Linux KHG Discussion Pages Block Device Drivers [Note: This has not been updated since changes were made in the block device interface to support block device loadable modules. you will add another section for your driver. after all other included header files. There is also a register_blkdev() call that corresponds to the character device register_chrdev() call. it must be a block device driven by a block device driver.it/LDP/khg/HyperNews/get/devices/block. which is called by the VFS routines. Requests for I/O are given by the buffer cache to a routine called ll_rw_block(). At the end of this list. like other tapes. which will call the strategy routine. In that section. drivers/block/blk. which constructs lists of requests ordered by an elevator algorithm. which you write in place of read() and write() for your driver. The file blk. which are generic functions. and active. calls your request() function to actually do the I/O.h requires the use of the MAJOR_NR define to set up many other defines and macros for your driver.h At the top of your driver code. you need to write two lines of code: #define MAJOR_NR DEVICE_MAJOR #include "blk. not a stream device. You do not provide read() and write() routines for a block device. especially as some ``initialization'' has to be done at compile time.55] . This means that the device must be a random access device. which is how normal files on normal filesystems are read and written. The changes shouldn't make it impossible for you to apply any of this. Refer to Writing a SCSI Driver for details. SCSI tapes. In other words.. provided by the VFS. Now you need to edit blk. This strategy routine is also called by the buffer cache.h.09. they are handled specially (as are all SCSI devices). working.h" where DEVICE_MAJOR is the major number of your device. Instead. Under #ifdef MAJOR_NR. you must be able to seek to any location on the physical device at any time..html (1 di 3) [08/03/2001 10.iol. which sorts the lists to make accesses faster and more efficient.

html (2 di 3) [08/03/2001 10.09.55] . These are used if your device can become ``stuck'': a condition where the driver waits indefinitely for an interrupt that will never arrive. see below */ (MINOR(device)) DEVICE_NAME is simply the device name. they will automatically be used in SET_INTR to make your driver time out. DEVICE_REQUEST is your strategy routine. and I still haven't gotten around to documenting this. in the hd driver. since the second hard drive starts at minor 64. By now. You might also consider setting these defines: #define DEVICE_TIMEOUT DEV_TIMER #define TIMEOUT_VALUE n where n is the number of jiffies (clock ticks. you will also set #define DEVICE_INTR do_dev which will become a variable automatically defined and used by the remainder of blk.c and include detailed. Jean-Marc Lugrin wrote one.Block Device Drivers #define #define #define #define #define DEVICE_NAME DEVICE_REQUEST DEVICE_ON(device) DEVICE_OFF(device) DEVICE_NR(device) "device" do_dev_request /* usually blank. etc.] http://ldp. DEVICE_NR(device) is used to determine the number of the physical device from the minor device number. see below */ /* usually blank.h for examples. See the other entries in blk. hundredths of a second on Linux/386. correct instructions on how to use them to allow your device to use the standard dos partitioning scheme. specifically by the SET_INTR() and CLEAR_INTR macros. Of course. but I can't find him now. DEVICE_NR(device) is defined to be (MINOR(device)>>6).it/LDP/khg/HyperNews/get/devices/block.iol. which will do all the I/O on the device. like floppies. it should be explained briefly how ll_rw_block() is called. Recognizing PC standard partitions [Inspect the routines in genhd. thousandths or so on Linux/Alpha) to time out after if no interrupt is received. DEVICE_ON and DEVICE_OFF are for devices that need to be turned on and off. your driver will have to be able to handle the possibility of being timed out by a timer. bsd disklabel and sun's SMD labelling are also supported. Shame on me--but people seem to have been able to figure it out anyway :-)] The Buffer Cache [Here. In fact. If your driver is interrupt-driven.h. For instance. about getblk() and bread() and breada() and bwrite(). the floppy driver is currently the only device driver which uses these defines. A real explanation of the buffer cache is reserved for the VFS reference section. If you define these. See The Strategy Routine for more details on the strategy routine.

at which time it returns. johnsonm@redhat. 2. and have the interrupt-handler call end_request(1) and the call the strategy routine again.] Copyright (C) 1992. which makes sure that requests are really on the request list and does some other sanity checking.55] . and if CURRENT->cmd == WRITE. in order to schedule the next request. If the driver is interrupt-driven. so the strategy routine ``merely'' has to satisfy the request. If for some reason I/O fails permanently on the current request. satisfy it and call end_request(1). If the device has seperate interrupt routines for handling reads and writes. until there are no more requests on the list. and then if there is still another request on the list. as it is called once for every request). the strategy routine may not return until all I/O is complete. Messages non-block-cached block device? by Neal Tucker 1. If the driver is not interrupt-driven. Shall I explain elevator algorithm (+sawtooth etc) by Michael De La Rue http://ldp. SET_INTR(n) must be called to assure that the proper interrupt routine will be called. the request is for a write. A request may be for a read or write.iol.09. 1993. This routine takes no arguments and returns nothing. and is responsible for turning on interrupts with a call to sti() before returning. It is called with interrupts disabled so as to avoid race conditions.it/LDP/khg/HyperNews/get/devices/block. Johnson. If CURRENT->cmd == READ. and knows how to get data from the device into the blocks. which will take the request off the list.com. the request is for a read.Block Device Drivers The Strategy Routine All reading and writing of blocks is done through the strategy routine. The interrupt-driven one should provide seperate read and write interrupt routines to show the use of SET_INTR. [Here I need to include samples of both a polled strategy routine and an interrupt-driven one. add_request() will have already sorted the requests in the proper order according to the elevator algorithm (using an insertion sort. call end_request(1). 1994. The strategy routine first calls the INIT_REQUEST macro. but it knows where to find a list of requests for I/O (CURRENT. defined by default as blk_dev[MAJOR_NR].html (3 di 3) [08/03/2001 10. 1996 Michael K. end_request(0) must be called to destroy the request. The driver determines whether a request is for a read or write by examining CURRENT->cmd. the strategy routine need only schedule the first request to occur.current_request).

Many of the names of the important functions in the Linux source come from this book. so the algorithm for dealing with getting new buffers is a bit different.09. Richard Stevens Publisher: Addison Wesley. and the chapter on device drivers is still worth reading. bread(). chapter 3 explains very well.00 This is one of the books that Linus used to design Linux. and are named after the algorithms presented here. The concept is similar. brelse(). but should be close enough for government work. and bwrite() are. If you have a book that you think should go in the bibliography. Therefore the above referenced explanation of getblk() is a little different than the getblk() in Linux. and approximate price) and the review to johnsonm@redhat. breada(). Advanced Programming in the UNIX Environment Author: W. publisher. they are likely to have somewhat different semantics. q The semantics and calling structure for device drivers is different. the KHG is the proper reference.html (1 di 5) [08/03/2001 10. if you can't quite figure out what exactly getblk().Annotated Bibliography The HyperNews Linux KHG Discussion Pages Annotated Bibliography This annotated bibliography covers books on operating system theory as well as different kinds of programming in a Unix environment. and if/when streams are implemented for Linux.com The Design of the UNIX Operating System Author: Maurice J. For instance. q Linux does not currently use streams. 1992 ISBN: 0-201-56317-7 Price: $50. 1986 ISBN: 0-13-201799-7 Price: $65.56] . It includes a http://ldp.00 This excellent tome covers the stuff you really have to know to write real Unix programs. a few differences are worth noting: q The Linux buffer cache is dynamically resized. q The memory management algorithms are somewhat different. please write a short review of it and send all the necessary information (title. There are other small differences as well. author. but a good understanding of this text will help you understand the Linux source. While most of the algorithms are similar or the same. It is a description of the data structures used in the System V kernel.iol. but for details on the device driver structures. ISBN. The price marked may or may not be an exact price.it/LDP/khg/HyperNews/get/bib/bib. Bach Publisher: Prentice Hall.

Kernighan and Dennis M.it/LDP/khg/HyperNews/get/bib/bib. Files and Directories. 1988 ISBN: 0-07-881342-5 Price: $22. including tables of the memory management structures as a handy reference. The Environment of a Unix Process. http://ldp.html (2 di 5) [08/03/2001 10. Daemon Processes. System Data Files and Information. Process Control. The author has a good writing style: If you are technically minded. In fact. etc. Programming for Performance. It will help you achieve POSIX compliance in ways that won't break SVR4 or BSD. 1988 ISBN: 0-13-110362-8 (paper) 0-13-110370-9 (hard) Price: $35. streams. This book will save you ten times its cost in frustration. Handling Faults and Interrupts. The 80387 Numeric Processor Extension. Some code samples are included.09. The book concentrates heavily on application and fairly complete specification. The chapters of this book are: Basics. Ritchie Publisher: Prentice Hall. and standard library reference. Hardware.). Signals.4 BSD. including chapters on A Database Library. A Modem Dialer. The C Programming Language. All major features are covered. Advanced Interprocess Communication.00 The C programming bible. including POSIX.Annotated Bibliography discussion of the various standards for Unix implementations. Paging. I have found that this book makes it possible for me to write useable programs for Unix. Communicating Among Tasks. and some example applications. second edition Author: Brian W. you will find yourself caught up just reading this book. SVR4 and pre-release 4. which it refers to as 4. and FIPS. Commmunicating with a PostScript Printer. File I/O. Debugging. 8086 Emulation. Standard I/O Library. memory-mapped. without touching on any other hardware. and a few appendices. as are many of the concepts needed. Process Relationships. Includes a C tutorial. async. Advanced 80386 Programming Techniques Author: James L.3+BSD. 80286 Emulation. the only times he mentions DOS and PC-compatible hardware are in the introduction. Unix interface reference. Memory Segmentation. nor how to deal with particular hardware. Reset and Real Mode. Terminal I/O. Interprocess Communication. Advanced I/O (non-blocking. Turley Publisher: Osborne McGraw-Hill. One strong feature of this book for Linux is that the author is very careful not to explain how to do things under DOS. where he promises never to mention them again. and notes which features relate to which standards and releases. Privilege Levels. as a general rule. and concentrates on two implementations. X/Open XPG3.iol. and then a seemingly misplaced final chapter on Pseudo Terminals. C reference. The chapters include: Unix Standardization and Implementations.56] .95 This book covers the 80386 quite well. Multitasking.

Minix is not included. This book is probably more useful to someone who wants to do something with his or her knowlege than Tanenbaum's earlier Operating Systems: Design and Implementation. Modern Operating Systems Author: Andrew S.. However.it/LDP/khg/HyperNews/get/bib/bib.Annotated Bibliography You program in C. It's that simple. http://ldp. but this book covers several things that the earlier book missed. 1992 ISBN: 0-13-588187-0 Price: $51. but vesitiges of this heritage live on in such things as the minix filesystem in Linux. which sports a monolithic design. It has been said that Minix shows that it is possible to to write a microkernel-based Unix. Minix-386 was the development environment under which Linux was bootstrapped. unlike Linux. it was originally to be binary-compatible with Minix-386. 1992 (800-548-9939) ISBN: 0-02-415481-4 Price: No one at Macmillan could find one. Operating Systems: Design and Implementation Author: Andrew S. Tanenbaum Publisher: Prentice Hall. which is based on a microkernel. Tanenbaum Publisher: Prentice Hall. as many things such as virtual memory are not covered at all. is a fairly clear exposition of what it takes to write an operating system. but overviews of MS-DOS and several distributed systems are.09. this book might still prove worthwhile for those who want a basic explanation of OS concepts. as Tanenbaum's explanations of the basic concepts remain some of the clearer (and more entertaining. if you like to be entertained) available.75 The first half of this book is a rewrite of Tanenbaum's earlier Operating Systems. basic is the key work here. Operating Systems Author: William Stallings Publisher: Macmillan. but does not adequately explain why one would do so.. you buy this book. Some clue as to the reason may be found in the title. Half the book is taken up with the source code to a Unix clone called Minix.. However.. Linus tells us). 1987 ISBN: 0-13-637406-9 Price: $50.html (3 di 5) [08/03/2001 10. In fact. Linux was originally intended to be a free Minix replacement (Linus' Minix. many have failed to discover. while a little simplistic in spots. what DOS is doing in a book on modern operating systems. including such things as virtual memory.56] . Unfortunately.iol.00 This book. and missing some important ideas. No Minix code is in Linux.

This book covers all the major topics that you would need to know to build an operating system. and provides very thorough references to the forms of networking that it does not cover directly. based on the partially broken implementation that System V provides. Topics covered in Operating Systems include threads. block.95 This book is written by the President and founder of Driver Design Labs. It covers TCP/IP and XNS most heavily. One example is code to provide useable semaphores. UNIX Network Programming Author: W. and covers more topics. and does so in a clear way. Richard Stevens Publisher: Prentice Hall. The four basic types of drivers (character.html (4 di 5) [08/03/2001 10. With each topic covered. Programming in the UNIX environment Author: Brian W. and fairly exhaustively describes how all the calls work. OS/2. This book contains a lot of source code examples to get you started. a company which specializes in the development of Unix device drivers.Annotated Bibliography A very thorough text on operating systems. 1990 ISBN: 0-13-949876-1 Price: $48. these example systems are used to clarify the points and provide an example of an implementation. STREAMS) are first discussed briefly. and many useful proceedures. real-time systems. Kernighan and Robert Pike Publisher: Prentice Hall. and I found it very helpful. Many full examples of device drivers of all types are given.75 This book covers several kinds of networking under Unix. It also has a description and sample code using System V's TLI. and MVS.09. 1984 ISBN: 0-13-937699 (hardcover) 0-13-937681-X (paperback) Price: ? Writing UNIX Device Drivers Author: George Pajari Publisher: Addison Wesley. starting with the simplest http://ldp. The section on distributed processing appears to be up-to-date. comparing and contrasting them: Unix. tty. this book gives more in-depth coverage of the topics covered in Tannebaum's books. distributed systems. as well as the standard topics like memory management and scheduling. in a much brisker style. and pretty complete coverage of System V IPC. 1992 ISBN: 0-201-52374-4 Price: $32. process migration. and security.it/LDP/khg/HyperNews/get/bib/bib.56] . This book is an excellent introduction to the sometimes wacky world of device driver design. multiprocessor scheduling. The author uses examples from three major systems.iol.

Messages Please replace K&R reference by Harbison/Steele by Markus Kuhn 1. supplement. Johnson. 1. and many of the ideas map directly into Linux. Johnson 80386 book is apparently out of print now by Austin Donnelly Linux Kernel Internals-> Kernel MM IPC fs drivers net modules by Alex Stewart http://ldp. the general idea is there. All examples are of drivers which deal with Unix on PC-compatible hardware. Replace.Annotated Bibliography and progressing in complexity.it/LDP/khg/HyperNews/get/bib/bib. Chapters include: Character Drivers I: A Test Data Generator Character Drivers II: An A/D Converter Character Drivers III: A Line Printer Block Drivers I: A Test Data Generator Block Drivers II: A RAM Disk Driver Block Drivers III: A SCSI Disk Driver Character Drivers IV: The Raw Disk Driver Terminal Drivers I: The COM1 Port Character Drivers V: A Tape Drive STREAMS Drivers I: A Loop-Back Driver STREAMS Drivers II: The COM1 Port (Revisited) Driver Installation Zen and the Art of Device Driver Writing Although many of the calls used in the book are not Linux-compatible.com. johnsonm@redhat. -> 2. Johnson Right you are Mike! by rohit patil Very unfortunate by Michael K. 1993. Copyright (C) 1992. 1996 Michael K.iol. no. 3.html (5 di 5) [08/03/2001 10.56] .09. yes by Michael K. 1.

with select_wait(). so the system will call my select() until my device becomes active. I use wake_up_interruptible() as the function the timer will call in the future to just make this process runnable again. in lieu of calling it from an interrupt service routine. Rememeber. select_table *wait) { static struct timer_list pc39_tl.09. switch (sel_type) { case SEL_EX: return (0). I think I must use a timer.edu> Hello.using XX_select() for device without interrupts The HyperNews Linux KHG Discussion Pages using XX_select() for device without interrupts Forum: Device Driver Basics Keywords: select interrupts polling sleeping Date: Thu. Elwood Downey static int pc39_select (struct inode *inode. struct file *file. The trouble is I can not seem to get the timer work.it/LDP/khg/HyperNews/get/devices/basics/1. the only real goal here is some way to get the os to call us occasionally to let us poll the device. Is this correct? Any comments would be greatly appreciated. static struct wait_queue *pc39_wq. /* never any exceptions */ case SEL_IN: if (IBF()) http://ldp. int sel_type. 25 Jul 1996 14:59:48 GMT From: Elwood Downey <ecdowney@noao. The entire system hangs _solid_ whenever it gets activated.html (1 di 2) [08/03/2001 10. Am I correct in assuming that if this general approach works I will need a separate timer_list and wait_queue for each open process instance? 2) In no examples do I ever see the wait_queue pointer ever _set_ to point at an actual wait_queue instance. Thank you in advance. I have need for a select() entry point in my driver but my device is not using interrupts so I'm not sure how to have the os call my select() to let me poll the device.57] .iol. Below is my select() code. but the device is not using interrupts. A few specific questions: 1) my driver permits several processes to have the device open at once.

case SEL_OUT: if (TBE()) return (1).57] . } /* nothing ready -.iol.expires = PC39_SELTO.data = (unsigned long) &pc39_wq. pc39_tl.function = (void(*)(unsigned long))wake_up_interruptible.using XX_select() for device without interrupts return (1). pc39_tl. select_wait (&pc39_wq. wait).set timer to try again later if necessary */ if (wait) { init_timer (&pc39_tl). } return (0). } http://ldp. break. add_timer (&pc39_tl). pc39_tl. break.it/LDP/khg/HyperNews/get/devices/basics/1.html (2 di 2) [08/03/2001 10.09.

html [08/03/2001 10. Evidently not many folks read this -. 3) pc39_tl.expires = jiffies + PC39_SELTO.found reason for select() problem The HyperNews Linux KHG Discussion Pages found reason for select() problem Forum: Device Driver Basics Keywords: select add_timer() del_timer() Date: Wed. wait). Elwood Downey ecdowney@noao. 1) call del_timer(&pc39_tl) before starting a new one. I'll be happy to send you the whole driver. it should be: pc39_tl.iol. 13 Nov 1996 14:45:59 GMT From: <unknown> Hello again. These were all discovered through trial-and-error so I suppose there might still be other theoretical problems but at least now everthing seems to work. Hope this helps someone else someday.58] .so I'll answer my own question :-) There were several problems with the original approach.edu http://ldp.it/LDP/khg/HyperNews/get/devices/basics/2. not just when wait != 0. so.expires is the jiffy to wake up on.09. If this is getting too hard to follow. 2) always call select_wait (&pc39_wq.no responses after 4 months -. not the number of elapsed jiffies as it says in the KHG.

html [08/03/2001 10. Why not simply pass "struct file *" alone? http://ldp.59] . 09 Jan 1997 05:47:10 GMT From: Reinhold J.it/LDP/khg/HyperNews/get/devices/basics/3. Gerharz <rgerharz@erols. yet both are passed to the VFS functions.com> It appears that "struct file" contains a "struct inode *".09.Why do VFS functions get both structs inode and file? The HyperNews Linux KHG Discussion Pages Why do VFS functions get both structs inode and file? Forum: Device Driver Basics Date: Thu.iol.

26 Apr 1997 03:07:04 GMT From: <unknown> I just finished a character device driver and I it appears that when fclose() is called on the device the release() method is called as well even if the device has been opened multiple times.09.release() method called when close is called The HyperNews Linux KHG Discussion Pages release() method called when close is called Forum: Character Device Drivers Keywords: release method close fclose Date: Sat. http://ldp.it/LDP/khg/HyperNews/get/devices/char/3.html [08/03/2001 10.59] .iol.

) The HyperNews Linux KHG Discussion Pages return value of foo_write(. If I do this as well with my driver and do this: echo "test" > /dev/foo_drv the foo_write () function gets called indefinately... I have noticed that from the source of serial.00] .html [08/03/2001 10. 25 Apr 1997 21:42:46 GMT From: My name here <wicksr@swami..com> In this section I noticed the example foo_write function returns 0 all the time.return value of foo_write(.) Forum: Character Device Drivers Keywords: return values Date: Fri..0.it/LDP/khg/HyperNews/get/devices/char/2. but still much better than programming under NT :). Thanks for the documentation anyhow! -Rich http://ldp.tce. Furthermore.10.indy.c (from /usr/src/linux-2.0/drivers/char) always returns the number of characters transmitted.iol. Do you have a typo? Also. why isn't there a DEFINITIVE list of return values for all functions? This is a bit confusing.

html [08/03/2001 10.10.it/LDP/khg/HyperNews/get/devices/char/1. by Andrew Manison 1. http://ldp.com> It is noted in several places that there is no section for serial drivers.iol. it can be entirely filled online.. 27 Sep 1996 18:48:12 GMT From: Daniel Taylor <danielt@dgii. and yet in this new medium there is not even a pointer to get started from.. Messages Is anything in the works? If not . even a bodiless section of the KHG would be useful.TTY drivers The HyperNews Linux KHG Discussion Pages TTY drivers Forum: Character Device Drivers Keywords: serial tty section Date: Fri. As the number of these drivers is increasing.02] .

.03] .html [08/03/2001 10.Is anything in the works? If not .. Forum: Character Device Drivers Re: TTY drivers (Daniel Taylor) Keywords: serial tty section Date: Fri.it/LDP/khg/HyperNews/get/devices/char/1/1.. I am willing to write a section on tty drivers for the KHG if no-one else is.. The HyperNews Linux KHG Discussion Pages Is anything in the works? If not .10. Let me know! http://ldp.iol.net> I am in the process of writing a device driver for an intelligent multiport serial I/O controller. 13 Dec 1996 04:28:33 GMT From: Andrew Manison <amanison@america.

unc. processors. X3T9.paper. scanners. for COMP-291.unc.Writing a SCSI Device Driver The HyperNews Linux KHG Discussion Pages Writing a SCSI Device Driver Copyright (C) 1993 Rickard E.it/LDP/khg/HyperNews/get/devices/scsi. They wanted to increase the mandatory requirements of SCSI and to define further features for direct-access devices. Adaptec 1740. Data may be transferred asynchronously at rates that only depend on device implementation and cable length. you are advised to get the original version by ftp from ftp://ftp. and communications devices. the Linux kernel contains drivers for the following SCSI host adapters: Adaptec 1542. 1993. The information contained herein comes with ABSOLUTELY NO WARRANTY. It also added caching commands. printers.html (1 di 19) [08/03/2001 10. Written at the University of North Carolina. data rates of up to 40 megabytes per second are possible.cs. In parallel with the development of the CCS working paper. tapes.edu). This is (with the author's explicit permission) a modified copy of the original document. CD-ROMs.iol. SCSI-2 includes command sets for magnetic and optical disks. With the 32 bit wide data transfer option. Many disk products were designed using this working paper in conjunction with the SCSI standard. several manufacturers approached the X3T9.tar. You may want to write your own driver for an unsupported host adapter. Rather than delay the SCSI standard.2 began work on an enhanced SCSI standard which was named SCSI-2. What is SCSI? The foreword to the SCSI-2 standard draft [ANS] gives a succinct definition of the Small Computer System Interface and briefly explains how SCSI-2 is related to SCSI-1 and CCS: The SCSI protocol is designed to provide an efficient peer-to-peer I/O bus with up to 8 devices. If you wish to reproduce this document. Seagate ST-01/ST-02. SCSI-2 included the results of the CCS working paper and extended them to all device types.2 formed an ad hoc group to develop a working paper that was eventually called the Common Command Set (CCS). Synchronous data transfers are supported at rates up to 10 mega-transfers per second. medium changers. Faith (faith@cs. In 1985.10. http://ldp. and Western Digital WD-7000.] Why You Want to Write a SCSI Driver Currently. Most things still apply. but some of the facts like the list of currently supported SCSI host adaptors are rather out of date by now. when the first SCSI standard was being finalized as an American National Standard. You may also want to re-write or update one of the existing drivers. including one or more hosts.edu/pub/users/faith/papers/scsi. Permission is granted to make and distribute verbatim copies of this paper provided the copyright notice and this permission notice are preserved on all copies. X3T9.gz [Note that this document has not been revised since its copyright date of 1993. All rights reserved. Future Domain TMC-1660/TMC-1680. UltraStor 14F.08] .2 Task Group.

if the arbitrating device's SCSI ID is 2. and other functions that X3T9.g. If multiple devices attempt simultaneous arbitration. the arbitrating device (now called the initiator) asserts the SCSI ID of the target on the DATA BUS. A single transaction between an ``initiator'' and a ``target'' can involve up to 8 distinct ``phases. the device with the highest SCSI ID will win. The current phase can be determined from an examination of five SCSI bus signals. This line remains active as long as the target is connected to the initiator.2 deemed worthwhile. -SEL HI HI I T HI HI HI HI HI HI -BSY HI LO I&T I&T LO LO LO LO LO LO -MSG ? ? ? ? HI HI HI HI LO LO -C/D ? ? ? ? HI HI LO LO LO LO -I/O ? ? ? ? HI LO HI LO HI LO PHASE BUS FREE ARBITRATION SELECTION RESELECTION DATA OUT DATA IN COMMAND STATUS MESSAGE OUT MESSAGE IN I = Initiator Asserts. the hard disk drive). SELECTION Phase After ARBITRATION. if present. The target.08] . the arbitrating device asserts its SCSI ID on the DATA BUS. p. T = Target Asserts. it retains a high degree of compatibility with SCSI-1 devices. Each of the eight phases will be described in detail. SCSI phases The ``SCSI bus'' transfers data and state information between interconnected SCSI devices.'' These phases are almost entirely determined by the target (e.iol.html (2 di 19) [08/03/2001 10.it/LDP/khg/HyperNews/get/devices/scsi. ARBITRATION Phase The ARBITRATION phase is entered when a SCSI device attempts to gain control of the SCSI bus. Although ARBITRATION is optional in the SCSI-1 standard. RESELECTION Phase http://ldp. 57]. During arbitration. will acknowledge the selection by raising the -BSY line. as shown in this table [LXT91. While SCSI-2 has gone well beyond the original SCSI standard (now referred to as SCSI-1). then the device will assert 0x04. ? = HI or LO Some controllers (notably the inexpensive Seagate controller) require direct manipulation of the SCSI bus--other controllers automatically handle these low-level details. it is a required phase in the SCSI-2 standard..Writing a SCSI Device Driver performance enhancement features. BUS FREE Phase The BUS FREE phase indicates that the SCSI bus is idle and is not currently being used.10. Arbitration can start only if the bus was previously in the BUS FREE phase. For example.

6. The DATA IN phase transfers data from the disk drive to the host adapter. Drivers which do not currently support RESELECTION do not allow the SCSI target to disconnect.html (3 di 19) [08/03/2001 10.it/LDP/khg/HyperNews/get/devices/scsi.10. BUSY http://ldp. STATUS Phase This phase is entered after completion of all commands. 10. then neither phase is entered. the status byte should be masked with 0x3e before being examined. When the device is ready. with the exception that it is used by the disconnected target to reconnect to the original initiator. 77].Writing a SCSI Device Driver The SCSI protocol allows a device to disconnect from the bus while processing a request. Note that since bits 1-5 (bit 0 is the least significant bit) are used for the status code (the other bits are reserved). There are nine valid status bytes.08] . RESELECTION should be supported by all drivers. and allows the target to send a status byte to the initiator. DATA OUT and DATA IN Phases During these phases. so that multiple SCSI devices can simultaneously process commands. Value* 0x00 0x02 0x04 0x08 0x10 0x14 0x18 0x22 0x28 Status GOOD CHECK CONDITION CONDITION MET BUSY INTERMEDIATE INTERMEDIATE-CONDITION MET RESERVATION CONFLICT COMMAND TERMINATED QUEUE FULL *After masking with 0x3e The meanings of the three most important status codes are outlined below: GOOD The operation completed successfully. COMMAND Phase During this phase. CHECK CONDITION An error occurred. data are transferred between the initiator and the target. the DATA OUT phase transfers data from the host adapter to the disk drive. This allows dramatically increased throughput due to interleaved I/O requests. If the SCSI command does not require data transfer. The RESELECTION phase is identical to the SELECTION phase. p. it reconnects to the host adapter. as shown in the table below [ANS.iol. The REQUEST SENSE command should be used to find out more information about the error (see SCSI Commands). however. or 12 bytes of command information are transferred from the initiator to the target. For example.

The 16 possible sense keys are described in the next table. the driver must be able to correctly process the SAVE DATA POINTERS. This may occur during a self-test or shortly after power-up. SCSI Commands Each SCSI command is 6. 10. This information may regard the status of an outstanding command.it/LDP/khg/HyperNews/get/devices/scsi. please refer to the SCSI standard [ANS] or to a SCSI device technical manual. or may be a request for a change of protocol. the high-level Linux SCSI code automatically obtains more information about the error by executing the REQUEST SENSE. If RESELECTION is supported.'' or ASC. or 12 bytes long. Some SCSI devices may also report an ``additional sense code qualifier'' (ASCQ). For information on the ASC and ASCQ.Writing a SCSI Device Driver The device was unable to accept a command.10. REQUEST SENSE Whenever a command returns a CHECK CONDITION status. MESSAGE OUT and MESSAGE IN Phases Additional information is transferred between the target and the initiator. This command returns a sense key and a sense code (called the ``additional sense code.html (4 di 19) [08/03/2001 10. in the SCSI-2 standard [ANS]).iol. Although required by the SCSI-2 standard. and DISCONNECT messages. some devices do not automatically send a SAVE DATA POINTERS message prior to a DISCONNECT message. The following commands must be well understood by a SCSI driver developer. RESTORE POINTERS. Sense Key 0x00 0x01 0x02 0x03 0x04 0x05 0x06 0x07 0x08 0x09 0x0a 0x0b 0x0c 0x0d 0x0e 0x0f Description NO SENSE RECOVERED ERROR NOT READY MEDIUM ERROR HARDWARE ERROR ILLEGAL REQUEST UNIT ATTENTION DATA PROTECT BLANK CHECK (Vendor specific error) COPY ABORTED ABORTED COMMAND EQUAL VOLUME OVERFLOW MISCOMPARE RESERVED http://ldp.08] . Multiple MESSAGE IN and MESSAGE OUT phases may occur during a single SCSI transaction.

edu in /pub/Linux/development/scsi-2. Additional tools like od. such as TEST UNIT READY and INQUIRY. The installed Linux system can be quite minimal: the GCC compiler distribution (including libraries and the binary utilities). Programming errors may result in the destruction of data on your SCSI drive and on your non-SCSI drive. Before You Begin: Gathering Tools Before you begin to write a SCSI driver for Linux.it/LDP/khg/HyperNews/get/devices/scsi. and the kernel source are all you need. Getting Started The author of a low-level device driver will need to have an understanding of how interruptions are handled by the kernel. and since you (ideally) want to distribute your source code freely. At minimum.g. the command returns with a GOOD status. and wakeup()) may also be needed by some drivers. The high-level Linux code uses this command to differentiate among magnetic disks. but this is simply not acceptable in the Linux community at this time. A detailed explanation of these functions can be found in Supporting Functions.g. A manual that explains the SCSI standard will be helpful.10. you will rebuild the kernel and reboot your system many times. model. This response usually indicates that the target is completing power-on self-tests..Writing a SCSI Device Driver TEST UNIT READY This command is used to test the target's status. The most important is a bootable Linux system--preferably one which boots from an IDE. processors.tar.08] . All of these tools will fit on an inexpensive 20-30~MB hard disk. sleepon(). optical disks. and tape drives (the high-level code currently does not support printers. hexdump. you will need to obtain several resources. If the target can accept a medium-access command (e. before attempting to use the READ and WRITE commands.unc. During the development of your new SCSI driver. Back up your system before you begin. but a copy of the SCSI standard will often be helpful.Z. and device type. At minimum.iol.) Documentation is essential. The scheduling functions (e. RLL. Since Linux is freely distributable. You should be sure your driver can support simpler commands. (A used 20 MB MFM hard disk and controller should cost less than US$100. you will need a technical manual for your host adapter. Most NDA's will prohibit you from releasing your source code--you might be allowed to release an object file containing your driver. and less will be quite helpful. avoid non-disclosure agreements (NDA). draft of the SCSI-2 standard document is available via anonymous ftp from sunsite. an editor. the kernel functions that disable (cli()) and enable (sti()) interruptions should be understood.html (5 di 19) [08/03/2001 10. (The October 17. 1991. schedule(). the command returns with a CHECK CONDITION status and a sense key of NOT READY. or juke boxes). and is available for purchase from Global http://ldp. Usually the technical manual for your disk drive will be sufficient.. READ and WRITE These commands are used to transfer data from and to the target. INQUIRY This command returns the target's make. a READ or a WRITE). or MFM hard disk. Otherwise.

scsi. (* abort)(Scsi_Cmnd *. These will prove to be useful references while you write your driver. 64. 1 . NULL. a low-level SCSI driver need only provide a few basic services to the high-level code. make hard copies of hosts. void (*done)(Scsi_Cmnd *)). fdomain_16x0_reset.it/LDP/khg/HyperNews/get/devices/scsi.) Before you start.h. Please refer to document X3.10. The Scsi_Host Structure The Scsi_Host structure serves to describe the low-level driver to the high-level code.iol. hence. The Linux SCSI Interface The high-level SCSI interface in the Linux kernel manages all of the interaction between the kernel and the low-level SCSI device driver. the manual cost US$60--70.0.131-199X. Two main structures (Scsi_Host and Scsi_Cmnd) are used to communicate between the high-level code and the low-level code.html (6 di 19) [08/03/2001 10. this description is placed in the device driver's header file in a C preprocessor definition: #define FDOMAIN_16X0 { "Future Domain TMC-16x0".Writing a SCSI Device Driver Engineering Documents (2805 McGaw. fdomain_16x0_queue. int). can write a low-level driver in a relatively short amount of time. (* detect)(int). The author of a low-level driver does not need to understand the intricacies of the kernel I/O system and. The next two sections provide detailed information about these structures and the requirements of the low-level driver. *(* info)(void).h. Because of this layered design. fdomain_16x0_detect. and one of the existing drivers in the Linux kernel. (800)-854-7179 or (714)-261-1455. http://ldp. CA 92714). fdomain_16x0_biosparam. fdomain_16x0_abort. (* command)(Scsi_Cmnd *). typedef struct { char int const char int int int *name. 0} \ \ \ \ \ \ \ \ \ #endif The Scsi_Host structure is presented next.08] . Each of the fields will be explained in detail later in this section. 1. 6. Irvine. (* queuecommand)(Scsi_Cmnd *. In early 1993. fdomain_16x0_command. Usually. fdomain_16x0_info.

short unsigned int sg_tablesize. int (* bios_param)(int. consequently. and the file-system code all contribute to poor SCSI performance. only one SCSI request can be processed per disk http://ldp.08] . Unless RESELECTION is supported by the driver and the driver is interrupt-driven. On older devices. this variable should be set to -1 (in this case. Therefore. usually 6 or 7. Some of the parameters in the Scsi_Host structure might then depend on the specific host adapter detected. Variables in the Scsi_Host structure In general.Writing a SCSI Device Driver int (* reset)(void). int []).'' a method of increasing SCSI throughput by combining many small SCSI requests into a few large SCSI requests. (some of the early Linux drivers were not interrupt driven and. (This may be an over-simplification. is used for RESELECTION. name name holds a pointer to a short description of the SCSI host adapter.it/LDP/khg/HyperNews/get/devices/scsi. this_id Most host adapters have a specific SCSI ID assigned to them. for example.html (7 di 19) [08/03/2001 10. int this_id. int). int (* slave_attach)(int. Further. unsigned present:1. the buffering code. } Scsi_Host. This situation might occur. RESELECTION cannot be supported). had very poor performance) this variable should be set to 1. unsigned unchecked_isa_dma:1. int. there is a great deal of layered overhead in the kernel: the high-level SCSI code.10. The this_id variable holds the host adapter's SCSI ID. can_queue can_queue holds the number of outstanding commands the host adapter can process. (``1:1 interleave'' means that all of the sectors in a single track appear consecutively on the disk surface) the time required to perform the SCSI ARBITRATION and SELECTION phases is longer than the rotational latency time between sectors. if a single driver provided support for several host adapters with very similar characteristics. short cmd_per_lun. Since most SCSI disk drives are formatted with 1:1 interleave. sg_tablesize The high-level code supports ``scatter-gather. the variables in the Scsi_Host structure are not used until after the detect() function (see section detect()) is called. any variables which cannot be assigned before host adapter detection should be assigned during detection. This SCSI ID. int can_queue.iol. If the host adapter does not have an assigned SCSI ID. the actual command processing can be significant.) Therefore.

Unfortunately.'' an index into the scsi_hosts variable (an array of type struct Scsi_Host).html (8 di 19) [08/03/2001 10. on machines using the ISA bus (the so-called ``Industry Standard Architecture'' bus was introduced with the IBM PC/XT and IBM PC/AT computers). Linked commands are fundamentally different from multiple outstanding commands (as described by the can_queue variable). Drivers written for host adapters that do not use DMA should set this bit to zero. multiple outstanding commands may be sent to an arbitrary SCSI target. cmd_per_lun The SCSI standard supports the notion of ``linked commands.'' Linked commands allow several commands to be queued consecutively to a single SCSI device. the high-level SCSI code will not take advantage of this feature. however. If the driver does not support scatter-gather. this variable should be set to SG_NONE. MESSAGE OUT. For example. When scatter-gather is supported. If the unchecked_isa_dma bit is set. the high-level code will provide data buffers which are guaranteed to be in the low 16 MB of the physical address space.08] . The detect() function should return a non-zero http://ldp. If the driver can support an unlimited number of grouped requests. The sg_tablesize variable holds the maximum allowable number of requests in the scatter-gather list.Writing a SCSI Device Driver revolution. unchecked_isa_dma Some host adapters use Direct Memory Access (DMA) to read and write blocks of data directly from or to the computer's main memory. since EISA bus machines allow unrestricted DMA access. present The present bit is set (by the high-level code) if the host adapter is detected. and RESELECTION phases. DMA is limited to the low 16 MB of physical memory. Linked commands always go to the same SCSI target and do not necessarily involve a RESELECTION phase. SELECTION. and MESSAGE OUT phases on all commands after the first one in the set. Functions in the Scsi_Host Structure detect() The detect() function's only argument is the ``host number. average throughput is usually over 500 kilobytes per second. SELECTION. this variable should be set to SG_ALL. and require the ARBITRATION. some Adaptec host adapters require a limit of 16. In contrast. resulting in a throughput of about 50 kilobytes per second. Linux is a virtual memory operating system that can use more than 16 MB of physical memory.iol. however. Further. linked commands eliminate the ARBITRATION. This variable should be set to 1 if command linking is not supported. The cmd_per_lun variable specifies the number of linked commands allowed. Drivers specific to EISA bus (the ``Extended Industry Standard Architecture'' bus is a non-proprietary 32-bit bus for 386 and i486 machines) machines should also set this bit to zero. At this time.10.it/LDP/khg/HyperNews/get/devices/scsi. Some drivers will use the host adapter to manage the scatter-gather list and may need to limit sg_tablesize to the number that the host adapter hardware supports.

07/28/89 found exactly five bytes from the start of the BIOS block. the IRQ number and a pointer to the handler routine. linux/kernel/irq. for example. Since the BIOS signatures are hard-coded in the kernel.7 kernel source code.08] . These tests are host adapter specific. forcing all Linux users who have this host adapter to use a specific set of I/O port addresses. the detect() routine must request any needed interrupt or DMA channels from the kernel. The kernel provides two methods for setting up an IRQ handler: irqaction() and request_irq(). each type of board is given a unique identification number which no other manufacturer can use--several Future Domain host adapters. Usually.it/LDP/khg/HyperNews/get/devices/scsi. There are 16 interrupt channels. Further.html (9 di 19) [08/03/2001 10. Usually each host adapter will allow 3 or 4 sets of addresses. Sometimes these addresses will be hard coded into the driver. After the BIOS signature is found. The request_irq() function takes two parameters. For example. the use of the address space between 0xc0000 and 0xfffff is fairly well defined.99. which are selectable via hardware jumpers on the host adapter card. the host adapter can be interrogated to confirm that it is. it is safe to test for the presence of a functioning host adapter in more specific ways. people who use the SCSI adapter exclusively for Linux may want to disable the BIOS to speed boot time. indeed. if the adapter can be detected safely without examining the BIOS. On PC/AT-compatible computers. For MCA bus (the ``Micro-Channel Architecture'' bus is IBM's proprietary 32 bit bus for 386 and i486 machines) machines.iol. the release of a new BIOS can cause the driver to mysteriously fail. labeled IRQ 0 through IRQ 15. For these reasons. the video BIOS on most machines starts at 0xc0000 and the hard disk BIOS. Other methods of verifying the host adapter existence and function will be available to the programmer. http://ldp. one Future Domain BIOS signature is the string FUTURE DOMAIN CORP. each host adapter has a series of I/O port addresses which are used for communications. every 2-kilobyte block from 0xc0000 to 0xf8000 is examined for the 2-byte signature (0x55aa) which indicates that a valid BIOS extension is present [Nor85].10. and find the current I/O port address by scanning all possible port addresses. and should return zero otherwise. For example.Writing a SCSI Device Driver value if the host adapter is detected. starts at 0xc8000. Host adapter detection must be done carefully. but commonly include methods to determine the BIOS base address (which can then be compared to the BIOS address found during the BIOS signature search) or to verify a unique identification number associated with the board. After the I/O port addresses are found. I will limit my discussion to the more general irqaction() function. The BIOS signature usually consists of a series of bytes that uniquely identifies the BIOS.c) for the request_irq() function is shown below. When a PC/AT-compatible computer boots. then that alternative method should be used. Requesting the IRQ After detection. Other drivers are more flexible. the expected host adapter. It then sets up a default sigaction structure and calls irqaction(). (C) 1986-1990 1800-V2. The code (Linux 0. Usually the process begins by looking in the ROM area for the ``BIOS signature'' of the host adapter. also use this number as a unique identifier on ISA bus machines. if present.

The interrupt handler should turn interrupts on as soon as possible.5 kernel source code. irq. sa. void (*handler)( int ) ) { struct sigaction sa. Traditionally. and is traditionally set to NULL.Writing a SCSI Device Driver int request_irq( unsigned int irq. http://ldp. The sa_mask variable is used as an internal flag by the irqaction() routine. however. sa.h) shown here: struct sigaction { __sighandler_t sigset_t int void }. this variable is set to zero prior to calling irqaction(). If SA_INTERRUPT is selected. is the number of the IRQ that is being requested. sa_mask. is a structure with the definition (Linux 0.99.c) for the irqaction() function is int irqaction( unsigned int irq. SA_INTERRUPT selects ``fast'' IRQ handler invocation routines. If zero is selected. such as those associated with the keyboard and timer interrupts. The sa_restorer variable is not currently used. linux/kernel/irq.sa_restorer = NULL. In this structure. sa_handler should point to your interrupt handler routine.it/LDP/khg/HyperNews/get/devices/scsi. which should have a definition similar to the following: void fdomain_16x0_intr( int irq ) where irq will be the number of the IRQ which caused the interrupt handler routine to be invoked.sa_flags = 0.html (10 di 19) [08/03/2001 10. linux/include/linux/signal. &sa ). the handler will be called with interrupts disabled and return will avoid the signal-handling return functions.10. } The declaration (Linux 0. sa_flags. so that other interrupts can be processed. struct sigaction *new ) where the first parameter. and is recommended for interrupt driven hard disk routines. sa_handler. This option is recommended for relatively slow IRQ's. sa.sa_mask = 0.5 kernel source code.iol. (*sa_restorer)(void).08] .99. return irqaction( irq.sa_handler = handler. The sa_flags variable can be set to zero or to SA_INTERRUPT. and the second parameter. new. and will return via the signal-handling return functions. the interrupt handler will run with other interrupts enabled. sa.

html (11 di 19) [08/03/2001 10. The Intel i486 manual [Int90. This description. This is a very serious situation.. Non-zero results may be interpreted as follows: -EINVAL The DMA channel number requested was larger than 7. Since the CPU does not have to deal with the individual DMA requests. The kernel uses an Intel ``interrupt gate'' to set up IRQ handler routines requested via the irqaction() function. In so doing. The host adapter will use a specific DMA channel. interrupt gates.iol.. which is similar to that pointed to by the name variable. data transfers are faster than CPU-mediated transfers and allow the CPU to do other useful work during a block transfer (assuming interrupts are enabled). An interrupt which uses an interrupt gate clears the IF flag [interrupt-enable flag]. info() The info() function merely returns a pointer to a static area containing a brief description of the low-level driver. which prevents other interrupts from interfering with the current interrupt handler. A subsequent IRET instruction restores the IF flag to the value in the saved contents of the EFLAGS register on the stack.08] . This DMA channel will be determined by the detect() function and requested from the kernel with the request_dma() function. . This function takes the DMA channel number as its only parameter and returns zero if the DMA channel was successfully allocated. or a NULL pointer was passed instead of a valid pointer to the interrupt handler routine. and is reasonable cause for a call to panic(). This situation should never occur.. the processor prevents instruction tracing from affecting interrupt response.10. queuecommand() http://ldp.. -EBUSY The IRQ requested has already been allocated to another interrupt handler. will be printed at boot time. 9-11] explains the interrupt gate as follows: Interrupts using.it/LDP/khg/HyperNews/get/devices/scsi. Non-zero result codes may be interpreted as follows: -EINVAL Either the IRQ requested was larger than 15. p. It is worthy of a call to panic(). Requesting the DMA channel Some SCSI host adapters use DMA to access large blocks of data in memory. and will probably cause any SCSI requests to fail. cause the TF flag [trap flag] to be cleared after its current value is saved on the stack as part of the saved contents of the EFLAGS register... A subsequent IRET [interrupt return] instruction restores the TF flag to the value in the saved contents of the EFLAGS register on the stack.Writing a SCSI Device Driver The request_irq() and irqaction() functions will return zero if the IRQ was successfully assigned to the specified interrupt handler routine. -EBUSY The requested DMA channel has already been allocated.

. The queuecommand() command tovalue). For an advanced host adapter. Before returning. Save the pointer to the done() function in the scsi_done() function pointer in the Scsi_Cmnd structure.5256 T450. Start the SCSI command. 3. the done() function is called with the Scsi_Cmnd structure pointer as a parameter. 2. This allows the SCSI command to be executed in an interrupt-driven fashion. the queuecommand() function must do several things: 1. Save the pointer to the Scsi_Cmnd structure.805 0 Td(queuecommaj/F15 1 T structure.'' For less advanced host adapters. When the command is finished. See section The Scsi_Cmnd Structure for detailed information on the Scsi_Cmnd structure. Otherwise the command() function is used for all SCSI requests.The queuecommand() function sets up the host adapter for processing a SCSI command and then returns. 4. Tf -397759 0 Td 529. Set up the special Scsi_Cmnd variables required by the driver. See section done() for more information.976Sa492Tj/F7 1 Tf 1. The queuecommand() function is called only if the can_queue variable (see section can_queue) is non-zero. the ARBITRATION phase is manually started. this may be as simple as sending the command to a host adapter ``mailbox.

DID_BAD_INTR An unexpected interrupt occurred and there is no appropriate way to handle this interrupt. written by Tommy Thorn. and should be left as zero by the low-level code. whereas returning DID_NO_CONNECT will abort the command. DID_ABORT The high-level code called the low-level abort() function (see section abort()).) static volatile int internal_done_flag = 0. rather than exploring existing drivers. interrupt-driven drivers were not supported. linux/kernel/blk_drv/scsi/aha1542. http://ldp. The old drivers are much less efficient (in terms of response time and latency) than the current interrupt-driven drivers.c to determine exactly how errors should be reported. When the original SCSI code was written. } int aha1542_command( Scsi_Cmnd *SCpnt ) { aha1542_queuecommand( SCpnt. internal_done ).08] . command() The command() function processes a SCSI command and returns when the command is finished. Note that returning DID_BUS_BUSY will force the command to be retried. static volatile int internal_done_errcode = 0. For new drivers.html (13 di 19) [08/03/2001 10.Writing a SCSI Device Driver DID_BAD_TARGET The SCSI ID of the target was the same as the SCSI ID of the host adapter. so it may be better to consult scsi.99. DID_ERROR An error occurred which lacks a more appropriate error code (for example.5 kernel.iol. but are also much easier to write.it/LDP/khg/HyperNews/get/devices/scsi.c. Byte 3 (MSB) This byte is for a high-level return code. (Linux 0. Current low-level drivers do not uniformly (or correctly) implement error reporting. this command can be replaced with a call to the queuecommand() function. as demonstrated here. ++internal_done_flag. DID_RESET The high-level code called the low-level reset() function (see section reset()). static void internal_done( Scsi_Cmnd *SCpnt ) { internal_done_errcode = SCpnt->result. an internal host adapter error). DID_PARITY A SCSI PARITY error was detected.10.

After setting the result variable in the Scsi_Cmnd structure. The abort() function is used to request that the currently outstanding SCSI command. it is usually DID_TIME_OUT or DID_RESET. it may be necessary to renegotiate a synchronous communications protocol with the targets. the second parameter to the abort() function. The initiator should request (by asserting the -ATN line) that the target enter a MESSAGE OUT phase. then result should be set to DID_ABORT. any executing command should fail with a DID_RESET result code (see section done()). slave_attach() The slave_attach() function is not currently implemented. reset() The reset() function is used to reset the SCSI bus. Then.Writing a SCSI Device Driver while (!internal_done_flag). Currently.it/LDP/khg/HyperNews/get/devices/scsi.html (14 di 19) [08/03/2001 10. the timeout for a SCSI tape drive is nearly infinite. none of the low-level drivers is able to correctly abort a SCSI command. Please see sections done() and The Scsi_Cmnd Structure for more details. none of the low-level drivers handles resets correctly. indicated by the Scsi_Cmnd pointer. After a SCSI bus reset. is zero..08] . the initiator should send a BUS DEVICE RESET message to the target. return internal_done_errcode. A SCSI device that supports synchronous data transfer recognizes it has not communicated with the other SCSI device since receiving a BUS DEVICE RESET message. result shoudl be set equal to code. Currently. If code is not zero.g. This function would be used to negotiate synchronous communications between the host adapter and the target drive. If code. the abort() function returns zero. } The return value is the same as the result variable in the Scsi_Cmnd structure. Otherwise. Then.10.iol. be aborted. which will cause all target devices to be reset. After a reset. This frees the low-level driver from having to do timing. It may also be necessary to initiate a SCSI RESET by asserting the -RST line. This negotiation requires an exchange of a pair of SYNCHRONOUS DATA TRANSFER REQUEST messages between the initiator and the target. the initiator should send an ABORT message to the target. This exchange should occur under the following conditions [LXT91]: A SCSI device that supports synchronous data transfer recognizes it has not communicated with the other SCSI device since receiving the last ``hard'' RESET. the initiator should request (by asserting the -ATN line) that the target enter a MESSAGE OUT phase. and permits different timeout periods to be used for different devices (e. abort() The high-level SCSI code handles all timeouts. To correctly reset a SCSI command. whereas the timeout for a SCSI disk drive is relatively short). internal_done_flag = 0. bios_param() http://ldp.

hide their physical geometry and are accessed logically as a contiguous list of sectors.html (15 di 19) [08/03/2001 10. in order to be compatible with MS-DOS. To facilitate this access. Therefore. the SCSI host adapter will ``lie'' about its geometry. and MINOR(dev) is the device's minor number.iol. The size parameter is the size of the disk in sectors.h) as shown below. These are the same major and minor device numbers used by the standard Linux mknod command to create the device in the /dev directory.'' (The reasons for this involve archaic and arbitrary limitations imposed by MS-DOS. SCSI disks. are extremely important. while available. Two macros are defined in linux/fs. the bios_param() function was introduced in an attempt to provide access to the host adapter geometry information. lun.99. Many variables in the Scsi_Cmnd structure can be ignored by the low-level device driver--other variables. typedef struct scsi_cmnd { int host. and sectors per cylinder.h which will help to interpret this value: MAJOR(dev) is the device's major number. unsigned char target. The distinction between physical and logical geometry cannot be overstressed. there is no standard method for converting between physical and logical geometry. linux/kernel/blk_drv/scsi/scsi. Each disk contains a ``partition table'' which defines how the disk is divided into logical sections. Hence. however. http://ldp. Some host adapters use a deterministic formula based on this number to calculate the logical geometry of the drive.it/LDP/khg/HyperNews/get/devices/scsi. but only a logical geometry that is identical to the logical geometry used by MS-DOS to access the drive. Interpretation of this partition table requires information about the size of the disk in terms of cylinders. is seldom used as the ``logical geometry.10. is used by the high-level code to specify a SCSI command for execution by the low-level code. (Linux 0.Writing a SCSI Device Driver Linux supports the MS-DOS (MS-DOS is a registered trademark of Microsoft Corporation) hard disk partitioning system. Other host adapters store geometry information in tables which the driver can access.7 kernel.) Linux needs to determine the ``logical geometry'' so that it can correctly modify and interpret the partition table. heads. The physical geometry of the SCSI disk.08] . The info parameter points to an array of three integers that the bios_param() function will fill in before returning: info[0] Number of heads info[1] Number of sectors per cylinder info[2] Number of cylinders The information in info is not the physical geometry of the drive. The Scsi_Cmnd Structure The Scsi_Cmnd structure. the dev parameter contains the drive's device number. however. Unfortunately.

Please see section done() for more http://ldp.iol.h. *request_buffer. retries. host is an index into the scsi_hosts array. Scsi_Pointer unsigned char int } Scsi_Cmnd. flags. defined in scsi.html (16 di 19) [08/03/2001 10. These bytes should be sent to the SCSI target during the COMMAND phase. struct scsi_cmnd *next. The COMMAND_SIZE macro. data_cmnd[10]. This information is important if multiple outstanding commands or multiple commands per target are supported.Writing a SCSI Device Driver index. cmnd is an array of bytes which hold the actual SCSI command. cmnd[0] is the SCSI command code. *prev. sense_buffer[16]. result is used to store the result code from the SCSI request. result. request_bufflen. allowed. void (*done)(struct scsi_cmnd *). timeout. timeout_per_command.08] . unsigned char unsigned void (*scsi_done)(struct scsi_cmnd *). use_sg. bufflen. *buffer. request. Reserved Areas Informative Variables SCp. can be used to determine the length of the current SCSI command. target stores the SCSI ID of the target of the SCSI command.10. timeout_total. *host_scribble.it/LDP/khg/HyperNews/get/devices/scsi. unsigned char unsigned void unsigned char unsigned short unsigned short unsigned void struct request unsigned char int int int cmnd[10]. internal_timeout. sglist_len.

and must be less than or equal to 4096 bytes. Each element of the scatterlist array contains an address and a length component.iol. This variable must be correctly set before the low-level routines return. To support multiple methods. Scratch Areas Depending on the capabilities and requirements of the host adapter. which are guaranteed to return memory in the first 16 MB of physical memory. cmd_per_lun. and use_sg will indicate how many such structures are in the array. the scatter-gather list can be handled in a variety of ways. is described here: typedef struct scsi_pointer { http://ldp.html (17 di 19) [08/03/2001 10. The host_scribble pointer is available to point to a region of memory allocated with scsi_malloc(). and request_bufflen is the length of this buffer in bytes. There are no other uses for this pointer. the address is guaranteed to be within the first 16 MB of physical memory. Large amounts of data will be processed by a single SCSI command. then request_buffer points to the data buffer for the SCSI command. therefore. The scsi_done() Pointer This pointer should be set to the done() function pointer in the queuecommand() function (see section queuecommand() for more information). The Scatter-Gather List use_sg contains a count of the number of pieces in the scatter-gather chain. and should free the area when it is no longer needed. a structure of type Scsi_Pointer. scsi_malloc() and scsi_free(). If use_sg is zero.10.it/LDP/khg/HyperNews/get/devices/scsi. The Scsi_Pointer Structure The SCp variable. The use of request_buffer is non-intuitive and confusing. Otherwise. and unchecked_isa_dma. The length of these data will be equal to the sum of the lengths of all the buffers pointed to by the scatterlist array. If the unchecked_isa_dma flag in the Scsi_Host structure is set to 1 (see section unchecked_isa_dma for more information on DMA transfers). request_buffer points to an array of scatterlist structures. suitable for use with DMA. several scratch areas are provided for the exclusive use of the low-level driver. The amount of memory allocated per request must be a multiple of 512 bytes. The total amount of memory available via scsi_malloc() is a complex function of the Scsi_Host structure variables sg_tablesize.Writing a SCSI Device Driver information about this variable.08] . The host_scribble Pointer The high-level code supplies a pair of memory allocation functions. This memory is. The low-level SCSI driver is responsible for managing this pointer and its associated memory.

1991). Typically. and Doug Hoffman for reading early versions of this paper and for providing many helpful comments.iol. 1990. buffers_residual counts the number of entries remaining in the scatterlist. revision 10h. Karin Boes. Johnson. http://ldp. and this_residual counts the characters remaining in the transfer. (X3T9. volatile int have_data_in. [LXT91] LXT SCSI Products: Specification and OEM Technical Manual.08] .it/LDP/khg/HyperNews/get/devices/scsi. } Scsi_Pointer. int this_residual. volatile int Message. Acknowledgements Thanks to Drew Eckhardt. [Nor85] Peter Norton. volatile int sent_command.html (18 di 19) [08/03/2001 10. Intel/McGraw-Hiull.10. buffer points to the current entry in the scatterlist. Bibliography [ANS] Draft Proposed American National Standard for Information Systems: Small Computer System Interface-2 (SCSI-2). i486 Processor Programmer's Reference Manual. The variables in this structure can be used in any way necessary in the low-level driver. ptr is used as a pointer into the buffer. /* /* /* /* data pointer */ left in this buffer */ which buffer */ how many buffers left */ volatile int Status. Devesh Bhatnagar. [Int90] Intel.'' Professors Peter Calingaert and Raj Kumar Singh. Special thanks to my official COMP-291 (Professional Writing in Computer Science) ``readers. Michael K.Writing a SCSI Device Driver char *ptr. Messages Writing a SCSI Device Driver by rohit patil 1. Bellevue. 1985. The Peter Norton Programmer's Guide to the IBM PC. int buffers_residual. The second set of variables provide convenient locations to store SCSI status information and various pointers and flags. volatile int phase. 1991. struct scatterlist *buffer. Washington: Microsoft Press. October 17.2/86-109. Some host adapters require support of this detail of interaction--others can completely ignore this structure.

html (19 di 19) [08/03/2001 10.08] .it/LDP/khg/HyperNews/get/devices/scsi.Writing a SCSI Device Driver http://ldp.10.iol.

it/LDP/khg/HyperNews/get/devices/block/1.html [08/03/2001 10. So.non-block-cached block device? The HyperNews Linux KHG Discussion Pages non-block-cached block device? Forum: Block Device Drivers Keywords: block device cache Date: Thu. 3) I'm completely wrong. 2) It doesn't make enough of a difference that people care. 30 May 1996 11:26:41 GMT From: Neal Tucker <ntucker@adobe. 3) A block device read which is not a cache hit always puts the calling process to sleep. it would be really hard to do.. I can see three possible reasons this isn't a good idea: 1) With the current design. which leads me to suggest that it might be a advantageous to allow block devices which do not go through the block cache... 2) Filesystems must be mounted from block devices. which means that even if the IO completes quickly (ie with a RAM disk).. What do people think? http://ldp. some premises upon which my idea relies 1) All block device access goes through the block cache. It seems to me that these three things could lead to very poor RAM disk performance.iol.10. First.com> I have a question/idea regarding the block device interface.09] . the process still has to wait to be scheduled again.

html [08/03/2001 10.ac. 10 Aug 1996 11:12:11 GMT From: Michael De La Rue <miked@ed.Shall I explain elevator algorithm (+sawtooth etc) The HyperNews Linux KHG Discussion Pages Shall I explain elevator algorithm (+sawtooth etc) Forum: Block Device Drivers Keywords: block device elevator sawtooth minimum algorithm Date: Sat.it/LDP/khg/HyperNews/get/devices/block/2.10.uk> I just wrote a response about it to the kernel list. so would a discussion of the elevator algorithm.10] .iol. and sawtooth algorithm (plus mention of minimum movement) be appreciated if I get it checked over by `someone who knows?' http://ldp.

html [08/03/2001 10. 02 Jan 1997 04:10:44 GMT From: rohit patil <rohit@techie.10. good work :) -rohit.it/LDP/khg/HyperNews/get/devices/scsi/1. http://ldp.com> hi! this is superb stuff.Writing a SCSI Device Driver The HyperNews Linux KHG Discussion Pages Writing a SCSI Device Driver Forum: Writing a SCSI Device Driver Keywords: Good work! Date: Thu. will let you know more after i go thro' it. thanks.iol.11] .

as well. The Kernel Korner series has included many other articles of interest to Linux kernel hackers. The core structure of the networking code goes back to the initial networking and socket implementations by Ross Biro and Orest Zborowski respectively. we will look at the way the memory management and buffering is implemented for network layers and network device drivers under the existing Linux kernel. by Alan Cox The Linux operating system implements the industry-standard Berkeley socket API. from issue 29. Core Concepts The networking layer tries to be fairly object-oriented in its design.Network Buffers And Memory Management The HyperNews Linux KHG Discussion Pages Network Buffers And Memory Management Reprinted with permission of Linux Journal. others are designed for special purposes. sk_buff: All the buffers used by the networking layers are sk_buffs.18] . September 1996. The key objects are: Device or Interface: A network interface represents a thing which sends and receives packets.10. This is normally interface code for a physical device like an ethernet card.4 BSD).it/LDP/khg/HyperNews/get/net/net-intro. Within the Linux kernel each protocol is a seperate module of code which provides services to the socket layer.2/4. as indeed is much of the Linux kernel. Some protocols exist purely because vendors chose to use proprietary networking schemes. Some changes have been made to accomodate the web. sk_buffs provide the general buffering and flow control facilities needed by network protocols. Socket: So called from the notion of plugs and sockets. This article was originally written for the Kernel Korner column.iol.3/4. as well as explain how and why some things have changed over time.html (1 di 19) [08/03/2001 10. The control for these is provided by core low-level library routines available to the whole of the networking. Protocol: Each protocol is effectively a different language of networking. In this article. In the kernel each socket is a pair of structures that represent the high level socket interface and low level protocol interface. which has its origins in the BSD unix developments (4. http://ldp. A socket is a connection in the networking that provides unix file I/O and exists to the user program as a file descriptor. However some devices are software only such as the loopback device which is used for sending data to yourself.

a list of buffers is managed using functions like this: void append_frame(char *buf. There are two primary sets of functions provided in the sk_buff library. As so much of the networking functionality occurs during interrupts these routines are written to be atomic.c at netif_rx() and net_bh(). kfree_skb(skb. The buffers are held on linked lists optimised for the common network operations of append to end and remove from start. int len) { struct sk_buff *skb=alloc_skb(len. Firstly routines to manipulate doubly linked lists of sk_buffs. At its most basic level.it/LDP/khg/HyperNews/get/net/net-intro. FREE_READ). They are far more complex. memcpy(skb->data. and process_frame() is similar to the code called to feed data into the protocols.len). you will see that they manage buffers similarly.10. We use the list operations to manage groups of packets as they arrive from the network.Network Buffers And Memory Management Implementation of sk_buffs The primary goal of the sk_buff routines is to provide a consistent and efficient buffer handling method for all of the network layers. } } These two fairly simplistic pieces of code actually demonstrate the receive packet mechanism quite accurately. skb_append(&my_list.iol.data.len). If you go and look in net/core/dev. and by being consistent to make it possible to provide higher level sk_buff and socket handling facilities to all the protocols. We use the memory manipulation routines for handling the contents of packets in a standardised and efficient manner. The small extra overhead this causes is well worth the pain it saves in bug hunting. } } void process_queue(void) { struct sk_buff *skb. if(skb==NULL) my_dropped++. and as we send them to the physical interfaces. An sk_buff is a control structure with a block of memory attached. The append_frame() function is similar to the code called from an interrupt by a device driver receiving a packet. while((skb=skb_dequeue(&my_list))!=NULL) { process_data(skb). else { skb_put(skb. secondly functions for controlling the attached memory. skb).18] .html (2 di 19) [08/03/2001 10. GFP_ATOMIC). as they have to feed packets to the right protocol http://ldp.

Let's look at append_frame(). q skb_queue_tail() places a buffer at the end of a list. you need not know what list the buffer is on.10. q skb_queue_head() places a buffer at the start of a list. but the basic operations are the same. providing enough space has been reserved to leave room for doing this. The skb_put() function (Figure 4) grows the data area upwards in memory through the free space at the buffer end and thus reserves space for the memcpy(). The buffer is not freed. Linux provides the following operations: q skb_dequeue() takes the first buffer from a list. A further function named skb_reserve() (Figure 2) can be called before data is added allows you to specify that some of the room should be at the beginning. which is the most commonly used function. The example also shows the use of one of the data control functions. skb_put(skb. q skb_unlink() removes a buffer from whatever list it was on. Many network operations used in sending add to the start of the frame each time in order to add headers to packets. The alloc_skb() fucntion obtains a buffer of len bytes (Figure 1). pass_to_m_protocol(skb).18] . q Immediately after a buffer has been allocated. merely removed from the list. This is just as true if you look at buffers going from the protocol code to a user application. and q len bytes of room at the end of the data. This enables http://ldp.len). all the available room is at the end. This is used to pull buffers off queues. To make some operations easier. so the skb_push() function (Figure 5) is provided to allow you to move the start of the data frame down through memory. many sending routines start with something like: skb=alloc_skb(len+headspace. If the list is empty a NULL pointer is returned. and you can always call skb_unlink() on a buffer which is not in a list.iol. Linux chooses to use linear buffers and save space in advance (often wasting a few bytes to allow for the worst case) because linear buffers make many other things much faster. In systems such as BSD unix you don't need to know in advance how much space you will need as it uses chains of small buffers (mbufs) for its network buffers.data. memcpy_fromfs(skb->data. The buffers are added with the routines skb_queue_head() and skb_queue_tail(). GFP_KERNEL). Almost all the queues are handled with one set of routines queueing data with this function and another set removing items from the same queues with skb_dequeue().html (3 di 19) [08/03/2001 10. headspace).Network Buffers And Memory Management and manage flow control. it is atomic. skb_put(). Thus.it/LDP/khg/HyperNews/get/net/net-intro. skb_reserve(skb. which consists of: 0 bytes of room at the head of the buffer q 0 bytes of data.len). Here it is used to reserve space in the buffer for the data we wish to pass down. Now to return to the list functions. As with all the list operations.

and if skb->sk is set it lowers the memory use counts of the socket (sk).html (4 di 19) [08/03/2001 10. skb_clone() makes a copy of an sk_buff but does not copy the data area.iol. Two functions.10.Network Buffers And Memory Management q q q q q network code to pull a buffer out of use even when the network protocol has no idea who is currently using it. Some more complex protocols like TCP keep frames in order and re-order their input as data is received. exist to allow users to place sk_buffs before or after a specific buffer in a list. The memory counts are very important. A buffer can be told not to be freed when kfree_skb() (see below) is called. and skb_copy() provides the same facilities but also copies the data (and thus has a much higher overhead). which must be considered read only. It is up tothe socket and protocol-level routines to have incremented these counts and to avoid freeing a socket with outstanding buffers. kfree_skb() releases a buffer. Normally this is skb->free=1. A seperate locking mechanism is provided so device drivers do not find someone removing a buffer they are using at that moment. The returned buffer is ready to use but does assume you will fill in a few fields to indicate how the buffer should be freed.18] . alloc_skb() creates a new sk_buff and initialises it. as the kernel networking layers need to know how much memory is tied up by each connection in order to prevent remote machines or local processes from using too much memory.it/LDP/khg/HyperNews/get/net/net-intro. skb_insert() and skb_append(). Figure 1: After alloc_skb Figure 2: After skb_reserve Figure 3: An sk_buff containing data http://ldp. For some things a copy of the data is needed for editing.

or as in TCP.10.it/LDP/khg/HyperNews/get/net/net-intro.html (5 di 19) [08/03/2001 10.iol. Two routines are designed to make this easy for most protocols.18] . return.skb)==-1) { myproto_stats. if(sock_queue_rcv_skb(sk. } This function uses the socket read queue counters to prevent vast amounts of data being queued to a socket. It is up to the application to read fast enough. data is discarded. After a limit is hit.Network Buffers And Memory Management Figure 4: After skb_put has been called on the buffer Figure 5: After an skb_push has occured on the previous buffer Figure 6: Network device data flow Higher Level Support Routines The semantics of allocating and queueing buffers for sockets also involve flow control rules and for sending a whole list of interactions with signals and optional settings such as non blocking. http://ldp. The sock_queue_rcv_skb() function is used to handle incoming data flow control and is normally used in the form: sk=my_find_socket(whatever).FREE_READ).dropped++. kfree_skb(skb.

Network Buffers And Memory Management for the protocol to do flow control over the network. This is done to get around the lack of the C++ concept of this within the C language. skb->sk=sk..) if(skb==NULL) return -err. An object oriented mentality is used and each device is an object with a series of methods that are filled into a structure.iol. sock_alloc_send_skb() handles signal handling. memcpy(skb->data.. skb_put(skb. On the sending side. Most of this we have met before. Basic Structure http://ldp. The file drivers/net/skeleton.it/LDP/khg/HyperNews/get/net/net-intro. Network Devices All Linux network devices follow the same interface although many functions available in that interface will not be needed for all devices. protocol_do_something(skb). len). The sock_alloc_send_skb() has charged the memory for the buffer to the socket.html (6 di 19) [08/03/2001 10. The very important line is skb->sk=sk. the non blocking flag.. skb_reserve(skb. headroom).len). and all the semantics of blocking until there is space in the send queue so you cannot tie up all of memory with data queued for a slow interface.c contains the skeleton of a network device driver. View or print a copy from a recent kernel and follow along throughout the rest of the article. TCP actually tells the sending machine to shut up when it can no longer queue data. Each method is called with the device itself as the first argument. Many protocol send routines have this function doing almost all the work: skb=sock_alloc_send_skb(sk. By setting skb->sk we tell the kernel that whoever does a kfree_skb() on the buffer should cause the socket to be credited the memory for the buffer. Thus when a device has sent a buffer and frees it the user will be able to send more.10.. data.18] .

This is not in any way related to the file system names devices may have. This function then passes the frames off to the protocol layer for further processing.10. Thus ethernet devices are known as ``eth0''.18] . control and physical encapsulation of packets. starting. although you may create a device which is tied to device drivers.it/LDP/khg/HyperNews/get/net/net-intro. both 10 and 100Mb/second trn http://ldp.iol. and in receiving and decoding the responses the hardware generates. The naming scheme is important as it allows users to write programs or system configuration in terms of ``an ethernet card'' rather than worrying about the manufacturer of the board and forcing reconfiguration if a board is changed.html (7 di 19) [08/03/2001 10. Multiple devices of the same type are numbered upwards from 0. ``eth1''. Incoming frames are turned into network buffers. Naming All Linux network devices have a unique name. Traditionally the name indicates only the type of a device rather than its maker. The following names are currently used for generic devices: ethn Ethernet controllers. identified by protocol and delivered to netif_rx(). ``eth2'' etc. and indeed network devices do not normally have a filesystem representation.Network Buffers And Memory Management Each network device deals entirely in the transmission of network buffers from the protocols to the physical media. These and all the other control information are collected together in the device structures that are used to manage each device. Each device provides a set of additional methods for the handling of stopping.

or module load and unload. tunln IPIP encapsulated tunnels nrn NetROM virtual devices isdnn ISDN interfaces handled by isdn4linux.iol.10. (*) dummyn Null devices lo The loopback device (*) At least one ISDN interface is an ethernet impersonator. "sl0c" works for basic devices like multidrop KISS. The proposed convention for such names is still under some discussion. Certain physical layers present multiple logical interfaces over one media. The Linux networking code is structured in such a way as to make this managable without excessive additional code.it/LDP/khg/HyperNews/get/net/net-intro. The number matches the printer port. it will break. and the name registration scheme allows you to create and remove interfaces almost at will as channels come into and out of existance.25 KISS mode. pppn PPP devices both asynchronous and synchronous. as the simple scheme of ``sl0a''. These calls are normally done at boot time. As the structure you pass in is used by the kernel. it uses an ``eth'' device name as it behaves in all aspects as if it was ethernet rather than ISDN. Both ATM and Frame Relay have this property. but does not cope with multiple frame relay connections where a virtual channel may be moved across physical boards. Therefore. you must not free this until you have unloaded the device with void unregister_netdev(struct device *) calls.18] . ``sl0b''. that is the Sonix PC/Volante driver. if http://ldp. Under such circumstances a driver needs to exist for each active channel. When you are adding a whole new physical layer type you should look for other people working on such a project and use a common naming scheme. Also used in AX.Network Buffers And Memory Management Token ring devices. sln SLIP devices. Registering A Device Each device is created by filling in a struct device object and passing it to the register_netdev(struct device *) call. The kernel will not object if you create multiple devices with the same name. plipn PLIP units.html (8 di 19) [08/03/2001 10. as does multi-drop KISS in the amateur radio environment. This links your device structure into the kernel network device tables. a new device should pick a name that reflects existing practice. If possible. Therefore.

\n"). This causes a call to your device that will be discussed later on. if(dev_get(mydevice. You may not use unregister_netdev() to unregister the other device with the name if you discover a clash! A typical code sequence for registration is: int register_my_device(void) { int i=0. The irq field holds the interrupt (IRQ) the device is using. for(i=0. Network drivers normally use a global int called irq for this so that users can load the module with insmod mydevice irq=5 style commands. Finally. the value zero should be used.name)==NULL) { if(register_netdev(&mydevice)!=0) return -EIO. This section covers how they should be set up. or the interrupt may be set when loading the network module. The interrupt can be set in a variety of fashions. the IRQ may be set dynamically from the ifconfig command.Network Buffers And Memory Management your driver is a loadable module you should use the struct device *dev_get(const char *name) call to ensure the name is not already in use.html (9 di 19) [08/03/2001 10.i++) { sprintf(mydevice.name.i<100. the name field holds the device name. Unable to load more. return -ENFILE. Naming First. If an interrupt is not used. } } printk("100 mydevs loaded."mydev%d". The auto-irq facilities of the kernel may be used to probe for the device interrupt.iol. we intend to change to a simple support function of the form dev_make_name("eth").10. } The Device Structure All the generic information and methods for each network device are kept in the device structure. return 0.18] . This is a string pointer to a name in the formats discussed previously. or by the initialization function. This is a special feature that is best not used. not currently known. you should fail or pick another name.0.i). After Linux 2.it/LDP/khg/HyperNews/get/net/net-intro. http://ldp. It may also be " " (four spaces). Bus Interface Parameters The next block of parameters are used to maintain the location of a device within the device address spaces of the architecture. in which case the kernel will automatically assign an ethn name to it. If it is in use. To create a device you need to fill most of these in. or not assigned. This is normally set at boot.

26.Network Buffers And Memory Management The base_addr field is the base I/O space address the device resides at. 24. or the DMA channel is not yet set.it/LDP/khg/HyperNews/get/net/net-intro. 25. The values used by the ARP protocol (see RFC1700) are used for those media supporting ARP and additional values are assigned for other physical layers. Thus the device driver must also allocate and register the I/O. New values are added when neccessary both to the kernel and to net-tools which is the package containing programs like ifconfig that need to be able to decode this field. Two hardware shared memory ranges are defined for things like ISA bus shared memory ethernet cards.iol.html (10 di 19) [08/03/2001 10. When this is user settable. The mtu is the largest payload that can be sent over this interface (that is. it is normally set by a global variable called io.18] . The interface hardware type (type) field is taken from a table of physical media types. since the latest PC boards allow ISA bus DMA channel 0 to be used by hardware boards and do not just tie it to memory refresh. [See the recent Kernel Korner articles on writing a character device driver in issues 23. A device is not usable for IPX without a 576 byte frame size or higher. This is used by the protocol layers such as IP to select suitable packet sizes to send. If no shared memory block is used. The family is always set to AF_INET and indicates the protocol family the device is using. Linux allows DMA (like interrupts) to be automatically probed. the value 0 is used. Linux allows a device to be using multiple protocol families at once. These are also maintained in the device structure. the device has to provide a set of capability flags and variables. and does not register these areas to prevent them being reused. For current purposes. http://ldp. Those devices that allow the user to specify this parameter use a global variable called mem to set the memory base. If no DMA channel is used. the rmem_start and rmem_end fields are obsolete and should be loaded with 0. The dma variable holds the DMA channel in use by the device. The interface I/O address may also be set with ifconfig. It is up to the protocol layers to decide whether to co-operate with your device. using the same kernel functions as any other device driver.10. and 28 of Linux Journal. DMA and interrupt lines it wishes to use. and maintains this information solely to look more like the standard BSD networking API.] The if_port field holds the physical media type for multi-media devices such as combo ethernet boards. If the device uses no I/O locations or is running on a system with no I/O space concept this field should be zero. and does not perform sensibly below about 200 bytes. This may have to change. Protocol Layer Variables In order for the network protocol layers to perform in a sensible manner. If the user can set the DMA channel the global variable dma is used. IP needs at least 72 bytes. It is important to realise that the physical information is provided for control and user viewing (as well as the driver's internal functions). The mem_start and mem_end addresses should be loaded with the start and end of the shared memory block used by this device. and set the mem_end appropriately themselves. There are minimums imposed by each protocol. then the value 0 should be stored. the largest packet size not including any bottom layer headers that the device itself will provide).

ARPHRD_EETHER Experimental Ethernet (not used).18] .Network Buffers And Memory Management The fields defined as of Linux pre2.html (11 di 19) [08/03/2001 10.iol. ARPHRD_ARCNET ARCnet interfaces. Defined by Linux: ARPHRD_SLIP Serial Line IP protocol ARPHRD_CSLIP SLIP with VJ header compression ARPHRD_SLIP6 6bit encoded SLIP ARPHRD_CSLIP6 6bit encoded header compressed SLIP ARPHRD_ADAPT SLIP interface in adaptive mode ARPHRD_PPP PPP interfaces (async and sync) ARPHRD_TUNNEL IPIP tunnels ARPHRD_TUNNEL6 IPv6 over IP tunnels http://ldp.5 are: From RFC1700: ARPHRD_NETROM NET/ROM(tm) devices. ARPHRD_DLCI Frame Relay DLCI. ARPHRD_CHAOS ChaosNET (not used).it/LDP/khg/HyperNews/get/net/net-intro. ARPHRD_PRONET PROnet token ring (not used). ARPHRD_IEE802 802.0. ARPHRD_AX25 AX. ARPHRD_ETHER 10 and 100Mbit/second ethernet.2 networks notably token ring.10.25 level 2 interfaces.

The physical media addresses (if any) are maintained in dev_addr and broadcast respectively.iol. The addr_len field is used to hold the length of a hardware address. The pa_alen field holds the length of an address (in our case an IP address). The ifconfig tool permits the setting of an interface hardware address.Network Buffers And Memory Management ARPHRD_FRAD Frame Relay Access Device. The pa_addr field is used to hold the IP address when the interface is up. Interfaces should start down with this variable clear. pa_brdaddr is used to hold the configured broadcast address. pa_dstaddr the target of a point to point link and pa_mask the IP netmask of the interface. It does not have to be the number of bytes of physical header that will be added. and this should be set to zero. It is up to you to use skb_push() appropriately as was discussed in the section on networking buffers. Those interfaces marked unused are defined types but without any current support on the existing net-tools. For some other interfaces the address must be set by a user program. ARPHRD_LOOPBACK Loopback device.10. With many media there is no hardware address. Linux 1. In the 1. although this is normal. http://ldp. but the open code should take care not to allow a device to start transmitting without an address being set.it/LDP/khg/HyperNews/get/net/net-intro. This also means for devices with variable length headers you will need to allocate max_size+1 bytes and keep a length byte at the start so you know where the header really begins (the header should be contiguous with the data). the skb->data pointer will point to the buffer start and you must avoid sending your scratchpad yourself. A device can use this to provide itself a scratchpad at the start of each buffer. In this case it need not be set initially.3. this should be initialised to 4. Link Layer Variables The hard_header_len is the number of bytes the device desires at the start of a network buffer it is passed.x series kernels.html (12 di 19) [08/03/2001 10. ARPHRD_LOCALTLK Localtalk apple networking device. ARPHRD_SKIP SKIP encryption tunnel. ARPHRD_METRICOM Metricom Radio Network.x makes life much simpler and ensures you will have at least as much room as you asked free at the start of the buffer.18] .2. These are byte arrays and addresses smaller than the size of the array are stored starting from the left. All of these can be initialised to zero. The Linux kernel provides additional generic support routines for devices using ethernet and token ring.

the IFF_RUNNING and IFF_UP flags are basically handled as a pair. IFF_POINTOPOINT The interface is a point to point link (such as SLIP or PPP). IFF_DEBUG Available to indicate debugging is desired.html (13 di 19) [08/03/2001 10. Point to point protocols such as SLIP generally support IP multicast. The NetROM interface is a good example of this.it/LDP/khg/HyperNews/get/net/net-intro. an interface that does not have IFF_UP set will never receive packets. An interface that cannot perform this operation but can receive all packets will go into promiscuous mode when asked to perform this task. but this can be enabled if needed. In Linux. IFF_MULTICAST Indicate that the interface supports multicast IP traffic. Not currently used. IFF_PROMISC The interface if it is possible will hear all packets on the network. A point to point link has no netmask or broadcast normally. This is typically used for network monitoring although it may also be used for bridging. The flags are: IFF_UP The interface is currently active. IFF_BROADCAST The interface has broadcast capability.25 for example supports IP multicast using physical broadcast.Network Buffers And Memory Management Flags A set of flags are used to maintain the interface properties. They exist as two items for compatibility reasons.10. IFF_ALLMULTI Receive all multicast packets. IFF_LOOPBACK The loopback interface (lo) is the only interface that has this flag set. Here all entries are hand configured as the NetROM protocol cannot do ARP queries. IFF_RUNNING See IFF_UP IFF_NOARP The interface does not perform ARP queries. Some of these are ``compatibility'' items and as such not directly useful. http://ldp. IFF_NOTRAILERS More of a prehistoric than a historic compatibility flag. When an interface is not marked as IFF_UP it may be removed.iol. Unlike BSD. This is not the same as supporting a physical multicast. Not used.25 interfaces are always in promiscuous mode. There will be a valid IP address stored in the device addresses. Such an interface must have either a static table of address conversions or no need to perform mappings. Setting it on other interfaces is neither defined nor a very good idea. One or two interfaces like the AX. AX. There is no broadcast capability as such.18] . The remote point to point address in the device structure is valid.

buffs[] is an array of packet queues for each kernel priority level.iol. but must be initialised by the device itself on boot up. although there is no guarantee that the buffer will be retried. It should also provide a set of support functions that interface the protocol layer to the protocol requirements of the link layer it is providing.it/LDP/khg/HyperNews/get/net/net-intro. The intialisation code used is: int ct=0. Frame Transmission All devices must provide a transmit function.Network Buffers And Memory Management The Packet Queue Packets are queued for an interface by the kernel protocol code. The device gets to select the queue length it wants by setting the field dev->tx_queue_len to the maximum number of frames the kernel should queue for the device. for example due to bad congestion. Network Device Methods Each network device has to provide a set of actual functions (methods) for the basic low level operations. it should return 1 and set dev->tbusy to a non-zero value. If the device knows the buffer cannot be transmitted in the near future. These are maintained entirely by the kernel code. although its effect will lag the change slightly. If your device is unable to accept the buffer.18] . and return an error code if the device is not present. http://ldp. then it will not be offered back to the device. Within each device. ct++. It should perform any low level verification and checking needed. If the protocol layer decides to free the buffer the driver has rejected.10. It is possible for a device to exist that cannot transmit. If the init method returns an error the register_netdev() call returns the error code and the device is not created. Setup The init method is called when the device is initialised and registered with the system. while(ct<DEV_NUMBUFFS) { skb_queue_head_init(&dev->buffs[ct]). This will queue the buffer and it may be retried again later. it can call dev_kfree_skb() to dump the buffer and return 0 indicating the buffer is processed. } All other fields should be initialised to 0.html (14 di 19) [08/03/2001 10. A device can modify this dynamically. The dev->hard_start_xmit() function is called and provides the driver with its own device pointer and network buffer (an sk_buff) to transmit. areas cannot be registered or it is otherwise unable to proceed. In this case the device needs a transmit function that simply frees the buffer it is passed. Typically this is around 100 for ethernet and 10 for serial lines. The dummy device has exactly this functionality on transmit.

When a header cannot be completed the protocol layers will attempt to resolve the address neccessary. except that you are guaranteed that the next/previous pointers are free so you can use the sk_buff list primitives to build internal chains of buffers. Reception There is no receive method in a network device.html (15 di 19) [08/03/2001 10. the device's own pointers. the dev->rebuild_header() method is called with the address at which the header is located. pointers to the source and destination hardware addresses. This is done by calling dev_kfree_skb(skb.Network Buffers And Memory Management If there is room the buffer should be processed. including link layer headers. The buffer handed down already contains all the headers. In addition. The method is invoked giving the buffer concerned.iol. Thus the protocol layer calls down to the device with a buffer that has at least dev->hard_header_len bytes free at the start of the buffer.18] . the buffer is locked. When the buffer has been loaded into the hardware.10. the device in question. As the routine may be called before the protocol layers are fully assembled. Frame Headers It is neccessary for the high level protocols to append low level headers to each frame before queueing it for transmission. because it is the device that invokes processing of such events. If the header is completely built it must return the number of bytes of header added. When this occurs. The function must then return the negative of the bytes of header added. the sk_buff in question may spontaneously disappear and the device driver thus should not reference it again. and the network buffer pointer. and the destination may be NULL to mean ``unknown''.it/LDP/khg/HyperNews/get/net/net-intro. such as SLIP. If the device is able to resolve the address by whatever means available (normally ARP). when the hardware has indicated transmission complete. FREE_WRITE). The device allocates a buffer of suitable size with dev_alloc_skb() and places the bytes from the hardware into the buffer. If as a result of an unknown destination the header may not be completed. may have this method specified as NULL. its protocol identity. It is also clearly undesirable that the protocol know in advance how to append low level headers for all possible frame types. or in the case of some DMA driven devices. it returns 0 and the buffer will be retried the next time the protocol layer has reason to believe resolution will be possible. and the length of the packet to be sent. Devices with no link layer header. The source address may be NULL to mean ``use the default address of this device''. the destination IP address. then it fills in the physical address and returns 1. it is vital that the method use the length parameter. http://ldp. This means that the device driver has absolute ownership of the buffer until it chooses to relinquish it. As soon as this call is made. This facility is currently only used by IP when ARP processing must take place. Next. It is then up to the network device to correctly call skb_push() and to put the header on the packet in its dev->hard_header() method. the space should be allocated and any bytes that can be filled in should be filled in. The contents of an sk_buff remain read-only. With a typical device. an interrupt notifies the handler that a completed packet is ready for reception. If the header cannot be resolved. the device driver analyses the frame to decide the packet type. neccessary and need only be actually loaded into the hardware for transmission. not the buffer length. the driver must release the buffer.

/* then 14 bytes of ethernet hardware header */ to align IP headers on a 16 byte boundary.10. The size of the queue is normally 100 frames. Once netif_rx() is called.18] . which is also the start of a cache line and helps give performance improvments. as sending even 10 frames is several seconds of queued data. Finally. The link layer header pointer is stored in skb->mac. It sets skb->protocol to the protocol the frame represents so that the frame can be given to the correct protocol layer. the buffer ceases to be property of the device driver and may not be altered or referred to again.html (16 di 19) [08/03/2001 10. the device driver must set skb->pkt_type to one of the following: PACKET_BROADCAST Link layer broadcast. Thus all flow control is applied by the protocol layers. Finally. The buffer is queued for processing by the networking protocols after the interrupt handler returns. is to reserve the neccessary bytes at the head of the buffer to land the IP header on a long word boundary.2). to keep the link and protocol isolated. the device driver invokes netif_rx() to pass the buffer up to the protocol layer. Deferring the processing in this fashion dramatically reduces the time interrupts are disabled and improves overall responsiveness.iol. PACKET_MULTICAST Link layer multicast.it/LDP/khg/HyperNews/get/net/net-intro. and one you should implement if possible. skb_reserve(skb. Flow control on received packets is applied at two levels by the protocols. the queue is normally set to about 10 frames.raw and the link layer header removed with skb_pull() so that the protocols need not be aware of it. One piece of magic that is done for reception with most existing device. if(skb==NULL) return. Firstly a maximum amount of data may be outstanding for netif_rx() to process. PACKET_SELF Frame to us. PACKET_OTHERHOST Frame to another single host. Secondly each socket on the system has a queue which limits the amount of pending data.Network Buffers And Memory Management The driver sets skb->dev to the device that received the frame. http://ldp. which is enough that the queue will be kept well filled when sending a lot of data over fast links. This last type is normally reported as a result of an interface running in promiscuous mode. On the Sparc or DEC Alpha these improvements are very noticable. On a slow link such as slip link. On the transmit side a per device variable dev->tx_queue_len is used as a queue length limiter. The existing ethernet drivers thus do: skb=dev_alloc_skb(length+2).

18] . This allows user programs such as ifconfig to see the loading on the interface and any problem frames logged. Some devices can only perform a physical address change if the interface is taken down. For these check IFF_UP and if set then return -EBUSY. All these ioctl calls take a struct ifreq. The dev->set_config() function is called by the SIOCSIFMAP function when a user enters a command like ifconfig eth0 irq 11. Activation And Shutdown When a device is activated (that is.it/LDP/khg/HyperNews/get/net/net-intro. For maximum flexibility any user may make these calls and it is up to your code to check for superuser status when appropriate. This is copied into kernel space before your handler is called and copied back at the end. For most interfaces this is not useful and you can return NULL.10. Configuration And Statistics A set of functions provide the ability to query and to set operating parameters. http://ldp. Finally.html (17 di 19) [08/03/2001 10.Network Buffers And Memory Management Optional Functionality Each device has the option of providing additional functions and facilities to the protocol layers. It can also be used to allow a module device to be unloaded now that it is down.iol. This permits the device to take any action such as enabling the interface that are needed when the interface is to be used. For example the PLIP driver uses these to set parallel port time out speeds to allow a user to tune the plip device for their machine. The first and most basic of these is a get_stats routine which when called returns a struct enet_statistics block for the interface. Thus the MOD_INC_USE_COUNT macro must be used within the open method. The rest of the kernel is structured in such a way that when a device is closed. Not providing this will lead to no statistics being available. The close method is not permitted to fail. all references to it by pointer are removed. These operations split into two categories--configuration and activation/shutdown. This ensures that the device may safely be unloaded from a running system. Not implementing these functions will cause a degradation in service available via the interface but not prevent operation. the dev->do_ioctl() call is invoked whenever an ioctl in the range SIOCDEVPRIVATE to SIOCDEVPRIVATE+15 is used on your interface. the flag IFF_UP is set) the dev->open() method is invoked if the device has provided one. The dev->set_mac_address() function is called whenever a superuser process issues an ioctl of type SIOCSIFHWADDR to change the physical address of a device. If so leave this functiom pointer as NULL. An error return from this function causes the device to stay down and causes the user request to activate the device to fail with the error returned by dev->open() The second use of this function is with any device loaded as a module. The dev->close() method is invoked when the device is configured down and should shut off the hardware in such a way as to minimise machine load (for example by disabling the interface or its ability to generate interrupts). It passes an ifmap structure containing the desired I/O and other interface parameters. For many devices this is not meaningful and for others not supported. Here it is neccessary to prevent a device being unloaded while it is open.

It is especially important that ethernet interfaces are programmed to support multicasting. Perfect filters. most of the work is done by the kernel for you (see net/core/dev_mcast.10. A table is loaded onto the card giving a mask of entries that we wish to hear multicast for. No multicast filters.Network Buffers And Memory Management Multicasting Certain physical media types such as ethernet support multicast frames at the physical layer. Hash filters. The driver can then reload its physical tables.18] . SetToHearList(). else if(dev->flags&IFF_ALLMULTI) SetToHearAllMulticasts(). This filters out some of the unwanted multicasts but not all. Typically this looks something like: if(dev->flags&IFF_PROMISC) SetToHearAllPackets(). http://ldp. The kernel support code maintains lists of physical addresses your interface should be allowing for multicast. The capabilities of ethernet cards are fairly variable. the driver must itself also set the IFF_PROMISC flag in dev->flags. rather than going from one host to another.html (18 di 19) [08/03/2001 10. This is done because the perfect filter often has a length limit of 8 or 16 entries. 2. Most fall into one of three categories: 1. when presented with a request for multicasts has to go promiscuous. Fortunately. else { if(dev->mc_count<16) { LoadAddressList(dev->mc_list). If this is done. In this case the driver. This simplifies many drivers.c). Whenever the list of multicast addresses changes the device drivers dev->set_multicast_list() function is invoked. } There are a small number of cards that can only do unicast or promiscuous mode. The device driver may return frames matching more than the requested list of multicasts if it is not able to do perfect filtering. } else SetToHearAllMulticasts().iol. Several ethernet protocols (notably Appletalk and IP multicast) rely on ethernet multicasting. In order to aid driver writer the multicast list is kept valid at all times. A multicast frame is heard by a group.it/LDP/khg/HyperNews/get/net/net-intro. hosts on the network. but not all. 3. The card either receives all multicasts or none of them. as a reset from error condition in a driver often has to reload the multicast address lists. Most cards that support perfect filters combine this option with 1 or 2 above. Such cards can be a nuisance on a network with a lot of multicast traffic such as group video conferences.

iol. This routine is normally called from the ethernet driver receive interrupt handler to classify packets. It provides the support to copy and checksum data from the card into an sk_buff in a single pass.Network Buffers And Memory Management Ethernet Support Routines Ethernet is probably the most common physical interface type that is handled. Messages Question on alloc_skb() by Joern Wohlrab 2. and can be used in any ethernet driver. Finding net info by Alan Cox Question on network interfaces by Vijay Gupta http://ldp. This single pass through memory almost eliminates the cost of checksum computation when used and can really help IP throughput. 1. 1. and Linux/8086 projects and hasn't done any work on AberMUD since November 1993.it/LDP/khg/HyperNews/get/net/net-intro. The eth_type_trans() routine expects to be fed a raw ethernet packet. eth_copy_and_sum().10.95. is quite internally complex but offers significant performance improvements for memory mapped cards. 1.18] . 2. eth_header() is the standard ethernet handler for the dev->hard_header routine. It analyses the headers and sets skb->pkt_type and skb->mac itself as well as returning the suggested value for skb->protocol. Re: Question on alloc_skb() by Erik Petersen Re: Question on network interfaces by Pedro Roque Untitled 1. The kernel provides a set of general purpose ethernet support routines that such drivers can use.html (19 di 19) [08/03/2001 10. Alan Cox has been working on Linux since version 0. Combined with eth_rebuild_header() for the rebuild routine it provides all the ARP lookup required to put ethernet headers on IP packets. the final ethernet support routine. when he installed it in order to do further work on the AberMUD game. SMP. He now manages the Linux Networking.

html [08/03/2001 10. isn't it? Well my experiences are that unfortunately it often happens that alloc_skb returns NULL.10. 31 Mar 1997 18:16:47 GMT From: Joern Wohlrab <unknown> Hi. As I read in that introduction a network device driver has to alloc a sk_buff in it's ISR and this has to happen atomically. So is this plan possible at all? Thank you very much.19] . -Joern Wohlrab Messages Re: Question on alloc_skb() by Erik Petersen 1.Question on alloc_skb() The HyperNews Linux KHG Discussion Pages Question on alloc_skb() Forum: Network Buffers And Memory Management Keywords: network interface Date: Mon. Else the device driver must forbid the higher layer to free the sk_buff's. But let's assume we don't care. Is this possible just by setting the lock flag inside sk_buff? The only problem with this scheme is when the user process doesn't read frequently from the socket the device driver overrides unread sk_buff data again and the packet order would be destroyed. The device driver would organize these sk_buff's as ring. So my idea was I alloc a few sk_buff's (with GFP_KERNEL flag) in the device drivers open function. http://ldp.iol.it/LDP/khg/HyperNews/get/net/net-intro/2.

Then the next time an interrupt occurs.html [08/03/2001 10. If you're using a good number of the buffers continuously. you might want to just printk a message saying "Get more memory cheapskate" or words to that effect. stuff the data in one of the pre-alloced buffers. you could reduce the number dynamically. If alloc_skb() returns NULL. If you aren't. you could balance the buffer pool at interrupt time to ensure you have enough to do the job.it/LDP/khg/HyperNews/get/net/net-intro/2/1. If you get to a point where both the alloc_skb fails and the buffer pool is empty. http://ldp. you might want to dynamically increase the number of buffers in the pool. If you must make a best effort to deliver the data regardless of the memory situation at the time the data is received (the interrupt handler).iol. Just a thought. Normally you would report the squeeze and drop the data.20] .Re: Question on alloc_skb() The HyperNews Linux KHG Discussion Pages Re: Question on alloc_skb() Forum: Network Buffers And Memory Management Re: Question on alloc_skb() (Joern Wohlrab) Keywords: network interface Date: Tue.. This would solve short-term squeeze situations but if you are that tight for memory.. If you're smart about it. you're pretty much screwed anyways. try to alloc_skb and if it fails.. try to replenish the buffer pool. 08 Apr 1997 04:42:56 GMT From: Erik Petersen <erik@spellcast. there is no memory to allocate the block you want. When you pass the skb to netif_rx you are esentially saying "here you go".10. You cant expect to reclaim the buffer as it will eventually be freed. I would create an skb list when the driver loads. I'm not sure why you would want to do this personally. Then during the interrupt.com> Hmm.

edu> Hi. There is a structure called "struct ifnet" which is used in include/linux/route. one interface for each of its IP addresses).71. 19 May 1996 17:46:31 GMT From: Vijay Gupta <vijay@crhc.iol. But I could not find the definition of this structure anywhere in version 1.71.h.Question on network interfaces The HyperNews Linux KHG Discussion Pages Question on network interfaces Forum: Network Buffers And Memory Management Keywords: network interface Date: Sun. But now I am unable to find its definition or use. Older versions of linux had the definition of this structure available (struct ifnet also occurs in BSD code).3.html (1 di 2) [08/03/2001 10. you will have many interfaces.uiuc. le0 or sl0). Vijay Gupta (Email : vijay@crhc. I had the following problems : How to find : (1) the interfaces which are available : (e.10.3.it/LDP/khg/HyperNews/get/net/net-intro/1.edu) Messages Re: Question on network interfaces by Pedro Roque 2. struct ifnet has information like the name of the interface (e. Many times you have both a SLIP as well as an Ethernet interface to your router computer. http://ldp. in the case of a router. (2) the status of the interface : whether the interface is up or down.g.g. Is there a substitute for that structure ? If there is no substitute. is it the case that the information about the available interfaces cannot be obtained ? Thank you very much.uiuc.22] . I am working on some networking code at the kernel level using version 1. as well as the status of the interface (whether up or down).

Finding net info by Alan Cox http://ldp.html (2 di 2) [08/03/2001 10. Untitled 1.iol.Question on network interfaces 1.it/LDP/khg/HyperNews/get/net/net-intro/1.22] .10.

pt> If you want to scan the interface list from kernel space you can do something like: { struct device *dev. dev = dev->next) { /* your code here */ /* example */ if (dev->family == AF_INET) { /* this is an inet device */ } if (dev->type == ARPHRD_ETHER) { /* this is ethernet */ } } } .html [08/03/2001 10. for (dev = dev_base.Re: Question on network interfaces The HyperNews Linux KHG Discussion Pages Re: Question on network interfaces Forum: Network Buffers And Memory Management Re: Question on network interfaces (Vijay Gupta) Keywords: network interface Date: Wed.10.ul.iol.fc./Pedro.24] . dev != NULL.it/LDP/khg/HyperNews/get/net/net-intro/1/2. 28 Aug 1996 15:33:15 GMT From: Pedro Roque <roque@di. http://ldp.

..html [08/03/2001 10. 21 May 1996 23:00:03 GMT From: <unknown> I more or less managed to get the answers to the above questions by winding through start_kernel -> init -> ifconfig -> . http://ldp. Vijay Messages Finding net info by Alan Cox 1.it/LDP/khg/HyperNews/get/net/net-intro/1/1. Thanks.Untitled The HyperNews Linux KHG Discussion Pages Untitled Forum: Network Buffers And Memory Management Re: Question on network interfaces (Vijay Gupta) Keywords: network interface Date: Tue.27] .iol.10..

30 May 1996 16:12:51 GMT From: Alan Cox <alan.10. or /proc/net/dev (just cat it).it/LDP/khg/HyperNews/get/net/net-intro/1/1/1.Finding net info The HyperNews Linux KHG Discussion Pages Finding net info Forum: Network Buffers And Memory Management Re: Question on network interfaces (Vijay Gupta) Keywords: network interface Date: Thu.29] .org> For general scanning there is both the BSD ioctl (which is a pain as you must guess the largest size).iol.cox@linux.html [08/03/2001 10. For the state of an interface you use a struct ifreq filled in and do SIOCGIFFLAGS and test IFF_RUNNING and IFF_UP http://ldp.

set/clear/test_bit() The HyperNews Linux KHG Discussion Pages down/up() .semaphores.30] .semaphores.it/LDP/khg/HyperNews/get/devices/reference/14. The KHG is missing an example section.com> refrences if you know about. Each function in the Linux kernel should have an example page in the KGH. The bit operations set_bit() clear_bit() and test_bit() are also missing usage information. Those functions are important for drivers programmers that should take care about SMP/resource locking. 25 Mar 1997 17:38:15 GMT From: Erez Strauss <unknown> The following features are almost not documented (AFAIK). http://ldp. semaphore locking with down() up() functions and the usage of them.iol.10.html [08/03/2001 10. Please email me <erez@newplaces.down/up() . set/clear/test_bit() Forum: Supporting Functions Date: Tue.

edu> The printk description states that (and I quote): ``printk() may cause implicit I/O. http://ldp. 19 Feb 1997 01:43:48 GMT From: Theodore Ts'o <tytso@mit. Hence. Because it causes I/O.30] . it is not safe to use in protected code anyway. so never use it in code protected by cli(). which is never swapped out. there is no risk of causing implicit I/O. it uses save_flags()/restore_flags().Bug in printk description! The HyperNews Linux KHG Discussion Pages Bug in printk description! Forum: Supporting Functions Date: Wed.it/LDP/khg/HyperNews/get/devices/reference/13.10. printk accesses kernel memory. so it's safe to use it in an interrupt routine (although it will do horrible things to your interrupt latency. Also. and therefore pre-emption may occur at this point.iol. if the memory being accessed has been swapped out.'' This is wrong! First of all. even it if didn't set the interrupt enable flag. Secondly. printk doesn't use sti(). printk() will set the interrupt enable flag.html [08/03/2001 10. so you obviously only use it for debugging).

or must special functions be used within the kernel? http://ldp.it/LDP/khg/HyperNews/get/devices/reference/12.uk> I have a device driver which locates a custom ISA card in I/O space.html [08/03/2001 10.h functions be used.10.31] .File access within a device driver? The HyperNews Linux KHG Discussion Pages File access within a device driver? Forum: Supporting Functions Keywords: file access device driver Date: Wed. 22 Jan 1997 10:51:25 GMT From: Paul Osborn <pao20@cam.iol. Which functions should I use to read the datafile? Can stdio. and then needs to download a 6kb configuration file to an FPGA on the card.ac.

So far I have been unlucky in finding man-pages for these functions. request_region().it/LDP/khg/HyperNews/get/devices/reference/11. The author mentions two functions: release_region().iol.html [08/03/2001 10.10. 20 Jan 1997 16:15:26 GMT From: <mharrison@i-55.man pages for reguest_region() and release_region() (?) The HyperNews Linux KHG Discussion Pages man pages for reguest_region() and release_region() (?) Forum: Supporting Functions Keywords: release request Date: Mon. any clues or hints would be most appreciated cheers Mike http://ldp.com> helo.32] . Recently I read a series of articles on writing device drivers in the Linux Journal.

Can register_*dev() assign an unused major number? The HyperNews Linux KHG Discussion Pages Can register_*dev() assign an unused major number? Forum: Supporting Functions Date: Thu. will it return and allocate an unused major number? If so.10. by Reinhold J.it/LDP/khg/HyperNews/get/devices/reference/10. also? Messages Register_*dev() can assign an unused major number.com> If you call register_*dev() with major=0. 09 Jan 1997 06:32:55 GMT From: <rgerharz@erols. http://ldp. will it do this for modules. Gerharz 1.32] .iol.html [08/03/2001 10.

with this one. 03 Feb 1997 17:48:13 GMT From: Reinhold J.it/LDP/khg/HyperNews/get/devices/reference/10/1.Register_*dev() can assign an unused major number. Gerharz <rgerharz@erols. The HyperNews Linux KHG Discussion Pages Register_*dev() can assign an unused major number.) http://ldp.com> If the first parameter to register_chrdev() is zero (0). Forum: Supporting Functions Re: Can register_*dev() assign an unused major number? Keywords: register_chrdev major device Date: Mon. If it returns <0.10.iol. (Moderator: Please delete this paragraph and replace my previous message. then the return value is an error code. above.html [08/03/2001 10.32] . register_chrdev() will attempt to return an unused major device number.

unsigned long n) It is not clear which way the copy occurs.com> memcpy_*fs() inline void memcpy_tofs(void * to. or kernel space. const void * from. 09 Jan 1997 05:00:55 GMT From: Reinhold J.html [08/03/2001 10. does "to" mean kernel space or user space? Assuming the "tofs" and "fromfs" refer to the Frame Segment register.10. Contrarily.memcpy_*fs(): which way is "fs"? The HyperNews Linux KHG Discussion Pages memcpy_*fs(): which way is "fs"? Forum: Supporting Functions Keywords: USER KERNEL SPACE MEMORY COPY Date: Thu. Does "from" mean user space.33] . http://ldp. const void * from. can one assume it always points to user space? How does this carry over to other architectures? Do they have Frame Segment registers? Messages memcpy_tofs() and memcpy_fromfs() by David Hinds 1.iol. Gerharz <rgerharz@erols. unsigned long n) inline void memcpy_fromfs(void * to.it/LDP/khg/HyperNews/get/devices/reference/9.

These calls are deprecated in current kernels. however. http://ldp.stanford.edu> In older versions of the Linux kernel.10. On other platforms. these did the right thing despite the non-existence of an FS register.iol.html [08/03/2001 10. So. and memcpy_fromfs meant from user space. memcpy_tofs meant to user space. the i386 FS segment register pointed to user space. and new code should use copy_from_user() and copy_to_user().memcpy_tofs() and memcpy_fromfs() The HyperNews Linux KHG Discussion Pages memcpy_tofs() and memcpy_fromfs() Forum: Supporting Functions Re: memcpy_*fs(): which way is "fs"? (Reinhold J.33] .it/LDP/khg/HyperNews/get/devices/reference/9/1. 13 Jan 1997 22:35:53 GMT From: David Hinds <dhinds@hyper. Gerharz) Keywords: USER KERNEL SPACE MEMORY COPY Date: Mon.

com> Before calling sleep_on() or wake_up() on a wait queue. you must initialize it with the init_wait_queue() function.html [08/03/2001 10.10.33] . Johnson <johnsonm@redhat. 19 Nov 1996 17:14:17 GMT From: Michael K.init_wait_queue() The HyperNews Linux KHG Discussion Pages init_wait_queue() Forum: Supporting Functions Date: Tue. http://ldp.iol.it/LDP/khg/HyperNews/get/devices/reference/8.

0. http://ldp.void *dev_id) The HyperNews Linux KHG Discussion Pages request_irq(.med. What is the magic behind this? Messages dev_id seems to be for IRQ sharing by Steven Hunyady 1.void *dev_id) Forum: Supporting Functions Keywords: request_irq Date: Tue.10.x...34] ..iol.it/LDP/khg/HyperNews/get/devices/reference/7. 29 Oct 1996 14:54:25 GMT From: Robert Wilhelm <robert@physiol.request_irq(.de> request_irg() and free_irq() seem to take a new parameter in Linux 2...html [08/03/2001 10.tu-muenchen..

.it/LDP/khg/HyperNews/get/devices/reference/7/1. Most other device drivers have not yet allowed for multiple use of IRQ lines.10.iol.dev_id seems to be for IRQ sharing The HyperNews Linux KHG Discussion Pages dev_id seems to be for IRQ sharing Forum: Supporting Functions Re: request_irq(.. 08 Apr 1997 02:11:34 GMT From: Steven Hunyady <hunyady@kestrel.34] . http://ldp.edu> Look in Don Becker's 3c59x. shows this ongoing adaptation.c net driver.html [08/03/2001 10.void *dev_id) (Robert Wilhelm) Keywords: request_irq dev_id IRQ-sharing Date: Tue. IRQ sharing amongst like (or dissimilar?) cards developed progressively in the kernel. Apparently. usable in several major kernel versions.. hence they simply put "NULL" for this fifth parameter in the function request_irq() and the second in free_irq().nmt. and this driver.

35] .dk> Hi I think that the function udelay() should be mentioned in this section.html [08/03/2001 10. 22 Oct 1996 13:45:41 GMT From: Klaus Lindemann <lindeman@nbi. since it is not possible to use delay in kernel modules (or at least that how I understood it).iol.it/LDP/khg/HyperNews/get/devices/reference/6.udelay should be mentioned The HyperNews Linux KHG Discussion Pages udelay should be mentioned Forum: Supporting Functions Keywords: udelay Date: Tue.10. Regards Klaus Lindemann http://ldp.

.html [08/03/2001 10. Forum: Supporting Functions Keywords: printk Date: Mon.it/LDP/khg/HyperNews/get/devices/reference/5. 21 Oct 1996 18:58:25 GMT From: Robert Baruch <baruch@oramp.. 1.vprintk would be nice. Messages RE: vprintk would be nice..iol.10...vprintk.35] .. http://ldp.com> I wish there were a function analagous to vprintf except for the kernel -. The HyperNews Linux KHG Discussion Pages vprintk would be nice.

http://ldp. Forum: Supporting Functions Re: vprintk would be nice.iol.html [08/03/2001 10..it/LDP/khg/HyperNews/get/devices/reference/5/1.... (Robert Baruch) Keywords: printk Date: Thu.10. 09 Jan 1997 05:19:03 GMT From: <unknown> What's wrong with using sprintf()? I do. The HyperNews Linux KHG Discussion Pages RE: vprintk would be nice.RE: vprintk would be nice...35] .

and if so.iol. where TIME_LENGTH is the time in 1/100'ths of a second. thanks. the `expires' variable in the timer_list struct is the time rather than the length of time before the timer will be processed. Could anyone tell me if they found this also to be the case. http://ldp. new version would be: timer. 07 Oct 1996 09:45:17 GMT From: Tim Ferguson <timf@dgs.0.edu.expires = jiffies + TIME_LENGTH.add_timer function errata? The HyperNews Linux KHG Discussion Pages add_timer function errata? Forum: Supporting Functions Date: Mon. To be backward compatible with older versions of linux.expires = TIME_LENGTH.it/LDP/khg/HyperNews/get/devices/reference/4. Tim.0+).35] . could the Linux hackers guide please be updated. Messages add_timer function errata by Tom Bjorkholm 1.monash. you need to do something like: if the old version was: timer.html [08/03/2001 10.10.au> It seems that when using the add_timer function in newer versions of the kernel (2.

36] .add_timer function errata The HyperNews Linux KHG Discussion Pages add_timer function errata Forum: Supporting Functions Re: add_timer function errata? (Tim Ferguson) Date: Mon. 17 Feb 1997 17:42:33 GMT From: Tom Bjorkholm <tomb@mydata.se> Tim. The time you should give is "jiffies + TIMEOUT" Could someone fix this in the original documentation.iol.html [08/03/2001 10. /Tom Bjorkholm http://ldp..10.it/LDP/khg/HyperNews/get/devices/reference/4/1. or at least I have the same experience as you have.. You are correct..

36] . Thanks Kenn Humborg kenn@wombat.html [08/03/2001 10.10.iol.it/LDP/khg/HyperNews/get/devices/reference/3.ie> Is there any way to wait for less than a jiffy without spinning and tying up the CPU? I'm trying to implement a key-click and kd_mksound can't make sounds shorter than 10ms.Very short waits The HyperNews Linux KHG Discussion Pages Very short waits Forum: Supporting Functions Keywords: short timer jiffies sleep wait Date: Mon.ie http://ldp. 23 Sep 1996 20:02:38 GMT From: Kenn Humborg <kenn@wombat.

kohl@ipn-b. Obviously 0 means without and 1 with certain (what?) priviledges. Finally. The KHG does not give any hint how to do that. I have no idea what kill_fasync is used for. after quite some browsing through kernel sources I came across the kill_xxxx() family in exit.iol.org http://ldp. thoughts and flames are welcome.comlink. But I still don't know how to handle the priv parameter correctly.it/LDP/khg/HyperNews/get/devices/reference/2. My email address is: b.36] . I found kill_pg() and kill_proc() widely used in a couple of char drivers. 22 Sep 1996 15:11:54 GMT From: Burkhard Kohl <b.comlink.apc. P. Burkhard. signaling Date: Sun. Wouldn't it be nice to have the kill_xxxx() family described in the KHG? Michael. Any comment. what do you think? Anyone willing to take this? I could do the stubs if someone who really knows will do the annotation.S. After some hacking I managed to use kill_proc() for my purpose. Another one is kill_fasync() which is mostly used by mouse drivers.10.Add the kill_xxx() family to Supporting functions? The HyperNews Linux KHG Discussion Pages Add the kill_xxx() family to Supporting functions? Forum: Supporting Functions Keywords: kill_xxx().apc.org> For the development of a char driver I needed functionality to signal an interrupt to the process in user space.html [08/03/2001 10.c.kohl@ipn-b.

http://ldp.html [08/03/2001 10.it/LDP/khg/HyperNews/get/devices/reference/1.0? by Greg Hager 1. 03 Jun 1996 22:25:40 GMT From: Michael K. pick up a copy of bigphysarea.37] . Messages bigphysarea for Linux 2.com> Matt Welsh has designed a solution to the need for very large areas of continuous physical areas of memory. which is specifically necessary for some DMA needs. which should work with most modern kernels.Allocating large amount of memory The HyperNews Linux KHG Discussion Pages Allocating large amount of memory Forum: Supporting Functions Keywords: memory allocation Date: Mon.10. If you need it. Johnson <johnsonm@redhat.iol.

0? The HyperNews Linux KHG Discussion Pages bigphysarea for Linux 2. Johnson) Keywords: memory allocation Linux 2. but unfortunately patch -p0 fails on Linux 2.html [08/03/2001 10. Greg http://ldp. Has anyone modifed the patch for 2. 24 Jul 1996 08:47:41 GMT From: Greg Hager <hager@cs. Forum: Supporting Functions Re: Allocating large amount of memory (Michael K.bigphysarea for Linux 2.yale.iol.edu> I acquired the bigphsyarea patch (for a digitizer driver that I am writing).37] .it/LDP/khg/HyperNews/get/devices/reference/1/1.0 Date: Wed.

edited. that looks simple enough. the bus address is exactly the same as the physical address. This is the address of memory as seen by OTHER devices. q Bus address. with each device seeing memory in some device-specific way. I'll take this opportunity to tell all device driver writers about the ugly secrets of portability. and that is not generally necessarily true on other PCI/ISA setups. on normal PC's. and things are very simple indeed. the CPU sees a memory map something like this (this is from memory): 0-2GB "real memory" 2GB-3GB "system IO" (ie inb/out type accesses on x86) 3GB-4GB "IO memory" (ie shared memory over the IO bus) Now. it has to give the master address 0x80000000 as the memory address. and the physical memory address 0 actually shows up as address 2GB for any IO master.it/LDP/khg/HyperNews/get/devices/addrxlate. on the PReP (PowerPC Reference Platform).iol. This is the "physical" address. but happily most hardware designers aren't actually actively trying to make things any more complex than necessary. Now. ie physical address 0 is what the CPU sees when it drives zeroes on the memory bus. depending on how the kernel is actually mapped on the PPC. and in this case we actually want the third. not the CPU. This is the "virtual" address. there are actually three different ways of looking at memory addresses. which is correct on x86. q CPU translated address. The aha1542 is a bus-master device. Now. because all bus master devices see the physical memory mappings directly. when you look at the same thing from the viewpoint of the devices. Things are actually worse than just physical and virtual addresses. in theory there could be many different bus addresses. and [a patch posted to the linux-kernel list] makes the driver give the controller the physical address of the buffers.Translating Addresses in Kernel Space The HyperNews Linux KHG Discussion Pages Translating Addresses in Kernel Space From a message from Linus Torvalds to the linux-kernel mailing list of 27 Sep 1996. just as an example. So. the so-called "bus address".38] . However. Now.html (1 di 4) [08/03/2001 10. normal RAM. see later about other details): q CPU untranslated. However. you can end up with a setup like this: physical address: 0 http://ldp. and is completely internal to the CPU itself with the CPU doing the appropriate translations into "CPU untranslated". i. on many setups. So when the CPU wants any bus master to write to physical memory 0. for example. Essentially. they are that simple because the memory and the devices share the same address space.e. However.10. you have the reverse. so you can assume that all external hardware sees the memory the same way. the three ways of addressing memory are (this is "real memory".

*/ struct mailbox { __u32 status. __u32 buflen.10.it/LDP/khg/HyperNews/get/devices/addrxlate. .html (2 di 4) [08/03/2001 10. you want the bus address when you have a buffer that you want to give to the controller: http://ldp. switch (retbuffer[0]) { case STATUS_OK: .Translating Addresses in Kernel Space virtual address: 0xC0000000 bus address: 0x80000000 where all the addresses actually point to the same thing. Anyway. the way to look up all these translations. virt_addr = phys_to_virt(phys_addr).. __u32 bufstart.. Similarly.38] . The controller sees this directly. on the alpha.bufstart). the normal translation is physical address: 0 virtual address: 0xfffffc0000000000 bus address: 0x40000000 (but there are also alpha's where the physical address and the bus address are the same). unsigned char * retbuffer. So you can have something like this (from the aha1542 driver): /* * this is the hardware "mailbox" we use to communicate with * the controller.. } mbox. On the other hand. it's just seen through different translations.iol. bus_addr = virt_to_bus(virt_addr). Now. you do: #include <asm/io.h> phys_addr = virt_to_phys(virt_addr). virt_addr = bus_to_virt(bus_addr). /* get the address from the controller */ retbuffer = bus_to_virt(mbox. when do you need these? You want the virtual address when you are actually going to access that pointer from the kernel.

for example. because the remap_page_range() mm function wants the physical address of the memory to be remapped (the memory management layer doesn't know about devices outside the CPU. This memory is called "PCI memory" or "shared memory" or "IO memory" or whatever.buflen = sizeof(sense_buffer). mbox. and there is only one way to access it: the readb/writeb and related functions. mbox. because you can't use that from the CPU (the CPU only uses translated virtual addresses). and you can't use it from the bus master. so on x86 it actually works to just deference a pointer. So why do we care about the physical address at all? We do need the physical address in some cases. Remapping and writing: /* * remap framebuffer PCI memory area at 0xFC000000. */ char * baseptr = ioremap(0xFC000000. (Sadly. And you generally never want to use the physical address.html (3 di 4) [08/03/2001 10. i.iol.status = 0. notify_controller(&mbox). so it shouldn't need to know about "bus addresses" etc). For such memory. and that's the "shared memory" on the PCI or ISA bus. The physical address is needed if you use memory mappings. it's just not very often in normal code.10.bufstart = virt_to_bus(&sense_buffer). * size 1MB.baseptr+10).e. you can do things like Reading: /* * read first 32 bits from ISA memory at 0xC0000.e.it/LDP/khg/HyperNews/get/devices/addrxlate. because there is really nothing you can do with such an address: it's not conceptually in the same memory space as "real memory" at all. but can be things like a packet buffer in a network card etc. RAM. That's generally not RAM (although in the case of a video graphics card it can be normal DRAM that is just used for a frame buffer). You should never take the address of such memory.38] . but it's not portable). i. There is a completely different type of memory too. CPU memory. aka * C000:0000 in DOS terms */ unsigned int signature = readl(0xC0000). so anything else * has to be remapped.Translating Addresses in Kernel Space /* ask the controller to read the sense status into "sense_buffer" */ mbox. 1024*1024). NOTE NOTE NOTE! The above is only one part of the whole equation. /* unmap when we unload the driver */ http://ldp. so that we can access it: We can directly * access only the 640k-1MB area. The above only talks about "real memory". on x86 it is in the same memory space. so you cannot just dereference a pointer. /* write a 'A' to the offset 10 of the area */ writeb('A'.

And the above sounds worse than it really is. /* write a packet to the driver */ memcpy_toio(0xE1000. and in many cases the code actually looks better afterwards: unsigned long signature = *(unsigned int *) 0xC0000. 0. vs.Translating Addresses in Kernel Space iounmap(baseptr).0.it/LDP/khg/HyperNews/get/devices/addrxlate. but I didn't think straight when I wrote it originally.10. and then you'll be happy that your driver works . People who have to support both can do something like: /* support old naming sillyness */ #if LINUX_VERSION_CODE < 0x020100 #define ioremap vremap #define iounmap vfree #endif at the top of their source files.) Note that kernel versions 2. Copying and clearing: /* get the 6-byte ethernet address at ISA address E000:0040 */ memcpy_fromio(kernel_buffer. 0x10000).x systems.html (4 di 4) [08/03/2001 10. Most real drivers really don't do all that complex things (or rather: the complexity is not so much in the actual IO accesses as in error handling and timeouts etc). 6). and then they can use the right names even on 2. that just about covers the basics of accessing IO portably. unsigned long signature = readl(0xC0000).iol.0. It's generally not hard to fix drivers.38] . ioremap() is the proper name. Ok. but one day you might find yourself with a 500MHz alpha in front of you. I think the second version actually is more readable. skb->len). /* clear the frame buffer */ memset_io(0xA0000. Questions? Comments? You may think that all the above is overly complex.x (and earlier) mistakenly called ioremap() "vremap()". no? Linus http://ldp. skb->data. 0xE0040.

the kernel has to verify this address.) I am. In older versions of Linux. and I did. the fault probably occured. Linus decided to let the virtual memory hardware present in every Linux capable CPU handle this test.e. write protected or something similiar.c. Where does fixup point to? Since we jump to the the contents of fixup.h files for x86 or alpha. However. that the memory area starting at address addr and of size size was accessible for the operation specified in type (read or write). error_code contains a reason code for the exception. It only failed for the (hopefully) rare. the fault handler modifies the return address (again regs->eip) and returns. this test was successful. The execution will continue at the address in fixup. and be ready to spend some time just figuring out what it all does . it often has to access user mode memory whose address has been passed by an untrusted program. this normally unneeded verification used up a considerable amount of time.S. If the address is within the virtual address space of the process. The parameters on the stack are set up by the low level assembly glue in arch/i386/kernel/entry. To protect itself. fixup obviously points to executable code. In some kernel profiling tests.8 When a process runs in kernel mode.10. the CPU generates a page fault exception and calls the page fault handler void do_page_fault(struct pt_regs *regs. there is no vma that contains this address.it/LDP/khg/HyperNews/get/devices/exceptions. To overcome this situation. unsigned long error_code) in arch/i386/mm/fault. To do this. This code is hidden inside the user http://ldp. If this search is successful. do_page_fault first obtains the unaccessible address from the CPU control register CR2. There it uses the address of the instruction that caused the exception (i.39] . verify_read had to look up the virtual memory area (vma) that contained the address addr. How does this work? Whenever the kernel tries to access an address that is currently not accessible. regs->eip) to find an address where the excecution can continue (fixup). the kernel jumps to the bad_area label. unsigned long size) function. The parameter regs is a pointer to the saved registers on the stack. buggy program. This function verified. because the page was not swapped in. Kernel-level exception handling in Linux 2. const void * addr. In this case. we are interested in the other case: the address is not valid. In the normal case (correctly working program).1.iol. this was done with the int verify_area(int type. edited.Kernel-Level Exception Handling The HyperNews Linux KHG Discussion Pages Kernel-Level Exception Handling From a message from Joerg Pommnitz to the linux-kernel mailing list of 11 Nov 1996. According to Linus Torvalds: People interested in low-level scary stuff should take a look at the uaccess.html (1 di 6) [08/03/2001 10.

%0\n" " xor" "b" " %" "b" "1.\"a\"\n" " .14 ).align 4\n" " .align 4\n" " .(sizeof(*(buf))))))) do { __gu_err = 0.html (2 di 6) [08/03/2001 10. switch ((sizeof(*(buf)))) { case 1: __asm__ __volatile__( "1: mov" "b" " %2. The definition is somewhat hard to follow. const __typeof__(*( ( buf ) )) *__gu_addr = ((buf)).10. I have picked the get_user macro defined in include/asm/uaccess.section __ex_table.c for a detailed examination. so lets peek at the code generated by the preprocessor and the compiler. __gu_val = 0.h as an example. "=r" (__gu_val) : "m"((*(struct __large_struct *) ( __gu_addr )) ).3b\n" ".iol. The preprocessor output (edited to become somewhat readable): ( { long __gu_err = .14 . break.Kernel-Level Exception Handling access macros.%0\n" " xor" "w" " %" "w" "1.section .14 ).fixup. "0"( __gu_err )). "i"(.long 1b.\"ax\"\n" "3: movl %3.\"a\"\n" " .%" "b" "1\n" "2:\n" ".section .3b\n" ".section __ex_table.text" : "=r"(__gu_err). The original code in console. "0"( __gu_err )) . buf).c line 1405: get_user(c.it/LDP/khg/HyperNews/get/devices/exceptions. "=q" (__gu_val): "m"((*(struct __large_struct *) ( __gu_addr )) ).%" "b" "1\n" " jmp 2b\n" ".39] .\"ax\"\n" "3: movl %3. I selected the get_user call in drivers/char/console.%" "w" "1\n" "2:\n" ".%" "" "1\n" "2:\n" http://ldp. if (((((0 + current_set[0])->tss. "i"(.long 1b. break.segment) == 0x18 ) || (((sizeof(*(buf))) <= 0xC0000000UL) && ((unsigned long)(__gu_addr ) <= 0xC0000000UL . case 2: __asm__ __volatile__( "1: mov" "w" " %2. case 4: __asm__ __volatile__( "1: mov" "l" " %2.text" : "=r"(__gu_err).fixup.%" "w" "1\n" " jmp 2b\n" ".

But what does the . default: (__gu_val) = __get_user_bad().\"ax\"\n" "3: movl %3.%edx movl current_set.%eax movl 64(%esp).788(%eax) je .section __ex_table.%ebx #APP 1: movb (%ebx).L1423 .%eax cmpl $24.text" : "=r"(__gu_err).iol.fixup.%eax xorb %dl.it/LDP/khg/HyperNews/get/devices/exceptions.10.%dl /* this is the actual user access */ 2: .long 1b.%0\n" " xor" "l" " %" "" "1.L1424: movl %edx.%dl jmp 2b . ((c)) = (__typeof__(*((buf))))__gu_val.text #NO_APP .%" "" "1\n" " jmp 2b\n" ". "=r" (__gu_val) : "m"((*(struct __large_struct *) ( __gu_addr )) ).L1423: movzbl %dl. __gu_err. } ).14 ).align 4\n" " . This is impossible to follow. "i"(. so lets see what code gcc generates: xorl %edx.L1424 cmpl $-1073741825.\"a\"\n" " .64(%esp) ja .Kernel-Level Exception Handling ". Thanks to the unified address space we can just access the address in user memory.section stuff do? To understand this we have to look at the final kernel: $ objdump --section-headers vmlinux vmlinux: file format elf32-i386 http://ldp.fixup. WOW! Black GCC/assembly magic.align 4 .39] .html (3 di 6) [08/03/2001 10. Can we? The actual user access is quite obvious. break."a" .3b\n" ".long 1b.section .3b .%esi The optimizer does a good job and gives us something we can actually understand. "0"(__gu_err))."ax" 3: movl $-14.section .section __ex_table. } } while (0) .

comment 00000ec4 00000000 00000000 000ba748 2**0 CONTENTS.text vmlinux c017e785 <do_con_write+c1> xorl %edx... DATA 5 ......39] ...%ebx c017e7a5 <do_con_write+e1> movb (%ebx).%eax c017e7a1 <do_con_write+dd> movl 0x40(%esp... READONLY There are obviously 2 non standard ELF sections in the generated object file... READONLY.. They are located in a different section of the executable file: $ objdump --disassemble --section=.%dl c017e7a7 <do_con_write+e3> $ objdump --full-contents --section=__ex_table vmlinux c01aa7c4 93c017c0 e09f19c0 97c017c0 99c017c0 c01aa7d4 f6c217c0 e99f19c0 a5e717c0 f59f19c0 c01aa7e4 080a18c0 01a019c0 0a0a18c0 04a019c0 or in human readable byte order: c01aa7c4 c017c093 c0199fe0 c017c097 c017c099 c01aa7d4 c017c2f6 c0199fe9 c017e7a5 c0199ff5 . LOAD.it/LDP/khg/HyperNews/get/devices/exceptions...1) c017e79d <do_con_write+d9> ja c017e7a7 <do_con_write+e3> c017e79f <do_con_write+db> movl %edx. The instructions bracketed in the .10..%eax c017e78c <do_con_write+c8> cmpl $0x18.html (4 di 6) [08/03/2001 10.%dl c017e7a7 <do_con_write+e3> movzbl %dl....%esi The whole user memory access is reduced to 10 x86 machine instructions... . http://ldp.bss 00018e21 c01ba748 c01ba748 000ba748 2**2 ALLOC 6 ......note 00001068 00000ec4 00000ec4 000bb60c 2**0 CONTENTS. ALLOC....Kernel-Level Exception Handling Sections: Idx Name 0 ....... READONLY. CODE 1 ..fixup 000016bc c0198f40 c0198f40 00099f40 2**0 CONTENTS.fixup+10ba> xorb c0199ffc <.fixup+10b5> movl c0199ffa <. ........ LOAD. ALLOC.. READONLY 7 .rodata 0000f127 c019a5fc c019a5fc 0009b5fc 2**2 CONTENTS..section directives are not longer in the normal execution path...... LOAD....data 0000ea58 c01abcf0 c01abcf0 000abcf0 2**4 CONTENTS. ALLOC. But first we want to find out what happened to our code in the final kernel executable: $ objdump --disassemble --section=...%edx c017e787 <do_con_write+c3> movl 0xc01c7bec.fixup vmlinux c0199ff5 <. READONLY. READONLY... LOAD... ..1). CODE 2 ..fixup+10bc> jmp And finally: $0xfffffff2.. LOAD... ALLOC.....%eax %dl.0x40(%esp.. DATA 4 . ..text Size VMA LMA File off Algn 00098f40 c0100000 c0100000 00001000 2**4 CONTENTS. DATA 3 __ex_table 000015c0 c01a9724 c01a9724 000aa724 2**2 CONTENTS. ALLOC.iol.....0x314(%eax) c017e793 <do_con_write+cf> je c017e79f <do_con_write+db> c017e795 <do_con_write+d1> cmpl $0xbfffffff.

. it creates the symbols __start_section and __stop_section delimiting the extents of the section.%dl and linked in vmlinux: c017e7a5 <do_con_write+e1> movb (%ebx)..iol.%dl http://ldp.fixup.. In order for the function search_exception_table to find the exception table in the __ex_table section.. The local label 1b (1b stands for next label 1 backward) is the address of the instruction that might fault.long 1b..."ax" .%eax and linked in vmlinux: c0199ff5 <.. what actually happens if a fault from kernel mode with no suitable vma occurs? 1.long 1b.section ..align 4 .... So the instructions 3: movl $-14. 1b and 3b are local labels.%dl jmp 2b ended up in the .10.fixup+10b5> movl $0xfffffff2.....html (5 di 6) [08/03/2001 10.. .%eax The assembly code .fixup section of the object file and the addresses ..... in our case the actual value is c0199ff5: the original assembly code: 3: movl $-14.3b ended up in the __ex_table section of the object file. In our case. access to invalid address: c017e7a5 <do_con_write+e1> movb (%ebx).39] .%dl The local label 3 (backwards again) is the address of the code to handle the fault."a" ...3b becomes the value pair c01aa7d4 c017c2f6 c0199fe9 c017e7a5 c0199ff5 ^this is ^this is 1b 3b c017e7a5. What happened? The assembly directives . the address of the label 1b is c017e7a5: the original assembly code: 1: movb (%ebx)...c0199ff5 in the exception table of the kernel.section __ex_table..it/LDP/khg/HyperNews/get/devices/exceptions. So search_exception_table brackets its search by __start___ex_table and __stop___ex_table Exception handling in action So.%eax xorb %dl.. it uses a linker feature: whenever the linker sees a section whose entire name is a valid C identifier..."a" told the assembler to move the following code to the specified sections in the ELF object file..Kernel-Level Exception Handling this is the interesting part! c01aa7e4 c0180a08 c019a001 c0180a0a c019a004 ..section __ex_table.

5. 2. if the user access was successful. Our original code did not test this return value. GCC selected EAX to return this value. execution continues in the fault handling code. the contents of the ELF section __ex_table and returns the address of the associated fault handle code c0199ff5. mostly.html (6 di 6) [08/03/2001 10.e. If you look at our example. the get_user macro actually returns a value: 0. Well. Joerg Pommnitz | joerg@raleigh. do_page_fault modifies its own return address to point to the fault handle code and returns.Kernel-Level Exception Handling MMU generates exception CPU calls do_page_fault do_page_fault calls search_exception_table (regs->eip == c017e7a5). 8.iol. 6.com | Never attribute to malloc Mobile/Wireless | Dept UMRA | that which can be adequately Tel:(919)254-6397 | Office B502/E117 | explained by stupidity.ibm. The steps 8a to 8c in a certain way emulate the faulting instruction. That's it. -EFAULT on failure. http://ldp. 4. search_exception_table looks up the address c017e7a5 in the exception table (i.10. 7.it/LDP/khg/HyperNews/get/devices/exceptions. you might ask why we set EAX to -EFAULT in the exception handler code. 8a) EAX becomes -EFAULT (== -14) 8b) DL becomes zero (the value we "read" from user space) 8c) execution continues at local label 2 (address of the instruction immediately after the faulting user access).39] . however the inline assembly code in get_user tries to return -EFAULT. 3.

it/LDP/khg/HyperNews/get/devices/devices/22. I'm developing a device driver for a PCI board meant for high performance communication. First of all the DMA controller knows nothing of virtual memory. PREVENTING THEM FROM BEING SWAPPED OUT TO THE SWAP DEVICE DURING A DMA OPERATION.linuxhq.Boosten@cern. YOU CAN HOWEVER LOCK THE PROCESSES PHYSICAL PAGES INTO MEMORY.html] Reading this the following subquestions arise: . 11 Jun 1997 09:23:28 GMT From: Marcel Boosten <Marcel.39] .Is it possible to obtain a continues block of http://ldp. This means that DMA requests are limited to the bottom 16 Mbytes of memory. it only has access to the physical memory in the system. the next 8 bits come from the page register. Secondly.10.DMA to user space The HyperNews Linux KHG Discussion Pages DMA to user space Forum: Device Drivers Date: Wed.How can one obtain the physical address of the pages involved? . This means that you cannot DMA directly into the virtual address space of a process. QUESTION: How do I implement DMA to user space? SUBQUESTIONS: In "The Linux Kernel". Interaction with the board is possible via DMA. In order to get optimal performance I need to do DMA directly to user space. the DMA controller cannot access the whole of physical memory.How can one lock specific process pages? .ch> Hello." [see http://www.iol.html (1 di 2) [08/03/2001 10.com/guides/TLK/node87. The DMA channel's address register represents the first 16 bits of the DMA address. Therefore the memory that is being DMA'd to or from must be a contiguous block of physical memory.How can one ensure that the pages involved are DMA-able (below 16Mb)? . David A Rusling writes the following: "Device drivers have to be careful when using DMA.

39] .it/LDP/khg/HyperNews/get/devices/devices/22.html (2 di 2) [08/03/2001 10. Marcel http://ldp.iol.10.DMA to user space physical memory in user space? Greetings.

html'.c' which is introduced to 'http://www. But I don't know how to drive 'zero device' by this device driver.kr Messages 1.redhat.ac.10.com/~johnsonm/devices. 'zero. I wrote .39] . a sample driver to test on my system .iol.html [08/03/2001 10.pusan. 31 May 1997 07:31:17 GMT From: Kim yeonseop <javakys@hyowon. Untitled http://ldp.kr> Hi.ac. Thanks for your response.it/LDP/khg/HyperNews/get/devices/devices/21. Kim yeonseop : javakys@hyowon. Please help me.How a device driver can driver his device The HyperNews Linux KHG Discussion Pages How a device driver can driver his device Forum: Device Drivers Keywords: device driver Date: Sat.pusan. I am a beginner for device driver on linux.

http://ldp.10.html [08/03/2001 10.40] .Untitled The HyperNews Linux KHG Discussion Pages Untitled Forum: Device Drivers Re: How a device driver can driver his device (Kim yeonseop) Keywords: device driver Date: Thu.iol. 05 Jun 1997 02:08:25 GMT From: <unknown> What do you mean to "drive the driver"? more clearly.it/LDP/khg/HyperNews/get/devices/devices/21/1.

nl> I am using memcpy in a device driver to copy data between to buffers in kernel space (one is a DMA buffer) and I keep getting segmentation faults I can't explain. (running i586-linux-2.0. 21 May 1997 14:33:34 GMT From: Edgar Vonk <edgar@it.tudelft.40] . I changed the driver since and now it copies the DMA buffer directly into user space with memcpy_tofs (and verify_area) and this seems to work just fine.it/LDP/khg/HyperNews/get/devices/devices/20. thanks.30-RedHat4. Anyone know why? Does this have to do with the memcpy faults under heavy system load? I saw a discussion and a kernel patch about this somewhere.1) http://ldp.html [08/03/2001 10.iol.10.et.memcpy error? The HyperNews Linux KHG Discussion Pages memcpy error? Forum: Device Drivers Keywords: memcpy verify_area Date: Wed.

html [08/03/2001 10.error The HyperNews Linux KHG Discussion Pages Unable to handle kernel paging request .iol..error Forum: Device Drivers Keywords: kernel paging Date: Wed." usually indicate? Does this mean a memory allocation problem. Also..tudelft.et.40] . (Running i586-Linux-2. just a simple question.it/LDP/khg/HyperNews/get/devices/devices/19.Unable to handle kernel paging request .0. or just a memory addressing problem.10. cheers. why does it come back with a virtual address and not a physical one? Does this mean it is doing something in user space? I am writing a device driver for a Data Acquisition Card. What does the "Unable to handle kernel paging request at virtual address . but haven't got a clue what the bug in my code is.1) http://ldp. 14 May 1997 16:15:40 GMT From: Edgar Vonk <edgar@it.30-RedHat4.nl> Hai.

Tom. http://ldp.html [08/03/2001 10. getitimer and setitimer in my driver Does anybody know how I can get a _syscall() macro to work in my loadable device driver?? Any advice would be much appreciated. I first of all have had problems with "errno: wrong version or undefined". 26 Mar 1997 23:07:31 GMT From: Tom Howley <unknown> Is it possible to use _syscallX macros in loadable device drivers. I want to be able to use the system calls signal.iol.10._syscallX() Macros The HyperNews Linux KHG Discussion Pages _syscallX() Macros Forum: Device Drivers Date: Wed.41] .it/LDP/khg/HyperNews/get/devices/devices/17.It seems to be defined in linux/lib/errno.c.

I know it is SoundBlaster and SoundBlaster Pro Compatible. Forum: Device Drivers Keywords: MediaMagic DSP16 Date: Tue.0? I would very much appreciate it.it/LDP/khg/HyperNews/get/devices/devices/16. but I don't know how to make it work. How to run in Linux.html [08/03/2001 10.10.MediaMagic Sound Card DSP-16. 18 Mar 1997 04:25:37 GMT From: Robert Hinson <oppie@afn.org> I am looking for a way to run the MediaMagic Sound Card DSP-16 under Linux RedHat 4.41] . My e-mail address is oppie@afn. http://ldp. I would like some help.iol. How to run in Linux.org since I don't read this. The HyperNews Linux KHG Discussion Pages MediaMagic Sound Card DSP-16. Or how to set it up with the current drivers.

Killing Interrupt handler. The problem is that I can't really continue transferring data until I get a response from the daemon. Here is why I want to know.html [08/03/2001 10. Is there somewhere I can look for this information? My only obvious alternative at this point is to create a request queue of some sort and respond to activity on the character device. I have a network driver in which it would be advantagous to be able to sleep during code initiated by an interrupt.iol. I then want to wait for the daemon to respond or timeout. The question is.10. Any thoughts? Erik Petersen. For example a piece of data is received by the device which is passed to a kernel daemon via a character device inode and a select call. can I sleep or do I Aiee. if I call mark_bh(NET_BH) IMMEDIATE_BH?? before I sleep. http://ldp. I am assuming from general knowledge that it mark the end of the interrupt service routine and allows a context switch in following code.. 12 Mar 1997 01:42:49 GMT From: Erik Petersen <erik@spellcast.41] .it/LDP/khg/HyperNews/get/devices/devices/15. Idle task may not sleep? mark_bh doesn't seem to be explained anywhere but is used by many net drivers for reasons I don't understand.What does mark_bh() do? The HyperNews Linux KHG Discussion Pages What does mark_bh() do? Forum: Device Drivers Keywords: network drivers interrupt mark_bh Date: Wed. Messages Untitled by Praveen Dwivedi 1.com> Can someone expain when and how I use mark_bh()..

iol. An example would be timer interrupt which comes 100 times a second. The reason why this is the preferred way to do things is because you want to have actual interrupt handler as small as possible so as to avoid losing further interrupts. It may help.html [08/03/2001 10. -pkd http://ldp. Look at the code in do_timer.42] .com> I am not an expert on Linux kernel but here is what my hacking wisdom says.Untitled The HyperNews Linux KHG Discussion Pages Untitled Forum: Device Drivers Re: What does mark_bh() do? (Erik Petersen) Keywords: network drivers interrupt mark_bh Date: Fri.10. Generally what happens is that you do minimum stuff in actual handler and call mark_bh() which takes care of updating lots of time related system stuff.it/LDP/khg/HyperNews/get/devices/devices/15/1. mark_bh marks the bottom half of some hardware interrupts. 14 Mar 1997 08:28:12 GMT From: Praveen Dwivedi <pkd@sequent.

iol.it/LDP/khg/HyperNews/get/devices/devices/14.uk> How would I go about making a driver for the Apocalypse 3D please Email reply http://ldp.42] . 08 Mar 1997 18:04:25 GMT From: <jamesbat@innotts.co.3D Acceleration The HyperNews Linux KHG Discussion Pages 3D Acceleration Forum: Device Drivers Keywords: 3D acceleration driver Date: Sat.10.html [08/03/2001 10.

07 Mar 1997 00:56:50 GMT From: Matthew Kirkwood <weejock@ferret. Should I create a document explaining this? (/dev/radio. The HyperNews Linux KHG Discussion Pages Device Drivers: /dev/radio.. but I have not yet seen anything along the lines of "/dev/* device creating for the inept. I intend to write (when my radio card arrives in a couple of days) a driver for /dev/radio.42] ..".10..Device Drivers: /dev/radio. Matthew..uk> Hi. and keep up the hacking. Forum: Device Drivers Keywords: device /dev radio Date: Fri.html [08/03/2001 10.lmh..it/LDP/khg/HyperNews/get/devices/devices/13. which is all fair enough..) Thanks. http://ldp. would be a mostly ioctl based thing. depending upon hardware support.. I have already obtained reasonable information for this.ox.ac..iol.. as I envisage it.

But i keep getting compaints that it doesn't work reliably. I was stymied. Since more modern kernel version. I spent days tracking down the bug.fourties.. ---david (david@tm.e. Now i posted a message similar to this to the kernel list half a year ago. Thanks..... Does it have to do with the (new?) macros DEVICE_TIMEOUT and TIMEOUT_VALUE that i've _not_ defined (because i wrote it in the KHG 0.3.10.it/LDP/khg/HyperNews/get/devices/devices/12.nl> Hi. But i wasn't capable of reading the list (sorry) because i use my e-mail address at work.5 days. not go to sleep) for many times in a row. Read errors. I know.tno.43] .. 26 Feb 1997 17:02:31 GMT From: David van Leeuwen <david@tm.tno.nl) http://ldp. it appeared that the driver was woken without an interrupt occurring. no interrupt occured and no time-out occurred) I found out that the sleep_on() could immediately wakeup (i. It used to work OK in the old 1. It's old.e. Apparently. I had to hack around by trying to go to sleep up to 100 times.Does anybody know why kernel wakes my driver up without apparant reasons? The HyperNews Linux KHG Discussion Pages Does anybody know why kernel wakes my driver up without apparant reasons? Forum: Device Drivers Keywords: wake_up interrupt time_out Date: Wed.html [08/03/2001 10. i've written a device driver for a cdrom device. but i am not charmed by the hack. or my own time-out causing the wake-up. there was some short reaction that my go_to_sleep routine should do something like while(!my_interrupt_woke_me) sleep_on(&wait) Why is this? Why does the kernel wake me up if i didn't ask for it (i.). it tended to break more often.iol.

http://ldp.Getting a DMA buffer aligned with 64k boundaries The HyperNews Linux KHG Discussion Pages Getting a DMA buffer aligned with 64k boundaries Forum: Device Drivers Date: Sun.fmc. It includes DMA (and I have already worked with it under MSDOS).html [08/03/2001 10.es> I'm writing a device driver for a Data Translation DT2821 adquisition card. I suppose it is something pretty obvious. 17 Nov 1996 01:25:45 GMT From: Juan de La Figuera Bayon <juan@hobbes.44] . Any help would be appreciated.it/LDP/khg/HyperNews/get/devices/devices/11. I usually ask for less than 256 words in my aplication).iol. I need to ask for a buffer which can be up to 128k in size (ok.10. but it is my first try at device driver programming under Linux.uam. The polled modes for DA and AD conversion already work. But for the DMA. And it should be aligned with 64k boundaries.

dseg.iol. I am struggling with compiling a simple test program to test these fuctions.Hardware Interface I/O Access The HyperNews Linux KHG Discussion Pages Hardware Interface I/O Access Forum: Device Drivers Keywords: I/O Date: Mon.) If I use gcc -o tst tst. http://ldp. Thanks Terry M. by Michael K.c the following fail appears..com> I need to write a driver using inb() and outb().44] .ti. (2.c I do not understand how to use the resulting file created -DMODULE.10. undefined reference to __inbc undefeined reference to __inb What I expected was a executable program to run from the command line..it/LDP/khg/HyperNews/get/devices/devices/10. (1.ti. tmoore@solbrn.dseg. 07 Oct 1996 12:36:40 GMT From: Terry Moore <tmoore@solbrn.com Messages You are somewhat confused.) If I use gcc -o -DMODULE -D__KERNEL__ -c myfile.html [08/03/2001 10. Johnson 1.

Forum: Device Drivers Re: Hardware Interface I/O Access (Terry Moore) Keywords: I/O Date: Mon. Johnson <johnsonm@redhat. so don't define MODULE. An executable program is what you want.You are somewhat confused.it/LDP/khg/HyperNews/get/devices/devices/10/1.com> You've got two things mixed up--user level drivers and kernel loadable modules.10. not a module..iol.html [08/03/2001 10. 14 Oct 1996 22:16:16 GMT From: Michael K. http://ldp.. The HyperNews Linux KHG Discussion Pages You are somewhat confused..44] .. Just compile your executable with -O and the undefined references should go away.

e-mail : avenco@online.html [08/03/2001 10.Is Anybody know something about SIS 496 IDE chipset? The HyperNews Linux KHG Discussion Pages Is Anybody know something about SIS 496 IDE chipset? Forum: Device Drivers Date: Fri. Is anybody know about it? I need technical informationd about the chipset for writig driver.iol.it/LDP/khg/HyperNews/get/devices/devices/9. Linux 2.10.21 doesn't support it.0.45] .ru> I use SIS 496 (E)IDE controler chipset. 27 Sep 1996 13:13:21 GMT From: Alexander <avenco@online.ru alex Messages http://ldp.

Brynn Messages Your choice. I have been devouring all information (and donuts) I can get my hands on. by Michael K.html [08/03/2001 10.com> I am writing an application that provides new images to the screen every vertical refresh. but I need lots of images).it/LDP/khg/HyperNews/get/devices/devices/7. What I am really confused about is this: Should I have a device that my animation program opens and then uses ioctls to talk to.45] .Vertical Retrace Interrupt . and to install a new colormap so the next image is cleanly flipped in... (I don't need many colors.I need to use it The HyperNews Linux KHG Discussion Pages Vertical Retrace Interrupt . I need to write a device driver to hook the vertical retrace interrupt (whatever interrupt your graphics card generates). 04 Sep 1996 06:39:27 GMT From: Brynn Rogers <brynn@wwa. http://ldp. and which screen or GC or whatever it needs. Just have the driver wake my process and signal it. or Something much better that somebody will clue me in on. The driver only needs to know a few things.I need to use it Forum: Device Drivers Date: Wed. and still am a little bit clueless as to how I should go about this. like which image planes are ready and the ID's? of the colormaps to use for which planes.iol.10. Johnson 1. (Think of it as an animation) As I understand it.

If you can. http://ldp. if you can use the write() method to take data from the application and the read() method to give data back to the application (remember that those names are user-space-centric). 29 Sep 1996 20:44:18 GMT From: Michael K. Forum: Device Drivers Re: Vertical Retrace Interrupt .Your choice.com> What I am really confused about is this: Should I have a device that my animation program opens and then uses ioctls to talk to. I would recommend that you do it that way.. The HyperNews Linux KHG Discussion Pages Your choice.html [08/03/2001 10. You are quite right that you need a device driver. I recommend avoiding using ioctls.10.45] .. Just have the driver wake my process and signal it..it/LDP/khg/HyperNews/get/devices/devices/7/1.iol. Johnson <johnsonm@redhat.I need to use it (Brynn Rogers) Date: Sun. It doesn't sound to me like a case in which ioctl()'s would be the cleanest solution.. or Something much better that somebody will clue me in on.

it/LDP/khg/HyperNews/get/devices/devices/6. We can build the packets.46] .10.iol.unh..edu> I am working on interfacing directly with the networking device drivers on my linux box.edu http://ldp.help working with skb structures The HyperNews Linux KHG Discussion Pages help working with skb structures Forum: Device Drivers Date: Thu. Any help is appreciated.RMON probes mostly ) so I need to be able to create both good and bad packets with most any kind of data contained inside as RMON-2 will be able to pick apart a packet and identify its contents. What I need to do is bypass the socket interface without destroying it . and if I make a mistake I will clean up the mess. 29 Aug 1996 15:44:32 GMT From: arkane <cat@iol. Question is: in the sk_buff structure what do I need to set up specifically so that dev_queue_xmit() and the driver will simply pass my data to the hardware without building the standard headers required by ethernet and other network types? I'll worry about that. TIA cat@iol.. I have tracked down the functions for transmitting ( dev_queue_xmit() ) packets down to the driver level.unh. but we can't get them to the wire through standard means. I think that this can be accomplished with the dev_queue_xmit() function. So that I can transmit my own packets or my own design down to the wire ( I am using this for my job of testing new networking hardware -.html [08/03/2001 10.

PCI. I currently have a machine with 2 PCI Plug&Play devices that choose the same irq (an HP-Vectra onboard SCSI controller and a HP J2585 100-VG-AnyLan card). 11 Jun 1996 16:09:00 GMT From: Frieder Löffler <floeff@mathematik.uni-stuttgart.How to do with Network Drivers? by Frieder Löffler Interrupt sharing 101 by Christophe Beauregard http://ldp.html [08/03/2001 10.de> I wonder if interrupt sharing is an issue for the Linux kernel. Plug%0Aamp.46] .10.it/LDP/khg/HyperNews/get/devices/devices/5.iol.Play Date: Tue. -> -> Interrupt sharing . It seems that there is no way to use such configurations at the moment? Frieder Messages Interrupt sharing-possible by Vladimir Myslik 1.Interrupt Sharing ? The HyperNews Linux KHG Discussion Pages Interrupt Sharing ? Forum: Device Drivers Keywords: Interrupt sharing.

Interrupt sharing-possible The HyperNews Linux KHG Discussion Pages Interrupt sharing-possible Forum: Device Drivers Re: Interrupt Sharing ? (Frieder Löffler) Keywords: Interrupt sharing. So. should set both the cards to it. a user wanting to find out whether it's possible to share one irq line. However.Play Date: Thu.it/LDP/khg/HyperNews/get/devices/devices/5/1. On the interrupt arrival. It has a list of routines (func. Plug%0Aamp. linux should have no problems. and the bus really notices CPU about them. whether the appropriate number incremented or not.46] . -> Interrupt sharing 101 by Christophe Beauregard http://ldp. instead of sitting on the irq line.cvut.felk.cz> Linux kernel has support for shared interrupt usage. make either of them generate interrupt (packet arrival.) that are called when an HW intr arises. So. PCI. Got all from usenet&kernel sources. The devices should had been designed with open collector or with 3-state IRQ lines with transition to IRQ active only during the interrupt generation(log. 11 Jul 1996 02:24:57 GMT From: Vladimir Myslik <xmyslik@cslab. don't blame me.html [08/03/2001 10. if your SCSI generates int#11 and your ethernet card the same irq. the ISA and IMHO PCI devices have problems with sharing one IRQ line per several physical cards (devices). 0/1). Messages Interrupt sharing .seek on disk) and look at the /proc/interrupts statistics.How to do with Network Drivers? by Frieder Löffler 1. the routines in the list are circularily called in the order in which the devices ISRs were hooked onto this chain.iol.10.

10. http://ldp. PCI.47] . Maybe someone could explain how this can be done . Frieder Messages Interrupt sharing 101 by Christophe Beauregard 1.uni-stuttgart. you are right . I can't see how I am supposed to register the interrupt handler routine for the second driver. some drivers seem to be designed to share interrupts. Plug%0Aamp.it/LDP/khg/HyperNews/get/devices/devices/5/1/1.c ? Right now.iol.Play Date: Thu.as I noticed in the AM53C974 SCSI driver. Thanks.de> Hi.How to do with Network Drivers? The HyperNews Linux KHG Discussion Pages Interrupt sharing .How to do with Network Drivers? Forum: Device Drivers Re: Interrupt Sharing ? (Frieder Löffler) Keywords: Interrupt sharing.html [08/03/2001 10. But I cannot see at the moment how I can implement interrupt sharing in the networking drivers.for example by adding some lines of code to skeleton.Interrupt sharing . 11 Jul 1996 09:02:56 GMT From: Frieder Löffler <floeff@mathematik.

.normally we'd call this a spurious interrupt.it/LDP/khg/HyperNews/get/devices/devices/5/1/1/1. void* dev_id.10. IMHO. */ return. service the interrupt */ .Interrupt sharing 101 The HyperNews Linux KHG Discussion Pages Interrupt sharing 101 Forum: Device Drivers Re: Interrupt Sharing ? (Frieder Löffler) Re: Interrupt sharing . since long chains of interrupt handlers may alter the timing such that processing is no longer ``fast''.47] . Either we passed a NULL dev to request_irq or someone screwed up irq.) { struct device* dev = (struct device*) dev_id. The key thing to sharing an interrupt is to make sure that you have separate context information for each instance of the driver. "MyDevice". no static global variables.com> I guess this would be a handy thing to have in the knowledge base. } /* now.Play Date: Wed...iol. Plug%0Aamp. using the dev structure. interrupt_handler.. http://ldp. PCI. Pass this to request_irq() as the last argument: request_irq( irq... Then your interrupt handler looks something like this: static void interrupt_handler( int irq. . of course). /* stupid programmer error. but it might belong to another device.html (1 di 2) [08/03/2001 10. dev ). That is. This is a bug. Note that the SA_INTERRUPT flag is significant here. not us . since you can't share an IRQ if one driver uses fast interrupts and the other uses slow interrupts. For most network drivers you just use the ``struct device* dev'' for the context. A better behaviour would be to just implicitly change to slow interrupts when more than one device is on the IRQ (and change back when the device is released down to one fast handler. SA_SHIRQ.c */ } /* query the device to see if it caused the interrupt */ if( !(inb(something)&something_else) ) { /* nope.How to do with Network Drivers? (Frieder Löffler) Keywords: Interrupt sharing. if( dev == NULL ) { ASSERT( 0 ). 28 Aug 1996 16:48:01 GMT From: Christophe Beauregard <chrisb@truespectra.

e.c. drivers/scsi/aicxxx7.it/LDP/khg/HyperNews/get/devices/devices/5/1/1/1.iol. } Because you have a separate ``struct device*'' for each instance of the card. drivers/net/tulip. and removing references to irq2dev_map[]. the device will continue to hold the IRQ line high. something).c. assuming they all Do The Right Thing. c.10. they can also share the IRQ with other card. multiple cards can share the same IRQ. Of course. You can usually modify an existing device to do shared IRQs by simply finding the part of the code where it spews out a spurious interrupt message and replacing that with a `return' statement. Some devices might do this implicitly in the interrupt processing (i. I've had no problems doing this for drivers including drivers/char/psaux.c and most of the MCA drivers.Interrupt sharing 101 /* tell the hardware device we're done (IMPORTANT) If this isn't done. http://ldp.47] .html (2 di 2) [08/03/2001 10. adding SA_SHIRQ to the request_irq call. and we go into a nasty interrupt loop. by emptying a buffer) */ outb(something.

however.html [08/03/2001 10. that the hardware performs a physical PC Reset while Linux is doing coming up (after a shutdown -r). etc.Device Driver notification of "Linux going down" The HyperNews Linux KHG Discussion Pages Device Driver notification of "Linux going down" Forum: Device Drivers Keywords: device drivers shutdown modules init watchdog Date: Tue.com> We have written a character device driver for FORTHRIGHT's PC WATCHDOG SYSTEM.48] .it/LDP/khg/HyperNews/get/devices/devices/4. But that doesn't seem closely related to device driver module management. We would also be willing to work on a "generic" solution (such as a device driver __halt() routine) if there is interest in this approach. If Linux succeeds in going down and back up within the default 2 minute window. in developing a generic "watch" application (tells the hardware that Linux is still healthy) that we can't detect when Linux is intentionally shutting down. Messages Through application which has opened the device by Michael K. Johnson 1.10. Is there a "shutting down" call made to the drivers? We have not found one. Device Driver notification of "Linux going down" by Marko Kohtala http://ldp. 06 Aug 1996 20:21:06 GMT From: Stan Troeh <stan@forthrt. everything is transparent and no problem occurs. but have found a place where it could be added in ReadItab() or InitMain() of init.c.. 2. However we find that at times when the tolerance is set tighter or the user delays making a LILO selection. Suggestions for better ways to package the driver are welcome.iol. We find.

com> In order to shut down a device. presumably. 14 Aug 1996 04:55:38 GMT From: Michael K.10. any other process). http://ldp.48] . close the device or alert it of the shutdown in some other way. have a user-level application have it opened. and when it is sent SIGTERM by init (or.Through application which has opened the device The HyperNews Linux KHG Discussion Pages Through application which has opened the device Forum: Device Drivers Re: Device Driver notification of "Linux going down" (Stan Troeh) Keywords: device drivers shutdown modules init watchdog Date: Wed. Johnson <johnsonm@redhat.iol.html [08/03/2001 10.it/LDP/khg/HyperNews/get/devices/devices/4/1.

h and kernel/sys.Device Driver notification of "Linux going down" The HyperNews Linux KHG Discussion Pages Device Driver notification of "Linux going down" Forum: Device Drivers Re: Device Driver notification of "Linux going down" (Stan Troeh) Keywords: device drivers shutdown modules init watchdog notifier Date: Mon. See in include/linux/notifier.nokia. 03 Mar 1997 09:33:14 GMT From: Marko Kohtala <Marko.c.48] .1. http://ldp.com> In 2.iol.html [08/03/2001 10.it/LDP/khg/HyperNews/get/devices/devices/4/2.10.x kernels there is a boot_notifier_list.Kohtala@ntc.

That is.48] .c examines and sets it.Is waitv honored? The HyperNews Linux KHG Discussion Pages Is waitv honored? Forum: Device Drivers Keywords: waitv VT Date: Sun.iol. and that the only reason it is set and reset is for compatibility with apps written for SVR4. Johnson <johnsonm@redhat. and drivers/char/tty_io.c resets it when the terminal is reset. Am I right? http://ldp.h has a member called waitv that doesn't seem to be used.10. I'm guessing that it exists because the SVR4 VT code has a structure member of the same name. but nothing else seems to be done with it.com> The vt_mode structure in /usr/include/linux/vt.it/LDP/khg/HyperNews/get/devices/devices/3. 07 Jul 1996 02:18:18 GMT From: Michael K. drivers/char/vt.html [08/03/2001 10.

Flavia Messages There is linux-2. http://ldp.c by Hasdi 1.49] .10.infn.html [08/03/2001 10...it/LDP/khg/HyperNews/get/devices/devices/2.PCI Driver The HyperNews Linux KHG Discussion Pages PCI Driver Forum: Device Drivers Keywords: A PCI Driver ??? Date: Wed. 12 Jun 1996 17:04:44 GMT From: Flavia Donno <flavia@galileo.iol.it> Probably this is not the right place for this question.pi. but please . Answer to me! Has anyone written a PCI driver for Lynux ? Any example ? Documentation ? Thank you in advance.0/drivers/pci/pci.

iol.There is linux-2.c Forum: Device Drivers Re: PCI Driver (Flavia Donno) Keywords: A PCI Driver ??? Date: Thu.edu> The subject says it all. Is there something about pci that every kernel should know about? http://ldp.it/LDP/khg/HyperNews/get/devices/devices/2/1.umich. I don't know why pci.html [08/03/2001 10.c The HyperNews Linux KHG Discussion Pages There is linux-2.c is the only file in the pci directory.0/drivers/pci/pci.0/drivers/pci/pci. 13 Jun 1996 19:38:02 GMT From: Hasdi <hasdi@engin. I thought there are lots of pci drivers.10.50] .

who was that? (just curious. transmit completed. Some device initialization is usually done here (allocating i/o space. This has the job of determining why the card posted an interrupt. etc.50] . Regardless.it/LDP/khg/HyperNews/get/devices/devices/1.iol.. here it is anyway.Re: Network Device Drivers The HyperNews Linux KHG Discussion Pages Re: Network Device Drivers Forum: Device Drivers Keywords: network driver prototype functions Date: Wed. so check back sometime.au> > > > > I don't know anything about this topic. error conditions being reported. Can also read from i/o ports. Paul.10. IRQs.) Somebody asked me about a year or so ago as to what the basics of a net driver would look like.filling in the dev->??? fields etc. and acting accordingly. -----------------------------1) Probe: called at boot to check for existence of card. Best if it can check un-obtrsively by reading from memory.) 2) Interrupt handler: Called by the kernel when the card posts an interrupt.. so this may be useless in comparison. This puts the data onto http://ldp. Usual interrupt conditions are data to be rec'd..anu..c file that can get you started.edu. Writing to i/o ports in a probe is *not* allowed as it may kill another device. The kernel source includes a skeleton.html (1 di 2) [08/03/2001 10. Someone has promised to write this section. 22 May 1996 16:23:09 GMT From: Paul Gortmaker <gpg109@rsphy1. 3) Transmit function Linked to dev->hard_start_xmit() and is called by the kernel when there is some data that the kernel wants to put out over the device. I haven't seen Alan's article in linux journal. Hrrm.

Re: Network Device Drivers the card and triggers the transmit.html (2 di 2) [08/03/2001 10. 1.iol. packages it into a sk_buff and lets the kernel know the data is there for it by doing a netif_rx(sk_buff) 5) Open function linked to dev->open and called by the networking layers when somebody does "ifconfig <device_name> up" -.10. 4) Receive function Called by the interrupt handler when the card reports that there is data on the card. perhaps I will have the time to write a proper document on the subject. Naaaaahhhhhh..this puts the device on line and enables it for Rx/Tx of data.. Someday.it/LDP/khg/HyperNews/get/devices/devices/1.. Messages Re: Network Device Drivers by Neal Tucker 1. 1. -> 2. -> network driver info by Neal Tucker Network Driver Desprately Needed by Paul Atkinson Re: Transmit function by Paul Gortmaker Skbuff by Joerg Schorr Transmit function by Joerg Schorr http://ldp. It pulls the data off the card.50] .

loopback. IRQs. it seems a bit obvious. but back when I was going mad trying to figure out why my driver didn't execute. Basically what it comes down to is an explanation of Space. 30 May 1996 10:42:09 GMT From: Neal Tucker <ntucker@adobe.. Can > also read from i/o ports. rather than "probe".it/LDP/khg/HyperNews/get/devices/devices/1/1. Best if it > can check un-obtrsively by reading from memory.html (1 di 2) [08/03/2001 10. :-) I've just recently been looking at the network device driver interface. > >1) Probe: > called at boot to check for existence of card. > Some device initialization is usually done here (allocating > i/o space. Now that I understand it.com> Paul Gortmaker says: >Somebody asked me about a year or so ago as to what the basics >of a net driver would look like. which is usually what people point to). slip.c.iol.Re: Network Device Drivers The HyperNews Linux KHG Discussion Pages Re: Network Device Drivers Forum: Device Drivers Re: Re: Network Device Drivers (Paul Gortmaker) Keywords: network driver functions Date: Thu.c.51] .axis. (which may sound a bit nit picky. I've found plenty of examples of what the actual driver code needs to do. Make sure to let me and/or the rest of the world what you think. One thing that I think would be helpful to people trying to write a network driver for the first time is a description of how this is all hooked into the kernel. So once it's done.) You must be the guy that wrote that part of the ethernet HOWTO. since all the code I was looking at (dummy.net/~linux/454. but is a bit funny looking to a first-timer. which doesn't do very much.. etc. I will submit a description. Writing to i/o ports in a probe > is *not* allowed as it may kill another device.. -Neal Tucker http://ldp. but there were other routines called "probe" that I studied for a while. thinking they were the important ones (they turned out to be used for module initialization only) :-). But on to my real reason for writing. check out a start at http://fester. it would have been really nice to have it all spelled out. (lots of some_driver.) refers to this as "init".c files. including skeleton. and I read your stuff and this part confused me.html.filling in the dev->??? fields etc. but no explanation of how to get it called.10. If you'd like.

it/LDP/khg/HyperNews/get/devices/devices/1/1.html (2 di 2) [08/03/2001 10.10.Re: Network Device Drivers Messages network driver info by Neal Tucker 1.51] .iol. -> Network Driver Desprately Needed by Paul Atkinson http://ldp.

c) which walks through the linked list pointed to by dev_base.com> Earlier.c) calls a function called net_dev_init (net/core/dev. That is accomplished by a clever piece of C preprocessor work in drivers/net/Space. At boot time. How a Network Device Gets Added to the Kernel There is a global variable called dev_base which points to a linked list of "device" structures.pointer to init function */ }.. and the site that the web page is on is going away.. If the init indicates failure (by returning a nonzero result). /* <. net_dev_init removes the device from the linked list and continues on.link to previously listed */ /* device struct (NULL here) */ slip_init. calling each device's init function. How can we define these links statically without knowing which devices are going to be included? Here's how it's done (from drivers/net/Space. This file has the static declarations for each device's "device" struct.c): #define NEXT_DEV NULL #if defined(CONFIG_SLIP) static struct device slip_dev = { device name and some other info goes here .c.it/LDP/khg/HyperNews/get/devices/devices/1/1/1. /* <. I posted a pointer to a bit of info on network device drivers. This brings up the question of how the devices get added to the linked list of devices before any of their code is executed. and contains a pointer to the device driver's initialization function. and is responsible for setting up the hooks to the other driver code.network driver info The HyperNews Linux KHG Discussion Pages network driver info Forum: Device Drivers Re: Re: Network Device Drivers (Paul Gortmaker) Re: Re: Network Device Drivers (Neal Tucker) Keywords: network driver functions Date: Sat. http://ldp. including the pointer to the next device in the list.10. The initialization function is the first code from the driver to ever get executed. Each record represents a network device. 15 Jun 1996 03:33:21 GMT From: Neal Tucker <ntucker@adobe.51] . the function device_setup (drivers/block/genhd.iol. so I am including what was there here.html (1 di 4) [08/03/2001 10... NEXT_DEV.

/* <. NEXT_DEV. NEXT_DEV. #undef NEXT_DEV #define NEXT_DEV (&ppp_dev) #endif struct device loopback_dev = { device name and some other info goes here . Ethernet devices http://ldp. There is a constant.51] .pointer to init function */ }. /* <.. defined to always point at the last device record declared. /* <.link to previously listed */ /* device struct.html (2 di 4) [08/03/2001 10. which is the head of the list.it/LDP/khg/HyperNews/get/devices/devices/1/1/1.link to previously listed */ /* device struct. */ */ */ */ struct device *dev_base = &loopback_dev. it puts the value of NEXT_DEV in itself as the "next" pointer and then redefines NEXT_DEV to point to itself. NEXT_DEV.pointer to init function */ }.network driver info #undef NEXT_DEV #define NEXT_DEV (&slip_dev) #endif #if defined(CONFIG_PPP) static struct device ppp_dev = { device name and some other info goes here . which is now * /* defined as &slip_dev */ ppp_init. This (dev_base) is the pointer the kernel uses to access all the devices.. Note that NEXT_DEV starts out NULL so that the first device structure is the end of the list. gets the value of the last device structure.10. When each device record gets declared.iol.. /* <. which points to the most recently defined device struct. /* /* /* /* And finally. loopback_dev. the head of the list.. This is how the linked list is built. which is now */ /* defined as &ppp_dev */ loopback_init. and at the end. the global dev_base.

Here is an abridged version of that function: static int ethif_probe(struct device *dev) { u_long base_addr = dev->base_addr.51] .iol.it/LDP/khg/HyperNews/get/devices/devices/1/1/1.c). and only one ethernet card is initialized and used. } return 0. rather than calling them by name (ie "NE2000". probably due to the fact that there are so many different types of ethernet devices that we'd like to be able to refer to them by just calling them ethernet devices (ie "eth0". "3C509". etc). there is a single entry for all ethernet devices.html (3 di 4) [08/03/2001 10. In the linked list mentioned above. etc). the #ifdef removes the code completely. || (base_addr == 1)) if (1 /* note start of expression here */ #ifdef CONFIG_DGRS && dgrs_probe(dev) #endif #ifdef CONFIG_VORTEX && tc59x_probe(dev) #endif #ifdef CONFIG_NE2000 && ne_probe(dev) #endif && 1 ) { /* end of expression here */ return 1.network driver info Ethernet devices are a bit of a special case in how they get called at initialization time. whose initialization function is set to the function ethif_probe (also defined in drivers/net/Space. no matter how many drivers you have installed. This function simply calls each ethernet device's init function until it finds one that succeeds. and requires providing command line parameters to the kernel which cause ethif_probe to be executed multiple times. This is done with a huge expression made up of the ANDed results of the calls to the initialization functions (note that with the ethernet devices. if ((base_addr == 0xffe0) return 1. Messages http://ldp.10. } The result is that the if statement bails out as false if any of the probe calls returns zero (success). The implications of this scheme are that supporting multiple ethernet cards is now a special case. and the expression gets a bit smaller. the init function is conventionally called xxx_probe). "eth1". For the drivers that aren't installed.

it/LDP/khg/HyperNews/get/devices/devices/1/1/1.10. Network Driver Desprately Needed by Paul Atkinson http://ldp.network driver info 1.iol.51] .html (4 di 4) [08/03/2001 10.

06 May 1997 20:56:36 GMT From: Paul Atkinson <patkinson@aerotek.52] . The card is based on a T1 ThunderLAN chip.iol.uk> I have looked everywhere for a Compaq Netflex 100BaseT network card device driver/patch and have come up with nothing :( I wouldn't know where to start to make my own (I have a hard enough time recompiling the kernel!). Many thanks Paul.html [08/03/2001 10.10.Network Driver Desprately Needed The HyperNews Linux KHG Discussion Pages Network Driver Desprately Needed Forum: Device Drivers Re: Re: Network Device Drivers (Paul Gortmaker) Re: Re: Network Device Drivers (Neal Tucker) Re: network driver info (Neal Tucker) Keywords: device drivers network compaq tlan thunderlan netflex Date: Tue.co. If anyone would like to fill a void in Linux Hardware Compatibility it would be very much appreciated.it/LDP/khg/HyperNews/get/devices/devices/1/1/1/1. http://ldp.

html [08/03/2001 10.epfl. uses a WD80x3 card (using the wd. -> Skbuff by Joerg Schorr http://ldp. and as it seems the transmit function is wd_block_output. but where are between the dev->hard_start_xmit and the wd_block_ouptut?? I haven't it out for the moment. 31 May 1996 20:55:37 GMT From: Joerg Schorr <jschorr@studi.iol.Transmit function The HyperNews Linux KHG Discussion Pages Transmit function Forum: Device Drivers Re: Re: Network Device Drivers (Paul Gortmaker) Keywords: network driver prototype functions Date: Fri.c driver). and i also noticed this part for transmit. Well. Messages Re: Transmit function by Paul Gortmaker 1.52] . i'm having to some work with network on linux.10. but the PC i am working on.ch> > 3) Transmit function > Linked to dev->hard_start_xmit() and is called by the > kernel when there is some data that the kernel wants > to put out over the device.it/LDP/khg/HyperNews/get/devices/devices/1/2. This puts the data onto > the card and triggers the transmit.

The function ei_transmit() in 8390.c is what is linked to dev->hard_start_xmit(). It uses the code in 8390.10. 31 May 1996 23:55:18 GMT From: Paul Gortmaker <unknown> The wd driver is not a complete driver by itself. http://ldp. Paul.c to do most of the work.iol.html [08/03/2001 10. Messages Skbuff by Joerg Schorr 1.it/LDP/khg/HyperNews/get/devices/devices/1/2/1.Re: Transmit function The HyperNews Linux KHG Discussion Pages Re: Transmit function Forum: Device Drivers Re: Re: Network Device Drivers (Paul Gortmaker) Re: Transmit function (Joerg Schorr) Keywords: network driver prototype functions Date: Fri. and then ei_transmit will call ei_block_output() which in this case is pointing at wd_block_output().52] .

ch> In the wd_block_output (in wd.epfl.53] . 06 Jun 1996 19:39:48 GMT From: Joerg Schorr <jschorr@studi. But I didn't found out when the message and headers (ip and udp for the case I am interested in) are copied in skb->data?? Also: what is exactly in skb->data?? Is there more than the message and the headers?? Also the rest of the skbuffer?? http://ldp. there is a moment where the buf (which is skb->data) is copied to the shared memory of the ethercard (if I understood it right).iol.it/LDP/khg/HyperNews/get/devices/devices/1/2/1/1.html [08/03/2001 10.c) function.Skbuff The HyperNews Linux KHG Discussion Pages Skbuff Forum: Device Drivers Re: Re: Network Device Drivers (Paul Gortmaker) Re: Transmit function (Joerg Schorr) Re: Re: Transmit function (Paul Gortmaker) Keywords: network driver prototype functions Date: Thu.10.

Let's change that. ISBN 90-367-0385-9. 3.. by Mark Salter New File System by Vamsi Krishna Partition? by Wilfredo Lugo Beauchamp 1..53] . Johnson libext2fs documentation by James Beckett proc filesystem by Praveen Krishnan Need NFS documentation by Ermelindo Mauriello Even more ext2 documentation! by Michael K. Would someone else like to help? Copyright (C) 1996 Michael K. 10. Design and Implementation of the Second Extended Filesystem The three main authors and designers of ext2fs have written an excellent paper on it.Filesystems The HyperNews Linux KHG Discussion Pages Filesystems There has been very little documentation so far regarding writing filesystems for Linux. This paper was first published in the Proceedings of the First Dutch International Symposium on Linux. A tour of the Linux VFS Before you can consider writing a filesystem for Linux. The basic principles are fairly simple.. 8... Need /proc info by Kai Xu Where to find libext2 sources? by Mark Salter 1. Where to find e2fsprogs by Theodore Ts'o libext2fs documentation by Theodore Ts'o man proc by Michael K. by Theodore Ts'o Need documentation on userfs implementation by Natchu Vishnu Priya ext2fs tools by Wilfredo Lugo Beauchamp 1.com. you need to have at least a vague understanding of how the Linux Virtual Filesystem Switch operates.html (1 di 2) [08/03/2001 10. Please be more specific.it/LDP/khg/HyperNews/get/fs/fs. Messages 12. johnsonm@redhat. 1.. 1.10. 5. and have contributed it to the KHG. Nevermind. 7. Johnson. 11. Filesystem Tutorial It may be a long time before I get around to writing a tutorial on how to write a Linux filesystem. after all. 6. 9. I've never done it. Johnson http://ldp..iol. 4.

html (2 di 2) [08/03/2001 10.Filesystems 2.it/LDP/khg/HyperNews/get/fs/fs. by Michael K.53] . Johnson Ext2 paper by Theodore Ts'o 1. 1.iol. Johnson http://ldp.10. More ext2 documentation by Michael K. Done.

c EXE binfmt_elf. write(). please respond. Dynamically-extensible list of open files on the system. or VFS. Most of the VFS and most of the code in a normal Linux filesystem is pretty directly related to completing normal system calls.c file_table. labeled system. Two main types of code modules take advantage of the VFS services.it/LDP/khg/HyperNews/get/fs/vfstour. is supposed to show to which major subsystem the file is (mainly) dedicated.html (1 di 6) [08/03/2001 10.c inode. The middle column. which caches directory name lookups. http://ldp. BUF means buffer cache.c DEV BUF VFS DEV VFS VFSg VFSg VFSg VFSg VFS All compiled-in filesystems are initialized from here by calling init_name_fs().c fcntl. I've only worked on the proc filesystem. Because the VFS doesn't exist in a vacuum. binfmt_aout. This is a layer of code which implements generic filesystem actions and vectors requests to the correct specific code to handle the request.c fifo. Generic device support functions. the ext2 filesystem source code is kept in fs/ext2/. Recognize and execute new ELF executables Recognize and execute Java apps and applets Recognize and execute #!-style scripts Generic read(). we won't cover them explicitly here.c EXE binfmt_script.c dcache.c dquot.10. VFSg Dynamically-extensible list of open inodes on the system. device drivers and filesystems.iol. fifo handling. DEV means that is for device driver support.54] . such as the buffer cache and code to deal with each executable file format. and I didn't do much real filesystem hacking there. we'll show its relationship with the favorite Linux filesystem. Each specific filesystem is kept in a lower subdirectory.c exec.c EXE block_dev. Where to find the code The source code for the VFS is in the fs/ subdirectory of the Linux kernel source. VFS means that it is a part of the VFS. the ext2 filesystem. The directory cache. fcntl() handling. VFSg means that this code is completely generic and never delegates part of its operation to specific filesystem code (that I noticed. This tour will focus on filesystems. Filename system Purpose Recognize and execute old-style a. The buffer cache. such as registries. all files are accessed through the Virtual Filesystem Switch.c EXE binfmt_java. only extensions to what was already there. for example.A tour of the Linux VFS The HyperNews Linux KHG Discussion Pages A tour of the Linux VFS I'm not an expert on this topic. and you will not be able to understand how the rest of the system works without understanding the system calls on which it is based. and delegates some functionality to filesystem-specific code. So if you see any mistakes or ommissions here (there have got to be ommissions in a piece this short on a topic this large).c devices. anyway) and which you shouldn't have to worry about while writing a filesystem. This table gives the names of the files in the fs/ subdirectory and explains the basic purpose of each one.c buffer. Because device drivers are covered elsewhere in the KHG. Generic disk quota support. in order to let me fix them and let other people know about them. and fsync() functions for block devices. which caches blocks read from block devices. along with a few other related pieces. Calls functions in the binfmt_* files. I've never written a filesystem from scratch.out executables. EXE means that it is used for recognizing and loading executable files.c filesystems. Generic executable support. In Linux. One warning: without a decent understanding of the system calls that have to do with files. you are not likely to be able to make heads or tails of filesystems.

more later). passes handling to the filesystem or device driver if necessary. Attaching a filesystem to the kernel If you look at the code in any filesystem for init_name_fs(). for (i = 0. and the NULL is required to fill up space that will be used to keep a linked list of filesystem types in the filesystem registry.. {sysv_read_super. lseek(). 1.iol. VFS read().it/LDP/khg/HyperNews/get/fs/vfstour. } All it does is register the filesystem with the registry kept in fs/super.c. VFSg Pipes. 1.54] . It's possible for a filesystem to support more than one type of filesystem.c VFS Lots of system calls including (surprise) open(). } http://ldp. NULL }. int init_sysv_fs(void) { int i. you will find that it probably contains about one line of code. three possible filesystem types are supported by one filesystem. The ext2_read_super entry is a pointer to a function which allows a filesystem to be mounted (among other things. "sysv". "ext2". it looks like this (from fs/ext2/super. in the ext2fs.c noquot. flock() locking. filesystem registry. NULL}.c): int init_ext2_fs(void) { return register_filesystem(&ext2_fs_type). 1.A tour of the Linux VFS ioctl. int ouch. 1. VFSg Support for fcntl() locking. -t ext2) to determine which filesystem to use to mount a device. i++) { if ((ouch = register_filesystem(&sysv_fs_type[i])) != 0) return ouch.html (2 di 6) [08/03/2001 10. NULL}. For instance.c super.c readdir. mount()/umount(). i < 3. and manadatory locking. VFS The guts of the select() system call VFS VFS stat() and readlink() support. VFS Fills in the inode. "ext2" is the name of the filesystem type.10. ext2_fs_type is a pretty simple structure: static struct file_system_type ext2_fs_type = { ext2_read_super.c select.c VFS First-stage handling for ioctl's. NULL} }. {sysv_read_super. readv().c stat. Superblock support. VFS Several different interfaces for reading directories. kept in (look it up in the table!) fs/super. given a pathname. in fs/sysv/inode. writev().. which is used (when you type mount .c open. and vhangup(). with this code: static struct file_system_type sysv_fs_type[3] = { {sysv_read_super. "coherent". The 1 says that it needs a device to be mounted on (unlike the proc filesyste or a network filesystem). write(). Implements several name-related system calls.c. close(). VFS No quotas: optimization to avoid #ifdef's in dquot.c read_write.c locks.c namei.c pipe. "xenix". For instance.c.

you are almost sure to be able to provide a NULL pointer and get the default painlessly. There will actually be (I hope) two descriptions--a simple. in fs/ext2/super. pointers to functions specific to ext2. that's the VFS's job. which contains pointers to functions which do common operations related to superblocks. void (*statfs) (struct super_block *. from and to the disk will be covered in a different section. the filesystem may or may not actually have a block on disk that is the real superblock. http://ldp.A tour of the Linux VFS return ouch. ext2_put_inode. assume that it is done by magic. void (*write_inode) (struct inode *). The details of how the filesystem actually reads and writes the blocks. Operations that pertain to the filesystem as a whole (as opposed to individual files) are considered superblock operations. NULL. int *. ext2_write_inode. char *).it/LDP/khg/HyperNews/get/fs/vfstour.iol. You have probably noticed that there are a lot of pointers. and which refer to or change the status of the filesystem as a whole (statfs() and remount()). ext2_put_super. the superblock. All the author for the filesystem needs to do is fill in (usually static) structures with pointers to functions. Second. void (*put_inode) (struct inode *). notice how simple and clean the declaration is. That's pretty normal Linux behavior. as in the case of the DOS filesystem--that is.54] . whenever there is a sensible default behavior of a function pointer. For example. ext2_remount }. First. ext2_statfs. and that sensible default is what you want. notice that an unneeded entry has simply been set to NULL.10. int (*notify_change) (struct inode *. including the superblock.h>): struct super_operations { void (*read_inode) (struct inode *). in this case. A superblock is the block that defines an entire filesystem on a device. If it succeeds in reading the superblock and is able to mount the filesystem. and a more detailed one in a tour through the buffer cache. If not. int). s hidden in the VFS implementation. ext2_write_super. void (*put_super) (struct super_block *). That's the VFS part.c: static struct super_operations ext2_sops = { ext2_read_inode. struct iattr *). struct statfs *. int (*remount_fs) (struct super_block *. Here's the much simpler declaration of the ext2 instance of that structure. it has to make something up. it fills in the super_block structure with information that includes a pointer to a structure called super_operations. and pass pointers to those structures back to the VFS so it can get at the filesystem and the files. It is sometimes mythical. here. and especially pointers to functions. } Connecting the filesystem to a disk The rest of the communication between the filesystem code and the kernel doesn't happen until a device bearing that type of file system is mounted. When you mount a device containing an ext2 file system.. The good news is that all the messy pointer work is done. void (*write_super) (struct super_block *). }. The super_operations structure contains pointers to functions which manipulate inodes. ext2_read_super() is called. All the painful stuff like sb->s_op->write_super(sb).. For now. the super_operations structure looks like this (from <linux/fs. functional one in a section on how to write filesystems.html (3 di 6) [08/03/2001 10.

int.c). as in the ext2 filesystem.const char *. it is possible to access files on that filesystem.struct inode *. it is relative to the current directory of the process that made the system call that included a path. all the other inodes can be looked up. That superblock includes a pointer to that structure of pointers to functions that we see in the definition of ext2_sops above.h>): struct inode_operations { struct file_operations * default_file_ops. Each inode includes a pointer to a structure of pointers to functions. So how does the VFS look up the name in the filesystem and get an inode back? It starts at the beginning of the path name and looks up the inode of the first directory in the path. When it reachs the end. Finding a file Once a filesystem is mounted. then the inode can be accessed. there's a limit on the depth of symlinks. The integer file handle which is passed back to the application code is an offset into a file table for that process.int. ext2_read_super(). the kernel will only follow so many symlinks in a row before giving up. like most everything we are looking at. It is kind of like malloc() and free().struct inode **). The iget() function finds and returns the inode specified by an inode number. the proc filesystem does this because different directories in the proc filesystem have different purposes.it/LDP/khg/HyperNews/get/fs/vfstour. In general.c. it has found the inode of the file or directory it is trying to look up. It also includes a lot of other data. You might want to follow along in fs/super. int (*lookup) (struct inode *. then the next component is looked up in the directory returned by the previous lookup. If it is a directory. Sound familiar? This is the inode_operations structure.html (4 di 6) [08/03/2001 10. int (*create) (struct inode *.A tour of the Linux VFS Mounting a filesystem When a filesystem is mounted (which file is in charge of mounting a filesystem? Look at the table above. Normally.const char *. int (*link) (struct inode *. and it is used to look up another inode on the same filesystem. it is really manipulating the inode in question. They have to come from the filesystem. It takes the path name one component (filename components are separated with / characters) at a time. which returns the superblock. When the VFS is looking at a name. it includes a path. One of the elements of that structure is called lookup(). then the VFS starts over with the new name which is retrieved from the symbolic link. Every component which is looked up. The inode_operations structure looks like this (defined. and find that it is fs/super. how does it get started with the first lookup? There is an inode pointer kept in the superblock called s_mounted which points at an inode structure for the filesystem. returns an inode number which uniquely identifies it. There are two main steps here: looking up the name to find what inode it points to. Then it uses that inode to look up the next directory in the path. except that more than one process may hold an inode open at once. From there. http://ldp. you can look at the definition of struct super_block in include/linux/fs.iol.10.54] . in <linux/fs. and looks it up. The iput() function is later used to release access to the inode. whether it is a file or a directory. which ends up calling (in the case of the ext2 filesystem). a filesystem has only one lookup() function that is the same in every inode on the filesystem.const char *. It uses filesystem-specific code to look up files on the filesystems specified.int).c). and then accessing the inode. That file table slot holds the inode number that was looked up with the namei() function until the file is closed or the process terminates. In order to prevent infinite recursion. and by which its contents are accessed.int. Unless the filename is absolute (it starts with a / character). the s_mounted inode is the inode of the root directory of the filesystem. But since it needs an inode to get started. and the VFS can't make them up on it's own. and a reference count is maintained to know when it's free and when it's not. When the VFS and the filesystem together have resolved a name into an inode number (that's the namei() function in namei. If the file turns out to be a symbolic link to another file.h if you like. inode Operations That inode number and inode structure have to come from somewhere. So whenever a process does anything to a ``file'' using a file handle. do_umount() calls read_super.struct inode **). but it is possible to have several different lookup() functions and assign them as appropriate for the filesystem. This inode is allocated when the filesystem is mounted and de-allocated when the filesystem is unmounted.

int (*revalidate) (kdev_t dev). In the ext2 filesystem. those are found in the file_operations structure. and symlinks have different inode_operations (this is normal). (*rename) (struct inode *. struct file *.const char *. directory.struct inode *. int). struct file *. (*mkdir) (struct inode *. and the file fs/ext2/symlink.const char *. struct file *.int). Most of these functions map directly to system calls. int (*fasync) (struct inode *. int (*read) (struct inode *.A tour of the Linux VFS int int int int int int *.const char *). filldir_t). rather than inodes: struct file_operations { int (*lseek) (struct inode *. (*unlink) (struct inode *. the file fs/ext2/file. vector them to the filesystem in charge of the file. int (*bmap) (struct inode *.int). (*mknod) (struct inode *.int). There are many system calls related to files (and directories) which aren't accounted for in the inode_operations structure. int (*mmap) (struct inode *.c contains ext2_dir_inode_operations. int. struct file *. int (*readdir) (struct inode *.struct inode **). int (*fsync) (struct inode *.it/LDP/khg/HyperNews/get/fs/vfstour. int (*writepage) (struct inode *.const char *. int (*follow_link) (struct inode *. directories. int). int (*permission) (struct inode *. int (*ioctl) (struct inode *. There are also a few functions which aren't directly related to system calls--and where they don't apply.const char *. char *. or inode in question. void (*release) (struct inode *.char *. unsigned long).int.int. The file fs/ext2/dir. struct file *. int (*smap) (struct inode *. http://ldp. they can simply be set to NULL. struct file *).int). struct page *). struct file *). struct file *.iol. unsigned int.int).const char *. const char *. q Do any reasonable generic processing for operations involving files. int (*write) (struct inode *.c contains ext2_file_inode_operations. struct file *). (*rmdir) (struct inode *.10. (*symlink) (struct inode *. }.int. int (*select) (struct inode *. Summary The role of the VFS is: q Keep track of available filesystem types. struct file *.int. struct file *. void (*truncate) (struct inode *).54] .int). void *. select_table *).int.c contains ext2_symlink_inode_operations. }. q When filesystem-specific operations become necessary. The file_operations structure is the same one used when writing device drivers and contains operations that work specifically on files. int (*check_media_change) (kdev_t dev). int).const char *. off_t. int (*readpage) (struct inode *. struct page *). q Associate (and disassociate) devices with instances of the appropriate filesystem. files.html (5 di 6) [08/03/2001 10. int).int.struct inode *.const char int (*readlink) (struct inode *. struct vm_area_struct *). int). int (*open) (struct inode *.int.int).int).

Johnson.10. file_operations.iol.h>. and others. inode_operations. including super_operations. and their associated data structures.com.html (6 di 6) [08/03/2001 10.54] .A tour of the Linux VFS The interaction between the VFS and specific filesystem types occurs through two main data structures. which are kept in the include file <linux/fs. and to provide code to carry out actions specific to filesystems and files that are requested by system calls and sorted out by the VFS. Messages A couple of comments and corrections by Jeremy Fitzhardinge 1. Copyright (C) 1996 Michael K. http://ldp.it/LDP/khg/HyperNews/get/fs/vfstour. the role of a specific filesystem code is to provide a superblock for each filesystem mounted and a unique inode for each file on the filesystem. Therefore. johnsonm@redhat. the super_block structure and the inode structure.

implementation details Date: Thu. The distinction is subtle but important.html (1 di 2) [08/03/2001 10. 23 May 1996 03:20:42 GMT From: Jeremy Fitzhardinge <jeremy@zip. What this means for a filesystem is that you rarely need to implement the open() file operation. correction. If you use the dup() syscall. each at a different offset and mode (RO/WO/RW/append). and points it it at the file structure The number of the file descriptor slot is what is returned. symlink. so one inode can have many file structures.55] . http://ldp.it looks up the inode for the pathname .it puts the inode into the inode cache . In general it looks OK. Inodes represent entities on disks: there's one (and only one) for each file.it allocates a new file structure . you can have multiple file descriptors pointing to the same file structure (so they share their offsets and open modes). device and FIFO on the filesystem.com.it/LDP/khg/HyperNews/get/fs/vfstour/1.iol.10. you must always implement iget() operations. not JavaScript (which is unrelated to Java in all but name). directory.au> A minor point: binfmt_java only deals with Java class files. Inodes have no notion of "current offset": that's in the file structure. However. but it doesn't really talk about the distinction between file operation and inode operations (or the distinction between struct file and struct inode).it allocates a filedescriptor slot. unless you actually care about specific file open and close operations. When a user process does an open() syscall. it does a couple of things: .A couple of comments and corrections The HyperNews Linux KHG Discussion Pages A couple of comments and corrections Forum: A tour of the Linux VFS Keywords: comments.

J http://ldp.10.iol.mit.documenting the args of the VFS interface functions I know this is supposed to be documentation. The structure of the userfs kernel module is also a pretty good guide for a skeletal filesystem.coping with dynamic filesystems (filesystems who's contents change of their own accord) .A couple of comments and corrections I guess there's lots more that can go here: . which is somewhat applicable to writing kernel resident filesystems too..html (2 di 2) [08/03/2001 10. It has a readme which talks in some detail about how to implement filesystems. [Plug mode] A good way of finding out about how Linux filesystems work is Userfs. which allows you to write user processes which implement filesystems..edu:pub/linux/ALPHA/userfs.55] .what order to do things in. and how the various VFS calls interact . It's available at tsx-11. but the kernel source is a good place to look.it/LDP/khg/HyperNews/get/fs/vfstour/1.interaction with the VM system and page cache .structure of a Unix filesystem (inodes/names/links) .

The Minix filesystem was an efficient and relatively bug-free piece of software.96c.it/LDP/khg/HyperNews/get/fs/ext2intro.uk Introduction Linux is a Unix-like operating system. The Minix filesystem contains two serious limitations: block addresses are stored in 16 bit integers. The VFS layer was initially written by Chris Provenzano. In order to ease the addition of new filesystems into the Linux kernel. However. In this paper. History of Linux filesystems In its very early days. It was easier to share disks between the two systems than to design a new filesystem. and later rewritten by Linus Torvalds before it was integrated into the Linux kernel. After the integration of the VFS in the kernel. the restrictions in the design of the Minix filesystem were too limiting. which runs on PC-386 computers.html (1 di 14) [08/03/2001 10.10. Laboratoire MASI--Institut Blaise Pascal.ed. a Virtual File System (VFS) layer was developed. a new filesystem. It was an improvement over the Minix filesystem but some problems were still present in it. so people started thinking and working on the implementation of new filesystems in Linux. This new filesystem removed the two big Minix limitations: its maximal size was 2 giga bytes and the maximal file name size was 255 characters. thus the maximal filesystem size is restricted to 64 mega bytes. University of Edinburgh. There was no http://ldp. It was implemented first as extension to the Minix operating system [Tanenbaum 1987] and its first versions included support for the Minix filesystem only. and Stephen Tweedie. We briefly introduce the fundamental concepts implemented in Unix filesystems. called the ``Extended File System'' was implemented in April 1992 and added to Linux 0. we present performance measurements made on Linux and BSD filesystems and we conclude with the current status of Ext2fs and the future directions. called ``Extended File System'' (Ext fs) and ``Second Extended File System'' (Ext2 fs) raise the limitations and add new features. and Theodore Ts'o. we describe the history of Linux filesystems.edu. It is described in The Virtual File System.iol. We present the implementation of the Virtual File System layer in Linux and we detail the Second Extended File System kernel code and user mode tools.58] . Massachussets Institute of Technology. so Linus Torvalds decided to implement support for the Minix filesystem in Linux. We have designed and implemented two new filesystems that are included in the standard Linux kernel.fr. and directories contain fixed-size entries and the maximal file name is 14 characters.ac. Last.ibp. These filesystems. E-Mail: sct@dcs. E-Mail: card@masi. E-Mail: tytso@mit. Linux was cross-developed under the Minix operating system.Design and Implementation of the Second Extended Filesystem The HyperNews Linux KHG Discussion Pages Design and Implementation of the Second Extended Filesystem Rémy Card.

bugs were fixed in Ext2fs and lots of improvements and new features were integrated. size. Basically.html (2 di 14) [08/03/2001 10. As a response to these problems. two new filesytems were released in Alpha version in January 1993: the Xia filesystem and the Second Extended File System. the lists became unsorted and the filesystem became fragmented.it/LDP/khg/HyperNews/get/fs/ext2intro. Due to its minimal design. uses this number as an index in the block addresses table and reads or writes the physical block. and data modification timestamps. It had been designed with evolution in mind and contained space for future improvements. block size Maintained Minix FS 64 MB 64 MB 16/30 c No No No Yes Ext FS 2 GB 2 GB 255 c No No No No Ext2 FS 4 TB 2 GB 255 c Yes Yes Yes Yes Xia FS 2 GB 64 MB 248 c Yes No No ? Basic File System Concepts Every Linux filesystem implements a basic set of common concepts derivated from the Unix operating system [Bach 1986] files are represented by inodes. pointers to data blocks. Ext2fs is now very stable and has become the de-facto standard Linux filesystem. The Xia filesystem was heavily based on the Minix filesystem kernel code and only added a few improvements over this filesystem. This table contains a summary of the features provided by the different filesystems: Max FS size Max file size Max file name 3 times support Extensible Var.iol. the kernel code converts the current offset to a block number. called an inode. Ext2fs was based on the Extfs code with many reorganizations and many improvements. Each inode contains the description of the file: file type. As the filesystems were used more widely. It will be described with more details in The Second Extended File System When the two new filesystems were first released. timestamps.10. directories are simply files containing a list of entries and devices can be accessed by requesting I/O on special files. The filesystem used linked lists to keep track of free blocks and inodes and this produced bad performances: as the filesystem was used. This figure represents the structure of an inode: http://ldp. inode modification. support for bigger partitions and support for the three timestamps. When a user requests an I/O operation on the file.Design and Implementation of the Second Extended Filesystem support for the separate access. access rights. Xia fs was more stable than Ext2fs. The addresses of data blocks allocated to a file are stored in its inode. owners. On the other hand. Inodes Each file is represented by a structure.58] . it provided long file names. they provided essentially the same features.

Several names can be associated with a inode. the inode is loaded into memory and is used by subsequent requests. When a process uses a pathname. Directories are implemented as a special type of files. Each directory can contain files and subdirectories. After the name has been converted to an inode number.58] .it/LDP/khg/HyperNews/get/fs/ext2intro. Each entry contains an inode number and a file name.Design and Implementation of the Second Extended Filesystem Directories Directories are structured in a hierarchical tree.html (3 di 14) [08/03/2001 10.iol. a directory is a file containing a list of entries. Actually. This figure represents a directory: Links Unix filesystems implement the concept of link. the kernel code searchs in the directories to find the corresponding inode number. The inode http://ldp.10.

it replaces the name of the link by its contents. When a process issues a file oriented system call. The former allows I/O operations in character mode while the later requires data to be written in block mode via the buffer cache functions. This function handles the structure independent manipulations and redirects the call to a function contained in the physical filesystem code. Two types of special files exist: character and block special files. and in incrementing the links count in the inode. A special file is referenced by a major number. Device special files In Unix-like operating systems. when one uses the rm command to remove a filename. which identifies the unit. i. This type of link is called a hard link and can only be used within a single filesystem: it is impossible to create cross-filesystem hard links. the kernel calls a function contained in the VFS. hard links can only point on files: a directory hard link cannot be created to prevent the apparition of a cycle in the directory tree. 1993]. the name of the target file. and restarts the pathname interpretation. Another kind of links exists in most Unix filesystems. where the inode number points to the inode. and a minor number. devices can be accessed via special files.Design and Implementation of the Second Extended Filesystem contains a field containing the number associated with the file. the kernel decrements the links count and deallocates the inode if this count becomes zero. The Virtual File System Principle The Linux kernel contains a Virtual File System layer which is used during system calls acting on files. A device special file does not use any space on the filesystem. However. Seltzer et al. Symbolic links can point to any type of file.58] . it is possible to create cross-filesystems symbolic links.it/LDP/khg/HyperNews/get/fs/ext2intro. When an I/O request is made on a special file. Symbolic links are very useful because they don't have the limitations associated to hard links. This indirection mechanism is frequently used in Unix-like operating systems to ease the integration and the use of several filesystem types [Kleiman 1986. allocated for their inode and their data blocks. This scheme is illustrated in http://ldp. i. Since a symbolic link does not point to an inode.iol.10. Adding a link simply consists in creating a directory entry. When the kernel encounters a symbolic link during a pathname to inode conversion. The VFS is an indirection layer which handles the file oriented system calls and calls the necessary functions in the physical filesystem code to do the I/O. Moreover. When a link is deleted.e. which identifies the device type. it is forwarded to a (pseudo) device driver.html (4 di 14) [08/03/2001 10. Symbolic links are simply files which contain a filename. It is only an access point to the device driver. Filesystem code uses the buffer cache functions to request I/O on devices. even on nonexistent files. they use some disk space. which is responsible for handling the structure dependent operations. and cause an overhead in the pathname to inode conversion because the kernel has to restart the name interpretation when it encounters a symbolic link.e.

inodes.10. initializing its internal variables. It uses a table defined during the kernel configuration.it/LDP/khg/HyperNews/get/fs/ext2intro. After the filesystem is mounted. This interface is made up of a set of operations associated to three kinds of objects: filesystems. Each entry in this table describes a filesystem type: it contains the name of the filesystem type and a pointer on a function called during the mount operation. This function is responsible for reading the superblock from the disk. and returning a mounted filesystem descriptor to the VFS.Design and Implementation of the Second Extended Filesystem this figure: The VFS structure The VFS defines a set of functions that every filesystem has to implement. the appropriate mount function is called.iol. the VFS functions can use this descriptor to access the physical filesystem routines. and open files. When a filesystem is to be mounted.58] . A mounted filesystem descriptor contains several kinds of data: informations that are common to every http://ldp. The VFS knows about filesystem types supported in the kernel.html (5 di 14) [08/03/2001 10.

In the later case.58] . it is now possible to use big disks without the need of creating many partitions. the file descriptors contains pointer to functions which can only act on open files (e. Thus. We also wanted to provide a very robust filesystem in order to reduce the risk of data loss in intensive use. It uses variable length directory entries. write).g. The Second Extended File System Motivations The Second Extended File System has been designed and implemented to fix some problems present in the first Extended File System. recent work in the VFS layer have raised this limit to 4 TB. Two other types of descriptors are used by the VFS: an inode descriptor and an open file descriptor. A mount option allows the administrator to choose the file creation semantics. This limit could be extended to 1012 if needed. BSD or System V Release 4 semantics can be selected at mount time.iol. ``Advanced'' Ext2fs features In addition to the standard Unix features. unlink). Ext2fs supports some extensions which are not usually present in Unix filesystems.g. Each descriptor contains informations related to files in use and a set of operations provided by the physical filesystem code. directories. read. File attributes allow the users to modify the kernel behavior when acting on a set of files. This allows the administrator to recover easily from situations where user processes fill up filesystems. we wanted to Ext2fs to have excellent performance. The maximal file name size is 255 characters. create. On a filesystem mounted with BSD semantics.Design and Implementation of the Second Extended Filesystem filesystem types. Ext2fs is able to manage filesystems created on really big partitions.10. new files created in the directory inherit these attributes. files http://ldp. Ext2fs had to include provision for extensions to allow users to benefit from new features without reformatting their filesystem. which implements Unix file semantics and offers advanced features. Of course. but not least. device special files and symbolic links. The function pointers contained in the filesystem descriptors allow the VFS to access the filesystem internal routines. ``Standard'' Ext2fs features The Ext2fs supports standard Unix file types: regular files. While the inode descriptor contains pointers to functions that can be used to act on any file (e. pointers to functions provided by the physical filesystem kernel code. Our goal was to provide a powerful filesystem. Last. Ext2fs reserves some blocks for the super user (root). Ext2fs provides long file names. and private data maintained by the physical filesystem code. One can set attributes on a file or on a directory. Normally.html (6 di 14) [08/03/2001 10. 5% of the blocks are reserved. While the original kernel code restricted the maximal filesystem size to 2 GB.it/LDP/khg/HyperNews/get/fs/ext2intro.

this counter is incremented. most of the advantages of larger block sizes are obtained by Ext2 filesystem's preallocation techniques (see section Performance optimizations. The maximal size of the target name in a fast symbolic link is 60 characters. its state is set to ``Not Clean''. need to be done to access a file. and thus fewer disk head seeks. this feature is not normally used. since in addition to the performance loss associated with using synchronous updates of the metadata.58] . When the maximal check interval has been reached. the last block allocated to a file is only half full. The kernel code also records errors in this field. A mount option allows the administrator to request that metadata (inodes. When it reaches a maximal value (also recorded in the superblock). Always skipping filesystem checks may sometimes be dangerous. Block sizes can typically be 1024. 2048 and 4096 bytes. it can cause corruption in the user data which will not be flagged by the filesystem checker. more space is wasted in the last block of each file.html (7 di 14) [08/03/2001 10. Ext2fs allows the administrator to choose the logical block size when creating the filesystem. We plan to extend this scheme to small files in the near future. System V semantics are a bit more complex: if a directory has the setgid bit set. indirect blocks and directory blocks) be written synchronously on the disk when they are modified. Using big block sizes can speed up I/O since fewer I/O requests.10. Ext2fs implements fast symbolic links. At boot time. the filesystem is marked as ``Erroneous'' and one of the three following actions can be done: continue normal execution.it/LDP/khg/HyperNews/get/fs/ext2intro. the filesystem checker forces the check even if the filesystem is ``Clean''. BSD-like synchronous updates can be used in Ext2fs. This can be useful to maintain a strict metadata consistency but this leads to poor performances. Ext2fs offers tools to tune the filesystem behavior. in the other case. new files inherit the group id of the directory and subdirectories inherit the group id and the setgid bit. bitmap blocks. big blocks waste more disk space: on the average. This policy can save some disk space (no data block needs to be allocated) and speeds up link operations (there is no need to read a data block when accessing such a link). http://ldp. Of course. A special field in the superblock is used by the kernel code to indicate the status of the file system. the filesystem checker uses this information to decide if a filesystem must be checked.Design and Implementation of the Second Extended Filesystem are created with the same group id as their parent directory. A fast symbolic link does not use any data block on the filesystem. Actually. The filesystem checker tests this to force the check of the filesystem regardless of its apparently clean state. Each time the filesystem is mounted in read/write mode. In addition. Ext2fs keeps track of the filesystem state. its state is reset to ``Clean''. When an inconsistency is detected by the kernel code. These two fields allow the administrator to request periodical checks. A mount counter is maintained in the superblock. so as blocks get bigger. remount the filesystem in read-only mode to avoid corrupting the filesystem. When an inconsistency is detected by the kernel code. When it is unmounted or remounted in read-only mode. the checker ignores the filesystem state and forces a filesystem check. make the kernel panic and reboot to run the filesystem checker. files and subdirectories are created with the primary group id of the calling process. so Ext2fs provides two ways to force checks at regular intervals. A last check time and a maximal check interval are also maintained in the superblock. On the other hand. the space available in the inode is limited so not every link can be implemented as a fast symbolic link. When a filesystem is mounted in read/write mode. the filesystem is marked as ``Erroneous''. The target name is not stored in a data block but in the inode itself. The tune2fs program can be used to modify: q the error behavior.iol.

Like immutable files.4 BSD filesystem have recently been added to Ext2fs. Immutable files can only be read: nobody can write or delete them. This can be used to protect sensitive configuration files. Append-only files can be opened in write mode but data is always appended at the end of the file. However. The physical structure of a filesystem is represented in this table: Boot Block Block . Mount options can also be used to change the kernel error behavior. This structure also helps to get good performances: by reducing the distance between the inode table and the data blocks. new types of files inspired from the 4. they cannot be deleted or renamed. it is easy to recover from a filesystem where the superblock has been corrupted. and data blocks). When such a file is deleted.iol. the number of logical blocks reserved for the super user.. This is especially useful for log files which can only grow.. it is possible to reduce the disk head seeks during I/O on files. directories are managed as linked lists of variable length entries. The structure of a block group is represented in this table: Super FS Block Inode Inode Data Block descriptors Bitmap Bitmap Table Blocks Using block groups is a big win in terms of reliability: since the control structures are replicated in each block group. random data is written in the disk blocks previously allocated to the file. an inode bitmap. Each entry contains the inode number. a piece of the inode table. The structure of a directory entry is shown in this table: inode number entry length name length filename http://ldp. it is possible to implement long file names without wasting disk space in directories. An attribute allows the users to request secure deletion on files. since modern drives tend to be optimized for sequential access and hide their physical geometry to the operating system. In Ext2fs.html (8 di 14) [08/03/2001 10.Design and Implementation of the Second Extended Filesystem q q q the maximal mount count. Group N Each block group contains a redundant copy of crucial filesystem control informations (superblock and the filesystem descriptors) and also contains a part of the filesystem (a block bitmap.. This prevents malicious people from gaining access to the previous content of the file by using a disk editor. block groups are not tied to the physical layout of the blocks on the disk. A filesystem is made up of block groups. Last.it/LDP/khg/HyperNews/get/fs/ext2intro.. Block Sector Group 1 Group 2 . Block groups are analogous to BSD FFS's cylinder groups. By using variable length entries.10. 1984]. the entry length.58] . the file name and its length. Physical Structure The physical structure of Ext2 filesystems has been strongly influenced by the layout of the BSD filesystem [McKusick et al. the maximal check interval.

This is intended to reduce the disk head seeks made when the kernel reads an inode and its data blocks. When writing data to a file. The Ext2fs library To allow user mode programs to manipulate the control structures of an Ext2 filesystem. it tries to ensure that the next block to read will already be loaded into the buffer cache. the kernel code requests the I/O on several contiguous blocks. It also allows contiguous blocks to be allocated to files. and f2: i1 16 05 file1 i2 40 14 long_file_name i3 12 02 f2 Performance optimizations The Ext2fs kernel code contains many performance optimizations. several different iterators are provided. Ext2fs also contains many allocation optimizations. either explicit reads (readdir(2) calls) or implicit ones (namei kernel directory lookup). and debugfs) use the Ext2fs library. This greatly simplifies the maintainance of these utilities. Ext2fs preallocates up to 8 adjacent blocks when allocating a new block. The next table represents the structure of a directory containing three files: file1. A program can simply pass in a function to ext2fs_block_interate(). which will be called for each block in an inode. This library provides routines which can be used to examine and modify the data of an Ext2 filesystem. These two allocation optimizations produce a very good locality of: q related files through block groups q related blocks through the 8 bits clustering of block allocations. Many of the Ext2fs utilities (mke2fs. tune2fs. For example.it/LDP/khg/HyperNews/get/fs/ext2intro. This code http://ldp.Design and Implementation of the Second Extended Filesystem As an example. since any changes to reflect new features in the Ext2 filesystem format need only be made in one place--in the Ext2fs library. by accessing the filesystem directly through the physical device. Another iterator function allows an user-provided function to be called for each file in a directory.iol. dumpe2fs. This preallocation achieves good write performances under heavy load. This way. The Ext2fs library was designed to allow maximal code reuse through the use of software abstraction techniques. Ext2fs takes advantage of the buffer cache management by performing readaheads: when a block has to be read. the libext2fs library was developed. Block groups are used to cluster together related inodes and data: the kernel code always tries to allocate data blocks for a file in the same group as its inode. Preallocation hit rates are around 75% even on very full filesystems.html (9 di 14) [08/03/2001 10.10.58] . long_file_name. thus it speeds up the future sequential reads. which tend to improve I/O speed when reading and writing files. e2fsck. Readaheads are normally performed during sequential reads on files and Ext2fs extends them to directory reads.

For example. Many of these ideas were originally explored by [Bina and Emrath 1989] although they have since been further refined by the authors. bitmaps indicating which blocks and inodes are in use are compiled. the maximal mount count. The original version of e2fsck was based on Linus Torvald's fsck program for the Minix filesystem. In addition. The final class of operations are oriented around inodes. The second class of operations affect directories. Examples of such checks include making sure the file mode is legal. It is possible to scan the inode table. During pass 1.iol. Since filesystem checkers tend to be disk bound. the maximal check interval. the Ext2fs library was used during the port of the 4. modify. A program can open and close a filesystem. and correct any inconsistencies in Ext2 filesystems. this was done by optimizing the algorithms used by e2fsck so that filesystem structures are not repeatedly accessed from the disk. Functions are also available to manipulate the filesystem's bad blocks list. As explained in section ``Advanced'' Ext2fs features. read and write the bitmaps. and to determine a pathname of an inode given its inode number.Design and Implementation of the Second Extended Filesystem reuse also results in smaller binaries.it/LDP/khg/HyperNews/get/fs/ext2intro. and scan through all of the blocks in an inode. The e2fsck program is designed to run as quickly as possible. it can change the error behavior. these checks do not require any cross-checks to other filesystem objects. and that all of the blocks in the inode are valid block numbers.58] . In pass 1. Very few changes were needed to adapt these tools to Linux: only a few filesystem dependent functions had to be replaced by calls to the Ext2fs library. using the Ext2fs library. Functions are also provided to both resolve a pathname to an inode number. The mke2fs program is used to initialize a partition to contain an empty Ext2 filesystem. and create a new filesystem on the disk. The tune2fs program can be used to modify the filesystem parameters. The Ext2fs tools Powerful management tools have been developed for Ext2fs. as well as add and remove directory entries. the current version of e2fsck was rewritten from scratch. E2fsck is intended to repair filesystem inconsistencies after an unclean shutdown of the system. and the number of logical blocks reserved for the super user. http://ldp. Because the interfaces of the Ext2fs library are so abstract and general. Allocation and deallocation routines are also available and allow user mode programs to allocate and free blocks and inodes.4BSD dump and restore backup utilities. the order in which inodes and directories are checked are sorted by block number to reduce the amount of time in disk seeks. new programs which require direct access to the Ext2fs filesystem can very easily be written. The most interesting tool is probably the filesystem checker. These utilities are used to create. since the Ext2fs library can be built as a shared library image. A caller of the Ext2fs library can create and expand directories.10. That is.html (10 di 14) [08/03/2001 10. The first class are the filesystem-oriented operations. However. read and write inodes. and is much faster and can correct more filesystem inconsistencies than the original version. e2fsck iterates over all of the inodes in the filesystem and performs checks over each inode as an unconnected object in the filesystem. The Ext2fs library provides access to several classes of operations.

nearly all of the disk I/O which e2fsck needs to perform is complete. since all of the inodes have to be read into memory and checked. it invokes passes 1B through 1D to resolve these conflicts. either by cloning the shared blocks so that each inode has its own copy of the shared block. and it is removed). the second reference of the directory is treated as an illegal hard link. Unfortunately. Any directories which can not be traced back to the root are linked to the /lost+found directory. thus decreasing disk seek time. by iterating over all the inodes and comparing the link counts (which were cached in pass 1) against internal counters computed during passes 2 and 3. Debugfs is a powerful program which can be used to examine and change the state of a filesystem. In pass 3. Basically. or create test cases for e2fsck. hence. using information that was cached during pass 2. For the first directory block in each directory inode. E2fsck traces the path of each directory back to the root.iol.it/LDP/khg/HyperNews/get/fs/ext2intro. in pass 5.' entry is not checked until pass 3. and corrects the on-disk copies if necessary. Finally. Information required by passes 3. each directory block can be checked individually without reference to other directory blocks. Debugfs can be used to examine the internal structures of a filesystem.10... it provides an interactive interface to the Ext2fs library: commands typed by the user are translated into calls to the library routines.Design and Implementation of the Second Extended Filesystem If e2fsck notices data blocks which are claimed by more than one inode. It compares the block and inode bitmaps which were constructed during the previous passes against the actual bitmaps on the filesystem. This allows e2fsck to sort all of the directory blocks by block number. In pass 4.) Pass 2 also caches information concerning the parent directory in which each directory is linked. Since directory entries do not span disk blocks. Pass 2 checks directories as unconnected objects. (The inode number for the `.' entry for each directory is also checked to make sure it is valid. the remaining passes of e2fsck are largely CPU bound. critical filesystem information is cached in memory.' and `. Pass 1 takes the longest time to execute. and contain references to inode numbers which are in use (as determined by pass 1). and take less than 5-10% of the total running time of e2fsck. At this time.' entry matches the current directory. and check directory blocks in ascending order. To reduce the I/O time necessary in future passes. The directory blocks are checked to make sure that the directory entries are valid. e2fsck checks the validity of the filesystem summary information. The filesystem debugger is another useful tool. this program can be dangerous if it is used by http://ldp.html (11 di 14) [08/03/2001 10. e2fsck checks the reference counts for all inodes.. the `. the `. This obviates the need to re-read the directory inodes structures during pass 2 to obtain this information.' entries are checked to make sure they exist. Any undeleted files with a zero link count is also linked to the /lost+found directory during this pass.58] . and that the inode number for the `. or by deallocating one or more of the inodes. It is noteworthy to note that at the end of pass 2. (If a directory is referenced by more than one directory. manually repair a corrupted filesystem. The most important example of this technique is the location on disk of all of the directory blocks on the filesystem. 4 and 5 are cached in memory. the directory connectivity is checked.

58] .10. Thus there is no head seek between two reads and the readahead optimizations can be fully used. and seeks into the file. Performance Measurements Description of the benchmarks We have run benchmarks to measure filesystem performances. it is very easy to destroy a filesystem with this tool.Design and Implementation of the Second Extended Filesystem people who do not know what they are doing. It writes data to the file using character based I/O. The Andrew Benchmark was developed at Carneggie Mellon University and has been used at the University of Berkeley to benchmark BSD FFS and LFS. The user must explicitly specify the -w flag in order to use debugfs to open a filesystem for read/wite access. The Bonnie benchmark tests I/O speed on a big file--the file size was set to 60 MB during the tests. It seems that FreeBSD has a more optimized character I/O library and its performance is better. This is clearly a benefit of the optimizations included in the allocation routines.1. and compile several of the files. It runs in five phases: it creates a directory hierarchy. debugfs opens filesytems for read-only access by default. This is probably due to the fact that FreeBSD and Linux do not use the same stdio routines in their respective C libraries. Benchmarks have been made on a middle-end PC. reads the file using character I/O and block I/O.it/LDP/khg/HyperNews/get/fs/ext2intro. Writes are fast because data is written in cluster mode. writes data using block based I/O.4BSD Lite distribution).62) and on the BSD Fast filesystem in asynchronous and synchronous mode (FreeBSD 2. makes a copy of the data. For this reason. We have run two different benchmarks. recursively examine the status of every file. Reads are fast because contiguous blocks have been allocated to the file. The tests were run on Ext2 fs and Xia fs (Linux 1.html (12 di 14) [08/03/2001 10.iol. performance is better in the FreeBSD operating system in character oriented I/O. based on a i486DX2 processor. Results of the Bonnie benchmark The results of the Bonnie benchmark are presented in this table: Char Write Block Write Rewrite Char Read Block Read (KB/s) (KB/s) (KB/s) (KB/s) (KB/s) BSD Async 710 684 401 721 888 BSD Sync 699 677 400 710 878 Ext2 fs 452 1237 536 397 1033 Xia fs 440 704 380 366 895 The results are very good in block oriented I/O: Ext2 fs outperforms other filesystems.0 Alpha--based on the 4. Results of the Andrew benchmark The results of the Andrew benchmark are presented in this table: http://ldp. examine every byte of every file. using 16 MB of memory and two 420 MB IDE disks. rewrites the contents of the whole file. On the other hand.

58] . People are also working on an Ext2fs port in the LITES server. undelete. While BSD used to outperform Linux by a factor of 3 in this test. In pass 3.iol. running on top of the Mach microkernel [Accetta et al. Some people are working on extensions to the current filesystem: access control lists conforming to the Posix semantics [IEEE 1992]. it is robust and offers excellent performance. and on-the-fly file compression. Conclusion The Second Extended File System is probably the most widely used filesystem in the Linux community. In passes 1 and 2. Moreover. the performance under BSD is poor. There is an anomaly. Linux is faster than FreeBSD mainly because it uses an unified buffer cache management. currently under development by one of the authors. Last. It provides standard Unix file semantics and advanced features. 1993]. the Linux and BSD times are very similar. Ext2fs was first developed and integrated in the Linux kernel and is now actively being ported to other operating systems. The buffer cache space can grow when needed and use more memory than the one in FreeBSD.Design and Implementation of the Second Extended Filesystem P1 Create P2 Copy P3 Stat P4 Grep P5 Compile (ms) (ms) (ms) (ms) (ms) BSD Async 2203 7391 6319 17466 75314 BSD Sync 2330 7732 6317 17499 75681 Ext2 fs 790 4791 7235 11685 63210 Xia fs 934 5402 8400 12912 66997 The results of the two first passes show that Linux benefits from its asynchronous metadata I/O. An Ext2fs server running on top of the GNU Hurd has been implemented. We want to thank these contributors for their help. Acknowledgments The Ext2fs kernel code and tools have been written mostly by the authors of this paper. but not least.10. and in the VSTa operating system. which uses a fixed size buffer cache. 1986]. Since Ext2fs has been designed with evolution in mind. it contains hooks that can be used to add new features. This is a big progress against the same benchmark run six months ago. though: even in asynchronous mode. http://ldp. Comparison of the Ext2fs and Xiafs results shows that the optimizations included in Ext2fs are really useful: the performance gain between Ext2fs and Xiafs is around 5-10%. thanks to the optimizations included in the kernel code. We suspect that the asynchronous support under FreeBSD is not fully implemented. directories and files are created and BSD synchronously writes inodes and directory entries. the addition of a file name cache in the VFS has fixed this performance problem.it/LDP/khg/HyperNews/get/fs/ext2intro. In passes 4 and 5.html (13 di 14) [08/03/2001 10. Ext2fs is an important part of the Masix operating system [Card et al. Some other people have also contributed to the development of Ext2fs either by suggesting new features or by sending patches.

Joy.10. Tanenbaum. [Tanenbaum 1987] A. Baron.iol. Leffler. McKusick. Dayras. Tevanian. McKusick. Staelin. E. Bach. http://ldp. The MASIX Multi-Server Operating System. ACM Transactions on Computer Systems. 1993] M. R. Rashid. Bostic. Mach: A New Kernel Foundation For UNIX Development. Fabry. and F.it/LDP/khg/HyperNews/get/fs/ext2intro.html (14 di 14) [08/03/2001 10. [Bach 1986] M. and R. and C. [McKusick et al. S. M. Emrath. R. August 1984. Bolosky. 1986] M. January 1989. In Proceedings of the Summer USENIX Conference. Kleiman. Bina and P. June 1986. 2(3):181--197. [Kleiman 1986] S. Institute of Electrical and Electronics Engineers. S.Draft 13. K. [Card et al. June 1993.58] . Prentice Hall. 1984] M. Mével. and M. 1987. [Bina and Emrath 1989] E. [Seltzer et al. In Proceedings of the USENIX Winter Conference. Young. 1992. A Fast File System for UNIX. Seltzer. June 1986. A. D. A Faster fsck for BSD Unix. Inc. Prentice Hall. Operating Systems: Design and Implementation. Accetta. 1986. W. The Design of the UNIX Operating System. W. In Proceedings of the USENIX 1986 Summer Conference. Commelin. Vnodes: An Architecture for Multiple File System Types in Sun UNIX. 1993] R. In OSF Workshop on Microkernel Technology for Distributed Systems. [IEEE 1992] SECURITY INTERFACE for the Portable Operating System Interface for Computer Environments . An Implementation of a Log-Structured File System for UNIX. January 1993. pages 260--269. In Proceedings of the USENIX Winter Conference.Design and Implementation of the Second Extended Filesystem References [Accetta et al. Card. Golub.

edu> Hi.rutgers.it/LDP/khg/HyperNews/get/fs/fs/12.iol. I have it already) Please reply to kai@caip.58] . 04 Jun 1997 15:37:10 GMT From: Kai Xu <kai@caip.edu .Need /proc info The HyperNews Linux KHG Discussion Pages Need /proc info Forum: Filesystems Keywords: linux filesystem /proc Date: Wed. Kai Xu http://ldp.10. Where can I find the detailed write-ups about the /proc file system in Linux. I want to know how it works and how it is implemented(not just the source code. if you can. Thanks.rutgers.html [08/03/2001 10.

Where to find libext2 sources? The HyperNews Linux KHG Discussion Pages Where to find libext2 sources? Forum: Filesystems Date: Fri.com> The subject says it all. 09 May 1997 14:27:59 GMT From: Mark Salter <marks@qms.iol. Messages Nevermind. http://ldp.59] .10. by Mark Salter 1..html [08/03/2001 10..it/LDP/khg/HyperNews/get/fs/fs/11.

html [08/03/2001 10. I didn't realize it was part of e2fsprogs..10..it/LDP/khg/HyperNews/get/fs/fs/11/1.. http://ldp. The HyperNews Linux KHG Discussion Pages Nevermind.59] . Forum: Filesystems Re: Where to find libext2 sources? (Mark Salter) Date: Fri.iol.Nevermind.. 09 May 1997 19:34:50 GMT From: Mark Salter <unknown> I found it.

I had a problem while implementing the function for file_write. I tried to use all these functions ( memcpy_tofs.New File System The HyperNews Linux KHG Discussion Pages New File System Forum: Filesystems Keywords: linux file system file_write memcpy fragment Date: Fri. Thank you Vamsi http://ldp. 18 Apr 1997 05:15:59 GMT From: Vamsi Krishna <vamsi@cslab. memcpy_fromfs.edu> Hi.uky. I am implementing a new file system. in Linux.iol. it only used to copy the first fragment properly and not the remaining ones. When I used memcpy_tofs or memcpy_fromfs. And when I used the other functions. memcpy.html [08/03/2001 10.59] . memmove) to copy the contents of the last block into one or fragments. Could anyone please help me in this regard.it/LDP/khg/HyperNews/get/fs/fs/10. I got segmentation faults.10. bcopy. I have borrowed lot of code from Minix for this purpose. I had to modify this function a lot for supporting fragments.

Is it possible get the superblock without the partition info? Are there any functions that return the current partition? Thanks for your attention Wilfredo Lugo Messages Please be more specific. by Theodore Ts'o 1..clu. The problem is that the function iget(.) needs the superblock.. http://ldp.upr.... 26 Mar 1997 03:05:49 GMT From: Wilfredo Lugo Beauchamp <ak47@amadeus.Partition? The HyperNews Linux KHG Discussion Pages Partition? Forum: Filesystems Keywords: Partition Date: Wed.html [08/03/2001 10.10..it/LDP/khg/HyperNews/get/fs/fs/9.59] .edu> Hi.iol.. I'm working in some Linux ext2 filesystem applications and I need to read an inode.

and perhaps we can help you out...html [08/03/2001 10. or trying to write kernel code. Assuming that you're writing a user-mode application. The HyperNews Linux KHG Discussion Pages Please be more specific. are you trying to read the filesystem directly using the device file. 26 Mar 1997 04:28:53 GMT From: Theodore Ts'o <tytso@mit.11. Forum: Filesystems Re: Partition? (Wilfredo Lugo Beauchamp) Keywords: ext2 filesystem Date: Wed. Parts of your question imply that you're writing user-mode code. but iget() is a kernel routine which isn't available to user-mode programs.edu> It's not clear from your question whether you are trying to write a user mode application.iol..00] ..it/LDP/khg/HyperNews/get/fs/fs/9/1.Please be more specific.. Why don't you be a bit more specific about what you're trying to do. and using direct I/O to the device? Or are you trying to get some information from a filesystem that is already mounted? Why do you need to read an inode? What are you trying to do with it? http://ldp..

might be helpful. etc. 13 Feb 1997 10:08:41 GMT From: Natchu Vishnu Priya <vishnu@cs.ernet..it/LDP/khg/HyperNews/get/fs/fs/8.in> I need documentation on userfs. -vishnu http://ldp. on the lines of those on the vfs.Need documentation on userfs implementation The HyperNews Linux KHG Discussion Pages Need documentation on userfs implementation Forum: Filesystems Keywords: userfs ftpfs Date: Thu.00] . A list of function.iol. I seem to be able to find only an alpha version of userfs available and the ftpfs in that does not work.iitm.. in linux.html [08/03/2001 10.11.

iol. I would like to know where I can find the set of tools e2fsprogs. http://ldp.11. I read a document prepared by Mr.00] . I'm working on an undelete for Linux for my Operating Systems course.ext2fs tools The HyperNews Linux KHG Discussion Pages ext2fs tools Forum: Filesystems Keywords: ext2fs tools Date: Wed. Linus Tauro of the University of California and he used these tools. 05 Feb 1997 00:42:12 GMT From: Wilfredo Lugo Beauchamp <ak47@amadeus.clu. Wilfredo Lugo Messages Where to find e2fsprogs by Theodore Ts'o 1.it/LDP/khg/HyperNews/get/fs/fs/7.html [08/03/2001 10.upr.edu> Hi.

html [08/03/2001 10.html You can also ftp it from tsx-11.edu/tytso/www/linux/e2fsprogs. 05 Feb 1997 15:46:52 GMT From: Theodore Ts'o <tytso@mit.edu> The website for for the e2fsprogs package can be found at http://web.iol.mit.mit.01] . http://ldp.Where to find e2fsprogs The HyperNews Linux KHG Discussion Pages Where to find e2fsprogs Forum: Filesystems Re: ext2fs tools (Wilfredo Lugo Beauchamp) Keywords: ext2fs tools Date: Wed. in the /pub/linux/packages/ext2fs directory.it/LDP/khg/HyperNews/get/fs/fs/7/1.edu.11.

insignia. so is it some limitation of libext2fs? Messages libext2fs documentation by Theodore Ts'o 1.com> After a repartition and (win95) reformat I find I didn't save away all the data I wanted from an ext2 fs. but the utilities (e2fsck.. so I've spent a morning grovelling through the source and figuring out the structure.it/LDP/khg/HyperNews/get/fs/fs/6. and how much does it depend on the filesystem being intact? Can it be told to use a backup superblock? I discovered that mount(8) can be given an option to do so. debugfs etc) don't seem to. http://ldp.html [08/03/2001 10. is there any documentation on how to use it. (I think I can get the data back.01] . 15 Jan 1997 18:07:22 GMT From: James Beckett <jmb@isltd.11. only the first block group got overwritten by format) Now I find that libext2fs exists.libext2fs documentation The HyperNews Linux KHG Discussion Pages libext2fs documentation Forum: Filesystems Keywords: libext2fs filesystem Date: Wed.iol.

It would be nice to have some documentation on it. E2fsck most certainly does have a way to do this. there currently isnt any documentation on the libext2fs library. and I am soliciting volunteers who would be willing to do a first pass documentation on it. As for your question.libext2fs documentation The HyperNews Linux KHG Discussion Pages libext2fs documentation Forum: Filesystems Re: libext2fs documentation (James Beckett) Keywords: libext2fs filesystem Date: Wed. you can absolutely tell it to use a backup superblock. The library is relatively well structured internally.01] . One of the arguments is "superblock". http://ldp.edu> No. though. and it's documented in the man page. though.it/LDP/khg/HyperNews/get/fs/fs/6/1. and that's the block number for the superblock.iol.html [08/03/2001 10. and so most people who have looked at it haven't had *too* much trouble figuring it out. just take a look at the source code for the function signature for the ext2fs_open() function. 22 Jan 1997 23:20:07 GMT From: Theodore Ts'o <tytso@mit. I'm definitely willing to work with someone who is interested in doing that sort of tech writing. You're right that debugfs currently doesn't have a method for opening the filesystem with a backup superblock.11. Try using "e2fsck -b 8193".

could someone please direct me to where i would be able to get some programming related info on the /proc filesystem.html [08/03/2001 10.iitm.proc filesystem The HyperNews Linux KHG Discussion Pages proc filesystem Forum: Filesystems Keywords: proc filesystem Date: Fri.ernet.praveen Messages man proc by Michael K. Maybe i've missed it. 18 Oct 1996 21:18:40 GMT From: Praveen Krishnan <praveen@kurinji. if so .11.it/LDP/khg/HyperNews/get/fs/fs/5.iol. Isn't there any documentation on the /proc filesystem ? There was a chapter on it in the earlier versions of KHG but i dont see it here. Johnson 1.in> Hello. Thanx a lot . http://ldp.01] .

02] . http://ldp.it/LDP/khg/HyperNews/get/fs/fs/5/1.html [08/03/2001 10.iol. Johnson <johnsonm@redhat.11.com> man proc The proc chapter was removed because the man page was more complete and more up-to-date.man proc The HyperNews Linux KHG Discussion Pages man proc Forum: Filesystems Re: proc filesystem (Praveen Krishnan) Keywords: proc filesystem Date: Thu. 24 Oct 1996 23:20:10 GMT From: Michael K.

unisa. Where can I find it ? Thanks in advance to all will help me.11.unisa. Actually this filesystem is out of the kernel but the project is to push it into the system using nfs module.it or www.dia. 27 Sep 1996 11:22:14 GMT From: Ermelindo Mauriello <ermmau@ikonos.unisa.it http://ldp.it.Need NFS documentation The HyperNews Linux KHG Discussion Pages Need NFS documentation Forum: Filesystems Keywords: NFS Date: Fri.02] .dia. ermmau@mikonos.it> I'm working on a Transparent Cryptographic Filesystem for Linux based on the NFS concept.it/LDP/khg/HyperNews/get/fs/fs/4.html [08/03/2001 10.dia.iol. Documentation about this filesystem can be found ad mikonos. So I need documentation about NFS module.globenet.

12 Jun 1996 16:51:45 GMT From: Michael K. an Analysis of the Ext2fs structure is available.11. including an overview. Also.mit.it/LDP/khg/HyperNews/get/fs/fs/3. The ext2ed (available via ftp from tsx-11.edu in /pub/linux/packages/ext2fs) contains a set of detailed papers on ext2fs.iol.02] .html [08/03/2001 10. and a users guide for ext2ed.com> Even more documentation on ext2fs is available. a design document.Even more ext2 documentation! The HyperNews Linux KHG Discussion Pages Even more ext2 documentation! Forum: Filesystems Keywords: info Date: Wed. Johnson <johnsonm@redhat. http://ldp.

html [08/03/2001 10.6 ACL's for ext2fs.More ext2 documentation The HyperNews Linux KHG Discussion Pages More ext2 documentation Forum: Filesystems Keywords: ext2fs Date: Mon. one-up slides on quotas. http://ldp. and one-up slides on ACL's. Four sets of slides are available: two-up slides on quotas.edu:/pub/linux/packages/ext2fs/slides/berlin96 One talk was on quota management for ext2fs. two-up slides on ACL's.com> Remy Card recently announced that he has made postscript versions of the slides which he prepared for the 3rd International Linux Conference in Berlin available for ftp at tsx-11.iol.it/LDP/khg/HyperNews/get/fs/fs/2. and the other was on the implementation of POSIX.mit.11.03] . 03 Jun 1996 22:15:17 GMT From: Michael K. Johnson <johnsonm@redhat.

edu> At one point. Messages Done. I think it would be a very valuable addition to the KHG.11. but we got copyright clearance so that it could be included in the KHG. http://ldp. the ext2 paper which Remy. by Michael K. It was written for the Amsterdam Linux conference 1-2 years ago. However.iol. but Remy (as the primary author) should.html [08/03/2001 10. Johnson 1. Stephen and I wrote was supposed to be going into the KHG. it seems that it never did get included into the KHG. Does anyone know what happened with that? I no longer have a copy of the original TeX.Ext2 paper The HyperNews Linux KHG Discussion Pages Ext2 paper Forum: Filesystems Keywords: ext2 filesystem Date: Wed.03] . 29 May 1996 21:02:45 GMT From: Theodore Ts'o <tytso@mit.it/LDP/khg/HyperNews/get/fs/fs/1.

Ted et.11. The HyperNews Linux KHG Discussion Pages Done.03] .html [08/03/2001 10.Done. al. Forum: Filesystems Re: Ext2 paper (Theodore Ts'o) Keywords: ext2 filesystem Date: Wed.iol. Johnson <johnsonm@redhat.com> See above! Thanks. http://ldp. 12 Jun 1996 16:11:41 GMT From: Michael K.it/LDP/khg/HyperNews/get/fs/fs/1/1.

The second section. on Linux's memory management code. Alpha. 80386 Memory Management Linux's memory management was originally conceived for Intel's 80386 processor. how people porting Linux can write code to use the architecture. most are supported). Copyright (C) 1996 Michael K. The first section.Linux Memory Management The HyperNews Linux KHG Discussion Pages Linux Memory Management The Linux Cache Flush Architecture David Miller wrote this document explaining how Linux tries to flush caches optimally. PowerPC. which has fairly rich and relatively easy-to-use memory management features. Motorola 68K. Linux's memory management was abstracted in a way that has been successfully applied to many different processors. ARM.html [08/03/2001 10. Sparc (there are several different MMUs for the Sparc. is out of date by now.11. including the memory management units (MMU's) that are supplied with the 386. and MIPS CPUs. Johnson. but may still provide some sort of understanding of the basic structure that will help you navigate through more recent kernels. an overview of 80386 memory management. and more importantly.04] . Linux Memory Management This chapter is rather old. and was updated a year later.com http://ldp. there are a few assumptions that should not get in your way in general. During the port to the Alpha.iol. is still mostly applicable. it was originally written when Linux was only a year old. johnsonm@redhat.it/LDP/khg/HyperNews/get/memory/memory.

05] . 3. 4. (NOTE: SMP architectures without hardware cache coherence mechanisms are indeed possible. This includes DMA mappings (in the sense of MMU mappings) and cache/DMA datum consistency. in a virtually cached configuration. by whatever means the requestor will get the uptodate copy owned by the other processor. This is more of a virtual entity than a strict model as far as the Linux flush architecture is concerned.11. That is to say. Devices and DMA may or may not be able to see the most up to date copy of a piece of data which resides in the cache of the local processor. The Players The TLB. Architecture specific code may need to be notified when the kernel has changed a process/kernel mapping. by modifying the data via the kernel-space alias of the underlying physical page. the current flush architecture does not handle this currently. 2. 2. allowing inconsistancies to result. 2.The Linux Cache Flush Architecture The HyperNews Linux KHG Discussion Pages 1. when address space state is changed (on the generic kernel memory management code's behalf only) the appropriate flush architecture hook will be called describing that state change in full. What the flush architecture does not care about 1. What the flush architecture cares about 1. regardless of the cache architecture and/or semantics. It may. whether in software or hardware. The only characteristics it has is: 1. In general. 3.iol. At all times the memory management hardware's view of a set of process/kernel mappings will be consistant with that of the kernel page tables. the user thread of control will see the right data before it is allowed to continue execution. DMA/Driver coherency. I will add the necessary hooks. it is assumed that coherence in a multiprocessor environment is maintained by the cache/memory subsystem. 4. Split Instruction/Data cache consistancy with respect to modifications made to the process instruction space performed by the signal dispatch code. 5.) 2. that is to say they may depend upon each other. This entity is essentially "memory state" as the flush architecture views it. 2. when one processor requests a datum on the memory bus and another processor has a more uptodate copy. 3. Again see below on how this should be handled in another way.html (1 di 7) [08/03/2001 10. But it will not be pretty.it/LDP/khg/HyperNews/get/memory/flush. It will always hold copies of data which will be viewed as uptodate by the local processor. The cache. If at at some point a Linux port to some system where this is an issue occurrs. Currently. the same piece of data can end up residing in the cache twice. and due to to the bits of an address used to index the cache line. Its proper functioning may be related to the TLB and process/kernel page mappings in some way. It keeps track of process/kernel mappings in some way. These sorts of issues have no buisness in the flush architecture. In general it has the following properties: 1. If the memory management kernel code makes a modification to a user process page. cause aliasing problems if one physical page is mapped at the same time to two virtual pages. see below how they should be handled. The interfaces for the flush architecture and how to implement them In general all of the routines described below will be called with the following sequence: http://ldp.

This applies to virtual cache architectures. For flush_cache_range. In particular. whatever entries could exist in a virtual cache for the address space described by mm_struct are to be invalidated. on a virtually cached system.11. this routine shall commit the cache data to memory before invalidating each entry. An implementation shall: 1. For physical caches. all TLB mappings for the kernel address space should be made consistant with the OS page tables by whatever means necessary. no action need be performed since physical mappings have no bearing on address space translations. An implementation shall: 1. flush_tlb_foo(.html (2 di 7) [08/03/2001 10. modify_address_space(). unsigned long start. unsigned long end). the tlb/mmu hardware is to be placed in a state where it will see the (now current) kernel page table entries for the address space described by the mm_struct.05] . Therefore the TLB flush is done after the page tables have been changed so that afterwards the hardware can only load in the new copy of the page table information to the TLB.it/LDP/khg/HyperNews/get/memory/flush. flush_tlb_range(struct mm_struct *mm. 2. If the cache is write-back in nature. therefore the flush must occur before the change is made. void flush_tlb_mm(struct mm_struct *mm). For flush_tlb_mm. void flush_tlb_all(void). An implementation shall: 1. It is therefore safe for code to avoid flushing kernel tlb/cache entries if that is possible for efficiency. flush_cache_range(struct mm_struct *mm. which means that the mappings of every process has effectively changed. unsigned long start. The mm_struct is the unit of mmu/tlb real estate as far as the flush architecture is concerned. The two notes above for flush_*_mm() concerning the mm_struct passed apply here as well.iol. whatever actions necessary to cause the MMU/TLB hardware to not contain stale http://ldp. A change to a particular range of user addresses in the address space described by the mm_struct passed is occurring. For flush_tlb_range... 2. void flush_cache_mm(struct mm_struct *mm). unsigned long end). These routines are to notify the architecture specific code that a change has been made to the kernel address space mappings. It may be illegal in a given architecture for a piece of cache data to exist when no mapping for that data exists. This "address space" change is considered to be occurring in user space only. void flush_cache_all(void). These routines notify the system that the entire address space described by the mm_struct passed is changing. an mm_struct may map to one or many tasks or none! 2..).The Linux Cache Flush Architecture flush_cache_foo(. Eliminate all cache entries which are valid at this point in time when flush_cache_all is invoked. Note that with an architecture that possesses the notion of "MMU/TLB contexts" it may be necessary to perform this synchronization in every "active" MMU/TLB context. The logic here is: 1.). all cache entries which are valid for the range start to end in the address space described by the mm_struct are to be invalidated. For flush_cache_mm.. Please take note of two things in particular: 1. It is possible for a given MMU/TLB architecture to perform a hardware table walk of the kernel page tables. 2. 2. For flush_tlb_all.

) Consider a virtually indexed cache which is write-back. ] } (Some of the actual code has been simplified for example purposes. address).it/LDP/khg/HyperNews/get/memory/flush. The two notes above for flush_*_mm() concerning the mm_struct (passed indirectly via vma->vm_mm) apply here as well.05] . unsigned long address).new_page).iol.. This way in an implementation where the instruction and data spaces are not unified. 2. [ . In this case.. The caches are stupid. address).. it uses the aliased mappings of all physical memory in kernel space to perform the copy of the page in question to a new page. on a virtually cached system. An implementation shall: 1.html (3 di 7) [08/03/2001 10. when (as one example) the kernel services a COW fault. ie. flush_page_to_ram(old_page). all cache entries which are valid for the page at address in the address space described by the VMA are to be invalidated. flush_cache_page(vma. Briefly. The code sequence being described here essentially looks like: do_wp_page() { [ . A change to a single page at address within user space to the address space described by the vm_area_struct passed is occurring. The page copy can bring this data (for the old page) into the caches. and for write back caches this data will be dirty or modified in the cache. whatever actions necessary to cause the MMU/TLB hardware to not contain stale translations are to be performed. void flush_cache_page(struct vm_area_struct *vma.. by whatever means. This presents a problem for virtually indexed caches which are write-back in nature. The VMA is passed for convenience so that an implementation can inspect vma->vm_flags. flush_page_to_ram(new_page). This is the ugly duckling. It will also place the data (at the new kernel aliased mapping of the page) being copied to into the cache. for example. if need be. At the point in time at which the copy of the page occurs to the kernel space aliases.11. the kernel touches two physical pages in kernel space. But its semantics are necessary on so many architectures that I needed to add it to the flush architecture for Linux. In such a case main memory will not see the most recent copy of the data. modify_address_space(). void flush_tlb_page(struct vm_area_struct *vma. by whatever means. free_page(old_page). void flush_page_to_ram(unsigned long page). one can check to see if VM_EXEC is set in vma->vm_flags to possibly avoid flushing the instruction space. For flush_tlb_range. This means that whatever translations are in the kernel page tables in the range start to end in the address space described by the mm_struct are to be what the memory mangement hardware will see from this point forward. An implementation. it is possible for the user space view of the original page to be in the caches (at the user's address. can get at the assosciated mm_struct for this address space via vma->vm_mm. so for the new page we http://ldp. For flush_cache_range.The Linux Cache Flush Architecture translations are to be performed. ] copy_cow_page(old_page. flush_tlb_page(vma. unsigned long address). where the fault is occurring). This means that whatever translations are in the kernel page tables for the page at address in the address space described by the VMA passed are to be what the memory mangement hardware will see from this point forward.

task 2 COW faults the page at 0x2000 http://ldp. expecting the contents that existed there beforehand). And for example purposes let us say that this virtual address maps to physical page 0x14000.iol. | -------------- . task 2 The kernel will get a new page for task2.05] Physical Pages -------------| 0x00000000 | -------------| . so to be safe it is best to be eliminate the cached copies of this data as well.it/LDP/khg/HyperNews/get/memory/flush. whatever garbage was there before the copy done by COW processing above). the (non-modified. but this time fork() yet another task 3 before the COW faults occur. clean) data for the original (old) page is in the cache at the kernel alias for physical page 0x14000. on a write-back virtually indexed cache architecture we have a potential inconsistancy. Also. consider the contents of the caches in both kernel and user space if the following sequence occurrs in exact succession: 1. Therefore an architecture shall: On virtually indexed cache architectures.e. he would complete his write. task 1 reads some the page at 0x2000 2. ie.. This can lead to disasterous results. do whatever is necessary to make main memory consistant with the cached copy of the kernel space page passed. To see why this is really necessary. At this point in time if the data is left in the cache at the kernel alias for the new physical page. Virtual Pages -------------| 0x00000000 | -------------| 0x00001000 | -------------| 0x00002000 | --\ -------------\ task 1 \ -------------\ | 0x00000000 | |----> -------------/ | 0x00001000 | / -------------/ | 0x00002000 | --/ -------------If task 2 tries to write to the read-only page at address 0x2000 we will get a fault and eventually end up at the code fragment shown above in do_wp_page(). Let us say we did not write back the data for the page at 0x26000 and we let it just stay there.. NOTE: It is actually necessary for this routine to invalidate lines in a virtual cache which is not write-back in nature.11. The page contents get copied from the kernel mappings for physical page 0x14000 to the ones for physical page 0x26000. The new data copied into physical page 0x26000 is not necessary in main memory at this point. in fact it could be all in the cache only at the kernel alias of the physical address. read-only with another task (or many) at virtual address 0x2000 in user space. A concrete example of what was just described: Consider a process which shares a page. and let us also say that the kernel alias mappings for physical pages 0x14000 and 0x26000 can reside in the two unique cache lines at the same time based upon the line indexing scheme of this cache.. without forcing the cached data at the kernel alias to main memory the process will see the old contents of the page (ie.. | -------------| 0x00014000 | -------------| . then he would read some other piece of data in this new page (i.The Linux Cache Flush Architecture are giving to the user. the user will get whatever was in main memory before the copy for his read. let us say this is physical page 0x26000. this can produce an inconsistancy later on. At this point in time. replay the above example with task 1 and 2. We would return to task 2 (who has this new page now mapped in at virtual address 0x2000).html (4 di 7) [08/03/2001 10.

or the flush is only guarenteed to be seen by the local processor. Implications for context based MMU/CACHE architectures The entire idea behind the concept of MMU and cache context facilities is to allow many address spaces to share the cache/mmu resources on the cpu. all an implementation needs to do essentially is: if((vma->vm_flags & (VM_WRITE|VM_SHARED)) == (VM_WRITE|VM_SHARED)) check_for_potential_bad_aliases(). pte_t pte). On some systems http://ldp. To take full advantage of such a facility. on certain architectures some critical operations and checks need to be performed here for things to work out properly and for the system to remain consistant. as far as contexts are concerned. As an example. The checks for this are very simple. unsigned long address. If such a "bad alias" is detected an implementation needs to resolve this inconsistancy some how. task 3 can see inconsistant data after the COW fault if flush_page_to_ram does not invalidate the kernel aliased physical page from the cache. The relationship of kernel space mappings to user space ones. and due to the indexing algorithm of the cache can also reside in unique and mutually exclusive cache lines. But in particular some of the issues are likely to be: 1. task 2 performs his writes to the new page at 0x2000 4. at least this has been the experience of the author.iol. one solution is to walk through all of the mappings and change the page tables to make these pages as "non-cacheable" if the hardware allows such a thing. In the latter case a cross calling mechanism is needed. The issues involved will vary greatly from one implementation to another. A "bad alias" is defined as two or more mappings (at least one of which is writable) to two or more virtual pages which all translate to the same exact physical page. on sun4m Sparc systems all processers in the system must execute the flush request to guarentee consistancy across the entire system. So for the common case (shared writable mappings are extremely rare) only one comparison is needed for systems with virtually indexed caches. task 3 COW faults the page at 0x2000 Even on a non-writeback virtually indexed cache. 5. on sun4d Sparc machines. TLB flushes performed on the local processor are broadcast over the system bus by the hardware and therefore a cross call is not necessary. void update_mmu_cache(struct vm_area_struct *vma. for virtually indexed caches this routine must check to see that the new mapping being added by the current page fault does not add an "bad alias" to user space.The Linux Cache Flush Architecture 3. requires some extra consideration from the implementor. However.it/LDP/khg/HyperNews/get/memory/flush. Although not strictly part of the flush architecture. The main concern is whether one of the above flush operations cause the entire system to be globally see the flush. In particular. 6.html (5 di 7) [08/03/2001 10.11. The current two SMP systems supported under Linux (Intel and Sparc) use inter-processor interrupts to "broadcast" the flush operation and cause it to run locally on all processors if necessary. Implications for SMP Depending upon the architecture certain amends may be needed to allow the flush architecture to work on an SMP system. and still maintain coherency as described above.05] .

which essentially is to flush a user space page if not doing so http://ldp. unsigned long.e. The cost of per-context flushes can become a key issue. struct linux_sbus *sbus). Of note is the MIPS R4000 which will give an exception when such a situation occurs.html (6 di 7) [08/03/2001 10. 2. which would be performed before the page copy in COW fault processing. It may be necessary in such a case to walk into all contexts which are currently valid and perform the complete flush in each one for a kernel address space flush. Any attempt to reach this level of efficiency via hooks added to the generic kernel memory management code would be complex and if anything very unclean. void (*mmu_get_scsi_sgl)(struct mmu_sglist *. especially with respect to the TLB. struct linux_sbus *sbus). void (*mmu_release_scsi_sgl)(struct mmu_sglist *. When a device driver must perform DMA to/from either a single buffer or a scatter list of many buffers it uses a set of abstract routines: char *(*mmu_get_scsi_one)(char *. struct linux_sbus *sbus). consider on the Sparc how DMA buffers are handled.05] . When the driver is done with the DMA and the transfer has completed the mmu_release_* routines must be called with the DMA'able address(es) so that the resources can be freed (if necessary) and cache flushes can be performed (if necessary). Open issues There seems to be some very stupid cache architectures out there which want to cause trouble when an alias is placed into the cache (even a safe one where none of the aliased cache entries are writable!). the exception handler can flush the entries in the cache being complained about and all is well. As an example. In fact this implementation is more efficient because the driver knows exactly when DMA needs to see consistant data or when DMA is going to create an inconsistancy which must be resolved. For example. int. On most chips which do something stupid like this. if a tlb flush is needed on a large range of addresses (or an entire address space) it may be more prudent to allocate and assign a new mmu context to this process for the sake of efficiency. The final argument is a Sparc specific entity which allows the machine level code to perform the mapping if DMA mappings are setup on a per-BUS basis. Perhaps a new flush is neccessary. one which can be loaded into the DMA controller for the transfer).it/LDP/khg/HyperNews/get/memory/flush. for example a networking driver would use this for a pool transmit and receive buffers. How to handle what the flush architecture does not do. void (*mmu_map_dma_area)(unsigned long addr. Essentially the mmu_get_* routines are passed a pointer or a set pointers and size specifications to areas in kernel space for which DMA will occur. 7. void (*mmu_release_scsi_one)(char *. Such issues are most cleanly dealt with at the device driver level. The final routine is there for drivers which need to have a block of DMA memory for a long period of time. int. struct linux_sbus *sbus).11. int len). 8. The author is convinced of this after his experiance with a common set of Sparc device drivers which needed to all function correctly on more than a handfull of cache/mmu and bus architecrures in the same kernel. The author is mostly concerned about the cost of these exceptions during COW processing and the effects this will have for system performance.The Linux Cache Flush Architecture kernel mappings have a "global" attribute. It also has no provisions for any mapping strategies necessary for DMA and devices should that be necessary on a certain machine Linux is ported to.iol. in that the hardware does not concern itself with context information when a translation is made which has this attribute. with examples The flush architecture just described make no amends for device/DMA coherency with cached data. they return a DMA capable address (i. unsigned long. these can occur when COW processing is happing in the current implementation. Therefore one flush (in any context) of a kernel cache/mmu mapping could be sufficient. Such issues are none of the flush architectures buisness. However it is possible in other implementations for the kernel to share the context key assosciated with a particular address space.

It may be necessary to extend the flush architecture to provide the interfaces and facilities necessary for these changes to the networking code.it/LDP/khg/HyperNews/get/memory/flush.html (7 di 7) [08/03/2001 10. the flush architecture is always subject to improvements and changes to handle new issues or new hardware which presents a problem that was to this point unknown.rutgers.The Linux Cache Flush Architecture would cause the trouble just described. David S.edu http://ldp. Miller davem@caip.11. And by all means.05] .iol. There has been heated talk lately about adding page flipping facilities for very intelligent networking hardware.

This space is invisible to the process in user mode but the mapping becomes relevant when privileged mode is entered.06] . q The user process' segment_base = 0x00.Linux Memory Management Overview The HyperNews Linux KHG Discussion Pages Linux Memory Management Overview [Note: This overview of Linux's Memory Management is several years old. Linux's MM has gone through a nearly complete rewrite since this was written. and then there are 4 or more that http://ldp. An exec() results in the reading in of a page or so from the executable. The space above 3 GB appears in a process' page directory as pointers to kernel page tables. linear addresses and logical addresses are identical. if you can't understand the Linux MM code. sometimes called the swapper task for historical reasons. However. for example. and the page is marked read-write. The swapper page directory (swapper_page_dir is set up so that logical addresses and physical addresses are identical in kernel space.it/LDP/khg/HyperNews/get/memory/linuxmm.html (1 di 10) [08/03/2001 10. q user process makes a system call: segment_base=0xc0000000 page_dir = same user page_dir. The selector points to a segment and the offset tells how far into that segment the address is located) The kernel code and data segments are priveleged segments defined in the global descriptor table and extend from 3 GB to 4 GB. and the page is a copy-on-write page.iol. A linear address is not a physical address--it is a virtual address.11. so the first 768 entries in swapper_pg_dir are 0's. Then. The idle process has its page directory initialized during the initialization sequence.] The Linux memory manager implements demand paging with a copy-on-write strategy relying on the 386's paging support. to handle a system call. it is copied. q swapper_pg_dir contains a mapping for all physical pages from 0xc0000000 to 0xc0000000 + end_mem. linear address run from 0GB to 4GB. A process acquires its page tables from its parent (during a fork()) with the entries marked as read-only or swapped. Supervisor mode is entered within the context of the current process so address translation occurs with respect to the process' page directory but using kernel segments. A process' page directory is initialized during a fork by copy_page_tables(). reading this and understanding that this documents the predecessor to the current MM code may help you out. The process then faults in any other pages it needs. Only task[0] (the idle task. On the 80386. This is identically the mapping produced by using the swapper_pg_dir and kernel segments as both page directories use the same page tables in this space. if the process tries to write to that memory space. In user space. Each process has a page directory which means it can access 1 KB of page tables pointing to 1 MB of 4 KB pages which is 4 GB of memory. Each user process has a local descriptor table that contains a code segment and data-stack segment. page_dir private to the process. A logical address consists of a selector and an offset. even though it has nothing to do with swapping in the Linux implementation) uses the swapper_pg_dir directly. These user segments extend from 0 to 3 GB (0xc0000000). A linear address points to a particular memory location within this space.

The column on the far right gives the relevant routine or variable name or explains the entry. This region does not contain page directories or page tables. the kernel page directory.html (2 di 10) [08/03/2001 10. The first 768 entries map the user space. Bad things would happen if the kernel stack were to grow below its current stack frame. the first kernel page table.06] . numbers in italics are approximate. 0x110000 FREE mem_map inode_table device data 0x100000 more pg_tables 0x0A0000 RESERVED 0x060000 FREE low_memory_start 0x006000 kernel code + data floppy_track_buffer bad_pg_table bad_page 0x002000 pg0 0x001000 swapper_pg_dir 0x000000 null page used by page_fault_handlers to kill processes gracefully when out of memory. q The user page directories have the same entries as swapper_pg_dir above 768. memory_end or high_memory mem_init() inode_init() device_init()* paging_init() http://ldp. TWIN. Minor alterations are needed in some places (tests for process memory limits comes to mind) to provide support for programmer defined segments. The kernel stack is not a pretty data structure or segment that I can point to with a ``yon lies the kernel stack.it/LDP/khg/HyperNews/get/memory/linuxmm. Only dirty pages are swapped. [Where is the kernel stack put? I know that there is one for every process. Wine. The column on the left gives the starting address of the item.iol.] Physical memory Here is a map of physical memory before any user processes are executed. but where is it stored when it's not being used?] User pages can be stolen or swapped. [There is now a modify_ldt() system call used by dosemu.11. The upshot is that whenever the linear address is above 0xc0000000 everything uses the same kernel page tables. A user page is one that is mapped below 3 GB in a user page table. and Wabi to create arbitrary segments.'' A kernel_stack_frame (a page) is associated with each newly created process and is used whenever the kernel operates within the context of that process.Linux Memory Management Overview point to kernel page tables. The column in the middle names the item(s). The user stack sits at the top of the user data segment and grows down.

Linux Memory Management Overview *device-inits that acquire memory are(main. The text and data portions are allocated on separate pages unless one chooses the -N compiler option. This is what malloc() does when it needs to. it is possible to write to code space. However.c): profil_buffer. con_init. a shared library.html (3 di 10) [08/03/2001 10. psaux_init. Currently the page fault handler do_wp_page checks to ensure that a process does not write to its code space.it/LDP/khg/HyperNews/get/memory/linuxmm. Note that all memory not marked as FREE is RESERVED (mem_init).06] .iol.5 GB and 3 GB. RESERVED pages belong to the kernel and are never freed or swapped. User process Memory Allocation a few code pages a few data pages stack pg_dir code/data page_table stack page_table swappable Y Y Y N N N shareable Y N? N N N N N N N N task_struct kernel_stack_frame N N shlib page_table http://ldp. by catching the SEGV signal. The handler do_no_page ensures that any new pages the process acquires belong to either the executable. except in special cases. A user process can reset its brk value by calling sbrk(). the stack. or lie within the brk value. rd_init. A user process' view of memory 0xc0000000 The invisible kernel reserved initial stack room for stack growth 4 pages 0x60000000 shared libraries unused brk malloc memory end_data uninitialized data end_code initialized data 0x00000000 text Both the code segment and data segment extend all the way from 0x00 to 3 GB. The address is between 1.11. causing a copy-on-write to occur. scsi_dev_init. Shared library load addresses are currently taken from the shared image itself.

http://ldp. cs = 0x0f All point to segments in the current ldt[]. A dirty page ends up shared across a fork until the parent or child chooses to write to it again. Only dirty pages are swapped. maj_flt. Page fault counting ulong min_flt. Memory Management data in the process table Here is a summary of some of the data kept in the process table which is used for memory managment: Process memory limits ulong start_code.Linux Memory Management Overview a few shlib pages Y Y? [What do the question marks mean? Do they mean that they might go either way.iol. or that you are not sure?] The stack. end_data.html (4 di 10) [08/03/2001 10. r Segment selectors: ds = es = fs = gs = ss = 0x17.06] . kernel_stack_page pointer to page allocated in fork.11. All kernel page_tables are shared by all processes so they are not in the list. shlibs and data are too far removed from each other to be spanned by one page table. Clean pages are stolen so the process can read them back in from the executable if it likes. rss number of resident pages. start_stack. r cr3: points to the page directory for this process.it/LDP/khg/HyperNews/get/memory/linuxmm. end_code. then process's pages will not be swapped. cmin_flt. brk. saved_kernel_stack V86 mode stuff struct tss r Stack segments esp0 kernel stack pointer (kernel_stack_page) ss0 kernel stack segment (0x10) esp1 = ss1 = esp2 = ss2 = 0 unused privelege levels. cmaj_flt Local descriptor table struct desc_struct ldt[32] is the local descriptor table for task. swappable if 0. Mostly only clean pages are shared.

memory_end low_memory_start end of the kernel code and data that is loaded initially. LDT[1] = user code. Each device init typically takes memory_start and returns an updated value if it allocates space at memory_start (by simply grabbing it). esp. base=0xc0000000. mem_map is then constructed by mem_init() to reflect the current usage of physical pages. The nested task flag is turned off in preparation for entering user mode.06] . Then Linux moves into user mode with an iret after pushing the current ss. limit = TASK_SIZE = 0xc0000000. This is the state reflected in the physical memory map of the previous section. In sched_init() the ldt and tss descriptors for task[0] are set in the GDT. The first page is zeroed to trap null pointer references in the kernel.iol.11.S). memory_start is incremented if any new page_tables are added. size = 640K LDT[2] = user data. The task_struct for task[0] appears in its entirety in <linux/sched. Of course the user segments for task[0] are mapped right over the kernel segments so execution continues exactly where it left off.c) there are 3 variables related to memory initialization: starts out at 1 MB. The timer is turned on. Processes and the Memory Manager Memory-related work done by fork(): q Memory allocation http://ldp.html (5 di 10) [08/03/2001 10. Memory initialization In start_kernel() (main.Linux Memory Management Overview r ldt: _LDT(n) selector for current task's LDT. Updated by device initialization.it/LDP/khg/HyperNews/get/memory/linuxmm.h>. no process sees the kernel segments while in user mode. base=0xc0000000. size = 640K The first exec() sets the LDT entries for task[1] to the user values of base = 0x0. memory_start end of physical memory: 8 MB. Actually the first 4 MB is done in startup_32 (head. task[0]: pg_dir = swapper_pg_dir which means the the only addresses mapped are in the range 3 GB to 3 GB + high_memory. paging_init() initializes the page tables in the {\tt swapper_pg_dir} (starting at 0xc0000000) to cover all of the physical memory from memory_start to memory_end. or whatever. Thereafter. etc. A trap gate (0x80) is set up for system_call(). 16 MB. and loaded into the TR and LDTR (the only time it's done explicitly).

The processes end up sharing their code and data segments (although they have separate local desctriptor tables. The stack and data pages will be copied when the parent or child writes to them (copy-on-write). r The remaining registers are inherited from parent. q Set the instruction pointer of the caller eip = ex.a_entry q Set the stack pointer of the caller to the stack just created (esp = stack pointer) These will be popped off the stack when the caller resumes. q change_ldt() sets the descriptors in the new LDT[] q ldt[1] = code base=0x00. The later is because the process' page directory maps this range exactly as page_pg_dir. are kernel segments so that all linear addresses point into kernel memory. The kernel space (0xc0000000 + high_memory) is mapped by the kernel page tables which are http://ldp. r descriptors set in gdt for new tss and ldt[]. r 1 for the pg_dir and some for pg_tables (copy_page_tables) q Other changes r ss0 set to kernel stack segment (0x10) to be sure? r esp0 set to top of the newly allocated kernel_stack_page r cr3 set by copy_page_tables() to point to newly allocated page directory. assume a user process invokes a system call and the kernel wants to access a variable at address 0x01.iol.a_data brk = end_data + ex. G=1. r 1 page for the kernel stack.a_bss Interrupts and traps are handled within the context of the current task. For example. limit=TASK_SIZE These segments are DPL=3. type=a (code) or 2 (data) q Up to MAX_ARG_PAGES dirty pages of argv and envp are allocated and stashed at the top of the data segment for the newly created user stack.html (6 di 10) [08/03/2001 10. r Memory-related work done by exec(): q memory allocation r 1 page for exec header entire file for omagic r 1 page or more for stack (MAX_ARG_PAGES) q clear_page_tables() used to remove old pages.a_text end_data = end_code + ex.it/LDP/khg/HyperNews/get/memory/linuxmm.06] . The linear address is 0xc0000001 (using kernel segments) and the physical address is 0x01. The segments. limit=TASK_SIZE q ldt[2] = data base=0x00. S=1. q update memory limits end_code = ex.11. P=1.Linux Memory Management Overview 1 page for the task_struct. the page directory of the current process is used in address translation. r ldt = _LDT(task_nr) creates new ldt descriptor. however. In particular. the entries point to the same segments).

you enter into the realm of page stealing which we'll go into in a moment. The priority is increased on each successive iteration so that these two routines run through their page stealing loops more often. q Quit when a page is freed.. This is at a lower level than kmalloc() (in fact kmalloc() uses get_free_page() when it needs more memory). During a fork copy_page_tables() treats RESERVED page tables differently. The interrupt instruction sets the stack pointer and stack segment from the privilege 0 values saved in the tss of the current task. at this stage. It sets pointers in the process page directories to point to kernel page tables and does not actually allocate new page tables as it does normally. Note that swap_out() (called by try_to_free_page()) maintains static variables so it may resume the search where it left off on the previous call. with the ``kswap'' changes. and released when it exits.. but rather a bunch of stack frames each allocated when a process is created. Here's one run through swap_out(): q Run through the process table and get a swappable task. of course. Note that the kernel stack is a really fragmented object--it's not a single object.] When any kernel routine wants memory it ends up calling get_free_page(). The problem.11. As a last resort (and for atomic requests) a page is torn off from the secondary_page_list (as you may have guessed.iol.Linux Memory Management Overview themselves part of the RESERVED memory. the secondary_page_list gets filled up first). That itself is simple enough. get_free_page() takes one parameter. http://ldp. is that the free_page_list may be empty. Suffice it to say that interrupts are disabled. say. a priority. when pages are freed. The kernel stack should never grow so rapidly within a process context that it extends below the current frame. zeroes the page and returns the physical address of the page (note that kmalloc() returns a physical address. GFP_NFS. get_free_page() calls try_to_free_page() which repeatedly calls shrink_buffers() and swap_out() in that order until it is successful in freeing a page. and GFP_ATOMIC. The logic of the mm depends on the identity map between logical and physical addresses). As an example the kernel_stack_page (which sits somewhere in the kernel space) does not need an associated page_table allocated in the process' pg_dir to map it. It takes a page off of the free_page_list. Q.it/LDP/khg/HyperNews/get/memory/linuxmm. The actual manipulation of the page_lists and mem_map occurs in this mysterious macro called REMOVE_FROM_MEM_QUEUE() which you probably never want to look into. If you did not request an atomic operation. Acquiring and Freeing Memory: Paging Policy [Note: swapping has also been massively changed in recent kernels. GFP_KERNEL. q For each page in the table try_to_swap_out(page).] Now back to the page stealing bit. updates mem_map. q Find a user page table (not RESERVED) in Q's space.06] . They are therefore shared by all processes.html (7 di 10) [08/03/2001 10. Possible values are GFP_BUFFER. It is not that hard. [I think that this should be explained here.

but the cumulative effects of a few iterations can slow down a process considerably. It ignores RESERVED pages. Leave dirty pages with map_counts > 1 alone. The actual work of freeing the page is done by free_page(). write_swap_page() gets called and does nothing remarkable from the memory management perspective.11. so a page shared by 6 processes can get stolen if it is clean. Don't tamper with recently acquired pages (last_free_pages[]). then frees the page and updates the page_lists if it is unmapped.it/LDP/khg/HyperNews/get/memory/linuxmm. updates mem_map. Essentially it looks for free buffers. [Why this check? This can only happen in kernel mode because of segment level protection. If the faulting address is greater than TASK_SIZE the process receives a SIGKILL. The former is handled by do_wp_page() and the latter by do_no_page(). Decrement the map_count of clean pages. there are 6 iterations. The details of shrink_buffers() would take us too far afield. 7.Linux Memory Management Overview try_to_swap_out() scans the page tables of all user processes and enforces the stealing policy: 1.] These routines have some subtleties as they can get called from an interrupt. The page fault handlers When a process is created via fork.iol.html (8 di 10) [08/03/2001 10. Note that page directories and page tables along with RESERVED pages do not get swapped. Page table entries are updated and the TLB invalidated. At present. it starts out with a page directory and a page or so of the executable. The error code (retrieved in sys_call. 6 and 7 will stop the process as they result in the actual freeing of a physical page. Swap dirty pages with a map_count of 1.06] . You can't assume that it is the ``current'' task that is executing. 4. do_no_page() handles three possible situations: http://ldp. 6. For swapping (in 6 above). Of these actions. 3. 2. They are freed only on exit from the process. then goes at busy buffers and calls free_page() when its able to free all the buffers on a page. the complement of get_free_page(). 5. The page fault handler do_page_fault() retrieves the faulting address from the register cr2. then writes out dirty buffers. Do not fiddle with RESERVED pages. So the page fault handler is the source of most of a processes' memory. Age the page if it is marked accessed (1 bit). Action 5 results in one of the processes losing an unshared clean page that was not accessed recently (decrement Q->rss) which is not all that bad. They are mapped in the process page directory through reserved page tables. stolen or aged.S) differentiates user/supervisor access and the reason for the fault--write protection or a missing page. Free clean pages if they are unmapped.

The page belongs to the executable or a shared library. struct inode * swap_file. unsigned char * swap_map. Clean pages are also not written to swap.html (9 di 10) [08/03/2001 10. and people are more used to the word ``swap'' than ``page. do_wp_page() does the following: q Send SIGSEGV if any user process is writing to current code_space. and does not swap. In all cases get_empty_pgtable() is called first to ensure the existence of a page table that covers the faulting address. it is write-protected.'' Kernel pages are never swapped. q If the old page is not shared then just unprotect it. In case 2. If that fails it reads in the page from the executable or library (It repeats the call to share_page() in case another process did the same meanwhile). } swap_info[MAX_SWAPFILES]. static struct swap_info_struct { unsigned long flags. The page acquires the dirty flag from the old page. Decrement the map count of the old page. Else get_free_page() and copy_page(). In case 3 get_empty_page() is called to provide a page at the required address and in case of the swapped page. A process that writes to a shared page will then have to go through do_wp_page() which does the copy-on-write.Linux Memory Management Overview 1. Any portion of the page beyond the brk value is zeroed. We will use swapping here to refer to paging. Paging Paging is swapping on a page basis rather than by entire processes. swap_in() is called. char * swap_lockmap. int highest_bit.iol. 3. This happens with a swap_in() or when it is read from the executable or a library.11. This is used by swapoff when it tries to http://ldp. since Linux only pages.it/LDP/khg/HyperNews/get/memory/linuxmm. the handler calls share_page() to see if the page is shareable with some other process. The page is missing--a data page that has not been allocated. A page read in from the disk is counted as a major fault (maj_flt). The swapper maintains a single bit of aging info in the PAGE_ACCESSED bit of the page table entries. When SWP_WRITEOK is off space will not be allocated in that file. The flags field (SWP_USED or SWP_WRITEOK) is used to control access to the swap files.c). The page is swapped. They are freed and reloaded when required. int lowest_bit. [What are the maintainance details? How is it used?] Linux supports multiple swap files or devices which may be turned on or off by the swapon and swapoff system calls. When a shareable page is found. 2. Each swapfile or device is described by a struct swap_info_struct (swap. unsigned int swap_device.06] . Other cases are deemed minor faults (min_flt).

Each process will swap in a separate copy of the page when it accesses it.).06] . swap_map holds a byte for each page in the swapfile. do_no_page() when needed. The first page contains a signature (`SWAP-SPACE') in the last 10 bytes. This index is then stored in bits 1-31 of the page table entry so the swapped page may be located by the page fault handler. The fields lowest_bit and highest_bit bound the free region in the swap file and are used to speed up the search for free swap space. swap_lockmap holds a bit for each page that is used to ensure mutual exclusion when reading or writing swap files. When the count drops to 0 the page can be reallocated by get_swap_page(). because the swap_map then takes 1 page. A `1' in the bitmap means the corresponding page is free. It is used to maintain a count of swap requests on each page in the swap file. Initially 0's in the bitmap signal bad pages. It just increments the count maintained in swap_map for that page. A couple of pages of memory are allocated for swap_map and swap_lockmap. swap_free() decrements the count maintained in swap_map. The user program mkswap initializes a swap device or file. but the space overhead due to the swap_map would be large. etc. The syscall swapon() is called by the user program swapon typically from /etc/rc.html (10 di 10) [08/03/2001 10. Instead the swapfile size is limited to 16 MB. A static variable nr_swapfiles stores the number of currently active swap files. When swapon adds a new swap file it sets SWP_USED. each with room for about 64 GB.Linux Memory Management Overview unuse a file. The function swap_duplicate() is used by copy_page_tables() to let a child process inherit swapped pages during a fork.com.iol. When a page of memory is to be swapped out an index to the swap location is obtained by a call to get_swap_page(). Johnson. This page is never allocated so the initialization needs to be done just once. The upper 7 bits of the index give the swapfile (or device) and the lower 24 bits give the page number on that device.it/LDP/khg/HyperNews/get/memory/linuxmm. It is initialized from the bitmap to contain a 0 for available pages and 128 for unusable pages. It is called each time a swapped page is read into memory (swap_in()) or when a page is to be discarded (free_one_table(). and holds a bitmap.11. Copyright (C) 1992. 1993. johnsonm@redhat. That makes as many as 128 swapfiles. 1993 Krishna Balasubramanian and Douglas Johnson Messages http://ldp. 1996 Michael K. Copyright (C) 1992.

. U/S 1 means user page. A 1 means page has been accessed (set to 0 by aging). Format for Page directory and Page table entries: 31 . 12 11 .. The corresponding definitions for Linux are in .. When a page is swapped. The register CR3 contains the physical base address of the page directory and is stored as part of the TSS in the task_struct and is therefore loaded on each task switch... 22 21 .11. physical_address = page_base + OFFSET Page directories (page tables) are page aligned so the lower 12 bits are used to store useful information about the page table (page) pointed to by the entry. 12 11 . P 1 means page is present in memory.iol..08] .80386 Memory Management The HyperNews Linux KHG Discussion Pages 80386 Memory Management A logical address specified in an instruction is first translated to a linear address by the segmenting hardware... bits 1-31 of the page table entry are used to mark where a page is stored in http://ldp. Paging on the 386 There are two levels of indirection in address translation by the paging unit..html (1 di 7) [08/03/2001 10..... table_base + TABLE points to the page_base. 9 8 7 6 5 4 3 2 1 0 ADDRESS OS 0 0 D A 0 0 U/S R/W P D 1 means page is dirty (undefined for page directory entry). 0 DIR TABLE OFFSET Physical address is then computed (in hardware) as: CR3 + DIR points to the table_base. A page directory contains pointers to 1024 page tables..it/LDP/khg/HyperNews/get/memory/80386mm... and are defined by the OS.. A 32-bit Linear address is divided as follows: 31 . OS bits can be used for LRU etc.. R/W 0 means readonly for user.. Each page table contains pointers to 1024 pages.. This linear address is then translated to a physical address by the paging unit..

privilege). It is explicitly flushed in Linux by calling invalidate() which just reloads CR3. Segments in the 80386 Segment registers are used in address translation to generate a linear address from a logical (virtual) address. The segments are: Regular segments r code and data segments System segments r (TSS) task state segments r (LDT) local descriptor tables Characteristics of system segments http://ldp. [in head.c) then either brings in a new page or unwriteprotects a page or does whatever needs to be done. limit. Three physical memory references for address translation for every logical memory reference would kill the system. Page Fault handling Information q q The register CR2 contains the linear address that caused the last page fault. Each segment in the system is described by a 8 byte segment descriptor which contains all pertinent information (base. The fault handler (in memory. hence the TLB.html (2 di 7) [08/03/2001 10.08] . If not. The TLB is flushed if CR3 loaded or by task switch that changes CR0. Page Fault Error Code (16 bits): bit 0 1 2 cleared page not present fault due to read supervisor mode set page level protection fault due to write user mode The rest are undefined. These are extracted in sys_call.it/LDP/khg/HyperNews/get/memory/80386mm.11. The Translation Lookaside Buffer (TLB) is a hardware cache for physical addresses of the most recently used virtual addresses.S. When a virtual address is translated the 386 first looks in the TLB to see if the information it needs is available. it has to make a couple of memory references to get at the page directory and then the page table before it can actually get at the page. linear_address = segment_base + logical_address The linear address is then translated into a physical address by the paging hardware.80386 Memory Management swap (bit 0 must be 0). Paging is enabled by setting the highest bit in CR0.S?] At each stage of the address translation access permissions are verified and pages not present in memory and protection violations result in page faults. type.iol.

iol. The size of the segment is that of the tss_struct excluding the i387_union (232 bytes). LDTn = a descriptor in the GDT for the LDT of the nth task. es. GDT[5] = LDT0. In Linux there is one LDT per task. q GDT[6] = TSS1. The selector uniquely identifies a segment descriptor in one of the tables. There is a Task State Segment (TSS) associated with each task in the system. The 386 has a complex set of criteria regarding access to segments so you can't simply load a descriptor into a segment register. GDT[7] = LDT1 q .. the 386 uses a global descriptor table (GDT) that is setup in memory by the system (located by the GDT register). LDT[n] != LDTn LDT[n] = the nth descriptor in the LDT of the current task.80386 Memory Management q q q q System segments are task specific. The Linux GDT contains just two normal segment entries: q GDT[0] is the null descriptor. http://ldp.it/LDP/khg/HyperNews/get/memory/80386mm. The LDT's contain regular segment descriptors that are private to a task. Access is validated and the corresponding descriptor loaded by the hardware. There is room for 32 descriptors in the linux task_struct. It contains all the information necessary to restart the task. The GDT contains a segment descriptors for each task state segment. The programmer loads one of these registers with a 16-bit value called a selector. each local descriptor tablet and also regular segments. and ss. gs. Its contents are: LDT[0] Null (mandatory) LDT[1] user code segment descriptor.11. The rest of the GDT is filled with TSS and LDT system descriptors: q GDT[3] ??? q GDT[4] = TSS0. It contains the tss_struct (sched. ds. q GDT[2] is the kernel data/stack segment descriptor. q GDT[1] is the kernel code segment descriptor. Before a segment can be used. Also these segment registers have programmer invisible portions.html (3 di 7) [08/03/2001 10. LDT[2] user data/stack segment descriptor.. fs.. The kernel segments have base 0xc0000000 which is where the kernel lives in the linear view. .h). To keep track of all these segments. etc. hence room for only 3 entries as above. The normal LDT generated by Linux has a size of 24 bytes. The user segments all have base=0x00 so that the linear address is the same as the logical address. The visible portion is what is usually called a segment register: cs..08] . the contents of the descriptor for that segment must be loaded into the segment register.

html (4 di 7) [08/03/2001 10. The segment level rules that apply to user processes are 1. index=1.08] .] Selectors in the 80386 A segment selector is loaded into a segment register (cs. RPL=3.iol. please. Linux uses only two privelege levels. A process cannot directly access the kernel data or code segments 2. and needs updating. On entry into syscall: q ds and es are set to the kernel data segment (0x10) http://ldp. 3 2 1 0 index TI RPL TI Table indicator: 0 means selector indexes into GDT 1 means selector indexes into LDT RPL Privelege level..80386 Memory Management Currently Linux largely ignores the (overly?) complex segment level protection afforded by the 386. [This has changed. 0 means kernel 3 means user Examples: Kernel code segment TI=0. Segment selector Format: 15 . There is always limit checking but given that every user segment goes from 0x00 to 0xc0000000 it is unlikely to apply. therefore selector = 0x08 (GDT[1]) User data segment TI=1.. ds. index=2.) to select one of the regular segments in the system as the one addressed via that segment register. etc..it/LDP/khg/HyperNews/get/memory/80386mm. It is biased towards the paging hardware and the associated page level protection..11. Instead one must load the TR or LDTR. therefore selector = 0x17 (LDT[2]) Selectors used in Linux: TI 0 0 0 1 1 index 1 2 3 1 2 RPL 0 0 0 3 3 selector 0x08 0x10 ??? 0x0F 0x17 segment kernel code kernel data/stack ??? user code user data/stack GDT[1] GDT[2] GDT[3] LDT[1] LDT[2] Selectors for system segments are not to be loaded directly into segment registers.. RPL=0.

type=9. type=2. DPL=3. base=0xc0000000.q q fs is set to the user data segment (0x17) and is used to access data pointed to by arguments to the system call. DPL=0. Segment descriptors There is a segment descriptor used to describe each segment in the system. S=0. Linux regular kernel descriptors: (head. Note that it takes 8 bytes.h) code: P=1. D=1. There is a TSS and LDT for each task. limit=0x3ffff data: P=1. The stack segment and pointer are automatically set to ss0 and esp0 by the interrupt and the old values restored when the syscall returns. G=1.S) code: P=1. Interpreted differently for system and normal descriptors. DPL=0. base=0xc0000000. type=2. G=1. DPL=0. The base is set during fork(). S=0. S=1. base=0xc0000000. type=a. Type There are many possibilities. D=1. S=1. LDT: P=1. G=1. 3 means user G 1 means 4K granularity (Always set in Linux) D 1 means default operand size 32bits U programmer definable P 1 means present in physical memory S 0 means system segment. There are regular descriptors and system descriptors. The strange format is essentially to maintain compatibility with the 286. Linux system descriptors: TSS: P=1. limit = 23 room for 3 segment descriptors. limit=0x3ffff The LDT for task[0] contains: (sched. 1 means normal code or data segment. limit=0x9f . DPL=0. 63-54 55 54 53 52 51-48 47 46 45 44-40 39-16 15-0 Base Limit Segment Base Segment Limit G D R U P DPL S TYPE 31-24 19-16 23-0 15-0 Explanation: R reserved (0) DPL 0 means kernel. Here's a descriptor in all its glory. S=1. limit = 231 room for 1 tss_struct. type=a. D=1.

h) Macros used in setting up descriptors Some assembler macros are defined in sched. The descriptor tables have registers associated with them that are used to locate them in memory. the TR is loaded with the new descriptor and the registers are restored from the new TSS. S=0. addr) ltype = 0x89 P=1.html (6 di 7) [08/03/2001 10. The segment base (TSS or LDT) is set to 0xc0000000 + addr. Format of GDTR (and IDTR): 32-bits 16-bits Linear base addr table limit The TR and LDTR are loaded from the GDT and so have the format of the other segment registers. DPL=0. means available 80386 TSS limit = 231 room for 1 tss_struct. Each TSS entry and LDT entry takes 8 bytes. _LDT(n) is stored in the the ldt field of the tss_struct by fork. Note that the field tss_struct. S and type: set_ldt_desc(n.it/LDP/khg/HyperNews/get/memory/80386mm. http://ldp. DPL. ulong get_base (struct desc_struct ldt) gets the base from a descriptor.iol. The LDTR is loaded on each task switch. addr) ltype = 0x82 P=1. limit = 23 => room for 3 segment descriptors.80386 Memory Management Format of segment register: (Only the selector is programmer visible) 16-bit 32-bit 32-bit selector physical base addr segment limit attributes The invisible portion of the segment register is more conveniently viewed in terms of the format used in the descriptor table entries that the programmer sets up. set_tss_desc(n.11. Specific instances of the above are.08] . The GDTR (and IDTR) are initialized at startup once the tables are defined. S=0. limit. This is the process used by schedule to switch to various user tasks. addr. The execution of a jump to a TSS selector causes the state to be saved in the old TSS. DPL=0. load_TR(n). (sched. type=2 means LDT entry. It is used to load the LDTR. load_ldt(n) load descriptors for task number n into the task register and ldt register. type) ulong *n points to the GDT entry to set (see fork.ldt contains a selector for the LDT of that task. _LDT(n) These provide the index into the GDT for the n'th task.h to ease access and setting of descriptors. The task register (TR) contains the descriptor for the currently executing task's TSS.c).h and system. _set_tssldt_desc(n. Manipulating GDT system descriptors: _TSS(n). where ltype refers to the byte containing P. type = 9.

The limit here is actually the size in bytes of the segment. 1993 Krishna Balasubramanian and Douglas Johnson Messages paging initialization. type.unm. set_base(struct desc_struct ldt. by Lennart Benschop http://ldp. Johnson.edu 1.iol. Copyright (C) 1992. ulong limit) Will set the base and limit for descriptors (4K granular segments).com. User Code and Data Segment no longer in LDT.11. 2. Returns the size of the segment in bytes.08] . 1993. gate_addr must be a (ulong *) Copyright (C) 1992. _set_seg_desc(gate_addr. set_limit(struct desc_struct ldt. ulong base). G=0 Present. johnsonm@redhat.80386 Memory Management ulong get_limit (ulong segment) gets the limit (size) from a segment selector. 1996 Michael K. P=1. base. dpl.html (7 di 7) [08/03/2001 10.it/LDP/khg/HyperNews/get/memory/80386mm. doc update by droux@cs. limit) Default values 0x00408000 => D=1. operation size is 32 bit and max size is 1M.

08] . the editor note can be suppressed. x86. doc update Forum: 80386 Memory Management Keywords: paging.2.S?] This is correct. CR0 initialisation for paging is performed by "setup_paging" which is implemented in (1. > [in head.11. CR0 Date: Mon.edu> From the x86 memory management doc: > Paging is enabled by setting the highest bit in CR0.paging initialization. doc update The HyperNews Linux KHG Discussion Pages paging initialization.S http://ldp. 03 Jun 1996 20:46:15 GMT From: <droux@cs.html [08/03/2001 10.iol.unm. initialization.13) arch/i386/kernel/head.it/LDP/khg/HyperNews/get/memory/80386mm/1.

making a total of 4GB.iol.html [08/03/2001 10. which had the kernel in its user space). This (somewhat elegant) scheme was abandoned to allow more than 64 processes and a per process virtual address space of more than 64MB.tue. though they were in fact the same segments for all processes. user space was restricted to 64 MB and there were a maximum of 64 processes (including process 0. In very early versions of Linux. There was only one page directory. That's why certain kernels had they suer code and data segments in the LDT.11. 23 Jul 1996 09:39:45 GMT From: Lennart Benschop <benschop@eb.09] . Forum: 80386 Memory Management Date: Tue. The code and data segment of each process starts at linear address 0 anyway. only the physical address is different (different page directory =CR3) Processes still have an LDT. Back then each process had a different linear address. this can be used by certain applications (WINE). The HyperNews Linux KHG Discussion Pages User Code and Data Segment no longer in LDT.ele. and there were per-process code and data segments.User Code and Data Segment no longer in LDT.nl> The user code and data segments of a process are no longer in the LDT. but in the GDT instead.it/LDP/khg/HyperNews/get/memory/80386mm/2. included in the LDT. http://ldp.

Exceptions are caused by the execution of instructions. Interrupts can occur at unexpected times during the execution of a program and are used to respond to signals from hardware. What Does the 386 Provide? The 386 recognizes two event classes: exceptions and interrupts.10] .11. new ones are added occasionally. Two sources of exceptions are recognized by the 386: Processor detected exceptions and programmed exceptions. Here is a listing of all the possible interrupts and exceptions: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 divide error debug exception NMI interrupt Breakpoint INTO-detected Overflow BOUND range exceeded Invalid opcode coprocessor not available double fault coprocessor segment overrun invalid task state segment segment not present stack fault general protection page fault http://ldp.How System Calls Work on Linux/i86 The HyperNews Linux KHG Discussion Pages How System Calls Work on Linux/i86 This section covers first the mechanisms provided by the 386 for handling system calls. which is referred to by the 386 literature as the vector.html (1 di 5) [08/03/2001 10. and they are documented in man pages that should be on your Linux system. can be used for maskable interrupts or programmed exceptions. inclusive. Each interrupt or exception has a number. The NMI interrupt and the processor detected exceptions have been assigned vectors in the range 0 through 31. External interrupt controllers put the vector on the bus during the interrupt-acknowledge cycle. Two sources of interrupts are recognized by the 386: Maskable interrupts and Nonmaskable interrupts.it/LDP/khg/HyperNews/get/syscall/syscall86.iol. and then shows how Linux uses those mechanisms. This is not a reference to the individual system calls: There are very many of them. inclusive. Any vector in the range 32 through 255. The vectors for maskable interrupts are determined by the hardware. Both cause a forced context switch to new a procedure or task.

along with other important vectors like the system clock vector. Debug traps for this instruction . INT 3 .it/LDP/khg/HyperNews/get/syscall/syscall86. Some system calls are more complex then others because of variable length argument lists. iBCS2 requries an lcall 0. and will automatically switch modes.2 of Linux.7 call is executed. As of version 0. In fact. INT n. the setuid system call is coded as _syscall1(int. Examples of a complex system call include open() and ioctl().11. execution flow is as follows: q Each call is vectored through a stub in libc. but even these complex system calls must use the same entry point: they just have more parameter setup overhead.How System Calls Work on Linux/i86 15 16 17-31 32-255 reserved coprocessor error reserved maskable interrupts The priority of simultaneous interrupts and exceptions is: HIGHEST Faults except debug faults . caused by the instruction int 0x80. which will expand to: _setuid: subl $4.uid_t. Documentation for these can be found in the man (2) pages. Linux will assume that an iBCS2-compliant binary is being executed if an lcall 0. This interrupt vector is initialized during system startup. NMI interrupt LOWEST INTR interrupt How Linux Uses Interrupts and Exceptions Under Linux the execution of a system call is invoked by a maskable interrupt or exception class transfer.setuid. Each call within the libc library is generally a syscallX() macro. q Each syscall macro expands to an assembly routine which sets up the calling stack frame and calls _system_call() through an interrupt. Debug traps for next instruction . We use vector 0x80 to transfer control to the kernel.uid).99. which Linux can send to the iBCS2 compatibility module appropriate if an iBCS2-compliant binary is being executed.10] .iol.html (2 di 5) [08/03/2001 10. Trap instructions INTO. where X is the number of parameters used by the actual routine. When a user invokes a system call.%exp pushl %ebx movzwl 12(%esp). via the instruction int $0x80 For example. there are 116 system calls.%eax http://ldp.7 instruction.

%eax movl 4(%esp).S Actual code for many of the system calls can be found in /usr/src/linux/kernel/sys.10] .iol. An IDT has 256 entries._errno movl $-1. It is also responsible for calling _ret_from_sys_call() when the system call has been completed. _ret_from_sys_call() is called.c.%eax popl %ebx addl $4. and the rest are found elsewhere. This routine sets up an IDT (Interrupt Descriptor Table) with 256 entries. Not until the int $0x80 is executed does the call transfer to the kernel entry point _system_call(). It is responsible for saving all registers. find is your friend.%edx testl %edx. so that it can be accessed by code like perror(). It checks to see if the scheduler should be run. Actual code for system_call entry point can be found in /usr/src/linux/kernel/sys_call.%eax popl %ebx addl $4.html (3 di 5) [08/03/2001 10. When start_kernel() (found in /usr/src/linux/init/main. No interrupt entry points are actually loaded by this routine.c) is called it invokes trap_init() (found in /usr/src/linux/kernel/traps.How System Calls Work on Linux/i86 q movl %eax.11.h. each 4 bytes long. puts a positive copy of the return value in the global variable _errno. but before returning to user space. calls it.it/LDP/khg/HyperNews/get/syscall/syscall86.S starts everything off by calling setup_idt().%edx jge L2 negl %edx movl %edx. for a total of 1024 bytes. checking to make sure a valid system call was invoked and then ultimately transfering control to the actual system call code via the offsets in the _sys_call_table. and if so. and if there is one.%ebx int $0x80 movl %eax. q q How Linux Initializes the system call vectors The startup_32() code found in /usr/src/linux/boot/head.%esp ret The macro definition for the syscallX() macros can be found in /usr/include/linux/unistd. the syscallX() macro code checks for a negative return value. Upon return from the system call. as that is done only after paging has been enabled and the kernel has been moved to 0xC0000000.c). and the user-space system call library code can be found in /usr/src/libc/syscall/ At this point no system code for the call has been executed. This entry point is the same for all system calls.4(%esp) movl $23. trap_init() sets up the IDT via the macro http://ldp. After the system call has executed.%esp ret L2: movl %edx.

html (4 di 5) [08/03/2001 10. How to Add Your Own System Calls 1. 5.c). A call to set_system_gate (0x80. See fs/Makefile. target fs. Create a directory under the /usr/src/linux/ directory to hold your code. Add the relocatable module produced by the link of your new kernel code to the ARCHIVES and the subdirectory to the SUBDIRS lines of the top level Makefile. trap_init() initializes the interrupt descriptor table as shown here: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18-48 divide_error debug nmi int3 overflow bounds invalid_op device_not_available double_fault coprocessor_segment_overrun invalid_TSS segment_not_present stack_segment general_protection page_fault reserved coprocessor_error alignment_check reserved At this point the interrupt vector for the system calls is not set up. &system_call) sets interrupt 0x80 to be a vector to the system_call() entry point. It is initialized by sched_init() (found in /usr/src/linux/kernel/sched. where xx.h).10] .11. Add an entry point for your system call to the sys_call_table in sys.o for an example. 3. It should match the index (xx) that you assigned in the previous step. the index. 4. http://ldp.h to assign a call number for your system call.h. 2. Add a #define __NR_xx to unistd. The NR_syscalls variable will be recalculated automatically.How System Calls Work on Linux/i86 set_trap_gate() (found in /usr/include/asm/system. It will be used to set up the vector through sys_call_table to invoke you code.it/LDP/khg/HyperNews/get/syscall/syscall86. is something descriptive relating to your system call. Put any include files in /usr/include/sys/ and /usr/include/linux/.iol.

3.iol. See the Annotated Bibliography. Johnson the solution to the problem by Vijay Gupta http://ldp.. etc.html (5 di 5) [08/03/2001 10. Run make from the top level to produce the new kernel incorporating your new code. would be nice to explain syscall macros by Tim Bird wrong file for syscallX() macro by Tim Bird the directory /usr/src/libc/syscall/ by vijay gupta 1. to take into account the environment needed to support your new code. johnsonm@redhat. Johnson. or use the proper _syscalln() macro in your user program for your programs to access the new system call. -> . as is James Turley's Advanced 80386 Programming Techniques.10] .How System Calls Work on Linux/i86 6.. 1.com.11.it/LDP/khg/HyperNews/get/syscall/syscall86. you will have to either add a syscall to your libraries. 1996 Michael K. At this point. 2. Copyright (C) 1993 Stanley Scalsky Messages wrong file for system_call code by Tim Bird 4. Copyright (C) 1993. The 386DX Microprocessor Programmer's Reference Manual is a helpful reference. Modify any kernel code in kernel/fs/mm/. by Michael K. 7.no longer exists.

no. 1995 ISBN: 0-13-326232-4 (hard) 0-13-326224-3 (paper) Pages: 455 Price: ??. If you program in C.iol. The authors have been members of the ANSI/ISO C standards committee. even if you are already a C expert. yes by Michael K. In contrast to K&R. the Harbison/Steele covers the full ISO C standard. is considerably more complete and precise than the one found in appendix B of K&R.?? USD. and maintainability. It also covers the old K&R C language as well as C++ compatibility issues. If you don't want to throw K&R out. Especially the description of the standard C library. K%0Aamp. the wide character and locale support is missing almost completely) nor the 1994 C language extensions.R replacement. 73. fourth edition Author: Samuel P. Publisher: Prentice Hall.80 DEM This book is an authoritative reference manual that provides a complete and precise description of the C language and the run-time library. It also teaches a C programming style that emphasizes correctness. Harbison and Guy L.html [08/03/2001 10.Please replace K&R reference by Harbison/Steele The HyperNews Linux KHG Discussion Pages Please replace K&R reference by Harbison/Steele Forum: Annotated Bibliography Keywords: C. Johnson 1.uni-erlangen.g. 19 May 1996 12:06:58 GMT From: Markus Kuhn <mskuhn@cip.de> I suggest that you replace the K&R reference by the following much better and more up-to-date one. including the 1994 extensions. which every C programmer needs for daily reference. reference manual Date: Sun. textbook. C . please add at least this reference and add to the K&R review that this old book does not cover the full ISO C run-time library (e. I have never touched my K&R again since I bought the following book. The Harbison/Steele has by now taken over the role of being the C bible from the traditional C book by Kernighan/Ritchie. portability.A Reference Manual.11] .it/LDP/khg/HyperNews/get/bib/bib/1.11. supplement.informatik. you want to have this book on your desk. Steele Jr. -> Right you are Mike! by rohit patil http://ldp. Messages Replace.

Replace. yes The HyperNews Linux KHG Discussion Pages Replace. 19 May 1996 15:27:40 GMT From: Michael K. no. no.iol. and since the kernel doesn't use the run-time library.11. Even if (like me) you ignore the annotation. along with a pointer to the GNU C documentation. or was last time I checked. http://ldp.it/LDP/khg/HyperNews/get/bib/bib/1/1. but not that K&R should be excluded. I agree that H&S should be included. Messages Right you are Mike! by rohit patil 1. ISBN 0-07-881952-90. supplement. even though it is relevant to C programmers in general. I happen to like K&R and find it easy to read and look things up in.html [08/03/2001 10. since the linux kernel does use a few GNU C extensions. it's still cheaper than an official copy of the standard. Johnson <johnsonm@redhat. I should probably add it to the bibliography.com> Since this is the kernel hackers' guide. IMHO) copy of the ANSI standard published by Osborne/McGraw Hill. I occasionally supplement it with an annotated (sometimes poorly. the fact that Harbison and Steele document the run-time library fully is rather irrelevant for the purposes of this document. yes Forum: Annotated Bibliography Re: Please replace K&R reference by Harbison/Steele (Markus Kuhn) Keywords: C.11] . textbook. supplement. reference manual Date: Sun.

Right you are Mike! The HyperNews Linux KHG Discussion Pages Right you are Mike! Forum: Annotated Bibliography Re: Please replace K&R reference by Harbison/Steele (Markus Kuhn) Re: Replace. supplement.html [08/03/2001 10. reference manual Date: Thu. maybe :) http://ldp.iol. no. 02 Jan 1997 04:15:15 GMT From: rohit patil <rohit@techie.it/LDP/khg/HyperNews/get/bib/bib/1/1/1. Supplement. yes (Michael K.11.com> Yup! Can't replace K&R.11] . textbook. Johnson) Keywords: C.

The HyperNews Linux KHG Discussion Pages http:// .

. Write to Osborne McGraw-Hill and let them know you want to buy a copy.A.com> Osborne McGraw-Hill may be bringing the book in and out of print.S. 26 May 1996 17:45:43 GMT From: Michael K.it/LDP/khg/HyperNews/get/bib/bib/2/1.. it was out of print but still available in bookstores.html [08/03/2001 10. When I got my copy in 1992. Johnson <johnsonm@redhat. but publishers do sometimes listen when plenty of potential readers write to them. Good luck! http://ldp. California 94710 U. Their address is (or was when the book was published.11.) Osborne McGraw-Hill 2600 Tenth Street Berkeley.Very unfortunate The HyperNews Linux KHG Discussion Pages Very unfortunate Forum: Annotated Bibliography Re: 80386 book is apparently out of print now (Austin Donnelly) Keywords: 80386 programming. they probably do infrequent printings and enough noise from potential readers may be sufficient to convince them to bring the book out of retirement.iol. Bookstores don't tell them when one person goes in and it's out of print. so if it just recently went out of print. out of print Date: Sun.12] .

Kunitz.Linux Kernel Internals-> Kernel MM IPC fs drivers net modules The HyperNews Linux KHG Discussion Pages Linux Kernel Internals-> Kernel MM IPC fs drivers net modules Forum: Annotated Bibliography Keywords: Must have Date: Tue.Boehme.2.3.12] .html [08/03/2001 10. Based on 1. but is unfamiliar with OS details outside of windows/dos.edu> Linux Kernel Internals ISBN 0-201-87741-4 Addison Wesley Longman 1996 Beck. This book talks you through the timer and scheduler code and fast and slow H/W interrrupt handlers for instance. http://ldp.11. 15 Oct 1996 15:43:14 GMT From: Alex Stewart <Alex@auriga. who has to make hardware work with Linux. indispensible to someone like me.Dziadzka(!).brandeis.it/LDP/khg/HyperNews/get/bib/bib/3.Verworner $45 at Borders A guide to the kernel code.iol.x. Supprisingly easy to read given the subject matter. so I can see what things like get/setitimer will do for me.rose.Magnus.13 and 1.

S http://ldp.html [08/03/2001 10.com> This page contains the sentence "Actual code for system_call entry point can be found in /usr/src/linux/kernel/sys_call. 13 Sep 1996 01:44:29 GMT From: Tim Bird <tbird@caldera.13] .iol.11.it/LDP/khg/HyperNews/get/syscall/syscall86/4.wrong file for system_call code The HyperNews Linux KHG Discussion Pages wrong file for system_call code Forum: How System Calls Work on Linux/i86 Keywords: syscall assembly error Date: Fri.S" This should read: Actual code for the system_call entry point (for the intel architecture) can be found in /usr/src/linux/arch/i386/kernel/entry.

arg1) \ type name(type1 arg1) \ { \ long __res. Here is the source for the _syscall1 macro #define _syscall1(type. \ errno = -__res.iol. 13 Sep 1996 01:37:11 GMT From: Tim Bird <tbird@caldera.uid_t.com> The syscall macros are a little dense to decipher.html (1 di 2) [08/03/2001 10.setuid.type1. \ if (__res >= 0) \ return (type) __res. It might be nice to show the macro.13] . } It's pretty easy to see how the cleanup code converts into assembly.11. return -1. "b" ((long)(uid))). \ __asm__ volatile ("int $0x80" \ : "=a" (__res) \ : "0" (__NR_##name). \ return -1. this become the code int setuid(uid_t uid) { long __res. if (__res >= 0 ) return (int) __res. errno = -__res.would be nice to explain syscall macros The HyperNews Linux KHG Discussion Pages would be nice to explain syscall macros Forum: How System Calls Work on Linux/i86 Keywords: sycall Date: Fri.uid) expanded into the assembly code shown. It took me a while to determine how the macro syscall1(int.it/LDP/khg/HyperNews/get/syscall/syscall86/3. \ } When expanded."b" ((long)(arg1))). and explain a little about how it gets expanded. but the setup code eluded me until I figured out the following: "=a" (__res) means the result comes back in %eax http://ldp. __asm__ volatile ("int $0x80" \ : "=a" (__res) \ : "0" (__NR_setuid).name.

13] .11.iol. %esi.it/LDP/khg/HyperNews/get/syscall/syscall86/3. and %edi to hold additional values passed through the call. %edx. http://ldp.html (2 di 2) [08/03/2001 10.would be nice to explain syscall macros "0" (__NR_setuid) means put the system call number into %eax on entry "b" ((long)(uid) means put the first argument into %ebx on entry syscallX macros that use additional parameters use %ecx.

h http://ldp.iol.wrong file for syscallX() macro The HyperNews Linux KHG Discussion Pages wrong file for syscallX() macro Forum: How System Calls Work on Linux/i86 Date: Fri.it/LDP/khg/HyperNews/get/syscall/syscall86/2. The macros for system call generation are located in the file /usr/include/asm/unistd.11.h" Actually. 13 Sep 1996 01:25:44 GMT From: Tim Bird <tbird@caldera.com> This page contains the sentence: " The macro definition for the syscallX() macros can be found in /usr/include/linux/unistd.13] .html [08/03/2001 10. this file containts the system call numbers.

.it/LDP/khg/HyperNews/get/syscall/syscall86/1.no longer exists. Vijay Gupta (Email : vijay@crhc.9.11.tar. -> the solution to the problem by Vijay Gupta http://ldp. Johnson 1. The directory /usr/src/libc/syscall/ is essential for hacking the libc code in order to add a new system call. Unfortunately.14] .edu> Hi.gz from ftp://sunsite.X.unc. by Michael K.3.edu) Messages . Does anyone have any ideas on this ? Thank you very much.html [08/03/2001 10.uiuc.the directory /usr/src/libc/syscall/ The HyperNews Linux KHG Discussion Pages the directory /usr/src/libc/syscall/ Forum: How System Calls Work on Linux/i86 Keywords: system call Date: Sat.edu/pub/Linux/GCC/ or from libc-4..iol. 18 May 1996 03:30:26 GMT From: vijay gupta <vijay@crhc. I have been unable to find this directory from libc-5.uiuc.

please ask more. 18 May 1996 12:52:48 GMT From: Michael K...com> The /usr/src/libc/syscall/ directory no longer exists.iol.it/LDP/khg/HyperNews/get/syscall/syscall86/1/1.no longer exists. If your system call only works on one architecture..11.. The HyperNews Linux KHG Discussion Pages .no longer exists. Instead.. then you need to use the architecture-dependent subdirectories i386 and m68k (at present. http://ldp. Forum: How System Calls Work on Linux/i86 Re: the directory /usr/src/libc/syscall/ (vijay gupta) Keywords: system call libc Date: Sat. they files you need to modify are in the libc/sysdeps/linux/ directory.. Johnson <johnsonm@redhat. Messages the solution to the problem by Vijay Gupta 1. that will soon expand to at least sparc. and maybe other platforms).html [08/03/2001 10.14] . If you need to know more..

(Michael K. NULL).iol. The solution to the problem is as follows : the khg seems to be wrong in assuming there was a directory syscall in the C library.no longer exists.struc t rusage *.the solution to the problem The HyperNews Linux KHG Discussion Pages the solution to the problem Forum: How System Calls Work on Linux/i86 Re: the directory /usr/src/libc/syscall/ (vijay gupta) Re: . Instead.pid. http://ldp.status.int. int protocol. } If you look at /usr/src/linux/net/socket. but by an assembler macro in __socketcall.uiuc. socketpair. which contains. args[3] = (unsigned long)sockvec..wait4.S: SYSCALL__ (socketcall.. args). The wait(2) function is declared as #ifdef __SVR4_I386_ABI_L1__ #define wait4 __wait4 #else static inline _syscall4(__pid_t.__pid_t.it/LDP/khg/HyperNews/get/syscall/syscall86/1/1/1. there is a directory sysdeps/linux.c. Thanks to the two people who replied to me on this.ru) #endif __pid_t __wait(__WAIT_STATUS_DEFN wait_stat) { return wait4(WAIT_ANY. int sockvec[2]) { unsigned long args[4]. args[2] = protocol. among others. 21 May 1996 23:05:39 GMT From: Vijay Gupta <vijay@crhc.options. int type. wait_stat. args[0] = family. you will find a good match with that code.html (1 di 2) [08/03/2001 10.__WAIT_STATUS_DEFN. 0. return socketcall(SYS_SOCKETPAIR. with wait4(2) being the system call). which defines the function int socketpair(int family. args[1] = type.c. } (so it is actually wait(3) in Linux. Johnson) Keywords: system call libc Date: Tue. The socketcall function then is not defined by a C macro.edu> Hi everybody.15] .11. 2) ret Please note that the socket system calls are special because of that level of indirection.

html (2 di 2) [08/03/2001 10.the solution to the problem -----------------------Thanks again.11.it/LDP/khg/HyperNews/get/syscall/syscall86/1/1/1.iol. Vijay http://ldp.15] .

you may do well to pick up one of the textbooks recommended in the bibliography. and some of it is generic GCC inline assembly. Johnson http://ldp. 7.iol. ISBN 0-916151-89-1. Johnson 1. Multi-architecture support by Michael K. TTY documentation by Michael De La Rue 1. 5. Johnson.Other Sources of Information The HyperNews Linux KHG Discussion Pages Other Sources of Information Other sources specifically about writing device drivers. Copyright (C) 1996 Michael K. 2. Johnson To add more sources of information: by Michael K. and occasional parts that aren't really related to Linux at all (such as a discussion of the Georgia Tech shell). Johnson 1.it/LDP/khg/HyperNews/get/other. -> -> English version of Linux Kernel Internals by Naoshad Eduljee Book Review? by Josh Du"Bois Thumbed through it by Brian J. In the queue. by Michael K.com. 2. Especially if you are new to kernel programming. The Annotated Bibliography mentions plenty of books out that don't have ``Linux'' in the title which may be useful to Linux programmers. 4. Randy Bentson recently wrote an interesting book called Inside Linux. Messages Linux Pgrogrammers Guide (LPG) by Federico Lucifredi 8. johnsonm@redhat.html [08/03/2001 10.15] . Analysis of the Ext2fs structure by Michael K... published by Specialized System Consultants Inline Assembly with DJGPP really applies to any version of GCC on a 386. some that is specifically related to Linux. 1. It has some information on basic operating system theory. Murrell Linux Architecture-Specific Kernel Interfaces by Drew Puch 3. TTY documentation by Eugene Kanter Untitled by Yusuf Motiwala The vger linux mail list archives by Drew Puch German book on Linux Kernel Programming by Jochen Hein 1.11. Definitely required reading for anyone who wants to do inline assembly with Linux and GCC.

iol.16] .95) http://ldp.11.it/LDP/khg/HyperNews/get/other/8.Linux Pgrogrammers Guide (LPG) The HyperNews Linux KHG Discussion Pages Linux Pgrogrammers Guide (LPG) Forum: Other Sources of Information Keywords: Linux Programming Date: Fri.4 (3.html [08/03/2001 10.edu> Does anybody know what happened to the LPG ? it doesn't seem to have been updated beyond v 0. 18 Apr 1997 04:58:44 GMT From: Federico Lucifredi <lucifred@cs.bc.

iol. 2. by Michael K.. 1.. Johnson 1.html [08/03/2001 10. TTY documentation by Eugene Kanter Untitled by Yusuf Motiwala http://ldp.11.16] .it/LDP/khg/HyperNews/get/other/7.TTY documentation Messages In the queue.

The HyperNews Linux KHG Discussion Pages In the queue..In the queue.iol.html [08/03/2001 10.. http://ldp.18] ..it/LDP/khg/HyperNews/get/other/7/1. and when I have time to set it in HTML. it will be added.11. Forum: Other Sources of Information Re: TTY documentation (Michael De La Rue) Keywords: TTY TeX Documentation Kernel Device Driver Date: Wed.com> The authors have agreed to have it included in the KHG.. I have a copy of the article. Thanks much! Messages TTY documentation by Eugene Kanter 1. 31 Jul 1996 15:47:36 GMT From: Michael K. Johnson <johnsonm@redhat.

23 Oct 1996 16:34:05 GMT From: Eugene Kanter <eugene.18] .it/LDP/khg/HyperNews/get/other/7/1/1. http://ldp.TTY documentation The HyperNews Linux KHG Discussion Pages TTY documentation Forum: Other Sources of Information Re: TTY documentation (Michael De La Rue) Re: In the queue. Johnson) Keywords: TTY TeX Documentation Kernel Device Driver Date: Wed. (Michael K.iol.html [08/03/2001 10.com> May I have at least plain text version of TTY document? Thanks..kanter@ab..11.

29 Mar 1997 07:24:16 GMT From: Yusuf Motiwala <yusuf@scientist.19] .hns.com> Can you please mail me tty documentation.com http://ldp.11. Regards.Untitled The HyperNews Linux KHG Discussion Pages Untitled Forum: Other Sources of Information Re: TTY documentation (Michael De La Rue) Keywords: TTY TeX Documentation Kernel Device Driver Date: Sat.html [08/03/2001 10.iol.it/LDP/khg/HyperNews/get/other/7/2. Yusuf ymotiwala@hss.

20] .11.html [08/03/2001 10. 27 May 1996 17:51:08 GMT From: Drew Puch <aapuch@eos.edu> Good place to see if the question you are about to ask are already answered vger mail list for linux topics http://ldp.ncsu.it/LDP/khg/HyperNews/get/other/5.iol.The vger linux mail list archives The HyperNews Linux KHG Discussion Pages The vger linux mail list archives Forum: Other Sources of Information Keywords: mailing list Date: Mon.

tu-clausthal. DEM 79.html [08/03/2001 10. -> -> Book Review? by Josh Du"Bois Thumbed through it by Brian J. if there's an english translation. 500 pages. Murrell http://ldp.it/LDP/khg/HyperNews/get/other/4.11.German book on Linux Kernel Programming The HyperNews Linux KHG Discussion Pages German book on Linux Kernel Programming Forum: Other Sources of Information Keywords: German book in Linux Kernel Hacking Date: Mon.90 Covers Release 1. Germany. Dirk Verworner Addison-Wesley.hein@informatik.iol.de> There is a german book on "Linux-Kernel-Programmierung" Algorithmen und Strukturen der Version 1.2 Michael Beck. Mirko Dziadzka. Ulrich Kunitz. ISBN 3-89319-939-x. Harald Böhme.2 in detail.21] . Robert Magnus. 27 May 1996 13:19:59 GMT From: Jochen Hein <jochen. I don't know. Messages English version of Linux Kernel Internals by Naoshad Eduljee 1.

$8.21] . You may order by mail: Addison-Wesley Book Express One Jacob Way Reading. use the fax number listed above for prompt service. 02 Jun 1996 11:05:31 GMT From: Naoshad Eduljee <naoshad@pacific.68 but will not be available until early June 1996. Here is the text of the mail I recieved from Addison Wesley when I enquired about the book : "LINUX Kernal Internals" is priced at $38. MA 01867 by phone within the US: 1-800-824-7799 outside the US: 1-617-944-7273 extension 2188 or by fax: 1-617-944-7273 When ordering by fax.html (1 di 2) [08/03/2001 10. quantity of each book. please let us know. credit card number and expiration date.it/LDP/khg/HyperNews/get/other/4/1. please include the title or book number.iol.net. as well as the appropriate shipping address.sg> The English version of the german book on Linux Kernel Programming is published by Addison Wesley.00 for the first book. Please do not send credit card information via the internet. ADDISON-WESLEY BOOK EXPRESS Messages Book Review? by Josh Du"Bois 1. Ordering information: The Book Express will gladly ship your order to any international location. If you need further ordering assistance or title information. USA. Sincerely. -> Thumbed through it by Brian J.11. Orders are shipped to international locations via Air Printed Matter Registered with an estimated delivery time of eight business days from our warehouse in Indiana. Charges for this service are $15. Orders can be prepaid by a valid credit card or a check drawn on a US bank.English version of Linux Kernel Internals The HyperNews Linux KHG Discussion Pages English version of Linux Kernel Internals Forum: Other Sources of Information Re: German book on Linux Kernel Programming (Jochen Hein) Keywords: German book in Linux Kernel Hacking Date: Sun.00 for each additional book on the order. Murrell http://ldp.

English version of Linux Kernel Internals

http://ldp.iol.it/LDP/khg/HyperNews/get/other/4/1.html (2 di 2) [08/03/2001 10.11.21]

Book Review?

The HyperNews Linux KHG Discussion Pages

Book Review?
Forum: Other Sources of Information
Re: German book on Linux Kernel Programming (Jochen Hein) Re: English version of Linux Kernel Internals (Naoshad Eduljee) Keywords: German book in Linux Kernel Hacking Date: Fri, 12 Jul 1996 23:50:34 GMT From: Josh Du"Bois <duboisj@is.com> Has anyone read the english version of this book? I'd love go get my hands on a good linux kernel-hacking guide. If anyone has read this and has comments please post them here or email me at duboisj@is.com. If I don't hear that it's worthless, or if it takes a while for anyone to respond, I'll try and pick up a copy and read it myself/post a review here. Naoshad Eduljee - thanks for the tip, Josh. ------------duboisj@is.com

Messages Thumbed through it by Brian J. Murrell 1.

http://ldp.iol.it/LDP/khg/HyperNews/get/other/4/1/1.html [08/03/2001 10.11.22]

Thumbed through it

The HyperNews Linux KHG Discussion Pages

Thumbed through it
Forum: Other Sources of Information
Re: German book on Linux Kernel Programming (Jochen Hein) Re: English version of Linux Kernel Internals (Naoshad Eduljee) Re: Book Review? (Josh Du"Bois) Keywords: German book in Linux Kernel Hacking Date: Thu, 16 Jan 1997 05:18:02 GMT From: Brian J. Murrell <brian@ilinx.com> I thumbed through it today at the bookstore. I was particularly interested in how a driver uses a handle in the proc filesystem to write information to a process willing to read, like kmsg does. In my thumbing I did not really get my answer. The book looked decent but what really disappointed me was that despite it's being a 1996 release, it only covers version 1.2 kernels. I now realize that this is because it was a translation from another book. :-( I really would like to see an updated version of this book! It would definately be on my bookshelf if it got updated. b.

http://ldp.iol.it/LDP/khg/HyperNews/get/other/4/1/1/1.html [08/03/2001 10.11.22]

Multi-architecture support

The HyperNews Linux KHG Discussion Pages

Multi-architecture support
Forum: Other Sources of Information
Date: Thu, 23 May 1996 15:45:37 GMT From: Michael K. Johnson <johnsonm@redhat.com> Michael Hohmuth, of TU Dresden, wrote a new document on Linux's multiple architecture support. A PostScript version is also available.

Messages Linux Architecture-Specific Kernel Interfaces by Drew Puch 1.

http://ldp.iol.it/LDP/khg/HyperNews/get/other/3.html [08/03/2001 10.11.23]

Linux Architecture-Specific Kernel Interfaces

The HyperNews Linux KHG Discussion Pages

Linux Architecture-Specific Kernel Interfaces
Forum: Other Sources of Information
Re: Multi-architecture support (Michael K. Johnson) Keywords: instructions kernel header file intro Date: Mon, 27 May 1996 17:38:34 GMT From: Drew Puch <aapuch@eos.ncsu.edu> Here is some kernel info by header files. Thanks goes out to Michael Hohmuth of TU Dresden, Dept. of Computer Science, OS Group

document corresponds to Linux 1.3.78++

http://ldp.iol.it/LDP/khg/HyperNews/get/other/3/1.html [08/03/2001 10.11.23]

Analysis of the Ext2fs structure

The HyperNews Linux KHG Discussion Pages

Analysis of the Ext2fs structure
Forum: Other Sources of Information
Date: Sat, 18 May 1996 00:45:42 GMT From: Michael K. Johnson <johnsonm@redhat.com> There's not much available on filesystems yet, but Analysis of the Ext2fs structure, by Louis-Dominique Dubeau, is worth visiting.

http://ldp.iol.it/LDP/khg/HyperNews/get/other/2.html [08/03/2001 10.11.24]

To add more sources of information:

The HyperNews Linux KHG Discussion Pages

To add more sources of information:
Forum: Other Sources of Information
Keywords: instructions Date: Wed, 15 May 1996 15:53:28 GMT From: Michael K. Johnson <johnsonm@redhat.com> In order to add another source to this page, simple respond to the page and mention the source. You can, if you like, simply type in the URL as your response--just click on the Respond button, enter the title of the web page you are connecting to in the Title box, click on the URL radiobutton for the format, and then type the URL into the large text window entitled Enter your response here:. That is all that is required. Just click the Preview Response button, and then if it looks right, submit it by clicking on the Post your Response button. If you want to be notified of further changes made to this page, you can subscribe to it. Subscribing makes you a member, with special privileges, and also puts you on a mailing list. Click on the Membership item at the bottom. Members can also edit their posts if they want to make changes later. Also, the more members there are, the more motivated I will be to maintain this new version of the KHG... :-) If you aren't subscribed, you should probably leave your name and email address, and possibly home page if you have one. At some point, the URL may be moved from the response list into the body of the article. If that sentence didn't make sense to you, you can safely ignore it. Thank you for your help! michaelkjohnson

http://ldp.iol.it/LDP/khg/HyperNews/get/other/1.html [08/03/2001 10.11.24]

Loading shared objects - How?

The HyperNews Linux KHG Discussion Pages

Loading shared objects - How?
Forum: The Linux Kernel Hackers' Guide
Date: Wed, 12 Aug 1998 19:31:19 GMT From: Wesley Terpstra <terpstra@unixg.ubc.ca> Does anyone know where I could find a good document about how shared objects are bound to an ELF executable before runtime? I would like to be able to import symbols from a .so file at runtime based on user input and call the imported symbol (a function). I suspect gdb must do this since it loads shared libraries for programs one debugs and allows one to call the imported functions. I hope to do this as portably as possible. Can anyone out there recommend a document? Thanks.

http://ldp.iol.it/LDP/khg/HyperNews/get/khg/349.html [08/03/2001 10.11.24]

How can I see the current kernel configuration?

The HyperNews Linux KHG Discussion Pages

How can I see the current kernel configuration?
Forum: The Linux Kernel Hackers' Guide
Date: Sun, 09 Aug 1998 10:32:18 GMT From: Melwin <tsmelwin@hotmail.com> Hi all, I need help on how to see my current kernel configuration. Thanks Melwin

http://ldp.iol.it/LDP/khg/HyperNews/get/khg/342.html [08/03/2001 10.11.25]

My mouse no work in X windows

The HyperNews Linux KHG Discussion Pages

My mouse no work in X windows
Forum: The Linux Kernel Hackers' Guide
Re: How can I see the current kernel configuration? (Melwin) Date: Tue, 11 Aug 1998 23:04:48 GMT From: alfonso santana <alfonsosantana@hotmail.com> I have a serial mouse in Com1 but I can´t move my mouse in X windows (the cursor don´t move. I tried with mouseconfig, xf86config, XF86Setup, i killed ps of mouse, i used ls -l /dev/mouse and i got /dev/mouse --->/dev/cua0 but not work, i tried with many protocols, but nothing. Please help me. Thanks

http://ldp.iol.it/LDP/khg/HyperNews/get/khg/342/1.html [08/03/2001 10.11.25]

The crash(1M) command in Linux?

The HyperNews Linux KHG Discussion Pages

The crash(1M) command in Linux?
Forum: The Linux Kernel Hackers' Guide
Keywords: kernel, crash Date: Fri, 07 Aug 1998 16:29:12 GMT From: Dmitry <defanov@romance.iki.rssi.ru> Hi, all! I know, that there is the crash(1M) command in System V. Is there something like crash(1M) in Linux? And how to get adresses of kernel's tabeles & structuries (for instance, process table or u-area)? Thanks! Dmitry

http://ldp.iol.it/LDP/khg/HyperNews/get/khg/340.html [08/03/2001 10.11.25]

Where can I gen detailed info on VM86

The HyperNews Linux KHG Discussion Pages

Where can I gen detailed info on VM86
Forum: The Linux Kernel Hackers' Guide
Keywords: vm86 Date: Thu, 06 Aug 1998 14:53:02 GMT From: Sebastien Plante <sebasp@cae.ca> I have a project of emulating 8086 card processor. I think that I can do it under Linux by using VM86. Where can I get enough information on VM86 to be able to use it ?

http://ldp.iol.it/LDP/khg/HyperNews/get/khg/338.html [08/03/2001 10.11.25]

How to print floating point numbers from the kernel?

The HyperNews Linux KHG Discussion Pages

How to print floating point numbers from the kernel?
Forum: The Linux Kernel Hackers' Guide
Date: Tue, 04 Aug 1998 16:51:29 GMT From: <pkunisetty@hotmail.com> I want to print floating point numbers from kernel module. printk is working fine for integers but not working for floating point numbers. Is there any otherway to print the floating point numbers? Thanks.

http://ldp.iol.it/LDP/khg/HyperNews/get/khg/335.html [08/03/2001 10.11.26]

PS/2 Mouse Operating in Remote Mode

The HyperNews Linux KHG Discussion Pages

PS/2 Mouse Operating in Remote Mode
Forum: The Linux Kernel Hackers' Guide
Date: Fri, 31 Jul 1998 20:00:10 GMT From: Andrei Racz <andreir@worldnet.att.net> I am experimenting with a PS/2 mouse in remote operation - which is, requesting for a pointing packet and then waiting for it. No requests - no packets. Usually, the mouse operates in stream mode, sending continuosly. I started with psaux.c code; first I added a timer which would fire the callback which in turn would send the request. I ended in hanging the machine - with some message ... iddle task could not sleep and then AIEE ...scheduling in interrupt. Trying with a task queue (tq_timer/tq_scheduler) did not help either. I have limited experience with Linux. I would appreciate some advice on this matter. Regards, Andrei Racz

http://ldp.iol.it/LDP/khg/HyperNews/get/khg/333.html [08/03/2001 10.11.26]

basic module

The HyperNews Linux KHG Discussion Pages

basic module
Forum: The Linux Kernel Hackers' Guide
Date: Wed, 29 Jul 1998 06:55:21 GMT From: <vano0023@tc.umn.edu> what's wrong with this code? It will not print out current->pid #define MODULE #include <linux/kernel.h> #include <linux/sched.h> #include <linux/module.h> extern struct task_struct *current; int init_module() {printk("<0>Process command %s pid %i",current->comm, current->pid); return 0;} void cleanup_module() {printk("<0>Goodbye\n"); }

http://ldp.iol.it/LDP/khg/HyperNews/get/khg/331.html [08/03/2001 10.11.26]

How to check if the user is local?

The HyperNews Linux KHG Discussion Pages

How to check if the user is local?
Forum: The Linux Kernel Hackers' Guide
Date: Mon, 27 Jul 1998 16:39:18 GMT From: <jb@nicol.ml.org> Access to some resources should be limited for local users only (starting Xserver, access to diskette). I wrote program that walks through /proc/*/stat files and checks if the tty field is between 1024 and 1087. If process has pseudoterminal it checks sid, ppid, sid.. etc. If it find process that is a deamon or has other terminal than vconsole or pseudoterm it tells that it is remote user. Is it a save way?

http://ldp.iol.it/LDP/khg/HyperNews/get/khg/329.html [08/03/2001 10.11.27]

Ldt & Privileges

The HyperNews Linux KHG Discussion Pages

Ldt & Privileges
Forum: The Linux Kernel Hackers' Guide
Keywords: LDT Memory Privilege Date: Fri, 24 Jul 1998 15:17:57 GMT From: Ganesh <ganesh@cs.sunysb.edu> Hi, I need some help with something related to modify_ldt system call which was added to Linux. I would greatly appreciate your help. I am experimenting with a new protection mechanism. I want to push a user process to privilege level 2 in Linux( by adding a system call) . If I do this, at the second level of protection checks in the CPU (ie. at the paging level), the user process would map to supervisor privileges.This is because x86 maps 0,1,2 to supervisor and 3 to user privilges at the paging level(that is what I understood from the manual. Please correct me if I am wrong). Will the process (at PL 2) be able to write to kernel pages since the protection check would go through at the page level? If so, I guess I can prevent it at the segment level by adding a check to modify_ldt code to figure out whether the process is making a pointer to a kernel segment. Is this correct? Anyway, the process wont be actually able to reload LDTR or change the actual LD Table directly without a system call(sys_modify_ldt). Or is there some roundabout way in which a process at privilge level 2 can somehow make an entry in LDT/access the kernel pages? Again, any help would be greatly appreciated. Thanks a lot.

http://ldp.iol.it/LDP/khg/HyperNews/get/khg/328.html [08/03/2001 10.11.27]

c .skb queues The HyperNews Linux KHG Discussion Pages skb queues Forum: The Linux Kernel Hackers' Guide Date: Wed. It would be great if someone could give me a clue about the possible bugs.27] .com> I was trying out some stuff that deals with creating Q of the sk_buffs as they are passed from routines in ip_output. Thanks.11.html [08/03/2001 10.c to dev_queue_xmit() in /net/core/dev. The code is able to control the rate at which skbs are passed to dev_queue_xmit() but seems to have a few bugs. Using sk_buff_head to do the Q-ing and timer_list to control the rate at which skbs are passed from my routine to dev_queue_xmit().iol. 22 Jul 1998 01:19:19 GMT From: Rahul Singh <rahul_sg@hotmail.it/LDP/khg/HyperNews/get/khg/326. http://ldp. The error msgs that I encountered are "killing of interrupt handlers" and "kfree-ing of memory allocated not using kalloc" (when I try to have an upper bound on the Q size).

reserved pages are not freed by the free_pte() function. The scenario I have is as follows: I have implemented a driver for a PCI device (as a module). The problem that I am concerned with however. So. When the device-file is opened. The device is also able to initiate a DMA transfer all by itself to or from the application's memory. I am however reluctant to do this since I suspect I am abusing the whole conecpt of reserved pages. and shared with this thread. During the termination of the process. and the process that is the destination or source of the DMA transfer dies. marking pages as reserved should probably work ok. What is the best way to make sure that the pages get pinned in memory until the device driver receives a release from the dying process? When this happens. is the case when a DMA operation is going on (or about to be started).uit.no> I have some questions concerning locking of pages belonging to a user-level process.cs.0. They are however freed by the forget_pte() function which is called by the zeromap_page_range() function --. When the DMA transfer has finished. some of the device's memory is mapped into the user-level application.something which is only possible to do after doing an open on /dev/mem (right?). and also some scatter-gather mechanism. From what I've seen of the kernel code. do some vitual to physical mapping. it may then free the pinned pages and continue termination. All locking of user-level pages are also made sharable. the pinned memoryis not freed (I think) beacause the memory is also shared with this thread. the driver will be able to pause the termination of the process if a dangerous DMA transfer is in progress. From what I've seen of the process termination code (I'm doing this in a 2. http://ldp. or through the in-kernel module (via ioctl). the memory mapings are freed before the open files are released (this rules out the obvious solution). Communiction between the application and the device either goes through this buffer.html (1 di 2) [08/03/2001 10. To be able to do this DMA transfer I will have to pin some pages to memory. page locking.28] .iol.30 kernel). I am somehow able to cope with all this. 2) Making the pages "reserved" instead of locked. I've thought of two other solutions: 1) Using a dummy "ghost thread" associated with the module.11.it/LDP/khg/HyperNews/get/khg/323.Page locking (for DMA) and process termination? The HyperNews Linux KHG Discussion Pages Page locking (for DMA) and process termination? Forum: The Linux Kernel Hackers' Guide Keywords: dma. 21 Jul 1998 17:40:58 GMT From: Espen Skoglund <espensk@stud. All processes that wants to access the device will have to do an open on it. process termination Date: Tue.

11.28] .) eSk http://ldp.html (2 di 2) [08/03/2001 10.Page locking (for DMA) and process termination? Are there any other ways to accomplish what I am trying to do.iol.overlooked an amazingly simple fact? (I guess this is a fairly easy thing to do --misrinterpreting the code I mean. It is afterall not what I would considered a well documented/commented piece of software.it/LDP/khg/HyperNews/get/khg/323. or have I misinterpreted the whole kernel-code --.

Thanks!!!!! http://ldp.ca> Is anyone out there know a good source of explanation of the Linux SMP code? I am writing an OS and after reading the Intel MP spec.utoronto. If anyone is an expert in this area and wouldn't mind chatting for a bit.28] . after hearing all the problems with SMP on Linux.iol.html [08/03/2001 10.it/LDP/khg/HyperNews/get/khg/322. I bet there is a little more to it.SMP code The HyperNews Linux KHG Discussion Pages SMP code Forum: The Linux Kernel Hackers' Guide Keywords: SMP code Date: Mon. 20 Jul 1998 20:05:11 GMT From: <97yadavm@scar.11. it'd be much appriciated.

GC needs to be able to stop a thread an examine it's stack for potential pointers. Unfortunately.11. unfortunately the Linux /proc lacks the ability to stop a process.it sends a signal to the pthread.it/LDP/khg/HyperNews/get/khg/319. incremental.iol. One major problem has to do with pthread support.Porting GC: Difficulties with pthreads The HyperNews Linux KHG Discussion Pages Porting GC: Difficulties with pthreads Forum: The Linux Kernel Hackers' Guide Date: Fri. Searching around the web I notice that one or two other people have attempted to get this thing working without success. the author of gc says that this is unavoidable due to the limitations of the pthreads API. and there's no defined way in the pthreads API to do this. this method is to be avoided if at all possible. 17 Jul 1998 00:34:11 GMT From: Talin <Talin@ACM. Under Irix. gc uses the /proc primitives to accomplish this task.28] . but the port appears broken. it uses an evil hack -. On SunOS. (GC is a conservative. Does anyone have any ideas for how to go about suspending a thread and getting a copy of it's register set under Linux? http://ldp. and has the pthread wait on a condition variable inside the signal handler! Needless to say.org> I'm attempting to get Hans Boehm's gc to run under Linux.html [08/03/2001 10. It's tantalizing because the bloody thing almost works. generational garbage collector for C and C++ programs). Apparently it has been ported to older versions of Linux.

it/LDP/khg/HyperNews/get/khg/314. who knows about Linux for Besta-88 workstation. Sapsan Date: Mon. 13 Jul 1998 17:02:17 GMT From: Dmitry <defanov@romance.11. http://ldp.Linux for "Besta .html [08/03/2001 10.rssi. It is based on MVME147 board.> Is there anybody. but has its own design.ru.29] .iki.iol.88"? Forum: The Linux Kernel Hackers' Guide Keywords: Besta.88"? The HyperNews Linux KHG Discussion Pages Linux for "Besta .

MVME147 Linux The HyperNews Linux KHG Discussion Pages MVME147 Linux Forum: The Linux Kernel Hackers' Guide Re: Linux for "Besta .11.org> Looks like http://www.com/~chaos/linux147/ has some info on the issue. Sapsan Date: Thu.iol.29] . Maybe that'll be helpful. 16 Jul 1998 11:51:12 GMT From: Edward Tulupnikov <allin1@allin1.88"? (Dmitry) Keywords: Besta.it/LDP/khg/HyperNews/get/khg/314/1.mindspring. http://ldp.html [08/03/2001 10.ml.

html [08/03/2001 10.it/LDP/khg/HyperNews/get/khg/313..polimi. Thanks in advance.)? I was not able to find them in man pages etc.iol.it> Can you give me informations about the /proc/locks file (what it is./proc/locks The HyperNews Linux KHG Discussion Pages /proc/locks Forum: The Linux Kernel Hackers' Guide Keywords: /proc/locks Date: Mon. its format etc. 13 Jul 1998 12:15:29 GMT From: Marco Morandini <marc2@vedyac.29] .11... Hoping this is the right forum.. Marco Morandini http://ldp.aero..

di.iol.html [08/03/2001 10.30] .it/LDP/khg/HyperNews/get/khg/310.ch> What is the syscall mechanism for calling system calls when there is no library interface for the system call.epfl.11.syscall The HyperNews Linux KHG Discussion Pages syscall Forum: The Linux Kernel Hackers' Guide Date: Wed. eg how do u call the get_kernel_syms system call. http://ldp. 08 Jul 1998 14:00:51 GMT From: <ppappu@lrc.

11.. 04 Jul 1998 05:13:31 GMT From: Kyung D.html [08/03/2001 10.30] . Ryu <kdryu@cs.0. So. (kernel size: original:446281.edu> I added some features on Linux 2. So.32. a few chage: 446289.".How to run a bigger kernel ? The HyperNews Linux KHG Discussion Pages How to run a bigger kernel ? Forum: The Linux Kernel Hackers' Guide Keywords: kernel size Date: Sat. How can I make this bigger new kernel run ? Any parameter to change ? Thanks in advance http://ldp. When I modified just a couple of lines and rebuilt kernel. I guess the kernel size does matter.iol. The kernel size got a bit bigger than last one.. more change: 446664) It crashed when it was rebooted and attemped to uncompress the new kernel giving message "ran out of input data. I put a couple of functions in the source files and rebuilt it.it/LDP/khg/HyperNews/get/khg/308.umd. I was able to reboot and run new kernel.

what the functionality of its parts is.nl> Hello. If you can help me just a little bit. whatsoever. and we cannot figure it out.iol. etcetera. I would appreciate it! Any help. Nils Appeldoorn http://ldp. I'm a student from Holland and have received the following assignment: Write a paper about the Linux Terminaldriver. We've found /usr/src/linux/drivers/char/tty_io. we can't get a clear overview of all the needed source files. It's a little bit fuzzy.hen.html [08/03/2001 10.30] . etcetera. for starters like me. 24 Jun 1998 12:39:01 GMT From: Nils Appeldoorn <appeldoo@hio.c but that's probably not the only one. The problem is. how the datastructure looks.it/LDP/khg/HyperNews/get/khg/300. Explain how it handles all the interrupts.11. is good! Thanks anyway.Linux Terminal Device Driver The HyperNews Linux KHG Discussion Pages Linux Terminal Device Driver Forum: The Linux Kernel Hackers' Guide Keywords: device driver Date: Wed.

h.c provide some common support functions tty.h. ioctl.iol. http://ldp.h contain the structure definitions and defines.c and tty_ioctl.Terminal DD The HyperNews Linux KHG Discussion Pages Terminal DD Forum: The Linux Kernel Hackers' Guide Re: Linux Terminal Device Driver (Nils Appeldoorn) Keywords: device driver Date: Tue.h.c . termio[s]. This is the interface with the hardware. but you probably can't get the tech doc for those.c is the code for the line discipline which does the processing of the input/output stream as well as some control function.it/LDP/khg/HyperNews/get/khg/300/1.com> Quickly: serial.11. serial. termbits. 30 Jun 1998 13:22:00 GMT From: Doug McNash <dmcnash@computone. n_tty.31] . digi. specialix et. tty_io.is the device driver for bare UARTS (8250-16550) others are present for various cards like stallion. This is the interface between the "user" and the driver.html [08/03/2001 10.h.al.

it/LDP/khg/HyperNews/get/khg/297.e. I thus need to either 1) Lock down the pages manually in the driver and generate a physical scatter/gather table for the DMA (I assume that the malloc'ed pages will not be contiguous in physical memory) 2) Remap the user buffer into physically contiguous memory.DMA to user allocated buffer ? The HyperNews Linux KHG Discussion Pages DMA to user allocated buffer ? Forum: The Linux Kernel Hackers' Guide Keywords: DMA user buffer physio Date: Mon.11. same user virtual address) 3) Implement a Unix (Solaris) like physio routine to perform the DMA in chunks. 22 Jun 1998 17:07:54 GMT From: Chris Read <support@active-silicon. Any pointers or ideas ? http://ldp.31] . without changing the user virtual mapping (i.iol.html [08/03/2001 10. akin to the read entry point uoi->buf mappings.com> I am porting an application from Solaris and NT to Linux I need to DMA fairly large ( >1 MByte ) data areas to a user assigned buffer (assigned using malloc).

iol. Could the 'allocator' code in ftp/v2.31] .[ch] and ftp/v2.Rubini's book The HyperNews Linux KHG Discussion Pages allocator-example in A.it/LDP/khg/HyperNews/get/khg/297/1. downloadable from the website advertised on top of the KHG page. 07 Jul 1998 08:19:04 GMT From: Thomas Sefzick <t.sefzick@fz-juelich.Rubini's book Forum: The Linux Kernel Hackers' Guide Re: DMA to user allocated buffer ? (Chris Read) Keywords: DMA user buffer physio Date: Tue.allocator be a solution to your problem? http://ldp.11.html [08/03/2001 10.1/misc-modules/README.de> Have a look into the examples from A.1/misc-modules/allocator.Rubini's 'Linux Device Drivers' book.allocator-example in A.

5 and underneath this directory was kernel_patch and etc.it/LDP/khg/HyperNews/get/khg/293.ca> Hi Every one I currently have kernel v2. For mounting the files I used the following command in Linux: mount -t msdos /dev/fd0 /mnt But after the mounting in /mnt I had different garbled file names such as rtlinu~1 & kernel~1.iol.0.. I am stuck here :)" Another problem was.Patching problems The HyperNews Linux KHG Discussion Pages Patching problems Forum: The Linux Kernel Hackers' Guide Date: Wed.11. unzipped it and stored it in a disk..27 running in my computer and would like to patch the rtlinux and do some experience on it. While listing the files in windows I got these directory names: