Introduction to Linux Device Driver Development

Prepared by Richard A. Sevenich,
Chapter 4. A Selection of Topics from the Kernel Internals
General References:
• Love, Linux Kernel Development, Sams (2004).
• Bovet & Cesati, Understanding the Linux Kernel, 2nd Edition, O'Reilly (2003).
• Beck et alii, Linux Kernel Programming, 3rd Edition, Addison-Wesley (2002).
• Rubini & Corbet, Linux Device Drivers, 2nd Edition, O'Reilly (2001).
• Linux Kernel Version 2.4 Source Code.
Note: At this writing the Rubini & Corbet book is freely available via download from
4.1 Introduction
The topic areas we'll skim are:
• system calls
• signals
• wait queues
• task queues
• kernel timers and other timely topics
• interrupt handling
• process scheduler
The kernel continues to change in each of these areas, particularly in response to the need for greater scalability.
4.2 The System Call Dispatcher
Some of you will recall the MS DOS function dispatcher. It had functionality such as to
• send a character to the screen or printer
• receive a character from the keyboard
• read/write a disk drive
• get/set time or date
This was implemented in assembly language and used the following interface:
• put the number of the desired function in the ah register
• perform any other initialization needed by the function (using other registers)
• call interrupt 0x21
The corresponding interrupt handler was the function dispatcher.
The Linux system call dispatcher may be more complex, but is essentially similar to the MS DOS function
dispatcher. Such a jump table is not a new idea. Let's look at an example:
int main() {
int result;
result = write(1, "hello\n", 6);
The code above makes a library call, write, which is a wrapper around the sys_write system call. Some authors refer
to the library call as a stub. The arguments of write are passed via the stack to the library function which does some
setup and then invokes (assuming IA 32) int 0x80, the system call dispatcher.
R.A. Sevenich © 2004 Introduction to Linux Device Driver Development 4 - 1
In our example, the library call would do this setup for the IA32:
• put the system call number for write (4) in the eax register
• put the first argument (stdout = 1) into the ebx register
• put the second argument (a pointer to the string "hello\n") into the ecx register
• put the third argument (length of string = 6) into the edx register
• invoke int 0x80
The interrupt handler switches to kernel mode, performs the task, and returns a result in eax to the library function.
Our goal in this chapter is to get familiar with some details of the underlying implementation and then implement
our own system call and then use it. Integrating our system call into the kernel will necessitate recompiling the
kernel as you have done before. However, you have a good .config file, right - and it's backed up, right?
4.2.1 Implementation Details
An important header file to look at is <asm/unistd.h>. It starts with a table of #define's containing the system call
numbers. The number for the write library call we considered earlier appears thusly
#define __NR_write 4
Further investigation of this header file suggests that the library wrapper around the system call or stub can be
generated by a macro call of this form (see 'man 2 intro'):
_syscallX(type, name, type1, arg1, type2, arg2, ...)
• X is the number of arguments taken by the stub (range is 0 through 5)
• type is the return type of the system call
• name is the name of the system call
• typeN is the type of the Nth argument
• argN is the name of the Nth argument
These macros can be seen in <linux/unistd.h>. For example, in that header file we find:
#define _syscall3(type, name, type1, arg1, type2, arg2, type3, arg3) \
type name(typ1 arg1, type2 arg2, type3 arg3) \
{ \
long __res; \
__asm__ volatile ("int $0x80") \
: "=a" (__res) \
: "0" (__NR_##name), "b" ((long)(arg1)), "c" ((long)(arg2)), \
"d" ((long)(arg3)); \
__syscall_return(type, __res); \
and later:
static inline _syscall3(int, write, int, fd, const char *, buf, off_t,
Hence we know how to build the prototype for write i.e.
int write(int fd, const char * buf, off_t count)
long __res; \
__asm__ volatile ("int $0x80") \
: "=a" (__res) \
: "0" (__NR_write), "b" ((long)(fd)), "c" ((long)(* buf)), \
"d" ((long)(count)); \
__syscall_return(int, __res);\
Here we see in which registers parameters are passed, how the value 4 identifying the write system call is
determined, etc. It is left for you to expand the __syscall_return macro. Note that we have explained the scenario
from making the library call in the user code to having the library call subsequently invoke int 0x80.
Now we've claimed that write is a wrapper around the actual kernel level system call, sys_write. How is sys_write
called and where in the source code is it? If we knew those answers we'd be on our way to doing our own
implementation. We note that int 0x80 ultimately results in executing the code in <linux/arch/i386/kernel/entry.S>.
Look particularly at the code starting from ENTRY(system_call) noting that it soon does a call to a reference in the
sys_call_table. That table is at the end of the entry.S file where we find, for example, that entry number 4
(__NR_write) is
.long SYMBOL_NAME(sys_write)
R.A. Sevenich © 2004 Introduction to Linux Device Driver Development 4 - 2
So that's how it finds the call to sys_write. Where in the source code is sys_write? Using,
we find that it is in /usr/src/linux/fs/read_write.c. Check to see what other directories are at this level and see what
other system calls you might locate (e.g. sys_fork, sys_chmod, sys_ alarm).
We have enough details and putting the picture together will allow us to create our own system call.
4.2.2 Implementing our own System Call
We'll just lay out the recipe, based on what we discovered in the previous section. Further we'll pick a specific
example so everything is concrete. Here is the recipe:
1. Call your new system call, sys_my_new_call, in a file it_b_mine.c. As root, copy the file to /usr/src/linux/kernel/.
2. Modify the Makefile in /usr/src/linux/kernel/.
3. Edit /usr/src/linux/arch/i386/kernel/entry.S and /usr/linux/src/linux/asm/unistd.h, in that order (will be described
in Section 4.3.3)
4. Recompile the kernel via 'make bzImage' while in /usr/src/linux, copy bzImage to the appropriate vmlinuz in /
boot, run lilo, and reboot (cf. Chapter 2 of the course notes)
5. Write a user program which exercises your new system call
You might tar and zip your current /usr/src/linux, because we're going to make some changes which you'll want to
remove subsequently.
4.2.3 The new system call
Here's the file it_b_mine.c:
#include <linux/kernel.h>
asmlinkage int sys_my_new_call(void)
printk(KERN_ALERT "sys_my_new_call at your service\n");
return 0;
As root, copy it into /usr/src/linux/kernel. Double check that the ownership and permissions are consistent with other
files in that directory.
4.2.4 Modify the Makefile
Modify the Makefile in /usr/src/linux/kernel to add it_b_mine.o to the entries for obj-y.
4.2.5 As root, edit unistd.h and entry.S
Near the end of the file, /usr/src/linux/arch/i386/kernel/entry.S, you'll find the jump table. At the very end of that
table, add
.long SYMBOL_NAME(sys_my_new_call)
and note the position. In my case, it was 226. Save the new entry.S.
Next, near the beginning of the file, /usr/src/linux/include/asm/unistd.h, you'll find the table of system call numbers.
Add the appropriate entry i.e. at the end I added
#define __NR_my_new_call 226
where, in your case, the number might be different than 226, but must match that from the entry.S file. Save this
4.2.6 Recompile and reboot
Unless you are also doing some reconfiguration, you need not do all the steps seen earlier in Section 1.4 of Chapter
1. In particular, you can start with Step 5 of that section and then something along the lines of Steps 8 and 9.
Essentially all you need to do then is
● compile via 'make bzImage'
● copy the new kernel to /boot
● revise lilo.conf, if necessary, and rerun lilo ... or modify /boot/grub/menu.lst, if necessary
● reboot
R.A. Sevenich © 2004 Introduction to Linux Device Driver Development 4 - 3
4.2.7 A user program using our new system call
Let's continue to assume the source hierarchy, in which we are working, is /usr/src/linux. Now gcc expects the
include files to be at /usr/include/, but ours are at /usr/src/linux/include. Sometimes there are symbolic links from the
former to the latter, in particular from
/usr/include/asm to /usr/src/linux/include/asm
and from
/usr/include/linux to /usr/src/linux/include/linux
So that we don't need to modify linkages for our particular example, we'll just tell gcc where those files are when we
compile the user program i.e.
gcc -I /usr/src/linux/include ... and so on.
Here is a user program:
/* Use my_new_call */
#include <sys/types.h>
#include <linux/unistd.h>
static inline _syscall0(int, my_new_call);
int main() {
int result;
result = my_new_call();
Compile and run this program. It should print to some log file e.g. to /var/log/messages:
sys_my_new_call at your service
which you can verify via something like
tail -f /var/log/messages
If it's printing to some other log file, you can do some detective work looking at time stamps via
ls -l /var/log/
and see which log files have been written recently.
4.2.8 Return to normalcy
If desired back out all the changes you made in this chapter and return your system to its original state.
4.2.9 Adding a bit more substance to our system call
User programs, of course, cannot be allowed access to kernel space. Yet we may need to pass information back and
forth under tight control e.g. via the system call mechanism and appropriate kernel functions. Linux provides various
ways to do this. Here we'll introduce two macros:
• get_user() - can be called by a kernel process to get a single datum from the user's memory space
• put_user() - can be called by a kernel process to put a single datum into the user's memory space
Here is the necessary information for get_user():
#include <asm/uaccess.h>
void get_user(datum, ptr)
This will read the datum from user space, where ptr is the user space address. The size of the datum transferred
depends on the type of the ptr argument and is determined by gcc at compile time. The macro returns 0 on success,
otherwise an error.
Here is the necessary information for get_user():
#include <asm/uaccess.h>
put_user(datum, ptr)
This will write the datum to user space, where ptr is the user space address. The size of the datum transferred
depends on the type of the ptr argument and is determined by gcc at compile time. The macro returns 0 on success,
otherwise an error.
R.A. Sevenich © 2004 Introduction to Linux Device Driver Development 4 - 4
As an example, we'll invent two new system calls:
sys_new_sys1 - will use get_user()
sys_new_sys2 - will use put_user()
We'll package them together in the same file and put that file in /usr/src/linux/kernel. We also must modify the
Makefile in that directory and put two new entries in both
/usr/src/linux/arch/i386/kernel/entry.S and
So we are essentially just following the recipe at the start of Section 4.2.2.
Here is the new kernel program:
#include <linux/kernel.h>
#include <asm/uaccess.h>
#include <asm/errno.h>
static int shared_int = 0;
asmlinkage int sys_new_sys2(unsigned long arg)
shared_int = 5 * shared_int;
printk(KERN_ALERT "sys_new_sys2 will call put_user()\n");
if (put_user(shared_int, (int *)arg) != 0) return -EFAULT;
return 0;
asmlinkage int sys_new_sys1(unsigned long arg)
shared_int = 0;
printk(KERN_ALERT "sys_new_sys1 will call get_user()\n");
if (get_user(shared_int, (int *)arg) !=0) return -EFAULT;
return 0;
Here is an example user program which makes use of the two new system calls.
#include <stdio.h>
#include <stdlib.h>
#include <linux/unistd.h>
#include <sys/types.h>
static inline _syscall1(int, new_sys1, int *, foo1)
static inline _syscall1(int, new_sys2, int *, foo2)
int main()
int user_space_int;
user_space_int = 16;
printf("user_space_int starts with value %d\n", user_space_int);
if (new_sys1(&user_space_int) != 0)
printf("new_sys1 failed.\n");
if (new_sys2(&user_space_int) != 0)
printf("new_sys1 failed.\n");
printf("user_space_int finishes with value %d\n", user_space_int);
R.A. Sevenich © 2004 Introduction to Linux Device Driver Development 4 - 5
4.3 Signals

We will see that there is a variety of available signals and there are various ways a program can be set up to respond
to signals - giving the signal mechanism both power and flexibility. More specifically, a signal can have these
possible effects on a program (please note the similarity to the hardware interrupt mechanism):
• The signal is 'caught' by the program: Execution is transferred to a signal handler and, upon its completion,
control is returned to the signaled program.
• There is no signal handler so the appropriate default is exercised:
STOP: The program is put into a stopped state, but can be returned to a runnable state later.
EXIT: The program is forced to exit.
CORE: The program is forced to exit and a core dump is generated and filed in the program's directory.
IGNORE: The signal is ignored.
• The SIGKILL and SIGSTOP signals are distinct in that they can neither be caught nor ignored.
A program's response to a signal is consistent throughout the process so that all threads within a process respond that
same way.
Signals have names (all starting with 'SIG'), values, and default actions. These are listed in the man page i.e. enter
'man 7 signal'. You'll note from the man page that there is a POSIX signal API and a legacy API. The referenced
book by Johnson and Troan has a very nice chapter on signals which moves through the legacy signal mechanisms
which were in some cases incompatible with each other. It also discusses the unreliability of ANSI C standardization
of the signal() function. It is recommended that the well defined and reliable POSIX signal API be used.
4.3.1 The kernel's use of signals
Of course, the kernel already uses signals to conduct its everyday business. Here are some examples from the man
• If a program makes an invalid memory reference (e.g. a wild pointer), the kernel send the offending process a
SIGSEGV, with default action CORE.
• If a child process has stopped or terminated, the kernel sends the parent a SIGCHLD, with default action
• If the suspend keystroke combination (often CRTL-z) is pressed, the kernel sends SIGTSTP to any foregound
process with default action STOP.
• If a program writes to a pipe which has no readers, the kernel sends that process a SIGPIPE, with default action
In general, the kernel uses signals for various reasons, not merely on error conditions. A categorization of such
reasons might include:
• Program termination
• Program stopping and subsequent continuing
• Dealing with errant programs
• Terminal handling
• Program Notification (e.g. a timeout alarm, death of child)
Again, note that some signals originate in response to a hardware interrupt i.e. the interrupt handler causes a signal
to be sent.
4.3.2 Signals in user programs
As expected, user programs use of signals is more restricted. They cannot for example, just send signals to anyone.
They can, however, set themselves up to catch a variety of kernel generated signals - often having to do with signals
sent in connection to terminal activity. Furthermore, the POSIX signals include a pair of user-defined signals,
SIGUSR1 and SIGUSR2, whereby two user programs with the same uid can communicate.
R.A. Sevenich © 2004 Introduction to Linux Device Driver Development 4 - 6
4.3.3 Signal handlers in user programs
Although there may be instances where we want the default response to the signal, it is alternatively possible that we
will want to catch and handle the signal - that will be the focus of this section. POSIX signals are organized in sets,
represented by a data type sigset_t. Linux provides us with a group of functions for safely manipulating signal sets:
empty the referenced set of all signals
int sigemptyset(sigset_t * set);
fill the referenced set with all signals
int sigfillset(sigset_t * set);
add a specified signal to the referenced set
int sigaddset(sigset_t * set, int signo);
remove a specified signal from the referenced set
int sigdelset(sigset_t * set, int signo);
test whether a specified signal is a member of the referenced set
int sigismember(const sigset_t * set, int signo);
The program that wishes to catch the signal will also declare the signal handler. The prototype for a signal handler is
typedef void (*__sighandler_t)(int signo);
The reference to your signal handler is placed in the struct sigaction, which specifies how the kernel should deliver
signals to your program. The struct looks like this:
struct sigaction {
sighandler_t sa_handler;
unsigned long sa_flags;
void (*sa_restorer)(void);
sigset_t sa_mask;
Now we'll describe the items in this struct:
• sa_handler is a pointer to your signal handler, alternatively it can be
SIG_IGN - tells the kernel to ignore the signal
SIF_DFL - tells the kernel to use the default response
• sa_flags is a bitmask that controls kernel behavior when the signal is received and OR's various possibilities. Our
subsequent example sets this to zero. You might investigate other options.
• sa_restorer is not used by linux
• sa_mask specifies the signals to be blocked while the signal handler is executing
R.A. Sevenich © 2004 Introduction to Linux Device Driver Development 4 - 7
Once the sigaction struct is declared, the sigaction() system call can be invoked to deliver the information to the
kernel detailing how the signal should be delivered. The following user space program provides an example.
#include <signal.h>
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#define true 1
#define false 0
int caught = false;
/* here's a trivial signal handler */
void mysig_handler(int sig) {
printf("mysig_handler got SIGALRM.\n");
caught = true;
int main(void)
/* declare the sigaction struct */
struct sigaction mysig_action;
/* fill in the necessary fields in the prior struct*/
mysig_action.sa_handler = mysig_handler;
mysig_action.sa_flags = 0;
/* pass the signal and related struct to the kernel*/
sigaction(SIGALRM, &mysig_action, NULL);
printf("Now calling alarm(5)\n");
/* set up a SIGALRM at 5 seconds from now */
/* let's hang around until the signal is caught*/
printf("Resumed program upon signal handler completion.\n");
4.4 Wait Queues
It routinely happens in a wide variety of circumstances that a kernel process needs to wait for a particular event to
happen. Although there are instances where the process may then do a busy waiting loop (e.g. spinlocks in a
multiprocessor environment) it is often more appropriate that the process block, so other processes can continue to
keep the cpu busy doing useful work. This capability is supported by wait queues. The wait queue struct is a cyclic
linked list:
struct wait_queue
struct task_struct * task;
struct wait_queue * next;
The supporting macros include those that
• put the process to sleep
• awaken the process
• add and delete wait queue members
We'll examine these next.
R.A. Sevenich © 2004 Introduction to Linux Device Driver Development 4 - 8
Putting a process to sleep on a wait queue
These include the following:
• void sleep_on(struct wait_queue **p);
This sets the process state to TASK_UNINTERRUPTIBLE, enters the process in the designated wait queue,
and relinquishes control by calling the scheduler. The process must be awakened by some other process
which does a wake up call (discussed under the next bold subheading) for this queue..
• void interruptible_sleep_on(struct wait_queue **p);
This sets the process state to TASK_INTERRUPTIBLE and enters the process in the designated wait queue,
and relinquishes control by calling the scheduler. The process must be awakened by some other process
which does a wake up call for this queue, but can also be awakened by a signal.
• void sleep_on_timeout(struct wait_queue **p, long timeout);
This sets the process state to TASK_UNINTERRUPTIBLE, enters the process in the designated wait queue,
and relinquishes control by calling schedule_timeout. The process is awakened at the time specified by the
timeout argument, rather than by requiring some other process to do a wake up call for this queue.
• void interruptible_sleep_on_timeout(struct wait_queue **p, long timeout);
This sets the process state to TASK_INTERRUPTIBLE and enters the process in the designated wait queue,
and relinquishes control by calling schedule_timeout. The process is awakened at the time specified by the
timeout argument, rather than by requiring some other process to do a wake up call for this queue. However,
the process can also be awakened by a signal.
Awakening a process on a wait queue
These include the following:
• void wake_up(struct wait_queue **p);
This will wake up both interruptible and noninterruptible sleepers on the designated queue.
• void wake_up_interruptible(struct wait_queue **p);
This will wake up only interruptible sleepers on the designated queue.
Note that the wake up calls will not awaken processes which were explicitly stopped.
Adding/deleting wait queue members
To safely add and remove members of wait queues we have:
• void add_wait_queue(struct wait_queue **queue, struct wait_queue *entry);
• void remove_wait_queue(struct wait_queue **queue, struct wait_queue
In both cases, the first argument refers to the queue of interest, while the second refers to the entry to be added or
removed, respectively.
4.4.1 Race Conditions
Let's say we put some process to sleep until some condition is true maybe using a construction like this:
while (wake_condition == false)
With the demise of the big kernel lock, this may be subject to race conditions. This will occur if the wake condition
evaluates as false in the first line and becomes true before the second line executes. In the worst case, the process
will experience deadlock. This can be avoided with some clever programming, but this has been encapsulated in the
kernel - so we don't even need to be clever. The appropriate replacement for the prior code snippet is
wait_event_interruptible(my_wait_queue, wake_condition == true);
There is also the expected
wait_event(my_wait_queue, wake_condition == true);
R.A. Sevenich © 2004 Introduction to Linux Device Driver Development 4 - 9
4.5 Task Queues
Task queues hold tasks to be executed at a later time. The kernel provides predefined task queues in which you can
register your task. The scheduler then decides just when tasks in such a queue will be executed. Alternatively, you
can define your own task queue and specify when it should execute. A queue element is a tq_struct as defined by:
#include <linux/tqueue.h>
struct tq_struct
struct tq_struct *next; /* linked list of queued tasks */
unsigned long sync; /* must be initialized to zero */
void (*routine)(void *); /* function to call */
void *data; /* argument to function */
Once you have declared an element, you should
• clear the next and sync fields
• enter appropriate items in the routine and data fields
Then you may queue the task with the queue_task function whose prototype is
void queue_task(struct tq_struct *task, task_queue *list);
Note: For the predefined tq_scheduler queue, the related code must use schedule_task to put the task on the
tq_scheduler queue, not queue_task. We'll see an example shortly.
To run a queue of tasks the function used is run_task_queue with prototype
void run_task_queue(task_queue *list);
which the kernel invokes for its predefined task queues and which you must call for any task queue you define
4.5.1 Queues Predefined by the Kernel
The four queues predefined by the kernel are:
• tq_scheduler - queued tasks in here execute whenever the scheduler runs (not executed at interrupt time)
• tq_timer - execution of these tasks is triggered by the timer tick (executed at interrupt time)
• tq_immediate - these tasks are run as soon as possible, either on return from a system call or when the scheduler
is run (executed at interrupt time)
• tq_disk - not available to modules; used internally by memory management
This essentially leaves the first three for us.
4.5.2 The tq_timer and tq_immediate queues
Note that tasks in the tq_timer and tq_immediate queues are executed in interrupt time. This has important
consequences. First, in interrupt mode, there is no process context so that
• the queued task cannot access user space
• the current pointer is not meaningful.
Second, if the process attempts to sleep or calls a function which can sleep, the queued task may hang. Note that
functions which attempt to reserve system resources are quite likely to have a need to sleep (e.g. kmalloc).
An example of usage of tq_timer or tq_immediate
#include <linux/tqueue.h>
static struct tq_struct my_task;
void my_own_task(unsigned long ptr)
{ ... some valid code ...
void init_and_enqueue_my_task()
my_task.routine = (void *)&my_own_task; = (void *)&some_data;
queue_task(&my_task, &tq_immediate);
R.A. Sevenich © 2004 Introduction to Linux Device Driver Development 4 - 10
4.5.3 The tq_scheduler queue
Tasks in the tq_scheduler queue are not executed in interrupt time, so the constraints mentioned at the start of
section 4.5.2 do not apply. A further difference from tq_immediate and tq_timer emerged in the 2.4 kernel series -
the related code must use schedule_task to put the task on the tq_scheduler queue, not queue_task. An example of
usage of tq_scheduler follows.
#include <linux/tqueue.h>
static struct tq_struct my_task;
static char my_msg[] = "<1>\nmy_special_task has executed.\n";
void my_special_task(unsigned long ptr)
printk((void *)ptr);
void init_and_enqueue_my_task()
my_task.routine = (void *)&my_special_task; = (void *)&my_msg;
4.5.4 Your own Task Queues
In this case, since the queue is not predefined, the queue is declared by a macro in this style:
The fields would be filled in as before and then the task would be enqueued by:
queue_task(&my_task, &my_tq);
Unlike the predefined queues, this would need to be executed overtly by
This leaves the question of how the task queue execution would be triggered. This is done by registering the prior
function in one of the predefined queues.
4.6 Time Related Functionality
4.6.1 Current Time
The kernel keeps track of time via the timer interrupt, which in my IA-32 machine occurs 100 times per second
(defined by HZ in /usr/src/linux/include/asm/param.h). The timer interrupt handler updates the value in jiffies. This
is defined as an unsigned long volatile in /usr/src/linux/include/linux/sched.h. This 32-bit quantity is zeroed when
your machine is powered up. The value in the variable jiffies is one method to measure time intervals in kernel code.
If your driver needs the current time, the do_gettimeofday function is provided. It gives near microsecond resolution
for most architectures. A usage example is shown in this fragment:
struct timeval tv;
printk(KERN_ALERT"Current seconds = %08u.%06u\n",
(int)(tv.tv_sec%100000000), (int)(tv.tv_usec));
In addition to the timer interrupt driven jiffies value, most modern processors have acknowledged the need for a
much finer time resolution. This will be based on the processor clock speed and made available in a special register.
This is architecture dependent and we will describe the situation in the more recent and ubiquitous IA32 (Pentium
and later). The IA32 has a 64-bit register called the time stamp counter (TSC) available via the assembly language
instruction rdtsc.

The TSC is also accessible via the C macros rdtsc and rdtscl desribed by:
#include <asm/msr.h>
rdtsc(low, high) - here low and high are each 32-bit variables holding the two parts of the 64-bit TSC
rdtscl(low) - here low is just the low part of the 64-bit TSC
R.A. Sevenich © 2004 Introduction to Linux Device Driver Development 4 - 11
4.6.2 Delays
Long Delays
For pedagogical reasons, we'll start with a poor solution for creating a delay and move toward better. Each example
will rely on this information:
/*resolution on order of jiffies */
unsigned long my_delay = desired_seconds * HZ;
unsigned long target_time = jiffies + my_delay;
Since jiffies will eventually roll over and since Linux machines are relatively stable, target_time could roll over and
be less than jiffies. Hence, a set of macros that accommodates roll over properly is provided in <linux/timer.h>.
These are as follows:
• time_before(jiffies, target_time) - rollover corrected; evaluates as true, if jiffies < target_time
• time_after( jiffies, target_time) - rollover corrected; evaluates as true, if jiffies > target_time
• time_before_eq(jiffies, target_time) - rollover corrected; evaluates as true, if jiffies <=
• time_after_eq( jiffies, target_time) - rollover corrected; evaluates as true, if jiffies >=
Let's examine some delay possibilities. The first example is known as "busy waiting" and should be avoided. It is
while time_before(jiffies, target_time);
/* the CPU stays busy in this loop, stalling any other work */
The fact that jiffies is declared as volatile forces it to be reread each time it is accessed in your code - so you won't
be haunted by a cached value. However, jiffies is changed by the timer interrupt, so using this busy waiting loop
while hardware interrupts were disabled would hang the machine.
Our second example removes both problems:
while time_before(jiffies, target_time) schedule();
This process calls the scheduler, so other tasks can run. However, this task remains in the execution queue which
creates a subtle problem. If this is the only task, it will keep getting turns to run and it will keep calling the scheduler
- but it's really doing nothing useful. On the other hand, if there are no tasks to run, the scheduler runs the 'idle'
process which provides these benefits:
• it reduces the CPU's workload, reducing temperature and increasing lifetime (e.g. a laptop will go longer before
needing its battery recharged)
• the time used by the process is accountable (maybe a non issue)
Our third example removes the prior problem as follows:
current->state = TASK_INTERRUPTIBLE;
Here, current is the task_struct of the executing process. The scheduler will avoid the task until the timeout has been
Short Delays
The prior delays have resolution in the jiffies range. To get delays in the microsecond range, you can use the udelay
function based on the processor's bogomips measurement. Its prototype is
#include <linux/delay.h>
void udelay(unsigned long usecs);
For example,
would be a busy waiting loop that lasts for 50 microseconds. It is recommended that the argument passed to udelay
not exceed 1000, because fast machines (i.e. with high bogomips) may encounter an overflow. A wrapper iterating
around udelay is provided by mdelay e.g.
would provide a delay of 70 milliseconds.
R.A. Sevenich © 2004 Introduction to Linux Device Driver Development 4 - 12
4.6.3 Kernel Timers
Like task queues, kernel timers provide a way to defer execution of a task until a later time. The kernel timers are
kept in a doubly linked list. The data structure for a timer is given in /usr/src/linux/include/linux/timer.h as:
struct timer_list
struct timer_list *next; /* MUST be first element */
struct timer_list *prev;
unsigned long expires;
unsigned long data;
void (*function)(unsigned long);
where 'expires' (3rd element) is the time in jiffies at which timeout occurs and '*function' (5th element) denotes the
function to call at timeout. There are three important functions provided for manipulating timers:
• init_timer() - initializes the timer structure by zeroing the 'next' and 'prev' pointers
• add_timer() - inserts a timer structure into the global list of active timers
• del_timer() - for removing a timer from the list before its timeout has transpired
Note that when a timer times out, it is automatically removed from the list.
Here are the elements of a trivial example:
#include <linux/time.h>
#include <linux/timer.h>
#include <linux/wait.h>
#include <linux/param.h>
static struct timer_list my_timer;
static char msg[] = "<1>\nmy_timer has timed out.\n";
void upon_my_timeout(unsigned long ptr)
printk((void *)ptr);
void wait_four()
my_timer.function = upon_my_timeout; = (unsigned long)&msg;
my_timer.expires = jiffies + (4 * HZ);
The time-outs provided by such timers are unlike task queues in that the timer specifies precisely when the timeout
function is to be executed; whereas with a task queue all you know is that the queued task will be performed at some
later time. Occasionally the need for such functionality arises in a driver.
4.7 Interrupt Handling
We'll have a short discussion here on the linux approach to IA-32 style hardware interrupts with the assumption that
the reader is familiar with the 'traditional' irq -> PIC/APIC <-> CPU interrupt mechanism. The interrupt handler
does not run within the context of a process and cannot transfer data to/from user space. The interrupt handler starts
executing with hardware interrupts disabled, but can reenable them if it so wishes masking irq's appropriately before
the sti. Other than that, the interrupt handler is normal C code. The writer of that code needs to understand how the
handler must interact with the hardware. For example, some devices will not issue another interrupt until the
interrupt handler has acknowledged its response to the current irq signal, perhaps by clearing a specified I/O port
R.A. Sevenich © 2004 Introduction to Linux Device Driver Development 4 - 13
4.7.1 The Bottom Half Mechanism
The handler needs to do its work quickly and efficiently. If there are subtasks that require significant time, but are
not urgent; they can be deferred until later. This is the so called 'bottom-half' mechanism provided by linux. There
are, in fact, only 32 'genuine' bottom halves available and the average joe device driver writer won't have one
assigned to his/her use. However, a driver without a genuine bottom half can employ the immediate queue to
provide bottom half functionality. What one does is to declare a task queue, initialize its routine field as the bottom
half code you wrote, initialize its data field as needed, and then.add the initialized task queue to the immediate
queue. Finally mark_bh(IMMEDIATE_BH) is called to schedule the function which will later execute all the
functions in the immediate queue.
4.7.2 An Example Bottom Half
Let's say we have an interrupt handler, my_irq_handler, to which we want to add a bottom half, say,
void some_bottom_half();
We then take these steps:
• declare a task struct e.g.
#include <linux/tqueue.h>
static struct tq_struct some_bh;
• initialize the struct somewhere appropriate such as in init_module e.g.
some_bh.routine = (void *)&some_bottom_half; = NULL;
some_bh.sync = 0;
• add code to my_irq_handler to enqueue and mark the bottom half e.g.
queue_task(&some_bh, &tq_immediate);
We note that the bottom half is actually taken care of by the tasklet mechanism in the 2.4 series kernel.
4.7.3 The Tasklet Alternative
The tasklet is quite similar to a task in a predefined task queue. Further, it runs in interrupt time so the constraints of
section 4.5.2 apply. Other important properties of tasklets include these, copied from interrupt.h:
• If tasklet_schedule() is called, then tasklet is guaranteed to be executed on some cpu at least once after this.
• If the tasklet is already scheduled, but its excecution is still not started, it will be executed only once.
• If this tasklet is already running on another CPU (or schedule is called from tasklet itself), it is rescheduled for
• Tasklet is strictly serialized wrt itself, but not wrt another tasklets. If client needs some intertask synchronization
he makes it with spinlocks.
The tasklet_struct follows:
struct tasklet_struct
struct tasklet_struct *next;
unsigned long state;
atomic_t count;
void (*func)(unsigned long);
unsigned long data;
R.A. Sevenich © 2004 Introduction to Linux Device Driver Development 4 - 14
4.7.4 A Tasklet Example
Let's say we have an interrupt handler, my_irq_handler, to which we want to add a bottom half via the tasklet
mechanism, say,
void some_bottom_half();
We then take these steps:
• ensure you have the needed header
#include <linux/interrupt.h>
• declare and initialize the tasklet_struct:
DECLARE_TASKLET(some_bh, some_bottom_half, 0);
• add code to my_irq_handler to schedule the bottom half e.g.
Note that you do not need to separately declare:
struct tasklet_struct some_bh;
The DECLARE_TASKLET takes care of that.
4.8 The Process Scheduler
4.8.1 Introduction to the scheduler
The Linux kernel is currently not preemptive and lies outside the realm of the scheduler, whose main job is to pick
the next process to run. More specifically we can state that
• There is no mechanism by which a 'higher priority' process can preempt a kernel mode process, but the latter can
decide to relinquish control.
• A kernel process can be interrupted by an interrupt/exception handler. Upon completion of the handler control
returns to the interrupted kernel process.
• The interrupt/exception handler is itself a kernel mode process and can be interrupted by an interrupt/exception
• Kernel mode processes can 'turn off' external hardware interrupts as appropriate.
The scheduler for the current Linux 2.6 series has likely changed. Further kernel processes can be configured as
preemptive. We focus on the 2.4 series here. In any case, it makes a good first exposure to scheduling. The excellent
O'Reilly book, Understanding the Linux Kernel by Bovet & Cesati has a good chapter on this topic and, if you go to
the O'Reilly web site (, you will find that the description of this book contains the chapter on
the scheduler as a downloadable example.
Recall that a process can exist in one of a possible set of states. For Linux, these are
To determine the next process to run, the scheduler chooses from among processes in the TASK_RUNNING state.
It is assumed here that the reader has had some exposure to the concepts used in schedulers, so that no time will be
spent on general background. Further, we will not discuss scheduling for SMP machines. In this chapter, we will
• scheduling policies and preemption
• when does the scheduler execute?
• process goodness and priorities
• the epoch
• the scheduling algorithm
R.A. Sevenich © 2004 Introduction to Linux Device Driver Development 4 - 15
4.8.2 Scheduling Policies and Preemption
In <linux/sched.h>, we find the three Linux scheduling policies:
#define SCHED_OTHER 0
#define SCHED_FIFO 1
#define SCHED_RR 2
Normal user tasks will run under the SCHED_OTHER policy. As such they are preemptible and run in a time sliced
environment involving dynamic priorities, to be described later.
This is a (soft) real-time policy. A SCHED_FIFO process is not time sliced and will execute until one of the
following conditions becomes true:
• it completes
• it blocks for I/O
• it relinquishes the CPU by calling sched_yield()
• a higher priority process enters the TASK_RUNNING state
This also is a (soft) real-time policy. However, SCHED_RR processes are subject to a time slice. A set of
SCHED_RR processes having the same priority would be scheduled in a classic round robin fashion with respect to
each other. Such a process will complete its time slice unless one of the following occurs
• it completes
• it blocks for I/O
• it relinquishes the CPU by calling sched_yield()
• a higher priority process enters the TASK_RUNNING state
If it is preempted, it is placed at the head of its queue. Next time it runs it completes its preempted time slice. On
the other hand, if the SCHED_RR process completes its time quantum, it is placed at the tail of its queue in the
traditional round robin fashion.
4.8.3 When does the scheduler execute?
There are several ways that scheduler execution is triggered. These can be categorized as direct and indirect.
Direct - a call to schedule()
A process running in kernel mode can make a call to schedule. If you look for references to schedule via the Linux
cross reference web site, you'll see that it is called many places such as
• file system code
• memory management code
• network management code
• many drivers
A typical scenario is this:
• A piece of code needs to block.
• It puts itself on the appropriate wait queue.
• It calls the scheduler.
Indirect - via need_resched = 1
The task struct has a field, need_resched, which is checked when returning to user mode from an interrupt or
exception. If this field equals 1, schedule() is called. Hence any time a process sets need_resched to 1, this ensures
that schedule() will be called in the near future. Setting need_resched to 1 occurs in the following cases:
• when sched_setscheduler() or sched_yield() is called
• when a process is awakened and has higher goodness than the current process
• when the current process exhausts its time quantum
R.A. Sevenich © 2004 Introduction to Linux Device Driver Development 4 - 16
4.8.4 The Epoch
From the scheduler's viewpoint, CPU time is divided into epochs as a means of encapsulating a group of runnable
processes and their respective time quanta. A pseudocode overview follows:
• set quantum value for every process, except TASK_ZOMBIE processes
• choose highest goodness TASK_RUNNING process to run (goodness is
discussed in Section 8.5)
• run that process until it blocks, is preempted, relinquishes the CPU
voluntarily, or finishes its time quantum
• if all runnable processes have exhausted their quanta, go to epoch_init
• else go to start_epoch
4.8.5 Process Goodness and Priorities
To make a scheduling decision, Linux calculates what is called the 'goodness' of each process currently in the
TASK_RUNNING state and then choosing the process having the highest value of goodness to run next. Linux uses
other parameters called priorities as constituents of goodness and therefore was forced to invent a new term
'goodness' rather than overloading the word 'priority'.
The goodness of SCHED_FIFO and SCHED_RR processes
The goodness of SCHED_FIFO and SCHED_RR processes lie in a range well above the goodness of any
SCHED_OTHER process. Hence, a SCHED_OTHER process will never be chosen if there is an available (soft)
real-time process.
Let's consider how the goodness of a process is calculated. For a SCHED_FIFO or SCHED_RR process,
goodness = 1000 + rt_priority
1 <= rt_priority <= 99
Note that rt_priority is a field in the task structure. The scheduler does not changes rt_priority, so it is called a 'static'
priority. However, under certain conditions, the rt_priority of a real-time process can be changed by system calls not
discussed here..

The goodness of SCHED_OTHER processes
The SCHED_OTHER goodness is somewhat more complex, is dynamic, and (as expected) does not depend on
rt_priority. In this case, the goodness depends on two other fields from the task structure
• priority - both the base time quantum and base priority for the process
• counter - number of timer ticks (via irq0) left to the process before its time quantum expires
The goodness is given by
goodness = priority + counter
Now the counter is decremented each timer tick, and when it reaches zero the process has exhausted its time
quantum. At that point, the formula above is replaced by setting
counter = 0
goodness = 0.
The base time quantum is initialized to DEF_PRIORITY for process 0, where currently
#define DEF_PRIORITY (20*HZ/100)
At the start of a new epoch, the new value of counter for each process is given by
counter = priority + counter/2.
Hence if the process is one that has just exhausted its quantum (counter = 0), it gets a new counter value equal to its
base quantum. However, if the process is, for example, in the TASK_INTERRUPTIBLE state, its counter will be
enhanced at the start of every epoch. This gives some preference to I/O bound processes.
R.A. Sevenich © 2004 Introduction to Linux Device Driver Development 4 - 17
At a fork, the child always inherits the base time quantum of its parent. It is possible, albeit rare, for a process to
change its base time quantum. As a result, most processes in the system have the same base time quantum,
DEF_PRIORITY. Also at a fork, the counter of the parent is split in two, half going to the parent and half to the
4.8.6 The Scheduling Algorithm
Starting with a very high level, coarse viewpoint, the scheduler does this:
• does some general housekeeping such as executing all interrupt handler bottom halves and deferred processes on
task queues
• calculates the goodness for the processes in the TASK_RUNNING state to determine the next process to run
• turns the CPU over to the chosen process
In this section, we'll look more closely at this scenario. It will perhaps take several readings to assimilate.
This is a somewhat more detailed look at the scheduling algorithm. After understanding this you might go to the
source code itself.
1. Run any deferred tasks in queue tq_scheduler.
2. Run any pending bottom halves.
3. Save current in local variable, prev.
4. If (prev is a SCHED_RR process), then assign it a new quantum and put it at the end of the run queue.
5. If (prev is in state TASK_INTERRUPTIBLE and has nonblocked, pending signals), then make its state
6. If (prev is not in the TASK_RUNNING state), then remove it from the run queue.
7. If the run queue is empty, point next at the idle_task. Otherwise, find the process in the run queue which has the
highest goodness and reference that process with next.
• If there is a tie for highest non zero goodness between prev and some other process, prev is chosen to save
a context switch.
• If all the runnable processes have zero goodness, this is the end of an epoch and a new quantum is assigned
to all processes except TASK_ZOMBIE processes.
8. If (prev != next) then update the context switch statistics and perform a context switch from prev to next.
R.A. Sevenich © 2004 Introduction to Linux Device Driver Development 4 - 18