You are on page 1of 12

Modeling device driver effects in real-time schedulability analysis:

Study of a network driver ∗

Mark Lewandowski, Mark J. Stanovich, Theodore P. Baker, Kartik Gopalan, An-I Andy Wang
Department of Computer Science
Florida State University
Tallahassee, FL 32306-4530
e-mail: [lewandow, stanovic, baker, awang]@cs.fsu.edu, kartik@cs.binghamton.edu

Abstract 1 Introduction

Device drivers are the software components for man-


Device drivers are integral components of operating
aging I/O devices. Traditionally, device drivers for
systems. The computational workloads imposed by de-
common hardware devices (e.g.. network cards and
vice drivers tend to be aperiodic and unpredictable be-
hard disks) are implemented as part of the operating
cause they are triggered in response to events that oc-
system kernel for performance reasons. Device drivers
cur in the device, and may arbitrarily block or preempt
have also traditionally been a weak spot of most operat-
other time-critical tasks. This characteristic poses sig-
ing systems, especially in terms of accounting and con-
nificant challenges in real-time systems, where schedu-
trol of the resources consumed by these software com-
lability analysis is essential to guarantee system-wide
ponents. Each device driver’s code may run in multiple
timing constraints. At the same time, device driver
(possibly concurrent) execution contexts which makes
workloads cannot be ignored. Demand-based schedu-
the resource accounting difficult, if not impossible. For
lability analysis is a technique that has been success-
instance, Linux device drivers are scheduled in a hi-
ful in validating the timing constraints in both single
erarchy of ad hoc mechanisms, namely hard interrupt
and multiprocessor systems. In this paper we present
service routines (ISR), softirqs, and process or thread
two approaches to demand-based schedulability analy-
contexts, in decreasing order of execution priorities.
sis of systems that include device drivers. First, we
derive load-bound functions using empirical measure- While the traditional ways of scheduling device
ment techniques. Second, we modify the scheduling of drivers can be tolerated in best-effort systems, they
network device driver tasks in Linux to implement an tend to present a problem for real-time systems. Real-
algorithm for which a load-bound function can be de- time systems need to guarantee that certain workloads
rived analytically. We demonstrate the practicality of can be completed within specified time constraints.
our approach through detailed experiments with a net- This implies that any workload within a real-time sys-
work device under Linux. Our results show that, even tem must be amenable to schedulability analysis, which
though the network device driver does not conform to is defined as the application of abstract workload and
conventional periodic or sporadic task models, it can be scheduling models to predict the ability of the real-time
successfully modeled using hyperbolic load-bound func- system to meet all of its timeliness guarantees.
tions that are fitted to empirical performance measure- The workloads imposed by device drivers tend to be
ments. aperiodic, hard to characterize, and they defy schedu-
lability analysis because much of their computational
workload is often triggered by unpredictable events
(e.g. arrival of network packets or completion of disk
I/O). There may be blocking due to nonpreemptable
critical sections within device drivers and preemption
∗ Based upon work supported in part by the National Science due to ISR code that executes in response to a hard-
Foundation under Grant No. 0509131, and a DURIP equipment ware interrupt. The interference caused by device
grant from the Army Research Office. drivers on the execution of time-critical tasks, through
such blocking and preemption, needs to be accurately execution time. The computational demand of a job
modeled and included in the schedulability analysis of J in a given time interval [a, b) for a given schedule,
the system. In addition, the device drivers themselves denoted by demandJ (a, b), is defined to be the actual
may have response time constraints imposed by the amount of processor time consumed by that job within
need to maintain some quality of I/O services. the interval.
In this paper we present two approaches to demand- Suppose there is a single processor, scheduled ac-
based schedulability analysis of systems including de- cording to a policy that is priority driven. Every job
vice drivers, based on a combination of analytically and will be completed on time as long as the sum of its
empirically derived load-bound functions. Demand- own execution time and the interference caused by the
based schedulability analysis views the schedulability execution of other higher priority jobs within the same
analysis problem in terms of supply and demand. One time window during which the job must be completed
defines a measure of computational demand and then add up to no more than the length of the window.
shows that a system can meet all deadlines by prov- That is, suppose J = {J1 , J2 , ...} is the (possibly in-
ing that demand in any time interval cannot exceed finite) collection of jobs to be scheduled, numbered in
the computational capacity of the available proces- order of decreasing priority. A job Jk released at time
sors. This analysis technique has been successfully ap- rk with deadline rk + dk and execution time ek will be
plied to several abstract workload models and schedul- completed by its deadline if
ing algorithms, for both single and multiprocessor sys-
tems [9, 1, 3, 2].
X
ek + demandJi (rk , rk + dk ) ≤ dk (1)
Aperiodic device-driver tasks present a special chal- i<k
lenge for demand-based schedulability analysis, be-
cause their potential computational demand is un- Traditional schedulability analysis relies on impos-
known. In principle, analysis would be possible if they ing constraints on the release times, execution times,
were scheduled according to an algorithm that budgets and deadlines of the jobs of a system to ensure that
compute time. However, the common practice in com- inequality (1) is satisfied for every job. This is done
modity operating systems is to schedule them using a by characterizing each job as belonging to one of a fi-
combination of ad hoc mechanisms described above, nite collection of tasks. A task is an abstraction for a
for which it may be impractical or impossible to derive collection of possible sequences of jobs.
an analytical bound on the interference that the device The best understood type of task is periodic, with
driver tasks may cause other time-critical tasks. Two release times separated by a fixed period pi , deadlines
possible approaches for analysis: at a fixed offset di relative to the release times, and
actual execution times bounded by a fixed worst-case
1. Derive a load-bound function for the driver empir-
execution time ei . A sporadic task is a slight relax-
ically.
ation of the periodic task model, in which the period
2. Modify the way device driver tasks are scheduled pi is only a lower bound on the separation between the
in the operating system, to use a time-budgeting release times of the task’s jobs.
algorithm for which a load-bound function can be The notions of computational demand and inter-
derived analytically. ference extend naturally to tasks. The function
demandmaxτi (∆) is the maximum of combined demands
In the rest of this paper we evaluate both of the
of all the jobs of τi in every time interval of length ∆,
above approaches, using a device driver as a case study
taken over all possible job sequences of τi . That is if S
– the Linux e1000 driver for the Intel Pro/1000 fam-
is the collection of all possible job sequences of τi then
ily of Ethernet network interface adapters. We focus
on demand-based schedulability analysis using fixed- def
X
priority scheduling in a uniprocessor environment. demandmax
τi (∆) = max demandJ (t − ∆, t) (2)
S∈S,t>0
J∈S

2 Demand Analysis Restated in terms of tasks, the same reasoning says


that a task τk with relative deadline dk will always meet
Our view of demand analysis is derived from stud- its deadline if the sum of its own execution time and
ies of traditional workload models [9, 1, 3, 2] which the interference of higher priority tasks within any time
are based on the concepts of job and task. A job is window of length dk never exceeds dk . That is, suppose
a schedulable component of computational work with there is a set of tasks τ = τ1 , . . . , τn , numbered in order
a release time (earliest start time), a deadline, and an of decreasing priority, and dk ≤ pk . Then every job of

2
τk will complete within its deadline if 10
traditional demand bound
8 refined demand bound

demand bound
k−1
X linear demand bound
ek + demandmax
τi (dk ) ≤ dk (3) 6
i=1 4

A core observation for preemptive fixed-priority 2

scheduling of periodic and sporadic tasks is the fol- 0


2 4 6 8 10 12 14 16 18 20
lowing traditional demand bound: interval length
 

demandmax
τi (∆) ≤ ei (4) Figure 1. Comparison of the demand bounds of (5)
pi
and (6), for a periodic task with pi = 7 and ei = 2.
This says that the maximum computation time of to earliest and latest possible completion times of jobs,
τi in any interval length ∆ can be no more than the but the refined bound is tighter for other points.
maximum execution time required by one job of τi , The diagonal line in the figure corresponds to a sim-
multiplied by the maxumum number of jobs of τi that plified upper bound function, obtained by interpolating
can execute in that interval. Replacing the maximum linearly between the points ∆ = jpi + ei at which the
demand in (3) by the expression on the right in (4) traditional and refined demand bounds converge. At
leads to a well known response test for fixed-priority these points, the expression on the right of (6) reduces
schedulability, i.e., τk is always scheduled to complete to
by its deadline if ∆ − ei
( + 1)ei = ui (∆ + pi − ei )
k−1
pi
X dk

ek + ei ≤ dk (5) and so
i=1
pi
demandmax
τi (∆) ≤ min(∆, ui (∆ + pi − ei )) (7)
In Section 4 we report experiments in which we
measured the actual interfering processor demand due The above definitions and analyses can also be ex-
to certain high priority tasks, over time intervals of pressed in terms of the ratio of demand to inter-
different lengths. When we computed the maximum val length, which we call load. That is, loadτi (t −
def def
observed demand from that data, we observed that ∆, t) = demandτi (t − ∆, t)/∆ and loadmax τi (∆) =
max
it never reached the level of the traditional demand demandτi (∆)/∆. It follows from (3) that a task τk
bound, given by the expression on the right of (4). will always complete by its deadline if
That is because the traditional demand over-estimates k−1
the actual worst-case execution time of τi in many in- ek X
+ loadmax
τi (dk ) ≤ 1 (8)
tervals, by including the full execution time ei even in dk i=1
cases where the interval is not long enough to permit
that. For example, suppose pi = di = 7 and ei = 2, That is, to verify that a task τk always completes by
and consider the case ∆ = 8. Since the release times its deadline, it is sufficient to add the percentage of
of τi must be separated by at least 7, the maximum CPU time used by all higher priority tasks and the
amount of time that τi can execute in any interval of percentage required for τk in any interval of length dk .
length 8 is 3, not 4. If this sum is less than or equal to one, then there is
In this paper we introduce a refined demand bound, enough CPU time available for τk to finish its work on
obtained by only including the portion of the last job’s time.
execution time that fits into the interval, as follows. The corresponding refined load-bound function of
a periodic task can be derived by dividing (6) by ∆,
demandmax
τi (∆) ≤ jei + min(ei , ∆ − jpi ) (6) resulting in
jei + min(ei , ∆ − jpi )
where   loadmax
τi (∆) ≤ (9)
def ∆ ∆
j=
pi where j is defined as in (6), and a simplified hyper-
The difference between the traditional demand bolic load bound can be obtained by dividing (7) by ∆,
bound and our refined bound in (6) is shown by Fig- resulting in
ure 1 for a periodic task with pi = 7 and ei = 2. The pi − ei
two bounds are equal between points that correspond loadmax
τi (∆) ≤ min(1, ui (1 + )) (10)

3
The refined load-bound function on the right of (9) Thread context Thread context

and the hyperbolic approximation on the right of (10) Level 3


Level 3
are compared to the traditional load-bound function in softirq softirq

Figure 2, for the same task as Figure 1.


Level 2 Level 2
driver ISR driver ISR

traditional load bound


1 hyperbolic bound Level 1
refined load bound generic ISR Level 1
utilization generic ISR

0.8 Interrupt context Interrupt context


data/load bound

Vanilla Linux TimeSys Linux


0.6

0.4 Figure 3. Comparison of execution contexts for


vanilla and Timesys Linux.
0.2
the polling server by allowing a thread to suspend it-
self without giving up its remaining budget, and so are
0 termed bandwidth preserving algorithms.
10 20 30 40 50 60 70
interval length

3 The Linux e1000 Driver


Figure 2. Load bounds for a periodic task with pi =
7 and ei = di = 2.
Network interface device drivers are representative
We find the load-based formulation more intuitive, of the devices that present the biggest challenge for
since it allows us to view the interference that a task modeling and schedulability analysis, because they
may cause other tasks as a percentage of the total avail- generate a very large workload with an unpredictable
able CPU time, which converges to the utilization fac- arrival pattern. Among network devices, we chose a
tor ui = ei /pi for sufficiently long intervals. This can gigabit Ethernet device for its high data rate, and the
be seen in in Figure 2, where all the load bounds con- Intel Pro/1000 because it has one of the most advanced
verge to the limit 2/7. Because of the critical zone open-source drivers, namely the e1000 driver. This sec-
property [9], an upper bound on the percentage inter- tion describes how the e1000 driver is scheduled in the
ference a periodic or sporadic task would cause for any Linux kernel.
job of a given lower priority task can be discovered by The Linux e1000 driver implements the new Linux
reading the Y-value of any of these load-bound func- API (NAPI) for network device drivers [11], which
tions for the X-value that corresponds to the deadline leaves the hardware interrupts for incoming packets
of the lower priority task. disabled as long as there are queued received packets
Demand-based schedulability analysis extends from that have not been processed. The device interrupt is
periodic and sporadic tasks to non-periodic tasks only re-enabled when the server thread has polled, dis-
through the introduction of aperiodic server thread covered it has no more work, and so suspends itself.
scheduling algorithms, for which a demand-bound These mechanisms were originally developed to reduce
function similar to the one above can be shown to ap- receive live-lock but also has the effect of reducing the
ply to even non-periodic tasks. The simplest such number of per-packet hardware interrupts.
scheduling algorithm is the polling server [14], in which The device-driven workload of the e1000 driver can
a task with a fixed priority level (possibly the high- be viewed as two device-driven tasks: (1) input process-
est) and a fixed execution budget is scheduled peri- ing, which includes dequeuing packets that the device
odically and allowed to execute until it has consumed has previously received and copied directly into system
the budgeted amount of execution time, or until it sus- memory and the replenishing the list of DMA buffers
pends itself voluntarily (whichever occurs first). Other available to the device for further input; (2) output
aperiodic server scheduling policies devised for use in processing, which includes dequeuing packets already
a fixed-priority preemptive scheduling context include sent and the enqueueing of more packets to send. In
the Priority Exchange, Deferrable Server [16, 8], and both cases, execution is triggered by a hardware inter-
Sporadic Server (not to be confused with sporadic task) rupt, which causes execution of a hierarchy of handlers
algorithms [15, 10]. These algorithms improve upon and threads.

4
The scheduling of the e1000 device-driven tasks can to model the aperiodic workloads of these tasks in a
be described as occurring at three levels. The schedul- way that supports schedulability analysis.
ing of the top two levels differs between the two Linux
kernel versions considered here (Figure 3), which are 4 Empirical Load Bound
the standard “vanilla” 2.6.16 kernel from kernel.org,
and Timesys Linux, a version of the 2.6.16 kernel
In this section we show how to model the workload
patched by Timesys Corporation to better support
of a device-driven task by an empirically derived load-
real-time applications.
bound function, which can then be used to estimate
Level 1. The hardware preempts the currently ex- the preemptive interference effects of the device driver
ecuting thread and transfers control to a generic in- on the other tasks in a system.
terrupt service routine (ISR) which saves the processor
For example, suppose one wants to estimate the to-
state and eventually calls a Level 2 ISR installed by the
tal worst-case device-driven processor load of a network
device driver. The Level 1 processing is always preemp-
device driver, viewed as a single conceptual task τD .
tively scheduled at the device priority. The only way
The first step is to experimentally estimate loadmax
τD (∆)
to control when such an ISR executes is to selectively
for enough values of ∆ to be able to produce a plot sim-
enable and disable the interrupt at the hardware level.
ilar to Figure 2 in Section 2. The value of loadmax
τD (∆)
Level 2. The driver’s ISR does the minimum for each value of ∆ is approximated by the maximum
amount of work necessary, and then requests that the observed value of demandτD (t − ∆, t)/∆ over a large
rest of the driver’s work be scheduled to execute at number of intervals [t − ∆, t).
Level 3 via the kernel’s “softirq” (software interrupt) One way to measure the processor demand of a
mechanism. In vanilla Linux this Level 2 processing is device-driven task in an interval is to modify the kernel,
called directly from the Level 1 handler, and so it is including the softirq and interrupt handlers, to keep
effectively scheduled at Level 1. In contrast, Timesys track of every time interval during which the task ex-
Linux defers the Level 2 processing to a scheduled ker- ecutes. We started with this approach, but were con-
nel thread, one thread per IRQ number on the x86 cerned about the complexity and the additional over-
architectures. head introduced by the fine-grained time accounting.
Level 3. The softirq handler does the rest Instead, we settled on the subtractive approach, in
of the driver’s work, including call-outs to perform which the CPU demand of a device driver task is in-
protocol-independent and protocol-specific processing. ferred by measuring the processor time that is left for
In vanilla Linux, the Level 3 processing is scheduled via other tasks.
a complicated mechanism with two sub-levels: A lim- To estimate the value of demandτD (t − ∆, t) for a
ited number of softirq calls are executed ahead of the network device driver we performed the following ex-
system scheduler, on exit from interrupt handlers, and periment, using two computers attached to a dedicated
at other system scheduling points. Repeated rounds network switch. Host A sends messages to host C at a
of a list of pending softirq handlers are made, allowing rate that maximizes the CPU time demand of C’s net-
each handler to execute to completion without preemp- work device driver. On system C, an application thread
tion, until either all have been cleared or a maximum τ2 attempts to run continuously at lower priority than
iteration count is reached. Any softirq’s that remain the device driver and monitors how much CPU time it
pending are served by a kernel thread. The reason for accumulates within a chosen-length interval. All other
this ad hoc approach is to achieve a balance between activity on C is either shut down or run at a priority
throughput and responsiveness. Using this mechanism lower than τ2 . If ∆ is the length of the interval, and
produces very unpredictable scheduling results, since τ2 is able to execute for x units of processor time in
the actual instant and priority at which a softirq han- the interval, then the CPU demand attributed to the
dler executes can be affected by any number of dy- network device is ∆ − x and the load is (∆ − x)/∆.
namic factors. In contrast, the Timesys kernel handles It is important to note that this approach only mea-
softirq’s entirely in threads; there are two such threads sures CPU interference. It will not address memory
for network devices, one for input processing and one cycle interference due to DMA operations. The reason
for output processing. is that most if not all of the code from τ2 will oper-
The arrival processes of the e1000 input and out- ate out of the processor’s cache and therefore virtually
put processing tasks generally need to be viewed as no utilization of the memory bus will result from τ2 .
aperiodic, although there may be cases where the net- This effect, known as cycle stealing, can slow down a
work traffic inherits periodic or sporadic characteristics memory intensive task. Measurement of memory cycle
from the tasks that generate it. The challenge is how interference is outside the scope of the present paper.

5
Each host had a Pentium D processor running in For this and the subsequent graphs, note that each
single-core mode at 3.0 GHz, with 2 GB memory and data point represents the maximum observed preemp-
an Intel Pro/1000 gigabit Ethernet adapter, and was tive interference over a series of trial intervals of a given
attached to a dedicated gigabit switch. Task τ2 was length. This is a hard lower bound, and it is also a sta-
run using the SCHED FIFO policy (strict preemptive tistical estimate of the experimental system’s worst-
priorities, with FIFO service among threads of equal case interference over all intervals of the given length.
priority) at a real-time priority just below that of the Assuming the interference and the choice of trial in-
network softirq server threads. All its memory was tervals are independent, the larger the number of trial
locked into physical memory, so there were no other intervals examined the closer the observed maximum
I/O activities (e.g. paging and swapping). should converge to the system’s worst-case interference.
The task τ2 estimated its own running time using The envelope of the data points should be approx-
a technique similar to the Hourglass benchmark sys- imately hyperbolic; that is, there should be an inter-
tem [12]. It estimated the times of preemption events val length below which the maximum interference is
experienced by a thread by reading the system clock as 100%, and there should be an average processor uti-
frequently as possible and looking for larger jumps than lization to which the interference converges for long
would occur if the thread were to run between clock intervals. There can be two valid reasons for deviation
read operations without preemption. It then added up from the hyperbola: (1) Periodic or nearly periodic de-
the lengths of all the time intervals where it was not mand, which results in a zig-zag shaped graph similar
preempted, plus the clock reading overhead for the in- to line labeled “refined load bound” in Figure 2 (see
tervals where it was preempted, to estimate amount of Section 2); (2) not having sampled enough intervals to
time that it was able to execute. encounter the system’s worst-case demand. The latter
The first experiment was to determine the base-line effects should diminish as more intervals are sampled,
preemptive interference experienced by τ2 when τD is but the former should persist.
idle, because no network traffic is directed at the sys- In the case of Figure 4 we believe that the tiny blips
tem. That is, we measured the maximum processor in the Timesys line around 1 and 2 msec are due to pro-
load that τ2 can place on the system when no de- cessing for the 1 msec timer interrupt. The data points
vice driver execution is required, and subtracted the for vanilla Linux exhibit a different pattern, aligning
value from one. This provided a basis for determin- along what appear to be multiple hyperbolae. In par-
ing the network device driver demand, by subtracting ticular, there is a set of high points that seems to form
the idle-network interference from the total interference one hyperbola, a layer of low points that closely follows
observed in later experiments when the network device the Timesys plot, and perhaps a middle layer of points
driver was active. that seems to fall on a third hyperbola. This appear-
ance is what one would expect if there were some rare
100
vanilla Linux events (or co-occurrences of events) that caused pre-
Timesys
emption for long blocks of time. When one of those
80 occurs it logically should contribute to the maximum
load for a range of interval lengths, up to the length
of the corresponding block of preemption, but it only
% interference

60
shows up in the one data point for the length of the
trial interval where it was observed. The three levels
40 of hyperbolae in the vanilla Linux graph suggest that
there are some events or combinations of events that
20
occur too rarely to show up in all the data points, but
that if the experiment were continued long enough data
points on the upper hyperbola would be found for all
0 interval lengths.
0 1000 2000 3000 4000 5000
interval length (µsec) Clearly the vanilla kernel is not as well behaved as
Timesys. The high variability of data points for the
Figure 4. Observed interference with no network vanilla kernel suggests that the true worst-case inter-
traffic. ference is much higher than the envelope suggested by
the data. That is, if more trials were performed for
Figure 4 shows the results of this experiment in each data point then higher levels of interference would
terms of the percent interference observed by task τ2 . be expected to occur throughout. By comparison, the

6
observed maximum interference for Timesys appears here is: (1) eliminate any upward jogs from the data by
to be bounded within a tight envelope over all inter- replacing each data value by the maximum of the values
val lengths. The difference is attributed to Timesys’ to the right of it, resulting in a downward staircase
patches to increase preemptability. function; (2) approximate the utilization by the value
The remaining experiments measured the behav- at the right most step; (3) choose the smallest period
ior of the network device driver task τD under a for which the resulting hyperbola intersects at least one
heavy load, consisting of ICMP “ping” packets every of the data points and is above all the rest.
10 µsec. ICMP “ping” packets were chosen because
they would execute entirely in the context of the de- 100
reply to ping flood
vice driver’s receive thread, from actually receiving the 90 ping flood, no reply
packet through replying to it (TCP and UDP split ex- no load
80
ecution between send and receive threads).
70

% interference
100 60
vanilla Linux
Timesys 50
hyperbolic bound
80 40

30
% interference

60 20

10

40 0
0 1000 2000 3000 4000 5000
interval length (µsec)
20
Figure 6. Observed interference with ping flooding,
with no reply.
0
0 1000 2000 3000 4000 5000
interval length (µsec) To carry the analysis further, an experiment was
done to separate the load bound for receive process-
Figure 5. Observed interference with ping flooding, ing from the load bound for transmit processing. The
including reply. normal system action for a ping message is to send a
reply message. The work of replying amounts to about
Figure 5 shows the observed combined interference half of the work of the network device driver tasks for
of the driver and base operating system under a net- ping messages. A more precise picture of the interfer-
work load of one ping every 10 µsec. The high variance ence caused by just the network receiving task can be
of data points observed for the vanilla kernel appears obtained by informing the kernel not to reply to ping
to extend to Timesys. This indicates a rarely occurring requests. The graph in Figure 6 juxtaposes the ob-
event or combination of events that occurs in connec- served interference due to the driver and base operating
tion with network processing and causes a long block system with ping-reply processing, without ping-reply
of preemption. We believe that this may be a “batch- processing, and without any network load. The fitted
ing” effect arising from the NAPI policy, which alter- hyperbolic load bound is also shown for each case. An
nates between polling and interrupt-triggered execu- interesting difference between the data for the “no re-
tion of the driver. A clear feature of the data is that ply” and the normal ping processing cases is the clear
the worst-case preemptive interference due to the net- alignment of the “no reply” data into just two distinct
work driver is higher with the Timesys kernel than the hyperbolae, as compared to the more complex pattern
vanilla kernel. We believe that this is the result of addi- for the normal case. The more complex pattern of vari-
tional time spent in scheduling and context-switching, ation in the data for the case with replies may be due to
because the network softirq handlers are executed in the summing of the interferences of these two threads,
scheduled threads rather than borrowed context. whose busy periods sometimes coincide. If this is true,
Given a set of data from experimental measurements it suggests a possible improvement in performance by
of interference, we can fit the hyperbolic bound through forcing separation of the execution of these two threads.
application of inequality (9) from Section 2. There are Note that understanding these phenomena is not
several ways to choose the utilization and period so necessary to apply the techniques presented here. In
that the hyperbolic bound is tight. The method used fact the ability to model device driver interference with-

7
out knowledge of the exact causes for the interference Server τ1 τ2 τD OS
is the chief reason for using these techniques. Traditional high med hybrid vanilla
Background high med low Timesys
5 Interference vs. I/O Service Quality Foreground med low high Timesys
Sporadic high low med (SS) Timesys
This section describes further experiments, involv- Table 1. Configurations for experiments.
ing the device driver with two sources of packets and
two hard-deadline periodic tasks. These were intended We tested the above task set in four scheduling con-
to explore how well empirical load bounds derived by figurations. The first was the vanilla Linux kernel. The
the technique in Section 4 work with analytical load other three used Timesys with some modifications of
bounds for periodic tasks for whole-system schedula- our own to add support for a Sporadic Server schedul-
bility analysis. We were also interested in compar- ing policy (SS). The SS policy was chosen because it is
ing the degree to which scheduling techniques that re- well known and is likely to be already implemented in
duce interference caused by the device-driver task for the application thread scheduler of any real-time op-
other tasks (e.g. lowering its priority or limiting its erating system, since it is the only aperiodic server
bandwidth through an aperiodic server scheduling al- scheduling policy included in the standard POSIX and
gorithm); would affect the quality of network input ser- Unix (TM) real-time API’s.
vice. The tasks were assigned relative priorities and
The experiments used three computers, referred to scheduling policies as shown in Table 1. The scheduling
as hosts A, B, and C. Host A sent host C a heartbeat policy was SCHED FIFO except where SS is indicated.
datagram once every 10 msec, host B sent a ping packet
to host C every 10µsec (without waiting for a reply),
and host C ran the following real-time tasks: 100

• τD is the device-driven task that is responsible


for processing packets received and sent on the 10
% deadlines missed

network interface (viewing the two kernel threads


softirq-net-rx and softirq-net-tx as a single task).
1
• τ1 is a periodic task with a hard implicit deadline
and execution time of 2 msec. It attempts one
vanilla Linux
non-blocking input operation on a UDP datagram Timesys, τD low
0.1 Timesys, τD high
socket every 10 msec, expecting to receive a heart- Timesys, τD med SS
beat packet, and counts the number of heartbeat
packets it receives. The packet loss rate measures
0.01
the quality of I/O service provided by the device 0 1 2 3 4 5 6 7 8
driver task τD . τ2 execution time (msec)

• τ2 is another periodic task, with the same period Figure 7. Percent missed deadlines of τ2 with inter-
and relative deadline as τ1 . Its execution time ference from τ1 (e1 = 2 and p1 = 10) and τD subject
was varied, and the number of deadline misses to one PING message every 10 µsec.
was counted at each CPU utilization level. The
number of missed deadlines reflects the effects of
Figures 7 and 8 show the percentage of deadlines
interference caused by the device driver task τD .
that task τ2 missed and the number of heartbeat pack-
All the memory of these tasks was locked into phys- ets that τ1 missed, for each of the experimental config-
ical memory, so there were no other activities. Their urations.
only competition for execution was from Level 1 and The Traditional Server experiments showed that the
Level 2 ISRs. The priority of the system thread that vanilla Linux two-level scheduling policy for softirq’s
executes the latter was set to the maximum real time causes τ2 to miss deadlines at lower utilization levels
priority, so that τD would always be queued to do work and causes a higher heartbeat packet loss rate for τ1
as soon as input arrived. than the other driver scheduling methods. Neverthe-
Tasks τ1 and τ2 were implemented by modifying the less, the vanilla Linux behavior does exhibit some desir-
Hourglass benchmark [12], to accommodate task τ1 ’s able properties. One is nearly constant packet loss rate,
nonblocking receive operations. independent of the load from τ1 and τ2 . That is due

8
100
3000
number of heartbeat packets received

2500 80

% interference
2000
60
e2 = 5 msec
e2 = 6 msec
1500 e2 = 7 msec
40
1000 vanilla Linux
Timesys, τD low
Timesys, τD high 20
500 Timesys, τD med SS

0 0
0 1 2 3 4 5 6 7 8 0 10 20 30 40
τ2 execution time (msec) interval length (msec)

Figure 8. Number of heartbeat packets received by Figure 9. Sum of load-bound functions for τ1 and
τ1 with interference from τ2 (e1 = 2 and p1 = 10) and τ2 , for three different values of the execution time e2 .
τD subject to one PING message every 10 µsec.

to the ability of the driver to obtain some processing 100


time at top priority, but only a limited amount. (See
the description of Level 3 processing in Section 3 for
80
details.) Another property, which is positive for soft-
% interference

deadline applications, is that the missed deadline rate


60
of τ2 degrades gracefully with increasing system load.
These are also characteristics of an aperiodic schedul- τD
τ1
ing algorithm, which the Linux policy approximates by 40 τ1 + τD
allocating a limited rate of softirq handler executions
at top priority and deferring the excess to be completed 20
at low priority. However, the vanilla Linux policy is not
simple and predictable enough to support schedulabil- 0
ity analysis. Additionally, this strategy does not allow 0 10000 20000 30000 40000 50000
for user-level tuning of the device driver scheduling. interval length (µsec)

The Background Server experiments confirmed that Figure 10. Individual load-bound functions for τ1
assigning τD the lowest priority of the three tasks (the and τD , and their sum.
default for Timesys) succeeds in maximizing the proba-
bility of τ2 in meeting its deadlines, but it also gives the the graph at the deadline (10000 µsec), and allowing
worst packet loss behavior. Figure 9 shows the com- some margin for release-time jitter, overhead and mea-
bined load for τ1 and τ2 . The values near the deadline surement error, one would predict that τ2 should not
(10) suggest that if there is no interference from τD or miss any deadlines until its execution time exceeds 1.2
other system activity, τ2 should be able to complete msec. That appears to be consistent with the actual
within its deadline until e2 exceeds 7 msec. This is performance in Figure 7.
consistent with the data in Figure 7. The heartbeat
The Sporadic Server experiments represent an at-
packet receipt rate for τ1 starts out better than vanilla
tempt to achieve a compromise that balances missed
Linux, but degenerates for longer τ2 execution times.
heartbeat packets for τ1 against missed deadlines for τ2 ,
The Foreground Server experiments confirmed that by scheduling τD according to a bandwidth-budgeted
assigning the highest priority to τD causes the worst aperiodic server scheduling algorithm, running at a pri-
deadline-miss performance for τ2 , but also gives the ority between τ1 and τ2 . This has the effect of reserving
best heartbeat packet receipt rate for τ1 . The line la- a fixed amount of high priority execution time for τD ,
beled “τ1 + τD ” in Figure 10 shows the sum of the the- effectively lowering the load bound curves. This al-
oretical load bound for τ1 and the empirical hyperbolic lows it to preempt τ2 for the duration of the budget,
load bound for τD derived in Section 4. By examining but later reduces its priority to permit τ2 to execute,

9
thereby increasing the number of deadlines τ2 is able the latter case, RTLinux allows device driver code to
to meet. The Sporadic Server algorithm implemented run as a RTLinux thread (see below).
here uses the native (and rather coarse) time account- The second technique, followed in the current paper,
ing granularity of Linux, which is 1 msec. The server is to defer most interrupt-triggered work to scheduled
budget is 1 msec; the replenishment period is 10 msec; threads. Hardware interrupt handlers are kept as short
and the number of outstanding replenishments is lim- and simple as possible. They only serve to notify a
ited to two. It can be seen in figure 7 that running the scheduler that it should schedule the later execution of
experiments on the SS implementation produces data a thread to perform the rest of the interrupt-triggered
that closely resembles the behavior of the vanilla Linux work. There are variations to this approach, depend-
kernel. (This is consistent with our observations on the ing on whether the logical interrupt handler threads
similarity of these two algorithms in the comments on execute in borrowed (interrupt) context or in indepen-
the Traditional Server experiments above.) Under ideal dent contexts (e.g. normal application threads), and on
circumstances the SS implementation should not allow whether they have an independent lower-level scheduler
τ2 to miss a deadline until its execution time exceeds (e.g. RTLinux threads or vanilla Linux softirq han-
the sum of its own initial budget and the execution dlers) or are scheduled via the same scheduler as nor-
time of τ1 . In this experiment our implementation of mal application threads. The more general the thread
the SS fell short of this by 3 msec. In continuing re- scheduling mechanism, the more flexibility the system
search, we plan to narrow this gap by reducing the developer has in assigning an appropriate scheduling
accounting granularity of our implementation and in- policy and priority to the device-driven threads. The
creasing the number of pending replenishments, and job of bounding device driver interference then focuses
determine how much of the currently observed gap is on analyzing the workload and scheduling of these
due to the inevitable overhead for time accounting, con- threads. This technique has been the subject of sev-
text switches, and priority queue reordering. eral studies, including [7, 4, 17, 5], and is implemented
in Windows CE and real-time versions of the Linux
6 Related Work kernel.
Facchinetti et al. [6] recently proposed an instance of
Previous research has considered a variety of tech- the work deferral approach, in which a system executes
niques for dealing with interference between interrupt- all driver code as one logical thread, at the highest sys-
driven execution of device-driver code and the schedul- tem priority. The interrupt server has a CPU time
ing of application threads. We classify these techniques budget, which imposes a bound on interference from
into two broad groups, according to whether they apply the ISRs. They execute the ISRs in a non-preemptable
before or after the interrupt occurs. manner, in interrupt context, ahead of the applica-
The first technique is to “schedule” hardware in- tion thread scheduler. Their approach is similar to the
terrupts in a way that reduces interference, by reduc- softirq mechanism of the vanilla Linux system, in that
ing the number of interrupts, or makes it more pre- both schedule interrupt handlers to run at the highest
dictable, by limiting when they can occur. On some system priority, both execute in interrupt context, and
hardware platforms, including the Motorola 68xxx se- both have a mechanism that limits server bandwidth
ries of microprocessors, this can be done by assign- consumption. However, time budgets are enforced di-
ing different hardware priorities to different interrupts. rectly in [6].
The most basic approach to scheduling interrupts in- Zhang and West [19] recently proposed another vari-
volves enabling and disabling interrupts intelligently. ation of the work deferral approach, that attempts to
The Linux network device driver model called NAPI minimize the priority of the bottom halves of driver
applies this concept to reduce hardware interrupts dur- code across all current I/O consuming processes. The
ing periods of high network activity[11]. Regehr and algorithm predicts the priority of the process that is
Duongsaa [13] propose two other techniques for reduc- waiting on some queued I/O, and then executes the
ing interrupt overloads, one through special hardware bottom half in its own thread at the highest predicted
support and the other in software. RTLinux can be priority per interrupt. Then it charges the execution
viewed as also using this technique. That is, to re- time to the predicted process. This approach makes
duce interrupts on the host operating system RTLinux sense for device driver execution that can logically be
interposes itself between the hardware and the host op- charged to an application process.
erating system[18]. In this way it relegates all device The above two techniques partially address the
driver execution for the host to background priority, problem considered in this paper. That is, they re-
unless there is a need for better I/O performance. In structure the device-driven workload in ways that po-

10
tentially allow more of it to be executed below inter- read from the network device. This will raise the bot-
rupt priority, and schedule the execution according to tom half’s priority to that of the high priority process.
a policy that can be analyzed if the workload can be However, since the typical network device driver han-
modeled. However, they do not address the problem of dles packets in FIFO order, the bottom half is forced to
how to model the workload that has been moved out of work through the backlog of the low-priority process’s
the ISRs, or how to model the workload that remains input before it gets to the packet destined for the high
in the ISRs. priority process. This additional delay could be enough
A difference between the Facchinetti approach and to cause the high priority process to miss its deadline.
our use of aperiodic server scheduling is that we have That would not have happened if the low-priority pack-
multiple threads, at different priorities, executing in ets had been cleared out earlier, as if the device bottom
independent contexts and scheduled according to stan- half had been able to preempt the middle-priority task.
dard thread scheduling policies which are also available In contrast, with our approach the bottom half still
to application threads. We have observed that differ- handles incoming packets in FIFO order, but by exe-
ent devices (i.e. NIC, disk controller, etc) all gener- cuting the bottom half in a server with a budget of high
ate unique workloads, which we believe warrant dif- priority time we are able to empty the incoming DMA
ferent scheduling strategies and different time budgets. queue more frequently. This can prevent the scenario
In contrast, all devices in the Facchinetti system are above from occurring unless the input rate exceeds the
forced to share the same budget and share the same pri- bottom-half server’s budgeted bandwidth.
ority; the system is not able to distinguish between dif-
ferent priority levels of I/O, and is forced to handle all 7 Conclusion
I/O in FIFO order. Imagine a scenario where the real
time system is flooded with packets. In the Facchinetti We have described two ways to approach the prob-
system the NIC could exhaust the ISR server’s bud- lem of accounting for the preemptive interference ef-
get. If a high priority task requests disk I/O while the fects of device driver tasks in demand-based schedula-
ISR server’s budget is exhausted, the disk I/O will be bility analysis. One is to model the worst-case inter-
delayed until the ISR server budget is replenished, and ference of the device driver by a hyperbolic load-bound
the high priority task may not receive its disk service in function derived from empirical performance data. The
time to meet its deadline. This scenario is pessimistic, other approach is to schedule the device driver by an
but explains our motivation to move ISR execution into aperiodic server algorithm that budgets processor time
multiple fully-schedulable threads. consistent with the analytically derived load-bound
A difference between the Zhang and West approach function of a periodic task. We experimented with
and ours is that we focus on the case where there is no the application of both techniques to the Linux device
application process to which the device-driven activity driver for Intel Pro/1000 Ethernet adapters.
can logically be charged. Our experiments use ICMP The experimental data show hyperbolic load bounds
packets, which are typically processed in the context of can be derived for base system activity, network receive
the kernel and cannot logically be charged to a process. processing, and network transmit processing. Further,
Another difference is that our model is not subject the hyperbolic load bounds may be combined with an-
to a middle-priority process delaying the execution of alytically derived load bounds to predict the schedula-
a higher priority process, by causing a backlog in the bility of hard-deadline periodic or sporadic tasks. We
bottom-half processing of I/O for a device on which the believe this technique of using empirically derived hy-
high priority process depends. Consider a system with perbolic load-bound functions to model processor inter-
three real time processes, at three different priorities. ference may also have potential applications outside of
Suppose the low priority process initiates a request for device drivers, to aperiodic application tasks that are
a stream of data over the network device, and that be- too complex to apply any other load modeling tech-
tween packets received by the low priority process, the nique.
middle-priority process (which does not use the net- The data also show preliminary indications that
work device) wakes up and begins executing. Under aperiodic-server scheduling algorithms, such as Spo-
the Zhang and West scheme, the network device server radic Server, can be useful in balancing device driver
thread would have too low priority for the network de- interference and quality of I/O service. This provides
vice’s bottom half to preempt the middle-priority pro- an alternative in situations where neither of the two
cess, and so a backlog of received packets would build extremes otherwise available will do, i.e., where run-
up in the DMA buffers. Next, suppose the high priority ning the device driver at a fixed high priority causes
process wakes up and during its execution, attempts to unacceptable levels of interference with other tasks,

11
and running the device driver at a fixed lower prior- [8] J. P. Lehoczky, L. Sha, and J. K. Strosnider. En-
ity causes unacceptably low levels of I/O performance. hanced aperiodic responsiveness in a hard real-time
In future work, we plan to study other device types, environment. In Proc. 8th IEEE Real-Time Systems
and other types of aperiodic server scheduling algo- Symposium, pages 261–270, 1987.
rithms. We also plan to extend our study of empir- [9] C. L. Liu and J. W. Layland. Scheduling algorithms for
multiprogramming in a hard real-time environment.
ically derived interference bounds to include memory
Journal of the ACM, 20(1):46–61, Jan. 1973.
cycle interference. As mentioned in this paper, our [10] J. W. S. Liu. Real-Time Systems. Prentice-Hall, 2000.
load measuring task can execute out of cache, and so [11] J. Mogul and K. Ramakrishnan. Eliminating receive
does not experience the effects of memory cycle steal- livelock in an interrupt-driven kernel. ACM Transac-
ing due to DMA. Even where there is no CPU inter- tions on Computer Systems, 15(3):217–252, 1997.
ference, DMA memory cycle interference may increase [12] J. Regehr. Inferring scheduling behavior with Hour-
the time to complete a task past the anticipated worst- glass. In Proc. of the USENIX Annual Technical Conf.
case execution time, resulting in missed deadlines. We FREENIX Track, pages 143–156, Monterey, CA, June
plan to perform an analysis of DMA interference on 2002.
[13] J. Regehr and U. Duongsaa. Preventing interrupt
intensive tasks. By precisely modeling these effects,
overload. In Proc. 2006 ACM SIGPLAN/SIGBED
increases in the execution time due to cycle stealing
conference on languages, compilers, and tools for em-
will be known and worst-case execution times will be bedded systems, pages 50–58, Chicago, Illinois, June
more accurately predicted. Further, by coordinating 2005.
the DMA and memory intensive tasks, the contention [14] L. Sha, J. P. Lehoczky, and R. Rajkumar. Solutions
for accessing memory can be minimized. for some practical problems in prioritizing preemptive
scheduling. In Proc. 7th IEEE Real-Time Sytems Sym-
posium, 1986.
References [15] B. Sprunt, L. Sha, and L. Lehoczky. Aperiodic task
scheduling for hard real-time systems. Real-Time Sys-
[1] N. C. Audsley, A. Burns, M. Richardson, and A. J. tems, 1(1):27–60, 1989.
Wellings. Hard real-time scheduling: the deadline [16] J. Strosnider, J. P. Lehoczky, and L. Sha. The de-
monotonic approach. In Proc. 8th IEEE Workshop ferrable server algorithm for enhanced aperiodic re-
on Real-Time Operating Systems and Software, pages sponsiveness in real-time environments. IEEE Trans.
127–132, Atlanta, GA, USA, 1991. Computers, 44(1):73–91, Jan. 1995.
[2] T. P. Baker and S. K. Baruah. Schedulability analysis [17] C. A. Thekkath, T. D. Nguyen, E. Moy, and E. La-
of multiprocessor sporadic task systems. In I. Lee, zowska. Implementing network protocols at user level.
J. Y.-T. Leung, and S. Son, editors, Handbook of Real- IEEE Trans. Networking, 1(5):554–565, Oct. 1993.
time and Embedded Systems. CRC Press, 2007. (to [18] V. Yodaiken. The RTLinux manifesto. In Proc. 5th
appear). Linux Expo, Raleigh, NC, 1999.
[3] S. K. Baruah, A. K. Mok, and L. E. Rosier. Preemp- [19] Y. Zhang and R. West. Process-aware interrupt
tively scheduling hard-real-time sporadic tasks on one scheduling and accounting. In Proc. 27th Real Time
processor. Proc. 11th IEE Real-Time Systems Sympo- Systems Symposium, Rio de Janeiro, Brazil, Dec. 2006.
sium, pages 182–190, 1990.
[4] L. L. del Foyo, P. Meja-Alvarez, and D. de Niz. Pre-
dictable interrupt management for real time kernels Acknowledgment
over conventional PC hardware. In Proc. 12th IEEE
Real-Time and Embedded Technology and Applications The authors are grateful to Timesys Corporation for
Symposium (RTAS’06), pages 14–23, San Jose, CA, providing access to their distribution of the Linux ker-
Apr. 2006. nel at a reduced price. We are also thankful to the
[5] P. Druschel and G. Banga. Lazy receiver processing
anonymous members of the RTAS 2007 program com-
(LRP): a network subsystem architecture for server
mittee for their suggestions.
systems. In Proc. 2nd USENIX symposium on oper-
ating systems design and implementation, pages 261–
275, Oct. 1996.
[6] T. Facchinetti, G. Buttazzo, M. Marinoni, and
G. Guidi. Non-preemptive interrupt scheduling for
safe reuse of legacy drivers in real-time systems. In
Proc. 17th IEEE Euromicro Conference on Real-Time
Systems, Palma de Mallorca, July 2005.
[7] S. Kleiman and J. Eykholt. Interrupts as threads.
ACM SIGOPS Operating Systems Review, 29(2):21–
26, Apr. 1995.

12

You might also like