Professional Documents
Culture Documents
net/publication/221500947
CITATION READS
1 1,089
2 authors:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Dwight Makaroff on 14 March 2014.
As hardware has become more complex, OS Initially, message passing (IPC) overhead de-
kernels have also become more complex. Mono- graded performance. Subsequent research in-
lithic kernels have added such features as dy- vestigated hybridization to solve the perfor-
namic module loading, multithreading, and mance problem of microkernels. IPC overhead
kernel locks to allow reentrant access to the was reduced by allowing strongly logically con-
data structures and code sequences inside. An- nected components to reside within the same
other way of dealing with the complexity has memory space.
been to structure the kernel of the operating
system as a microkernel. Related research into comparative perfor-
The overhead of coordinating the different mance of kernel designs (e.g. [3]) has not looked
parts of the microkernels limited the perfor- at commercial microkernel operating systems
mance possible in early microkernels. Current such as MacOS X and Windows. In this paper,
we examine some performance characteristics
Copyright
c 2006 University of Saskatchewan.
Permission to copy is hereby granted provided the orig- of hybrid and monolithic kernels on commod-
inal copyright notice is reproduced in copies made. ity hardware using a standard benchmark.
1
2 Methodology tested. Two of these processors have a CISC
interface, while the third is a RISC machine.
We chose lmbench3, which has been widely used Table 2 gives more detail regarding the hard-
[2], as our benchmark program. Four tests were ware components. The bus speed referred to is
selected from lmbench’s suite to characterize the memory bus speed.
some of the performance details of the proces-
sor, memory, and OS of the test systems. The Feature AMD64 Sempron PPC G4
first test was that of memory latency. Mem- CPU 2.0 GHz 1.75 GHz 1.5 GHz
ory latency provides a baseline measurement Memory 1024 MB 768 MB 512 MB
for the hardware. The other three tests were re- Bus Speed 800 MHz 266 MHz 333 MHz
lated to combinations of the software and hard- L1 Cache 64 KB 64 KB 32 KB
ware: context switch overhead, memory read L2 Cache 512 KB 256 KB 512 KB
bandwidth, and file read bandwidth.
Context switch overhead is measured by Table 2: Hardware Characteristics
passing a token between a number of concur-
rent processes via IPC, capturing IPC imple-
mentation efficiency. Memory read bandwidth 4 Results
is tested by allocating a block of memory, ze-
Preliminary analysis of the result graphs iden-
roing it, and then copying the contents of the
tified interesting performance behaviour. De-
first-half to the second-half. The file read band-
tailed results for Windows are not shown as
width benchmark is very similar in concept:
lmbench interacted with the Cygwin environ-
open a file and read it in 64 KB chunks.
ment in unpredictable ways.
Each operating system was installed and
Fig. 1 shows the memory latency for the A64
minimally configured to support compilation
in 32-bit mode for Linux. It is typical of all
and execution of the lmbench3 benchmark.
the graphs on this architecture. The stride size
The benchmark was executed 10 times on each
is the offset into a fixed array for each access.
combination of software and hardware.
The small memory size operations were very
fast and did not differ between operating sys-
3 Test Environment tems or hardware platforms. This corresponds
to data being held in registers and/or L1 cache.
Table 1 lists the operating systems tested Once the memory accesses reached the bound-
and corresponding hardware for each test. ary of the L2 cache, latencies went up by a fixed
Two systems have monolithic kernels (Linux, amount representing the time required to ser-
OpenBSD), while two are microkernel based vice the request by issuing a read command, the
with some hybrid features in each (MacOS X, delay of waiting for the data to become ready,
Windows). and the bus transit time.
100
CPU/mode Operating System Stride 16
Stride 32
Stride 64
PowerPC Linux (2.6.12.X), Stride 128
Stride 256
Stride 512
MacOS X (10.4.5), Stride 1024
OpenBSD 3.8
Latency in nanoseconds
Windows 2K (SP4)
Athlon64 32-bit Linux, MacOS X,
OpenBSD, Windows
Athlon64 64-bit Linux, OpenBSD
1
0.1 1
2
Some stride sizes have a higher initial latency roughly a 10 ns increase.
as the cache refills with data, settling to a lower
overall latency. This pattern was not found 160 MacOS X 4k
MacOS X 16k
MacOS X 64k
on any other machine, and improves latency 140
Linux 4k
Linux 16k
Overhead in microseconds
OpenBSD 64k
100
tencies again increase. Differences can be ob-
80
served between all stride lines on all architec-
tures, with smaller strides generally gaining the 60
40
A subset of the memory throughput perfor-
Overhead in microseconds
30
mance for the A64 in 32-bit mode is shown in
Fig. 4. Memory accesses less than L1 cache
20 size have constant speed. As working set sizes
increase, the hardware becomes more of a lim-
10
iting factor. Moving from L1 to L2 reduces the
achieved bandwidth substantially.
10 20 30 40 50 60
Processes The filesystem performance builds directly
on the memory performance. If the kernel is
Figure 2: Context Switch Latency (A64) able to align its accesses and be smart about
page management for the disk-cache, a signifi-
cant performance improvement can be gained.
Context switch latency under MacOS X is The effect of the L1 and L2 caches on the
similar in performance to Linux. In Fig. 3, graphs (Fig. 5) are noticable. Within the L2
MacOS X appears to have 4 ns of additional cache size, algorithms which invalidate fewer
latency. This is possibly due to the microkernel cache lines have a higher throughput. Once
context switch overhead. This also holds for the L2 cache boundary is exceeded, most sys-
the Power PC, where MacOS X’s latency shows tems experience nearly identical throughput –
3
OpenBSD read
OpenBSD write
kernels are still present, but hardware seems to
10000
MacOS X read
MacOS X write
Linux read
be the dominant factor. Linux appears to have
Linux write
a performance advantage on some operations,
but this advantage is marginal once the work-
Bandwidth MB/s
100
ference, June 1986.
OpenBSD mmap
[2] A. Brown and M. Seltzer. Operating sys-
10
OpenBSD mmap-open2close
OpenBSD read
MacOS X mmap
tem benchmarking in the wake of lmbench:
MacOS X mmap-open2close
MacOS X read a case study of the performance of netbsd
Linux mmap
Linux mmap-open2close
Linux read on the intel x86 architecture. In ACM SIG-
1
0.2 0.4
Memory size
0.6 0.8 1
METRICS, pages 214–224, Seattle, WA,
USA, 1997.
Figure 5: File Read (A64-32 bit) [3] H. Hartig, M. Hohmuth, J. Liedtke,
S. Schonberg, and J. Wolter. The Perfor-
mance of uKernel-Based Systems. Operat-
Within the L1 cache boundary, Linux out-
ing Systems Review, 31(5):66–77, 1997.
performed MacOS X by an order of magnitude
on some operations, while within the bound- [4] J. Liedtke. On µ-Kernel Construction. In
aries of the L2 cache they perform within a SOSP, pages 237–250, 1995.
few percentage points of each other. Once the
working set exceeded the size of the L2 cache, [5] A. Tanenbaum, J. Herder, and H. Bos. Can
Linux and MacOS X performed similarly. We Make Operating Systems Reliable and
OpenBSD was the only operating system to Secure? IEEE Computer, 39(5):44–51, May
exhibit different behaviour. We see that the 2006.
file mmap open2close line starts off much lower
than any other operating system, and does not
appear to be affected by caching.
5 Conclusions
Based on our limited set of experiments, dif-
ferences between microkernels and monolithic