You are on page 1of 18

Debugging operating systems with

time-traveling virtual machines

by S.T. King, G.W. Dunlap, P.M. Chen

Presented by: Mirna Limic


What is it? and Why have one?

virtual machine + time travel = time-traveling


(VM) virtual machine
software abstr. ability to navigate (TTVM)
of a physical through immutable
machine execution history

Used to debug an operating system (OS) since an OS is:


- non-deterministic
- runs for long periods of time
- debugging may perturb its state
- it interacts directly with hardware devices
virtual machine monitor (VMM): software layer that
provides the abstraction of a virtual machine
guest OS: the OS which is run on a VMM
host OS: the OS on which VMM runs
IDEA: EXTEND gdb TO MAKE USE OF TIME TRAVEL

guest-user guest-kernel
gdb host process host process

TTVM functionality
(checkpointing, logging, replay)

host OS
VM state: VM's physical memory, the virtual disk, CPU
registers, and any state in VMM or host kernel
that affects the execution of the virtual machine

run: the time from which the virtual machine was


powered on to the last instruction it executed

TTVMs capabilities:
1. reconstruct the complete state logging,
of the VM at any point in a run achieved replay,
2. start from any point in a run with checkpointing
and replay the instruction stream
executed during the original run
VMM, logging and replaying

VMM used is User-Mode Linux (UML).


Logging/replay system used is ReVirt.

Host device drivers in the guest OS

UML exports a set of virtual devices with no hardware


equivalent. Problem: How to debug device drivers?

Workaround: modify UML to run real device drivers


in the guest OS.

Result: I/O instructions and DMA requests of guest OS


are forwarded to host hardware
Host device drivers in the guest OS


Logging is performed on any information sent from
device to the driver (IN instructions, memory-mapped
I/O instructions, and DMA memory loads.


Host OS provides regions of its physical memory for
guest's memory-mapped I/O and DMA.

Potential problem: Corruption of host's memory?


Solution: Deny access to memory outside the
intended region
Checkpointing

It is used to speed up time travel over long time periods.

It is done by: logging memory and disk accesses into


undo and redo logs.

Difference:
memory – log the actual pages at every checkpoint into
undo and redo logs
disk – log multiple versions of guest disk blocks but only
keep the changes to the guest -> host disk block
map in the undo and redo logs
Checkpointing: logging of memory

checkpoint1 checkpoint2 checkpoint3

write write write write write write


A B C A D E
A A A A
B B D D
C C E E
undo redo undo redo
log log log log
TTVM-aware gdb

Commands added to gdb:

reverse continue - takes the VM back to previous


point (point is a reverse equivalent of forward
breakpoint, watchpoint, and step)

reverse step – goes back a specified number of


instructions

goto – jumps to an arbitrary time in the execution


Performance
Machine: uniprocessor 3 Ghz Pentium 4, 1 GB memory,
120 GB Hitachi Deskstar GXP disk
Host OS: Linux 2.4.18 with UML running in skas mode,
and TTVM modifications.
Guest OS: 256 MB memory, 5 GB disk.
Both guest and host filesystems initialized from RedHat 9.
Three guest workloads measured:
- SPEC99web using Apache (Spec99web is benchmark
for evaluating performance of www servers,
- 3 successive builds of linux 2.4 kernel where each build
executes make clean; make dep; make bzImage;
- PostMark – filesystem benchmark.
Performance (cont'd)

Time and space overhead of logging for the three


workloads:
Logging without checkpointing:
Time overh. Space overh.
Spec99web 12 % 85 KB/sec
kernel build 11 % 7 KB/sec
Postmark 3% 2 KB/sec
Replay without checkpointing:
1 – 3 % longer for all three workloads
Performance (cont'd)

Running time with checkpointing:


Running times are normalized to running the workload without
any checkpoints

Workload without
Checkpoints:
SPEC99web 1135 sec
kernel build 1027 sec
PostMark 1114 sec
Performance (cont'd)

Space overhead of checkpointing


Performance (cont'd)

Time to restore a checkpoint


A common problem with traditional debuggers is
that using the debugger changes the timing of
events in the application.
Are you convinced that this particular implementation
can reproduce the playback reliable enough for
debugging purposes?

Would you say that authors can claim the


debugging strength of their TTVM based on the
debugging examples given in the paper?
Do you think that this technique can be adapted to
be used in debugging of parallel applications which
generally require high replay cost and complexity?
(multiprocessors)

In general, which OS processes experience most


number of bugs and most significant bugs? Would
it be sufficient to monitor those sections of the OS
alone, with TTVM?
How can TTVM be enhanced to identify OS bugs
that it has not yet encountered or might not encounter
in the near future? How can the entire range of OS
bugs be identified?

Do you think that this idea can be easily applicable


to non x86 architectures?

Would the guest kernel need to be modified if TTVM


is implemented on hardward-based virtualization
technology (eg. AMD-V, Intel-VT)?
During check pointing how would you capture
network state and replay it later? What would
you log?

You might also like