Debugging Operating Systems With Time-Traveling Virtual Machines

Debugging operating systems with
time-traveling virtual machines
by S.T. King, G.W. Dunlap, P.M. Chen
Presented by: Mirna Limic

What is it? and Why have one?
virtual machine + time travel = time-traveling

(VM) virtual machine
software abstr. ability to navigate (TTVM)
of a physical through immutable
machine execution history
Used to debug an operating system (OS) since an OS is:

- non-deterministic
- runs for long periods of time
- debugging may perturb its state
- it interacts directly with hardware devices
virtual machine monitor (VMM): software layer that
provides the abstraction of a virtual machine
guest OS: the OS which is run on a VMM
host OS: the OS on which VMM runs
IDEA: EXTEND gdb TO MAKE USE OF TIME TRAVEL
guest-user guest-kernel
gdb host process host process
TTVM functionality
(checkpointing, logging, replay)
host OS
VM state: VM's physical memory, the virtual disk, CPU
registers, and any state in VMM or host kernel
that affects the execution of the virtual machine
run: the time from which the virtual machine was

powered on to the last instruction it executed
TTVMs capabilities:
1. reconstruct the complete state logging,
of the VM at any point in a run achieved replay,
2. start from any point in a run with checkpointing
and replay the instruction stream
executed during the original run
VMM, logging and replaying
VMM used is User-Mode Linux (UML).

Logging/replay system used is ReVirt.
Host device drivers in the guest OS
UML exports a set of virtual devices with no hardware

equivalent. Problem: How to debug device drivers?
Workaround: modify UML to run real device drivers

in the guest OS.
Result: I/O instructions and DMA requests of guest OS

are forwarded to host hardware
Host device drivers in the guest OS
●
Logging is performed on any information sent from
device to the driver (IN instructions, memory-mapped
I/O instructions, and DMA memory loads.
●
Host OS provides regions of its physical memory for
guest's memory-mapped I/O and DMA.
Potential problem: Corruption of host's memory?

Solution: Deny access to memory outside the
intended region
Checkpointing
It is used to speed up time travel over long time periods.
It is done by: logging memory and disk accesses into

undo and redo logs.
Difference:
memory – log the actual pages at every checkpoint into
undo and redo logs
disk – log multiple versions of guest disk blocks but only
keep the changes to the guest -> host disk block
map in the undo and redo logs
Checkpointing: logging of memory
checkpoint1 checkpoint2 checkpoint3
write write write write write write

A B C A D E
A A A A
B B D D
C C E E
undo redo undo redo
log log log log
TTVM-aware gdb
Commands added to gdb:
reverse continue - takes the VM back to previous

point (point is a reverse equivalent of forward
breakpoint, watchpoint, and step)
reverse step – goes back a specified number of

instructions
goto – jumps to an arbitrary time in the execution

Performance
Machine: uniprocessor 3 Ghz Pentium 4, 1 GB memory,
120 GB Hitachi Deskstar GXP disk
Host OS: Linux 2.4.18 with UML running in skas mode,
and TTVM modifications.
Guest OS: 256 MB memory, 5 GB disk.
Both guest and host filesystems initialized from RedHat 9.
Three guest workloads measured:
- SPEC99web using Apache (Spec99web is benchmark
for evaluating performance of www servers,
- 3 successive builds of linux 2.4 kernel where each build
executes make clean; make dep; make bzImage;
- PostMark – filesystem benchmark.
Performance (cont'd)
Time and space overhead of logging for the three

workloads:
Logging without checkpointing:
Time overh. Space overh.
Spec99web 12 % 85 KB/sec
kernel build 11 % 7 KB/sec
Postmark 3% 2 KB/sec
Replay without checkpointing:
1 – 3 % longer for all three workloads
Running time with checkpointing:

Running times are normalized to running the workload without
any checkpoints
Workload without
Checkpoints:
SPEC99web 1135 sec
kernel build 1027 sec
PostMark 1114 sec
Space overhead of checkpointing

Time to restore a checkpoint

A common problem with traditional debuggers is
that using the debugger changes the timing of
events in the application.
Are you convinced that this particular implementation
can reproduce the playback reliable enough for
debugging purposes?
Would you say that authors can claim the

debugging strength of their TTVM based on the
debugging examples given in the paper?
Do you think that this technique can be adapted to
be used in debugging of parallel applications which
generally require high replay cost and complexity?
(multiprocessors)
In general, which OS processes experience most

number of bugs and most significant bugs? Would
it be sufficient to monitor those sections of the OS
alone, with TTVM?
How can TTVM be enhanced to identify OS bugs
that it has not yet encountered or might not encounter
in the near future? How can the entire range of OS
bugs be identified?
Do you think that this idea can be easily applicable

to non x86 architectures?
Would the guest kernel need to be modified if TTVM

is implemented on hardward-based virtualization
technology (eg. AMD-V, Intel-VT)?
During check pointing how would you capture
network state and replay it later? What would
you log?

Debugging Operating Systems With Time-Traveling Virtual Machines

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Debugging Operating Systems With Time-Traveling Virtual Machines

Uploaded by

Copyright:

Available Formats

Debugging operating systems with

time-traveling virtual machines

by S.T. King, G.W. Dunlap, P.M. Chen

Presented by: Mirna Limic

virtual machine + time travel = time-traveling

Used to debug an operating system (OS) since an OS is:

run: the time from which the virtual machine was

VMM used is User-Mode Linux (UML).

Host device drivers in the guest OS

UML exports a set of virtual devices with no hardware

Workaround: modify UML to run real device drivers

Result: I/O instructions and DMA requests of guest OS

Potential problem: Corruption of host's memory?

It is used to speed up time travel over long time periods.

It is done by: logging memory and disk accesses into

checkpoint1 checkpoint2 checkpoint3

write write write write write write

Commands added to gdb:

reverse continue - takes the VM back to previous

reverse step – goes back a specified number of

goto – jumps to an arbitrary time in the execution

Time and space overhead of logging for the three

Running time with checkpointing:

Space overhead of checkpointing

Time to restore a checkpoint

Would you say that authors can claim the

In general, which OS processes experience most

Do you think that this idea can be easily applicable

Would the guest kernel need to be modified if TTVM

You might also like