You are on page 1of 47

An Introduction to Virtual Machines Implementation and

Applications
by

Qian Huang

M.Sc., Tsinghua University 2002


B.Sc., Tsinghua University, 2000

AN ESSAY SUBMITTED IN PARTIAL FULFILLMENT OF

THE REQUIREMENTS FOR THE DEGREE OF

Master of Science

in

THE FACULTY OF GRADUATE STUDIES


(Computer Science)

The University of British Columbia

October 2006

© Qian Huang, 2006


Abstract
Virtual machines provide an abstraction of the underlying physical system to the

guest operating system running on it. Based on which level of abstraction the VMM

provides and whether the guest and host system use the same ISA, we can classify virtual

machines into many different types. For system virtual machines, there are two major

development approaches, full system virtualization and para virtualization. Because

virtual machines can provide desirable features like software flexibility, better protection

and hardware independence, they are applied in various research areas and have great

potential.

ii
Contents
Abstract..............................................................................................................................ii

Contents ............................................................................................................................iii

Acknowledgments ............................................................................................................. v

Chapter 1 Introduction................................................................................................ 1

Chapter 2 Virtual Machine Principles ....................................................................... 5

2.1 Virtual Machine Implementation.......................................................... 5


2.2 An Early View of Virtual Machines ..................................................... 7
2.3 A Systematic Taxonomy....................................................................... 8
2.3.1 ISA and ABI ............................................................................. 8
2.3.2 Process Level Virtual Machines ............................................. 10
2.3.3 System Level Virtual Machines.............................................. 12
2.3.4 Taxonomy of Virtual Machines .............................................. 14
Chapter 3 Full System Virtualization-VMWare ..................................................... 15

3.1 Hosted approach - VMWare Workstation .......................................... 17


3.2 Hypervisor approach - VMWare ESX Server .................................... 19
Chapter 4 Para Virtualization - Xen ........................................................................ 23

4.1 CPU..................................................................................................... 25
4.2 Memory............................................................................................... 27
4.3 I/O Devices ......................................................................................... 28
4.4 Domain 0 – the Administration Interface ........................................... 30
Chapter 5 Application of Virtual Machines............................................................. 31

5.1 System logger-ReVirt ......................................................................... 32


5.2 Migration............................................................................................. 33

iii
5.3 Mate: Virtual machines for sensor networks ...................................... 35
Chapter 6 Conclusions ............................................................................................... 37

Bibliography .................................................................................................................... 40

iv
Acknowledgments
I would like to express my gratitude to my supervisor, Norm Hutchinson,
for his patience, inspiration and encouragement. Without his help, this essay
would have been literally impossible to be completed. Also I would like to thank
my parents and husband for their endless love and support.

v
Chapter 1

Introduction

Standard computer systems are hierarchically constructed from three


components: bare hardware, operating system, and application software. To get
better software capability, a standard Instruction Set Architecture (ISA) was
proposed to precisely define the interface between hardware and software. In
other words, the ISA is the part of the processor that is visible to the programmer
or compiler writer. It includes both user and system instructions. User instructions
are the set that is accessible to both the operating system and application
programs; while the system instructions are special privileged instructions for
managing and protecting shared hardware resources, e.g., the processor, memory
and I/O system. Only by system calls can application programs access these
resources.

The standard architecture has many advantages. Since the interfaces are
nicely defined, the application program developers can skip the details of

1
hardware, like I/O and memory allocations, and the hardware and software
designs can be decoupled. In the same ISA, software can be reused across
different hardware configurations and even across generations. But this
architecture also has its disadvantages.

Flexibility

In the standard architecture, the hardware, operating system, and


application software are fixed, e.g., Windows applications can only execute in
Windows operating systems, and Windows can only run on X86 machines. The
three components, the hardware, the operating system, and the application
programs, are not interchangeable. MacOS cannot run on an X86 machine, and
applications compiled for Windows cannot execute on Linux operating systems.
So software is restricted by the operating system it is compiled for and the
particular ISA underneath. It cannot move freely among all computers connected
by a network, because usually these computers all vary in hardware and operating
systems. Similarly, from the perspective of hardware, one single ISA cannot run
all programs. So this architecture loses flexibility. This becomes more obvious
when the internet becomes larger and application software becomes more
complicated.

Protection

For the application programs running concurrently on the OS, the isolation
is not good enough. All the sharing and protection management are handled by

2
the single operating system. As they share the system hardware, it provides the
opportunities for malicious programs to exploit security holes.

Performance

The applications sharing one OS don’t have exclusive accesses to the


system’s resources. They will inevitably interfere with each other. For example,
developing the applications in the same machine which the application will be
deployed usually will cause the machine restart or something even worse. So the
stability of the system is weakened in this way.

Given all the limitation of the standard computer systems, virtual


machines provide a new way to address all the problems.

Virtual machines were first developed by IBM in the 1960’s and were
very popular in the 1970’s [1]. At that time, computer systems were large and
expensive, so IBM invented the concept of virtual machines as a way of time-
sharing for mainframes, partitioning machine resources among different users. A
virtual machine is defined as a fully protected and isolated replica of the
underlying physical machine’s hardware. Thus it allows the same computer to be
shared as if there were several machines. And the users need not be aware of the
virtual machine layer.

This essay will first discuss the basic idea of virtual machine
implementation. Then it talks about two main methods for practical development
of virtual machines: full system virtualization and paravirtualization. Finally we

3
present some applications of virtual machines in current system research fields
and its possible future trends.

4
Chapter 2

Virtual Machine Principles

Since the concept of the Virtual Machine was developed, it’s been always
a hot topic. Especially recently, it has experienced a great resurrection, and plays
an important role in system research.

In this section, we’ll discuss some general design and implementation


issues of virtual machines and introduce a systematic taxonomy of it.

2.1 Virtual Machine Implementation

A virtual machine provides a fully protected and isolated replica of the


underlying physical system. It takes a layered approach to achieve this goal. We
need a new layer above the original bare systems to abstract the physical
resources and provide interface to operating systems running on it. This layer is
called the Virtual Machine Monitor (VMM). The VMM is the essential part of the

5
virtual machine implementation, because it performs the translation between the
bare hardware and virtualized underlying platform: providing virtual processors,
memory, and virtualized I/O devices. Since all the virtual machines share the
same bare hardware, the Virtual Machine Monitor should also provide appropriate
protection so that each virtual machine is an isolated replica.

The basic virtual machine model should be like Figure 1 [2], where the

Applications Applications Applications

Operating Operating Operating


system 1 system 2 system 3
Virtual Machine Monitor

Bare machine

Figure 1: Simple Virtual Machine Model

virtual machine monitor sits between the bare system hardware and operating
systems. Usually the underlying platform comprised of the virtual machine
monitor and the bare machine, which provides the virtual machine environment,
is called the “host”, and the operating system and the applications running on it
are called “guests”. Actually this is just one of the many possible virtual machine
models, and we will address this later in this chapter. Also, the abstract interfaces

6
which VMMs provide can be different types. Some virtual machine monitors
perform whole system virtualization, which means the guest operating system
doesn’t need any changes to run on the virtualized system hardware, while some
other VMMs don’t do full system virtualization, and we need to change some
code of the guest operating system to make it suitable for the abstract interface.
This type of virtual machine mechanism is called paravirtualization.

2.2 An Early View of Virtual Machines

Robert Goldberg had a good summary of the virtual machine research of


the 60’s and 70’s [3], and he also summarized the principles to implement a
virtual machine. As he said, the major purpose of virtual machines was to solve
software transportability, debug OSes, and run test and diagnostic programs.
Since the architecture of the third generation computers cannot be virtualized
directly, it has to be done by software maneuver, which is very difficult. Some
researchers then proposed an approach to address this problem -virtualizable
architectures- which directly support virtual machines, including Goldberg’s
Hardware Virtualizer. The basic idea is to not to have a trap and simulation
mechanism, which will make the VMM smaller and simpler, and the machine
more efficient. This sounds like a great idea, but it seems not the main trend for
virtual machines. Currently there are still no “virtualizable architectures”, and the
implementation of a virtual machine still needs lot of effort, which we’ll talk
about in Chapters 3 and 4.

7
2.3 A Systematic Taxonomy

Besides the original virtual machine type mentioned above, there are many
other types out there in different research areas. In all these types of virtual
machines, the VMM sits in different layers of standard systems and plays
different roles. So the name virtual machines sometimes got confused. J.E. Smith
and Ravi Nair proposed a systematic taxonomy of virtual machines to classify all
these virtual machines and introduced a diagram language which can precisely
distinguish different types [4].

So before we continue with the implementation of two concrete examples


of virtual machines, we need to know the whole picture of different virtual
machines.

2.3.1 ISA and ABI

Since a virtual machine is a layer which abstracts all the layers below it
and provides an interface to the layer above it, in which level the virtual machine
does the abstraction can be a good criteria to classify virtual machines.

There are two perspective of what a machine is. One is of a process, and
the other is of the whole system.

From the perspective of process, the machine is the assigned memory


address space, instructions and user level registers. A process doesn’t have direct
access to disk, or other secondary storage and I/O resources. It only can access the

8
I/O resources by system calls. While for the entire system, it provides a full
environment that can support multiple processes simultaneously, and allocates
physical memory and I/O resources to the processes. Also, operating system, as a
part of the system, handles how the processes interact with their resources.

So based on the abstraction level, we have process virtual machines and


system virtual machines. As the names infer, a process virtual machine can
support an individual process, while a system virtual machine supports a complete
operating system and its environment.

To better understand these two types of virtual machines, we need to know


about two standardized interfaces: ISA and Application Binary Interface (ABI).
We’ve talked about ISA, it is the part of the processor that is visible to the
programmer or compiler writer and includes both user and system instructions.
While ABI includes the whole set of user instructions and the system call
interfaces by which the applications can access the hardware resources. In other
words, ABI separates processes from the rest of the whole system, and ISA
separates the hardware from the rest.

Given the definition of ISA and ABI, we can say that process level virtual
machines provides ABI to applications and system level virtual machines
provides ISA to the operating system and applications running on it. Based on
whether they support ABI or ISA, and whether the host and guest systems are the
same ISA, we can classify virtual machines into different types.

9
2.3.2 Process Level Virtual Machines

Actually most of the process level virtual machines mentioned below are
not commonly known as “virtual machines”. But they all have the properties of
virtual machines: provide virtual layers to the modules above.

Multiprogramming

Multiprogramming is a standard feature in modern operating systems. The


Operating system provides a replicated ABI to each process and each process
thinks it owns the whole machine. So actually the concurrently executing
applications are running on process level virtual machines. In this type of virtual
machine, the guest and host systems are in the same ISA and same OS.

Emulation

The second type of process level virtual machines are to run program
binaries compiled to a source ISA, while the underlying hardware is a different
ISA. The virtual machine needs to emulate the execution of the source ISA, and
the simplest way is interpretation, i.e., the VMM interprets every source
instruction by executing several native ISA instructions. Clearly this method has
poor performance. So binary translation is more commonly used, which converts
source instructions to native instructions with equivalent functions, and after the
block of instructions is translated, it can be cached and reused repeatedly. We can
see that interpretation has minimal start up cost but a huge overhead for emulating

10
each instruction, and binary translation, on the contrary, has a bigger initial
overhead but is fast for execution of each instruction.

Dynamic Optimizers

In the above type of virtual machines, the source and target ISA are
different, so the purpose of virtual machine is to emulate execution. Some virtual
machines are with the same source ISA and target ISA, and the goal of the VM is
to perform some optimizations in the process. The implementations of the virtual
machines for dynamic optimizers are very similar to that for emulations.

High Level Virtual Machines

The last type of process level virtual machine is the most commonly
recognized one, partly due to the popularity of Java. The purpose of the previous
three virtual machines, except for dynamic optimizer, is to improve cross-
platform portability. But their approaches need great effort for every ISA, so a
better way is to move the virtualization to a higher level: bring a process level
virtual machine to the high level language design. Two good examples for this
type of virtual machines are Pascal and Java. In a conventional system, the HLL
programs are compiled to abstract intermediate codes, and then generated into
object code for specific ISA/OS by a code generator. But in Pascal/Java, the code
to be distributed is not the object code, but the intermediate codes: P-code for
Pascal and bytecode for Java. On every ISA/OS, there’s a virtual machine to
interpret the intermediate codes to platform specific host instructions. So this type
of process virtual machines provides the maximal platform independence.

11
2.3.3 System Level Virtual Machines

System virtual machines are our main focus and also are the real “virtual
machines” commonly recognized when the term “virtual machine” is used.

Classic Virtual Machines

Classic virtual machines are the original model for system virtual
machines. Like Figure 1, here the VMM sits directly on bare hardware and
provides hardware replica and resource management. In most cases, this model
will bring efficiency, but the VMM has to handle all the device drivers and users
have to install the VMM and guest OS after wiping the existing system clean.

Hosted Virtual Machines

Hosted virtual machines, as the name infers, build a VMM on top of an


existing host OS. So it’s convenient for users to install the VMM, which is just
like the installation of an application program, because the VMM doesn’t run in
the privileged mode. Also the VMM can use the facilities provided by the host OS,
for example, device drivers. But this kind of virtual machine implementation in
most cases is less efficient than the classic virtual machines because of the extra
software layer.

In both the classic virtual machines and the hosted virtual machines, the
ISA of the guest OS is the same as the underlying hardware.

Whole system Virtual Machines

12
Sometimes we need to run operating systems and applications on a
different ISA. In these cases, because of the different ISA, complete emulation
and translation of the whole OS and application are required, so it’s called whole
system virtual machines. Usually the VMM stays on top of a host OS running on
the underlying ISA hardware.

Co-designed Virtual Machines

The above three system virtual machines are all built on a well-developed
ISA. Co-designed virtual machines focus on improving performance or efficiency
for non-standard ISAs. There are no native ISA implementations, so no native
execution is possible. Usually the VMM uses a binary translator to convert guest
instructions to native ISA instructions. The VMM works like a part of the
hardware implementation, to provide the guest operating system and applications
a VM platform just like a native hardware platform. The native ISA is totally
concealed from the guest OS and software.

13
2.3.4 Taxonomy of Virtual Machines

So from all the classification above, we can see, based on the level of
virtual machines- ABI or ISA and whether the host and guest VM are the same
ISA, we can get an overall taxonomy like Figure 2 [4]:

As said above, system virtual machines are our research focus, so from the

Process VMs System VMs


Provide ISA or ABI?

ABI ISA

Same ISA? Same ISA?

yes no
yes no

Multi Dynamic
programming translators Classic OS Whole
systems VMs system VMs
High level
Dynamics language Hosted VMs Co-designed
optimizers VMs VMs

Figure 2. A taxonomy of virtual machine architecture.

next chapter, we’ll discuss two major approaches of system VMs’ implementation.

14
Chapter 3

Full System Virtualization-

VMWare

There are various ways to implement a virtual machine. Most of them can
be classified into two categories: full virtualization and para-virtualization.

Full virtualization is designed to provide an unmodified


illusion/abstraction of the underlying physical system and creates a complete
virtual system in which the guest operating system can execute. The goal of full
virtualization is to achieve zero change for guest OSes and applications running
on it when they are migrated to the virtual machine. The abstraction provided by
the full virtualization must be the exactly same as the physical hardware so that
the guest OS and the applications are not aware of the fact that they are running in
the virtual machine instead of the physical machine. Thus one of the big

15
advantages of full virtualization is that you can port any existing guest OS and
application for a given system to a full virtualized virtual machine without any
additional cost. For example, with the VMWare server, a full virtual machine, you
are ready to run all commodity x86 operating systems and applications. However
because of the restrict requirement for the complete mirror of the underlying
physical system, the full virtualization usually has to pay the performance penalty.

Here I will first discuss the details of the implementation of VMWare [6],
a representative example of the full virtualization approach.

There are two types full virtualization in VMWare solutions [6]: the
hosted architecture and the hypervisor architecture. They are both for the IA-32
architecture and support running unmodified commodity operating systems, like
Windows 2000, XP and Linux Redhat. The VMWare workstation [2] uses the
hosted approach in which the VM and the guest OS is installed and runs on the
top of a standard commodity operating system. It uses the host OS to support a
broad range of hardware devices. The hypervisor architecture, in contrast, installs
a layer of software, called hypervisor, directly on top of the bare hardware and the
guest OS runs on top of the hypervisor. VMWare ESX Server [7][8] is the
representative of the hypervisor architecture. Next I will explain the techniques
used in the full virtualization by comparing and contrasting the above two
VMWare products.

16
3.1 Hosted approach - VMWare Workstation

There are two major advantages for this hosted architecture:

PC’s “open” architecture resulted in a large diversity of hardware


devices which need to be managed by the virtual machine. This
hosted approach can leverage the existing device drivers in the
standard operating system to save the effort of porting hundreds of
device drivers to virtual drivers.

Most PC users have a large amount of software installed and


configured properly in their existing operating system. They don’t
want to lose this software by switching to the virtual machine. The
hosted approach allows the co-existence of the original OS/software
and the virtual machine/guest OS.

However the downside of this hosted approach is also obvious. Because


the host OS has the full control on the hardware, even though the VMM has full
system and hardware privileges, it can not perform full-fledged scheduling. For
example, the VM can not guarantee for a certain CPU share because the VMM
itself is scheduled by the host OS. Secondly, to have acceptable performance, the
guest OS needs to run on the physical hardware directly as much as possible. So
the context switch between the guest OS/VM and the host OS is even more
expensive than the process switch. The I/O performance now becomes a big issue,
because the I/O operations in the guest OS have to forward to the device drivers
in the host OS and the context switches are inevitable here.

17
Figure 3 [2] illustrates the structure of a guest OS in a virtual machine in
the hosted architecture in VMWare Workstation. The install process of VMWare
Workstation in the host OS is the same as installing a normal application. When it
runs, the application portion (VMApp) uses a driver (VMDriver) loaded into the
host OS to create a privileged virtual machine monitor component (VMM). This
component lives in the guest OS kernel space and runs directly on the physical

Host World VMM World


Applications

Guest OS

Applications VM App Virtual Machine

Host OS VM Driver VM Monitor

Physical Machine

Figure 3. Hosted architecture for VMWare Workstation

hardware. The physical processor is now switching between the two worlds: the
host OS world and the VMM world. The guest OS and the applications on it are
all running in the user mode. The execution is the combination of direct execution
and the binary translation. For most of the non-privileged instructions, they run in
the physical hardware directly. For the privileged instructions, they are translated
to another sequence of the instructions at run time. The translated sequence will
ensure to trap into the VMM and emulate the same effect of executing the
privileged instruction. When the guest OS performs the I/O operation, it will be
intercepted by the VMM. Instead of accessing the physical hardware directly,

18
VMM will switch to the host OS world and call the VMApp to perform this I/O
operation via the normal system call on behalf of the VM. The VMM may also
yield the control to the host OS when necessary, so that the host OS could handle
the interrupt sent back from the hardware when the I/O operation finishes. Only
the host OS need deal with the hardware via the normal device drivers. The
VMApp will bridge the request/reply back and forth from the VM and host OS.
The VMM will never touch the physical I/O device. The memory virtualization is
another interesting topic for the full virtualization. A shadow page table must be
introduced to map the physical address to machine address and thus gives the
virtual machine an illusion of a continuous zero-based physical memory. We will
focus on the VMWare memory virtualization techniques in the next section.

3.2 Hypervisor approach - VMWare ESX Server

Unlike the VMWare Workstation which builds on top of the existing


operating system, the VMWare ESX Server [7][8] runs on the physical hardware
directly. As far as CPU virtualization, it uses the same technique as VMWare
Workstation, i.e., direct execution with dynamic binary translation. I will skip
explaining the CPU utilization to avoid repetition, and instead, I would like to
dive into the innovative memory virtualization techniques in VMWare ESX
Server.

Shadow page table for memory virtualization. There are three


types of addresses in the virtual machine world: the virtual address
which is the application visible address in the guest OS; the physical

19
address which is the 0-based linear address space abstracted by the
VMM and provided to the guest OS; the machine address which is the
hardware memory address accessed by the physical processor. The
unmodified commodity guest OS assumes it running directly in the
hardware, which has the linear address space starting at zero. It’s the
VMM’s responsibility to provide this illusion to the guest OS. To
support this, VMWare ESX Server maintains a pmap data structure
for each VM to map physical addresses to machine addresses. Instead
of accessing the guest OS’s normal page table, the process accesses a
“shadow page table” which contains the translation from virtual
address directly to the machine address. All guest OS instructions that
manipulate the page tables and TLBs are trapped by dynamic binary
translation. ESX Server is responsible to update both the pmap and
the shadow page table to keep them in synchronization. The biggest
advantage of this shadow page table is that the normal memory access
in guest OS can then be executed directly in native processor because
the virtual to machine address translation is in the TLB. However, it
has to pay performance penalty to maintain the correctness of the
shadow page table.

Ballooning in reclamation for over-commitment of memory. Over-


commitment is considered as one of the important advantages of using
virtual machines. In over-commitment the total size of memory
configured for all virtual machines exceeds the total size of the
physical memory. The memory pages may shift among the VMs

20
based on configuration and workload. It gives more efficient use of
the limited memory resource because most of the time the different
guest OSes will have a different level of demand for memory. The
overall performance will get improved when more memory is
allocated to the guest OS with the higher demand. However the
problem is how to find the pages to reclaim. The ESX Server decides
to let guest OSes to make the choice based on the fact that the best
information about the least valuable pages is known only the guest OS.
A ballooning device driver is installed into each guest OS. When ESX
Server needs to squeeze the memory from a guest OS, it asks the
ballooning device driver to inflate, i.e., to request more memory from
the guest OS. Based on its own replacement algorithm, the guest OS
will page out the least valuable pages to the virtual disk and the pages
obtained by the ballooning device driver will be passed to the ESX
Server, which will then update the shadow page table to move it to
another guest OS with higher memory demand. The ballooning device
in the later guest OS will perform a deflation operation, i.e., returning
pages to the OS. That guest OS now has more free pages that could be
allocated to the applications. With the ballooning, ESX Server avoids
tracking the page usage history and coding those complicated
replacement algorithms. A decision is made at the best place for
making the decision.

Content-based transparent memory sharing. The shadow page table


makes it very easy to share a page among different VMs. Multiple virtual page

21
numbers can be mapped to a single machine page number. It reduces the overall
memory footprint, and lowers the overhead in some cases by eliminating the
copies. The ESX Server uses a content-based transparent memory sharing
technique. By transparent, it means the VMs do not know the pages are shared,
and they all look like the same as those private owned pages. Disco[5] can
discover the shared pages at the page creation time, but that requires a change to
the guest OS, which is unacceptable for the ESX Server. The content-based page
sharing means the VMM will share all pages having the same content. Obviously
comparing every page with every other page has a complexity of O(n2) in page
comparisons. ESX Server uses hash functions to reduce the full page comparison.
A hash function is performed on every read-only page first to summarize the
content of that page, and the resulting hash value is used as a key and put into a
global hash table. Every time when a new read-only page is requested, the hash
value of this page is also computed and looked up in the global hash table. If this
key is in the hash table already, a full page comparison is performed between the
new page and the page set keyed by the hash value. If it is a match, the shared
page is found, the shadow page table is updated to map the shared page and also
the shared page is marked as “COW” (copy-on-write). If a match is not found, the
new page will be added to the page set in the global hash table. Even though this
content based memory sharing is more expensive to maintain, it can find some
identical pages which can not be detected by the traditional approach and thus
potentially save more memory space and data copy operations.

22
Chapter 4

Para Virtualization - Xen

Para-virtualization [9][11], unlike full virtualization, does not target an


exact identical abstraction of the underlying physical system to mitigate the
unnecessary performance penalty. It provides each virtual machine an abstraction
that can be efficiently implemented in a given hardware. Because this abstraction
is different from the original hardware interface, the guest OS has to be modified
to run on the virtual machine. However depending on the design, the ABI
(Application Binary Interface) would be kept untouched and thus we could run
applications in the guest OS without any modification. The relaxation of the
abstraction gives a much larger design space for the virtual machine monitor to
reduce the overhead and improve the performance. The para-virtualization is
getting more attention recently because it can lower performance penalty to a very
small extent and yield very close performance scores to the native OS. One of the
successful para-virtualization examples is Xen [9][10], developed by a group of

23
researchers at the University of Cambridge. Some industry data centers have
already started to provide Xen based virtual servers to their clients for less price
but similar performance.

The targeted processor in Xen is IA-32 or x86, the most prevalent


architecture in the world, but also, at the same time, it is notoriously expensive to
implement a full virtualization on this processor: running supervisor instructions
with insufficient privilege may fail silently instead of causing a convenient trap;
the TLB is managed by hardware instead of software and it is not tagged with the
address space descriptor. To avoid paying extra performance penalty, Xen chose
to present virtual machine a new interface and modify the host OS, XenoLinux to
adopt the new interface.

By observing previous VMM researches, Xen decided to guide the whole


project with the following design principles:

Support for unmodified application binary. Even though some


modification has to be made in the guest OS, the existing standard
ABI must be kept the same. Otherwise the users will not transition to
Xen. This is one of the key factors for the success of Xen.

Support full multi-application operating system. Xen deliberately


distinguished itself from the Denali [11] VMM which is designed to
support thousands of virtual machines running network services. The
applications are customized and linked directly into the also
customized guest OS in Denali, like the libOS in the ExoKernel [19].

24
In contrast, the modified XenoLinux can run multiple standard
complex Linux applications concurrently in the single guest OS
instance.

Para-virtualization. To obtain the strong resource isolation and high


performance in the same time is difficult, especially in the
uncooperative x86 architecture. Para-virtualization is necessary to
overcome those difficulties.

Not completely hide the real resource from the guest operating system.
This decision is made for both correctness and performance for the
virtual machine. It is desirable for the guest OS to see both virtual and
real resources in some situations. For example, the guest OS can
handle the time-sensitive tasks better, like TCP timeout, if it can be
provided with both the virtual and real time.

4.1 CPU

The x86 architecture has 4 different privilege levels in hardware, from


zero to three, referred as ring 0 to ring 3. Ring 0 is the most privileged level and
thus is where the native OS is always running because it is the only level that can
run privileged instructions. The applications are usually running in the least
privileged level, ring 3. And the other two rings are not used in most cases. In
order to protect the hypervisor (Xen) from the OS misbehavior, the hypervisor
must be running in the higher privileged level than the guest OS. In Xen’s

25
implementation, the guest OS is moved to ring 1 so it has to delegate the
privileged instructions to Xen which runs in ring 0. Xen will then perform the
validation and execution of those privileged instructions for the guest OS to
enforce the protection and isolation. In most of the time, the guest OS will run in
this less privileged level without any interference with Xen if no privileged
instruction is invoked which eliminates all unnecessary overhead.

System calls (software exception) and page faults are the two types of
exceptions which may have significant impact on the system performance. In Xen,
the guest OS can register fast exception handlers for system calls. Xen validates
the installation of those fast exception handlers, but once they are registered, the
guest OS can access them directly without indirecting via ring 0. However the
page faults can not be done this way because the faulting address register can not
be accessed in ring 1. Xen has to interfere when a page fault happens, i.e., it will
save and copy the value of that register to a place accessible by ring 1.

Three types of time are presented to the guest OS by Xen: the real time
from the processor’s actual cycle counter, the virtual time which stops when the
guest OS does not occupy the CPU, and the wall-clock time which is the real time
plus an offset. The guest OS can program a pair of alarm timers, one for real time
and one for virtual. This gives the guest OS better control over different tasks. It
is easy to imagine that the time-sensitive tasks, like TCP timeout, will use the real
timer, and the virtual timer will be used for fair process scheduling within the
guest OS.

26
4.2 Memory

The x86 manages its memory by combining paging with segmentation. A


virtual address consists of a 16-bit segment selector and a 16 or 32-bit segment
offset. The selector is used to fetch a segment descriptor from the segment
descriptor tables (actually, there are two tables and one of the bits of the selector
is used to choose which table). The 64-bit descriptor contains the 32-bit address of
the segment (called the segment base) 21 bits indicating its length, and
miscellaneous bits indicating protections and other options. The segment length is
indicated by a 20-bit limit and one bit to indicate whether the limit should be
interpreted as bytes or pages. If the offset from the original virtual address does
not exceed the segment length, it is added to the base to get a “physical” address
called the linear address. If paging is turned off, the linear address really is the
physical address. Otherwise, it is translated by a two-level page table, with the 32-
bit address divided into two 10-bit page numbers and a 12 bit offset, assuming the
page size is 4K. The TLB is used to cache the translation from virtual page
number to the physical frame number. If a TLB lookup is missed, the processor
will walk through the page table structure automatically in hardware. If the
concerned part of the page table is not in the memory, a page fault exception will
bring the process to the page fault handler.

To give better performance, the Xen designers decided to minimize the


involvement for Xen during the virtual address translation, that is the guest OS
must manage the hardware page table on its own. Xen comes to the play only
when it is needed to ensure the safety and isolation. Each time when a guest OS

27
initiates a new page table, it allocates a page from its own memory reservation
and registers it with Xen via a hypercall (a system call to the hypervisor). Xen
will look into this request and register the new page table directly with the
hardware MMU if the validation is passed. Starting from this time point, the guest
OS must relinquish direct write privileges to the page-table memory. All
subsequent updates must be validated by Xen. Except for this, the guest OS can
perform all virtual address translation operations using the MMU without
invoking Xen (in the page fault case, Xen needs to be involved to forward the
page fault register and update the page table).

Remember the TLB in x86 has no address space tags, which means a TLB
flush for each context switch. To avoid address space switch during hypercalls,
Xen exists in a 64MB section in every address space. The top 64MB region is not
used by the standard x86 ABI, so this change does not break application
compatibility. Also to enforce safety, the top 64MB region is set to not accessible
and remappable by the guest OS.

4.3 I/O Devices

To provide isolation, demultiplexing and high performance at the same


time, Xen designers have to resolve two big challenges: the resource management
and the event notification.

To address the first challenge, the data transfer between Xen and host OSs
is implemented efficiently by exploiting the power of the virtual memory, i.e.,

28
shared memory. When requesting an I/O operation, the guest OS will commit a
buffer from its own memory reservation. The buffer will be passed to Xen via a
synchronous hypercall, and Xen will first validate the request, for example, verify
that the buffer is within the guess OS’s memory reservation and next Xen will
remap the page to be accessible only by Xen. The DMA operation will then be
executed by the DMA controller. When the data transmission is finished, an
interrupt will be delivered from the device to Xen. Upon receiving this interrupt,
Xen will remap the buffer back to the guest OS address space.

Now it comes to the second challenge of event notification. Xen supports


a lightweight event delivery mechanism, which propagates the asynchronous
notifications to the guest OS. These notifications are made by updating a bitmap
of pending event types in the guest OS kernel, and optionally, Xen will call the
callback event handler registered by the guest OS.

Xen creates an I/O descriptor ring for every guest OS to handle the
asynchronous I/O requests and responses. The I/O descriptor ring is created in the
guest OS’s memory reservation but is accessible within Xen. The ring has a pair
of request producer and consumer pointers and a pair of response producer and
consumer pointers. The guest OS is responsible for forwarding the request
producer pointer when adding a new I/O request and forwarding the response
consumer pointer when finished processing the response. In the other hand, Xen
will forward the request consumer pointer when starting processing the request
and forward the response producer pointer when posting the I/O response back.

29
4.4 Domain 0 – the Administration Interface

One of the big principles in system design is the separation of policy and
mechanism. It is not hard to imagine that each guest OS may have very different
configurations, and also one guest OS may change its configuration during its
lifetime. Xen has a clean and elegant administration interface to manage all guest
OSs, including bootstrapping new domains (guest OS), shutting down domains,
and changing the configurations for a particular domain in real time, for example,
changing the virtual firewall-router rules to open/block certain types of
connections. This interface is exposed via a set of administration tools running
command line in domain 0.

Domain 0 is a special domain which is created at boot time. It is


responsible for hosting the application-level management software and the
profiling software which collects and reports the statistics on the current state of
the system. With all sorts of other applications in a full-fledged guest OS, the
maintaining, diagnosing and debugging became much easier than if these
functions were performed in the VMM level.

30
Chapter 5

Application of Virtual Machines

Nowadays computer systems are getting cheaper and cheaper, so the


original purpose of virtual machines to share the expensive computer systems
isn’t that important anymore. But virtual machines have unique features that
always draw interest from system researchers. They provide flexibility for the
software, more protection between applications, hardware independence, and a
good environment for system development and debugging. Peter Chen and Brian
Noble [12] propose that the current operating system and software structure
should be replaced by a new virtual machine-operating system-software 3-layer
structure, and argue that this structure is very useful for certain systems research,
like secure logging, intrusion prevention and detection, and environment
migration.

31
There are a number of research projects which explored this idea
independently [13, 14, 15, 16, 17]; here we pick some of them to illustrate the
applications of the virtual machine technologies.

5.1 System logger-ReVirt

System logging provides the first hand information for intrusion analysis
after attacks have occurred on the system. But ironically, the reliability of
operating system level logging depends on the integrity of the operating system
itself, which is the very target most attacks try to compromise. Once the villain
gains the administrative privilege, the first thing he may try to do is to hide his
tracks by modifying the system log using that privilege. So if a log is going to be
used for attack analysis, it has to be protected even when the kernel is
compromised. In other words, the system logging needs to be done in an even
higher privilege level than the kernel. The normal OS is running on the bare
hardware where no such higher privilege level exists, which means if the villain
gets into the kernel, it is very hard to protect the logging system from being
modified. But if the OS is running in a virtual machine monitor, the VMM could
be a perfect place to perform the system logging task as it is running in a higher
privilege level than the guest OS.

ReVirt [13] is such a logging system running in a VMM. It logs all non-
deterministic events after the system gets booted. With all this information, a
system administrator can replay the entire execution history and find out how the
system was broken into and what damage has been caused. Revirt runs in

32
UMLinux, a VMM that runs a guest OS as one process in a host OS. All guest OS
system calls and interrupts are mapped to various signals in the host OS. The
VMM intercepts these signals, logs the corresponding events and then passes
them to the guest OS. When replaying, Revirt starts the system from the known
initial state, and injects the logged non-deterministic events at the right point to
direct the system to evolve exactly in the same way as on the previous run. Revirt
runs in the VMM level and is almost transparent to the guest OS, except that one
non-deterministic instruction in the guest OS has to be replaced with a system call.

We should note that it does not mean the VMM can not be compromised.
But considering the relatively narrower VMM interface and the simpler (thus less
vulnerable and easier to verify) VMM software, breaking into a VMM should be
much harder than breaking into a guest OS. So the VMM provides a quite secure
platform for logging. Also because the VMM is forwarding all system interactions
between the guest OS and the host OS or hardware, logging in the VMM is an
easy and natural way to record sufficient information for a full replay.

5.2 Migration

Because a VMM is a software abstraction of the hardware, the entire state


of a running environment, including the guest operating system and all
applications running on it, can be easily captured, packaged into a capsule, sent
over the network and resumed in a remote host. The capsule contains all the
information the target host needs to resume the running processes and entire guest
OS. That information usually includes the state of the virtual disks, memory, CPU

33
registers, and I/O devices. This environment migration is at the virtual machine
level which is bigger scale than process migration. It will allow a user to move
between computers at home and those at work without having work interrupted.
Or it may allow a system admin make a live patch and deploy that live patch to
the entire server fleet to let all servers start from a clean fresh state.

However considering the giga-bytes size of a virtual disk and hundreds


mega-bytes size of memory, the capsule could be so large that constructing it and
sending it over the network is very expensive. In [14], the author explained an
effective way to construct a capsule so that it can be transported over a DSL
connection within a reasonable time. The following techniques are used to reduce
the capsule size:

After the initial capsule is transmitted, the future virtual disk changes are
captured in an incremental manner. Using copy-on-write, only disk updates
will be written into the capsule.

Before capturing memory into the capsule, we can run a “balloon” process to
zero out most of the memory by paging most memory pages back out to disk.
Therefore, only a small amount of the memory is built into the capsule.

Instead of waiting for the entire capsule, the target host can start early with
partial information. Especially for disk pages, they could be fetched on
demand while the capsule runs.

34
A hash code of a data block is sent first and if the target host has a data block
which can produce the same hash code, we use the local copy instead of
sending the data block over the wire.

The result is significant. The capsule can start running in the target host
only after 20 minutes of transmission on a 384 kbps link.

5.3 Mate: Virtual machines for sensor networks

As we discussed in Chapter 2, different from the system level virtual


machines which try to provide an abstraction above hardware, the high level
virtual machines are aiming to provide a more generic, simpler, safer and easy-to-
use interface at the programming language level. Java and C# are the two big
names falling into this category. Along this line, there are some other research
projects exploring the same idea in some very different computing environments.

Mate [15], a language interpreter for a new abstract ISA, is designed to


cope with the special requirements in the sensor networks environment. A sensor
network is composed of hundreds of motes, a very simplified computing device
which has its own processor, ROM and RAM, and communicates with each other
via wireless connections. In the sensor network world, energy is the most precious
resource because the mote is non-rechargeable once deployed. And the energy for
transmitting a single byte via wireless can run thousands of instructions. However
the only way to reprogram a mote after deployment is via a wireless transfer, from

35
a parameter change to the installation of an entire new binary. The challenge is to
provide reprogramming capability with minimal network transmission.

The author believed that “most sensor network applications are composed
of a common set of services and sub-systems, combined in different ways. A
system that allows these compositions to be concisely described (a few packets)
could provide much of the flexibility of reprogramming with the transmission
costs of parameter tweaking. ”[15] So they abstracted a language interface in the
service and sub-system level and built a virtual machine to interpret programs
written in this language. The new interface is much more concise and more easy-
to-use as well. It also provides a protected execution environment with user/kernel
boundary on motes that have no hardware protection mechanisms.

36
Chapter 6

Conclusions

The concept of virtual machines is not new. In the 60’s, IBM first
developed virtual machines to share machine resources among users. The virtual
machine has always been an interesting research topic, and recently it draws more
attention than ever.

The essential part of a virtual machine is the virtual machine monitor


(VMM). It abstracts the physical resources of the underlying bare hardware and
provides a fully protected and isolated replica of the physical system. It is
transparent to the operating system running above it, i.e., the guest operating
system.

While the above structure describes the original virtual machines, there are
many different types of virtual machines in different research areas. Based on
whether the VMM provides abstracted ISA or ABI, we can distinguish system

37
virtual machines from process virtual machines. And together with the criteria of
whether the guest and host system are the same ISA, we can classify virtual
machines into different types and establish an overall taxonomy.

There are two major approaches to implement system virtual machines:


full virtualization and para virtualization. Full virtualization is meant to provide
an identical abstraction of the underlying physical system so that the guest
operating system and the applications running on it don’t need to do any changes
to be migrated to the virtual machine. So full virtualization has an obvious
advantage that we can port any guest OS and applications to a fully virtualized
virtual machine without any additional cost. However, because IA-32, the current
most dominant system architecture, isn’t an architecture designed to be virtualized,
full virtualization has to pay an expensive performance penalty to implement a
complete identical abstraction of the physical system. Para-virtualization, on the
contrary, only provides an abstraction that can be efficiently implemented in a
given hardware. This means the guest operating system has to be modified to run
on the virtual machine. But because para-virtualization mitigates the expensive
performance penalty, it becomes more popular in recent system research.

Nowadays the price of hardware resources isn’t a problem any more, so


the original motivation of virtual machines isn’t that important. Researchers are
more interested in the features virtual machines can provide, like flexibility, more
protection, and hardware independence. Many projects use virtual machines to
achieve system logger, working environment migration, or simplified system
reprogramming interface in motes in sensor network. Especially Xen, due to its

38
excellent performance, is being targeted by many research efforts. For example,
the Planetlab project is planning to improve the Planetlab OS to support both
VServer and Xen-based slices to achieve better remote management of Planetlab
nodes [18]. So we can say the research of virtual machines has great potential.

39
Bibliography

[1] R. J. Creasy. The origin of the VM/370 time-sharing system. IBM Journal of research

and development, vol. 25, no. 5, p. 483, 1981.

[2] J. Sugerman, G. Venkitachalam, and B. Lim. Virtualizing I/O devices on VMware

workstation’s hosted virtual machine monitor. Annual Usenix Technical Conference 2001,

Boston, MA, USA, Jun. 2001.

[3] R. P. Goldberg. Survey of Virtual Machine Research. IEEE Computer, pp. 34–45, Jun.

1974.

[4] J. E. Smith, R. Nair. An Overview of Virtual Machine Architectures. Elsevier Science,

pp. 5-6, 2006.

[5] E. Bugnion, S. Devine, and M. Rosenblum. Disco: Running commodity operating

systems on scalable multiprocessors. In Proceedings of the 16th ACM SIGOPS

Symposium on Operating Systems Principles, volume 31(5) of ACM Operating Systems

Review, pages 143-156, Oct. 1997.

[6] VMware whitepaper: Virtualization Overview.

http://www.vmware.com/pdf/virtualization.pdf

40
[7] C. A. Waldspurger. Memory resource management in VMware ESX server. In

Proceedings of the 5th Symposium on Operating Systems Design and Implementation),

Boston, MA, USA, Dec. 2002.

[8] M. R. Ferre. Vmware ESX server: scale up or scale out?.

http://www.redbooks.ibm.com/abstracts/redp3953.html

[9] P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, R. Neugebauer, I. Pratt,

A. Warfield. Xen and the art of virtualization. In Proceedings of the ACM Symposium on

Operating Systems Principles. Bolton Landing, NY, USA, Oct. 2003.

[10] Xen website: http://www.cl.cam.ac.uk/research/srg/netos/xen/

[11] A. Whitaker, M. Shaw, and S. D. Gribble. Scale and performance in the Denali

isolation kernel. In Proceedings of the 5th Symposium on Operating Systems Design and

Implementation, ACM Operating Systems Review, winter 2002 Special Issue, pages 195-

210, Boston, MA, USA Dec. 2002.

[12] P. M. Chen and B. D. Noble. When virtual is better than real., In Proceedings of

Eighth Workshop on Hot Topics in Operating Systems, p. 0133, 2001.

[13] G. W. Dunlap, S. T. King, S. Cinar, M. Basrai, and P. M. Chen. ReVirt: Enabling

intrusion analysis through virtual –machine logging and replay. In Proceedings of the 5th

Symposium on Operating Systems Design and Implmentation (OSDI 2002), ACM

Operating Systems Review, winter 2002 special Issue, pages 211-224, Boston, MA, USA,

Dec. 2002.

[14] C. P. Sapuntzakis, R. Chandra, B. Pfaff, J. Chow, M. S. Lam, and M. Rosenblum.

Optimizing the migration of virtual computers. In Proceedings of the 5th Symposium on

41
Operating Systems Design and Implementation, pages 377-- 390, Boston, MA, USA ,

Dec. 2002.

[15] P. Levis and D. Culler. Mate: A tiny virtual machine for sensor networks. In

Proceedings of 10th International Conference on Architectural Support for Programming

Languages and Operating Systems, San Jose, CA, USA, Oct. 2002.

[16] S. T. King and P. M. Chen. Backtracking intrusion. In Proceedings of the ACM

Symposium on Operating Systems Principles. Bolton Landing, NY, USA, Oct. 2003.

[17] T. Garfinkel, B. Pfaff, J. Chow, M. Rosenblum, and D. Boneh. Terra: A virtual

machine-based platform for trusted computing. In Proceedings of the 19th ACM

Symposium on Operating Systems Principles. Bolton Landing, NY, USA, Oct. 2003.

[18] PlanetLab website: http://www.planet-lab.org/Software/roadmap.php#os

[19] M. F. Kaashoek, D. R. Engler, G. R. Ganger, H. M. Briceno, R. Hunt, D. Mazieres,

T. Pinckney, R. Grimm, J. Jannotti, and K Mackenzie. Application performance and

flexibility on Exokernel systems. In Proceedings of the 16th ACM Symposium on

Operating Systems Principles. St. Malo, France, Oct. 1997.

42

You might also like