You are on page 1of 42

107 - Introduction

Welcome back.
As we mentioned at the conclusion of the previous lesson module, the drive for extensibility in operating
system services led to innovations in the internal structure of operating systems and dynamic loading of operating
system modules in modern operating systems.
Do not stop there.
In this lesson, we will see how the concept of virtualization has taken the vision of extensibility to a whole
new level. Namely, allowing the simultaneous co-existence of entire operating systems on top of the same
hardware platform. Strap your seatbelts and get ready for an exciting ride. Before we look deep into the nuts and
bolts of virtualization, let's cover some basics.
108 - Virtualization Question

109 - Virtualization Question

You know what, if you picked all of them you're right on.

We've heard the term virtualization in the context of virtual memory systems. Data centers today catering to
cloud computing use virtualization technology.
Java virtual machine and Virtual Box is something that you may be using in your laptop for instance.
And IBM VM/370, was the mother of all virtualization in some sense, because way back in the 60s and the
70s, they pioneered this concept of virtualization.
Google Glass, of course many of you may have heard the hype behind it, as well as the opportunities for
marketing that this provides.
Cloud computing, Dalvik Virtual Machine in the context of Android, VMware Workstation as an alternative
to Virtual Box.
And of course the movie, Inception, has a lot of virtualization themes in it.
110 - Platform Virtualization

The words virtual and virtualization are popular these days. From virtual worlds and virtual reality to
applications level virtual machines like Dalvik or Java's virtual machine. What we are concerned with here are
virtual platforms and by platform we mean an operating system running on top of some hardware.

Let's say Alice has her own company, Alice Inc. and she has some cool applications to run for productivity.
Let's say that they're running on a Windows machine on some server that the company maintains. Now, if cost
were not an issue, then this will be the ideal situation for Alice Inc.
The hope of virtualization is that we will be able to give a company not as well endured as Alice Inc., say Bala
Inc., almost the same experience that Alice Inc. gets, at a fraction of the cost. So, instead of a real platform, Bala
Inc. gets a virtual platform.
In this diagram, what I've shown you is a black box to represent the virtual platform. Because as far as Bala
Inc. is concerned, they don't really care what goes on inside this black box. All they want is to make sure that the
apps that they want to run can run on top of this virtual platform. In other words as long as the platform offers the
same capabilities and the same abstractions for running the applications that Bala wants, he is happy.
Now, we however as aspiring operating system designers are very much interested in what goes on inside this
black box. We want to know
• How we can give Bala Inc. the illusion of having his own platform without actually giving him one.
• And including all of the cost, the implementation and maintenance associated with implementing such a
virtual platform.
111 - Utility Computing

Now, if we peek inside the black box however, we find that it is not just Bala who is using the resources in the
virtual platform, but there is also Piero, Kim and possibly others. Why would we want to do this? Well, unless
we are quite unlucky, the hope is that sharing hardware resources across several different user communities is
going to result in making the cost of ownership and maintenance of the shared resources much cheaper.
The fundamental intuition that makes sharing hardware resources across the diverse set of users is the fact that
resource usage is typically very bursty. If we were to draw. If we were to draw Bala's memory usage over time,
it might look like this. And maybe Piero's need for memory over time may look like this. And Kim's like this, and
so on. Now, adding all of these dynamic needs of different user communities, we may see a cumulative usage
pattern that might look like this.

Now, let's consider Bala's cost. If he were to buy his own server, then he would have to buy as much as the peak
usage that he has. Probably he'll even want to do more than that, just to be on the safe side. The virtual machine
actually has a total available memory that's much more than the individual needs of any one of these guys. Each
of these guys gets to share the cost of the total resources among themselves.
On a big enough scale, each of these guys will have potentially access to a lot more resources. Then they can
individually afford to pay for at a fraction of the cost, because both the cost of acquiring the resource as well as
maintaining it and upgrading it and so on is borne collectively. And that in a nutshell is the whole behind utility
computing. That is promoted by data centers worldwide and this is how AWS, Microsoft, and so on provide
resource on a shared basis to a wide clientele.
Some of you may have already seen the connection to what we have seen in the previous lecture. Yes.
virtualization is the logical extension to the idea of extensibility or specialization of services that we've seen
in the previous lesson, through the spin and exokernel papers. Now it is applied at a much larger granularity,
namely an entire operating system. In other words virtualization is extensibility applied at the granularity of
an entire operating system as opposed to individual services within an operating system.
112 - Hypervisors

Now returning to our look inside the black box, notice how we have multiple operating systems running on the
same shared hardware resources. How is this possible? How are the operating systems protected from one
another, and who decides who gets the resource and at what time? What we need then is an operating system of
operating systems. Something called VMM, which stands for Virtual Machine Monitor or hypervisor. And
these OS that are running on top of the shared hardware resources are often referred to as Virtual Machines
(VM) or Guest OS.

There are two types of hypervisors.


• The first type is what is called a native hypervisor or bare metal, meaning that the hypervisor is running
on top of the bare hardware. And that's why it's called a bare-metal hypervisor or a native hypervisor.
All of the operating systems that I'm showing you inside of the black box are running on top of this
hypervisor. They're called the guest operating systems because they're the guest of the hypervisor
running on the shared resource.
• The second type of hypervisor is the hosted hypervisor. The hosted ones run on top of a host OS and
allows the users to emulate the functionality of other OS. So the hosted hypervisor is not running on top
of the bare metal, but it is running as an application process on top of the host OS. And the guest
operating systems are clients of this hosted hypervisor. Some examples of hosted hypervisors include
VMware Workstation, and Virtual Box. If you don't have access to a computer that's running Linux
operating system in this course you're likely to be doing your course projects on a virtual box or a
VMWare Workstation that's available to run on Windows platform.
For the purpose of this lesson today, however, we will be focusing on the bare metal hypervisors. These bare
metal hypervisors interfere minimally with the normal operation of these guest operating systems on the shared
resources in a spirit which is very similar to the extensible operating systems that we studied earlier, like the
spin and exokernel and therefore the bare metal hypervisors, offer the best performance for the guest operating
system on the shared resource.
113 - Connecting the Dots

If you're a student of history you probably know that ideas come at some point of time and the practical use may
come at a much later point of time. An excellent example of that is Boolean algebra, which was invented at the
turn of the century as a pure mathematical exercise by George Boole. Now, Boolean algebra is the basis for pretty
much anything and everything we do with computers. The concept of virtualization also had its beginning way
back in time.

• It started with IBM VM 370 in the 60s and the early 70s. And the intent was to give the illusion to every
user on the computer as though the computer is theirs. That was the vision and it was also a vehicle for
binary support for legacy applications that may run on older versions of IBM platforms.
• Then we had the microkernels that we have discussed earlier which surfaced in the 80s and early 90s. That
in turn gave way to extensibility of operating systems in the 90s.
• The Stanford project SimOS in the late 90s laid the basis for the modern resurgence of virtualization
technology at the operating system level and in fact, was the basis for VMware.
• The specific ideas we're going to explore in this course module through Xen and VMware, are papers that
date back to the early 2000s. They were proposed from the point of view of supporting application mobility,
server consolidation, collocating, hosting facilities, distributed web services.
• Today, virtualization has taken off like anything.
Why the big resurgence today? Well, companies want a share of everybody's pie. One of the things that has
become very obvious is the margin for device making companies in terms of profits is very small. Everybody
wants to get into providing services for end users. This is pioneered by IBM and others are following suit as well.
So the attraction with virtualization technology is that companies can now provide resources with complete
performance isolation and bill each individual user separately. Companies like Microsoft, Amazon,
HP...everybody's in this game wanting to provide computing facilities through their data centers to a wide
diversity of user communities. It's a win-win situation for both the users that do not want to maintain their own
computing infrastructure and constantly upgrading them. Companies like IBM that has a way of providing these
resources on a rental basis, on a utility basis, to the user community.
You can see the dots connecting up from the extensibility studies of the 90s like the SPIN and exokernel, to
virtualization today, that is providing computational resources, much like we depend on utility companies to
provide electricity and water. In other words, virtualization technology has made computing like the other utilities
that we use to such as electricity and water. That's the reason why there's a huge resurgence in virtualization
technology in all the data centers across the world.
114 - Full Virtualization

One idea for this virtualization framework is what is called full virtualization, and in full virtualization the idea
is to leave the operating system pretty much untouched. So you can run the unchanged binary of the OS on top
of the hypervisor. This is called full virtualization because the OS is completely untouched.

However, operating systems running on top of the hypervisor are run as user-level processes. They're not
running at the same level of privilege as a Linux that is running on bare metal. But if the OS code is unchanged,
it doesn't know that it does not have the privilege for doing certain things that it would do normally on bare metal
hardware. In other words, when the OS executes some privileged instructions, those instructions will create a trap
that goes into the hypervisor and the hypervisor will then emulate the intended functionality of the OS. This is
what is called the trap and emulate strategy. Essentially, each Guest OS thinks it is running on bare metal.
There are some thorny issues with this trap and emulate strategy of full virtualization. In some architectures,
some privilege instructions may fail silently. What that means is, you would think that the instruction actually
succeeded, but it did not and you may never know about it.
In order to get around this problem in fully virtualized systems, the hypervisor will resort to a binary
translation strategy, meaning it knows what are the things that might fail silently in the architecture and look
for those gotchas in each of these individual binaries of the unmodified guest OS. Through binary editing strategy,
they will ensure that those instructions are dealt with carefully so that if those instructions fail silently the
hypervisor can catch it and take the appropriate action. This was a problem in early instances of Intel architecture.
Both Intel and AMD have since started adding virtualization support to the hardware, so that such problems
don't exist anymore. But in the early going, when virtualization technology was experimented with in the late 90's
and the early 2000s, this was a problem that virtualization technology had to overcome in order to make sure that
you can run operating systems as unchanged binaries on a fully virtualized hypervisor.
Full virtualization is the technology that is employed in the VMWare system.

115 - Para Virtualization

Another approach to virtualization is to modify the source code of the guest operating system.
If we can do that, not only can we avoid problematic instructions, as I mentioned earlier with full virtualization,
but we can also include optimizations. For instance, letting the guest OS see real hardware resources underneath
the hypervisor and access to real hardware resources and also being able to employ tricks such as page coloring
and exploiting the characteristics of the underlaying hardware.

It is important to note, however that so far as the applications are concerned, nothing is changed about the OS
because the interfaces that the applications see is exactly the interfaces provided by the guest operating system.
For example, if the application is running on top of Linux, it sees exactly the same API as it would if this Linux
operating system was running on native hardware.
In this sense, there's no change to the application's themselves. But the operating system has to be modified in
order to account for the fact that it is not running on bare metal, but it is running as a guest of the hypervisor. This
technology is often referred to as para virtualization, meaning it is not fully virtualized, but a part of it is
modified to account for being a guest of the hypervisor. The Zen product family uses this para virtualization
approach.
Now this brings up an interesting question, in order to do this para virtualization we have to modify the
operating system, but how big is this modification?
116 - Modification of Guest OS Code Question

117 - Modification of Guest OS Code Question

The right answer is that the percentage of the guest operating system code that would need to be modified with
the para virtualization technology is minuscule, less than 2%.
And this is shown by proof of construction in Xen.
What they did was to implement multiple operating system on top of Xen, with para-virtualized hypervisor.
118 - Para Virtualization (cont)

This table is showing you the lines of code that the designers of Xen hypervisor had to change in the native
operating systems.
The two native operating systems that they implemented on top of Xen hypervisor are Linux and Windows XP.

You can see that, in the case of Linux, the total amount of the original code base that had to be changed is just
1.36% and in the case of XP, it is miniscule, almost an annoyance.
So in other words, even though, in para virtualization we have to modify the operating system to run on top of
the hypervisor, the amount of code change that has to be done in the operating system in order to make it run on
top of the hypervisor. Can be bound to a very small percentage of the total code base of the original operating
system. Which is good news.

119 - Big Picture

So, what is the big picture with virtualization?


In either of the two approaches that I mentioned (full virtualization or para-virtualization), we have to virtualize
the hardware resources and make them available safely to the operating systems that are running on top of the
hypervisor.
When we talk about hardware resources, we're talking about the memory hierarchy, the CPU, and the
devices that are there in the hardware platform.
How to virtualize them and make them available in a transparent manner for use by the operating systems that
live above the hypervisor? How do we affect data and control transfer between the guest OS and the hypervisor.
These are all the questions that we will be digging deeper into in this course module. That wraps up the basic
introduction to virtualization technology. Now it is time to roll up our sleeves and look deeper into the nuts and
bolts of virtualizing the different hardware elements in the hypervisor.
120 - Introduction

We will now dig deeper into what needs to be done in the systems software stack to support virtualization of the
hardware resources for use by the many operating systems living on top of the hardware simultaneously.
As you've seen in the earlier course module, on operating system extensibility, the question boils down to how
to be flexible while being performance conscious and safe.
You'll see that bulk of our discussion on virtualization centers around memory systems. The reason is quite
simple: memory hierarchy is crucial to performance. Efficient handling of device virtualization is heavily
reliant on how memory system is virtualized and made available to the operating systems running on top of the
hypervisor. So let's start with techniques for virtualizing the memory management in a performance-conscious
manner.

121 - Memory Hierarchy

I'm sure by now this picture is very familiar to you. Showing you the memory hierarchy going all the way from
the CPU to the virtual memory on the disk.
• You have several levels of caches, of course, the TLB that holds address translations for you (virtual to
physical), and you have the main memory and then the virtual memory on the disk.
• Now, caches are physically tagged and so you don't need to do anything special about them from the point
of your virtualizing the memory hierarchy.
The really thorny issue is handling virtual memory, namely the virtual address to the physical memory mapping,
which is the key functionality of the memory management subsystem in any operating system.
122 - Memory Subsystem Recall

Recall that in any modern operating system, each process is in its own protection domain and usually a separate
hardware address space.
The operating system maintains a page table on behalf of each of these processes. The page table is the operating
system's data structure that holds the mapping between the virtual page numbers and the physical pages where
those virtual pages are contained in the main memory of the hardware.
• The physical memory of course is contiguous starting from zero to whatever is the maximum limit of the
hardware capabilities are.
• The virtual address space of a given process is not contiguous in physical memory, but it is scattered all
over the physical memory.
That's in some sense the advantage that you get with page-based memory management: a process notion of which
virtual address being contiguous is not necessarily reflected in the physical mapping of those virtual pages to the
physical pages in the main memory.
123 - Memory Management and Hypervisor

In the virtualized set up the hypervisor sits between the guest operating system and the hardware. So the picture
gets a little bit complicated.
Inside each one of these Guest OS, of course there are user-level processes and each process is in its own
protection domain. What that means is that in each of these operating systems, there is a distinct page table that
the operating system maintains on behalf of the processes that are running in that operating system.
Similarly in this case, does the hypervisor know about the page tables maintained on behalf of the processes
that are running in each one of these individual operating systems? The answer is no.
Windows and Linux are simply protection domains distinct from one another as far as the hypervisor is
concerned. The fact that each of Windows and Linux containing processes within them, is something that the
hypervisor doesn't know about at all.
124 - Memory Manager Zoomed Out

So far as each one of the Guest OS is concerned, they think of physical memory as contiguous. But, unfortunately,
the real physical memory, we're going to call it machine memory from now on, is in the control of the hypervisor.
Not in the control of any one of the Guest OS. The physical memory of each of the Guest OS is an illusion. It's
no longer contiguous in terms of the real machine memory that the hypervisor controls.

For instance, look at the physical memory that the Guest Windows has. I've broken that down into 2 regions, R1
and R2.
• R1 has a number of pages 0 through Q, and R2 has a number of pages starting from Q plus 1, through N.
So this is a physical memory. But if you look at this region R1, it's occupying a space in the machine
memory controlled by the hypervisor over here (green blocks). R2 is not contiguous with respect to R1 in
the machine memory.
• Come over to the Linux VM and it has its own physical memory. We'll call that region 1 and region 2
again and it has a total capacity of M + 1 physical page frames. Page 0 through L are contiguous in the
machine memory. L+1 through M are contiguous in machine memory. Again, they're not contiguous with
respect to the other region R1.
Why would this happen? The hypervisor is not being nasty to the operating system. After all. Even if all of the
N+1 pages of Windows, and M+1 pages of Linux would be contiguous in the machine memory, they cannot all
start at 0 because the machine memory is the only thing that is contiguous.
The machine memory has to be partitioned between these two Guest OS, and therefore, the starting point for
the physical memory (the illusion of the physical memory to each Guest OS) cannot be 0.
Also, the memory requirements of operating systems are dynamic, and bursting. If Guest Windows started out
initially with Q+1 pages and later it needed additional memory, then it's going to request the hypervisor. At that
point, it may not be possible for the hypervisor to give another region, which is contiguous with the previous
region that Windows already has.
So these are the reasons why the physical memory of a Guest operating system is just an illusion, in terms of
how it is actually contained in the machine memory.
125 - Zooming Back In

Zooming back in to what is going on within a given operating system, we already know that the process address
space for an application is an illusion in the sense that the virtual memory space of this process is contiguous, but
the physical memory is not contiguous. The page table data structure that the Guest OS maintains on behalf of
the processes, is the one that supports this illusion so far as each process is concerned.
• The Guest OS maps the virtual page number of the process, using the page table for this particular process,
into the physical page number where a particular virtual page may be contained in physical memory. This
is a setting in a non-virtualized operating system. So the page table serves as the broker to convert a virtual
page number to a physical page number.
• In a virtualized setting, we have another level of indirection, from the Guest OS physical page number to
the machine memory or the machine page numbers (MPN). This goes from zero through some max,
which is the total memory capacity that you have in your hardware.
This data structure, the page table, is a traditional page table, that gives a mapping between virtual page number
and the physical page number. And this is a traditional page table.
The mapping between the physical page number and the machine page number, that is PPN to MPN mapping, is
kept in another page table, which is called shadow page table, S-PT.
Now in a virtualized setting, there's a two-step translation process to go from VPN to MPN. The page table
maintained by the Guest OS is the one that translates VPN to PPN and then there is this additional page table
called a shadow page table that converts the PPN to MPN.
126 - Who Keeps PPN MPN Mapping Question

It's clear that the VPN to PPN mapping is in the Guest OS. Now the question is, where should this PPN to MPN
mapping be kept in a fully virtualized setting. Should it be in the Guest OS? or should it be in the hypervisor?
Similarly in the context of a para-virtualized system, where should the mapping be between PPN and MPN?
Should it be in the guest operating system or in the hypervisor?

127 - Who Keeps PPN MPN Mapping Solution

In the case of a fully virtualized hypervisor, the Guest OS has no knowledge of machine pages. It thinks that it's
physical memory is contiguous because it is thinking that it is running on bare metal, nothing has been changed
in the operating system to run on top of a hypervisor in a fully virtualized setting. And therefore, it is the
responsibility of the hypervisor to maintain the PPN to MPN mapping.
In a para virtualized setting, on the other hand, the guest operating system knows that it is not running on bare
metal. It knows that there's a hypervisor in between it, and the real hardware. And it knows, therefore, that its
physical memory is going to be fragmented in the machine memory. So the mapping PPN to MPN can be kept in
either the Guest OS or the hypervisor. But, usually, it is kept in the Guest OS.
We'll talk more about that later on.
128 - Shadow Page Table

Let's understand what exactly this shadow page table is and what it is.
In many architectures, e.g. Intel's X86 family, the CPU uses the page table for address translation. What that
means is presented with the virtual address (VPN), the CPU first looks up the TLB to see if there is a match for
the VPN. If there is a match, it's a TLB hit and it can translate this virtual address to the physical address. If it is
a miss, CPU knows where in memory the page table data structure is kept by the OS. It goes to the page table,
which is in main memory, and retrieves the specific entry which will give it the translation from the VPN to the
PPN. Once it gets that PPN, it'll stash this PPN in the TLB for future reference.
That's the way the CPU does the translation in many architectures. In other words, both the TLB and the page
table are data structures that the architecture uses for address translation.
Page table is also a data structure that is set by the operating system for enabling the processor to do this
translation. So in other words, the hardware page table is really the shadow page table in the virtualized setting,
if an architecture is going to use the page table for address translation.
129 - Efficient Mapping (Full Virtualization)

As I mentioned in a fully virtualized setting, the guest operating system has no idea about machine pages. It thinks
that the physical page number that it is generating is the real thing. But it is not. Therefore, there are two levels
of indirection:
• One level of indirection going from virtual page to physical page and this is an illusion.
• Then the hypervisor has to take this physical page and using the shadow page table, convert it into the
machine page.
And as I said, the shadow page table may be the real hardware page table that the CPU uses as well. So this is the
data structure that is maintained by the hypervisor to translate the PPN to MPN.
How to make this efficient? Because on every memory access of a process of the guest operating system, the
virtual address has to be converted to a machine address. So in other words, we want to avoid the one level of
indirection that's happening because the VPN has to be converted to a PPN by the guest OS and then it has to be
looked up in the shadow page table by the hypervisor to generate the MPN.
So, we would like to make this process more efficient by getting rid of one level of indirection, i.e. the translation
performed by the Guest OS.

• Remember that the Guest OS is the one that establishes this mapping in the first place between a VPN and
a PPN by creating an entry in the Guest page table for the process that is generating this virtual address in
the first place.

• Updating the Guest page table is a privileged instruction. So, when the Guest OS tries to update the page
table (what it thinks is the page table), it'll result in a trap and this trap is called by the hypervisor.

• What the hypervisor is going to do is, it's basically going to say, oh this particular VPN corresponds to a
specific entry in the shadow page table. So the update that guest OS is making to the Guest page table data
structure is not the real thing.
• The hypervisor is updating the same mapping by saying well, this particular VPN is going to this machine
page number, that's the one that we're going to put as the entry here.

• So as a result, what we have done is putting the real translations between VPN and MPN in the shadow
page table, which may be the hardware page table if the processor is using the page table for its address
translation or it could be the TLB.
So now we basically bypassed the Guess OS page table in order to do the translation. Now every time a process
generates a virtual address, we are not going through the Guest OS to do the translation, but as long as the
translation has already been installed in the TLB or the hardware page table (shadow page table), the hypervisor
without the intervention of the Guest OS can translate the VPN of a user level process running on top of the Guest
OS lyrically to the MPN using the TLB or the hardware page table.
This is a trick to make address translation efficient. It's extremely crucial because it is not acceptable to go
through the guest OS to do the address translation on every memory access. This is the trick that is used in
VMware ESX server that is implemented on top of Intel hardware.
130 - Efficient Mapping (Para Virtualization)

In a para-virtualized settings, the Guest OS knows that its physical memory is not contiguous. Therefore this
burden of efficient mapping can be shifted into the Guest OS itself. The Guest OS is also going to know that its
notion of physical memory, is not the same as machine memory, and it will map the discontiguous physical
memory to real hardware pages. The burden of doing the PPN to MPN mapping can be pushed into the Guest OS
in a para virtualized setting.
As a result, on an architecture like Intel, where the page table is a data structure of the operating system and is
also used by the hardware to do the address translation, the responsibility of allocating and managing the hardware
page table data structure can be shifted into the Guest OS.
In a fully virtualized setting, it's not possible to do that because the OS in a fully virtualized setting is unaware
of the fact that it is not running on bare metal. But in a paravirtualized setting, since it is possible to do that and
to push this efficient mapping handling into the Guest OS.
For example, in Xen, which is a para-virtualized hypervisor, it provides a set of hyper-calls for the Guest OS
to tell the Hypervisor about changes to the hardware page table.
For instance, there is a call that says “create page table” and this allows a guest operating system to allocate and
initialize a page frame that it has previously acquired from the hypervisor as a hardware resource. It can target
that physical page frame as a page table data structure.
Recall that each Guest OS, would have gotten a bunch of physical memories from the hypervisor at the beginning
of establishing its foot print on the hypervisor. So it can use one of those real physical memories to host a page
table data structure, on behalf of a new process that it launches.
• Anytime a new process starts up in the Guest OS, the Guest OS will make a hypercall to Xen, saying
“please create a page table for me and this is the page frame that I'm giving you to use as the page table
entry”.
• When the Guest OS has to operate another process, it can make another hypercall to the hypervisor, saying
“please switch to this page table and here is the location of the page table”. The hypervisor doesn't know
about all these processes, all it understands is that there's a hypercall that says change the page table from
whatever it used to be to this new page table and that essentially results in this Guest OS switching the
address space of the currently running process on the bare hardware on the bare metal to P1 by this
switch-page-table hypercall. Xen will do that appropriate thing of setting the hardware register of the
processor to point to this page table data structure, in response to this hypercall from the Guest OS.
• If the process P1 one were to page fault at some point of time, the page fault would be handled by the
Guest OS. Once it handles that page fault for P1 and says, “oh, this particular virtual page of this process
is now going to correspond to a physical frame that I own. I'm going to tell the hypervisor that the mapping
in the page table, has to be set for this translation that I just established for the faulted page for this process”.
So there's another hypercall that's available for updating a given page table data structure. Using this,
the Guest OS can deal with modifications to the page table data structure.
All the things that an operating system would have to do in a normal setting on bare metal, needs to be done
in the setting where you have the hypervisor sitting between the real hardware and the Guest OS.
And the three things that are required to be done in the context of memory management, in a para virtualized
setting, is being able to
• create a brand new hardware address space for a newly launched process which involves creating a page
table. That's a hypercall that's available.
• When you do a context switch, you want to switch the page table. That's a hypercall that's available when
you do a context switch in the Guest OS from P1 to P2, the guest can tell the hypervisor that the page table
to use from now on is such and so. That's the way the guest can do a context switch from one process to
another.
• And, thirdly, since not all of the address space or the memory footprint of a process would be physical
memory, if the currently running process were to incur at page fault, that has to be handled by the Guest
OS. In handling that, it establishes a mapping between the missing virtual page for this process and the
physical frame in which the contents of the page is now contained. That mapping has been put into the
page table for this particular process. Again that's something that only the hypervisor can do, because it is
a privileged operation happening inside the hardware. And for that purpose, the hypervisor provides a
hyper call that allows a guest operating system to update the page table data structure.
So at the outset I said that handling virtual memory is a thorny issue. Doing the mapping from virtual to physical
on every memory access, with all the intervention of the Guest OS is the key to good performance. And it can be
done both in fully virtualized and paravirtualized setting by the tricks that we talked about just now.

131 - Dynamically Increasing Memory

The next thing we are going to talk about is how can we dynamically increase the amount of physical memory,
that's available to a particular Guest OS running on top of the hypervisor?
As I mentioned, memory requirements tend to be bursty and therefore the hypervisor has to be able to allocate
real physical memory or machine memory on demand to the requesting Guest OS on top of it.

Let's look at this picture here and assume that this is the total amount of machine memory that's available to
the hypervisor.
The hypervisor has divvied up the available machine memory among these two Guest OS. So the hypervisor
has no spare memory at all. It is completely divided up and given to these two guys.
What if the Guest Windows experiences a burst in memory usage, and therefore requires more memory from
the hypervisor? It may happen because of some resource hungry application that has started up in windows that
is gobbling up a lot of physical memory, and therefore Guest Windows needs more memory and comes to the
hypervisor asking for additional hardware resources.
But unfortunately, the bank is empty. What the hyper-visor could do is recognize that, well maybe, this Guest
Linux doesn't need all of the physical resources I allocated it so I'm going to grab back a portion of the physical
memory that Linux has. And once I get back this portion of physical memory that I previously allocated to Linux,
I can then give it to Guest Windows to satisfy its sudden hunger for more memory.
Well, this principle of robbing Peter to pay Paul, can lead to unexpected and anomalous behavior of
applications running on the guest operating system.
A standard approach of course would be to coach one of the guest operating system in this case perhaps Linux,
to give up some of its physical memory voluntarily to satisfy the needs of a peer that is currently experiencing
memory pressure.

132 - Ballooning

The idea of ballooning is to have a special device driver installed in every Guest OS. So even if it is a fully
virtualized setting, since device drivers can be installed on the fly, with the cooperation of the Guest OS, the
hypervisor can install the device driver, which is called a balloon. This balloon device driver is the key for
managing memory pressures that maybe experienced by a virtual machine or a Guest OS in a virtualized setting.
Let's say that the house needs more memory suddenly. This may be a result of another Guest OS, saying that it
needs more memory. Then the hypervisor will contact one of the Guest OS that is not actively using all of its
memory and talk to this balloon driver.
• The balloon driver was installed through a private channel by hypervisor so this is something that only
the hypervisor knows about. It knows how to get to this device driver and tell it to do something.

• In this case, the hypervisor is going to tell this balloon device driver to inflate. What that means is that
this balloon device driver is going to make requests to the Guest OS, saying “I need more memory”. And
the Guest OS will give requested memory to this balloon driver.

• The amount of physical memory that's available to the Guest OS is finite. If one process (in this case, the
balloon driver) is requesting more memory, the Guest OS has to necessarily make room for that by paging
out unwanted pages from the total memory footprint in this Guest OS to the disk. Once it is done with
swapping out of pages, it can make room for the request from this balloon driver.

• Now, once the balloon driver has gotten all this extra memory out of this Guest OS, it can return those
physical memories (the real physical memory, or the machine memory) back to the hypervisor.

• So we started with this house needing more machine memory. The house/hypervisor got that more
machine memory by contacting the balloon driver installed in one of the Guest OS that is not actively
using all of its resources. Asking the balloon to inflate essentially means that we are acquiring more
memory from the Guest OS. So, you can see visually that the footprint of the balloon goes up because of
the inflation. That means, it's got a bigger memory footprint and all of this memory footprint has extra
resources that it can simply return to the hypervisor. The balloon itself is not going to use these resources,
it just wants to get it so that it can return it to the hypervisor.
The opposite of what I've just described is a situation where the house needs less memory. Or, in other words, it
has more memory to give away to Guest OS. So maybe it is this guest that requests more memory in the first
place. In that case, it wants the guest to get to the memory that it wanted. The hypervisor can do that is as follows.
• Once again the hypervisor through its private channel will contact the balloon driver and tell the balloon
driver to deflate the balloon.

• By deflating, the balloon driver is to contract its memory footprint. If it contracts its memory footprint by
deflating, it is actually releasing memory into the Guest OS. So, the available physical memory in the
Guest OS is going to increase.

• That's the effect of the balloon deflating: the Guest OS has more memory to play with, which means that
it can page in from the disk the working set of processes that are executing on this Guest OS. So that those
processes can have more memory resources than they've been able to because of the balloon occupying a
lot of the memory.
The technique of ballooning assumes cooperation with the guest operating system.
It's sort of the line airline reservations. You always notice that in airline reservation, the airlines tend to sell
more seats than what they have with the hope that someone is not going to show up.
That's the same sort of thing that is being done here: there is a finite amount of physical resource available
within the hypervisor and it is giving it out to all the Guest OS. What the hypervisor wants to do is to be able to
reacquire some of those resources to satisfy the needs of another Guest OS that is experiencing a memory pressure
at any point of time.

133 - Sharing Memory Across Virtual Machines

Memory is a precious resource. You don't want to waste it. You want protection between the virtual machines
from one another. But at the same time if there's an opportunity to share memory so that the available physical
resource can be maximally utilized, you want to be able to do that.
So the question is: can we share memory across virtual machines without affecting the integrity of these machines?
Protection is important, but without affecting the protection, can we actually enhance the utility of the available
physical memory in the hypervisor? And the answer is yes.
• Think about it and you may have one instance of Linux contained in a virtual machine (VM1) and another
Linux hosted in another virtual machine (VM2). In both VMs, the memory footprint of Guest OS is exactly
the same. Even the applications (Firefox) that run on it are going to have exactly the same content.
• This VM1 Linux is going to have a page table, unique for this particular Firefox instance. Similarly, this
instance of Linux VM2 is going to have a page table data structure unique its instance of Firefox.
• If all things are equal, in terms of versions and so on, then a particular virtual page of the Firefox in VM1
and VM2 is going to have the same content.
• So there is an opportunity to make both of those page table entries point to the same machine page. If we
can do that, then we are avoiding duplication without compromising safety. This is particularly true for
the core pages. The core pages are immutable, so the core pages for VM1 Firefox and VM2 Firefox
could actually share the same page in physical memory. And if it does that you're using the resources
much more effectively in a virtualized setting.
One way of doing the sharing is by a cooperation between the virtual machines and the hypervisor.
• In other words the Guest OS has hooks that allows the hypervisor to mark pages Copy-On-Write (COW),
and have the PPNs point to exactly the same machine page. In this way if a page is going to be written
into by the operating system, it'll result in a fault and we can make a copy of the page that is currently
being shared across two different virtual machines.
• That is one way of doing it: with the cooperation between the hypervisor and the VM, we can share
machine pages across virtual machines, but mark the entries in the individual Guest page tables as COW.
• As long as they only read, it is perfectly fine. The minute any one of the Guest OS wants to write into a
particular page, you make a copy of it and make these two Guest VPNs point to different machine pages.

134 - VM Oblivious Page Sharing

An alternative is to achieve the same effect, but in a way completely oblivious to the Guest OS. This is used in
VMWare ESX server.
The idea is to use content-based sharing. In order to support that, VMWare has a data structure which is a hash
table kept in the hypervisor. This hash table data structure contains a content hash of the machine pages. For
instance, this entry is saying that in VM3 (PPN: 0x43f8, PPN content is stored in MPN: 0x123b, the content
hash: ...06f) If you hash the contents of this memory page, you get a signature. That signature is the content hash.
That content hash is stored in this data structure.

Now let's see how this data structure is used for doing VM-oblivious page sharing in the ESX Server.
• We want to know if there is a page in VM2 that is content-wise the same as this page contained in VM3.
• In particular we want to know if the VM2 PPN = 0x2868, which is mapped to MPN = 0x1096, is content-
wise the same as MPN = 0x123b.
• So how do we find that out? What we do is we take the contents of 0x1096 and create a content hash. The
hypervisor is going to run a particular algorithm that creates a content hash MPN 0x1096.
• Now we take this content hash and look through hypervisor's data structure to see if there is a match
between the hash of 0x1096 and any page currently in the data structure.
• We have a match between the content hash of VM2’s 0x1096 and VM3’s 0x123b.
Can we know that this page and this page are exactly the same? Well, we cannot. It's only a hint.
• The content hash for 0x123b was calculated when we created this data structure to represent this as a hint
frame.
• VM3 could have been using this page actively and modified it. If it has modified it, then this content hash
that we have in this data structure may no longer be valid.
• Therefore, even though we've got a match, it's only a hint, not an absolute. We still want to do a full
comparison between these two guys to make sure that they are exactly the same. This is called full
comparison upon match.

135 - Successful Match

If the content hash of 0x1096 and 0x123b are exactly the same, then we can modify the PPN to MPN mapping in
VM2 for the page 0x2868 (which used to point to 0x1096) and we can now make it point to 0x123b.
• Once we have done that, then we increment the reference count to this hash table entry to 2, indicating
that there are 2 different virtual machines that map to this same machine page 0x123b.
• We're also going to mark these two mappings (0x2868, 0x123b) and (0x43f8, 0x123b) as Copy-on-Write
entries, indicating that they can share this page, as long as these 2 VMs are only reading it. If any one of
them tries to write it later, you have to make a copy of this page and change the mappings for those PPNs
to go to different MPNs, i.e. COW.
• Now, we can free-up MPN 0x1096 and there is one more machine page frame that's available for the
house/hypervisor in terms of allocation.
All of these things that I mentioned just now are fairly labor-intensive. You don't want to do this when there is
active usage of the system. You want to do it as a background activity of the server when it is lightly loaded:
• Scanning the pages (going through all of a VM's pages to see if page may already be present in the machine
memory reflecting the contents of some VM) and looking for such matches
• Mapping the virtual machines to share the same machine page
• Freeing up machine memory for allocation by the hypervisor.
The important thing to notice is that, as opposed to the earlier mechanism that I mentioned that the hypervisor
can get into the page table data structures inside the Guest OS, there is no such thing here.
It is completely done oblivious to the Guest OS and there is no change made to the Guest OS.
This technique is applicable to both fully-virtualized and para-virtualized environments.
Because basically, all that we are saying is that: let's (hypervisor) go through the memory contents of a VM,
and see if any machine particular page frames can be shared with some other VMs. If so, let's do that and free up
some memory.
136 - Memory Allocation Policies

Up to now, we talked about mechanisms for virtualizing the memory resource. In particular for dealing with
dynamic memory pressure and sharing machine pages across VMs.
A higher level issue is the policies we have to use for allocating and reclaiming memory from the domains to
which we've allocated them in the first place. Ultimately, the goal of virtualization is maximizing the utilization
of the resources. Memory is a precious resource and virtualized environments may use different policies for
memory allocation.

• One can be a pure share-based policy. The idea is, you pay less, you get less. So if you have a service
level agreement with the data center, then the data center gives you a certain amount of resource based on
the dollars you put on a table. The problem with a share based approach is the fact that it could lead to
hoarding. If a virtual machine gets a bunch of resources and it's not really using it it's just wasting it.

• The desired behavior is if the working-set of a virtual machine goes up, you give it more memory. If it's
working-set shrinks, get back the memory so that you can give it to somebody else. So working-set-based
approach would be the saner approach.
At the same time, if I paid money, I need my resources. So, one thing that can be done is sort of put these two
ideas together in implementing a dynamic idle-adjusted-shares approach.
• In other words, you're going to tax the guys that are hoarders by taxing the idle pages more than active
pages.
• If I've given you a bunch of resources, if you're actively using it, more power to you.
• But if you're hoarding it, I'm going to tax you. I'm even going to take away the resources from you and
you may not even notice it because you're not using it anyway.
So that's the idea in this dynamic idle-adjusted shares approach. Now what is this tax?
Well, we could make the tax rate 0%, that is plutocracy, meaning you paid for it, you got it, you can sit on it,
I'm not going to tax you, that's one approach.
Or I could make the tax 100%, meaning that if you've got some resources and you're not using it, I'm going to
take all of it away. So that's the wealth redistribution, sort of a socialistic policy, use it or lose it. In other words,
if you make the tax 100%, we are ignoring shares altogether.
Now something in between is probably the best way to do it, so for instance if you use a tax rate of 50% or
75% saying if you have idle pages, then the tax rate is 50%, there's a 50% chance I'll take it away. And that's what
is being done in the VMware ESX server today in terms of how to allocate memory to the domains that need it.
By having a tax rate that is not quite 100% but maybe 50% or 75%, we can reclaim most of the idle memory
from the VMs that are not actively using it. But at the same time, since we're not taxing idle pages 100%, it allows
for sudden working set increases that might be experienced by a particular domain. Suddenly, a domain starts
needing more memory, at that point it may still have some reserves in terms of the idle pages that I did not take
away from that particular domain.
So the key point is that, you don't want to tax at 100% because this allows for sudden working set increases
that may be there, in a virtual machine that happens to be idle for some time, but suddenly work picks up.
137 - Introduction

As we said at the outset of this course module, virtualizing the memory subsystem in a safe and performance
conscious manner is the key to the success of virtualization.
What remains to be virtualized are the CPU and the devices. That's what we'll talk about next. Memory
virtualization is sort of under the covers. When it comes to the CPU and the devices, the interference among the
virtual machines living on top of the hypervisor becomes much more explicit.
The challenge in virtualizing the CPU and the devices is giving the illusion to the Guest OS living on top, that
they own the resources that are protected and handed to them by the hypervisor on a need basis.
There are two parts to CPU virtualization.
• We want to give the illusion to each guest operating system, that it owns the CPU. That is: it does not
even know the existence of other guests on top of the same CPU. If you think about it, this is not very far
removed from the concepts of a time shared operating system, which has to give the illusion to each
running process that that process is the only one running on the processor. The main difference in the
virtual setting is that this illusion is being provided by the hypervisor at the granularity of an entire
operating system. That's the first part.
• The second part is we want the hypervisor to field events arising due to the execution of a process that
belongs to a parent guest operating system. In particular, during the execution of a process on the processor.
There are going to be discontinuities that occur. Those program discontinuities, have to be fielded by the
hypervisor and passed to right Guest OS.
138 - CPU Virtualization

To keep things simple, let's assume that there's a single CPU.


• Each Guest OS is already multiplexing processes on the “real” CPU in a non-virtualized setting. So each
operating system has a ready que of processes that can be scheduled on the CPU.
• But there is this hypervisor that is sitting in between a Guest OS, its ready queue and the CPU.
• The first part that the hypervisor has to do is to give an illusion of CPU ownership to each of the VM, so
that each VM can schedule the processes that it currently has in its ready queue on the CPU.
• If you look at it from the hypervisor's point of view, it has to have a precise way of accounting the time
that a particular VM uses on the CPU. From the point of view of billing the different customers, that is
the key thing that hypervisor is worried about.
• For the time that has been allocated to a particular VM, how this VM is using the CPU time? what
processes is being scheduled on the CPU? The hypervisor doesn't care about that. That Guest OS is free
to schedule its pool of processes on the CPU for the time that has been given to it by the hypervisor.
Similar to the policy that we discussed for memory allocation, one straightforward way to share the CPU
among the Guest OS, is to give a proportional share of the CPU for each Guest OS, depending on the service
agreement that a VM has with the hypervisor. This is called a proportional share scheduler and it is used in the
VM-ware ESX server. Another approach is to use a fair share scheduler which is going to give a fair share of
the CPU for each of the Guest OS running on top of the hypervisor. Most of these strategies, proportion, share
scheduler, or a fair share scheduler are straight-forward conceptual mechanisms. You can learn more about them
from the assigned readings for this course.

In either case, the hypervisor has to account for the time used on the CPU on behalf of a different guest during
the allocated time for a particular guest. For example, this can happen if there is an external interrupt that is
intended for the Linux VM while a process in the Windows VM is currently executing on the CPU. This is the
accounting that the hypervisor needs to make sure that any time that was stolen away from a particular VM to
service an external interrupt from a different VM, will get rewarded later on by the accounting procedure in the
hypervisor.
We have discussed this issue already in the concept of operating system extensibility, specifically when we
discussed the exokernel approach to extensibility. So, it's not new problem. It is just that this problem occurs in
the hypervisor setting also.
139 - Second Part (Common to Full and Para)

The second part of CPU virtualization, which is common to both full and para-virtualized environments, is being
able to deliver events to the parent guest operating system. These events are events that are generated by a process
that belongs to the guest operating system, currently executing on the CPU.
Let's see what is happening to this process, when it is executing on the CPU. Once this process has been scheduled
on the CPU, during its normal program execution, everything should be happening on hardware speed, what is
that mean?
The process is going to generate virtual addresses that has to be translated to machine page addresses. We have
talked about how the hypervisor can ensure that the virtual address translation to the machine page address be
done at hardware speed by some clever tricks.
• In a fully virtualized environment, the hypervisor is responsible for ensuring that the virtual address gets
translated directly to the machine address.
• With all the intervention of the Guest OS, para-virtualized environment can ensure that the page table that
is used for translating virtual address to physical addresses is something that had been installed by the
hypervisor on behalf of the Guest OS so that the translation can happen at hardware speeds.
This is the most crucial part of ensuring good performance for the currently executing process in a virtualized
environment.
Let's look at other things that can happen to this process during the course of its execution.
• One thing that this process may do is to execute a system call. For instance, it might want to open a file
and that's really a call into the Guest OS.
• Another thing that can happen to this process is that it can incur a page fault, because not all of the virtual
address pages of the process is going to be in the machine memory.
• The process may also throw an exception. For instance, it might do something silly like divide by zero
which can result in an exception.
• Lastly, though no fault of this process, there could be an external interrupt when this process is executing.
So these are the kind of discontinuities are called program discontinuities that affect the normal execution of
the process. The first three things that I mentioned (system call, page fault, exceptions) are due to this process in
some shape or form. But the fourth one (external interrupt) is something that happens asynchronously and
unbeknownst to what this process intended to do while running on the processor.
And all of these continuities have to be passed up, by the hypervisor to the Guest OS. So the common issue in
both full and para virtualized environment, is that all such program discontinuities for the currently running
process have to be passed up to the parent guest OS by the hypervisor. Nothing special needs to be done in the
Guest OS for fielding these events from the hypervisor, because all of these events are going to be packaged as
software interrupts by the hypervisor and delivered to the Guest OS. Any operating system knows how to handle
interrupts.
There are some quirks of the architecture that may get in the way and the hypervisor may have to deal with
that. Recall that system calls, page faults, exceptions or all things that need to be handled by the Guest OS, the
Guest OS probably has entry points in it, for dealing with all these kinds of program list continuities.
Now some of the things that the Guest OS may have to do to deal with these program discontinuities may
require the Guest OS to have privileged access to the CPU. That is, certain instructions of the processor can only
be executed in the kernel mode, or the privileged mode. But recall that the guest operating system itself Is not
running in the privileged mode. It is above the red line, which means it has no more privilege than a normal user-
level process.
This is a problem, especially in a fully virtualized environment. Because the fully virtualized environment, the
Guest OS has no knowledge that it does not have the privileges. So when it tries to execute some instruction that
can only be executed in privileged mode, it will trap, get in the hypervisor, and the hypervisor will then do the
work that the fully virtualized Guest OS was intending to do.
Here is where the quirks of the architecture come into play. Because unfortunately, some privileged
instructions fail silently in some architectures when they're executed at the user level. And this is a problem for
the fully virtualized environment where the Guest OS binary is unchanged. In the paravirtualized environment,
because we know that the paravirtualized guest is not running on bare metal, we know that it does not have the
same privilege as the hypervisor. And therefore, we can collaboratively make sure that anything that the guest
has to do in privileged mode be delegated to the hypervisor. But in a fully virtualized setting, the Guest OS has
no knowledge.
So the only mechanism that will save the day is: if the guest tries to execute a privileged instruction, it'll trap
into the hypervisor, and the hypervisor can do the necessary thing. However, some privileged instructions don’t
trap when being executed at the user level. Instead, they fail silently and that can be a problem. This happens in
older versions of ISA Architecture for privileged instructions being executed in user mode.
Therefore, the hypervisor has to be intimately aware of the quirks of the hardware, and ensure that there is
workaround such quirks. Specifically, in a fully virtualized setting, the hypervisor has to look at the unchanged
binary of the operating system, and look for places where these quarks may surface and do binary rewriting in
order to catch those instructions when they're executed. Having said that, I should mention that newer versions of
Intel's architecture and AMD architecture for the x86 constructions have included virtualization support so that
these kinds of problems don't occur any more.

As a result of servicing the events that are delivered with a hypervisor, the guest may have to do certain things,
which may result in communication back to the hypervisor.
• In the case of a fully virtualized environment, communication from the guest to the hypervisor is always
implicit via traps. For example, as a result of page fault servicing, the guest may try to install a new entry
into the page table. When it tries to do that, that's a trap that will come in to the hypervisor and the
hypervisor will take the appropriate action.
• In a para-virtualized environment the communication from the guest to the hypervisor is explicit. There
are APIs (hyper-call) that the hypervisor supports for the Guest OS to communicate back to the hypervisor.
What may be the reasons for that? Well, we talked about memory management done by the Guest OS in
a para-virtualized environment such as Xen where the Guest OS may have to tell the hypervisor: “Here is
a page table entry, please install it in the page table for me”. So that is the kind of communication that
may have to come back from the Guest OS down to the hypervisor.

140 - Device Virtualization Intro

The next issue is virtualizing devices. Here again, we want to give the illusion to the guest that they own the I/O
devices.
141 - Device Virtualization

In the case of full virtualization, the Guest OS thinks that it owns all the devices already and the way devices are
virtualized, is the familiar trap and emulate technique. For the devices that the Guest OS thinks it owns, when
it tries to access those devices, it's going to result in a trap into the hypervisor and the hypervisor will emulate the
functionality that the Guest OS intends for that particular device. In this sense, there is not much scope for
innovation in the way devices are virtualized in a fully virtualized environment. A lots of details of course, that
the hypervisor has to worry about once the Guest OS traps into the hypervisor to ensure the legality of the I/O
operation and also whether the guest is allowed to make those I/O operations and so on. But nothing conceptually
fundamental is there in terms of device virtualization in a fully virtualized environment.
Para virtualized setting is much more interesting. The I/O devices seen by the Guest OS are exactly the ones that
are available to the hypervisor. This gives an opportunity for innovating the interaction between the Guest OS
and the hypervisor to make device virtualization more efficient.
• So it is possible for the hypervisor to come up with clean and simple device abstractions that can be used
by the para-virtualized Guest OS.
• Further, through APIs it becomes possible For the hypervisor to expose shared buffers to the Guest OS,
so that data can be passed between the Guest OS and the hypervisor and to the devices efficiently without
incurring the overhead of copying multiple times.
• Similarly, there can be innovations in the way event delivery happens between the hypervisor and the
Guest OS.

So in order to do device virtualization, we have to worry about two things:


• One is how to transfer control back and forth between the hypervisor and the Guest OS. Because devices,
being hardware entities, need to be manipulated by the hypervisor in a privileged state.
• And there is data transfer that needs to be done because the hypervisor is in a different protection domain
compared to the Guest OS.
We'll see how control transfer and data transfer are accomplished in fully and para virtualized settings.

142 - Control Transfer

Control transfer in a fully virtualized setting happens implicitly from the guest to the hypervisor.
• When the Guest OS executes any privileged instruction, it'll result in a trap and hypervisor will catch it
and do the appropriate thing. That's how control is transferred from the guest to the hypervisor implicitly.
• In the other direction, control transfer happens as we already mentioned, via software interrupts or events,
from the hypervisor to the guest.
In a para virtualized setting, the control transfer happens explicitly via hyper-calls from the guest into the
hypervisor.
• I gave you the example of page table updates that the Guest OS may want to communicate to the
hypervisor through the hyper-calls. When it executes the API calls, it results in control transfer from the
guest into the hypervisor.
• Similar to the full virtualization case, in para virtualization, in the other direction from the hypervisor to
the guest, it is done through software interrupts.

So that's how control transfer is handled in both the fully virtualized and paravirtualized environments.
The additional facility that you have in a paravirtualized environment is the fact that, the guest has control via
hyper-calls on when event notifications need to be delivered. In the case of full virtualization, since the Guest OS
is unaware of the existence of the hypervisor, events are going to be delivered as and when they occur by the
hypervisor to the guest.
In a para virtualization, via hyper-calls, the guest can indicate to the hypervisor that leave me alone don't send
me any event notifications now or it can say now is a good time to send me event notifications. So that level of
control exists in a para-virtualized environment, which doesn't exist in a full virtualized environment. This is sort
of similar to an operating system disabling interrupts, which the same facility that's available in a para virtualized
environment at the granularity of the operating system.

143 - Data Transfer

Again, data transfer in full virtualization is implicit. In a para virtualized setting, for example in Xen, there's an
opportunity to innovate because you can be explicit about the data movement from the Guest OS into the
hypervisor and vice-versa.
There are two aspects to resource management and accountability when it comes to data transfer.
• The first is the CPU time. So when an interrupt comes in from a device, hypervisor has to demultiplex
the data from the device to the domains very quickly upon an interrupt. CPU time is needed to make such
copies so the hypervisor has to account for the computation time for managing the buffers on behalf of
the virtualized operating system above it. The CPU time accountability is crucial from the point of your
billing and data centers. Therefore hypervisors pay a lot of attention to how CPU time is accounted for
and charged to the appropriate virtual machines.
• The second aspect of a data transfer is how the memory buffers are managed. This is a space issue while
the CPU issue is a time issue. In the context of the full virtualization, there is no scope for innovation. But
in the context of paravirtualization, there's a lot of scope for innovation in the way memory buffers are
handled between the Guest OS and the hypervisor.
Specifically, let's look at the opportunities for innovation in the Guest OS to Xen communication.
• Xen provides asynchronous I/O rings, which is basically a data structure that is shared between the guest
and the Xen for communication.
• Any number of such I/O rings can be allocated for handling the device I/O needs of a particular guest
domain.
• The I/O ring itself is just a set of descriptors. What you see here, they're all descriptors in this data
structure. The idea is: requests from a guest can be placed in this I/O ring by populating these descriptors.
Every descriptor is a unique I/O request coming from the Guest OS. So every request has a unique ID
associated with it.
• Recall what I said earlier that the I/O ring is specific to a guest. Every guest can declare a set of I/O rings
as data structures for its communication with Xen.
• After completing processing the request, Xen will place a response back in one of these descriptors the
same I/O ring.
• And this pointer is going to have the same unique ID that was used to send the request to Xen in the first
place.
So it's sort of a symmetric relationship between the guest and Xen, in terms of request and response for things
that the guest wants Xen to get done on its behalf and for Xen to communicate the response back.
The guest is the producer of requests. It has a pointer into this data structure that dictate where it has to place
the next request. This is a pointer (P1) that is updated by the guest and readable by Xen. It's a shared pointer in
that sense. For example, the Guest OS may have placed new requests, and that's why P1 has moved, indicating
that it has placed two new requests and the next empty spot for new request is here.
The consumer of the requests is Xen. And it is processing requests in the order in which they were produced
by the Guest OS. It has its own pointer (P2) into this I/O ring data structure, indicating where exactly it is servicing
Guest OS requests. So right now ,the pointer is pointing here indicating that Xen is yet to process these two
requests that have been created by the Guest OS. This P2 is private to Xen and the difference between P1 and P2,
i.e. |P1 - P2| tells Xen how many outstanding requests need to be processed, as far as Xen is concerned.
Similar to request production, Xen is going to be the guy that is offering responses back to the Guest OS, i.e.
Xen is the response producer. It is going to have its own producer pointer P3, indicating where it can place new
responses. Once again, P3 is a shared pointer, which is updated by Xen and readable by the Guest OS. For example,
these are two new responses that Xen has placed in the I/O Ring and they are yet to be picked up by the consumer.
The Guest has a private pointer P4 that says where it is in this I/O ring in terms of picking up the responses from
Xen. The difference between P3 and P4 is the number of responses that are yet to be picked-up by the Guest OS.
All pointers in the image move clock-wisely.
• P1 moves when new Guest request is put in the ring. P2 moves when Xen processes Guest request.
• P3 moves when Xen put response back into the ring. P4 moves when Guest picks up response from Xen.
That's the idea behind this I/O ring data structure, a very powerful extraction that allows the guest to place requests
asynchronously, and for Xen to place responses asynchronously.
• As I mentioned already, the guests can identify the request for which a particular response is intended,
because the ID that is associated with the response is the same as the ID of the request.
• The other thing is that these are just descriptors of the requests and responses. We are only passing pointers
to a machine page between Guest and Xen, so that Xen can pick up the data directly from that machine
page without any copy. Similarly, for the responses, the Guest will get a pointer from the descriptor data
structure to the machine page and access the data directly without copy.
This asynchronous I/O rings is a powerful mechanism, both for efficient communication between the Guest OS
and Xen, and for avoiding copying overhead between the para-virtualized Guest OS and Xen. We will look at
two specific examples, on how these I/O rings are used for guest and Xen communication.

144 - Control and Data Transfer in Action

The first example that we'll look at is how control and data transfer is done in Xen for a network virtualization.
Each guest has two I/O rings, one for transmission and one for reception.
If the guest wants to transmit packets, it enqueues descriptors into the transmit I/O ring via the hyper-calls that
Xen provides.
• The packets that need to be transmitted are not copied into Xen. The data packets stay in the Guest OS
buffers and the Guest OS simply embeds pointers to these buffers in the descriptors and enqueue the
descriptor into the I/O ring. So in other words, there's no copying of data packets from the Guest buffers
into Xen.
• For the duration of the transmission, the pages associated with these network packets are pinned so that
Xen can complete the transmission of the packets.
• This is the interaction between one Guest OS and Xen. Of course, more than one VM may want to transmit
and what Xen does is that it adopts its own draw bin packet scheduler in order to transmit packets from
different VMs.
Receiving packets from the network and passing it to the appropriate domain works similarly to transmission,
but in the opposite direction.
• In order to make things efficient, a Guest OS will do pre-allocate network buffers which are pages owned
by the Guest OS.
• When a package comes in, Xen can directly put the network package into the buffer that is owned by the
Guest OS and enqueue a response descriptor to the I/O ring for that particular Guest OS. So once again,
we can avoid copying data from Xen to the Guest OS.
• One of the cute tricks that Xen also plays is that, when Xen receives a packet into a machine page, it can
simply exchange that machine page with some other page that belongs to the Guest OS. And there again,
is another trick to avoid copying.
Either the guest can pre-allocate a buffer for Xen to directly put the packet into, or Xen can simply swap the
machine page if Xen receives a packet into a machine page.

145 - Disk I/O Virtualization

Disk I/O virtualization works quite similarly, every VM has an I/O ring which is dedicated for disk I/O.
For example this is an I/O ring associated with VM1 and this is an I/O ring associated with VM2. Similar to
network virtualization the communication between the Guest OS and Xen strives to avoid copying altogether.
• No copying into Xen because we only enqueue descriptors for the disk I/O that we want to get done. In
the descriptors are pointers to the Guest OS buffers, where the data is already contained for the transfer
into the disk, or a placeholder for the data from the disk.

• The philosophy is asynchronous I/O. Enqueuing these descriptors into these I/O ring by the Guest OS
happens asynchronously with respect to Xen enqueuing responses back.
o Since Xen is in charge of the actual devices (the disk subsystem in this case), Xen may reorder
requests from competing domains in order to make the I/O throughput efficient.

o There may be situations where such a request reordering may be inappropriate from the semantics
of the I/O operation. Therefore, Xen also provides a reorder barrier for Guest OS to enforce that
operations are performed in exactly the order in which they are requested. Such a reorder barrier
may be necessary for higher level semantics such as write-ahead logging and things like that.
This completes discussion of all the subsystems that need to be virtualized, in both a fully virtualized
environment and a para virtualized environment.
Next, we will talk about usage and billing.
146 - Measuring Time

The whole tenant of utility computing is that resources are being shared by several potential clients and we need
a mechanism for billing them.
It's extremely important that we have good ways of measuring the usage of every one of these resources: the
CPU, memory, storage, or network. Virtualized environments have mechanisms for recording both the time and
the space usage of the resources that it possesses, so that it can accurately bill the users of the resources.
147 - Xen and Guests

I want to conclude this course module with a couple of pictures.

• One shows Xen and the guests supported on top of it and this is from the original paper which shows all
the resources available, how they are virtualized and how Guest OS may sit on top of it.

• This is a similar picture for VMware. As you know, Xen is a para-virtualized environment and VMware
is a fully virtualized environment. This shows how VMware allows different guests to sit on top of its
virtualization layer.
The main difference between the virtualization technology and the extensible operating system that we have
seen in the earlier course module is: in the case of virtualization technology (para virtualization or full
virtualization), the focus is on protection and flexibility. Performance is important, but we are focusing on
protection and flexibility and making sure that we are sharing the resources not at the granularity of individual
applications that may run on top of the hardware, but, at the granularity of entire operating systems running on
top of the virtualization layer.

148 - Conclusion

In closing, we saw how the vision of extensible services (which had its roots in the HYDRA OS) and the
visualization technology for hardware (which started with IBM VM-370) all culminated in the virtualization
technology of today that has now become mainstream.
Data centers around the world are powered by this technology. All the major IT giants of the day (chip makers,
box makers, service providers, etc) are all in this game. Processor chips from both Intel and AMD have started
incorporating virtualization support and that makes it easier to implement the hypervisor, especially for
overcoming architectural quirks that we discussed in the context of VM were like full virtualization.
Even Xen, which started out with para-virtualization, has newer version that exploit such architectural features
of the processor architectures to run unmodified operating systems on top of Xen, a technique that they call
hardware-assisted virtualization.

You might also like