820 3703

SOLARIS OPERATING SYSTEM
HARDWARE VIRTUALIZATION PRODUCT ARCHITECTURE

Chien-Hua Yen, ISV Engineering chien.yen@sun.com Sun BluePrints On-Line November 2007
Part No 820-3703-10 Revision 1.0, 11/27/07 Edition: November 2007
Sun Microsystems, Inc.
Table of Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Hardware Level Virtualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Section 1: Background Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Virtual Machine Monitor Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 VMM Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 VMM Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 The x86 Processor Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 SPARC Processor Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 Section 2: Hardware Virtualization Implementations . . . . . . . . . . . . . . . . . . . . . 37 Sun xVM Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 Sun xVM Server Architecture Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 Sun xVM Server CPU Virtualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 Sun xVM Server Memory Virtualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Sun xVM Server I/O Virtualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Sun xVM Server with Hardware VM (HVM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 HVM Operations and Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 Sun xVM Server with HVM Architecture Overview . . . . . . . . . . . . . . . . . . . . . . . 68 Logical Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 Logical Domains (LDoms) Architecture Overview . . . . . . . . . . . . . . . . . . . . . . . . 80 CPU Virtualization in LDoms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 Memory Virtualization in LDoms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 I/O Virtualization in LDoms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 VMware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 VMware Infrastructure Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 VMware CPU Virtualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 VMware Memory Virtualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 VMware I/O Virtualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 Section 3: Additional Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 VMM Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 Terms and Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 Author Biography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
Introduction
Chapter 1
Introduction
In the IT industry, virtualization is a mechanism of presenting a set of logical computing resources over a fixed hardware configuration so that these logical resources can be accessed in the same manner as the original hardware configuration. The concept of virtualization is not new. First introduced in the late 1960s on mainframe computers, virtualization has recently become popular as a means to consolidate servers and reduce the costs of hardware acquisition, energy consumption, and space utilization. The hardware resources that can be virtualized include computer systems, storage, and the network. Server virtualization can be implemented at different levels on the computing stack, including the application level, operating system level, and hardware level: An example of application level virtualization is the Virtual Machine for the Java platform (Java Virtual Machine or JVM machine)1. The JVM implementation provides an application execution environment as a layer between the application and the OS, removing application dependency on OS-specific APIs and hardwarespecific characteristics. OS level virtualization abstracts OS services such as file systems, devices, networking, and security, and provides a virtualized operating environment to applications. Typically, OS level virtualization is implemented by the OS kernel. Only one instance of the kernel runs on the system, and it provides multiple virtualized operating environments to applications. Examples of OS level virtualization include Solaris Containers technology, Linux VServers, and FreeBSD Jails. OS level virtualization has less performance overhead and better system resource utilization than hardware level virtualization. Since one OS kernel is shared among all virtual operating environments, isolation among all virtualized operating environments is as good as the OS provides. Hardware level virtualization, discussed in detail in this paper, has become popular recently because of increasing CPU power and low utilization of CPU resources in the IT data center. Hardware level virtualization allows a system to run multiple OS instances. With less sharing of system resources than OS level virtualization, hardware virtualization provides stronger isolation of operating environments. The Solaris OS includes bundled support for application and OS level virtualization with its JVM software and Solaris Containers offerings. Sun first added support for hardware virtualization in the Solaris 10 11/06 release with Sun Logical Domains (LDoms) technology, supported on Sun servers which utilize UltraSPARC T1 or UltraSPARC T2
1. The terms "Java Virtual Machine" and "JVM" mean a Virtual Machine for the Java(TM) platform.
Introduction
processors. VMware also supports the Solaris OS as a guest OS in its VMware Server and Virtual Infrastructure products starting with the Solaris 10 1/06 release. In October 2007, Sun announced the Sun xVM family of products that includes the Sun xVM Server and the Sun xVM Ops Center management system: Sun xVM Server includes support for the Xen open source community work [6] on the x86 platform and support for LDoms on the UltraSPARC T1/T2 platform Sun xVM Ops Center a management suite for the Sun xVM Server
Note In this paper, in order to distinguish the discussion of x86 and UltraSPARC
T1/T2 processors, Sun xVM Server is specifically used to refer to the Sun hardware virtualization product for the x86 platform, and LDoms is used to refer to the Sun hardware virtualization product for the UltraSPARC T1 and T2 platforms.
The hardware virtualization technology and new products built around this technology have expanded options and opportunities for deploying servers with better utilization, more flexibility, and enhanced functionality. In reaping the benefits of the hardware virtualization, IT professionals also face the challenges of operating within the limitation of a virtualized environment while delivering the same level of service agreement as the physical operating environment. Meeting this requirement requires a good understanding of virtualization technologies, CPU architecture, and software implementations, and awareness of their strengths and limitations.
Hardware Level Virtualization

Hardware level virtualization is a mechanism of virtualizing the system hardware resources such as CPU, memory, and I/O, and creating multiple execution environments on a single system. Each of these execution environments runs an instance of the operating system. A hardware level virtualization implementation typically consists of several virtual machines (VMs), as shown in Figure 1. A layer of software, the virtual machine monitor (VMM), manages system hardware resources and presents an abstraction of these resources to each VM. The VMM runs in privileged mode and has full control of system hardware. A guest operating system (GOS) runs in each VM. The GOS to VM is analogous to program to process in which OS plays the function of the VMM.
Introduction
VM
VM
VM
GOS
GOS
GOS
Virtual Machine Monitor (VMM) Platform Hardware

Figure 1. In hardware level virtualization, the VMM software manages hardware resources and presents an abstraction of these resources to one or more virtual machines.
Hardware resource virtualization can take the form of sharing, partitioning, or delegating: Sharing Resources are shared among VMs. The VMM coordinates the use of resources by VMs. For example, the VMM may include a CPU scheduler to run threads of VMs based on a pre-determined scheduling policy and VM priority. Partitioning Resources are partitioned so that each VM gets the portion of resources allocated to it. Partitioning can be dynamically adjusted by the VMM based on the utilization of each VM. Examples of resource partitioning include the ballooning memory technique employed in Sun xVM Server and VMware, and the allocation of CPU resources in Logical Domains technology. Delegating With delegating, resources are not directly accessible by a VM. Instead, all resource accesses are made through a control VM that has direct access to the resource. I/O device virtualization is normally accessed via delegation. The distinction and boundaries between the virtualization methods are often not clear. For example, sharing may be used for one component and partitioning used in others, and together they make up an integral functional module.
Benefits of Hardware Level Virtualization

Hardware level virtualization allows multiple operating systems to run on a single server system. This ability offers many benefits that are not available in a single OS server. These benefits can be summarized in three functional categories: Workload Consolidation According to Gartner [17] Intel servers running at 10 percent to 15 percent utilization are common. Many IT organizations run out and buy a new server every time they deploy a new application. With virtualization, computers no longer have to be dedicated to a particular task. Applications and users can share computing resources, remaining blissfully unaware that they are doing so. Companies can shift computing resources around to meet demand at a given time, and get by with less infrastructure overall. When used for consolidation, virtualization can also save
Introduction
hardware and maintenance expenses, floor space, cooling costs, and power consumption. Workload Migration Hardware level virtualization decouples the OS from the underlying physical platform resources. A guest OS state, along with the user applications running on top of it, can be encapsulated into an entity and moved to another system. This capability is useful for migrating a legacy OS system from an old under-powered server to a more powerful server while preserving the investment in software. When a server needs to be maintained, a VM can be dynamically migrated to a new sever with no down time, further enhancing availability. Changes in workload intensity levels can be addressed by dynamically shifting underlying resources to the starving VMs. Legacy applications that ran natively on a server continue to run on the same OS running inside a VM, leveraging the existing investment in applications and tools. Workload Isolation Workload isolation includes fault and security isolations. Multiple guest OSes run independently, and thus a software failure in one VM does not affect other VMs. However, the VMM layer introduces a single point of failure that can bring down all VMs on the system. A VMM failure, although potentially catastrophic, is less probable than a failure in the OS because the complexity of VMM is much less than that of an OS. Multiple VMs also provide strong security isolation among themselves with each VM running an independent OS. Security intrusions are confined to the VM in which they occur. The boundary around each VM is enforced by the VMM and the inter-domain communication, if provided by the VMM, is restricted to specific kernel modules only. One distinct feature of hardware level virtualization is the ability to run multiple instances of heterogeneous operating systems on a single hardware platform. This feature is important for the following reasons: Better security and fault containment among application services can be achieved through OS isolation. Applications written for one OS can run on a system that supports a different OS. Better management of system resource utilization is possible among the virtualized environments.
Scope
This paper explores the underlying hardware architecture and software implementation for enabling hardware virtualization. Great emphasis has been placed on the CPU hardware architecture limitations for virtualizing CPU services and their software workarounds. In addition, this paper discusses in detail the software architecture for implementing the following types of virtualization:
Introduction
CPU virtualization uses processor privileged mode to control resource usage by the VM, and relays hardware traps and interrupts to VMs Memory virtualization partitions physical memory among multiple VMs and handles page translations for each VM I/O virtualization uses a dedicated VM with direct access to I/O devices to provide device services The paper is organized into three sections. Section I, Background Information, contains information on VMMs and provides details on the x86 and SPARC processors: Virtual Machine Monitor Basics on page 9 discusses the core of hardware virtualization, the VMM, as well as requirements for the VMM and several types of VMM implementations. The x86 Processor Architecture on page 21 describes features of the x86 processor architecture that are pertinent to virtualization. SPARC Processor Architecture on page 29 describes features of the SPARC processor that affect virtualization implementations. Section II, Hardware Virtualization Implementations, provides details on the Sun xVM Server, Logical Domains, and VMware implementations: Sun xVM Server on page 39 discusses a paravirtualized Solaris OS that is based on an open source VMM implementation for x86[6] processors and is planned for inclusion in a future Solaris release. Sun xVM Server with Hardware VM (HVM) on page 63 continues the discussion of Sun xVM Server for the x86 processors that support hardware virtual machines: IntelVT and AMD-V. Logical Domains on page 79 discusses Logical Domains (LDoms), supported on Sun servers that utilize UltraSPARC T1 or T2 processors, and describes Solaris OS support for this feature. VMware on page 97 discusses the VMware implementation for the VMM. Section III, Additional Information, contains a concluding comparison, references, and appendices: VMM Comparison on page 109 presents a summary of the VMM implementations discussed in this paper. References on page 111 provides a comprehensive listing of related references. Terms and Definitions on page 113 contains a glossary of terms. Author Biography on page 117 provides information on the author.
Introduction
Introduction
Section I
Background Information
Chapter 2: Virtual Machine Monitor Basics (page 9) Chapter 3: The x86 Processor Architecture (page 21) Chapter 4: SPARC Processor Architecture (page 29)
Introduction
Virtual Machine Monitor Basics
Chapter 2

At the heart of hardware level virtualization is the VMM. The VMM is a software layer that abstracts computer hardware resources so that multiple OS instances can run on a physical system. Hardware resources are normally controlled and managed by the OS. In a virtualized environment the VMM takes this role, managing and coordinating hardware resources. There is no clear boundary between an OS and the VMM from the definition point of view. The division of functions between OS and the VMM can be influenced by factors such as processor architecture, performance, OS, and nontechnical requirements such as ease of installation and migration. Certain VMM requirements exist for running multiple OS instances on a system. These requirements, discussed in detail in the next section, stem primarily from processor architecture design that is inherently an impediment to hardware virtualization. Based on these requirements, two types of VMMs have emerged, each with distinct characteristics in defining the relationship between the VMM and an OS. This relationship determines the privilege level of the VMM and an OS, and the control and sharing of hardware resources.
VMM Requirements
A software program communicates with the computer hardware through instructions. Instructions, in turn, operate on registers and memory. If any of the instructions, registers, or memory involved in an action is privileged, that instruction results in a privileged action. Sometimes an action, which is not necessarily privileged, attempts to change the configuration of resources in the system. Subsequently, this action would impact other actions whose behavior or result depends on the configuration of resources. The instructions that result in such operations are called sensitive instructions. In the context of the virtualization discussion, a processor's instructions can be classified into three groups: Privileged instructions are those that trap if the processor is in non-privileged mode and do not trap if it is in privileged mode. Sensitive instructions are those that change or reference the configuration of resources (memory), affect the processor mode without going through the memory trap sequence (page fault), or reference the sensitive registers whose contents change when the processor switches to run another VM. Non-privileged and non-sensitive instructions are those that do not fall into either the privileged or sensitive categories described above.
10
Sensitive instructions have a major bearing on the virtualizability of a machine [1] because of their system-wide impact. In a virtualized environment, a GOS should only contain non-privileged and non-sensitive instructions. If sensitive instructions are a subset of privileged instructions, it is relatively easy to build a VM because all sensitive instructions will result in a trap. In this case a VMM can be constructed to catch all traps that result from execution of sensitive instructions by a GOS. All privileged and sensitive actions from VMs would be caught by the VMM, and resources could be allocated and managed accordingly (a technique called trap-andemulate). A GOS's trap handler could then be called by the VMM trap handler to perform the GOS-specific actions for the trap. If a sensitive instruction is a non-privileged instruction, the instruction executed by one VM will be unnoticed. Robin and Irvine [3] identified several x86 instructions in this category. These instructions cannot be safely executed by a GOS as they can impact the operations of other VMs or adversely affect the operation of its own GOS. Instead, these instructions must be substituted by the VMM service. The substitution can be in the form of an API for the GOS to call, or a dynamic conversion of these instructions to explicit processor traps.
Types of VMM
In a virtualized environment, the VMM controls the hardware resources. VMMs can be categorized into two types, based on this control of resources: Type I maintains exclusive control of hardware resources Type II leverages the host OS by running inside the OS kernel The Type I VMM [3] has several distinct characteristics: it is the first software to run (besides BIOS and the boot loader), it has full and exclusive control of system hardware, and it runs in privileged mode directly on the physical processor. The GOS on a Type I VMM implementation runs in a less privileged mode than the VMM to avoid conflicts managing the hardware resources. An example of a Type I VMM is Sun xVM Server. Sun xVM Server includes a bundled VMM, the Sun vVM Hypervisor for x86. The Sun xVM Hypervisor for x86 is the first software, beside BIOS and boot loader, to run during boot as shown in the GRUB
menu.lst file:
title Sun xVM Server kernel$ /boot/$ISADIR/xen.gz module$ /platform/i86xpv/kernel/$ISADIR/unix /platform/i86xpv/kernel/$ISADIR/unix module$ /platform/i86pc/$ISADIR/boot_archive
11
The GRUB bootloader first loads the Sun xVM Hypervisor for x86, xen.gz. After the VMM gains control of the hardware, it loads the Solaris kernel,
/platform/i86xpv/kernel/$ISADIR/unix, to run as a GOS.
Sun's Logical Domains and VMware's Virtual Infrastructure 3 [4] (formerly knows as VMware ESX Server), described in detail in Chapter 7 Logical Domains on page 79 and Chapter 8 VMware on page 97, are also Type I VMMs. A Type II VMM typically runs inside a host OS kernel as an add-on module, and the host OS maintains control of the hardware resources. The GOS in a Type II VMM is a process of the host OS. A Type II VMM leverages the kernel services of the host OS to access hardware, and intercepts a GOS's privileged operations and performs these operations in the context of the host OS. Type II VMMs have the advantage of preserving the existing installation by allowing a new GOS to be added to an running OS. An example of type II VMM is VMware's VMware Server (formerly known as VMware GSX Server). Figure 2 illustrates the relationships among hardware, VMM, GOS, host OS, and user application in virtualized environments.
Type I VMM Server Type II VMM Server Physical Server
Unprivileged Mode
Apps GOS
Apps GOS
Apps GOS
Apps
Apps
Apps
User Space Applications
GOS VMM
GOS VMM OS Platform Hardware Privileged Mode
VMM Platform Hardware
Host OS Platform Hardware
Figure 2. Virtual machine monitors vary in how they support guest OS, host OS, and user applications in virtualized environments.
VMM Architecture
As discussed in VMM Requirements on page 9, the VMM performs some of the functions that an OS normally does: namely, it controls and arbitrates CPU and memory resources, and provides services to upper layer software for sensitive and privileged operations. These functions require the VMM to run in privileged mode and the OS to relinquish the privileged and sensitive operations to the VMM. In addition to processor and memory operation, I/O device support also has a large impact on VMM architecture.
12
VMM in Privileged Mode

A processor typically has two or more privileged modes. The operating system kernel runs in the privileged mode. The user applications run in a non-privileged mode and trap to the kernel when they need to access system resources or services from the kernel. The GOS normally assumes it runs in the most privileged mode of the processor. Running a VMM in a privileged mode can be accomplished with one of the following three methods: Deprivileging the GOS This method usually requires a modification to the OS to run at a lower privilege level. For x86 systems, the OS normally runs at protected ring 0, the most privileged level. In Sun xVM Server, ring 0 is reserved to run the VMM. This requires the GOS to be modified, or paravirtualized, to run outside of ring 0 at a lower privilege level. Hyperprivileging the VMM Instead of changing the GOS to run at lower privilege, another approach taken by the chip vendors is to create a hyperprivileged processor mode for the VMM. The Sun UltraSPARC T1 and T2 processors hyperprivileged mode [2], Intel-VT's VMX-root operation (see [7] Volume 3B, Chapter 19), and AMD-Vs VMRUN-Exit state (see [9] Chapter 15) are examples of a hyperprivileged processor for VMM operations. Both VMM and GOS run in same privileged mode It is possible to have both the VMM and GOS run in the same privileged mode. In this case, the VMM intercepts all privileged and sensitive operations of a GOS before passing them to the processor. For example, VMware allows both the GOS and the VMM to run in privileged mode. VMware dynamically examines each instruction to decide whether the processor state and the segment reversibility (see Segmented Architecture on page 23) allow the instruction to be executed directly without the involvement of the VMM. If the GOS is in privileged mode or the code segment is non-reversible, the VMM performs necessary conversions of the core execution path.
Removing Sensitive Instructions in the GOS

Privileged and sensitive operations are normally executed by the OS kernel. In a virtualized environment, the GOS has to relinquish the privileged and sensitive operations to the VMM. This is accomplished by one of the following approaches: Modifying the GOS source code to use the VMM services for handling sensitive operations (paravirtualization) This method is used by Sun xVM Server and Sun's Logical Domains (LDoms). Sun xVM Server and LDoms provide a set of hypercalls for an OS to request VMM services. The VMM-aware Solaris OS uses these hypercalls to replace its sensitive instructions.
13
Dynamically translating the GOS sensitive instructions by software As described in a previous section, VMware uses binary translation to replace the GOS sensitive instructions with VMM instructions. Dynamically translating the GOS sensitive instructions by hardware This method requires the processor to provides a special mode of operation that is entered when an sensitive instruction is executed in reduced privileged mode. The first approach, which involves modifying the GOS source code, is called paravirtualization, because the VMM provides only partial virtualization of the processor. The GOS must replace its sensitive and privileged operations with the VMM service. The remaining two approaches provide full virtualization to the VM, enabling the GOS to run without modification In addition to OS modification, performance requirements, processor architecture design, tolerance of a single point of failure, and support for legacy OS installations have an impact on the design of VMM architecture.
Physical Memory Virtualization

Memory management by the VMM involves two tasks: partitioning physical memory for VMs, and supporting page translations in a VM. Each OS assumes physical memory starts from page frame number (PFN) 0 and is contiguous to the size configured for that VM. An OS uses physical addresses in operations like page table updates and Direct Memory Access (DMA). In reality, the starting PFN of the memory exported to a VM may not start from PFN 0 and may not be contiguous. The virtualization of physical address is provided in the VMM by creating another layer of addressing scheme, namely machine address (MA). Within a GOS, a virtual address (VA) is used by applications, and a physical address (PA) is used by the OS in DMA and page tables. The VMM maps a PA from a VM to a MA, which is used on hardware. The VMM maintains translation tables, one for each VM, for mapping PAs to MAs. Figure 3 depicts the scheme to partition machine memory to physical memory for each VM.
14
VM0
PFN 0
VM1
PFN 0 Physical Memory VM/GOS

Figure 3. Example physical-to-machine memory mapping.
MPFN 0 Machine Memory VMM
A ballooning technique [5] has been used in some virtualization products to achieve better utilization of physical memory among VMs. The idea behind the ballooning technique is simple. The VMM controls a balloon module in a GOS. When the VMM wants to reclaim memory, it inflates the balloon to increase pressure on memory, forcing the GOS to page out memory to disk. If the demand for physical memory decreases, the VMM deflates the balloon in a VM, enabling the GOS to claim more memory.
Page Translations Virtualization

Access to processor's page translation hardware is a privileged operation, and this operation is performed by the privileged VMM. Exactly what the VMM needs to perform depends on the processor architecture. For example, x86 hardware automatically loads translations from the page table to the Translation Lookaside Buffer (TLB). The software has no control of loading page translations to the TLB. Therefore, the VMM is responsible for updating the page table that is seen by the hardware. The SPARC processor uses software through traps to load page translations to the TLB. A GOS maintains its page tables in its own memory, and the VMM gets page translations from the VM and loads them to the TLB. VMMs typically support the following two methods to support page translations: Hypervisor calls The GOS makes a call to the VMM for page translation operations. This method is commonly used by paravirtualized OSes, as it provides better performance. Shadow page table The VMM maintains an independent copy of page tables, called shadow page tables, from the guest page tables. When a page fault occurs, the VMM propagates changes made by the GOS's page table to the shadow page table. This method is commonly used by VMMs that support full virtualization, as the GOS continues to update its own page table and the synchronization of the guest
15
page table and the shadow page table is handled by the VMM when page faults occur. Figure 4 shows three different page translation implementations in the Solaris OS on x86 and SPARC platforms. 1. The paravirtualized Sun xVM Server uses the following approach on x86 platforms: [1] The GOS uses the hypervisor call method to update the page tables maintained by the VMM. 2. The Sun xVM Server with HVM and VMware use the following approach: [2a] The GOS maintains its own guest page table. The synchronization between the guest page table and the hardware page table (shadow page table) is handled by the VMM when page faults occur. [2b] The x86 CPU loads the page translation from the hardware page table to the TLB. 3. On SPARC systems, the Solaris OS uses the following approach for Logical Domains: [3a] The GOS maintains its own page table. The GOS takes an entry from the page table as an argument to the hypervisor call that loads the translations to the TLB. [3b] The VMM gets the page translation from the GOS and loads the translation to the TLB.
GOS Guest Page Table 2a
GOS Guest Page Table
HV Calls 1 VMM HW Page Table 2b TLB Hardware X86 Page Translations VMM
HV Calls 3a TLB Operations 3b
TLB Hardware SPARC Page Translations
Figure 4. Page translation schemes used on x86 and SPARC architectures.
The memory management implementation for Sun xVM Server, Sun xVM Server with HVM, VMware, and Logical Domains using these mechanisms is discussed in detail in later sections of this paper.
16
I/O Virtualization
I/O devices are typically managed by a special software module called the device driver running in the kernel context. Due to vastly different types and varieties of device types and device drivers, the VMM either includes few device drivers or leaves device management entirely to the GOS. In the latter case, because of existing device architecture limitations (discussed later in the section), devices can only be exclusively managed by one VM. This constraint creates some challenges for I/O access by a VM, and limits the following: What device are exported to a VM How devices are exported to a VM How each I/O transaction is handled by a VM and the VMM Consequently, I/O has the most challenges in the areas of compatibility and performance for virtual machines. In order to explain what devices are exported and how they are exported, it is first necessary to understand the options available to handle I/O transactions in a VM. There are, in general, three approaches for I/O virtualization, as illustrated in Figure 5: Direct I/O (VM1 and VM3) Virtual I/O using I/O transaction emulation (VM2) Virtual I/O using device emulation (VM4)
VM1 Direct I/O I/O VM I/O Transaction Emulation and Native Driver VM2 Virtual I/O thru I/O VM VM3 Direct I/O VM4 Virtual I/O thru VMM Native Driver or Virtual Driver
Virtual Driver
Native Driver
VMM Device Emulation and Device Driver Network Chip SCSI Controller Sun X64 Server
Figure 5. Different I/O virtualization techniques used by virtual machine monitors.
For direct I/O, the VMM exports all or a portion of the physical devices attached to the system to a VM, and relies on VMs to manage devices. The VM that has direct I/O access uses the existing driver in the GOS to communicate directly with the device. VM 1 and VM3 in Figure 5 have direct I/O access to devices. VM1 is also a special I/O VM that provides virtual I/O for other VMs, such as VM2, to access devices.
17
Virtual I/O is made possible by controlling the device types exported to a VM. There are two different methods of implementing virtual I/O: I/O transaction emulation (shown in VM2 in Figure 5) and device emulation (shown in VM4). I/O transaction emulation requires virtual drivers on both ends for each type of I/O transaction (data and control functions). As shown in Figure 5, the virtual driver on the client side (VM2) receives I/O requests from applications and forwards requests through the VMM to the virtual driver on the server side (VM1); the virtual driver on the server side then sends out the request to the device. I/O transaction emulation is typically used in paravirtualization because the OS on the client side needs to include the special drivers to communicate with its corresponding driver in the OS on the server side, and needs to add kernel interfaces for inter-domain communication using the VMM services. However, it is possible to have PV drivers in an un-paravirtualized OS (full virtualization) for better I/O performance. For example, Solaris 10, which is not paravirtualized, can include PV drivers on a HVM-capable system to get better performance than that achieved using device emulation drivers such as QEMU. (See Sun xVM Server with HVM I/O Virtualization (QEMU) on page 71.) I /O transaction emulation may cause application compatibility issues if the virtual driver does not provide all data and control functions (for example, ioctl(2)) that the existing driver does. Device emulation provides an emulation of a device type, enabling the existing driver for the emulated device in a GOS to be used. The VMM exports emulated device nodes to a VM so that the existing drivers for the emulated devices in a GOS are used. By doing this, the VMM controls the driver used by a GOS for a particular device type; for example, using the e1000g driver for all network devices. Thus, the VMM can focus on the emulation of underlying hardware using one driver interface. Driver accesses to the I/O register and port in a GOS, which will result in a trap due to invalid address, are caught and converted to access the real device hardware. VM4 in Figure 5 uses native OS drivers to access emulated devices exported by the VMM. Device emulation is in general less efficient and more limited on platforms supported than I/O transaction emulation. Device emulation does not require changes in the GOS and, therefore, is typically used to provide full virtualization to a VM. Virtual I/O, unlike direct I/O, requires additional drivers in either the I/O VM or the VMM to provide I/O virtualization. This constraint: Limits the type of devices that are made available to a VM Limits device functionality Causes significant I/O performance overhead While virtualization provides full application binary compatibility, I/O becomes a trouble area in terms of application compatibility and performance in a VM. One
18
solution to the I/O virtualization issues is to allow VMs to directly access I/O, as shown by VM3 in Figure 5. Direct I/O access by VMs requires additional hardware support to ensure device accesses by a VM are isolated and restricted to resources owned by the assigned VM. In order to understand the industry effort to allow an I/O device to be shared among VMs, it is necessary to examine device operations from an OS point of view. The interactions between an OS and a device consist, in general, of three operations: 1.
Programmed I/O (PIO) host-initiated data transfer. In PIO, a host OS maps a

virtual address to a piece of device memory and accesses the device memory using CPU load/store instructions.
2.
Direct Memory Access (DMA) device-initiated data transfer without the CPU involvement. In DMA, a host OS writes an address of its memory and the transfer size to a device's DMA descriptor. After receiving an enable DMA instruction from the host driver, the device performs data transfer at a time it chooses and uses interrupts to notify the host OS of DMA completion.
3.
Interrupt a device-generated asynchronous event notification.
Interrupts are already virtualized by all VMM implementations as is shown in the later discussions for Sun xVM Server, Logical Domains, and VMware. The challenge of I/O sharing among VMs therefore lies in the device handling for PIO and DMA. To meet the challenges, PCI SIG has released a suite of IOV specifications for PCI Express (PCIe) devices, in particular the Single Root I/O Virtualization and Sharing Specification (SRIOV) specification [35] for device sharing and PIO operation, and the Address Translation Services (ATS) specification [30] for DMA operation. Device Configuration and PIO A PCI device exports its memory to the host through Base Address Registers (BARs) in its configuration space. A device's configuration space is identified in the PCI configuration address space as shown in Figure 6.
31 24 23 16 15 11 10 8 7 2 1 0
Reserved
Bus Number
Device Number
Function Number
Register Number
00
Figure 6. PCI configuration address space.
A PCI device can have up to 8 physical functions (PF). Each PF has its own 256 byte configuration header. The BARs of a PCI function, which are 32-bit wide, are located at offset 0x10-0x24 in the configuration header. The host gets the size of the memory region mapped by a BAR by writing a value of all 1's to the BAR and then reading the value back. The address written to a BAR is the assigned starting address of the memory region mapped to the BAR.
19
To allow multiple VMs to share a PF, the SRIOV specification introduces the notion of a Virtual Function (VF). Each VF shares some common configuration header fields with the PF and other VFs. The VF BARs are defined in the PCIe's SRIOV extended capabilities structure. A VF contains a set of non-shared physical resources, such as work queue and data buffer, which are required to deliver function specific services. These resources are exported through the VF BARs and are directly accessible by a VM. The starting address of a VF's memory space is derived from the first VF's memory space address and the size of VF's BAR. For any given VFx, the starting address of its memory space mapped to BARa is calculated according to the following formula:
addr (VF x,BAR a) = addr (VF 1,BAR a) + ( x 1 ) ( VF BARa aperature size )
where addr (VF1, BARa) is the starting address of BARa for the first VF and (VF BARa aperture size) is the size of the VF BARa as determined by writing a value of 1's to BARa and reading the value back. Using this mechanism, a GOS in a VM is able to share the device with other VMs while performing device operations that pertain only to the VM. DMA In many current implementations (especially in most x86 platforms), physical addresses are used in DMA. Since a VM shares the same physical address space on the system with other VMs, a VM might read/write to another VM's memory through DMA. For example, a device driver in a VM might write the memory contents that belong to other VMs to a disk and read the data back into the VM's memory. This causes a potential breach in security and fault isolation among VMs. To provide isolation during DMA operation, the ATS specification defines a scheme for a VM to use the address mapped to its own physical memory for DMA operation. (This approach is used in similar designs such as IOMMU Specification [31] and DMA Remapping [28].) This DMA ATS enables DMA memory to be partitioned into multiple domains, and keeps DMA transactions on one domain isolated from other domains. Figure 7 shows device DMA with and without ATS. With DMA ATS, the DMA address is like a virtual address that is associated with a context (VM). DMA transactions initiated by a VM can only be associated with the memory owned by the VM. DMA ATS is a chipset function that resides outside of the processor.
20
System Memory
System Memory
VM1 VM2
DMA Buffer DMA Buffer
CPU
DMA Buffer
DMA Buffer
CPU
DMA Buffer
DMA Buffer
PA
North Bridge
PA
HPA
North Bridge
HPA
PA PA
South Bridge PCI Device PCI Device
HPA DVA/GPA
South Bridge w/ IOMMU PCI Device PCI Device
PA
DVA/GPA
DMA without ATS
DMA with ATS

PA - Physical Address HPA - Host Physical Address DVA - Device Virtual Address GPA - Guest Physical Address
Figure 7. DMA with and without address translation service (ATS).
As shown in Figure 7, the physical address (PA) is used on the hardware platform without hardware support for ATS. For platforms with hardware support for ATS, a GOS in a VM writes either a device virtual address (DVA) or a guest physical address (GPA) to the devices DMA engine. The device driver in the GOS loads the mappings of either the DVA or GPA to the host physical address (HPA) in the hardware IOMMU. The HPA is the address understood by the memory controller.
Note The distinction between the HPA and GPA is described in detail in later sections for Sun xVM Server (see Physical Memory Management on page 52), for UltraSPARC LDoms (see Physical Memory Allocation on page 88), and for VMware (see Physical Memory Management on page 103).
When the device performs a DMA operation, a DVA/GPA address appears on the PCI bus and is intercepted by the hardware IOMMU. The hardware IOMMU looks up the mapping for the DVA/GPA, finds the corresponding HPA, and moves the PCI data to system memory pointed to by the HPA. Since either DVA or GPA of a VM has its own address space, ATS allows system memory for DMA to be partitioned and, thus, prevents a VM from accessing another VMs DMA buffer.
21
The x86 Processor Architecture
Chapter 3

This chapter provides background information on the x86 processor architecture that is relevant to later discussions on Sun xVM Server (Chapter 5 on page 39), Sun xVM Server with HVM (Chapter 6 on page 63), and VMware (Chapter 8 on page 97). The x86 processor was not designed to run in a virtualized environment, and the x86 architecture presents some challenges for CPU and memory virtualization. This chapter discusses the following x86 architecture features that are pertinent to virtualization: Protected Mode The protected mode in the x86 processor utilizes two mechanisms, segmentation and paging, to prevent a program from accessing a segment or a page with a higher privilege level. Privilege level controls how the VMM and a GOS work together to provide CPU virtualization. Segmented Architecture The x86 segmented architecture converts a program's virtual addresses into linear addresses that are used by the paging mechanism to map into physical memory. During the conversion, the processor's privilege level is checked against the privilege level of the segment for the address. Because of the segment cache technique employed by the x86 processor, the VMM must ensure segment cache consistency with the VM descriptor table updates. This x86 feature results in a significant amount of work for the VMM of full virtualization products such as VMware. Paging Architecture The x86 paging architecture provides page translations to the TLB and page tables. Because the loading of page translations from page table to TLB is done automatically by hardware on the x86 platform, page table updates have to be performed by the privileged VMM. Several mechanisms are available for updating this hardware page table by a VM. I/O and Interrupts A device interacts with a host processor through PIO, DMA, and interrupts. PIO in the x86 processor can be performed through either I/O ports using special I/O instructions or through memory-mapped addresses with general purpose MOVE and String instructions. DMA in most x86 platforms is performed with physical addresses. This can cause a security and isolation breach in a virtualized environment because a VM may read/write other VMs memory contents. Interrupts and exceptions are handled through the Interrupt Descriptor Table (IDT). There is only one IDT on the system and access to the IDT is privileged. Therefore, interrupts have to be handled by the VM and virtualized to be delivered to a VM.
22
Timer Devices The x86 platform includes several timer devices for time keeping purposes. Knowledge of the characteristics of these devices is important to fully understand time keeping in a VM: Some timer devices are interrupt driven (which is virtualized and delayed) and some require privileged access to update the device counter.
Protected Mode
The x86 architecture protected mode provides a protection mechanism to limit access to certain segments or pages and prevent unprivileged access. The processor's segment-protection mechanism recognizes 4 privilege levels, numbered from 0 to 3 (Figure 8). The greater the level number, the lesser the privileges provided. The page-level protection mechanism restricts access to pages based on two privilege levels: supervisor mode and user mode. If the processor is operating at a current privilege level (CPL) 0, 1, or 2, it is in a supervisor mode and the processor can access all pages. If the processor is operating at a CPL 3, it is in a user mode and the processor can access only user level pages.
Level 0 - OS Kernel Level 1 Level 2 Level 3 - Applications

Figure 8. Privilege levels in the x86 architecture.
When the processor detects a privilege level violation, it generates a general-protection exception (#GP). The x86 has more than 20 privileged instructions. These instructions can be executed only when the current privilege level (CPL) is 0 (most privileged). In addition to the CPL, the x86 has an I/O privilege level (IOPL) field in the EFLAGS register that indicates the I/O privilege level of the currently running program. Some instructions, while allowed to execute when the CPL is not 0, might generate a #GP exception if the CPL value is higher than IOPL. These instructions include CLI (clear interrupt), STI (set interrupt flag), IN/INS (input from port), and OUT/OUTS (output to port). In addition to the above instructions, there are many instructions [3] that, while not privileged, reference registers or memory locations that would allow a VM to access a memory region not assigned to that VM. These sensitive instructions will not cause a
#GP exception. The trap-and-emulate method for virtualization of a GOS, as stated in
VMM Requirements on page 9, does not apply to these instructions. However, these instructions may impact other VMs.
23
Segmented Architecture
In protected mode, all memory accesses must go through a logical address } Linear address (LA) } Physical Address (PA) translation scheme. The logical address to LA translation is managed by the x86 segmentation architecture which divides a process's address space into multiple protected segments. A logical address, which is used as the address of an operand or of an instruction, consists of a 16-bit segment selector and a 32-bit offset. A segment selector points to a segment descriptor that defines the segment (see Figure 11 on page 24). The segment base address is contained in the segment descriptor. The sum of the offset in a logical address and the segment base address gives the LA. The Solaris OS directly maps an LA to a process's Virtual Address (VA) by setting the segment base address to NULL.
Segmentation: VA + Segment Base Address (always 0 in Solaris) } Linear address Paging: Linear address } Physical Address
For each memory reference, a VA and a segment selector are provided to the processor (Figure 9). The segment selector, which is loaded to the segment register, is used to identify a segment descriptor for the address.
15
Index
3 2 1
TI RPL
Index: up to 8K descriptors (bits 3-15) TI: Table Indicator; 0=GDT, 1=LDT RPL: Request Privilege Level
Figure 9. Segment Selector
Every segment descriptor has a visible part and a hidden part, as illustrated in Figure 10 (see also [7], Volume 3A Section 3.4.3). The visible part is the segment selector, an index that points into either the global descriptor table (GDT) or the local descriptor table (LDT) to identify from which descriptor the hidden part of the segment register is to be loaded. The hidden part includes portions containing segment descriptor information loaded from the descriptor table.
Selector
Type
Base Address
Limit
CPL
Visible
Hidden
Figure 10. Each segment descriptor has a visible and a hidden part.
24
The hidden fields of a segment register are loaded to the processor from a descriptor table and are stored in the descriptor cache registers. The descriptor cache registers, like the TLB, allow the hardware processor to refer to the contents of the segment register's hidden part without further reference to the descriptor table. Each time a segment register is loaded, the descriptor cache register gets fully loaded from the descriptor table. Since each VM has its own descriptor table (for example, the GDT), the VMM has to maintain a shadow copy of each VMs descriptor table. A context switch to a VM will cause the VM's shadow descriptor table to be loaded to the hardware descriptor table. If the content of the descriptor table is changed by the VMM because of a context switch to another VM, the segment is non-reversible, which means the segment cannot be restored if an event such as a trap causes the segment to be saved and replaced. The Current Privilege Level (CPL) is stored in the hidden portion of the segment register. The CPL is initially equal to the privilege level of the code segment from which it is being loaded. The processor changes the CPL when program control is transferred to a code segment with a different privilege level. The segment descriptor contains the size, location, access control, and status information of the segment that is stored in either the LDT or GDT. The OS sets segment descriptors in the descriptor table and controls which descriptor entry to use for a segment (Figure 11). See CPU Privilege Mode on page 45 for a discussion of setting the segment descriptor in the Solaris OS.
31
Base 31:24
24 23 22 21 20 19
D D/B L AVL SL
16 15 14
P
13 12 11
S Type
87
Base 23:16
DPL
31
Base 15:00
16
Segment Limit 15:00
L: 64-bit code segment AVL: Available for use by system software Base: Segment base address D/B Default operation size (0=64-bit segment, 1=32 bit segment) DBL: Descriptor Privilege Level G: Granularity SL: Segment Limit 19:16 P: Segment present S: Descriptor type (0=system, 1=code or data) Type: segment type
Figure 11. Segment descriptor.
The privilege check performed by the processor recognizes three types of privilege levels: requested privilege level (RPL), current privilege level (CPL), and descriptor privilege level (DPL). A segment can be loaded if the DPL of the segment is numerically greater than or equal to both the CPL and the RPL. In other words, a segment can be
25
accessed only by code that has equal or higher privilege level. Otherwise, a generalprotection fault exception, #GP, is generated and the segment register is not loaded. On 64-bit systems, linear address space (flat memory model) is used to create a continuous, unsegmented address space for both kernel and application programs. Segmentation is disabled in the sense that privilege checking can not apply to VA to LA translations as it doesn't exist. The only protection left to prevent a user application from accessing kernel memory is through the page protection mechanism. This is why the kernel of a GOS has to run in ring 3 (user mode in page level protection) on a 64-bit system.
Paging Architecture
When operating in the protected mode, the LA } PA translation is performed by the paging hardware of the x86 processor. To access data in memory, the processor requires the presence of a VA } PA translation in the TLB (in Solaris, LA is equal to VA), the page table backing up the TLB entry, and a page of physical memory. For the x86 processor, loading the VA } PA page translation from the page table to TLB is performed automatically by the processor. The OS is responsible for allocating physical memory and loading the VA } PA translation to the page table. When the processor cannot load a translation from the page table, it generates a page fault exception, #PF. A #PF exception on x86 processors usually means a physical page has not been allocated, because the loading of the translation from the page table to the TLB is handled by the processor (Figure 12).
TLB Entry
Page Table
Physical Memory
Performed by the processor
Performed by the OS
Figure 12. Translations through the TLB are accomplished in the processor itself, while translations through page tables are performed by the OS.
The x86 processor uses a control register, %cr3, to manage the loading of address translations from the page table to the TLB. The base address of a process's page table is kept by the OS and loaded to %cr3 when the process is contexted in to run. On the Solaris OS, %cr3 is kept in the kernel hat structure. Each address space, as, has one
hat structure. The mdb(1) command can be used to find the value of the %cr3
register of a process:
26
% mdb -k > ::ps S PID PPID PGID SID UID FLAGS ADDR NAME .... R 9352 9351 9352 9352 28155 0x4a014000 fffffffec2ae78c0 bash > fffffffec2ae78c0::print -t 'struct proc' ! grep p_as struct as *p_as = 0xfffffffed15ba7e0 > 0xfffffffed15ba7e0::print -t 'struct as' ! grep a_hat struct hat *a_hat = 0xfffffffed1718e98 > 0xfffffffed1718e98::print -t 'struct hat' ! grep hat_htable htable_t *hat_htable = 0xfffffffed0f67678 > 0xfffffffed0f67678::print -t 'struct htable' ! grep ht_pfn pfn_t ht_pfn = 0x16d37 // %cr3
When multiple VMs are running, the automatic loading of page translations from the page table to the TLB actually makes the virtualization more difficult because all page tables have to be accessible by the processor. As a result, pages table updates can only be performed by the VMM to enforce a consistent memory usage on the system. Page Translations Virtualization on page 14 discusses two mechanism for managing page tables by the VMM. Another issue of the x86 paging architecture is related to the flushing of TLB entries. Unlike many RISC processors which support a tagged TLB, the x86 TLB is not tagged. A TLB miss results in a walk of the page table by the processor to find and load the translation to the TLB. Since the TLB is not tagged, a change in the %cr3 register due to a virtual memory context switch will result in invalidating all TLB entries. This adversely affects performance if the VMM and VM are not in the same address space. A typical solution to address the performance impact of TLB flushing is to reserve a region of the VM address space for the VMM. With this solution, the VMM and VM can run from the same address space and thus avoid a TLB flush when a VM memory operation traps to the VMM. The latest CPUs from Intel and AMD with hardware virtualization support include tagged TLBs, and consequently the translation of different address spaces can co-exist in the TLB.
I/O and Interrupts

In general, x86 support for exceptions and I/O interrupts does not impose any particular challenge to the implementation of a VMM. The x86 processor uses the interrupt descriptor table (IDT) to provide a handler for a particular interrupt or exception. Access to the IDT functions is privileged and, therefore, can only be performed by the VMM. The Sun xVM Hypervisor for x86 provides a mechanism to relay hardware interrupts to a VM through its event channel hypervisor calls (see Event Channels on page 43).
27
The x86 processor allows device memory and registers to be accessed through either an I/O address space or memory-mapped I/O. An I/O address space access is performed using special I/O instructions such as IN and OUT. These instructions, while allowed to execute when the CPL is not 0, will result in a #GP exception if the processor's CPL value is higher than the I/O privilege level (IOPL). The Sun xVM Hypervisor for x86 provides a hypervisor call to set the IOPL, enabling a GOS to directly access I/O ports by setting the IOPL to its privilege level. When using memory-mapped I/O, any of the processors instructions that reference memory can be used to access an I/O location with protection provided through segmentation and paging. PIO, whether it is using I/O address space or memorymapped I/O, is normally uncacheable as device registers are usually accessed with precise programming order. PIO uses addresses in a VM's address space and doesn't cause any security and isolation issues. The x86 processor uses physical addresses for DMA. DMA in a virtualized x86 system has certain issues: A 32-bit, non-dual-address-cycle (DAC) PCI device can not address beyond 4 GB of memory. It is possible for one domains DMA to intrude into another domain's physical memory, thus causing the risk of security violation. The solution to the above issues is to have an I/O memory management unit (IOMMU) as a part of an I/O bridge or north bridge that performs a translation of I/O addresses (for example, an address that appears on the PCI bus) to machine memory addresses. The I/O address can be any address that is recognized by the IOMMU. An IOMMU can also improve the performance of large chunk data transfers by mapping a contiguous I/O address to multiple physical pages in one DMA transaction. However, the IOMMU may hurt the I/O performance for small data transfers because the DMA setup cost is higher than that of DMA without an IOMMU. For more details on the IOMMU, also known as hardware address translation service (hardware ATS), see I/O Virtualization on page 16.
Timer Devices
An OS typically uses several timer devices for different purposes. Timer devices are characterized by their frequency granularity, frequency reliability, and ability to generate interrupts and receive counter input. Understanding the characteristics of timer devices is important for the discussion of timekeeping in a virtualized environment, as the VMM provides virtualized timekeeping of some timers to its overlaying VMs. Virtualized timekeeping has significant impact on the accuracy of time related functions in the GOS and, thus, on the performance and results of time sensitive applications.
28
An x86 system typically includes the following timer devices: Programmable Interrupt Timer (PIT) PITs use a 1.193182 Mhz crystal oscillator and have a 16-bit counter and counter input register. The PIT contains three timers. Timer 0 can generate interrupts and is used by the Solaris OS as the system timer. Timer 1 was historically used for RAM refreshes and timer 2 for the PC speaker. Time Stamp Counter (TSC) The TSC is a feature of the x86 architecture that is accessed via the RDTSC instruction. The TSC, a 64-bit counter, changes with the processor speed. The TSC cannot generate interrupts and has no counter input register. The TSC is the finest grained of all timers and is used in the Solaris OS as the high resolution timer. For example, the gethrtime(3C) function uses the TSC to return the current highresolution real time. Real Time Clock (RTC) The RTC is used as the time-of-day (TOD) clock in the Solaris OS. The RTC uses a battery as an alternate power source, enabling it to continue to keep time while the primary source of power is not available. The RTC can generate interrupts and has a counter input register. It is the lowest grained timer on the system. Local Advanced Programmable Interrupt Controller (APIC) Timer The local APIC timer, which is a part of the local APIC, has a 32-bit counter and counter input register. It can generate interrupts and has the same frequency as the front side bus. The Solaris OS supports the use of the local APIC timer as one of the cyclic timers. High Precision Event Timer (HPET) The HPET is a relatively new timer available in some new x86 systems. The HPET is intended to replace the PIT and the RTC for generating periodic interrupts. The HPET can generate interrupts, is 64-bits wide, and has a counter input register. The Solaris OS currently does not use the HPET. Advanced Configuration and Power Interface (ACPI) Timer The ACPI timer has a 24-bit counter, can generate interrupts, and has no input counter register. The Solaris OS does not use the ACPI timer.
29
SPARC Processor Architecture
Chapter 4

This chapter provides background information on the SPARC processor architecture that is relevant to later discussions on Logical Domains (Chapter 7 on page 79). The SPARC (Scalable Processor Architecture) processor, first introduced in 1987, is a bigendian RISC processor ISA. SPARC International (SI), an industry organization, was established in 1989 to promote the open SPARC architecture. In 1994, SI introduced a 64-bit version of the SPARC processor as SPARC v9. The UltraSPARC processor, which is a Sun-specific implementation of SPARC v9, was introduced in 1996 and has been incorporated into all Sun SPARC platforms shipping today. In 2005, Sun's UltraSPARC architecture was open sourced as the UltraSPARC Architecture 2005 Specification [2]. Included in this enhanced UltraSPARC 2005 specification is support for Chip-level Multithreading (CMT) for a highly threaded processor architecture and a hyperprivileged mode that allows the hypervisor to virtualize the processor to run multiple domains. The design of the UltraSPARC T1 processor, which is the first implementation of the UltraSPARC Architecture 2005 Specification, is also open sourced. The UltraSPARC T1 processor includes 8 cores with 4 strands in each core, providing a total of 32 strands per processor. In August 2007 Sun announced the UltraSPARC T2 processor, the follow-up CMT processor to the UltraSPARC T1 processor, and the OpenSPARC T2 architecture [33] which is the open source version of the UltraSPARC T2 processor. Sun also released the UltraSPARC Architecture 2007 specification [34] which adds a section for error handling and expands the discussion for memory management. The UltraSPARC T2 processor has several enhancements over the UltraSPARC T1 processor. These enhancements include 64 strands, per-core floating-point and graphic units, and integrated PCIe and 10 GB Ethernet (for more details see Processor Components on page 31). The remainder of this chapter discusses the following features of the UltraSPARC T1/T2 processor architecture, and describes their effect on virtualization implementations: Processor privilege mode The UltraSPARC 2005 specification defines a hyperprivileged mode for the hypervisor operations. Sun4v Chip Multithreaded architecture This feature enables the creation of up to 32 domains, each with its own dedicated strands, on an UltraSPARC T1 processor, and up to 64 domains on an UltraSPARC T2 processor. Address Space Identifier (ASI) The ASI provides functionality to control access to a range of address spaces, similar to the segmentation used by x86 processors. Memory Management Unit (MMU) The software-controlled MMU allows an efficient redirection of page faults to the intended domain for loading translations.
30
Trap and interrupt handling Each strand (virtual processor) has its own trap and interrupt priority registers. This functionality allows the hypervisor to re-direct traps to the target CPU and enables the trap to be taken by the GOS's trap handler.
Note The terms strand, hardware thread, logical processor, virtual CPU and virtual
processor are used by various documents to refer to the same concept. For consistency, the term strand is used in this chapter.
Processor Mode of Operation

The UltraSPARC 2005 specification defines three privilege modes: non-privileged, privileged, and hyperprivileged. In hyperprivileged mode, the processor can access all registers and address spaces, and can execute all instructions. Instructions, registers, and address spaces for privileged and non-privileged modes are restricted. The processor operates in privileged mode when PSTATE.priv is set to 1 and
HPSTATE.hpriv is set to 0. The processor operates in hyperprivileged mode when HPSTATE.hpriv is set to 1 (PSTATE.priv is ignored).
Table 1 lists the availability of instructions, registers, and address spaces for each of the privilege modes, and includes information on where further details can be found in the
UltraSPARC Architecture 2005 Specification [2].

Table 1. Documentation describing the availability of components in the UltraSPARC processor.
Component Instruction
Locationa Table 7-2
Comments All instructions except SIR, RDHPR, and RHPR (which require hyperprivilege to execute) can be executed from the privileged mode. There are seven hyperprivileged registers: HPSTATE, HTSTATE, HINTP , HTBA, HVER, HSTICK_CMPR, and STRAND_STS. These registers are used by the hypervisor in the hyperprivileged mode. ASIs 0x30-0x7F are for hyperprivileged access only. These ASIs are mainly for CMT control, MMU, TLB, and hyperprivileged scratch registers.
Registers
Chapter 5
Address Space
Tables 9-1 and 10-1
a. Location in the UltraSPARC Architecture 2005 Specification [2].
Based on the availability of instructions, registers, and the ASI in hyperprivileged mode, the following functions of the hypervisor can be deduced: Reset the processor: SIR instruction Control hyperprivileged traps and interrupts: HTSTATE, HTBA, HINTP registers Control strand operation: ASI 0x41, and HSTICK_CMPR and STRAND_STS registers Manage MMU: ASI 0x50-0x5F
31
Processor Components
The UltraSPARC T1 processor[10] contains eight cores, and each core has hardware support for four strands. One FPU and one L2 cache are shared among all cores in the processor. Each core has its own Level 1 instruction and data cache (L1 Icache and Dcache) and TLB that are shared among all strands in the core. In addition, each strand contains the following: A full register file with eight register windows and four sets of global registers (a total of 160 registers: 8 * 16 registers per window, + 4 * 8 global registers) Most of the ASIs Ancillary privileged registers Trap queue with up to 16 entries This hardware support in each strand allows the hypervisor to partition the processor into 32 domains, with one strand for each domain. Each strand can execute instructions separately without requiring a software scheduler in the hypervisor to coordinate the processor resources. Table 2 summarizes the association of processor components to their location in the processor, core and strand.
Table 2. Location of key processor components in the UltraSPARC T1 processor.
Processor Floating Point Unit L2 cache crossbar L2 cache
Core 6 stage instruction pipeline L1 Icache and Dcache TLB
Strand Register file with 160 registers Most of ASI Ancillary state register (ASR) Trap registers Privileged registers
The UltraSPARC T2 processor[33] is built upon the UltraSPARC T1 architecture. It has the following enhancements over the UltraSPARC T1 processor: EIght strands per core (for a total of 64 strands) Two integer pipelines per core, with each integer pipeline supporting 4 strands A floating-point and graphics unit (FGU) per core Integrated PCI-E and 10 Gb/Gb Ethernet (System-on-Chip) Eight banks of 4 MB L2 cache The UltraSPARC T2 has a total of 64 strands in 8 cores, and each core has its own floating-pointing and graphics unit (FGU). This allows up to 64 domains to be created on the UltraSPARC T2 processor. This design also adds integrated support for industry standard I/O interfaces such PCI-Express and 10 Gb Ethernet. Table 3 summarizes the association of processor components to physical processor, core and strand.
32
Table 3. Location of key processor components in the UltraSPARC T2 processor.
Processor 8 banks 4 MB L2 cache L2 cache crossbar Memory controller PCI-E 10 Gb/Gb Ethernet
Core 2 instruction pipelines (8 stages) L1 Icache and Dcache TLB FGU (12 stages)
Strand Full register file with 8 windows Most of ASI Ancillary state register (ASR) Privileged registers
Address Space Identifier

Unlike x86 processors in 32-bit mode, which use segmentation to divide a process's address space into several segments of protected address spaces, the SPARC v9 processor has a flat 64-bit address space. An address in the SPARC V9 processor is a tuple consisting of an 8-bit address space identifier (ASI) and a 64-bit byte-address offset within the specified address space. The ASI provides attributes of an address space, including the following: Privileged or non-privileged Register or memory Endianness (for example, little-endian or big-endian) Physical or virtual address Cacheable or non-cacheable The SPARC processor's ASI allows different types of address spaces (user virtual address space, kernel virtual address space, processor control and status registers, etc.) to coexist as separate and independent address spaces for a given context. Unlike x86 processors in which user processes and the kernel share the same address space, user processes and the kernel have their own address space on SPARC processors. Access to these address spaces are protected by the ASI associated with each address space. ASIs in the range 0x00-0x2F may be accessed only by software running in privileged or hyperprivileged mode; ASIs in the range 0x30-0x7F may be accessed only by software running in hyperprivileged mode. An access to a restricted (privileged or hyperprivileged) ASI (0x00-0x7F ) by non-privileged software will result in a
privileged_action trap.
Table 9-1 and Table 10--1 of [2] provide a summary and description for each ASI.
Memory Management Unit

The traditional UltraSPARC architecture supports two types of memory addressing: Virtual Address (VA) managed by the GOS and used by user programs Physical address (PA) passed by the processor to the system bus when accessing physical memory
33
The Memory Management Unit (MMU) of the UltraSPARC processor provides the translation of VAs to PAs. This translation enables user programs to use a VA to locate data in physical memory. The SpitFire Memory Management Unit (sfmmu) is Sun's implementation of the UltraSPARC MMU. The sfmmu hardware consists of Translation Lookaside Buffers (TLBs) and a number of MMU registers: Translation Lookaside Buffer (TLB) The TLB provides virtual to physical address translations. Each entry of the TLB is a Translation Table Entry (TTE) that holds information for a single page mapping of virtual to physical addresses. The format of the TTE is shown in Figure 13. The TTE consists of two 64-bit words, representing the tag and data of the translation. The privileged field, P, controls whether or not the page can be accessed by nonprivileged software. MMU registers A number of MMU registers are used for accessing TLB entries, removing TLB entries (demap), context management, handling TLB misses, and support for Translation Storage Buffer (TSB) access. The TSB, an array of TTE entries, is a cache of translation tables used to quickly reload the TLB. The TSB resides in the system memory and is managed entirely by the OS. The UltraSPARC processors includes some MMU hardware registers for speeding up TSB access. The TLB miss handler will first search the TSB for the translation. If the translation is not found in the TSB, the TLB handler calls to a more sophisticated (and slower) TSB miss handler to load the translation table to the TSB.
TTE Tag
63
context_id
000000 48 47 42 41 taddr
va 0 s i cc e o e e p v p pw f t 13121110 9 8 7 654 3 sz 0
TTE Data
n v f soft2 o 636261 56 55
Figure 13. The translation lookaside buffer (TLB) is an array of translation table entries containing tag and data portions.
A TLB hit occurs if both the context and virtual address match an entry in the TLB. Address aliasing (multiple TLB entries with the same physical address) is permitted. Unlike the x86 processor, the loading of page translations to the TLB is manually managed by software through traps. In the event of a TLB miss, a trap is generated trying first to get the translation from the Translation Storage Buffer (TSB) (Figure 14). The TSB, an in-memory array of translations, acts like a direct-mapped cache for the TLB. If the translation is not present in the TSB, a TSB miss trap is generated. The TSB miss trap handler uses a software lookup mechanism based on the hash memory entry
34
block structure, hme_blk, to obtain the TTE. If a translation is still not found in
hme_blk, the kernel generic trap handler is invoked to call the kernel function pagefault() to allocate physical memory for the virtual address and load the
translation into the hme_blk hash structure. Figure 14 depicts the mechanism for handling TLB misses in an unvirtualized domain.
TLB miss TLB TTE load to TLB Processor MMU TTE cache in memory TSB TTE load to TSB OS data structure TSB miss home_blk hat_memload() OS function Allocate memory pagefault ()
Figure 14. Handling a TLB miss in an unvirtualized domain, UltraSPARC T1/T2 processor architecture.
Similarly, Figure 15 depicts how TLB misses are handled in a virtualized domain. In a virtualized environment, the UltraSPARC T1/T2 processor adds a Real Address type, in addition to the VA and PA, into the types of memory addressing (Figure 15). Real addresses (RA), which are equivalent to the physical memory in Sun xVM Server (see Physical Memory Management on page 52) are provided to the GOS as the underlying physical memory allocated to it. The GOS-maintained TSBs are used to translate VAs into RAs. The hypervisor manages the translation from RA to PA.
TLB miss TLB TTE load to TLB Processor MMU Managed by Hypervisor PA<-RA TLB miss TSB miss hme_blk TTE load to TSB OS data structure hat_memload() OS function Allocate memory pagefault()
TSB RA<-VA TTE cache in memory
Figure 15. Handling a TLB miss in a virtualized domain, UltraSPARC T1/T2 processor architecture.
Applications, which are non-privileged software, use only VAs. The OS kernel, which is privileged software, uses both VAs and RAs. The hypervisor, which is hyperprivileged software, normally uses PAs. Physical Memory Allocation on page 88 discusses in detail the types of memory addressing used in LDoms. The UltraSPARC T2 processor adds a hardware table walk for loading TLB entries. The hardware table walk accesses the TSBs to find TTEs that match the virtual address and context ID of the request. Since a GOS cannot access or control physical memory, the TTEs in the TSBs controlled by a GOS contain real page numbers, not physical page numbers (see Physical Memory Allocation on page 88). TTEs in the TSBs controlled by the hypervisor can contain real page numbers or physical page numbers. The hypervisor performs the RA-to-PA translation within the hardware table walk to permit the hardware table walk to load a GOS TTEs into the TLB for VA-to-PA translation.
35
Traps
In the SPARC processor, a trap transfers software execution from one privileged mode to another privileged mode at the same or higher level. The only exception is that unprivileged mode can not trap to another unprivileged mode. A trap can be generated by the following methods: Internally by the processor (memory faults, privileged exceptions, etc.) Externally generated by I/O devices (interrupts) Externally generated by another processor (cross calls) Software generated (for example, the Tcc instruction) A trap is associated with a Trap Type (TT), a 9-bit value. (TT values 0x180-0x1FF are reserved for future use.) The transfer of software execution occurs through a trap table that contains an array of TT handlers indexed by the TT value. Each trap table entry is 32-bytes in length and contains the first eight instructions of the TT handler. When a trap occurs, the processor gets the TT from the TT register and the trap table base address (TBA) from the TBA register. After saving the current executing states and updating some registers, the processor starts to execute the instructions in the trap table handler. The SPARC processors support nesting traps using a trap level (TL). The maximum TL (MAXTL) value is typically in the range of 2-6, and depends on the processor; in UltraSPARC T1/T2 processors, MAXTL is 6. Each trap level has one set of trap stack control registers: trap type (TT), trap program counter (TPC), trap next program counter (TNPC), and trap state (TSTATE). These registers provide trap software execution state and control for the current TL. The ability to support nested traps in SPARC processors makes the implementation of an OS trap handler easier and more efficient, as the OS doesn't need to explicitly save the current trap stack information. On UltraSPARC T1/T2 processors, each strand has a full set of trap control and stack registers which include TT, TL, TPC, TNPC, TSTATE, HTSTATE (hyperprivileged trap state), TBA, HTBA (hyperprivileged trap base address), and PIL (priority interrupt level). This design feature allows each strand to receive traps independently of other strands. This capability significantly helps trap handling and management by the hypervisor, as traps are delivered to a strand without being queued up in the hypervisor.
Interrupts
On SPARC platforms, interrupt requests are delivered to the CPU as traps. Traps 0x041 through 0x04F are used for Priority Interrupt Level (PIL) interrupts, and trap 0x60 is used for the vector interrupt. There are 15 interrupt levels for PIL interrupts. Interrupts are serviced in accordance to their PIL, with higher PILs having higher priority. The vector interrupt is used to support the data bearing vector interrupt which allows a device to include its private data in the interrupt packet (also known as the mondo
36
vector). With vector interrupt, device CSR access can be eliminated and the complexity of device hardware can be reduced. PIL interrupts are delivered to the processor through the ASR's SOFTINT_REG register. The SOFTINT_REG register contains a 15 bit int_level field. When a bit in this field is set, a trap is generated and the PIL of the trap corresponds to the location of the bit in that field. There is one SOFTINT_REG for each strand. In LDoms, the interrupt delivery from an I/O device to a GOS is a two-step process: An I/O device sends an interrupt request using the vector interrupt (trap 0x60) to the hypervisor. The hypervisor inserts the interrupt request into the interrupt queue of the target virtual processor. The target processor receives the interrupt request on its interrupt queue through trap 0x7D (for device) or 0x7C (for cross calls), and schedules an interrupt to itself to be processed at a later time by setting bits in the privileged SOFTINT register which causes a PIL interrupt (trap 0x41-0x4F). For more details on interrupt delivery, see Trap and Interrupt Handling on page 85.
Section II
Hardware Virtualization Implementations
Chapter 5: Sun xVM Server (page 39) Chapter 6: Sun xVM Server with Hardware VM (HVM) (page 63) Chapter 7: Logical Domains (page 79) Chapter 8: VMware (page 97)
38
39
Sun xVM Server
Chapter 5
Sun xVM Server

Sun xVM Server is a a paravirtualized Solaris OS that incorporates the Xen open source community work. The open source VMM, Xen, was originally developed by the Systems Research Group of the University of Cambridge Computer Laboratory, as part of the UKEPSRC funded XenoServers project. The first versions of Xen, targeted at the Linux community for the x86 processor, required the Linux kernel to be specifically modified to run on the Xen VMM. This OS paravirtualization made it impossible to run Windows on early versions of Xen, because Microsoft did not permit the Windows software to be modified. In December 2005 the Xen development team released Xen 3.0, the first version of its VMM that supported hardware-assisted virtual machines (HVM). With this new version, an unmodified OS could be hosted on the Intel-VTx and AMD-V (Pacifica) processors. Xen 3.0 eliminated the need for paravirtualization and enabled Microsoft Windows to run in a Xen environment side-by-side with Linux and the Solaris OS. Xen 3.0 supports the x86 CPU both with HVM and without HVM. Xen 3.0 also extends support for symmetric multiprocessing, 64-bit operating systems, and up to 64 GB RAM allowed by the x86 physical address extension (PAE) in 32-bit mode. HVM technology affects the Xen implementation in many ways. This chapter discusses the architecture and design of Sun xVM Server, which does not leverage the processor HVM feature. Chapter 5 discusses Sun xVM Server for x86 processors with HVM support (Sun xVM Server with HVM).
Note Sun xVM Server includes support for the Xen open source community work on
the x86 platform and support for LDoms on the UltraSPARC T1/T2 platform. In this paper, in order to distinguish the discussion of x86 and UltraSPARC T1/T2 processors, Sun xVM Server is specifically used to refer to the Sun hardware virtualization product for the x86 platform, and LDoms is used to refer to the Sun hardware virtualization product for the UltraSPARC T1 and T2 platforms.
This chapter is organized as follows: Sun xVM Server Architecture Overview on page 40 provides an overview of the Sun xVM Server architecture. Sun xVM Server CPU Virtualization on page 45 discusses the CPU virtualization employed by Sun xVM Server. Sun xVM Server Memory Virtualization on page 52 describes memory management issues.
40
Sun xVM Server
Sun xVM Server I/O Virtualization on page 56 discusses the I/O virtualization used in Sun xVM Server.
Sun xVM Server Architecture Overview

A Sun xVM Server virtualized system consists of an x86 system, a VMM, a control VM running Sun xVM Server (Dom0), and zero or more VMs (DomU), as shown in Figure 16. The Sun xVM Hypervisor for x86, the VMM of the Sun xVM Server system, manages hardware resources and provides services to the VMs. Each VM, including Dom0, runs an instance of a guest operating system (GOS) and is capable of communicating with the VMM through a set of hypervisor calls.
Dom 0 Guest Applications/ Domain Manager/ Console Guest OS
Dom U Guest Applications
Guest OS
Guest OS
Guest OS
Scheduler Event Channel Console IF XenStore Hypercalls Grant Tables
Sun xVM Hypervisor for x86 Sun X64 Server

Figure 16. A Sun xVM Server virtualized system consists of a VMM, a control VM (Dom0), and zero or more VMs (DomU).
The Dom0 VM has some unique characteristics not available in other VMs: First VM started by the VMM Able to directly access I/O devices Runs domain manager to create, start, stop, and configure other VMs Provides I/O access service to other VMs (DomU) Each DomU VM runs an instance of a paravirtualized GOS, and gets VMM services through a set of hypercalls. Access to I/O devices from each DomU VM are provided by drivers in Dom0.
41
Sun xVM Server
Sun xVM Hypervisor for x86 Services

The Sun xVM Hypervisor for x86, the VMM of the Sun xVM Server, provides several communication channels between itself and overlying domains: Hypercalls synchronous calls from a GOS to the VMM Event Channel asynchronous notifications from the VMM to VMs Grant Table shared memory communication between the VMM and VMs, and among VMs XenStore a hierarchical collection of control and status repository Each of these mechanisms is described in more detail in the following sections. Hypercalls The Sun xVM Server hypercalls are a set of interfaces used by a GOS to request service from the VMM. The hypercalls are invoked in a manner similar to OS system calls: a software interrupt is issued which vectors to an entry point within the VMM. Hypercalls use INT $0x82 on a 32-bit system and SYSCALL on a 64-bit system, with the particular hypercall contained in the %eax register. For example, the common routine for hypercalls with four arguments on a 64-bit Solaris kernel is:
long __hypercall4(ulong_t callnum, ulong_t a1, ulong_t a2, ulong_t a3, ulong_t a4);
The function in assembly is as follows:

_[0]> __hypercall4,7/ai __hypercall4: __hypercall4: movl %edi,%eax __hypercall4+2: movq %rsi,%rdi __hypercall4+5: movq %rdx,%rsi __hypercall4+8: movq %rcx,%rdx __hypercall4+0xb: movq %r8,%r10 __hypercall4+0xe: syscall __hypercall4+0x10: ret
/* %edi is the first argument */
The calling convention is compliant with the AMD64 ABI [8]. The SYSCALL instruction is intended to enable unprivileged software (ring 3) to access services from privileged software (ring 0). Solaris system calls also use SYSCALL to allow user applications to access Solaris kernel services. Having SYSCALL used by both Solaris system calls and the hypercalls means that the SYSCALL made by the user process in Solaris is delivered indirectly by the VMM to the Solaris kernel. This causes a slight overhead for each Solaris system call.
42
Sun xVM Server
A complete list of Sun xVM Server hypercalls is provided in Table 4:

Table 4. Sun xVM Server hypercalls.
Privilege Operations:
long set_trap_table(trap_info_t *table); long mmu_update(mmu_update_t *req, int count, int *success_count, domid_t domid); long set_gdt(ulong_t *frame_list, int entries); long stack_switch(ulong_t ss, ulong_t esp); long fpu_taskswitch(int set); long mmuext_op(struct mmuext_op *req, int count, int *success_count, domid_t domain_id); long update_descriptor(maddr_t ma, uint64_t desc); long update_va_mapping(ulong_t va, uint64_t new_pte, ulong_t flags); long set_timer_op(uint64_t timeout); long physdev_op(void *physdev_op); long vm_assist(uint_t cmd, uint_t type); long update_va_mapping_otherdomain(ulong_t va, uint64_t new_pte, ulong_t flags, domid_t domain_id); long iret(); long set_segment_base(int reg, ulong_t value); long nmi_op(ulong_t op, void *arg); long hvm_op(int cmd, void *arg);
VMM Services: long set_callbacks(ulong_t event_address, ulong_t failsafe_address, ulong_t syscall_address); long grant_table_op(uint_t cmd, void *uop, uint_t count); long event_channel_op(void *op); long xen_version(int cmd, void *arg); long set_debugreg(int reg, ulong_t value); long get_debugreg(int reg); long multicall(void *call_list, int nr_calls); long console_io(int cmd, int count, char *str); long sched_op(int cmd, void *arg); long do_kexec_op(unsigned long op, int arg1, void *arg); VM Control Operations: long sched_op_compat(int cmd, ulong_t arg); long platform_op(xen_platform_op_t *platform_op); long memory_op(int cmd, void *arg); long vcpu_op(int cmd, int vcpuid, void *extra_args); long sysctl(xen_sysctl_t *sysctl); long domctl(xen_domctl_t *domctl); long acm_op();
As Table 4 shows, the hypercalls provide a variety of functions for a GOS: Perform privileged operations such as setting the trap table, updating the page table, loading the GDT, and setting the GS and FS segment registers Get services from the VMM such as using the event channel, grant table, set_callbacks services, and scheduled operations Control VM operations such as platform_op, domain control, and virtual CPU control An example use of a hypercall is to request a set of page table updates. For example, a new process created by the fork(2) call requires the creation of page tables. The hypercall HYPERVISOR_mmu_update(), which validates and applies a list of
43
Sun xVM Server
updates, is called by the Solaris kernel to perform the page table updates. This routine returns control to the calling domain when the operation is completed. In the following example, a kmdb(1M) breakpoint is set at the mmu_update() call. The stack trace illustrates how the mmu_update() function is called after a new process is created by fork():
[1]> set_pteval+0x4f:b // set breakpoint at HYPERVISOR_mmu_update [1]> :c // continue kmdb: stop at set_pteval+0x4f // the breakpoint reached kmdb: target stopped at:set_pteval+0x4f:call -0x5a34 <HYPERVISOR_mmu_update> [1]> $c // display the stack trace set_pteval+0x4f(c753000, 1fb, 3, f9c29027) x86pte_copy+0x73(fffffffec08115a8, fffffffec2a8a0d8, 1fb, 5) hat_alloc+0x228(fffffffec2fa88c0) as_alloc+0x99() as_dup+0x3f(fffffffec27b1d28, fffffffec2a11168) cfork+0x102(0, 1, 0) forksys+0x25(0, 0) sys_syscall32+0x13e() {1]
The above example shows that the kernel doesn't maintain a copy of the page table. It uses the mmu_update() hypercall to request the VMM to update the page table. Event Channels To a GOS, a VMM event is the equivalent of a hardware interrupt. Communication from the VMM to a VM is provided through an asynchronous event mechanism, called an event channel, which replaces the usual delivery mechanisms for device interrupts. A VM creates an event channel to send and receive asynchronous event notifications. Three classes of events are delivered by this event channel mechanism: Bi-directional inter- and intra-VM connections A VM can bind an event-channel port to another domain or to another virtual CPU within the VM. Physical interrupts A VM with direct access to hardware (Dom0) can bind an event-channel port to a physical interrupt source. Virtual interrupts A VM can bind an event-channel port to a virtual interrupt source, such as the virtualtimer device.
44
Sun xVM Server
Event channels are addressed by a port. Each channel is associated with two bits of information: unsigned long evtchn_pending[sizeof(unsigned long) * 8] This value notifies the domain that there is a pending notification to be processed. This bit is cleared by the GOS. unsigned long evtchn_mask[sizeof(unsigned long) * 8] This value specifies if the event channel is masked. If this bit is clear and PENDING is set, an asynchronous upcall will be scheduled. This bit is only updated by the GOS; it is read-only within the VMM. Interrupts to a VM are virtualized by mapping them to event channels. These interrupts are delivered asynchronously to the target domain using a callback supplied via the
set_callbacks hypercall. A guest OS can map these events onto its standard
interrupt dispatch mechanisms. The VMM is responsible for determining the target domain that will handle each physical interrupt source. Interrupts and Exceptions on page 49 provides a detailed discussion of how an interrupt is handled by the VMM and delivered to a VM using an event channel. Grant Tables The Sun xVM Hypervisor for x86 allows sharing memory among VMs, and between the VMM and a VM, through a grant table mechanism. Each VM makes some of its pages available to other VMs by granting access to its pages. The grant table is a data structure that a VM uses to expose some of its pages, specifying what permissions other VMs have on its pages. The following example shows the information stored in a grant table entry:
struct grant_entry { /* GTF_xxx: various type and flag information. [XEN,GST] */ uint16_t flags; /* The domain being granted foreign privileges. [GST] */ domid_t domid; uint32_t frame; / / page frame number (PFN) };
The flags field stores the type and various flag information of the grant table. There are three types of grant table entries: GTF_invalid Grants no privileges. GTF_permit_access Allows the domain domid to map/access the specified
frame.
GTF_accept_transfer Allows domid to transfer ownership of one page frame to this guest; the VMM writes the page number to frame.
45
Sun xVM Server
The type information acts as a capability which the grantee can use to perform operations on the granter's memory. A grant reference also encapsulates the details of a shared page, removing the need for a domain to know the real machine address of a page it is sharing. This makes it possible to share memory correctly with domains running in fully virtualized memory. Device drivers in the Sun xVM Server (see Sun xVM Server I/O Virtualization on page 56) use grant tables to send data between drivers of different domains, and use event channels and callback services for asynchronous notification of data availability. XenStore XenStore [22] is a shared storage space used by domains to communicate and store configuration information. XenStore is the mechanism by which control-plane activities, including the following, occur: Setting up shared memory regions and event channels for use with split device drivers Notifying the guest of control events (for example, balloon driver requests) Reporting status information from the guest (for example, performance-related statistics) The store is arranged as a hierarchical collection of key-value pairs. Each domain has a directory hierarchy containing data related to its configuration. Domains are permitted to register for notifications about changes in a subtree of the store, and to apply changes to the store transactionally.
Sun xVM Server CPU Virtualization

The Sun xVM Hypervisor for x86 provides a paravirtualized environment to a VM. Full CPU virtualization to a VM is achieved by a concerted coordination of CPU management by the VMM, and CPU usage by the GOS within a VM. The next sections discuss CPU virtualization employed by the Sun xVM Server for these tasks: Deprivileging CPUs to run a VM Scheduling CPUs for VMs Handling and delivery of interrupts to a VM Providing timer services to a VM
CPU Privilege Mode

The Sun xVM Hypervisor for x86 operates at a higher privilege level than the GOS. On 32-bit x86 processors with protection mode enabled, a GOS may use rings 1, 2 and 3 as it sees fit. The Sun xVM Server kernel uses ring 1 for its own operation and places applications in ring 3.
46
Sun xVM Server
On 64-bit systems, linear address space (flat memory model) is used to create a continuous, unsegmented address space for both the kernel and application programs. Segmentation is disabled and rings 1 and 2, which practically do not exist, have the same privilege to access paging as ring 0 (see Protected Mode and following sections beginning on page 22). To protect the VMM, the Sun xVM Server kernel is therefore restricted to run in ring 3 for the 64-bit mode and in ring 1 for the 32-bit mode only, as seen in the definitions in segments.h:
% cat intel/sys/segments.h .... #if defined(__amd64) #define SEL_XPL 0 #define SEL_KPL 3 #elif defined(__i386) #define SEL_XPL 0 #define SEL_KPL 1 #endif /* __i386 */
/* xen privilege level */ /* both kernel and user in ring 3 */ /* xen privilege level */ /* kernel privilege level under xen */
If both kernel and user application run with the same privilege level, how does Sun xVM Server protect the kernel from user applications? The answer is given as follows [32]: 1. The VMM performs context switching between kernel mode and the currently running application in user mode. The VMM tracks which mode the GOS is running, kernel or user. 2. The GOS maintains two top level (PML4) page tables per process, one each for kernel and user. The GOS registers the two page tables with the VMM. The kernel page table contains translations for both the kernel and user addresses, and the user page table contains translations only for the user addresses. During the context switch, the VMM switches the top level page table so the kernel addresses are not visible to the user process. The linear address mapping to paging data structure for 64-bit x86 processor is shown below in Figure 17:
63 48 47 39 38 30 29 21 20 12 11 0
Sign Extended
PML4
PDP
PDE
PTE
Offset
Figure 17. Linear address mapping to paging data structure for 64-bit x86 processor.
Switching the PML4 page tables between kernel and user mode enables a 64-bit address space to be split into two logically separate address spaces. In this logical separation of a 64-bit address space, the kernel can access both its address space and a user address space while a user process can access only its own address space. The user address space in this addressing scheme is therefore restricted to use the lower 48 bits of the 64-bit address space. The resulting address space partition in the 64-bit Sun xVM Server is shown as follows, in Figure 18:
47
Sun xVM Server
Kernel (ring 3) VMM (ring 0) Reserved 2

47
0xFFFF8800 00000000 0xFFFF8000 00000000 0x7FFF FFFFFFFF
User (ring 3)
0
Figure 18. Address space partitioning in the 64-bit Sun xVM Server.
As discussed previously (see Segmented Architecture on page 23), the processor privilege level is set when a segment is loaded. The Solaris OS uses the GDT for user and kernel segments. The segment index of each segment type is assigned as shown in Table 5 on page 54. The command kmdb(1M) can be used to examine the segment descriptor of kernel code:
[0]> gdt0+30::print -t 'struct user_desc' / / 64-bit kernel code segment { unsigned long usd_lolimit :16 = 0x7000 unsigned long usd_lobase :16 = 0xe030 unsigned long usd_midbase :8 = 0 unsigned long usd_type :5 = 0xe unsigned long usd_dpl :2 = 0x3 unsigned long usd_p :1 = 0x1 unsigned long usd_hilimit :4 = 0x4 unsigned long usd_avl :1 = 0 unsigned long usd_long :1 = 0 unsigned long usd_def32 :1 = 0 unsigned long usd_gran :1 = 0x1 unsigned long usd_hibase :8 = 0xfb } / / 32-bit user code segment > gdt0+40::print -t 'struct user_desc' { unsigned long usd_lolimit :16 = 0xc450 unsigned long usd_lobase :16 = 0xe030 unsigned long usd_midbase :8 = 0xf8 unsigned long usd_type :5 = 0xe unsigned long usd_dpl :2 = 0x3 unsigned long usd_p :1 = 0x1 unsigned long usd_hilimit :4 = 0x1 unsigned long usd_avl :1 = 0 unsigned long usd_long :1 = 0 unsigned long usd_def32 :1 = 0 unsigned long usd_gran :1 = 0x1 unsigned long usd_hibase :8 = 0xfb }
48
Sun xVM Server
The descriptor privilege level (DPL) of both kernel and 32-bit user code segments are set to 3. At boot time, the Sun xVM Hypervisor for x86 is loaded into memory in ring 0. After initialization, it loads the Solaris kernel to run as Dom0 in ring 3. The domain Dom0 is permitted to use the VM control hypercall interfaces (see Table 4 on page 42), and is responsible for hosting the application-level management software.
CPU Scheduling
The Sun xVM Hypervisor for x86 provides two schedulers for the user to choose between: Credit and simple Earliest Deadline First (sEDF). The Credit scheduler is the default scheduler; sEDF might be phased out and removed from the Sun xVM Server implementation. The Credit scheduler is a proportional fair share CPU scheduler. Each physical CPU (PCPU) manages a queue of runnable virtual CPUs (VCPUs). This queue is sorted by VCPU priority. A VCPU's priority can be either over or under, representing whether this VCPU has exceeded its share of the PCPU or not. A VCPU's share is determined by weight assigned to the VM and credit accumulated by the VCPU in each accounting period.
( Credit total Weight VM ) + ( Weight total 1 ) i Credit VM = ------------------------------------------------------------------------------------------------------------------Weight total Credit VCPU
j, i i
Credit VM + ( TotalVCPU VM 1 ) = ------------------------------------------------------------------------------------------------------------------------i TotalVCPU VM
The first equation determines the total credit of a VM and the second equation determines the credit of a VCPU in a VM. Credittotal is a constant; Weighttotal is the sum of the weight of all domains. A VM's weight is assigned to a VM using xm(1M) (for example, xm sched-credit -w weight). In each accounting period, fixed amount of credits are added to idle VCPUs and are subtracted from running VCPUs. The VCPU has the priority under if the VCPU has not consumed all credits it possesses. On each PCPU, at every scheduling decision (when a VCPU blocks, yields, completes its time slice, or is awakened), the next VCPU to run is picked off the head of the run queue of priority under. When a VM runs, it consumes credits of its VCPU[s]. When a VCPU uses all its allocated credits, the VCPU's priority is changed from under to over. When a CPU doesn't find a VCPU of priority under on its local run queue, it will look on other PCPUs for one. This load balancing guarantees each VM receives its fair share of PCPU resources system-wide. Before a PCPU goes idle, it will look on other PCPUs to find any runnable VCPU. This guarantees that no PCPU idles when there is runnable work in the system. Earliest Deadline First (EDF) scheduling provides weighted CPU sharing by comparing the deadline of scheduled periodic processes (or domains, in the case of Sun xVM
49
Sun xVM Server
Server). This scheduler places domains in a priority queue. Each domain is associated with two parameters: time requested to run, and an interval or deadline. Whenever a scheduling event occurs, the queue is searched for the domain closest to its deadline. This domain is then scheduled for execution next with the time requested. The EDF scheduler gives a better CPU utilization when a system is underloaded. When the system is overloaded, the set of domains that will miss deadlines is largely unpredictable (it is a function of the exact deadlines and time at which the overload occurs).
Interrupts and Exceptions

The x86 processor uses a vector of size 256 to associate with exceptions and interrupts. The vector number is an index into the interrupt descriptor table (IDT). The IDT associates each vector with a gate descriptor for the procedure for handling the interrupt or exception. The IDT register (IDTR) contains the base address of the IDT. When Sun xVM Server is booting up, it registers its own IDT to the VMM. During system initialization, an early stage of Solaris boot, the Solaris kernel function
init_desctbls() is called to initialize the GDT and IDT:
void init_desctbls(void) { .... init_idt(&idt0[0]); for (vec = 0; vec < NIDT; vec++) xen_idt_write(&idt0[vec], vec); .... }
The Solaris kernel function init_desctbls() passes each of its exception and interrupt vectors to the VMM using the set_trap_table() hypercall:
void xen_idt_write(gate_desc_t *sgd, uint_t vec) { trap_info_t trapinfo[2]; bzero(trapinfo, sizeof (trapinfo)); if (xen_idt_to_trap_info(vec, sgd, &trapinfo[0]) == 0) return; if (xen_set_trap_table(trapinfo) != 0) panic("xen_idt_write: xen_set_trap_table() failed"); }
The set_trap_table() hypercall has one argument, trap_info, which contains the privilege level of the GOS code segment, the code segment selector, and the address of the handler which will be used to set the instruction pointer when the VMM
50
Sun xVM Server
passes the control back to the GOS (see following code segment). The value of
trap_info is set in the function xen_idt_to_trap_info() using the setting in
the kernel global variable array, idt0.

typedef struct trap_info { uint8_t vector; /* exception vector */ uint8_t flags; /* 0-3: privilege level */ uint16_t cs; /* code selector */ unsigned long address; /* code offset */ } trap_info_t;
On a 64-bit system, the interrupt descriptor has the descriptor privilege level (DPL) 3, similar to the segment descriptor:
[0]> idt0::print 'struct gate_desc' { sgd_looffset = 0x4bf0 sgd_selector = 0xe030 sgd_ist = 0 sgd_resv1 = 0 sgd_type = 0xe sgd_dpl = 0x3 sgd_p = 0x1 sgd_hioffset = 0xfb84 sgd_hi64offset = 0xffffffff sgd_resv2 = 0 sgd_zero = 0 sgd_resv3 = 0 }
When an interrupt or exception occurs, the VMMs trap handler is invoked to handle the interrupt or exception. If this is an exception caused by a GOS, the VMMs trap handler sets the pending bit (see Event Channels on page 43) and calls the GOS's exception handler. Interrupts for the GOS are virtualized by mapping them to event channels, which are delivered asynchronously to the target GOS via the
set_callbacks() hypercall.
In the following example, a kmdb(1M) breakpoint is set at the interrupt service routine of the sd driver, sdintr(). The function xen_callback_handler(), the callback function used for processing events from the VMM, is registered in the VMM by the hypercall set_callbacks(). When an interrupt intended for sd arrives, the
51
Sun xVM Server
hypercall HYPERVISOR_block() detects an event is available and then invokes the callback function:
sd`sdintr: sd`sdintr: ec8b4855 = pushq %rbp [0]> $c sd`sdintr(fffffffec0670000) mpt`mpt_intr+0xdb(fffffffec0670000, 0) av_dispatch_autovect+0x78(1b) dispatch_hardint+0x33(1b, 0) switch_sp_and_call+0x13() do_interrupt+0x9b(ffffff0001005ae0, 1) xen_callback_handler+0x36c(ffffff0001005ae0, 1) xen_callback+0xd9() HYPERVISOR_sched_op+0x29(1, 0) HYPERVISOR_block+0x11() mach_cpu_idle+0x52() cpu_idle+0xcc() idle+0x10e() thread_start+8() [0]>
Pending events are stored in a per-domain bitmask (see Event Channels on page 43), that is updated by the VMM before invoking an event-callback handler specified by the GOS. The function xen_callback_handler() is responsible for resetting the set of pending events and responding to the notifications in an appropriate manner. A VM may explicitly defer event handling by setting a VMM-readable software flag; this is analogous to disabling interrupts on a real processor.
Timer Services
Timer Devices on page 27 discusses several hardware timers available on x86 systems. These hardware devices vary in their frequency reliability, granularity, counter size, and ability to generate interrupts. The Solaris OS employs some of these timer devices for running the OS clock and high resolution timer: OS system clock The Solaris OS uses the local APIC timer on multiprocessor systems to generate ticks for the system clock. On uniprocessor systems, the Solaris OS uses the PIT to generate ticks for the system clock. High resolution timer The Solaris OS uses the TSC timer for a high resolution timer. The PIT counter is used to calibrate the TSC counter. Time-of-day clock The time-of-day (TOD) clock is based on the RTC. Only Dom0 can set the TOD clock. The DomU VMs don't have the permission to update the machine's physical RTC. Therefore, any attempt by the date(1) command to set the date and time on DomU will be quietly ignored. In Sun xVM Server, the VMM provides the system time to each VCPU when it is scheduled to run. The high resolution timer, gethrtime(), is still run through the
52
Sun xVM Server
unprivileged RDTSC instruction, thus the high resolution timer is not virtualized. The virtualized system time relies on the current TSC to calculate the time in nanoseconds since the VCPU was scheduled.
Sun xVM Server Memory Virtualization

Memory virtualization in Sun xVM Server deals with the following two memory management issues: Physical memory sharing and partitioning Page table access
Physical Memory Management

Sun xVM Server introduces a distinction between machine memory and physical memory. Machine memory refers to the entire amount of memory installed in the machine. Physical memory is a per-VM abstraction that allows a GOS to envision its memory as a contiguous range of physical pages starting at physical page frame number (PFN) 0, despite the fact that the underlying machine PFN may be sparsely allocated and in any order (see Page Translations Virtualization on page 14). The VMM maintains a table of machine-to-physical memory mappings. The GOS performs all page allocations and management based on physical memory. During page table updates, a conversion from physical memory to machine memory is performed before making the mmu_update() hypercall to update the page tables. Since VMs get created and deleted randomly throughout time, the VMM employs memory hotplug and ballooning schemes to optimize the memory usage in a machine. Memory hotplug allows a GOS to dynamically add or remove physical memory to its inventory. The memory ballooning technique allows a VMM to dynamically adjust the usage of physical memory among VMs. For example, consider a machine that has 8 GB of memory. Two VMs, VM-A and VM-B, are initially created with 5 GB of memory each. Memory hotplug adds 5 GB memory to both VMs after they are booted. The total memory committed to both VMs is greater than the actually physical memory available. When VM-A needs more physical memory, the memory ballooning technique increases memory pressure in VM-B by inflating the balloon driver. This results in memory being paged out to free up the memory consumed by VM-B, and thus more memory becoming available to VM-A. The GOS requests the service of physical memory management to the VMM through the memory_op(cmd, ...) hypercall. The operations supported by the
memory_ops() hypercall include the following:
XENMEM_increase_reservation XENMEM_decrease_reservation
53
Sun xVM Server
XENMEM_populate_physmap XENMEM_maximum_ram_page XENMEM_current_reservation XENMEM_maximum_reservation XENMEM_machphys_mfn_list XENMEM_add_to_physmap XENMEM_translate_gpfn_list XENMEM_memory_map XENMEM_machine_memory_map XENMEM_set_memory_map XENMEM_machphys_mapping XENMEM_exchange
Page Translations
Segmented Architecture on page 23 describes two stages of address translation to arrive at a physical address: virtual address (VA) to linear address (LA) translation using segmentation, and LA to physical address (PA) translation using paging. Solaris x64 uses a flat address space in which the VA and LA are equivalent, which means the base address of the segment is 0. In Solaris 10, the Global Descriptor Table (GDT) contains the segment descriptor for the code and data segments of both kernel and user processes, as shown in Table 5 on page 54. Since there is only one GDT in a system, the VMM maintains the GDT in its memory. If a GOS wishes to use something other than the default segment mapping that the VMM GDT provides, it must register a custom GDT with the VMM using the set_gdt() hypercall. In the following code sample, frame_list is the physical address of the page that contains the GDT and entries is the number of entries in the GDT.
xen_set_gdt(ulong_t *frame_list, int entries) { .... if ((err = HYPERVISOR_set_gdt(frame_list, entries)) != 0) { .... } return (err); }
The Solaris 32-bit thread library uses %gs to refer to the LWP state manipulated by the internals of the thread library. The 64-bit thread library uses %fs to refer to the LWP state as specified by the AMD64 ABI [8]. The 64-bit kernel still uses %gs for its CPU state (%fs is never used in the kernel). The MSR's KernelBase register is used to store the kernel %gs content while it switches to run the 32-bit user LWP. The privileged instruction SWAPGS is used to restore the kernel %gs during the context switch to the
54
Sun xVM Server
kernel context. So when the VMM performs a context switch between the guest kernel mode and the guest user mode, it executes SWAPGS as part of the context switch (see CPU Privilege Mode on page 45). The GDT segment is given in Table 5 below:
% cat intel/sys/segments.h #define GDT_NULL 0 #define GDT_B32DATA 1 #define GDT_B32CODE 2 #define GDT_B16CODE 3 #define GDT_B16DATA 4 #define GDT_B64CODE 5 #define GDT_BGSTMP 7 #if defined(__amd64) #define #define #define #define #define #define #define #define #define #define #define #define #define #define GDT_KCODE GDT_KDATA GDT_U32CODE GDT_UDATA GDT_UCODE GDT_LDT GDT_KTSS GDT_FS GDT_GS GDT_LWPFS GDT_LWPGS GDT_BRANDMIN GDT_BRANDMAX NGDT 6 /* kernel code seg %cs */ 7 /* kernel data seg %ds */ 8 /* 32-bit process on 64-bit kernel %cs */ 9 /* user data seg %ds (32 and 64 bit) */ 10 /* native user code seg %cs */ 12 /* LDT for current process */ 14 /* kernel tss */ GDT_NULL /* kernel %fs segment selector */ GDT_NULL /* kernel %gs segment selector */ 55 /* lwp private %fs segment selector (32-bit)*/ 56 /* lwp private %gs segment selector (32-bit)*/ 57 /* first entry in GDT for brand usage */ 61 /* last entry in GDT for brand usage */ 62 /* number of entries in GDT */
/* /* /* /* /* /* /*
null */ dboot 32 bit data descriptor */ dboot 32 bit code descriptor */ bios call 16 bit code descriptor */ bios call 16 bit data descriptor */ dboot 64 bit code descriptor */ kmdb descriptor only used in boot */
Table 5. The GDT segment.
Every LWP context switch requires an update to the GDT for the new LWP. The GOS uses
update_descriptor() for the task:
intel/ia32/os/desctbls.c update_gdt_usegd(uint_t sidx, user_desc_t *udp) { .... if (HYPERVISOR_update_descriptor(pa_to_ma(dpa), *(uint64_t *)udp)) panic("xen_update_gdt_usegd: HYPERVISOR_update_descriptor"); }
On an x86 system, the base physical address of the page directory is contained in the control register %cr3. In the Solaris OS, the value of %cr3 is stored in the process's
hat structure, proc->p_as->a_hat->hat_table->ht_pfn, as shown in Paging
Architecture on page 25. The loading of %cr3 is performed by the VMM for security and coherency reasons.
55
Sun xVM Server
Page Translations Virtualization on page 14 discusses two alternatives for updating page tables in a virtualized environment: hypervisor calls to a read-only page table and shadow page tables. The Sun xVM Hypervisor for x86 provides an additional alternative, a writable page table, for the GOS to implement page translations. In the default mode of operation, the VMM uses both read-only page tables and writable page tables to manage page tables. The VMM allows the GOS to use a writable page table to update the lowest level page tables (for example, the PTE). The higher levels, such as PDE, PDP, and PML4, use a read-only page table and are updated using the hypercall
mmu_update(). Updates to higher level page tables are much less frequent compared
to the PTE page table updates. Read-only page table The GOS has read-only access to page tables and uses the mmu_update() hypercall to update page tables. As described in the previous section Physical Memory Management on page 52, the GOS has a view of pseudo-physical memory, and a translation from physical address to machine address is performed before the
mmu_update() call.
Void set_pteval(paddr_t table, uint_t index, uint_t level, x86pte_t pteval) { .... ma = pa_to_ma(PT_INDEX_PHYSADDR(pfn_to_pa(ht->ht_pfn), entry)); t[0].ptr = ma | MMU_NORMAL_PT_UPDATE; t[0].val = new; if (HYPERVISOR_mmu_update(t, cnt, &count, DOMID_SELF)) panic("HYPERVISOR_mmu_update() failed"); .... }
Writable page table If a GOS attempts to write to a page table that is maintained by the VMM, this attempt will result in a #PF fault to the VMM. In the VMM fault handling routine, the following tasks are performed: Hold the lock for all further page table updates Disconnect the page that contains the updated page table by clearing the page
present bit of the page table entry in the parent page table
Make the page writable by the GOS The page will be reconnected to the paging hierarchy again automatically in a number of situations, including when the guest modifies a different page-table page, when the domain is preempted, and whenever the guest uses the VMMs explicit page-table update interfaces. Shadow page table
56
Sun xVM Server
The VMM maintains a independent copy of page tables, called the shadow page table, that is pointed to by the %cr3 register. If a page fault occurs when a GOSs page table is accessed, the VMM propagates changes made to the GOSs page table to the shadow page table. Shadow page mode can be set in the GOS by calling
dom0_op(DOM0 SHADOW CONTROL).
In addition to creating a translation entry, the VMM also provides the mmuext_op() hypercall for the GOS to flush, to invalidate, or to lock a page translation. For example, it is necessary to lock the translations of a process when it is being created. The
mmuext_op() is invoked by the kernel during the fork(2) system call:
[3]> :c kmdb: stop at xen_pin+0x3a kmdb: target stopped at: xen_pin+0x3a: call +0x208b1 <HYPERVISOR_mmuext_op> [3]> $c xen_pin+0x3a(ff2c, 3) hat_alloc+0x285(fffffffec381b7e8) as_alloc+0x99() as_dup+0x3f(fffffffec381ba88, fffffffec3d0f8d0) cfork+0x102(0, 1, 0) forksys+0x25(0, 0) sys_syscall32+0x13e()
Sun xVM Server I/O Virtualization

Sun xVM Server uses a split device driver architecture to provide device services to DomU domains. The device services are provided by two co-operating drivers: the frontend driver, which runs in a DomU, and the back-end driver, which runs in Dom0 (Figure 19). Sun xVM Server doesn't export any real devices to DomU domains. All device access made by DomU domains must go through the back-end driver located in Dom0.
57
Sun xVM Server
Dom0
System Calls
DomU User Kernel User Kernel

System Calls
Back End Drivers
Front End Drivers
Native Driver
Grant Tables/Event Channel/Xen Callback
Sun xVM Hypervisor for x86 Sun x64 Server X86 Hardware (CPU, Memory, Devices)
Figure 19. The split device driver architecture employed by Sun xVM Server includes a front-end driver in DomU and a back-end driver in Dom0.
Dom0 is a special VM that has access to the real device hardware. The front-end driver appears to a GOS in DomU as a real device. This driver receives I/O requests from applications as usual. However, since this front-end driver does not have access to the physical hardware of the system, it must then send requests to the back-end driver in Dom0. The back-end driver is responsible for issuing I/O requests to the real device hardware. When the I/O completes, the back-end notifies the front-end that the data is ready for use; the front-end is then able to report I/O completion and unblock the I/O call. When the Solaris OS is initialized, devices identify themselves and are organized into the device tree. This device tree depicts a hierarchy of nodes, with each node on the tree representing a device. Sun xVM Server exports a complete device tree to domain Dom0 so that it can directly accesses all physical devices on the system. For DomU domains, the paravirtualized Solaris OS uses information passed to it by xm(1M) to disable PCI bus probing and create virtual Sun xVM Server device nodes under the VMM virtual device nexus driver, xpvd.
58
Sun xVM Server
Output from the prtconf(1M) command shows the device tree as exported by Sun xVM Server to a VM in a DomU domain. As the prtconf(1M) output shows, there are no physical devices of any kind on the device tree in DomU:
# prtconf System Configuration: Sun Microsystems Memory size: 2048 Megabytes System Peripherals (Software Nodes):
i86pc
i86xpv scsi_vhci, instance #0 isa (driver not attached) xpvd, instance #0 xencons, instance #0 xenbus, instance #0 domcaps, instance #0 balloon, instance #0 xdf, instance #0 xnf, instance #0 iscsi, instance #0 pseudo, instance #0 agpgart, instance #0 options, instance #0 xsvc, instance #0 cpus (driver not attached) cpu, instance #0 (driver not attached) cpu, instance #1 (driver not attached)
A driver that provides services to other drivers is called a bus nexus driver and is shown in the device tree hierarchy as a node with children. The nexus driver provides bus mapping and translation services to subordinate devices in the device tree. The type of services provided by the nexus driver include interrupt priority assignment, DMA resource mapping, and device memory mapping. As seen in the previous
prtconf(1M) output, the xpvd driver is the root nexus driver for all Sun xVM Server
devices on DomU. An individual device driver is represented in the tree as a node with no children. This type of node is referred to as a leaf driver. In the above example,
xenbus, domcaps, xencons, xdf, and xnf are leaf drivers.
59
Sun xVM Server
The Sun xVM Server-related driver modules for Dom0 and DomU respectively are shown below:
Sun xVM Server related device modules on dom0: xpvtod (TOD module for Xen) xpvd (virtual device nexus driver) xencons (virtual console driver) privcmd (privcmd driver) evtchn (Evtchn driver) xenbus (virtual bus driver) xdb (vbd backend driver) xnb (xnb module) xsvc (xsvc driver) balloon (balloon driver)
Sun xVM Server related device modules on domU: xenbus (virtual bus driver) xpvtod (TOD module for i86xpv) xpvd (virtual device nexus driver) xencons (virtual console driver) xdf (Xen virtual block driver) xnf (Virtual Ethernet driver)
The xpvtod driver provides setting and getting the time-of-day for the VM. TOD service is provided by the RTC timer. If the request to set the TOD comes from a DomU domain, the request is silently ignored, as DomU doesn't have permission to set the RTC timer. The nexus driver in Solaris provides bus mapping and translation services to subordinate devices in the device tree. The xpvd driver is the nexus driver for all virtual I/O drivers which don't directly access physical device. This drivers primary functions are to provide interrupt mapping and to invoke the initialization routine of its children devices. The xenbus driver provides a bus abstraction that drivers can use to communicate between VMs. The bus is mainly used for configuration negotiation, leaving most data transfer to be done via an interdomain channel composed of a grant table and an event channel. The xenbus driver also makes the configuration data available to the XenStore shared storage repository (see XenStore on page 45). The evtchn driver is used for receiving and demultiplexing event-channel signals to the user land. The balloon driver is controlled by the VMM to manage physical memory usage by a VM. (See Physical Memory Virtualization on page 13 and Physical Memory Management on page 52). The privcmd driver is used by the domain manager on Dom0 to get the VMM service for VM management.
60
Sun xVM Server
The drivers xdf and xdb, the front-end and back-end block device drivers respectively, are discussed in Disk Driver on page 60. The xnf and xnb drivers, the front-end and back-end network drivers respectively, are discussed in Network Driver on page 61. Data transfer between interdomain drivers is mainly provided by the VMM grant table and event-channel services. Most of the data transfer is handled in a similar fashion to DMA transfer between host and device. Data is put in the grant table by the sending VM, and notification is sent to the receiving VM through the event channel. Then, the callback routine in the receiving VM is invoked to process the data.
Disk Driver
The xdb driver, the back-end driver on Dom0, is used to provide services for block device management. This driver receives I/O requests from DomU domains and sends them on to the native driver. On DomU, xdf is the pseudo block driver that gets the I/O requests from applications and sends them to the xdb driver in Dom0. The xdf driver provides functions similar to those of the SCSI target disk driver, sd, on an unvirtualized Solaris system. On Solaris systems, the main interface between a file system and storage device is the
strategy(9E) driver entry point. The strategy(9E) entry point takes only one
argument, buf(9S), which is the basic data structure for block I/O transfer. The I/O request made by a file system to the strategy(9E) entry point is called PAGEIO, as the memory buffer for the I/O is allocated from the kernel page pool. An application can also open the storage device as a raw device and perform read(2) and write(2) operations directly on the raw device. Such an I/O request is called PHYSIO,
physio(9F), as the memory buffer for the I/O is allocated by the application.
In addition to the strategy(9E) driver entry point for supporting file system and raw device access, a disk driver also supports a set of ioctl(2) operations for disk control and management. The dkio(7I) disk control operations define a standard set of
ioctl(2) commands. Normally, support for dkio(7I) operations requires direct
access to the device. In DomU, xdf supports most ioctl(2) commands as defined in
dkio(7I) by emulating the disk control inside xdf. No communication is made by xdf to the back-end driver for ioctl(2) operations.
The sequence of events for disk I/O data transfer is illustrated in Figure 20. The disk control path, ioctl(2), is similar to the data path. When a disk I/O request is issued by a DomU domain, the sequence is as follows: 1. The file system calls the xdf driver's strategy(9E) entry point as a result of a
read(2) or write(2) system call.
61
Sun xVM Server
2.
The xdf driver puts the I/O buffer, buf(9S), on the grant table. This buffer is allocated from the DomU memory. Permission for other domain access is granted to this memory.
3. 4. 5. 6. 7. 8. 9.
The xdf driver notifies Dom0 of an event through event channel. The VMM event channel generates an interrupt to the xdb driver in Dom0. The xdb driver in Dom0 gets the DomU I/O buffer through the grant_table. The xdb driver in Dom0 calls the native driver's strategy(9E) entry point. The native driver performs DMA. The VMM receives the device interrupt. The VMM generates an event to Dom0.
10. The xdb driver's iodone() routine is called by biodone(9F). 11. The xdb drivers iodone() routine generates an event to DomU. 12. The xdf driver in DomU receives an interrupt to free up the grant table and DMA resources, and calls biodone(9F) to wake up anyone waiting for it. When a disk I/O request is issued by the control domain DomO, the sequence is as follows: 13. Block I/O requests are sent directly to the native driver.
DomU
read(2)/write(2)
User Kernel 11 4
User Kernel
read(2)/write(2)
FS 13 6
xdb 10 Native Driver 9
FS 1 2 xdf 3
5 Grant Tables
Event Channel
7 Xen Callback 8
12
Sun xVM Hypervisor for x86
Sun X64 Server X86 Hardware (CPU, Memory, Devices)
Figure 20. Sequence of events for an I/O request from a Sun xVM Server virtual machine.
Network Driver
The Sun xVM Server network drivers uses a similar approach to the disk block driver for handling network packets. On DomU, the pseudo network driver xnf gets the I/O requests from the network stack and sends them to xnb on Dom0. The back-end network driver xnb on Dom0 forwards packets sent by xnf to the native network driver.
62
Sun xVM Server
The buffer management for packet receiving has more impact on network performance than packet transmitting does. On the packet receiving end, the data is transferred via DMA into the native driver receiving buffer on domO. Then, the packet is copied from the native driver buffer to the VMM buffer. The VMM buffer is then mapped to the DomU kernel address space without another copy of the data. The sequence of operations for packet receiving is as follows: 1. 2. 3. 4. 5. Data is transferred via DMA into the native driver, bge, receive buffer ring. The xnb drivers gets a new buffer from the VMM and copies data from the bge receive ring to the new buffer. The xnb driver sends DomU an event through the event channel. The xnf driver in DomU receives an interrupt. The xnf driver maps a mblk(9S)to the VMM buffer and sends the mblk(9S) to the upper stack.
dom0 xnb_to_peer xnbo`from_mac+0x1c mac`mac_do_rx+0x88 mac`mac_rx+0x1b vnic`vnic_rx+0x59 vnic`vnic_classifier_rx+0x6b 2 mac`mac_do_rx+0x88 mac`mac_rx+0x1b bge`bge_receive+0x564 bge`bge_intr+0x182 unixàv_dispatch_autovect+0x78 unix`dispatch_hardint+0x33 unix`switch_sp_and_call+0x13 3
domU xnf`xnf_process_recv+0x275 xnf`xnf_intr+0x5e unixàv_dispatch_autovect+0x78 unix`dispatch_hardint+0x33 unix`switch_sp_and_call+0x13 4
Grant Tables
Event Channel
Xen Callback
Sun xVM Hypervisor for x86 Network Chip Sun X64 Server X86 Hardware (CPU, Memory, Devices)
Figure 21. Sequence of events for a network request from a Sun xVM Server virtual machine.
63
Sun xVM Server with Hardware VM (HVM)
Chapter 6

Intel and AMD have independently developed extensions to the x86 architecture that provide hardware support for virtualization. These extensions enable a VMM to provide full virtualization to a VM, and support the running of unmodified guest operating systems on a VM. This approach is in contrast to Sun xVM Server PV, which requires modifications to the guest operating system. Virtual machines that are supported by virtualization capable processors are called Hardware Virtual Machines (HVMs). An HVM environment includes the following requirements: A processor that allows an OS with reduced privilege to execute sensitive instructions A memory management scheme for a VM to update its page tables without accessing MMU hardware An I/O emulation scheme that enables a VM to use its native driver to access devices through an I/O VM (see I/O Virtualization on page 16) An emulated BIOS to bootstrap the OS The x86 processor for HVM meets the first requirement, allowing an OS with reduced privilege to execute sensitive instructions. However, a processor alone is not enough to provide full virtualization. The memory management, I/O emulation, and emulated BIOS requirements necessitate enhancements in the VMM. This chapter begins with a discussion of HVM operations that are applicable to both Intel and AMD virtualization extensions, followed by Intel and AMD specific enhancements for HVM. After the introduction of processor extensions, Sun xVM Server enhancements in the areas of BIOS emulation, memory management, and I/O virtualization for full virtualization are discussed in detail.
Note Intel's virtualization extension is called Virtual Machine Extensions (VMX), and is documented in the IA-32 Intel Architecture Software Developer's Manual (see [7] Volume 3B Chapters 19-23). AMD's extension is called Secure Virtual Machine (SVM), and is documented in the AMD64 Architecture Programmer s Manual Volume 2: System Programming (see [9] Chapter 15).
64
HVM Operations and Data Structure

Both Intel and AMD's extension for HVM, though not compatible to each other, are similar in basic concepts. Both create a special mode of operation that allows system software running in a reduced privileged mode to execute sensitive instructions. In addition, both implementations also define state and control data structures that enable the transition between modes of operation. The processor for HVM has two operating modes: privileged mode and reduced privilege mode. Processor behavior in the privileged mode is very much the same as the processor running without the virtualization extension. Processor behavior in the reduced privilege mode is restricted and modified to facilitate virtualization. Table 6 summarizes the terms used by Intel and AMD for HVM. The extension creates new instructions, and a HVM control and state data structure (HVMCSDS) for the VMM to manage transition from one mode to another. The HVMCSDS is called VMCS on the Intel processor and is called VMCB on the AMD processor. The VMM associates a HVMCSDS with each VM. For a VM with multiple VCPUs, the VMM can associate a HVMCSDS with each VCPU in the VM.
Table 6. Comparison of Intel and AMD processor support for virtualization.
Intel Virtualization Operation Privileged Mode Reduced-privileged mode HVM Control and State Data Structure (HVMCSDS) Entering non-privileged mode Exiting non-privileged mode VMX VMX Root VMX non-Root VMCS VMLAUNCH/VM RESUME Implicit
AMD SVM Host Mode Guest Mode VMCB VMRUN Implicit
After HVM is enabled, the processor is operating at privileged mode. Transitions from privileged mode to reduced privilege modes are called VM Entries. Transitions from reduced privilege mode to privileged mode are called VM Exits. Figure 22 illustrates entry and exit with the HVMCSDS.
65
VM1
VM2
VMX non-root operation (Intel) VM EXIT/ VM ENTER/ VM EXIT/ VM ENTER/ #VMEXIT VMRUN #VMEXIT VMRUN Guest Mode (AMD) VMCS/ VMCB VMX root operation (Intel) Host Mode (AMD) Virtual Machine Monitor (VMM) VMCS/ VMCB
Figure 22. Virtual machine entry and exit with hardware support on AMD and Intel processors.
VM entry is explicitly initiated by the VMM using an instruction (VMLAUNCH and

VMRESUME on Intel; VMRUN on AMD). The processor performs checks on the processor
state, VMM state, control fields, and the VM state before loading the VM state from the HVMCSDS to launch the VM entry. As a part of VM entry, the VMM can inject an event into the VM. The event injection process is used to deliver virtualized external interrupts to a VM. A VM normally doesn't get interrupts from I/O devices, because I/O devices are not exposed to VMs (with the exception of Dom0). As will be shown in Sun xVM Server with HVM I/O Virtualization (QEMU) on page 71, a VM's I/O is handled by a special domain (Dom0) that runs a paravitualized OS and has direct access to I/O devices. When an I/O operation completes, Dom0 informs the VMM to send an interrupt through an hvm_op hypercall. The VMM prepares the HVMCSDS for event injection and the VM's return instruction pointer (RIP) is pushed on the stack. VM exit occurs implicitly in response to certain instruction and events in a VM. The VMM governs the conditions causing a VM exit through manipulating the control fields in the HVMCSDS. The events that can be controlled to result in a VM exit include the following (see [9] Chapter 20): External interrupts, non-maskable interrupts, and system management interrupts Executing certain instructions (such as RDPMC, RDTSC, or instructions that access the CR) Exceptions The exact conditions that cause a VM exit are defined in the HVMCSDS control fields. Certain conditions may cause a VM exit for one VM but not for other VMs. VM exits behave like a fault, meaning that the instructions causing the VM exit does not execute and no processor state is updated by the instruction. The VM exit handler
66
in the VMM is responsible for taking appropriate actions for the VM exit. Unlike exceptions, the VM exit handler is specified in the HVMCSDS host RIP field rather than using the IDT:
static void construct_vmcs(struct vcpu *v) { .... /* Host CS:RIP. */ __vmwrite(HOST_CS_SELECTOR, __HYPERVISOR_CS); __vmwrite(HOST_RIP, (unsigned long)vmx_asm_vmexit_handler); .... }
Intel Virtualization Technology Specifics

Intel Virtualization (Intel-VT), code name Vanderpool, is the Intel virtual machine extensions (VMX) to run unmodified guest OSes. Intel-VT has two implementations: VT-x defines the extensions to the IA-32 Intel architecture, and VT-i defines the extensions to the Intel Itanium architecture. This paper focuses on the Intel VT-x implementation. Table 6 on page 64 summarizes the terms used in Intel documents [7] for HVM. IntelVTx adds several new instructions to the existing IA-32 instructions set to facilitate HVM operations (see Table 7):
Table 7. Intel-VTx instructions that facilitate HVM operations.
Instruction VMLAUNCH/VMRESUME VMCLEAR VMPTRLD/VMPTRST VMREAD/VMWRITE VMXON/VMXOFF VMCALL
Description launch/resume VM clear VMCS load/store VMCS read/write VMCS enable/disable VMX operation call to the VMM
In addition to new VMX instructions and VMCS, VT-x introduces a direct I/O architecture for Intel-VT [28] to improve VM security, reliability, and performance through I/O enhancements. As will be shown in Sun xVM Server with HVM I/O Virtualization (QEMU) on page 71, the current I/O virtualization implementation for Sun xVM Server with HVM, which is based on the QEMU project, is inefficient as all I/O transaction have to go through Dom0, unreliable as the I/O virtualization layer on Dom0 becomes a single point-of-failure, and insecure as a VM may access other VM's DMA memory by manipulating the value written to I/O port.
67
The Intel-VT direct I/O architecture specifies the following hardware capabilities to the VMM: DMA remapping This feature provides IOMMU support for I/O address translation and caching capabilities. The IOMMU as specified in the architecture includes a page table hierarchy similar to the processor page table, and an IOTLB for frequently accessed I/O pages. Addresses used in the DMA transactions are allocated from IOMMU address space, and the IOMMU hardware provide address translation from the IOMMU address space to the system memory address space. I/O device assignment across VMs This feature allows a PCI/PCI-X device that is behind a PCI-E to PCI/PCI-X bridge or a PCI-E device to be assigned to a VM, regardless of how the PCI bus is bound to a VM.
AMD Secure Virtual Machine Specifics

The AMD Secure Virtual Machine (SVM), code name Pacifica, is similar to Intel VT-x in technology and design. The AMD SVM uses the instruction VMRUN to switch between a GOS and the VMM. The instruction VMRUN takes, as a single argument, the physical address of a 4KB-aligned page, the virtual machine control block (VMCB), which describes a virtual machine (guest) to be executed. In addition to functions that are equivalent to those in Intel VT-x, AMD SVM provides additional features, that are not available in Intel VT-x, to improve HVM operations: Nested page table (NPT) As an alternative to using a shadow page table for address translation (see Shadow Page Table on page 69), AMD SVM uses two %cr3 registers, gCR3 and nCR3, to point to guest page tables and nested page tables respectively. Guest page tables map guest linear addresses to guest physical addresses. Nested page tables map guest physical addresses to system physical addresses. The table walker first translates that entrys guest physical address into a system physical address. Then translations from guest linear to system physical addresses are cached in the TLB for subsequent guest access. Tagged TLB To avoid a TLB flush during context switch (see Paging Architecture on page 25), AMD SVM provides a tagged TLB with Address Space Identifier (ASID) bits to distinguish different address spaces. A tagged TLB allows the VMM to use shadow page tables or multiple nested page tables for address translation during a context switch without flushing the TLBs. IOMMU The AMD64 IOMMU enables secure virtual machine guest operating system access to selected I/O devices by providing address translation and access protection on DMA transfers by peripheral devices. The IOMMU can be thought of as a combination and
68
generalization of two facilities included in the AMD64 architecture: the Graphics Aperture Remapping Table (GART) and the Device Exclusion Vector (DEV). The GART provides address translation of I/O device accesses to a small range of the system physical address space, and the DEV provides a limited degree of I/O device classification and memory protection.
Sun xVM Server with HVM Architecture Overview

Sun xVM Server with HVM supports the running of unmodified operating systems in DomU. However, Dom0 still requires a paravirtualized OS in order to provide full I/O virtualization support for DomUs. To support full virtualization, the Sun xVM Hypervisor for x86 has extended its paravirtualized architecture with the following enhancements: A set of HVM functions (struct hvm_function_table) for processor dependent implementation of HVM, and an hvm_op hypercall A shadow page table to virtualize memory management Device emulation based on the QEMU project for I/O virtualization Emulated BIOS, hvmload, to bootstrap the GOS These enhancements are discussed in more detail in the following sections.
69
Processor Dependent HVM Functions

The Sun xVM Hypervisor for x86 defines a set of foundational interfaces, struct
hvm_function_table, to abstract processor HVM specifics. The struct hvm_function_table entries are:
struct hvm_function_table { void (*disable)(void); int (*vcpu_initialise)(struct vcpu *v); void (*vcpu_destroy)(struct vcpu *v); void (*store_cpu_guest_regs)( struct vcpu *v, struct cpu_user_regs *r, unsigned long *crs); void (*load_cpu_guest_regs)( struct vcpu *v, struct cpu_user_regs *r); int (*paging_enabled)(struct vcpu *v); int (*long_mode_enabled)(struct vcpu *v); int (*pae_enabled)(struct vcpu *v); int (*guest_x86_mode)(struct vcpu *v); unsigned long (*get_guest_ctrl_reg)(struct vcpu *v, unsigned int num); unsigned long (*get_segment_base)(struct vcpu *v, enum x86_segment seg); void (*get_segment_register)(struct vcpu *v, enum x86_segment seg, struct segment_register *reg); void (*update_host_cr3)(struct vcpu *v); void (*update_guest_cr3)(struct vcpu *v); void (*stts)(struct vcpu *v); void (*set_tsc_offset)(struct vcpu *v, u64 offset); void (*inject_exception)(unsigned int trapnr, int errcode, unsigned long cr2); void (*init_ap_context)(struct vcpu_guest_context *ctxt, int vcpuid, int trampoline_vector); void (*init_hypercall_page)(struct domain *d, void *hypercall_page); };
The VMM uses hvm_function_table to provide a VCPU to a VM. The entry points in
hvm_function_table fall into two categories: setup and runtime. The setup entry
points are called when a VM is being created. The runtime entry points are called before VM entry or after VM exit. Since the HVMCSDS data structure abstracts the states and controls of a VCPU, the entry points in hvm_function_table are primarily used to manipulate the data structure.
Shadow Page Table

Because the GOS is unmodified, the read-only page table scheme for page translation as used in the Sun xVM Server is no longer applicable. The read-only page table scheme requires the OS to make hypercalls into the VMM to update page tables. To support an unmodified OS, the shadow page table scheme becomes the only option available. In this scheme, the shadow page table (also known as the active page table hierarchy) is the actual page table used by the processor.
70
In supporting shadow page [29], the Sun xVM Hypervisor for x86 attempts to intercept all updates to a guest page table, and updates both the VM's page table and the shadow page table maintained by the VMM, keeping both page tables synchronized at all times. This implementation results in two page faults, one due to faulting the actual page and a second one due to page table access. This shadow page table scheme has a significant impact on the VM performance. An alternative such as nested page table (see AMD Secure Virtual Machine Specifics on page 67) has been proposed to improve the memory virtualization performance.
Sun xVM Server Interrupt and Exception Handling for HVM

The VMM can specify processor behavior on specific exceptions and interrupts by setting appropriate control filed in the HVMCSDS. When a physical interrupt occurs, the processor uses the setting in the HVMCSDS to determine whether this interrupt would result in the VM exit of a running VM. Upon VM exit, the VMM gets the interrupt vector from the HVMCSDS, sets controls field for event injection, and launches the VM entry of the target VM. The interrupt handling by the VMM is a two stage process: from physical device to the VMM, and from a virtual device in Dom0 to the target VM. The VMM controls the IDT for interrupt from physical devices. Each VM registers its own IDT with the VMM. When a physical interrupt arrives, the VMM delivers the interrupt to a virtual device in Dom0. The virtual device then generates a virtual interrupt to a VM. A virtual interrupt is delivered to a VM through event injection by setting the VM entry control field in the HVMCSDS for event injection. The VMM uses the
inject_exception entry point in hvm_function_table (see Processor
Dependent HVM Functions on page 69) to set the HVMCSDS event injection control field. The event is delivered when the VM is entered.
Emulated BIOS
The PC BIOS provides hardware initialization, boot services, and runtime services to the OS. There are some restrictions on VMX operation. An OS in HVM cannot operate in real mode. Unlike a paravirtualized OS that can change its bring up sequence for an environment without BIOS, an unmodified OS requires an emulated BIOS to perform some real mode operations before control is passed to the OS. Sun xVM Server includes a BIOS emulator, hvmloader, as a surrogate to real BIOS. The hvmloader BIOS emulation contains three components: ROMBIOS, VGABIOS, and VMXAssist. Both ROMBIOS and VGABIOS are based the open source Bochs BIOS [23]. The VMXAssist component is included in hvmloader to emulate real mode, which is required by hvmloader and bootstrap loaders. The hvmloader BIOS emulator is bootstrapped as any other 32-bit OS. After it is loaded, hvmloader copies
71
its three components to pre-assigned addresses (VGABIOS at C000:0000,

VMXAssist at D000:0000, and ROMBIOS at F000:0000) and transfers control to VMXAssist.
The hvmloader BIOS emulator does not directly interface with physical devices. It communicates with virtual devices as discussed in the following section Sun xVM Server with HVM I/O Virtualization (QEMU).
Sun xVM Server with HVM I/O Virtualization (QEMU)

Sun xVM Server I/O virtualization on an HVM-enabled environment is based on the open source QEMU project [24]. QEMU is a machine emulator that uses dynamic binary translation to run an unmodified OS and its applications in a virtual machine. QEMU includes several components: CPU emulators, emulated devices, generic devices, machine descriptions, user interface, and a debugger. The emulated devices and generic devices in QEMU make up its device models for I/O virtualization. Sun xVM Server uses QEMU's device models to provide full I/O virtualization to VMs. For example, QEMU supports several emulated network interfaces, including ne2000,
PCNet, and Realteck 8139. The Solaris OS has the pcn driver for the PCNet NIC.
The Solaris OS running in DomU can use pcn and communicate to QEMU on a Solaris Dom0 that has a e1000g NIC. The pcnet emulation in QEMU converts Solaris pcn transactions to a generic virtual network interface (such as TAP), which forwards the packet to the driver for the native network interface (such as e1000g). QEMU I/O emulation is illustrated in Figure 23. The principle of operation for sending out an I/O request is outlined as follows: 1. An OS interfaces with a device through I/O ports and/or memory-mapped device memory. The device performs certain operations, such as DMA, in response to I/O port/memory access by the OS. At the completion of the operation, the device generates an interrupt to notify the OS (Steps 1 and 2 on Figure 23). 2. 3. 4. 5. The VMM monitors and intercepts the device I/O ports and memory accesses (Step 3 on Figure 23). The VMM forwards the I/O port/memory data to an I/O virtualization layer such as QEMU (Step 4 in Figure 23). QEMU decodes the I/O port/memory data and performs necessary emulation for the I/O request (Step 5 in Figure 23). QEMU delivers the emulated I/O request to the OS native device interface (Steps 6 and 7 in Figure 23).
72
Dom0 qemu-dm/ioemu/pcnet 9 10 TAP/ Native NIC 8 11 7 5 6 User Kernel 4 User Kernel
Dom U socket(3c) 1
pcn 10 11 Event Channel VM exit handler 2
hvm hypercall Sun xVM Hypervisor for x86 NIC 3 I/O Port Device memory X86 Hardware (CPU, Memory, Devices)
Figure 23. I/O emulation in Sun xVM Server using QEMU for dynamic binary translation.
Using the AMD PCNet LANCE PCI Ethernet controller as an example, the vendor ID and device ID of the PC Net chip is respectively 1022 and 2000. From prtconf(1M) output, the PCI registers exported by the device are:
% prtconf -v .... pci1022,2000, instance #0 Hardware properties: name='assigned-addresses' type=int items=5 value=81008810.00000000.00001400.00000000.00000080 name='reg' type=int items=10 value=00008800.00000000.00000000.00000000.00000000.01008810.00000000.000 00000.00000000.00000080 ....
According to IEEE1275 OpenBoot Firmware [25], the reg property is generated by reading the base address registers in the configuration address space. Each entry in the
reg property format consists of one 32-bit cell for register configuration, a 64-bit
address cell, and a 64-bit size cell [26]. As the prtconf(1M) output shows, the PCNet chip has a 128 byte (0x00000080) register in the I/O address space (01 in the first byte
of0x01008810 denotes I/O address space). QEMU emulation for PCNet simply
monitors the Solaris driver access to the 128 bytes register using x86 IN/OUT instructions.
73
The QEMU virtualization for transmitting and receiving a packet using the PCNet emulation is illustrated in the Figure 23 on page 72. The sequence of events corresponding to the numbered dots in the figure is described below: 1. 2. Applications make an I/O request to the driver through system calls. The pcn driver writes to the DMA descriptor using the OUT instruction. In pcn,
pcn_send() calls pcn_OutCSR() to start the DMA transaction. Then, pcn_OutCSR() calls ddi_put16() to write a value to an I/O address. Next, ddi_put16() checks whether the mapping (io_handle) is for I/O space or
memory space. If the mapping is for the I/O space, it moves its third argument to
%rax and port ID to %rdx, and issues the OUTW instruction to the port referenced
by %dx.
pcn_send() { .... pcn_OutCSR(pcnp, CSR0, CSR0_INEA | CSR0_TDMD); ... } static void pcn_OutCSR(struct pcninstance *pcnp, uintptr_t reg, ushort_t value) { ddi_put16(pcnp->io_handle, REG16(pcnp->io_reg, PCN_IO_RAP), reg); ddi_put16(pcnp->io_handle, REG16(pcnp->io_reg, PCN_IO_RDP), value); } ENTRY(ddi_put16) movl ACC_ATTR(%rdi), %ecx cmpl $_CONST(DDI_ACCATTR_IO_SPACE|DDI_ACCATTR_DIRECT), %ecx jne 8f movq %rdx, %rax movq %rsi, %rdx outw (%dx) ret
The OUT instruction causes a VM exit. The CPU is setup by the VMM to have an unconditional VM exit if the VM executes IN/OUT/INS/OUS as shown in the setting of the CPU_BASED_UNCOND_IO_EXITING bit in VM exit control (see Table 20-6 in [7]).
74
#define MONITOR_CPU_BASED_EXEC_CONTROLS ( MONITOR_CPU_BASED_EXEC_CONTROLS_SUBARCH | CPU_BASED_HLT_EXITING | CPU_BASED_INVDPG_EXITING | CPU_BASED_MWAIT_EXITING | CPU_BASED_MOV_DR_EXITING | CPU_BASED_UNCOND_IO_EXITING | CPU_BASED_USE_TSC_OFFSETING ) void vmx_init_vmcs_config(void) { .... _vmx_vmexit_control = adjust_vmx_controls(MONITOR_VM_EXIT_CONTROLS, MSR_IA32_VMX_EXIT_CTLS_MSR); .... }
\ \ \ \ \ \ \
The VM exit handler is set in the host RIP field in HVMCDCS (see HVM Operations and Data Structure on page 64). The VM exit handler examines the exit reason and calls the I/O instruction function, vmx_io_instruction(), to handle the VM exit.
asmlinkage void vmx_vmexit_handler(struct cpu_user_regs *regs) { .... case EXIT_REASON_IO_INSTRUCTION: exit_qualification = __vmread(EXIT_QUALIFICATION); inst_len = __get_instruction_length(); vmx_io_instruction(exit_qualification, inst_len); break; .... }
3.
The VM exit handler for I/O instructions in the VMM examines the exit qualification, and gets OUT information from the HVMCDCS. This information includes: Size of the access (1 byte, 2 byte, or 4 bytes) Direction of the access (IN or OUT) Port number Double fault exception or not Size and address of string buffer if this is an I/O string operation
75
The VM exit handler then fills in struct ioreq fields, and sends the I/O request to its client by calling send_pio_req().
static void vmx_io_instruction(unsigned long exit_qualification, unsigned long inst_len) { .... send_pio_req(port, count, size, addr, dir, df, 1); .... }
4.
The client of the I/O request (qemu-dm) is blocked on the event channel device node created by the evtchn module (see Event Channels on page 43). In the VMM, hvm_send_assist_req() gets called by send_pio_req() to set the event pending bit of the event channel and wake up the qemu-dm client waiting on the event.
void hvm_send_assist_req(struct vcpu *v) { .... p->state = STATE_IOREQ_READY; notify_via_xen_event_channel(v->arch.hvm_vcpu.xen_port); }
5.
The QEMU emulator, qemu-dm, is a user process that contains the ioemu module for I/O emulation. The ioemu module waits on one end of the event channel for I/O requests from the VMM.
int main_loop(void) { .... qemu_set_fd_handler(evtchn_fd, cpu_handle_ioreq, NULL, env); while (1) { .... main_loop_wait(10); } .... }
When an I/O request arrives, ioemu is unblocked and cpu_handle_ioreq() is called to get the ioreg structure from the event channel. Based on the information in ioreq, appropriate pcnet functions are invoked to handle the I/O request.
76
6.
After pcnet decodes the ioreq structure, ioemu sends the packet to the TAP network interface. The TAP network interface [27] is a virtual ethernet network device that provides two interfaces to applications: Character device /dev/tapX Virtual network interface tapX where X is the instance number of the TAP interface. Applications can write Ethernet frames to the /dev/tapX character interface, and the TAP driver will receive this frame from the tapX network interface. In the same manner, a packet that kernel writes to the tapX network interface can be read by application from the character
/dev/tapX device node.
To continue the packet flow, pcnet_transmit() is called to send out ioreq. In

pcnet_tranmit(), qemu_send_packet() invokes tap_receive() to write
the packet to the TAP character interface which will forward the packet to the native driver interface.
static void pcnet_transmit(PCNetState *s) { .... qemu_send_packet(s->vc, s->buffer, s->xmit_pos); .... } static void tap_receive(void *opaque, const uint8_t *buf, int size) { .... for(;;) { ret = write(s->fd, buf, size); .... } }
7. 8.
The Dom0 native driver sends the packet to the network hardware. This marks the end of transmitting a packet from DomU to the real network. Dom0 receives an interrupt indicating a packet intended for DomU has arrived. This marks the beginning of receiving a packet targeted to a DomU from the real network. The native network driver forwards the packet through a bridge to the TAP network interface, tapX.
9.
Next, tap_send() is invoked when data is written to the file. The packet is read from the character interface of /dev/tapX. Next, qemu_send_packet() calls
pcnet_receive() to send out the buffer.
77
static void tap_send(void *opaque) { .... size = read(s->fd, buf, sizeof(buf)); if (size > 0) { qemu_send_packet(s->vc, buf, size); } }
10. The pcnet_receive() function in ioemu copies data read from the TAP character device to the VMM memory. The data can be either an I/O port value from the IN instruction or a network packet. At the end of data transfer, pcnet informs the VMM to generate an interrupt.
static void pcnet_receive(void *opaque, const uint8_t *buf, int size) { .... cpu_physical_memory_write(rbadr, src, count); ... pcnet_update_irq(s); }
11. The ioemu module makes a hvm_opt(set_pci_intx_level) hypercall to the VMM to generate an interrupt to the target domain.
int xc_hvm_set_pci_intx_level( int xc_handle, domid_t dom, uint8_t domain, uint8_t bus, uint8_t device, uint8_t intx, unsigned int level) { .... hypercall.op = __HYPERVISOR_hvm_op; hypercall.arg[0] = HVMOP_set_pci_intx_level; hypercall.arg[1] = (unsigned long)&arg; .... rc = do_xen_hypercall(xc_handle, &hypercall); .... }
The VMM sets the guest HVMCDCS area to inject an event with the next VM entry. The target VM will get an interrupt when the VMM launches a VM entry to the target domain (see Sun xVM Server Interrupt and Exception Handling for HVM on page 70).
Sun xVM Server with HVM I/O Virtualization (PV Drivers)

As shown in the previous section, the QEMU I/O emulation used in Sun xVM Server with HVM suffers significant performance overhead. An I/O packet has to go through several context switches, including a switch to the user level at Dom0, to reach its destination. One alternative for improving the performance is to use a similar I/O
78
virtualization model as the Sun xVM Server PV architecture (see Sun xVM Server I/O Virtualization on page 56). Paravirtualized drivers (PV drivers) like xbf and xnf are included in the OS distribution. When a VM is created, Dom0 exports virtual I/O devices (for example, xnf and xbf ) instead of emulated I/O devices (for example, pcn and
mpt) to the GOS. PV drivers are subsequently bound to these virtual devices and used
for handling I/O. The I/O transactions follow the same path as described in Chapter 5, Sun xVM Server. PV drivers will be provided for Solaris 10 and Windows so they can run unmodified in the Sun xVM Server with better I/O performance.
79
Logical Domains
Chapter 7
Logical Domains
The Logical Domains (LDoms) technology from Sun Microsystems allows a system's resources, such as memory, CPUs, and I/O devices, to be allocated into logical groupings. Multiple isolated systems, each with their own operating system, resources, and identity within a single computer system, can then be created using these partitioned resources. Unlike Sun xVM Server, LDoms technology partitions a processor into multiple strands, and assigns each strand its own hardware resources. (See Terms and Definitions on page 113.) Each virtual machine, called a domain in LDoms terminology, is associated with one or more dedicated strands. A thin layer of firmware, called the hypervisor, is interposed between the hardware and the operating system (Figure 24). The hypervisor abstracts the hardware resources and provides an interface to the operating system software.
Control Domain
Solaris 10
Domain 1
Solaris 10
Domain 2
Solaris 10
Domain 3
Linux
Hypervisor
CPU Mem CPU Mem CPU Mem CPU Mem
Hardware
CPU Mem CPU Mem
Figure 24. The hypervisor, a thin layer of firmware, abstracts hardware resources and presents them to the OS.
The LDoms implementation includes four components: UltraSPARC T1/T2 processor UltraSPARC hypervisor Logical Domain Manager (LDM) Paravirtualized Solaris OS
Note The terms strand, hardware thread, logical processor, virtual CPU and virtual
processor are used by various documents to refer to the same concept. For consistency, the term strand is used in this chapter.
80
Logical Domains
Note In Sun documents, the term hypervisor is used to refer to the hyperprivileged
software that performs the functions of the VMM and the term domain is used to refer to a VM. To accommodate Sun's terminologies, hypervisor and domain (instead of VMM and VM) are used in this chapter.
This chapter assumes a basic understanding of the UltraSPARC T1/T2 processor, which plays a major role in the implementation of LDoms. (See Chapter 4, SPARC Processor Architecture on page 29.) The remainder of the chapter is organized as follows: Logical Domains (LDoms) Architecture Overview on page 80 provides an overview of the LDoms architecture and the other three components of LDoms: paravirtualized Solaris, the UltraSPARC hypervisor, and the Logical Domain manager. CPU Virtualization in LDoms on page 84 discusses CPU virtualization including trap and interrupt handling. Memory Virtualization in LDoms on page 88 discusses memory virtualization including physical memory allocation and page translations. I/O Virtualization in LDoms on page 91 discusses I/O virtualization and describes the operation of the disk block and network drivers.
Logical Domains (LDoms) Architecture Overview

Logical Domains (LDoms) technology supports CPU partitioning and enables multiple OS instances to run on a single UltraSPARC T1/T2 system. The UltraSPARC T1/T2 architecture has been enhanced from the original UltraSPARC specification to incorporate hypervisor technology that supports hardware level virtualization. The hypervisor is delivered with the UltraSPARC T1/T2 platform, not with the OS. During a boot, the OpenBoot PROM (OBP) loads the Solaris OS directly from the disk. After the boot, a logical domain manager is enabled and initializes the first domain as the control domain. From a control domain, the administrator can create, shutdown, configure, and destroy other domains. The control domain can also be configured as an I/O domain, which has direct access to I/O devices and provides services for other domains to access I/O devices (Figure 25).
81
Logical Domains
Control Domain I/O Domain Solaris

) M 1 ( m d l
Guest Domain Solaris
Guest Domain Linux
Kernel Hypercalls Drivers
Kernel Hypercalls
Kernel Hypercalls
LDM ALOM Devices
Hypervisor API Hypervisor Services OBP POST Firmware
Sun UltraSPARC T1/T2 Server UltraSPARC T1/T2 Processor-powered Server Firmware
Figure 25. A control domain, Solaris OS, and Linux guest domains running in logical domains on an UltraSPARC T1/T2 processor-powered server.
The UltraSPARC T1/T2 processor architecture is described earlier in Chapter 4, SPARC Processor Architecture on page 29. In this section, the other three components of the LDoms technology paravirtualized Solaris OS, hypervisor, and logical domain manager are discussed.
Paravirtualized Solaris OS
The Solaris kernel implementation for the UltraSPARC T1/T2 hardware class (uname -m) is referred to as the Solaris sun4v architecture. In this implementation, the Solaris OS is paravirtualized to replace operations that require hyperprivileged mode with hypervisor calls. The Solaris OS communicates with the hypervisor through a set of hypervisor APIs, and uses these APIs to request that the hypervisor perform hyperprivileged operations. Sun4v support for LDoms is a combination of partitioning the UltraSPARC T1/T2 processor into strands and virtualization of memory and I/O services. Unlike Sun xVM Server and VMware, an LDoms domain does not share strands with other domains. Each domain has one or more strands assigned to it, and each strand has its own hardware resources so that it can execute instructions independently of other strands. The virtualization of CPU functions to support CMT is implemented at the processor rather than at the software level (that is, there is no software scheduler). A Solaris guest OS can directly access strand-specific registers in a domain and can, for example, perform operations such as setting an OS trap table to the trap base address register (TBA). The Solaris sun4v architecture assumes that the platform includes the hypervisor as part of its firmware. The hypervisor runs in the hyperprivileged mode, and the Solaris
82
Logical Domains
OS runs in the privileged mode of the processor. The Solaris kernel uses hypercalls to request that the hypervisor perform hyperprivileged functions of the processor. Like Intel's VT and AMD's Pacifica architectures, the sun4v architecture leverages CPU support (hyperprivileged mode) for the implementation of the hypervisor. Unlike Intel's VT and AMD's Pacifica architectures which provide a special mode of execution for the hypervisor and thus make the hypervisor transparent to the GOS, the support for the hypervisor in UltraSPARC T1/T2 is non-transparent to the GOS. The UltraSPARC T1/T2 processors provide a set of hypervisor APIs for the GOS to delegate the hyperprivileged operations to the hypervisor.
Hypervisor Services
The hypervisor layer is a component of the UltraSPARC T1/T2 system's firmware. An UltraSPARC systems firmware consists of Open Boot PROM (OBP), Advanced Lights Out Management (ALOM), Power-on Self Test (POST), and the hypervisor. The hypervisor leverages the UltraSPARC T1/T2 hyperprivileged extensions to provide a protection mechanism for running multiple guest domains on the system. The hypervisor includes a number of hypervisor services to its overlaying domains. These services include hypervisor APIs that are the interfaces for a GOS to request hypervisor services, and Logical Domain Channel (LDC) services which are used by virtual device drivers for inter-domain communications. Hypervisor API The Sun4v hypervisor API [11] uses the Tcc instruction to cause the GOS to trap into hyperprivileged mode, in a similar fashion to how OS system calls are implemented. The function of the hypervisor API is equivalent to system calls in the OS that enable user applications to request services from the OS. The Sun4v hypervisor API allows a GOS to perform the following actions: Request services from the hypervisor Get and set CPU information through the hypervisor The UltraSPARC Virtual Machine Specification [11] lists the complete set of services and APIs for: API versioning request and check for a version of the hypervisor APIs with which it may be compatible Domain services enable a control domain to request information about or to affect other domains CPU services control and configure a strand; includes operations such as start/stop/suspend a strand, set/get the trap base address register, and configure the interrupt queue MMU services perform MMU related operations such as configure the TSB, map/demap the TLB, and configure the fault status register
83
Logical Domains
Memory services zero and flush data from cache to memory Interrupt services get/set interrupt enabled, target strand, and state of the interrupt Time-of-Day services get/set time-of-day Console services get/put a character to the console Channel Services provide communication channels between domains (see Logical Domain Channel (LDC) Services on page 83) The following two examples of hv_mem_sync() and hv_api_set_version() show the implementation for hypervisor calls:
% mdb -k > hv_mem_sync,6/ai hv_mem_sync: hv_mem_sync: mov %o2, %o4 hv_mem_sync+4: mov 0x32, %o5 hv_mem_sync+8: ta %icc, %g0 + 0 hv_mem_sync+0xc:retl hv_mem_sync+0x10: stx > hv_api_set_version,6/ai hv_api_set_version: hv_api_set_version: mov hv_api_set_version+4: clr hv_api_set_version+8: ta hv_api_set_version+0xc: retl hv_api_set_version+0x10: stx
%o1, [%o4]
%o3, %o4 %o5 %icc, %g0 + 0x7f %o1, [%o4]
The trap type in the range 0x180-0x1FF is used to transition from a privileged mode to a hyperprivileged mode. In the two preceding examples, a TT value of 0x180 (offset of 0) is used for hv_mem_sync(), and a TT value of 0x1FF (offset of 0x7f ) is used for
hv_api_set_version().
Hypervisor calls are normally invoked during the startup of the kernel to set up strands for the domain. Only a few hypercall functions are called during the runtime of the kernel, including: hv_tod_set(), hv_tod_get(), hv_set_ctx0(),
hv_mmu_map_perm_addr(), hv_mmu_unmap_perm_addr(), hv_set_ctxnon0(), and hv_mmu_set_stat_area().
Logical Domain Channel (LDC) Services The hypervisor provides communication channels between domains. These channels are accessed within a domain as an endpoint. Two endpoints are connected together forming a bi-directional point-to-point LDC. All traffic sent to a local endpoint arrives at the corresponding endpoint at the other end of the channel in the form of short fixed-length (64-byte) message packets. Each endpoint is associated with one receive queue and one transmit queue. Messages from a channel are deposited by the hypervisor at the tail of a queue, and the receiving
84
Logical Domains
domain indicates receipt by moving the corresponding head pointer for the queue. To send a packet down an LDC, a domain inserts the packet into its transmit queue, and then uses a hypervisor API call to update the tail pointer for the transmit queue. In the Solaris OS, the hypervisor LDC service is used as a simulated I/O bus interface, enabling a virtual device to communicate with a real device on the I/O domain. All virtual devices that communicate with the I/O domain for device access are a leaf nodes on the LDC bus. For example, a virtual disk driver, vdc, uses the LDC service to communicate with the virtual disk driver, vds, on the other side of the channel. Both
vdc and vds are leaf nodes on the channel bus (see I/O Virtualization in LDoms on
page 91).
Logical Domain Manager

The Logical Domain Manager (LDM) provides the following functionality: Provides a control point for managing a domain's configuration and operation Binds a domain to the resources of the underlying local physical machine Manages the integrity of the configuration in a persistent and consistent manner The LDM is a software module that runs on a control domain (see Logical Domains (LDoms) Architecture Overview on page 80). The LDM uses the LDC to communicate with the hypervisor when binding a domain to hardware resources, and stores the configuration in the service processor. The LDM is only required when a domain reconfiguration operation is needed, such as during the creation, shutdown, or deletion of a domain. The LDM maintains two persistent databases: one for the currently defined domains, and one for active domains. The active domain database is stored with the service processor, and the currently defined database is held with LDM's own persistent storage. The command line interface to the LDM is ldm(1M).
CPU Virtualization in LDoms

The hypervisor exposes strands to a domain. Each strand has it own registers and trap queues; shares L1 caches, the TLB, and the instruction pipeline with other strands in the same core; and shares the L2 cache with other strands in the socket. Strands on the UltraSPARC T1 processor share the FPU with other strands, while each core in the UltraSPARC T2 processor has its own floating-point and graphics unit (FGU). Each domain has its own strands that are not shared with other domains. The software threads (also known as kernel threads) are executed on the strands that are bound to that domain. Unlike the VMM in Sun xVM Server or VMware, there is no software scheduler in the hypervisor. CPU virtualization in LDoms, from a software perspective, involves trap and interrupt handling and timer services.
85
Logical Domains
Trap and Interrupt Handling

Each strand has two trap tables for handling traps: the hyperprivileged trap table and the privileged trap table. The trap table used for handling a trap depends on the following criteria: Trap type (TT) Trap level at the time when the trap is taken Privilege mode at the time when the trap is taken The UltraSPARC Architecture 2005 specification (see Table 12-4 in [2]) lists the mode in which a trap is delivered based on a given TT and current privileged mode. The hyperprivileged trap table and the privileged trap table are installed, respectively, by the hypervisor and the GOS. For example, the Solaris OS installs the trap table for sun4v in mach_cpu_startup():
ENTRY_NP(mach_cpu_startup) .... set trap_table, %g1 wrpr %g1, %tba ....
! write trap_table to %tba
And the hypervisor installs its trap table in start_master():

ENTRY_NP(start_master) .... setx htraptable, %g3, %g1 wrhpr %g1, %htba ....
Each strand has two interrupt queues: cpu_mondo and dev_mondo. The cpu_mondo queue is used for CPU-to-CPU cross-call interrupts; the dev_mondo queue is used for I/O-to-CPU interrupts. The Solaris kernel allocates memory for each queue, and registers these queues with the hv_cpu_qconf() hypercall. When the queue is nonempty (that is, the queue header is not equal to the queue tail), a trap is generated to the target CPU. The data of the interrupt received (mondo data) is stored in the queue.
86
Logical Domains
The Solaris kernel function for registering the interrupt queues is

cpu_intrq_register() as shown below:
void cpu_intrq_register(struct cpu *cpu) { struct machcpu *mcpup = &cpu->cpu_m; uint64_t ret; ret = hv_cpu_qconf(INTR_CPU_Q, mcpup->cpu_q_base_pa, cpu_q_entries); .... ret = hv_cpu_qconf(INTR_DEV_Q, mcpup->dev_q_base_pa, dev_q_entries); .... }
The I/O and CPU cross-call interrupt delivering mechanism is as follows: 1. An I/O device asserts its interrupt line to generate an interrupt to the processor. The I/O bridge chip receives the interrupt request and prepares a mondo packet to be sent to the target processor whose CPU number is stored in the bridge chip register by the OS. The mondo packet contains an interrupt number that uniquely identifies the source of the interrupt. 2. The hypervisor receives an interrupt request from the hardware through the interrupt vector trap (0x60). For example, the trap table for the T2000 firmware has the following entries:
ENTRY(htraptable) .... TRAP(tt0_05e, HSTICK_INTR) TRAP(tt0_05f, NOT) TRAP(tt0_060, VECINTR) ....
/* HV: hstick match */ /* reserved */ /* interrupt vector */
The CPU number and interrupt number are also delivered, along with the interrupt trap. The interrupt vector trap handle, VECINTR, uses the interrupt number to determine the source of the interrupt. If the interrupt is coming from I/O, the trap handler use the CPU number to find the dev_mondo queue associated with the CPU and adds the interrupt to the tail of the dev_mondo queue. When the head of the queue is not equal to the tail, a trap (0x7C for CPU cross calls and 0x7D for I/O) is generated to the CPU that owns the queue. 3. Traps 0x7C and 0x7D are taken via the GOS trap table. For I/O interrupts,
dev_mondo() is the trap handler for 0x7D.
87
Logical Domains
# mdb -k > trap_table+0x20*0x7c/ai 0x1000f80: 0x1000f80: ba,a,pt %xcc, +0xc784 <cpu_mondo> > trap_table+0x20*0x7d/ai 0x1000fa0: 0x1000fa0: ba,a,pt %xcc, +0xc800 <dev_mondo> >
The dev_mondo() handler takes the interrupt out of the queue by incrementing the queue header. It also finds the interrupt vector data, struct
intr_vec, from the systems interrupt vector table. The struct intr_vec
data contains the priority interrupt level (PIL) and the driver's interrupt service routine (ISR) for the interrupt. The dev_mondo() handler then sets the
SOFTINT register with the PIL of the interrupt.
4.
Setting the SOFTINT register causes an interrupt_level_n trap, 0x410x4f, to be generated where n is the PIL of the interrupt. The GOS's trap handler
for the interrupt_level_1 interrupt, for example, is shown below:

> trap_table+0x20*0x41,2/ai tt_pil1: tt_pil1: ba,pt %xcc, +0xc33c <pil_interrupt> 0x1000824: mov 1, %g4 >
If the PIL of the interrupt is below the clock PIL, an interrupt thread is allocated to handle the interrupt. Otherwise, the high level interrupt is handled by the currently executing thread. In summary, the interrupt delivering mechanism is a two stage process. First, an interrupt is delivered to the hypervisor as the interrupt vector trap, 0x60. Then the interrupt is added to an interrupt queue, which causes another trap to the GOS.
LDoms Timer Service

The system time is provided by the programmable interrupt generator. Clock interrupts are sent directly from the hardware to the domain, without being queued in the hypervisor. Therefore, unlike Sun xVM Server domains, LDoms exhibit no lost ticks issues. The time of day (TOD) is maintained by the hypervisor on a per-domain basis. The Solaris OS uses the tod_get() and tod_set() hypercalls to get and set the TOD, respectively. Setting the TOD in one domain does not affect any other domain.
88
Logical Domains
The high resolution timer is provided by the rdtick instruction, which reads the counter field of the TICK register. The rdtick instruction is a privileged instruction that can be executed by the Solaris OS without the hypervisor involvement.
Memory Virtualization in LDoms

Similar to Sun xVM Server, memory virtualization in LDoms deals with two memory management issues: Physical memory sharing and partitioning Page translations
Physical Memory Allocation

The UltraSPARC T1/T2 processors supports three types of memory addressing: Virtual Address (VA) utilized by user programs Real Address (RA) describes the underlying memory allocated to a GOS Physical Address (PA) appears in the system bus for accessing physical memory Multiple virtual address spaces within the same real address space are distinguished by a context identifier (context ID). The context ID is included as a field in the TTE for VA to PA translation (see Memory Management Unit on page 32). The GOS can create multiple virtual address spaces, using the primary and secondary context registers to associate a context ID with every virtual address. The GOS manages the allocation of context IDs among the processes within the domain. Multiple real address spaces within the same physical address space are distinguished by a partition identifier (partition ID). The hypervisor can create multiple real address spaces, using the partition register to associate a partition ID with every real address. The hypervisor manages the allocation of partition IDs. Because of the new addressing scheme, a number of new ASIs are defined for RA and PA addressing, as described in Table 8.
Table 8. New ASIs defined for real and physical addresses.
ASI # 0x14 0x15 0x1C 0x1D 0x21 0x52
ASI Name ASI_REAL ASI_REAL_IO ASI_REAL_LITTLE ASI_REAL_IO_LITTLE
Description Real Address (memory) Noncacheable Real Address Real Address Little-endian Noncacheable Real Address Little-endian
ASI_MMU_CONTEXTID MMU context register ASI_MMU_REAL MMU Register
89
Logical Domains
The partition ID register is defined in ASI 0x58, VA 0x80 [2] with an 8-bit field for the partition ID. The full representation of each type of address is as follows:
real_address = context_ID :: virtual_address physical_address = partition ID :: real_address
or:
physical_address = partition ID :: context ID :: virtual_address
Figure 26 illustrates the type of addressing in each of mode of operation.
Process Process 64-bit Process 64-bit addressing 64-bit addressing addressing
Logical Physical Domain System

Virtual Addressing Unprivileged mode User Space Kernel Space
64-bit VA + context ID
Real Addressing Privileged mode
64-bit VA + context ID + Partition ID
Physical Addressing Hyperprivileged mode
Figure 26. Different types of addressing are used in different modes of operation.
Page Translations
Page translations in the UltraSPARC architecture are managed by software through several different type of traps (see Memory Management Unit on page 32). Depending on the trap type, traps may be handled by the hypervisor or the GOS. Table 9 summarizes the MMU related trap types (see also Table 12-4 in [2]).
Table 9. MMU-related trap types in the UltraSPARC T1/T2 processor
Trap name fast_instruction_access_MMU_miss fast_data_access_MMU_miss fast_data_access_protection instruction_access_exception data_access_exception
Trap Cause iTLB Miss dTLB Miss Protection Violation Several Several
TT 0x64 0x68 0x6c 0x08 0x30
Handled by Hypervisor Hypervisor Hypervisor Hypervisor Hypervisor
90
Logical Domains
instruction_access_MMU_miss data_access_MMU_miss *mem_address_not_aligned
iTSB Miss dTSB Miss Misaligned memory operation
0x09 0x31 0x340x39
GOS GOS Hypervisor
In the hypervisor trap table, htraptable, the instructions for handling dTLB miss, trap 0x68, are:
% mdb ./ontario/release/q > htraptable+0x20*0x68,8/ai htraptable+0xd00: htraptable+0xd00: rdpr htraptable+0xd04: cmp htraptable+0xd08: bgu,pn htraptable+0xd0c: mov htraptable+0xd10: ba,pt htraptable+0xd14: ldxa htraptable+0xd18: illtrap htraptable+0xd1c: illtrap
%priv_16, %g1 %g1, 3 %xcc, +0x73b8 <watchdog_guest> 0x28, %g1 %xcc, +0x97a0 <dmmu_miss> [%g1] 0x4f, %g1 0 0
The trap table transfers control to dmmu_miss() to load the page translation from the TSB. If the translation doesn't exist in the TSB, dmmu_miss() calls dtsb_miss(). The handler dtsb_miss() sets the TT register to trap type 0x31 (data_access_MMU_miss), changes the PSTATE register to the privileged mode, and transfers control to the GOS's trap handler for trap 0x31. The portion of
dtsb_miss() that performs this functionality is shown in the following example:
> dtsb_miss,80/ai .... wrpr %g0, 0x31, %tt rdpr %pstate, %g3 or %g3, 4, %g3 wrpr %g3, %pstate rdpr %tba, %g3
! write 0x31 to %tt ! read %pstate to %g3 ! write %g3 to%pstate ! get privileged mode's trap ! table base address ! set %g3 to the address of ! trap type 0x31 ! jump to 0x31 trap handler
add .... jmp
%g3, 0x620, %g3
%g3
In the Solaris OS, the trap handler for trap type 0x31 calls the handler
sfmmu_slow_dmmu_miss() to load the page translation from hme_blk. If no entry
is found in hme_blk for the virtual address, sfmmu_slow_dmmu_miss() calls

sfmmu_pagefault() to transfer control to Solaris's trap() handler.
91
Logical Domains
% mdb -k > trap_table+0x20*0x31,2/ai scb+0x620: scb+0x620: ba,a +0xc1b4 scb+0x624: illtrap 0 > sfmmu_pagefault,80/ai .... sfmmu_pagefault+0x78: sethi sfmmu_pagefault+0x7c: or sfmmu_pagefault+0x80: ba,pt
<sfmmu_slow_dmmu_miss>
%hi(0x101d400), %g1 %g1, 0x364, %g1 %xcc, -0x13f0 <sys_trap>
I/O Virtualization in LDoms

LDoms provide the ability to partition system PCI buses so that more than one domain can directly access devices. (Currently, access by up to two domains is supported.) A domain that has direct access to devices is called an I/O domain or service domain. A domain that doesn't have direct access to devices uses the virtual I/O (VIO) framework and goes through an I/O domain for access. The device tree of a domain is determined by the OBP of that domain. The OBP device tree of a typical non-I/O domain is shown in the following example:
{0} ok show-devs /cpu@3 /cpu@2 /cpu@1 /cpu@0 /virtual-devices@100 /virtual-memory /memory@m0,4000000 /aliases /options /openprom /chosen /packages /virtual-devices@100/channel-devices@200 /virtual-devices@100/console@1 /virtual-devices@100/ncp@4 /virtual-devices@100/channel-devices@200/disk@0 /virtual-devices@100/channel-devices@200/network@0 /openprom/client-services /packages/obp-tftp /packages/kbd-translator /packages/SUNW,asr /packages/dropins /packages/terminal-emulator /packages/disk-label /packages/deblocker /packages/SUNW,builtin-drivers
92
Logical Domains
During the system boot, the OBP device tree information is passed to the Solaris OS and used to create the system device nodes. Output from the following pftconf(1M) command shows the system configuration of a typical non-I/O domain:
# prtconf System Configuration: Sun Microsystems Memory size: 1024 Megabytes System Peripherals (Software Nodes): sun4v
SUNW,Sun-Fire-T200 scsi_vhci, instance #0 packages (driver not attached) SUNW,builtin-drivers (driver not attached) deblocker (driver not attached) disk-label (driver not attached) terminal-emulator (driver not attached) dropins (driver not attached) SUNW,asr (driver not attached) kbd-translator (driver not attached) obp-tftp (driver not attached) ufs-file-system (driver not attached) chosen (driver not attached) openprom (driver not attached) client-services (driver not attached) options, instance #0 aliases (driver not attached) memory (driver not attached) virtual-memory (driver not attached) virtual-devices, instance #0 ncp (driver not attached) console, instance #0 channel-devices, instance #0 disk, instance #0 network, instance #0 cpu (driver not attached) cpu (driver not attached) cpu (driver not attached) cpu (driver not attached) iscsi, instance #0 pseudo, instance #0
As this system configuration shows, no physical devices are exported to the domain. The virtual-devices entry is the nexus node of all virtual devices. The channeldevices entry is the bus node for the virtual devices that require communication with
the I/O domain. The disk and network entries are leaf nodes on the channeldevices bus.
93
Logical Domains
The Solaris drivers that are specific to the LDom configuration are listed below:
LDOM drivers: vdc (virtual disk client 1.4) non I/O domain only ldc (sun4v LDC module v1.5) ds (Domain Services 1.3) cnex (sun4v channel-devices nexus dri) vnex (sun4v virtual-devices nexus dri) dr_cpu (sun4v CPU DR 1.2) drctl (DR Control pseudo driver v1.1) qcn (sun4v console driver v1.5) vnet (vnet driver v1.4) non I/O domain only vds (virtual disk server v1.6) I/O domain only vsw (sun4v Virtual Switch Driver 1.5) I/O domain only
Similar to Sun xVM Server, the LDoms VIO on a non-I/O domain uses a split device driver architecture for virtual disk and network devices. The vdc and vnet client drivers are used in non-I/O domains. The vds and vsw server drivers are used in the I/O domain to support the vdc and vnet drivers. The vnex nexus driver, the driver for the
virtual-devices nexus node, provides bus services to its children nodes, vnet and vdc.
The VIO framework uses the hypervisors Logical Domain Channel (LDC) service for driver communication between domains. The LDC forms bi-directional point-to-point links between two endpoints. All traffic sent to a local endpoint arrives only at the corresponding endpoint at the other end of the channel in the form of short fixedlength (64 byte) message packets. Each endpoint is associated with one receive queue and one transmit queue. Messages from a channel are deposited by the hypervisor at the tail of a queue, and the receiving domain indicates receipt by moving the corresponding head pointer for the queue. To send a packet down an LDC, a domain inserts the packet into its transmit queue, and then uses a hypervisor API call to update the tail pointer for the transmit queue.
Disk Block Driver

On non-I/O domains, the vdc client driver provides disk interface. The vdc driver receives I/O requests from the file system or raw device access, and sends these requests to the hypervisor LDC. The vds server driver, located in the I/O domain, sits on the other end of LDC. The vds driver receives requests from the vdc driver and then forwards these requests to the disk service to which the disk device on the client is mapped. The sequence of events for disk I/O is illustrated in Figure 27.
94
Logical Domains
I/O Domain (Server)

read(2)/write(2)
Non I/O Domain (Client)

User Kernel User Kernel
read(2)/write(2)
10
vds
4 Yes File? No 7 8 3 2 5 6
FS
1
vdc
9
FS
Native Driver
Logical Domain Channel (LDC) LDOM Hypervisor
UltraSPARC T1 (CPU, Memory, Devices)
Figure 27. Sequence of events for disk I/O from a non-I/O domain to an I/O domain.
For non-I/O domains, the following events occur when applications use read(2) and
write(2) system calls to access a file:
1. 2. 3. 4. 5. 6. 7. 8. 9.
The file system calls the vdc driver's strategy(9E) entry point. The vdc drivers send the I/O buf, buf(9S), to the LDC. The vdc driver returns after all data is successfully sent to the LDC. The vds driver is notified by the hypervisor that messages are available on its queue. The vds driver retrieves data from the LDC and sends it to the device service that is mapped to the client virtual disk. The vds driver starts the block I/O by sending the I/O request to the native driver and then dispatching a task queue to await I/O completion. The native SCSI driver receives the device interrupt. The vds driver's I/O completion is woken up by biodone(9F). The vds driver sends a message to vdc indicating I/O completion. The vdc driver receives the message from vds, and calls biodone(9F) to wake up anyone waiting for it.
For I/O domains, the I/O path of data requests is simpler: 10. Block I/O requests are sent directly from the file system to the native driver. In addition to the strategy(9E) driver entry point for supporting file system and raw device access, the vdc driver also supports most of the ioctl(2) commands as
95
Logical Domains
defined in dkio(7I) for disk control. The Solaris kernel variable dk_ioctl1 defines the exact disk ioctl commands supported by the vdc driver.
Network Driver
The Solaris LDoms network drivers include a client network driver, vnet, and a virtual switch, vsw, on the server side. To transmit a packet, vnet sends a packet over the LDC to vsw. The binding of vnet to vsw is defined in the vnet resource of the domain when the domain is created. The vsw forwards the packet to the native driver, and includes the IP address of vnet as the source address. The vnet driver returns as soon as the packet has been put on a buffer and the buffer has been added to the tail of the LDC queue. When receiving packets from the network, if the native driver is configured as a virtual switch in the vswitch resource of the domain, the packet is passed up from the native driver to vsw. The vsw finds the MAC address associated with the destination IP address from its ARP table. The vsw gets the target domain from the MAC address, and gets the vnet interface from the vnet resource. The packet is then sent to the LDC of the designated vnet driver. The vnet driver uses Solaris GLD v3 interfaces and is fully compatible with the native driver using the same GLD v3 interface. Figure 28 depicts the flow of receiving a packet from the network through an I/O domain to a guest domain. The sequence of operations for receiving packets is as follows: 1. 2. 3. Data is stored via DMA into the native driver, e1000g, receive buffer ring. The vsw sends the packet to client driver, vnet, through the LDC. The LDC receiving worker thread gets the packet and sends it to the vnet driver.
1.
Information on the Solaris kernel variable dk_ioctl can be looked up at the Web site: http://www.opensolaris.org/.
96
Logical Domains
I/O Domain ldc`ldc_write vsw`vsw_dringsend+0x234 vsw`vsw_portsend+0x60 vsw`vsw_forward_all+0x134 vsw`vsw_switch_l2_frame+0x248 mac`mac_rx+0x58 e1000gè1000g_intr_pciexpress+0xb8 px`px_msiq_intr+0x1b8 intr_thread+0x170 cpu_halt+0xc0 idle+0x128 thread_start+4
Guest Domain 3 vnet`vgen_handle_datamsg ldcì_ldc_rx_hdlr+0x3c0 cnex`cnex_intr_wrapper+0xc intr_thread+0x170 idle+0x128 thread_start+4
Logical Domain Channel LDoms Hypervisor
Network Chip
Sun UltraSPARC T1/T2 Server UltraSPARC T1/T2 (CPU, Memory, Devices)
Figure 28. Flow of control for receiving a network packet from an I/O domain to a guest domain.
97
VMware
Chapter 8
VMware
VMware, the current market leader in virtualization software for x86 platforms, offers three virtual machine systems: VMware Workstation; no-cost VMware Server, formerly known as VMware GSX Server; and VMware Infrastructure 3, a suite of virtualization products based on VMware ESX Server Version 3. The VMware Workstation and VMware Server products are add-on software modules that run on a host OS such as Windows, Linux-hosted, or BSD variants (Figure 29). In these implementations the VMM is a part of, and has the same privilege as, the host OS kernel. The guest OS runs as an application on the host OS. The Solaris OS can only run as a guest OS on VMware Workstation and VMware Server. The VMware Infrastructure suite of products is built around the VMware ESX Server. VMware ESX Server runs on bare metal and uses a derived version of SimOS [18] as the kernel for running the VMM and I/O services. All other operating systems run as a guest OS. VMware Infrastructure supports Windows, Linux, and Solaris as guest OS. VMware ESX Server provides lower overhead and better control of system resources than VMware Workstation and Server. However, because it provides all device drivers, it therefore supports fewer devices than VMware Workstation and VMware Server. Figure 29 shows the configuration of VMware ESX Server and GSX Server.
GSX Server
Guest OS Linux VMM Guest OS Solaris VMM Host Apps Guest OS Linux
ESX Server
Guest OS Solaris Guest OS Windows
Host Operating System Hardware
VMM Hardware
Figure 29. VMware GSX Server (Vmware Workstation and VMware Server products) runs within a host operating system, while VMware ESX Server runs on the bare metal.
VMware ESX Server is a Type I VMM, and has exclusive control of hardware resources (see Types of VMM on page 10). In contrast, VMware Workstation and VMware Server are Type II VMMs, and leverage the host OS by running inside the OS kernel.
98
VMware
VMware Infrastructure Overview

VMware ESX Server, VMware's product for running enterprise applications in data centers, serves as the foundation of the VMware Infrastructure solution. VMware ESX Server includes the following components: Virtualization layer abstracts the hardware resources including CPU, memory, and I/O I/O interface enables the delivery of file system services to VMs Service Console provides an interface to manage resources and administer VMs Figure 30 shows the functional components of the VMware ESX Server product. The VMkernel, the core of the ESX Server, abstracts the underlying hardware resources and implements the virtualization layer. The VMkernel includes multiple VMMs, one for each VM. The VMM implements the virtual CPUs for each VM. The VMkernel also includes modules for I/O driver emulation, the I/O stack, and device drivers for network and storage devices. The service console, a RedHat Linux-based component, serves as a boot loader and provides a management interface to the VMkernel.
Guest Application
Network Driver SCSI Driver
Service Console
Management Interface
Network Emulation Execution Mode
Storage Emulation
VMkernel
Network Stack Storage Stack
VMM
Hardware Interface Network Driver Layer
Storage Driver
CPU
Sun X64 Server
Network
Storage
Figure 30. VMware ESX Server functional components.
The following sections discuss the functional components of VMware Infrastructure, with particular emphasis on the virtualization layer which forms the core of all VMware virtualization products.
VMware CPU Virtualization

ESX Server provides full virtualization, enabling an unmodified GOS to run on the underlying x86 hardware. The full virtualization is achieved by the ESX virtualization
99
VMware
layer. The core of the ESX virtualization layer is the VMM, which includes three modules (Figure 31) [12]: Execution decision module decides whether VM instructions should be sent to the direct execution module or the binary translation module Binary translation module used to execute the VM whenever the hardware processor is in a state in which direct execution cannot be used Direct execution module enables the VM to directly execute its instruction sequences on the underlying hardware processor
VM GOS VMM Execution Execution Module Decision
Binary Translation
Direct Execution
Figure 31. VMware ESX Server virtualizes the CPU hardware through binary translation whenever the processor itself cannot directly execute an instruction.
The decision to use binary translation or direct execution depends on the state of the processor and whether the segment is reversible or not (see Segmented Architecture on page 23). If the content of the descriptor table, for example the GDT, is changed by the VMM because of a context switch to another VM, the segment is non-reversible. Direct execution can be used only if the VM is running in an unprivileged mode and the hidden descriptors of the segment register are reversible. In all other cases, the VMM will switch to the binary translation module.
Binary Translation
The binary translation (BT) module is believed influenced by the machine simulators Shade [13] and Embra [14]. Embra is part of SimOS [18] which was developed by a Stanford team led by Mendel Rosenblum, one of the founders of VMware. While extensive details of the BT module implementation have not been published, Agesen [15], Embra [14], and Shade [13] provide some information on its implementation. The BT module translates GOS instructions, which are running in a deprivileged VM, into instructions that can run in the privileged VMM segment. The BT module receives x86 binary instructions, including privileged instructions, as input. The output of the module is a set of instructions that can be safely executed in the non-privileged mode. Agesen [15] gives an example of how control flow is handled in the BT module.
100
VMware
To avoid frequently retranslating blocks of instructions, translated blocks are kept in a Translation Cache (TC). The execution of a block of instructions is simulated by locating the blocks translation in the TC and jumping to it. A hash table maintains the mappings from a program counter to the address of the translated code in the TC. The main loop of the dynamic binary translation simulator is shown in Figure 32. The loop checks to see if the current simulated program counter address is present in the TC. If it is present in the TC, the translated block is executed. If it is not, the translator is called to add the block to the TC. Each block of translated code ends by loading the new simulated program counter and jumping back to the main loop for dispatching.
Translator Main{ .... dispatch loop if (PC_not_in TC(PC)) tc=translate(pc); newpc = pc_to_tc(pc); jump_to_pc(newpc) .... } translate(pc) { .... blk = read_instructions(pc); perform_translation(blck); write_into_TC(blk); .... }
Figure 32. Binary translation manages a translation cache to reduce the need to re-translate frequently executed blocks of instructions.
Translation Cache Translation Cache: code fragments which end with jump back to dispatch_loop
A more detailed description of binary translation is beyond the scope of this paper. Readers should refer to Shade [13] and Embra [14] for more details about dynamic binary translation. Some privileged instructions that have simple operations use in-TC sequences. For example, a clear interrupt instruction (cli) can be replaced by setting a virtual processor flag. Privileged instructions that have more complex operations (such as setting cr3 during a context switch), require a call out of the TC to perform the emulation work. In addition to binary translation and logic for determining the code execution, the virtualization layer employs other techniques to overcome x86 virtualization issues: Memory Tracing The virtualization layer traces modifications on any given physical page of the virtual machine, and is notified of all read and write accesses made to that page in a transparent manner. This memory tracing ability in the VMM is enabled by page faults and the ability to single-step the virtual machine via binary translation.
101
VMware
Shadow Descriptor Tables The x86's segmented architecture (see Segmented Architecture on page 23) has a segment caching mechanism that allows the segment register's hidden fields to be re-used. However, this can approach can cause difficulty if the descriptor table is modified in a non-coherent way. The virtualization layer supports the GOS system descriptor tables using VMM shadow descriptor tables. The VMM descriptor tables include shadow descriptors that correspond to predetermined descriptors of the VM descriptor tables. The VMM also includes a segment tracking mechanism that compares the shadow descriptors with their corresponding VM segment descriptors. This mechanism indicates any lack of correspondence between shadow descriptor tables with their corresponding VM descriptor tables, and updates the shadow descriptors so that they correspond to their respective corresponding VM segment descriptors. The ESX Server's VMM implementation is unique in that each GOS has an associated VMM. The ESX Server may include any number of VMMs in a given physical system, each supporting a corresponding VM; the number of VMMs is limited only by available memory and speed requirements. The features in the virtualization layer mentioned in the previous discussion allow multiple concurrent VMMs, with each VMM supporting an unmodified GOS in the virtualization layer.
CPU Scheduling
The ESX Server implements a rate-based proportional-share scheduler [19] that is similar to the fair-share-scheduler scheme used by the Solaris OS (see [21] Chapter 8) in which each virtual machine is given a number of shares. The amount of CPU time given to each VM is based on its fractional share of the total number of shares of active VMs in the whole system. The term share is used to define a portion of the systems CPU resources that is allocated to a VM. If a greater number of CPU shares is assigned to a VM, relative to other VMs, then that VM receives more CPU resources from the scheduler. CPU shares are not equivalent to percentages of CPU resources. Rather, shares are used to define the relative weight of a CPU load in a VM in relation to CPU loads of other VMs. The following formula shows how the scheduler calculates per-domain allocation of CPU resources.
Allocation Shares i domain = ----------------------------------------------------TotalShares
domain
The ESX scheduler allows specifying minimum (reservation) and maximum (limit) CPU utilization for each virtual machine. A minimum CPU reservation guarantees that a virtual machine always has this minimum percentage of a physical CPUs time
102
VMware
allocated to it, regardless of the total number of shares. A maximum CPU limit ensures that the virtual machine never uses more than this maximum percentage of a physical CPUs time, even if extra idle time is available. The proportional-share algorithm is only applied if the VM CPU utilization falls within the range of reservation and limit CPU utilization. Figure 33 shows how CPU resource allocation is calculated.
Total MHz
Limit The CPU utilization range where proportionalshare is applied
Reservation
0 MHz
Figure 33. Calculation of CPU resources in VMware.
In an SMP environment in which a VM could have more than one virtual CPU (VCPU), a scalability issue arises when one VCPU is spinning on a lock held by another VCPU that gets de-scheduled. The spinning VCPU wastes CPU cycles spinning on the lock until the lock owner VCPU is finally scheduled again and releases the lock. ESX implements co-scheduling to work around this problem. In co-scheduling (also called gang scheduling), all virtual processors of a VM are mapped one-to-one onto the underlying processors and simultaneously scheduled for an equal time slice. The ESX scheduler guarantees that no VCPUs are spinning on a lock hold by a VCPU that has been preempted. However, co-scheduling does introduce other problems. Because all VCPUs are scheduled at the same time, co-scheduling activates a VCPU regardless of whether there are jobs in the VCPU's run queue. Co-scheduling also precludes multiplexing multiple VCPUs on the same physical processor.
Timer Services
Similar to Sun xVM Server, ESX Server faces the same issue of getting clock interrupts delivered to VMs at the configured interval [16]. This issue arises because the VM may not get scheduled when interrupts are due to deliver. ESX Server keeps track of the clock interrupt backlog and tries to deliver clock interrupts at a higher rate when the backlog gets large. However, the backlog can get so large that it is not possible for the GOS to catch up with the real time. In such cases, ESX Server stops attempting to catch
103
VMware
up if the clock interrupt backlog grows beyond 60 seconds. Instead, ESX Servers sets its record of the clock interrupt backlog to zero and synchronizes the GOS clock with the host machine clock. ESX Server virtualizes the Time Stamp Counter (TSC) so that the virtualized TSC counter matches with the GOS clock (see Time Stamp Counter (TSC) on page 28). When the clock interrupt backlog is cleared due to catching up or due to reset when the backlog is too large, the virtualized TSC catches up with the adjusted clock.
VMware Memory Virtualization

Similar to Sun xVM Server, memory virtualization in VMware ESX Server deals with two memory management issues: physical memory management and page translations.
Physical Memory Management

Similar to Sun xVM Server, ESX Server virtualizes a VM's physical memory by adding an extra level of address translation when mapping a VM's physical memory pages to the physical memory pages on the underlying machine. Also like Sun xVM Server, the underlying physical pages are referred to as machine pages, and the VM's physical pages as physical pages. Each VM sees a contiguous, zero-based, addressable physical memory space whereas the underlying machine memory used by each virtual machine may not be contiguous. ESX Server manages physical memory allocation and reclamation, similar to Sun xVM Server, by using the memory ballooning technique. More detailed information on how the memory ballooning technique manages the physical memory allocation and reclamation is included in [5].
Page Translations
Each GOS in the ESX Server maintains page tables for virtual-to-physical address mappings. The VMM also maintains shadow page tables for the virtual-to-machine page mappings along with physical-to-machine mappings in its memory. The processor's MMU uses the VMM's shadow page table. When a GOS updates its page tables with a virtual-to-physical translation, the VMM intercepts the instruction, gets the physical-to-machine mapping from its memory, and loads the shadow page table with the virtual-to-machine mapping. This mechanism allows normal memory accesses in the VM to execute without adding address translation overhead if the shadow page tables are set up for that access.
VMware I/O Virtualization

Every VM is configured with a set of standard PC virtual devices: PS2/ keyboard and mouse, IDE controller with ATA disk and ATAPI CDROM, serial port, parallel port, and
104
VMware
sound chip [20]. In addition, ESX Server also provides virtual PCI emulation for PCI addon devices such as SCSI, Ethernet, and SVGA graphics (see Figure 30 on page 98). The device tree as exported by the VMM to a GOS is shown in the following
prtconf(1M) output.
% prtconf System Configuration: Sun Microsystems i86pc Memory size: 1648 Megabytes System Peripherals (Software Nodes): i86pc scsi_vhci, instance #0 isa, instance #0 i8042, instance #0 keyboard, instance #0 mouse, instance #0 lp (driver not attached) asy, instance #0 (driver not attached) asy, instance #1 (driver not attached) fdc, instance #0 fd, instance #0 pci, instance #0 pci15ad,1976 (driver not attached) pci8086,7191, instance #0 pci15ad,1976 (driver not attached) pci-ide, instance #0 ide, instance #0 sd, instance #16 ide (driver not attached) pci15ad,1976 (driver not attached) display, instance #0 pci1000,30, instance #0 sd, instance #0 pci15ad,750, instance #0 iscsi, instance #0 pseudo, instance #0 options, instance #0 agpgart, instance #0 (driver not attached) xsvc, instance #0 objmgr, instance #0 acpi (driver not attached) used-resources (driver not attached) cpus (driver not attached) cpu, instance #0 (driver not attached) cpu, instance #1 (driver not attached)
105
VMware
The PCI vendor ID of VMware is 15ad. The following entries are relevant to VMware I/O virtualization:
Device Entry Description VMware emulation of Intel's 100FX Gigabit Ethernet VMware emulation of the Intel 440BX/ZX PCI bridge chip the LSI logic 53C1020/1030 SCSI controller VMware virtual SVGA
pci15ad,750 pci15ad,1976 pci1000,30 display
For the example device tree shown here, the Solaris OS binds the e1000g driver to
pci15ad,750 and uses e1000g as the network driver. The actual network hardware
used on the system is a Broadcom's NetXtreme Dual Gigabit Adapter with the PCI ID
pci14e4,1468. VMware translates the e1000g device interfaces passed by the
Solaris e1000g driver, and sends them to the Broadcom's NetXtreme device. For storage, unlike Sun xVM Server, ESX Server continues to use sd as the interface to file systems. The emulation of disk interface is provided at the SCSI bus adapter interface (LSI logic SCSI controller) instead of at the SCSI target interface (SCSI disk sd).
Device Emulation
Each storage device, regardless of the specific adapters, appears as a SCSI drive connected to an LSI Logic SCSI adapter within the VM. For network I/O, ESX Server emulates an AMD Lance/PCNet or Intel E1000g device driver, or uses a custom interface called vmxnet for the physical network adapter. VMware provides device emulation rather than the I/O emulation as used by Sun xVM Server and UltraSparc LDoms (see I/O Virtualization on page 16). In a simple scenario, consider an application within the VM making an I/O request to the GOS, as illustrated in Figure 34:
106
VMware
Guest
Application
VMware Supported Virtual Device Interface 1 Solaris Native Drivers 2 9 8
VMKernel
7 3
Device Independent I/O Access Handler
Device Emulation Module

4 7
Hardware Interface Layer VMM
Real Device Driver

5 6
Sun x64 Server
I/O Device
Figure 34. Sequence of events for applications making an I/O request.
1.
Applications perform I/O operations through the interface to the device as exported by the VMware VMM (see VMware I/O Virtualization on page 103). The virtual device interface uses the native drivers (for example, the e1000g for network and mpt for the LSI SCSI HBA) in the Solaris kernel.
2. 3. 4. 5. 6. 7. 8. 9.
The Solaris native driver attempts to access the device via the IN/OUT instructions (for example, by writing a DMA descriptor to the device's DMA engine). The VMM intercepts the I/O instructions and then transfers control to the deviceindependent module in the VMkernel for handling the I/O request. The VMkernel converts the I/O request from the emulated device to one for the real device, and sends the converted I/O request to the driver for the real device. The VMware driver sends the I/O request to the real I/O device. When an I/O request completion interrupt (for example, DMA completion interrupt) arrives, the VMkernel device driver receives and processes the interrupt. The VMkernel then notifies the VMM of the target virtual machine, which copies data to the VM memory and then raises the interrupt to the GOS. The Solaris drivers interrupt service routine (ISR) is called. The Solaris driver performs a sequence of I/O accesses (for example, reads the transaction status, acknowledges receipt) to the I/O ports before passing the data to its applications.
The VMkernel ensures that data intended for each virtual machine is isolated from other VMs.
VMware
Section III
Additional Information
Appendix A: VMM Comparison (page 109) Appendix B: References (page 111) Appendix C: Terms and Definitions (page 113) Appendix D: Author Biography (page 117)
108
VMware
109
VMM Comparison
Appendix A
VMM Comparison
This chapter presents a summary comparison of the four virtual machine monitors discussed in this paper: Sun xVM Server without HVM, Sun xVM Server with HVM, VMware, and Logical Domains (LDoms). Table 10 summarizes their general characteristics; provides information on their CPU, Memory, and I/O virtualization implementation; and lists the management options available for each.
Table 10. Comparison of virtual machine monitors discussed in this paper.
General
Sun xVM Server w/o HVM Sun xVM Server w/HVM VMM version 3.0.4 Supported ISA x86 and IA-64 VMM Layer Run on bare metal 3.0.4 x86 and IA-64 Run on bare metal Full Linux, NetBSD, FreeBSD, OpenBSD, Windows Yes Yes Limited by memory Hardware Virtualization GPL (free)
VMware ESX 3.0.1 x86 Run on bare metal Full
LDoms LDoms 1.0.1 UltraSPARC T1/T2 Firmware Paravirtualization
Virtualization Scheme Paravirtualization Supported GOS Linux, NetBSD, FreeBSD, OpenBSD, Solaris SMP GOS Yes 64-bit GOS Yes Max VMs Limited by memory Method of operation Modified GOS License GPL (free) CPU
Windows, Linux, Netware, Solaris, Linux Solaris Yes Yes Limited by memory Binary Translation Proprietary VMware Fair Share Privileged Deprivileged Yes Yes 32 on UltraSPARC T1; 64 on UltraSPARC T2 Modified OS CDDL (Free) LDoms N/A Hyperprivileged Privileged
Sun xVM Server w/o HVM Sun xVM Server w/HVM CPU scheduling Credit Credit Privileged (ring 0) Reduced privileged
VMM Privilege Mode Privileged (ring 0) GOS Privileged Mode Unprivileged (ring 3 for 64-bit kernel; ring 1 for 32-bit kernel CPU Granularity Fractional Interrupt Queued and delivered to run Memory
Fractional Queued and delivered to run
Fractional Queued and delivered to run VMware Shadow page Balloon driver Managed by VMM
1 strand Deliver directly to the VM
when the VM is scheduled when the VM is scheduled when the VM is scheduled
Sun xVM Server w/o HVM Sun xVM Server w/HVM Shadow Page Balloon driver Managed by VMM
LDoms Hypercall to VMM Hard Partition Managed by GOS
Page Translation Hypercall to VMM Physical Memory Balloon driver Allocation Page Tables Managed by VMM
110
VMM Comparison
I/O
Sun xVM Server w/o HVM Sun xVM Server w/HVM I/O Granularity Shared I/O Virtualization I/O emulation by Dom0 Shared
VMware Shared
LDoms PCI bus I/O emulation by I/O domain Virtual driver on non I/O domain and native driver on I/O domain LDoms Control domains CLI: ldm(1M), XML, and SNMP MIBs
Device emulation by Device emulation by QEMU or I/O emulation by vmkernel Dom0 Native driver on DomU and Dom0 (QEMU) Native driver on guest supported by the VMM VMware Service console - SPOF GUI: Virtual Center
Device drivers Virtual driver on DomU, native driver on Dom0 Management
Sun xVM Server w/o HVM Sun xVM Server w/HVM Dom0 - SPOF CLI: (xm(1)) GUI: virt-manager
Management Model Dom0 - SPOF Interface CLI: xm(1M) GUI: virt-manager
111
References
Appendix B
References
1. Popek, Gerald J. and Goldberg, Robert P. Formal Requirements for Virtualizable Third Generation Architectures, Communications of the ACM 17 (7), pages 412421, July 1974. 2. 3. UltraSPARC Architecture 2005: One Architecture.... Multiple Innovative Implementations, Draft D0.9, 15 May 2007. Robin, John Scott and Irvine, Cynthia E. Analysis of the Intel Pentium's Ability to Support a Secure Virtual Machine Monitor, Proceedings of the 9th USENIX Security Symposium, August 2000. 4. 5. VMware: http://www.vmware.com/vinfrastructure/ Waldspurger, Carl A. Memory Resource Management in VMware ESX Server, Proceedings of the 5th Symposium on Operating Systems Design and Implementation, Dec. 2002. 6. 7. 8. Xen, The Xen virtual machine monitor, University of Cambridge Computer Laboratory: http://www.cl.cam.ac.uk/research/srg/netos/xen/ IA-32 Intel Architecture Software Developer's Manual, March 2006. System V Application Binary Interface AMD64 Architecture Processor Supplement Draft Version 0.98, September 27, 2006. http://www.x86-64.org/documentation/abi.pdf 9. AMD64 Architecture Programmers Manual, Volume 2: System Programming, Rev. 3.12, September 2006. 10. OpenSPARC T1 Microarchitecture Specification, Revision A, August 2006. 11. UltraSPARC Virtual Machine Specification (The sun4v architecture and Hypervisor API specification), Revision 1.0, January 24, 2006. 12. Devine, Scott W.; Bugnion, Edouard; Rosenblum, Mendel. Virtualization system including a virtual machine monitor for a computer with a segmented architecture, U.S. Patent 6,397,242, October 26, 1998. 13. Cmelik, Robert F. and Keppel, David. Shade: A Fast Instruction Set Simulator for Execution Profiling, ACM SIGMETRICS Performance Evaluation Review, pages 128137, May 1994. 14. Witchel, Emmett and Rosenblum, Mendel. Embra: Fast and Flexible Machine Simulation, The Proceedings of ACM SIGMETRICS '96: Conference on Measurement and Modeling of Computer Systems, 1996. 15. Adams, Keith and Agesen, Ole. A Comparison of Software and Hardware Techniques for x86 Virtualization, ASPLOS 2006, San Jose, CA, USA, October 21-25, 2006.
112
References
16. Timekeeping in VMware Virtual Machines, VMware white paper, August 2005. 17. Bittman, T. Gartner RAS Core Strategic Planning SPA-21-5502, Research Note 14, November 2003. 18. Rosenblum, Mendel; Herrod, Stephen A.; Witchel, Emmett; and Gupta, Anoop. Complete Computer Simulation: The SimOS Approach, IEEE Parallel and Distributed Technology, pages 34-43, Winter 1995. 19. VMware ESX Server 2 Architecture and Performance Implication, VMware white paper, 2005. 20. Sugerman, Jeremy; Venkitachalam, Ganesh; and Lim, Beng-Hong. Virtualizing I/O Devices on VMware Workstations Hosted Virtual Machine Monitor, Proceedings of the 2001 USENIX Annual Technical Conference, Boston, Massachusetts, USA, June 25-30, 2001. 21. System Administration Guide: Solaris Containers-Resource Management and Solaris Zones, Part No: 817-1592 -14, June 2007 22. Drakos, Nikos; Hennecke, Marcus; Moore, Ross; and Swan, Herb. Xen Interface
manual: Xen v3.0 for x86.

23. Bochs IA-32 Emulator Project: http://bochs.sourceforge.net/ 24. QEMU, Open Source Processor Emulator: http://fabrice.bellard.free.fr/qemu/ 25. IEEE 1275-1994 Open Firmware: http://playground.sun.com/1275/ 26. PCI Bus Binding to IEEE std. 1275-1994, Rev 2.1 August 29, 1998. 27. TAP a Virtual Ethernet network device: http://vtun.sourceforge.net/tun/ 28. Intel Virtualization Technology for Directed I/O Architecture Specification, May 2007, Order Number: D51397-002. 29. Shadow2 presentation at Xen Technical Summit, Summer 2006:
http://www.xensource.com/files/summit_3/XenSummit_Shadow2.pdf
30. PCI SIG, Address Translation Services, Revision 1.0, March 8, 2007. 31. AMD I/O Virtualization Technology (IOMMU) Specification, Revision 1.20, Publication# 34434, February 2007. 32. Jun Nakajima, Asit Mallick, Ian Pratt, Keir Fraser, x86-64 XenLinux: Architecture, Implementation, and Optimizations, Proceedings of the Linux Symposium, July 19-22 2006. Ontario, Canada. 33. OpenSPARC T2 Core Microarchitecture Specification, July 2007, Revision 5. 34. UltraSPARC Architecture 2007, Hyperprivileged, Privileged, and Nonprivileged, Draft D0.91, Aug 2007. 35. PCI SIG, Single Root I/O Virtualization and Sharing Specification, Revision 1.0, September 11, 2007.
113
Terms and Definitions
Appendix C

Hardware level virtualization introduces several terms that are used throughout this document. The following terms are defined in the context of hardware-level virtualization.
Balloon driver A method for dynamic sharing of physical memory among VMs [5]. Binary Translation In computing, binary translation [13] [14] usually refers to the emulation of one instruction set by another through translation of instructions to allow software programs (e.g., operating systems and applications) written for a particular processor architecture to run on another. In the context of VMware products, binary translation refers to the conversion of one set of instruction sequences that belongs to a VM and has been deprivileged, to another set of instruction sequences that can run in a privileged VMM segment. VMware uses binary translation [12] and [15] to provide full virtualization of x86 processor. Domain A running virtual machine within which a guest OS runs. Domain and virtual machine are used interchangeably in this document. Full Virtualization Full virtualization is an implementation of virtual machine that doesn't require guest OS to be modified to run in the VM. The techniques used for full virtualization can be a dynamic translation of software programs running in a VM (e.g., VMware products), or providing a complete emulation of the underlying processor (e.g., Xen with Intel-VT or AMD-V). Guest Operating Systems (GOS) A GOS is one of the OSes that the VMM can host in a VM. The relationship between VMM, VM, and GOS is analogous to the relationship between, respectively, OS, process, and program. Hardware Level Virtualization Hardware Level Virtualization is the technique of using a thin layer of software to abstract the system hardware resources for creating multiple instance of virtual executing environment, each of which runs a separate instance of operating system. Hardware Thread See strand. HVM Hardware Virtual Machine, also known as hardware-assisted virtualization. Hypervisor Hypervisor is another term for VMM. Hypervisor is an extension of the term supervisor which was commonly applied to operating system kernel. Logical Domains (LDoms) Logical domains are Sun's implementation for hardware level virtualization based on the UltraSPARC T1 processor technology. LDom technology allows multiple domains to be created on one processor; each domain runs an instance of OS supported by one or more strands. Operating System Level Virtualization OS Level Virtualization is provided by an OS by virtualizing its services to allow multiple and separate operating environments to be created for applications. The services virtualized by the
114
OS includes: file system, devices, networking, security, and Inter Process Communication (IPC). Pacifica AMD's implementation for Hardware Virtualization, also known as AMD-V or AMD SVM. Paravirtualization Paravirtualization is an implementation of virtual machine that requires the guest OS to be modified to run in the VM. Paravirtualization provides partial emulation of the underlying hardware to a VM and requires the guest OS to replace all sensitive instructions and passes the control to the VMM for handling these operations. Privileged Instructions Privileged instruction are those that result in trap if the processor is running in user mode and do not result in trap if the processor is running in supervisor mode. Secure Virtual Machine (SVM) AMD's implementation for Hardware Virtualization, also known as Pacifica or ADM-V (see [9] Chapter 15). Sensitive Instructions Sensitive instructions [1] [12] are those that change the configuration of resources (memory), affect the processor mode without going through the memory trap sequence (page fault), or whose behavior changes with the processor mode or the contents of relocation register. If sensitive instructions are a subset of privileged instructions, it is relatively easy to build a VM because all sensitive instructions will result in a trap and the underlying VMM can process the trap and emulate the behavior of these sensitive instructions. If some sensitive instructions are not privileged instructions, special measure has to be taken to handles these sensitive instructions. Shadow Page A technique for hiding the layout of machine memory from a virtual machine's operating system. A virtual page table is presented to the guest OS by the VMM, but not connected to the processor's memory management unit. The VMM is responsible for trapping accesses to the table, validating updates and maintaining consistency with the real page table that is visible to the processor MMU. Shadow page is typically used to provide full virtualization to a VM. Simple Earliest Deadline First (sEDF) One of the scheduling algorithms used in Sun xVM Hypervisor for x86 for scheduling domains. See section CPU Scheduling on page 48 for a detailed description of sEDF. Strand Strand [2] refers to the state that hardware must maintain in order to execute a software thread. Specifically, a strand is the software-visible state (PC, NPC, general-purpose registers, floating-point registers, condition codes, status registers, ASRs, etc.) of a thread plus any microarchitecture state required by hardware for its execution. Strand replaces the ambiguous term hardware thread. The number of strands in a processor defines the number of threads that an operating system can schedule on that processor at any given time. Sun xVM Hypervisor for x86 Sun xVM Hypervisor for x86 is the VMM of the Sun xVM Server. Sun xVM Infrastructure Sun Cross Virtualization and Management Infrastructure is a complete solution offering for virtualizing and managing the data center. Sun xVM Infrastructure = Sun xVM Server + xVM Ops Center Sun xVM Ops Center Sun xVM Ops Center is the management suite for the Sun xVM Server.
115
Sun xVM Server Sun xVM Server is a paravirtualized Solaris OS that includes support for the Xen open source community work on the x86 platform and support for LDoms on the UltraSPARC T1/T2 platform. In this paper, Sun xVM Server specifically refers to the Sun xVM Server for the x86 platform. Vanderpool Intel's implementation for Hardware Virtualization, also known as Intel-VT. Virtual CPU (VCPU) VCPU is an entity that can be dispatched by the scheduler of a guest OS. For UltraSPARC processorss LDoms, VCPU is also know as strand, hardware thread, or logical processor. Virtual Machine (VM) Virtual machine is a discrete execution environment that abstracts computer platform resources to an operating system. Each virtual machine runs an independent and separate instance of operating system. Popek and Goldberg [1] also defines VM as an efficient, isolated duplicate of a real machine. Virtual Machine Monitor (VMM) The VMM is a software layer that runs directly on top of the hardware and virtualizes all resources of the computer system. The VMM layer is situated between VMs and hardware resources. The VMM abstracts hardware resources to VMs and performs privileged and sensitive actions on the behalf of VM. Virtualization Technology (VT) Intel's implementation for Hardware Virtualization, also known as Vanderpool. Xen Xen is a open source VMM for x86, IA-64, and PPC [6].
116
117
Author Biography
Appendix D
Author Biography
Chien-Hua Yen is currently a senior staff engineer in the ISV engineering group at Sun. Before joining Sun more than 12 years ago, he had been with several Silicon Valley companies working as a software development engineer on Unix file systems, real time embedded system, and device drivers. His first job at Sun was with the kernel I/O group developing a kernel virtual memory segment driver for device memory mapping. After the kernel group, he worked with third party hardware vendors on developing PCI drivers for the Solaris OS and high availability products for the Sun CompactPCI board. In the last two yeas, Chien-Hua has been working with ISVs on application performance tuning, Solaris 10 adoption, and Solaris virtualization.
Acknowledgements
The author would like to thank Honlin Su, Lodewijk Bonebakker, Thomas Bastian, Ray Voight, and Joost Pronk for their invaluable comment; Patric Change for his encouragement and support; Suzanne Zorn for her editorial work; and Kemer Thompson for his constructive comments and his coordination of the reviews.
Solaris Operating System Hardware Virtualization Product Architecture
On the Web sun.com
Sun Microsystems, Inc. 4150 Network Circle, Santa Clara, CA 95054 USA Phone 1-650-960-1300 or 1-800-555-9SUN (9786) Web sun.com
2007 Sun Microsystems, Inc. All rights reserved. Sun, Sun Microsystems, the Sun logo, Java, JVM, Solaris, and Sun BluePrints are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States and other countries. All SPARC trademarks are used under license and are trademarks or registered trademarks of SPARC International, Inc. in the U.S. and other countries. Products bearing SPARC trademarks are based upon architecture developed by Sun Microsystems, Inc. Information subject to change without notice. Printed in USA 11/07

820 3703

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

820 3703

Uploaded by

Copyright:

Available Formats

SOLARIS OPERATING SYSTEM

HARDWARE VIRTUALIZATION PRODUCT ARCHITECTURE

Part No 820-3703-10 Revision 1.0, 11/27/07 Edition: November 2007

Sun Microsystems, Inc.

Sun Microsystems, Inc.

Sun Microsystems, Inc.

Hardware Level Virtualization

Sun Microsystems, Inc.

Virtual Machine Monitor (VMM) Platform Hardware

Benefits of Hardware Level Virtualization

Sun Microsystems, Inc.

Sun Microsystems, Inc.

Sun Microsystems, Inc.

Sun Microsystems, Inc.

Sun Microsystems, Inc.

Virtual Machine Monitor Basics

Sun Microsystems, Inc.

Virtual Machine Monitor Basics

Virtual Machine Monitor Basics

Sun Microsystems, Inc.

Virtual Machine Monitor Basics

Sun Microsystems, Inc.

User Space Applications

GOS VMM OS Platform Hardware Privileged Mode

VMM Platform Hardware

Host OS Platform Hardware

Virtual Machine Monitor Basics

Sun Microsystems, Inc.

VMM in Privileged Mode

Removing Sensitive Instructions in the GOS

Virtual Machine Monitor Basics

Sun Microsystems, Inc.

Physical Memory Virtualization

Virtual Machine Monitor Basics

Sun Microsystems, Inc.

PFN 0 Physical Memory VM/GOS

MPFN 0 Machine Memory VMM

Page Translations Virtualization

Virtual Machine Monitor Basics

Sun Microsystems, Inc.

GOS Guest Page Table 2a

GOS Guest Page Table

HV Calls 3a TLB Operations 3b

TLB Hardware SPARC Page Translations

Figure 4. Page translation schemes used on x86 and SPARC architectures.

Virtual Machine Monitor Basics

Sun Microsystems, Inc.

Virtual Machine Monitor Basics

Sun Microsystems, Inc.

Virtual Machine Monitor Basics

Sun Microsystems, Inc.

Programmed I/O (PIO) host-initiated data transfer. In PIO, a host OS maps a

Interrupt a device-generated asynchronous event notification.

Figure 6. PCI configuration address space.

Virtual Machine Monitor Basics

Sun Microsystems, Inc.

Virtual Machine Monitor Basics

Sun Microsystems, Inc.

DMA without ATS

DMA with ATS

Figure 7. DMA with and without address translation service (ATS).

The x86 Processor Architecture

Sun Microsystems, Inc.