Professional Documents
Culture Documents
820 3703
820 3703
Table of Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Hardware Level Virtualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Section 1: Background Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Virtual Machine Monitor Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 VMM Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 VMM Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 The x86 Processor Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 SPARC Processor Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 Section 2: Hardware Virtualization Implementations . . . . . . . . . . . . . . . . . . . . . 37 Sun xVM Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 Sun xVM Server Architecture Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 Sun xVM Server CPU Virtualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 Sun xVM Server Memory Virtualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Sun xVM Server I/O Virtualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Sun xVM Server with Hardware VM (HVM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 HVM Operations and Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 Sun xVM Server with HVM Architecture Overview . . . . . . . . . . . . . . . . . . . . . . . 68 Logical Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 Logical Domains (LDoms) Architecture Overview . . . . . . . . . . . . . . . . . . . . . . . . 80 CPU Virtualization in LDoms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 Memory Virtualization in LDoms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 I/O Virtualization in LDoms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 VMware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 VMware Infrastructure Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 VMware CPU Virtualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 VMware Memory Virtualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 VMware I/O Virtualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 Section 3: Additional Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 VMM Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 Terms and Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 Author Biography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
Introduction
Chapter 1
Introduction
In the IT industry, virtualization is a mechanism of presenting a set of logical computing resources over a fixed hardware configuration so that these logical resources can be accessed in the same manner as the original hardware configuration. The concept of virtualization is not new. First introduced in the late 1960s on mainframe computers, virtualization has recently become popular as a means to consolidate servers and reduce the costs of hardware acquisition, energy consumption, and space utilization. The hardware resources that can be virtualized include computer systems, storage, and the network. Server virtualization can be implemented at different levels on the computing stack, including the application level, operating system level, and hardware level: An example of application level virtualization is the Virtual Machine for the Java platform (Java Virtual Machine or JVM machine)1. The JVM implementation provides an application execution environment as a layer between the application and the OS, removing application dependency on OS-specific APIs and hardwarespecific characteristics. OS level virtualization abstracts OS services such as file systems, devices, networking, and security, and provides a virtualized operating environment to applications. Typically, OS level virtualization is implemented by the OS kernel. Only one instance of the kernel runs on the system, and it provides multiple virtualized operating environments to applications. Examples of OS level virtualization include Solaris Containers technology, Linux VServers, and FreeBSD Jails. OS level virtualization has less performance overhead and better system resource utilization than hardware level virtualization. Since one OS kernel is shared among all virtual operating environments, isolation among all virtualized operating environments is as good as the OS provides. Hardware level virtualization, discussed in detail in this paper, has become popular recently because of increasing CPU power and low utilization of CPU resources in the IT data center. Hardware level virtualization allows a system to run multiple OS instances. With less sharing of system resources than OS level virtualization, hardware virtualization provides stronger isolation of operating environments. The Solaris OS includes bundled support for application and OS level virtualization with its JVM software and Solaris Containers offerings. Sun first added support for hardware virtualization in the Solaris 10 11/06 release with Sun Logical Domains (LDoms) technology, supported on Sun servers which utilize UltraSPARC T1 or UltraSPARC T2
1. The terms "Java Virtual Machine" and "JVM" mean a Virtual Machine for the Java(TM) platform.
Introduction
processors. VMware also supports the Solaris OS as a guest OS in its VMware Server and Virtual Infrastructure products starting with the Solaris 10 1/06 release. In October 2007, Sun announced the Sun xVM family of products that includes the Sun xVM Server and the Sun xVM Ops Center management system: Sun xVM Server includes support for the Xen open source community work [6] on the x86 platform and support for LDoms on the UltraSPARC T1/T2 platform Sun xVM Ops Center a management suite for the Sun xVM Server
Note In this paper, in order to distinguish the discussion of x86 and UltraSPARC
T1/T2 processors, Sun xVM Server is specifically used to refer to the Sun hardware virtualization product for the x86 platform, and LDoms is used to refer to the Sun hardware virtualization product for the UltraSPARC T1 and T2 platforms.
The hardware virtualization technology and new products built around this technology have expanded options and opportunities for deploying servers with better utilization, more flexibility, and enhanced functionality. In reaping the benefits of the hardware virtualization, IT professionals also face the challenges of operating within the limitation of a virtualized environment while delivering the same level of service agreement as the physical operating environment. Meeting this requirement requires a good understanding of virtualization technologies, CPU architecture, and software implementations, and awareness of their strengths and limitations.
Introduction
VM
VM
VM
GOS
GOS
GOS
Hardware resource virtualization can take the form of sharing, partitioning, or delegating: Sharing Resources are shared among VMs. The VMM coordinates the use of resources by VMs. For example, the VMM may include a CPU scheduler to run threads of VMs based on a pre-determined scheduling policy and VM priority. Partitioning Resources are partitioned so that each VM gets the portion of resources allocated to it. Partitioning can be dynamically adjusted by the VMM based on the utilization of each VM. Examples of resource partitioning include the ballooning memory technique employed in Sun xVM Server and VMware, and the allocation of CPU resources in Logical Domains technology. Delegating With delegating, resources are not directly accessible by a VM. Instead, all resource accesses are made through a control VM that has direct access to the resource. I/O device virtualization is normally accessed via delegation. The distinction and boundaries between the virtualization methods are often not clear. For example, sharing may be used for one component and partitioning used in others, and together they make up an integral functional module.
Introduction
hardware and maintenance expenses, floor space, cooling costs, and power consumption. Workload Migration Hardware level virtualization decouples the OS from the underlying physical platform resources. A guest OS state, along with the user applications running on top of it, can be encapsulated into an entity and moved to another system. This capability is useful for migrating a legacy OS system from an old under-powered server to a more powerful server while preserving the investment in software. When a server needs to be maintained, a VM can be dynamically migrated to a new sever with no down time, further enhancing availability. Changes in workload intensity levels can be addressed by dynamically shifting underlying resources to the starving VMs. Legacy applications that ran natively on a server continue to run on the same OS running inside a VM, leveraging the existing investment in applications and tools. Workload Isolation Workload isolation includes fault and security isolations. Multiple guest OSes run independently, and thus a software failure in one VM does not affect other VMs. However, the VMM layer introduces a single point of failure that can bring down all VMs on the system. A VMM failure, although potentially catastrophic, is less probable than a failure in the OS because the complexity of VMM is much less than that of an OS. Multiple VMs also provide strong security isolation among themselves with each VM running an independent OS. Security intrusions are confined to the VM in which they occur. The boundary around each VM is enforced by the VMM and the inter-domain communication, if provided by the VMM, is restricted to specific kernel modules only. One distinct feature of hardware level virtualization is the ability to run multiple instances of heterogeneous operating systems on a single hardware platform. This feature is important for the following reasons: Better security and fault containment among application services can be achieved through OS isolation. Applications written for one OS can run on a system that supports a different OS. Better management of system resource utilization is possible among the virtualized environments.
Scope
This paper explores the underlying hardware architecture and software implementation for enabling hardware virtualization. Great emphasis has been placed on the CPU hardware architecture limitations for virtualizing CPU services and their software workarounds. In addition, this paper discusses in detail the software architecture for implementing the following types of virtualization:
Introduction
CPU virtualization uses processor privileged mode to control resource usage by the VM, and relays hardware traps and interrupts to VMs Memory virtualization partitions physical memory among multiple VMs and handles page translations for each VM I/O virtualization uses a dedicated VM with direct access to I/O devices to provide device services The paper is organized into three sections. Section I, Background Information, contains information on VMMs and provides details on the x86 and SPARC processors: Virtual Machine Monitor Basics on page 9 discusses the core of hardware virtualization, the VMM, as well as requirements for the VMM and several types of VMM implementations. The x86 Processor Architecture on page 21 describes features of the x86 processor architecture that are pertinent to virtualization. SPARC Processor Architecture on page 29 describes features of the SPARC processor that affect virtualization implementations. Section II, Hardware Virtualization Implementations, provides details on the Sun xVM Server, Logical Domains, and VMware implementations: Sun xVM Server on page 39 discusses a paravirtualized Solaris OS that is based on an open source VMM implementation for x86[6] processors and is planned for inclusion in a future Solaris release. Sun xVM Server with Hardware VM (HVM) on page 63 continues the discussion of Sun xVM Server for the x86 processors that support hardware virtual machines: IntelVT and AMD-V. Logical Domains on page 79 discusses Logical Domains (LDoms), supported on Sun servers that utilize UltraSPARC T1 or T2 processors, and describes Solaris OS support for this feature. VMware on page 97 discusses the VMware implementation for the VMM. Section III, Additional Information, contains a concluding comparison, references, and appendices: VMM Comparison on page 109 presents a summary of the VMM implementations discussed in this paper. References on page 111 provides a comprehensive listing of related references. Terms and Definitions on page 113 contains a glossary of terms. Author Biography on page 117 provides information on the author.
Introduction
Introduction
Section I
Background Information
Chapter 2: Virtual Machine Monitor Basics (page 9) Chapter 3: The x86 Processor Architecture (page 21) Chapter 4: SPARC Processor Architecture (page 29)
Introduction
Chapter 2
VMM Requirements
A software program communicates with the computer hardware through instructions. Instructions, in turn, operate on registers and memory. If any of the instructions, registers, or memory involved in an action is privileged, that instruction results in a privileged action. Sometimes an action, which is not necessarily privileged, attempts to change the configuration of resources in the system. Subsequently, this action would impact other actions whose behavior or result depends on the configuration of resources. The instructions that result in such operations are called sensitive instructions. In the context of the virtualization discussion, a processor's instructions can be classified into three groups: Privileged instructions are those that trap if the processor is in non-privileged mode and do not trap if it is in privileged mode. Sensitive instructions are those that change or reference the configuration of resources (memory), affect the processor mode without going through the memory trap sequence (page fault), or reference the sensitive registers whose contents change when the processor switches to run another VM. Non-privileged and non-sensitive instructions are those that do not fall into either the privileged or sensitive categories described above.
10
Sensitive instructions have a major bearing on the virtualizability of a machine [1] because of their system-wide impact. In a virtualized environment, a GOS should only contain non-privileged and non-sensitive instructions. If sensitive instructions are a subset of privileged instructions, it is relatively easy to build a VM because all sensitive instructions will result in a trap. In this case a VMM can be constructed to catch all traps that result from execution of sensitive instructions by a GOS. All privileged and sensitive actions from VMs would be caught by the VMM, and resources could be allocated and managed accordingly (a technique called trap-andemulate). A GOS's trap handler could then be called by the VMM trap handler to perform the GOS-specific actions for the trap. If a sensitive instruction is a non-privileged instruction, the instruction executed by one VM will be unnoticed. Robin and Irvine [3] identified several x86 instructions in this category. These instructions cannot be safely executed by a GOS as they can impact the operations of other VMs or adversely affect the operation of its own GOS. Instead, these instructions must be substituted by the VMM service. The substitution can be in the form of an API for the GOS to call, or a dynamic conversion of these instructions to explicit processor traps.
Types of VMM
In a virtualized environment, the VMM controls the hardware resources. VMMs can be categorized into two types, based on this control of resources: Type I maintains exclusive control of hardware resources Type II leverages the host OS by running inside the OS kernel The Type I VMM [3] has several distinct characteristics: it is the first software to run (besides BIOS and the boot loader), it has full and exclusive control of system hardware, and it runs in privileged mode directly on the physical processor. The GOS on a Type I VMM implementation runs in a less privileged mode than the VMM to avoid conflicts managing the hardware resources. An example of a Type I VMM is Sun xVM Server. Sun xVM Server includes a bundled VMM, the Sun vVM Hypervisor for x86. The Sun xVM Hypervisor for x86 is the first software, beside BIOS and boot loader, to run during boot as shown in the GRUB
menu.lst file:
title Sun xVM Server kernel$ /boot/$ISADIR/xen.gz module$ /platform/i86xpv/kernel/$ISADIR/unix /platform/i86xpv/kernel/$ISADIR/unix module$ /platform/i86pc/$ISADIR/boot_archive
11
The GRUB bootloader first loads the Sun xVM Hypervisor for x86, xen.gz. After the VMM gains control of the hardware, it loads the Solaris kernel,
/platform/i86xpv/kernel/$ISADIR/unix, to run as a GOS.
Sun's Logical Domains and VMware's Virtual Infrastructure 3 [4] (formerly knows as VMware ESX Server), described in detail in Chapter 7 Logical Domains on page 79 and Chapter 8 VMware on page 97, are also Type I VMMs. A Type II VMM typically runs inside a host OS kernel as an add-on module, and the host OS maintains control of the hardware resources. The GOS in a Type II VMM is a process of the host OS. A Type II VMM leverages the kernel services of the host OS to access hardware, and intercepts a GOS's privileged operations and performs these operations in the context of the host OS. Type II VMMs have the advantage of preserving the existing installation by allowing a new GOS to be added to an running OS. An example of type II VMM is VMware's VMware Server (formerly known as VMware GSX Server). Figure 2 illustrates the relationships among hardware, VMM, GOS, host OS, and user application in virtualized environments.
Type I VMM Server Type II VMM Server Physical Server
Unprivileged Mode
Apps GOS
Apps GOS
Apps GOS
Apps
Apps
Apps
GOS VMM
Figure 2. Virtual machine monitors vary in how they support guest OS, host OS, and user applications in virtualized environments.
VMM Architecture
As discussed in VMM Requirements on page 9, the VMM performs some of the functions that an OS normally does: namely, it controls and arbitrates CPU and memory resources, and provides services to upper layer software for sensitive and privileged operations. These functions require the VMM to run in privileged mode and the OS to relinquish the privileged and sensitive operations to the VMM. In addition to processor and memory operation, I/O device support also has a large impact on VMM architecture.
12
13
Dynamically translating the GOS sensitive instructions by software As described in a previous section, VMware uses binary translation to replace the GOS sensitive instructions with VMM instructions. Dynamically translating the GOS sensitive instructions by hardware This method requires the processor to provides a special mode of operation that is entered when an sensitive instruction is executed in reduced privileged mode. The first approach, which involves modifying the GOS source code, is called paravirtualization, because the VMM provides only partial virtualization of the processor. The GOS must replace its sensitive and privileged operations with the VMM service. The remaining two approaches provide full virtualization to the VM, enabling the GOS to run without modification In addition to OS modification, performance requirements, processor architecture design, tolerance of a single point of failure, and support for legacy OS installations have an impact on the design of VMM architecture.
14
VM0
PFN 0
VM1
A ballooning technique [5] has been used in some virtualization products to achieve better utilization of physical memory among VMs. The idea behind the ballooning technique is simple. The VMM controls a balloon module in a GOS. When the VMM wants to reclaim memory, it inflates the balloon to increase pressure on memory, forcing the GOS to page out memory to disk. If the demand for physical memory decreases, the VMM deflates the balloon in a VM, enabling the GOS to claim more memory.
15
page table and the shadow page table is handled by the VMM when page faults occur. Figure 4 shows three different page translation implementations in the Solaris OS on x86 and SPARC platforms. 1. The paravirtualized Sun xVM Server uses the following approach on x86 platforms: [1] The GOS uses the hypervisor call method to update the page tables maintained by the VMM. 2. The Sun xVM Server with HVM and VMware use the following approach: [2a] The GOS maintains its own guest page table. The synchronization between the guest page table and the hardware page table (shadow page table) is handled by the VMM when page faults occur. [2b] The x86 CPU loads the page translation from the hardware page table to the TLB. 3. On SPARC systems, the Solaris OS uses the following approach for Logical Domains: [3a] The GOS maintains its own page table. The GOS takes an entry from the page table as an argument to the hypervisor call that loads the translations to the TLB. [3b] The VMM gets the page translation from the GOS and loads the translation to the TLB.
HV Calls 1 VMM HW Page Table 2b TLB Hardware X86 Page Translations VMM
The memory management implementation for Sun xVM Server, Sun xVM Server with HVM, VMware, and Logical Domains using these mechanisms is discussed in detail in later sections of this paper.
16
I/O Virtualization
I/O devices are typically managed by a special software module called the device driver running in the kernel context. Due to vastly different types and varieties of device types and device drivers, the VMM either includes few device drivers or leaves device management entirely to the GOS. In the latter case, because of existing device architecture limitations (discussed later in the section), devices can only be exclusively managed by one VM. This constraint creates some challenges for I/O access by a VM, and limits the following: What device are exported to a VM How devices are exported to a VM How each I/O transaction is handled by a VM and the VMM Consequently, I/O has the most challenges in the areas of compatibility and performance for virtual machines. In order to explain what devices are exported and how they are exported, it is first necessary to understand the options available to handle I/O transactions in a VM. There are, in general, three approaches for I/O virtualization, as illustrated in Figure 5: Direct I/O (VM1 and VM3) Virtual I/O using I/O transaction emulation (VM2) Virtual I/O using device emulation (VM4)
VM1 Direct I/O I/O VM I/O Transaction Emulation and Native Driver VM2 Virtual I/O thru I/O VM VM3 Direct I/O VM4 Virtual I/O thru VMM Native Driver or Virtual Driver
Virtual Driver
Native Driver
VMM Device Emulation and Device Driver Network Chip SCSI Controller Sun X64 Server
Figure 5. Different I/O virtualization techniques used by virtual machine monitors.
For direct I/O, the VMM exports all or a portion of the physical devices attached to the system to a VM, and relies on VMs to manage devices. The VM that has direct I/O access uses the existing driver in the GOS to communicate directly with the device. VM 1 and VM3 in Figure 5 have direct I/O access to devices. VM1 is also a special I/O VM that provides virtual I/O for other VMs, such as VM2, to access devices.
17
Virtual I/O is made possible by controlling the device types exported to a VM. There are two different methods of implementing virtual I/O: I/O transaction emulation (shown in VM2 in Figure 5) and device emulation (shown in VM4). I/O transaction emulation requires virtual drivers on both ends for each type of I/O transaction (data and control functions). As shown in Figure 5, the virtual driver on the client side (VM2) receives I/O requests from applications and forwards requests through the VMM to the virtual driver on the server side (VM1); the virtual driver on the server side then sends out the request to the device. I/O transaction emulation is typically used in paravirtualization because the OS on the client side needs to include the special drivers to communicate with its corresponding driver in the OS on the server side, and needs to add kernel interfaces for inter-domain communication using the VMM services. However, it is possible to have PV drivers in an un-paravirtualized OS (full virtualization) for better I/O performance. For example, Solaris 10, which is not paravirtualized, can include PV drivers on a HVM-capable system to get better performance than that achieved using device emulation drivers such as QEMU. (See Sun xVM Server with HVM I/O Virtualization (QEMU) on page 71.) I /O transaction emulation may cause application compatibility issues if the virtual driver does not provide all data and control functions (for example, ioctl(2)) that the existing driver does. Device emulation provides an emulation of a device type, enabling the existing driver for the emulated device in a GOS to be used. The VMM exports emulated device nodes to a VM so that the existing drivers for the emulated devices in a GOS are used. By doing this, the VMM controls the driver used by a GOS for a particular device type; for example, using the e1000g driver for all network devices. Thus, the VMM can focus on the emulation of underlying hardware using one driver interface. Driver accesses to the I/O register and port in a GOS, which will result in a trap due to invalid address, are caught and converted to access the real device hardware. VM4 in Figure 5 uses native OS drivers to access emulated devices exported by the VMM. Device emulation is in general less efficient and more limited on platforms supported than I/O transaction emulation. Device emulation does not require changes in the GOS and, therefore, is typically used to provide full virtualization to a VM. Virtual I/O, unlike direct I/O, requires additional drivers in either the I/O VM or the VMM to provide I/O virtualization. This constraint: Limits the type of devices that are made available to a VM Limits device functionality Causes significant I/O performance overhead While virtualization provides full application binary compatibility, I/O becomes a trouble area in terms of application compatibility and performance in a VM. One
18
solution to the I/O virtualization issues is to allow VMs to directly access I/O, as shown by VM3 in Figure 5. Direct I/O access by VMs requires additional hardware support to ensure device accesses by a VM are isolated and restricted to resources owned by the assigned VM. In order to understand the industry effort to allow an I/O device to be shared among VMs, it is necessary to examine device operations from an OS point of view. The interactions between an OS and a device consist, in general, of three operations: 1.
2.
Direct Memory Access (DMA) device-initiated data transfer without the CPU involvement. In DMA, a host OS writes an address of its memory and the transfer size to a device's DMA descriptor. After receiving an enable DMA instruction from the host driver, the device performs data transfer at a time it chooses and uses interrupts to notify the host OS of DMA completion.
3.
Interrupts are already virtualized by all VMM implementations as is shown in the later discussions for Sun xVM Server, Logical Domains, and VMware. The challenge of I/O sharing among VMs therefore lies in the device handling for PIO and DMA. To meet the challenges, PCI SIG has released a suite of IOV specifications for PCI Express (PCIe) devices, in particular the Single Root I/O Virtualization and Sharing Specification (SRIOV) specification [35] for device sharing and PIO operation, and the Address Translation Services (ATS) specification [30] for DMA operation. Device Configuration and PIO A PCI device exports its memory to the host through Base Address Registers (BARs) in its configuration space. A device's configuration space is identified in the PCI configuration address space as shown in Figure 6.
31 24 23 16 15 11 10 8 7 2 1 0
Reserved
Bus Number
Device Number
Function Number
Register Number
00
A PCI device can have up to 8 physical functions (PF). Each PF has its own 256 byte configuration header. The BARs of a PCI function, which are 32-bit wide, are located at offset 0x10-0x24 in the configuration header. The host gets the size of the memory region mapped by a BAR by writing a value of all 1's to the BAR and then reading the value back. The address written to a BAR is the assigned starting address of the memory region mapped to the BAR.
19
To allow multiple VMs to share a PF, the SRIOV specification introduces the notion of a Virtual Function (VF). Each VF shares some common configuration header fields with the PF and other VFs. The VF BARs are defined in the PCIe's SRIOV extended capabilities structure. A VF contains a set of non-shared physical resources, such as work queue and data buffer, which are required to deliver function specific services. These resources are exported through the VF BARs and are directly accessible by a VM. The starting address of a VF's memory space is derived from the first VF's memory space address and the size of VF's BAR. For any given VFx, the starting address of its memory space mapped to BARa is calculated according to the following formula:
addr (VF x,BAR a) = addr (VF 1,BAR a) + ( x 1 ) ( VF BARa aperature size )
where addr (VF1, BARa) is the starting address of BARa for the first VF and (VF BARa aperture size) is the size of the VF BARa as determined by writing a value of 1's to BARa and reading the value back. Using this mechanism, a GOS in a VM is able to share the device with other VMs while performing device operations that pertain only to the VM. DMA In many current implementations (especially in most x86 platforms), physical addresses are used in DMA. Since a VM shares the same physical address space on the system with other VMs, a VM might read/write to another VM's memory through DMA. For example, a device driver in a VM might write the memory contents that belong to other VMs to a disk and read the data back into the VM's memory. This causes a potential breach in security and fault isolation among VMs. To provide isolation during DMA operation, the ATS specification defines a scheme for a VM to use the address mapped to its own physical memory for DMA operation. (This approach is used in similar designs such as IOMMU Specification [31] and DMA Remapping [28].) This DMA ATS enables DMA memory to be partitioned into multiple domains, and keeps DMA transactions on one domain isolated from other domains. Figure 7 shows device DMA with and without ATS. With DMA ATS, the DMA address is like a virtual address that is associated with a context (VM). DMA transactions initiated by a VM can only be associated with the memory owned by the VM. DMA ATS is a chipset function that resides outside of the processor.
20
System Memory
System Memory
VM1 VM2
DMA Buffer DMA Buffer
CPU
DMA Buffer
DMA Buffer
CPU
DMA Buffer
DMA Buffer
PA
North Bridge
PA
HPA
North Bridge
HPA
PA PA
South Bridge PCI Device PCI Device
HPA DVA/GPA
South Bridge w/ IOMMU PCI Device PCI Device
PA
DVA/GPA
As shown in Figure 7, the physical address (PA) is used on the hardware platform without hardware support for ATS. For platforms with hardware support for ATS, a GOS in a VM writes either a device virtual address (DVA) or a guest physical address (GPA) to the devices DMA engine. The device driver in the GOS loads the mappings of either the DVA or GPA to the host physical address (HPA) in the hardware IOMMU. The HPA is the address understood by the memory controller.
Note The distinction between the HPA and GPA is described in detail in later sections for Sun xVM Server (see Physical Memory Management on page 52), for UltraSPARC LDoms (see Physical Memory Allocation on page 88), and for VMware (see Physical Memory Management on page 103).
When the device performs a DMA operation, a DVA/GPA address appears on the PCI bus and is intercepted by the hardware IOMMU. The hardware IOMMU looks up the mapping for the DVA/GPA, finds the corresponding HPA, and moves the PCI data to system memory pointed to by the HPA. Since either DVA or GPA of a VM has its own address space, ATS allows system memory for DMA to be partitioned and, thus, prevents a VM from accessing another VMs DMA buffer.
21
Chapter 3
22
Timer Devices The x86 platform includes several timer devices for time keeping purposes. Knowledge of the characteristics of these devices is important to fully understand time keeping in a VM: Some timer devices are interrupt driven (which is virtualized and delayed) and some require privileged access to update the device counter.
Protected Mode
The x86 architecture protected mode provides a protection mechanism to limit access to certain segments or pages and prevent unprivileged access. The processor's segment-protection mechanism recognizes 4 privilege levels, numbered from 0 to 3 (Figure 8). The greater the level number, the lesser the privileges provided. The page-level protection mechanism restricts access to pages based on two privilege levels: supervisor mode and user mode. If the processor is operating at a current privilege level (CPL) 0, 1, or 2, it is in a supervisor mode and the processor can access all pages. If the processor is operating at a CPL 3, it is in a user mode and the processor can access only user level pages.
When the processor detects a privilege level violation, it generates a general-protection exception (#GP). The x86 has more than 20 privileged instructions. These instructions can be executed only when the current privilege level (CPL) is 0 (most privileged). In addition to the CPL, the x86 has an I/O privilege level (IOPL) field in the EFLAGS register that indicates the I/O privilege level of the currently running program. Some instructions, while allowed to execute when the CPL is not 0, might generate a #GP exception if the CPL value is higher than IOPL. These instructions include CLI (clear interrupt), STI (set interrupt flag), IN/INS (input from port), and OUT/OUTS (output to port). In addition to the above instructions, there are many instructions [3] that, while not privileged, reference registers or memory locations that would allow a VM to access a memory region not assigned to that VM. These sensitive instructions will not cause a
#GP exception. The trap-and-emulate method for virtualization of a GOS, as stated in
VMM Requirements on page 9, does not apply to these instructions. However, these instructions may impact other VMs.
23
Segmented Architecture
In protected mode, all memory accesses must go through a logical address } Linear address (LA) } Physical Address (PA) translation scheme. The logical address to LA translation is managed by the x86 segmentation architecture which divides a process's address space into multiple protected segments. A logical address, which is used as the address of an operand or of an instruction, consists of a 16-bit segment selector and a 32-bit offset. A segment selector points to a segment descriptor that defines the segment (see Figure 11 on page 24). The segment base address is contained in the segment descriptor. The sum of the offset in a logical address and the segment base address gives the LA. The Solaris OS directly maps an LA to a process's Virtual Address (VA) by setting the segment base address to NULL.
Segmentation: VA + Segment Base Address (always 0 in Solaris) } Linear address Paging: Linear address } Physical Address
For each memory reference, a VA and a segment selector are provided to the processor (Figure 9). The segment selector, which is loaded to the segment register, is used to identify a segment descriptor for the address.
15
Index
3 2 1
TI RPL
Index: up to 8K descriptors (bits 3-15) TI: Table Indicator; 0=GDT, 1=LDT RPL: Request Privilege Level
Figure 9. Segment Selector
Every segment descriptor has a visible part and a hidden part, as illustrated in Figure 10 (see also [7], Volume 3A Section 3.4.3). The visible part is the segment selector, an index that points into either the global descriptor table (GDT) or the local descriptor table (LDT) to identify from which descriptor the hidden part of the segment register is to be loaded. The hidden part includes portions containing segment descriptor information loaded from the descriptor table.
Selector
Type
Base Address
Limit
CPL
Visible
Hidden
Figure 10. Each segment descriptor has a visible and a hidden part.
24
The hidden fields of a segment register are loaded to the processor from a descriptor table and are stored in the descriptor cache registers. The descriptor cache registers, like the TLB, allow the hardware processor to refer to the contents of the segment register's hidden part without further reference to the descriptor table. Each time a segment register is loaded, the descriptor cache register gets fully loaded from the descriptor table. Since each VM has its own descriptor table (for example, the GDT), the VMM has to maintain a shadow copy of each VMs descriptor table. A context switch to a VM will cause the VM's shadow descriptor table to be loaded to the hardware descriptor table. If the content of the descriptor table is changed by the VMM because of a context switch to another VM, the segment is non-reversible, which means the segment cannot be restored if an event such as a trap causes the segment to be saved and replaced. The Current Privilege Level (CPL) is stored in the hidden portion of the segment register. The CPL is initially equal to the privilege level of the code segment from which it is being loaded. The processor changes the CPL when program control is transferred to a code segment with a different privilege level. The segment descriptor contains the size, location, access control, and status information of the segment that is stored in either the LDT or GDT. The OS sets segment descriptors in the descriptor table and controls which descriptor entry to use for a segment (Figure 11). See CPU Privilege Mode on page 45 for a discussion of setting the segment descriptor in the Solaris OS.
31
Base 31:24
24 23 22 21 20 19
D D/B L AVL SL
16 15 14
P
13 12 11
S Type
87
Base 23:16
DPL
31
Base 15:00
16
Segment Limit 15:00
L: 64-bit code segment AVL: Available for use by system software Base: Segment base address D/B Default operation size (0=64-bit segment, 1=32 bit segment) DBL: Descriptor Privilege Level G: Granularity SL: Segment Limit 19:16 P: Segment present S: Descriptor type (0=system, 1=code or data) Type: segment type
Figure 11. Segment descriptor.
The privilege check performed by the processor recognizes three types of privilege levels: requested privilege level (RPL), current privilege level (CPL), and descriptor privilege level (DPL). A segment can be loaded if the DPL of the segment is numerically greater than or equal to both the CPL and the RPL. In other words, a segment can be
25
accessed only by code that has equal or higher privilege level. Otherwise, a generalprotection fault exception, #GP, is generated and the segment register is not loaded. On 64-bit systems, linear address space (flat memory model) is used to create a continuous, unsegmented address space for both kernel and application programs. Segmentation is disabled in the sense that privilege checking can not apply to VA to LA translations as it doesn't exist. The only protection left to prevent a user application from accessing kernel memory is through the page protection mechanism. This is why the kernel of a GOS has to run in ring 3 (user mode in page level protection) on a 64-bit system.
Paging Architecture
When operating in the protected mode, the LA } PA translation is performed by the paging hardware of the x86 processor. To access data in memory, the processor requires the presence of a VA } PA translation in the TLB (in Solaris, LA is equal to VA), the page table backing up the TLB entry, and a page of physical memory. For the x86 processor, loading the VA } PA page translation from the page table to TLB is performed automatically by the processor. The OS is responsible for allocating physical memory and loading the VA } PA translation to the page table. When the processor cannot load a translation from the page table, it generates a page fault exception, #PF. A #PF exception on x86 processors usually means a physical page has not been allocated, because the loading of the translation from the page table to the TLB is handled by the processor (Figure 12).
TLB Entry
Page Table
Physical Memory
Performed by the OS
Figure 12. Translations through the TLB are accomplished in the processor itself, while translations through page tables are performed by the OS.
The x86 processor uses a control register, %cr3, to manage the loading of address translations from the page table to the TLB. The base address of a process's page table is kept by the OS and loaded to %cr3 when the process is contexted in to run. On the Solaris OS, %cr3 is kept in the kernel hat structure. Each address space, as, has one
hat structure. The mdb(1) command can be used to find the value of the %cr3
register of a process:
26
% mdb -k > ::ps S PID PPID PGID SID UID FLAGS ADDR NAME .... R 9352 9351 9352 9352 28155 0x4a014000 fffffffec2ae78c0 bash > fffffffec2ae78c0::print -t 'struct proc' ! grep p_as struct as *p_as = 0xfffffffed15ba7e0 > 0xfffffffed15ba7e0::print -t 'struct as' ! grep a_hat struct hat *a_hat = 0xfffffffed1718e98 > 0xfffffffed1718e98::print -t 'struct hat' ! grep hat_htable htable_t *hat_htable = 0xfffffffed0f67678 > 0xfffffffed0f67678::print -t 'struct htable' ! grep ht_pfn pfn_t ht_pfn = 0x16d37 // %cr3
When multiple VMs are running, the automatic loading of page translations from the page table to the TLB actually makes the virtualization more difficult because all page tables have to be accessible by the processor. As a result, pages table updates can only be performed by the VMM to enforce a consistent memory usage on the system. Page Translations Virtualization on page 14 discusses two mechanism for managing page tables by the VMM. Another issue of the x86 paging architecture is related to the flushing of TLB entries. Unlike many RISC processors which support a tagged TLB, the x86 TLB is not tagged. A TLB miss results in a walk of the page table by the processor to find and load the translation to the TLB. Since the TLB is not tagged, a change in the %cr3 register due to a virtual memory context switch will result in invalidating all TLB entries. This adversely affects performance if the VMM and VM are not in the same address space. A typical solution to address the performance impact of TLB flushing is to reserve a region of the VM address space for the VMM. With this solution, the VMM and VM can run from the same address space and thus avoid a TLB flush when a VM memory operation traps to the VMM. The latest CPUs from Intel and AMD with hardware virtualization support include tagged TLBs, and consequently the translation of different address spaces can co-exist in the TLB.
27
The x86 processor allows device memory and registers to be accessed through either an I/O address space or memory-mapped I/O. An I/O address space access is performed using special I/O instructions such as IN and OUT. These instructions, while allowed to execute when the CPL is not 0, will result in a #GP exception if the processor's CPL value is higher than the I/O privilege level (IOPL). The Sun xVM Hypervisor for x86 provides a hypervisor call to set the IOPL, enabling a GOS to directly access I/O ports by setting the IOPL to its privilege level. When using memory-mapped I/O, any of the processors instructions that reference memory can be used to access an I/O location with protection provided through segmentation and paging. PIO, whether it is using I/O address space or memorymapped I/O, is normally uncacheable as device registers are usually accessed with precise programming order. PIO uses addresses in a VM's address space and doesn't cause any security and isolation issues. The x86 processor uses physical addresses for DMA. DMA in a virtualized x86 system has certain issues: A 32-bit, non-dual-address-cycle (DAC) PCI device can not address beyond 4 GB of memory. It is possible for one domains DMA to intrude into another domain's physical memory, thus causing the risk of security violation. The solution to the above issues is to have an I/O memory management unit (IOMMU) as a part of an I/O bridge or north bridge that performs a translation of I/O addresses (for example, an address that appears on the PCI bus) to machine memory addresses. The I/O address can be any address that is recognized by the IOMMU. An IOMMU can also improve the performance of large chunk data transfers by mapping a contiguous I/O address to multiple physical pages in one DMA transaction. However, the IOMMU may hurt the I/O performance for small data transfers because the DMA setup cost is higher than that of DMA without an IOMMU. For more details on the IOMMU, also known as hardware address translation service (hardware ATS), see I/O Virtualization on page 16.
Timer Devices
An OS typically uses several timer devices for different purposes. Timer devices are characterized by their frequency granularity, frequency reliability, and ability to generate interrupts and receive counter input. Understanding the characteristics of timer devices is important for the discussion of timekeeping in a virtualized environment, as the VMM provides virtualized timekeeping of some timers to its overlaying VMs. Virtualized timekeeping has significant impact on the accuracy of time related functions in the GOS and, thus, on the performance and results of time sensitive applications.
28
An x86 system typically includes the following timer devices: Programmable Interrupt Timer (PIT) PITs use a 1.193182 Mhz crystal oscillator and have a 16-bit counter and counter input register. The PIT contains three timers. Timer 0 can generate interrupts and is used by the Solaris OS as the system timer. Timer 1 was historically used for RAM refreshes and timer 2 for the PC speaker. Time Stamp Counter (TSC) The TSC is a feature of the x86 architecture that is accessed via the RDTSC instruction. The TSC, a 64-bit counter, changes with the processor speed. The TSC cannot generate interrupts and has no counter input register. The TSC is the finest grained of all timers and is used in the Solaris OS as the high resolution timer. For example, the gethrtime(3C) function uses the TSC to return the current highresolution real time. Real Time Clock (RTC) The RTC is used as the time-of-day (TOD) clock in the Solaris OS. The RTC uses a battery as an alternate power source, enabling it to continue to keep time while the primary source of power is not available. The RTC can generate interrupts and has a counter input register. It is the lowest grained timer on the system. Local Advanced Programmable Interrupt Controller (APIC) Timer The local APIC timer, which is a part of the local APIC, has a 32-bit counter and counter input register. It can generate interrupts and has the same frequency as the front side bus. The Solaris OS supports the use of the local APIC timer as one of the cyclic timers. High Precision Event Timer (HPET) The HPET is a relatively new timer available in some new x86 systems. The HPET is intended to replace the PIT and the RTC for generating periodic interrupts. The HPET can generate interrupts, is 64-bits wide, and has a counter input register. The Solaris OS currently does not use the HPET. Advanced Configuration and Power Interface (ACPI) Timer The ACPI timer has a 24-bit counter, can generate interrupts, and has no input counter register. The Solaris OS does not use the ACPI timer.
29
Chapter 4
30
Trap and interrupt handling Each strand (virtual processor) has its own trap and interrupt priority registers. This functionality allows the hypervisor to re-direct traps to the target CPU and enables the trap to be taken by the GOS's trap handler.
Note The terms strand, hardware thread, logical processor, virtual CPU and virtual
processor are used by various documents to refer to the same concept. For consistency, the term strand is used in this chapter.
Table 1 lists the availability of instructions, registers, and address spaces for each of the privilege modes, and includes information on where further details can be found in the
Component Instruction
Comments All instructions except SIR, RDHPR, and RHPR (which require hyperprivilege to execute) can be executed from the privileged mode. There are seven hyperprivileged registers: HPSTATE, HTSTATE, HINTP , HTBA, HVER, HSTICK_CMPR, and STRAND_STS. These registers are used by the hypervisor in the hyperprivileged mode. ASIs 0x30-0x7F are for hyperprivileged access only. These ASIs are mainly for CMT control, MMU, TLB, and hyperprivileged scratch registers.
Registers
Chapter 5
Address Space
Based on the availability of instructions, registers, and the ASI in hyperprivileged mode, the following functions of the hypervisor can be deduced: Reset the processor: SIR instruction Control hyperprivileged traps and interrupts: HTSTATE, HTBA, HINTP registers Control strand operation: ASI 0x41, and HSTICK_CMPR and STRAND_STS registers Manage MMU: ASI 0x50-0x5F
31
Processor Components
The UltraSPARC T1 processor[10] contains eight cores, and each core has hardware support for four strands. One FPU and one L2 cache are shared among all cores in the processor. Each core has its own Level 1 instruction and data cache (L1 Icache and Dcache) and TLB that are shared among all strands in the core. In addition, each strand contains the following: A full register file with eight register windows and four sets of global registers (a total of 160 registers: 8 * 16 registers per window, + 4 * 8 global registers) Most of the ASIs Ancillary privileged registers Trap queue with up to 16 entries This hardware support in each strand allows the hypervisor to partition the processor into 32 domains, with one strand for each domain. Each strand can execute instructions separately without requiring a software scheduler in the hypervisor to coordinate the processor resources. Table 2 summarizes the association of processor components to their location in the processor, core and strand.
Table 2. Location of key processor components in the UltraSPARC T1 processor.
Strand Register file with 160 registers Most of ASI Ancillary state register (ASR) Trap registers Privileged registers
The UltraSPARC T2 processor[33] is built upon the UltraSPARC T1 architecture. It has the following enhancements over the UltraSPARC T1 processor: EIght strands per core (for a total of 64 strands) Two integer pipelines per core, with each integer pipeline supporting 4 strands A floating-point and graphics unit (FGU) per core Integrated PCI-E and 10 Gb/Gb Ethernet (System-on-Chip) Eight banks of 4 MB L2 cache The UltraSPARC T2 has a total of 64 strands in 8 cores, and each core has its own floating-pointing and graphics unit (FGU). This allows up to 64 domains to be created on the UltraSPARC T2 processor. This design also adds integrated support for industry standard I/O interfaces such PCI-Express and 10 Gb Ethernet. Table 3 summarizes the association of processor components to physical processor, core and strand.
32
Processor 8 banks 4 MB L2 cache L2 cache crossbar Memory controller PCI-E 10 Gb/Gb Ethernet
Core 2 instruction pipelines (8 stages) L1 Icache and Dcache TLB FGU (12 stages)
Strand Full register file with 8 windows Most of ASI Ancillary state register (ASR) Privileged registers
Table 9-1 and Table 10--1 of [2] provide a summary and description for each ASI.
33
The Memory Management Unit (MMU) of the UltraSPARC processor provides the translation of VAs to PAs. This translation enables user programs to use a VA to locate data in physical memory. The SpitFire Memory Management Unit (sfmmu) is Sun's implementation of the UltraSPARC MMU. The sfmmu hardware consists of Translation Lookaside Buffers (TLBs) and a number of MMU registers: Translation Lookaside Buffer (TLB) The TLB provides virtual to physical address translations. Each entry of the TLB is a Translation Table Entry (TTE) that holds information for a single page mapping of virtual to physical addresses. The format of the TTE is shown in Figure 13. The TTE consists of two 64-bit words, representing the tag and data of the translation. The privileged field, P, controls whether or not the page can be accessed by nonprivileged software. MMU registers A number of MMU registers are used for accessing TLB entries, removing TLB entries (demap), context management, handling TLB misses, and support for Translation Storage Buffer (TSB) access. The TSB, an array of TTE entries, is a cache of translation tables used to quickly reload the TLB. The TSB resides in the system memory and is managed entirely by the OS. The UltraSPARC processors includes some MMU hardware registers for speeding up TSB access. The TLB miss handler will first search the TSB for the translation. If the translation is not found in the TSB, the TLB handler calls to a more sophisticated (and slower) TSB miss handler to load the translation table to the TSB.
TTE Tag
63
context_id
000000 48 47 42 41 taddr
va 0 s i cc e o e e p v p pw f t 13121110 9 8 7 654 3 sz 0
TTE Data
n v f soft2 o 636261 56 55
Figure 13. The translation lookaside buffer (TLB) is an array of translation table entries containing tag and data portions.
A TLB hit occurs if both the context and virtual address match an entry in the TLB. Address aliasing (multiple TLB entries with the same physical address) is permitted. Unlike the x86 processor, the loading of page translations to the TLB is manually managed by software through traps. In the event of a TLB miss, a trap is generated trying first to get the translation from the Translation Storage Buffer (TSB) (Figure 14). The TSB, an in-memory array of translations, acts like a direct-mapped cache for the TLB. If the translation is not present in the TSB, a TSB miss trap is generated. The TSB miss trap handler uses a software lookup mechanism based on the hash memory entry
34
block structure, hme_blk, to obtain the TTE. If a translation is still not found in
hme_blk, the kernel generic trap handler is invoked to call the kernel function pagefault() to allocate physical memory for the virtual address and load the
translation into the hme_blk hash structure. Figure 14 depicts the mechanism for handling TLB misses in an unvirtualized domain.
TLB miss TLB TTE load to TLB Processor MMU TTE cache in memory TSB TTE load to TSB OS data structure TSB miss home_blk hat_memload() OS function Allocate memory pagefault ()
Figure 14. Handling a TLB miss in an unvirtualized domain, UltraSPARC T1/T2 processor architecture.
Similarly, Figure 15 depicts how TLB misses are handled in a virtualized domain. In a virtualized environment, the UltraSPARC T1/T2 processor adds a Real Address type, in addition to the VA and PA, into the types of memory addressing (Figure 15). Real addresses (RA), which are equivalent to the physical memory in Sun xVM Server (see Physical Memory Management on page 52) are provided to the GOS as the underlying physical memory allocated to it. The GOS-maintained TSBs are used to translate VAs into RAs. The hypervisor manages the translation from RA to PA.
TLB miss TLB TTE load to TLB Processor MMU Managed by Hypervisor PA<-RA TLB miss TSB miss hme_blk TTE load to TSB OS data structure hat_memload() OS function Allocate memory pagefault()
Figure 15. Handling a TLB miss in a virtualized domain, UltraSPARC T1/T2 processor architecture.
Applications, which are non-privileged software, use only VAs. The OS kernel, which is privileged software, uses both VAs and RAs. The hypervisor, which is hyperprivileged software, normally uses PAs. Physical Memory Allocation on page 88 discusses in detail the types of memory addressing used in LDoms. The UltraSPARC T2 processor adds a hardware table walk for loading TLB entries. The hardware table walk accesses the TSBs to find TTEs that match the virtual address and context ID of the request. Since a GOS cannot access or control physical memory, the TTEs in the TSBs controlled by a GOS contain real page numbers, not physical page numbers (see Physical Memory Allocation on page 88). TTEs in the TSBs controlled by the hypervisor can contain real page numbers or physical page numbers. The hypervisor performs the RA-to-PA translation within the hardware table walk to permit the hardware table walk to load a GOS TTEs into the TLB for VA-to-PA translation.
35
Traps
In the SPARC processor, a trap transfers software execution from one privileged mode to another privileged mode at the same or higher level. The only exception is that unprivileged mode can not trap to another unprivileged mode. A trap can be generated by the following methods: Internally by the processor (memory faults, privileged exceptions, etc.) Externally generated by I/O devices (interrupts) Externally generated by another processor (cross calls) Software generated (for example, the Tcc instruction) A trap is associated with a Trap Type (TT), a 9-bit value. (TT values 0x180-0x1FF are reserved for future use.) The transfer of software execution occurs through a trap table that contains an array of TT handlers indexed by the TT value. Each trap table entry is 32-bytes in length and contains the first eight instructions of the TT handler. When a trap occurs, the processor gets the TT from the TT register and the trap table base address (TBA) from the TBA register. After saving the current executing states and updating some registers, the processor starts to execute the instructions in the trap table handler. The SPARC processors support nesting traps using a trap level (TL). The maximum TL (MAXTL) value is typically in the range of 2-6, and depends on the processor; in UltraSPARC T1/T2 processors, MAXTL is 6. Each trap level has one set of trap stack control registers: trap type (TT), trap program counter (TPC), trap next program counter (TNPC), and trap state (TSTATE). These registers provide trap software execution state and control for the current TL. The ability to support nested traps in SPARC processors makes the implementation of an OS trap handler easier and more efficient, as the OS doesn't need to explicitly save the current trap stack information. On UltraSPARC T1/T2 processors, each strand has a full set of trap control and stack registers which include TT, TL, TPC, TNPC, TSTATE, HTSTATE (hyperprivileged trap state), TBA, HTBA (hyperprivileged trap base address), and PIL (priority interrupt level). This design feature allows each strand to receive traps independently of other strands. This capability significantly helps trap handling and management by the hypervisor, as traps are delivered to a strand without being queued up in the hypervisor.
Interrupts
On SPARC platforms, interrupt requests are delivered to the CPU as traps. Traps 0x041 through 0x04F are used for Priority Interrupt Level (PIL) interrupts, and trap 0x60 is used for the vector interrupt. There are 15 interrupt levels for PIL interrupts. Interrupts are serviced in accordance to their PIL, with higher PILs having higher priority. The vector interrupt is used to support the data bearing vector interrupt which allows a device to include its private data in the interrupt packet (also known as the mondo
36
vector). With vector interrupt, device CSR access can be eliminated and the complexity of device hardware can be reduced. PIL interrupts are delivered to the processor through the ASR's SOFTINT_REG register. The SOFTINT_REG register contains a 15 bit int_level field. When a bit in this field is set, a trap is generated and the PIL of the trap corresponds to the location of the bit in that field. There is one SOFTINT_REG for each strand. In LDoms, the interrupt delivery from an I/O device to a GOS is a two-step process: An I/O device sends an interrupt request using the vector interrupt (trap 0x60) to the hypervisor. The hypervisor inserts the interrupt request into the interrupt queue of the target virtual processor. The target processor receives the interrupt request on its interrupt queue through trap 0x7D (for device) or 0x7C (for cross calls), and schedules an interrupt to itself to be processed at a later time by setting bits in the privileged SOFTINT register which causes a PIL interrupt (trap 0x41-0x4F). For more details on interrupt delivery, see Trap and Interrupt Handling on page 85.
Section II
Chapter 5: Sun xVM Server (page 39) Chapter 6: Sun xVM Server with Hardware VM (HVM) (page 63) Chapter 7: Logical Domains (page 79) Chapter 8: VMware (page 97)
38
39
Chapter 5
Note Sun xVM Server includes support for the Xen open source community work on
the x86 platform and support for LDoms on the UltraSPARC T1/T2 platform. In this paper, in order to distinguish the discussion of x86 and UltraSPARC T1/T2 processors, Sun xVM Server is specifically used to refer to the Sun hardware virtualization product for the x86 platform, and LDoms is used to refer to the Sun hardware virtualization product for the UltraSPARC T1 and T2 platforms.
This chapter is organized as follows: Sun xVM Server Architecture Overview on page 40 provides an overview of the Sun xVM Server architecture. Sun xVM Server CPU Virtualization on page 45 discusses the CPU virtualization employed by Sun xVM Server. Sun xVM Server Memory Virtualization on page 52 describes memory management issues.
40
Sun xVM Server I/O Virtualization on page 56 discusses the I/O virtualization used in Sun xVM Server.
Guest OS
Guest OS
Guest OS
The Dom0 VM has some unique characteristics not available in other VMs: First VM started by the VMM Able to directly access I/O devices Runs domain manager to create, start, stop, and configure other VMs Provides I/O access service to other VMs (DomU) Each DomU VM runs an instance of a paravirtualized GOS, and gets VMM services through a set of hypercalls. Access to I/O devices from each DomU VM are provided by drivers in Dom0.
41
The calling convention is compliant with the AMD64 ABI [8]. The SYSCALL instruction is intended to enable unprivileged software (ring 3) to access services from privileged software (ring 0). Solaris system calls also use SYSCALL to allow user applications to access Solaris kernel services. Having SYSCALL used by both Solaris system calls and the hypercalls means that the SYSCALL made by the user process in Solaris is delivered indirectly by the VMM to the Solaris kernel. This causes a slight overhead for each Solaris system call.
42
Privilege Operations:
long set_trap_table(trap_info_t *table); long mmu_update(mmu_update_t *req, int count, int *success_count, domid_t domid); long set_gdt(ulong_t *frame_list, int entries); long stack_switch(ulong_t ss, ulong_t esp); long fpu_taskswitch(int set); long mmuext_op(struct mmuext_op *req, int count, int *success_count, domid_t domain_id); long update_descriptor(maddr_t ma, uint64_t desc); long update_va_mapping(ulong_t va, uint64_t new_pte, ulong_t flags); long set_timer_op(uint64_t timeout); long physdev_op(void *physdev_op); long vm_assist(uint_t cmd, uint_t type); long update_va_mapping_otherdomain(ulong_t va, uint64_t new_pte, ulong_t flags, domid_t domain_id); long iret(); long set_segment_base(int reg, ulong_t value); long nmi_op(ulong_t op, void *arg); long hvm_op(int cmd, void *arg);
VMM Services: long set_callbacks(ulong_t event_address, ulong_t failsafe_address, ulong_t syscall_address); long grant_table_op(uint_t cmd, void *uop, uint_t count); long event_channel_op(void *op); long xen_version(int cmd, void *arg); long set_debugreg(int reg, ulong_t value); long get_debugreg(int reg); long multicall(void *call_list, int nr_calls); long console_io(int cmd, int count, char *str); long sched_op(int cmd, void *arg); long do_kexec_op(unsigned long op, int arg1, void *arg); VM Control Operations: long sched_op_compat(int cmd, ulong_t arg); long platform_op(xen_platform_op_t *platform_op); long memory_op(int cmd, void *arg); long vcpu_op(int cmd, int vcpuid, void *extra_args); long sysctl(xen_sysctl_t *sysctl); long domctl(xen_domctl_t *domctl); long acm_op();
As Table 4 shows, the hypercalls provide a variety of functions for a GOS: Perform privileged operations such as setting the trap table, updating the page table, loading the GDT, and setting the GS and FS segment registers Get services from the VMM such as using the event channel, grant table, set_callbacks services, and scheduled operations Control VM operations such as platform_op, domain control, and virtual CPU control An example use of a hypercall is to request a set of page table updates. For example, a new process created by the fork(2) call requires the creation of page tables. The hypercall HYPERVISOR_mmu_update(), which validates and applies a list of
43
updates, is called by the Solaris kernel to perform the page table updates. This routine returns control to the calling domain when the operation is completed. In the following example, a kmdb(1M) breakpoint is set at the mmu_update() call. The stack trace illustrates how the mmu_update() function is called after a new process is created by fork():
[1]> set_pteval+0x4f:b // set breakpoint at HYPERVISOR_mmu_update [1]> :c // continue kmdb: stop at set_pteval+0x4f // the breakpoint reached kmdb: target stopped at:set_pteval+0x4f:call -0x5a34 <HYPERVISOR_mmu_update> [1]> $c // display the stack trace set_pteval+0x4f(c753000, 1fb, 3, f9c29027) x86pte_copy+0x73(fffffffec08115a8, fffffffec2a8a0d8, 1fb, 5) hat_alloc+0x228(fffffffec2fa88c0) as_alloc+0x99() as_dup+0x3f(fffffffec27b1d28, fffffffec2a11168) cfork+0x102(0, 1, 0) forksys+0x25(0, 0) sys_syscall32+0x13e() {1]
The above example shows that the kernel doesn't maintain a copy of the page table. It uses the mmu_update() hypercall to request the VMM to update the page table. Event Channels To a GOS, a VMM event is the equivalent of a hardware interrupt. Communication from the VMM to a VM is provided through an asynchronous event mechanism, called an event channel, which replaces the usual delivery mechanisms for device interrupts. A VM creates an event channel to send and receive asynchronous event notifications. Three classes of events are delivered by this event channel mechanism: Bi-directional inter- and intra-VM connections A VM can bind an event-channel port to another domain or to another virtual CPU within the VM. Physical interrupts A VM with direct access to hardware (Dom0) can bind an event-channel port to a physical interrupt source. Virtual interrupts A VM can bind an event-channel port to a virtual interrupt source, such as the virtualtimer device.
44
Event channels are addressed by a port. Each channel is associated with two bits of information: unsigned long evtchn_pending[sizeof(unsigned long) * 8] This value notifies the domain that there is a pending notification to be processed. This bit is cleared by the GOS. unsigned long evtchn_mask[sizeof(unsigned long) * 8] This value specifies if the event channel is masked. If this bit is clear and PENDING is set, an asynchronous upcall will be scheduled. This bit is only updated by the GOS; it is read-only within the VMM. Interrupts to a VM are virtualized by mapping them to event channels. These interrupts are delivered asynchronously to the target domain using a callback supplied via the
set_callbacks hypercall. A guest OS can map these events onto its standard
interrupt dispatch mechanisms. The VMM is responsible for determining the target domain that will handle each physical interrupt source. Interrupts and Exceptions on page 49 provides a detailed discussion of how an interrupt is handled by the VMM and delivered to a VM using an event channel. Grant Tables The Sun xVM Hypervisor for x86 allows sharing memory among VMs, and between the VMM and a VM, through a grant table mechanism. Each VM makes some of its pages available to other VMs by granting access to its pages. The grant table is a data structure that a VM uses to expose some of its pages, specifying what permissions other VMs have on its pages. The following example shows the information stored in a grant table entry:
struct grant_entry { /* GTF_xxx: various type and flag information. [XEN,GST] */ uint16_t flags; /* The domain being granted foreign privileges. [GST] */ domid_t domid; uint32_t frame; / / page frame number (PFN) };
The flags field stores the type and various flag information of the grant table. There are three types of grant table entries: GTF_invalid Grants no privileges. GTF_permit_access Allows the domain domid to map/access the specified
frame.
GTF_accept_transfer Allows domid to transfer ownership of one page frame to this guest; the VMM writes the page number to frame.
45
The type information acts as a capability which the grantee can use to perform operations on the granter's memory. A grant reference also encapsulates the details of a shared page, removing the need for a domain to know the real machine address of a page it is sharing. This makes it possible to share memory correctly with domains running in fully virtualized memory. Device drivers in the Sun xVM Server (see Sun xVM Server I/O Virtualization on page 56) use grant tables to send data between drivers of different domains, and use event channels and callback services for asynchronous notification of data availability. XenStore XenStore [22] is a shared storage space used by domains to communicate and store configuration information. XenStore is the mechanism by which control-plane activities, including the following, occur: Setting up shared memory regions and event channels for use with split device drivers Notifying the guest of control events (for example, balloon driver requests) Reporting status information from the guest (for example, performance-related statistics) The store is arranged as a hierarchical collection of key-value pairs. Each domain has a directory hierarchy containing data related to its configuration. Domains are permitted to register for notifications about changes in a subtree of the store, and to apply changes to the store transactionally.
46
On 64-bit systems, linear address space (flat memory model) is used to create a continuous, unsegmented address space for both the kernel and application programs. Segmentation is disabled and rings 1 and 2, which practically do not exist, have the same privilege to access paging as ring 0 (see Protected Mode and following sections beginning on page 22). To protect the VMM, the Sun xVM Server kernel is therefore restricted to run in ring 3 for the 64-bit mode and in ring 1 for the 32-bit mode only, as seen in the definitions in segments.h:
% cat intel/sys/segments.h .... #if defined(__amd64) #define SEL_XPL 0 #define SEL_KPL 3 #elif defined(__i386) #define SEL_XPL 0 #define SEL_KPL 1 #endif /* __i386 */
/* xen privilege level */ /* both kernel and user in ring 3 */ /* xen privilege level */ /* kernel privilege level under xen */
If both kernel and user application run with the same privilege level, how does Sun xVM Server protect the kernel from user applications? The answer is given as follows [32]: 1. The VMM performs context switching between kernel mode and the currently running application in user mode. The VMM tracks which mode the GOS is running, kernel or user. 2. The GOS maintains two top level (PML4) page tables per process, one each for kernel and user. The GOS registers the two page tables with the VMM. The kernel page table contains translations for both the kernel and user addresses, and the user page table contains translations only for the user addresses. During the context switch, the VMM switches the top level page table so the kernel addresses are not visible to the user process. The linear address mapping to paging data structure for 64-bit x86 processor is shown below in Figure 17:
63 48 47 39 38 30 29 21 20 12 11 0
Sign Extended
PML4
PDP
PDE
PTE
Offset
Figure 17. Linear address mapping to paging data structure for 64-bit x86 processor.
Switching the PML4 page tables between kernel and user mode enables a 64-bit address space to be split into two logically separate address spaces. In this logical separation of a 64-bit address space, the kernel can access both its address space and a user address space while a user process can access only its own address space. The user address space in this addressing scheme is therefore restricted to use the lower 48 bits of the 64-bit address space. The resulting address space partition in the 64-bit Sun xVM Server is shown as follows, in Figure 18:
47
User (ring 3)
0
Figure 18. Address space partitioning in the 64-bit Sun xVM Server.
As discussed previously (see Segmented Architecture on page 23), the processor privilege level is set when a segment is loaded. The Solaris OS uses the GDT for user and kernel segments. The segment index of each segment type is assigned as shown in Table 5 on page 54. The command kmdb(1M) can be used to examine the segment descriptor of kernel code:
[0]> gdt0+30::print -t 'struct user_desc' / / 64-bit kernel code segment { unsigned long usd_lolimit :16 = 0x7000 unsigned long usd_lobase :16 = 0xe030 unsigned long usd_midbase :8 = 0 unsigned long usd_type :5 = 0xe unsigned long usd_dpl :2 = 0x3 unsigned long usd_p :1 = 0x1 unsigned long usd_hilimit :4 = 0x4 unsigned long usd_avl :1 = 0 unsigned long usd_long :1 = 0 unsigned long usd_def32 :1 = 0 unsigned long usd_gran :1 = 0x1 unsigned long usd_hibase :8 = 0xfb } / / 32-bit user code segment > gdt0+40::print -t 'struct user_desc' { unsigned long usd_lolimit :16 = 0xc450 unsigned long usd_lobase :16 = 0xe030 unsigned long usd_midbase :8 = 0xf8 unsigned long usd_type :5 = 0xe unsigned long usd_dpl :2 = 0x3 unsigned long usd_p :1 = 0x1 unsigned long usd_hilimit :4 = 0x1 unsigned long usd_avl :1 = 0 unsigned long usd_long :1 = 0 unsigned long usd_def32 :1 = 0 unsigned long usd_gran :1 = 0x1 unsigned long usd_hibase :8 = 0xfb }
48
The descriptor privilege level (DPL) of both kernel and 32-bit user code segments are set to 3. At boot time, the Sun xVM Hypervisor for x86 is loaded into memory in ring 0. After initialization, it loads the Solaris kernel to run as Dom0 in ring 3. The domain Dom0 is permitted to use the VM control hypercall interfaces (see Table 4 on page 42), and is responsible for hosting the application-level management software.
CPU Scheduling
The Sun xVM Hypervisor for x86 provides two schedulers for the user to choose between: Credit and simple Earliest Deadline First (sEDF). The Credit scheduler is the default scheduler; sEDF might be phased out and removed from the Sun xVM Server implementation. The Credit scheduler is a proportional fair share CPU scheduler. Each physical CPU (PCPU) manages a queue of runnable virtual CPUs (VCPUs). This queue is sorted by VCPU priority. A VCPU's priority can be either over or under, representing whether this VCPU has exceeded its share of the PCPU or not. A VCPU's share is determined by weight assigned to the VM and credit accumulated by the VCPU in each accounting period.
( Credit total Weight VM ) + ( Weight total 1 ) i Credit VM = ------------------------------------------------------------------------------------------------------------------Weight total Credit VCPU
j, i i
The first equation determines the total credit of a VM and the second equation determines the credit of a VCPU in a VM. Credittotal is a constant; Weighttotal is the sum of the weight of all domains. A VM's weight is assigned to a VM using xm(1M) (for example, xm sched-credit -w weight). In each accounting period, fixed amount of credits are added to idle VCPUs and are subtracted from running VCPUs. The VCPU has the priority under if the VCPU has not consumed all credits it possesses. On each PCPU, at every scheduling decision (when a VCPU blocks, yields, completes its time slice, or is awakened), the next VCPU to run is picked off the head of the run queue of priority under. When a VM runs, it consumes credits of its VCPU[s]. When a VCPU uses all its allocated credits, the VCPU's priority is changed from under to over. When a CPU doesn't find a VCPU of priority under on its local run queue, it will look on other PCPUs for one. This load balancing guarantees each VM receives its fair share of PCPU resources system-wide. Before a PCPU goes idle, it will look on other PCPUs to find any runnable VCPU. This guarantees that no PCPU idles when there is runnable work in the system. Earliest Deadline First (EDF) scheduling provides weighted CPU sharing by comparing the deadline of scheduled periodic processes (or domains, in the case of Sun xVM
49
Server). This scheduler places domains in a priority queue. Each domain is associated with two parameters: time requested to run, and an interval or deadline. Whenever a scheduling event occurs, the queue is searched for the domain closest to its deadline. This domain is then scheduled for execution next with the time requested. The EDF scheduler gives a better CPU utilization when a system is underloaded. When the system is overloaded, the set of domains that will miss deadlines is largely unpredictable (it is a function of the exact deadlines and time at which the overload occurs).
The Solaris kernel function init_desctbls() passes each of its exception and interrupt vectors to the VMM using the set_trap_table() hypercall:
void xen_idt_write(gate_desc_t *sgd, uint_t vec) { trap_info_t trapinfo[2]; bzero(trapinfo, sizeof (trapinfo)); if (xen_idt_to_trap_info(vec, sgd, &trapinfo[0]) == 0) return; if (xen_set_trap_table(trapinfo) != 0) panic("xen_idt_write: xen_set_trap_table() failed"); }
The set_trap_table() hypercall has one argument, trap_info, which contains the privilege level of the GOS code segment, the code segment selector, and the address of the handler which will be used to set the instruction pointer when the VMM
50
passes the control back to the GOS (see following code segment). The value of
trap_info is set in the function xen_idt_to_trap_info() using the setting in
On a 64-bit system, the interrupt descriptor has the descriptor privilege level (DPL) 3, similar to the segment descriptor:
[0]> idt0::print 'struct gate_desc' { sgd_looffset = 0x4bf0 sgd_selector = 0xe030 sgd_ist = 0 sgd_resv1 = 0 sgd_type = 0xe sgd_dpl = 0x3 sgd_p = 0x1 sgd_hioffset = 0xfb84 sgd_hi64offset = 0xffffffff sgd_resv2 = 0 sgd_zero = 0 sgd_resv3 = 0 }
When an interrupt or exception occurs, the VMMs trap handler is invoked to handle the interrupt or exception. If this is an exception caused by a GOS, the VMMs trap handler sets the pending bit (see Event Channels on page 43) and calls the GOS's exception handler. Interrupts for the GOS are virtualized by mapping them to event channels, which are delivered asynchronously to the target GOS via the
set_callbacks() hypercall.
In the following example, a kmdb(1M) breakpoint is set at the interrupt service routine of the sd driver, sdintr(). The function xen_callback_handler(), the callback function used for processing events from the VMM, is registered in the VMM by the hypercall set_callbacks(). When an interrupt intended for sd arrives, the
51
hypercall HYPERVISOR_block() detects an event is available and then invokes the callback function:
sd`sdintr: sd`sdintr: ec8b4855 = pushq %rbp [0]> $c sd`sdintr(fffffffec0670000) mpt`mpt_intr+0xdb(fffffffec0670000, 0) av_dispatch_autovect+0x78(1b) dispatch_hardint+0x33(1b, 0) switch_sp_and_call+0x13() do_interrupt+0x9b(ffffff0001005ae0, 1) xen_callback_handler+0x36c(ffffff0001005ae0, 1) xen_callback+0xd9() HYPERVISOR_sched_op+0x29(1, 0) HYPERVISOR_block+0x11() mach_cpu_idle+0x52() cpu_idle+0xcc() idle+0x10e() thread_start+8() [0]>
Pending events are stored in a per-domain bitmask (see Event Channels on page 43), that is updated by the VMM before invoking an event-callback handler specified by the GOS. The function xen_callback_handler() is responsible for resetting the set of pending events and responding to the notifications in an appropriate manner. A VM may explicitly defer event handling by setting a VMM-readable software flag; this is analogous to disabling interrupts on a real processor.
Timer Services
Timer Devices on page 27 discusses several hardware timers available on x86 systems. These hardware devices vary in their frequency reliability, granularity, counter size, and ability to generate interrupts. The Solaris OS employs some of these timer devices for running the OS clock and high resolution timer: OS system clock The Solaris OS uses the local APIC timer on multiprocessor systems to generate ticks for the system clock. On uniprocessor systems, the Solaris OS uses the PIT to generate ticks for the system clock. High resolution timer The Solaris OS uses the TSC timer for a high resolution timer. The PIT counter is used to calibrate the TSC counter. Time-of-day clock The time-of-day (TOD) clock is based on the RTC. Only Dom0 can set the TOD clock. The DomU VMs don't have the permission to update the machine's physical RTC. Therefore, any attempt by the date(1) command to set the date and time on DomU will be quietly ignored. In Sun xVM Server, the VMM provides the system time to each VCPU when it is scheduled to run. The high resolution timer, gethrtime(), is still run through the
52
unprivileged RDTSC instruction, thus the high resolution timer is not virtualized. The virtualized system time relies on the current TSC to calculate the time in nanoseconds since the VCPU was scheduled.
XENMEM_increase_reservation XENMEM_decrease_reservation
53
XENMEM_populate_physmap XENMEM_maximum_ram_page XENMEM_current_reservation XENMEM_maximum_reservation XENMEM_machphys_mfn_list XENMEM_add_to_physmap XENMEM_translate_gpfn_list XENMEM_memory_map XENMEM_machine_memory_map XENMEM_set_memory_map XENMEM_machphys_mapping XENMEM_exchange
Page Translations
Segmented Architecture on page 23 describes two stages of address translation to arrive at a physical address: virtual address (VA) to linear address (LA) translation using segmentation, and LA to physical address (PA) translation using paging. Solaris x64 uses a flat address space in which the VA and LA are equivalent, which means the base address of the segment is 0. In Solaris 10, the Global Descriptor Table (GDT) contains the segment descriptor for the code and data segments of both kernel and user processes, as shown in Table 5 on page 54. Since there is only one GDT in a system, the VMM maintains the GDT in its memory. If a GOS wishes to use something other than the default segment mapping that the VMM GDT provides, it must register a custom GDT with the VMM using the set_gdt() hypercall. In the following code sample, frame_list is the physical address of the page that contains the GDT and entries is the number of entries in the GDT.
xen_set_gdt(ulong_t *frame_list, int entries) { .... if ((err = HYPERVISOR_set_gdt(frame_list, entries)) != 0) { .... } return (err); }
The Solaris 32-bit thread library uses %gs to refer to the LWP state manipulated by the internals of the thread library. The 64-bit thread library uses %fs to refer to the LWP state as specified by the AMD64 ABI [8]. The 64-bit kernel still uses %gs for its CPU state (%fs is never used in the kernel). The MSR's KernelBase register is used to store the kernel %gs content while it switches to run the 32-bit user LWP. The privileged instruction SWAPGS is used to restore the kernel %gs during the context switch to the
54
kernel context. So when the VMM performs a context switch between the guest kernel mode and the guest user mode, it executes SWAPGS as part of the context switch (see CPU Privilege Mode on page 45). The GDT segment is given in Table 5 below:
% cat intel/sys/segments.h #define GDT_NULL 0 #define GDT_B32DATA 1 #define GDT_B32CODE 2 #define GDT_B16CODE 3 #define GDT_B16DATA 4 #define GDT_B64CODE 5 #define GDT_BGSTMP 7 #if defined(__amd64) #define #define #define #define #define #define #define #define #define #define #define #define #define #define GDT_KCODE GDT_KDATA GDT_U32CODE GDT_UDATA GDT_UCODE GDT_LDT GDT_KTSS GDT_FS GDT_GS GDT_LWPFS GDT_LWPGS GDT_BRANDMIN GDT_BRANDMAX NGDT 6 /* kernel code seg %cs */ 7 /* kernel data seg %ds */ 8 /* 32-bit process on 64-bit kernel %cs */ 9 /* user data seg %ds (32 and 64 bit) */ 10 /* native user code seg %cs */ 12 /* LDT for current process */ 14 /* kernel tss */ GDT_NULL /* kernel %fs segment selector */ GDT_NULL /* kernel %gs segment selector */ 55 /* lwp private %fs segment selector (32-bit)*/ 56 /* lwp private %gs segment selector (32-bit)*/ 57 /* first entry in GDT for brand usage */ 61 /* last entry in GDT for brand usage */ 62 /* number of entries in GDT */
/* /* /* /* /* /* /*
null */ dboot 32 bit data descriptor */ dboot 32 bit code descriptor */ bios call 16 bit code descriptor */ bios call 16 bit data descriptor */ dboot 64 bit code descriptor */ kmdb descriptor only used in boot */
Every LWP context switch requires an update to the GDT for the new LWP. The GOS uses
update_descriptor() for the task:
intel/ia32/os/desctbls.c update_gdt_usegd(uint_t sidx, user_desc_t *udp) { .... if (HYPERVISOR_update_descriptor(pa_to_ma(dpa), *(uint64_t *)udp)) panic("xen_update_gdt_usegd: HYPERVISOR_update_descriptor"); }
On an x86 system, the base physical address of the page directory is contained in the control register %cr3. In the Solaris OS, the value of %cr3 is stored in the process's
hat structure, proc->p_as->a_hat->hat_table->ht_pfn, as shown in Paging
Architecture on page 25. The loading of %cr3 is performed by the VMM for security and coherency reasons.
55
Page Translations Virtualization on page 14 discusses two alternatives for updating page tables in a virtualized environment: hypervisor calls to a read-only page table and shadow page tables. The Sun xVM Hypervisor for x86 provides an additional alternative, a writable page table, for the GOS to implement page translations. In the default mode of operation, the VMM uses both read-only page tables and writable page tables to manage page tables. The VMM allows the GOS to use a writable page table to update the lowest level page tables (for example, the PTE). The higher levels, such as PDE, PDP, and PML4, use a read-only page table and are updated using the hypercall
mmu_update(). Updates to higher level page tables are much less frequent compared
to the PTE page table updates. Read-only page table The GOS has read-only access to page tables and uses the mmu_update() hypercall to update page tables. As described in the previous section Physical Memory Management on page 52, the GOS has a view of pseudo-physical memory, and a translation from physical address to machine address is performed before the
mmu_update() call.
Void set_pteval(paddr_t table, uint_t index, uint_t level, x86pte_t pteval) { .... ma = pa_to_ma(PT_INDEX_PHYSADDR(pfn_to_pa(ht->ht_pfn), entry)); t[0].ptr = ma | MMU_NORMAL_PT_UPDATE; t[0].val = new; if (HYPERVISOR_mmu_update(t, cnt, &count, DOMID_SELF)) panic("HYPERVISOR_mmu_update() failed"); .... }
Writable page table If a GOS attempts to write to a page table that is maintained by the VMM, this attempt will result in a #PF fault to the VMM. In the VMM fault handling routine, the following tasks are performed: Hold the lock for all further page table updates Disconnect the page that contains the updated page table by clearing the page
present bit of the page table entry in the parent page table
Make the page writable by the GOS The page will be reconnected to the paging hierarchy again automatically in a number of situations, including when the guest modifies a different page-table page, when the domain is preempted, and whenever the guest uses the VMMs explicit page-table update interfaces. Shadow page table
56
The VMM maintains a independent copy of page tables, called the shadow page table, that is pointed to by the %cr3 register. If a page fault occurs when a GOSs page table is accessed, the VMM propagates changes made to the GOSs page table to the shadow page table. Shadow page mode can be set in the GOS by calling
dom0_op(DOM0 SHADOW CONTROL).
In addition to creating a translation entry, the VMM also provides the mmuext_op() hypercall for the GOS to flush, to invalidate, or to lock a page translation. For example, it is necessary to lock the translations of a process when it is being created. The
mmuext_op() is invoked by the kernel during the fork(2) system call:
[3]> :c kmdb: stop at xen_pin+0x3a kmdb: target stopped at: xen_pin+0x3a: call +0x208b1 <HYPERVISOR_mmuext_op> [3]> $c xen_pin+0x3a(ff2c, 3) hat_alloc+0x285(fffffffec381b7e8) as_alloc+0x99() as_dup+0x3f(fffffffec381ba88, fffffffec3d0f8d0) cfork+0x102(0, 1, 0) forksys+0x25(0, 0) sys_syscall32+0x13e()
57
Dom0
System Calls
Native Driver
Sun xVM Hypervisor for x86 Sun x64 Server X86 Hardware (CPU, Memory, Devices)
Figure 19. The split device driver architecture employed by Sun xVM Server includes a front-end driver in DomU and a back-end driver in Dom0.
Dom0 is a special VM that has access to the real device hardware. The front-end driver appears to a GOS in DomU as a real device. This driver receives I/O requests from applications as usual. However, since this front-end driver does not have access to the physical hardware of the system, it must then send requests to the back-end driver in Dom0. The back-end driver is responsible for issuing I/O requests to the real device hardware. When the I/O completes, the back-end notifies the front-end that the data is ready for use; the front-end is then able to report I/O completion and unblock the I/O call. When the Solaris OS is initialized, devices identify themselves and are organized into the device tree. This device tree depicts a hierarchy of nodes, with each node on the tree representing a device. Sun xVM Server exports a complete device tree to domain Dom0 so that it can directly accesses all physical devices on the system. For DomU domains, the paravirtualized Solaris OS uses information passed to it by xm(1M) to disable PCI bus probing and create virtual Sun xVM Server device nodes under the VMM virtual device nexus driver, xpvd.
58
Output from the prtconf(1M) command shows the device tree as exported by Sun xVM Server to a VM in a DomU domain. As the prtconf(1M) output shows, there are no physical devices of any kind on the device tree in DomU:
# prtconf System Configuration: Sun Microsystems Memory size: 2048 Megabytes System Peripherals (Software Nodes):
i86pc
i86xpv scsi_vhci, instance #0 isa (driver not attached) xpvd, instance #0 xencons, instance #0 xenbus, instance #0 domcaps, instance #0 balloon, instance #0 xdf, instance #0 xnf, instance #0 iscsi, instance #0 pseudo, instance #0 agpgart, instance #0 options, instance #0 xsvc, instance #0 cpus (driver not attached) cpu, instance #0 (driver not attached) cpu, instance #1 (driver not attached)
A driver that provides services to other drivers is called a bus nexus driver and is shown in the device tree hierarchy as a node with children. The nexus driver provides bus mapping and translation services to subordinate devices in the device tree. The type of services provided by the nexus driver include interrupt priority assignment, DMA resource mapping, and device memory mapping. As seen in the previous
prtconf(1M) output, the xpvd driver is the root nexus driver for all Sun xVM Server
devices on DomU. An individual device driver is represented in the tree as a node with no children. This type of node is referred to as a leaf driver. In the above example,
xenbus, domcaps, xencons, xdf, and xnf are leaf drivers.
59
The Sun xVM Server-related driver modules for Dom0 and DomU respectively are shown below:
Sun xVM Server related device modules on dom0: xpvtod (TOD module for Xen) xpvd (virtual device nexus driver) xencons (virtual console driver) privcmd (privcmd driver) evtchn (Evtchn driver) xenbus (virtual bus driver) xdb (vbd backend driver) xnb (xnb module) xsvc (xsvc driver) balloon (balloon driver)
Sun xVM Server related device modules on domU: xenbus (virtual bus driver) xpvtod (TOD module for i86xpv) xpvd (virtual device nexus driver) xencons (virtual console driver) xdf (Xen virtual block driver) xnf (Virtual Ethernet driver)
The xpvtod driver provides setting and getting the time-of-day for the VM. TOD service is provided by the RTC timer. If the request to set the TOD comes from a DomU domain, the request is silently ignored, as DomU doesn't have permission to set the RTC timer. The nexus driver in Solaris provides bus mapping and translation services to subordinate devices in the device tree. The xpvd driver is the nexus driver for all virtual I/O drivers which don't directly access physical device. This drivers primary functions are to provide interrupt mapping and to invoke the initialization routine of its children devices. The xenbus driver provides a bus abstraction that drivers can use to communicate between VMs. The bus is mainly used for configuration negotiation, leaving most data transfer to be done via an interdomain channel composed of a grant table and an event channel. The xenbus driver also makes the configuration data available to the XenStore shared storage repository (see XenStore on page 45). The evtchn driver is used for receiving and demultiplexing event-channel signals to the user land. The balloon driver is controlled by the VMM to manage physical memory usage by a VM. (See Physical Memory Virtualization on page 13 and Physical Memory Management on page 52). The privcmd driver is used by the domain manager on Dom0 to get the VMM service for VM management.
60
The drivers xdf and xdb, the front-end and back-end block device drivers respectively, are discussed in Disk Driver on page 60. The xnf and xnb drivers, the front-end and back-end network drivers respectively, are discussed in Network Driver on page 61. Data transfer between interdomain drivers is mainly provided by the VMM grant table and event-channel services. Most of the data transfer is handled in a similar fashion to DMA transfer between host and device. Data is put in the grant table by the sending VM, and notification is sent to the receiving VM through the event channel. Then, the callback routine in the receiving VM is invoked to process the data.
Disk Driver
The xdb driver, the back-end driver on Dom0, is used to provide services for block device management. This driver receives I/O requests from DomU domains and sends them on to the native driver. On DomU, xdf is the pseudo block driver that gets the I/O requests from applications and sends them to the xdb driver in Dom0. The xdf driver provides functions similar to those of the SCSI target disk driver, sd, on an unvirtualized Solaris system. On Solaris systems, the main interface between a file system and storage device is the
strategy(9E) driver entry point. The strategy(9E) entry point takes only one
argument, buf(9S), which is the basic data structure for block I/O transfer. The I/O request made by a file system to the strategy(9E) entry point is called PAGEIO, as the memory buffer for the I/O is allocated from the kernel page pool. An application can also open the storage device as a raw device and perform read(2) and write(2) operations directly on the raw device. Such an I/O request is called PHYSIO,
physio(9F), as the memory buffer for the I/O is allocated by the application.
In addition to the strategy(9E) driver entry point for supporting file system and raw device access, a disk driver also supports a set of ioctl(2) operations for disk control and management. The dkio(7I) disk control operations define a standard set of
ioctl(2) commands. Normally, support for dkio(7I) operations requires direct
access to the device. In DomU, xdf supports most ioctl(2) commands as defined in
dkio(7I) by emulating the disk control inside xdf. No communication is made by xdf to the back-end driver for ioctl(2) operations.
The sequence of events for disk I/O data transfer is illustrated in Figure 20. The disk control path, ioctl(2), is similar to the data path. When a disk I/O request is issued by a DomU domain, the sequence is as follows: 1. The file system calls the xdf driver's strategy(9E) entry point as a result of a
read(2) or write(2) system call.
61
2.
The xdf driver puts the I/O buffer, buf(9S), on the grant table. This buffer is allocated from the DomU memory. Permission for other domain access is granted to this memory.
3. 4. 5. 6. 7. 8. 9.
The xdf driver notifies Dom0 of an event through event channel. The VMM event channel generates an interrupt to the xdb driver in Dom0. The xdb driver in Dom0 gets the DomU I/O buffer through the grant_table. The xdb driver in Dom0 calls the native driver's strategy(9E) entry point. The native driver performs DMA. The VMM receives the device interrupt. The VMM generates an event to Dom0.
10. The xdb driver's iodone() routine is called by biodone(9F). 11. The xdb drivers iodone() routine generates an event to DomU. 12. The xdf driver in DomU receives an interrupt to free up the grant table and DMA resources, and calls biodone(9F) to wake up anyone waiting for it. When a disk I/O request is issued by the control domain DomO, the sequence is as follows: 13. Block I/O requests are sent directly to the native driver.
DomU
read(2)/write(2)
User Kernel 11 4
User Kernel
read(2)/write(2)
FS 13 6
FS 1 2 xdf 3
5 Grant Tables
Event Channel
7 Xen Callback 8
12
Figure 20. Sequence of events for an I/O request from a Sun xVM Server virtual machine.
Network Driver
The Sun xVM Server network drivers uses a similar approach to the disk block driver for handling network packets. On DomU, the pseudo network driver xnf gets the I/O requests from the network stack and sends them to xnb on Dom0. The back-end network driver xnb on Dom0 forwards packets sent by xnf to the native network driver.
62
The buffer management for packet receiving has more impact on network performance than packet transmitting does. On the packet receiving end, the data is transferred via DMA into the native driver receiving buffer on domO. Then, the packet is copied from the native driver buffer to the VMM buffer. The VMM buffer is then mapped to the DomU kernel address space without another copy of the data. The sequence of operations for packet receiving is as follows: 1. 2. 3. 4. 5. Data is transferred via DMA into the native driver, bge, receive buffer ring. The xnb drivers gets a new buffer from the VMM and copies data from the bge receive ring to the new buffer. The xnb driver sends DomU an event through the event channel. The xnf driver in DomU receives an interrupt. The xnf driver maps a mblk(9S)to the VMM buffer and sends the mblk(9S) to the upper stack.
dom0 xnb_to_peer xnbo`from_mac+0x1c mac`mac_do_rx+0x88 mac`mac_rx+0x1b vnic`vnic_rx+0x59 vnic`vnic_classifier_rx+0x6b 2 mac`mac_do_rx+0x88 mac`mac_rx+0x1b bge`bge_receive+0x564 bge`bge_intr+0x182 unix`av_dispatch_autovect+0x78 unix`dispatch_hardint+0x33 unix`switch_sp_and_call+0x13 3
Grant Tables
Event Channel
Xen Callback
Sun xVM Hypervisor for x86 Network Chip Sun X64 Server X86 Hardware (CPU, Memory, Devices)
Figure 21. Sequence of events for a network request from a Sun xVM Server virtual machine.
63
Chapter 6
Note Intel's virtualization extension is called Virtual Machine Extensions (VMX), and is documented in the IA-32 Intel Architecture Software Developer's Manual (see [7] Volume 3B Chapters 19-23). AMD's extension is called Secure Virtual Machine (SVM), and is documented in the AMD64 Architecture Programmer s Manual Volume 2: System Programming (see [9] Chapter 15).
64
Intel Virtualization Operation Privileged Mode Reduced-privileged mode HVM Control and State Data Structure (HVMCSDS) Entering non-privileged mode Exiting non-privileged mode VMX VMX Root VMX non-Root VMCS VMLAUNCH/VM RESUME Implicit
After HVM is enabled, the processor is operating at privileged mode. Transitions from privileged mode to reduced privilege modes are called VM Entries. Transitions from reduced privilege mode to privileged mode are called VM Exits. Figure 22 illustrates entry and exit with the HVMCSDS.
65
VM1
VM2
VMX non-root operation (Intel) VM EXIT/ VM ENTER/ VM EXIT/ VM ENTER/ #VMEXIT VMRUN #VMEXIT VMRUN Guest Mode (AMD) VMCS/ VMCB VMX root operation (Intel) Host Mode (AMD) Virtual Machine Monitor (VMM) VMCS/ VMCB
Figure 22. Virtual machine entry and exit with hardware support on AMD and Intel processors.
state, VMM state, control fields, and the VM state before loading the VM state from the HVMCSDS to launch the VM entry. As a part of VM entry, the VMM can inject an event into the VM. The event injection process is used to deliver virtualized external interrupts to a VM. A VM normally doesn't get interrupts from I/O devices, because I/O devices are not exposed to VMs (with the exception of Dom0). As will be shown in Sun xVM Server with HVM I/O Virtualization (QEMU) on page 71, a VM's I/O is handled by a special domain (Dom0) that runs a paravitualized OS and has direct access to I/O devices. When an I/O operation completes, Dom0 informs the VMM to send an interrupt through an hvm_op hypercall. The VMM prepares the HVMCSDS for event injection and the VM's return instruction pointer (RIP) is pushed on the stack. VM exit occurs implicitly in response to certain instruction and events in a VM. The VMM governs the conditions causing a VM exit through manipulating the control fields in the HVMCSDS. The events that can be controlled to result in a VM exit include the following (see [9] Chapter 20): External interrupts, non-maskable interrupts, and system management interrupts Executing certain instructions (such as RDPMC, RDTSC, or instructions that access the CR) Exceptions The exact conditions that cause a VM exit are defined in the HVMCSDS control fields. Certain conditions may cause a VM exit for one VM but not for other VMs. VM exits behave like a fault, meaning that the instructions causing the VM exit does not execute and no processor state is updated by the instruction. The VM exit handler
66
in the VMM is responsible for taking appropriate actions for the VM exit. Unlike exceptions, the VM exit handler is specified in the HVMCSDS host RIP field rather than using the IDT:
static void construct_vmcs(struct vcpu *v) { .... /* Host CS:RIP. */ __vmwrite(HOST_CS_SELECTOR, __HYPERVISOR_CS); __vmwrite(HOST_RIP, (unsigned long)vmx_asm_vmexit_handler); .... }
Description launch/resume VM clear VMCS load/store VMCS read/write VMCS enable/disable VMX operation call to the VMM
In addition to new VMX instructions and VMCS, VT-x introduces a direct I/O architecture for Intel-VT [28] to improve VM security, reliability, and performance through I/O enhancements. As will be shown in Sun xVM Server with HVM I/O Virtualization (QEMU) on page 71, the current I/O virtualization implementation for Sun xVM Server with HVM, which is based on the QEMU project, is inefficient as all I/O transaction have to go through Dom0, unreliable as the I/O virtualization layer on Dom0 becomes a single point-of-failure, and insecure as a VM may access other VM's DMA memory by manipulating the value written to I/O port.
67
The Intel-VT direct I/O architecture specifies the following hardware capabilities to the VMM: DMA remapping This feature provides IOMMU support for I/O address translation and caching capabilities. The IOMMU as specified in the architecture includes a page table hierarchy similar to the processor page table, and an IOTLB for frequently accessed I/O pages. Addresses used in the DMA transactions are allocated from IOMMU address space, and the IOMMU hardware provide address translation from the IOMMU address space to the system memory address space. I/O device assignment across VMs This feature allows a PCI/PCI-X device that is behind a PCI-E to PCI/PCI-X bridge or a PCI-E device to be assigned to a VM, regardless of how the PCI bus is bound to a VM.
68
generalization of two facilities included in the AMD64 architecture: the Graphics Aperture Remapping Table (GART) and the Device Exclusion Vector (DEV). The GART provides address translation of I/O device accesses to a small range of the system physical address space, and the DEV provides a limited degree of I/O device classification and memory protection.
69
The VMM uses hvm_function_table to provide a VCPU to a VM. The entry points in
hvm_function_table fall into two categories: setup and runtime. The setup entry
points are called when a VM is being created. The runtime entry points are called before VM entry or after VM exit. Since the HVMCSDS data structure abstracts the states and controls of a VCPU, the entry points in hvm_function_table are primarily used to manipulate the data structure.
70
In supporting shadow page [29], the Sun xVM Hypervisor for x86 attempts to intercept all updates to a guest page table, and updates both the VM's page table and the shadow page table maintained by the VMM, keeping both page tables synchronized at all times. This implementation results in two page faults, one due to faulting the actual page and a second one due to page table access. This shadow page table scheme has a significant impact on the VM performance. An alternative such as nested page table (see AMD Secure Virtual Machine Specifics on page 67) has been proposed to improve the memory virtualization performance.
Dependent HVM Functions on page 69) to set the HVMCSDS event injection control field. The event is delivered when the VM is entered.
Emulated BIOS
The PC BIOS provides hardware initialization, boot services, and runtime services to the OS. There are some restrictions on VMX operation. An OS in HVM cannot operate in real mode. Unlike a paravirtualized OS that can change its bring up sequence for an environment without BIOS, an unmodified OS requires an emulated BIOS to perform some real mode operations before control is passed to the OS. Sun xVM Server includes a BIOS emulator, hvmloader, as a surrogate to real BIOS. The hvmloader BIOS emulation contains three components: ROMBIOS, VGABIOS, and VMXAssist. Both ROMBIOS and VGABIOS are based the open source Bochs BIOS [23]. The VMXAssist component is included in hvmloader to emulate real mode, which is required by hvmloader and bootstrap loaders. The hvmloader BIOS emulator is bootstrapped as any other 32-bit OS. After it is loaded, hvmloader copies
71
The hvmloader BIOS emulator does not directly interface with physical devices. It communicates with virtual devices as discussed in the following section Sun xVM Server with HVM I/O Virtualization (QEMU).
The Solaris OS running in DomU can use pcn and communicate to QEMU on a Solaris Dom0 that has a e1000g NIC. The pcnet emulation in QEMU converts Solaris pcn transactions to a generic virtual network interface (such as TAP), which forwards the packet to the driver for the native network interface (such as e1000g). QEMU I/O emulation is illustrated in Figure 23. The principle of operation for sending out an I/O request is outlined as follows: 1. An OS interfaces with a device through I/O ports and/or memory-mapped device memory. The device performs certain operations, such as DMA, in response to I/O port/memory access by the OS. At the completion of the operation, the device generates an interrupt to notify the OS (Steps 1 and 2 on Figure 23). 2. 3. 4. 5. The VMM monitors and intercepts the device I/O ports and memory accesses (Step 3 on Figure 23). The VMM forwards the I/O port/memory data to an I/O virtualization layer such as QEMU (Step 4 in Figure 23). QEMU decodes the I/O port/memory data and performs necessary emulation for the I/O request (Step 5 in Figure 23). QEMU delivers the emulated I/O request to the OS native device interface (Steps 6 and 7 in Figure 23).
72
Dom U socket(3c) 1
hvm hypercall Sun xVM Hypervisor for x86 NIC 3 I/O Port Device memory X86 Hardware (CPU, Memory, Devices)
Figure 23. I/O emulation in Sun xVM Server using QEMU for dynamic binary translation.
Using the AMD PCNet LANCE PCI Ethernet controller as an example, the vendor ID and device ID of the PC Net chip is respectively 1022 and 2000. From prtconf(1M) output, the PCI registers exported by the device are:
% prtconf -v .... pci1022,2000, instance #0 Hardware properties: name='assigned-addresses' type=int items=5 value=81008810.00000000.00001400.00000000.00000080 name='reg' type=int items=10 value=00008800.00000000.00000000.00000000.00000000.01008810.00000000.000 00000.00000000.00000080 ....
According to IEEE1275 OpenBoot Firmware [25], the reg property is generated by reading the base address registers in the configuration address space. Each entry in the
reg property format consists of one 32-bit cell for register configuration, a 64-bit
address cell, and a 64-bit size cell [26]. As the prtconf(1M) output shows, the PCNet chip has a 128 byte (0x00000080) register in the I/O address space (01 in the first byte
of0x01008810 denotes I/O address space). QEMU emulation for PCNet simply
monitors the Solaris driver access to the 128 bytes register using x86 IN/OUT instructions.
73
The QEMU virtualization for transmitting and receiving a packet using the PCNet emulation is illustrated in the Figure 23 on page 72. The sequence of events corresponding to the numbered dots in the figure is described below: 1. 2. Applications make an I/O request to the driver through system calls. The pcn driver writes to the DMA descriptor using the OUT instruction. In pcn,
pcn_send() calls pcn_OutCSR() to start the DMA transaction. Then, pcn_OutCSR() calls ddi_put16() to write a value to an I/O address. Next, ddi_put16() checks whether the mapping (io_handle) is for I/O space or
memory space. If the mapping is for the I/O space, it moves its third argument to
%rax and port ID to %rdx, and issues the OUTW instruction to the port referenced
by %dx.
pcn_send() { .... pcn_OutCSR(pcnp, CSR0, CSR0_INEA | CSR0_TDMD); ... } static void pcn_OutCSR(struct pcninstance *pcnp, uintptr_t reg, ushort_t value) { ddi_put16(pcnp->io_handle, REG16(pcnp->io_reg, PCN_IO_RAP), reg); ddi_put16(pcnp->io_handle, REG16(pcnp->io_reg, PCN_IO_RDP), value); } ENTRY(ddi_put16) movl ACC_ATTR(%rdi), %ecx cmpl $_CONST(DDI_ACCATTR_IO_SPACE|DDI_ACCATTR_DIRECT), %ecx jne 8f movq %rdx, %rax movq %rsi, %rdx outw (%dx) ret
The OUT instruction causes a VM exit. The CPU is setup by the VMM to have an unconditional VM exit if the VM executes IN/OUT/INS/OUS as shown in the setting of the CPU_BASED_UNCOND_IO_EXITING bit in VM exit control (see Table 20-6 in [7]).
74
#define MONITOR_CPU_BASED_EXEC_CONTROLS ( MONITOR_CPU_BASED_EXEC_CONTROLS_SUBARCH | CPU_BASED_HLT_EXITING | CPU_BASED_INVDPG_EXITING | CPU_BASED_MWAIT_EXITING | CPU_BASED_MOV_DR_EXITING | CPU_BASED_UNCOND_IO_EXITING | CPU_BASED_USE_TSC_OFFSETING ) void vmx_init_vmcs_config(void) { .... _vmx_vmexit_control = adjust_vmx_controls(MONITOR_VM_EXIT_CONTROLS, MSR_IA32_VMX_EXIT_CTLS_MSR); .... }
\ \ \ \ \ \ \
The VM exit handler is set in the host RIP field in HVMCDCS (see HVM Operations and Data Structure on page 64). The VM exit handler examines the exit reason and calls the I/O instruction function, vmx_io_instruction(), to handle the VM exit.
asmlinkage void vmx_vmexit_handler(struct cpu_user_regs *regs) { .... case EXIT_REASON_IO_INSTRUCTION: exit_qualification = __vmread(EXIT_QUALIFICATION); inst_len = __get_instruction_length(); vmx_io_instruction(exit_qualification, inst_len); break; .... }
3.
The VM exit handler for I/O instructions in the VMM examines the exit qualification, and gets OUT information from the HVMCDCS. This information includes: Size of the access (1 byte, 2 byte, or 4 bytes) Direction of the access (IN or OUT) Port number Double fault exception or not Size and address of string buffer if this is an I/O string operation
75
The VM exit handler then fills in struct ioreq fields, and sends the I/O request to its client by calling send_pio_req().
static void vmx_io_instruction(unsigned long exit_qualification, unsigned long inst_len) { .... send_pio_req(port, count, size, addr, dir, df, 1); .... }
4.
The client of the I/O request (qemu-dm) is blocked on the event channel device node created by the evtchn module (see Event Channels on page 43). In the VMM, hvm_send_assist_req() gets called by send_pio_req() to set the event pending bit of the event channel and wake up the qemu-dm client waiting on the event.
void hvm_send_assist_req(struct vcpu *v) { .... p->state = STATE_IOREQ_READY; notify_via_xen_event_channel(v->arch.hvm_vcpu.xen_port); }
5.
The QEMU emulator, qemu-dm, is a user process that contains the ioemu module for I/O emulation. The ioemu module waits on one end of the event channel for I/O requests from the VMM.
int main_loop(void) { .... qemu_set_fd_handler(evtchn_fd, cpu_handle_ioreq, NULL, env); while (1) { .... main_loop_wait(10); } .... }
When an I/O request arrives, ioemu is unblocked and cpu_handle_ioreq() is called to get the ioreg structure from the event channel. Based on the information in ioreq, appropriate pcnet functions are invoked to handle the I/O request.
76
6.
After pcnet decodes the ioreq structure, ioemu sends the packet to the TAP network interface. The TAP network interface [27] is a virtual ethernet network device that provides two interfaces to applications: Character device /dev/tapX Virtual network interface tapX where X is the instance number of the TAP interface. Applications can write Ethernet frames to the /dev/tapX character interface, and the TAP driver will receive this frame from the tapX network interface. In the same manner, a packet that kernel writes to the tapX network interface can be read by application from the character
/dev/tapX device node.
the packet to the TAP character interface which will forward the packet to the native driver interface.
static void pcnet_transmit(PCNetState *s) { .... qemu_send_packet(s->vc, s->buffer, s->xmit_pos); .... } static void tap_receive(void *opaque, const uint8_t *buf, int size) { .... for(;;) { ret = write(s->fd, buf, size); .... } }
7. 8.
The Dom0 native driver sends the packet to the network hardware. This marks the end of transmitting a packet from DomU to the real network. Dom0 receives an interrupt indicating a packet intended for DomU has arrived. This marks the beginning of receiving a packet targeted to a DomU from the real network. The native network driver forwards the packet through a bridge to the TAP network interface, tapX.
9.
Next, tap_send() is invoked when data is written to the file. The packet is read from the character interface of /dev/tapX. Next, qemu_send_packet() calls
pcnet_receive() to send out the buffer.
77
static void tap_send(void *opaque) { .... size = read(s->fd, buf, sizeof(buf)); if (size > 0) { qemu_send_packet(s->vc, buf, size); } }
10. The pcnet_receive() function in ioemu copies data read from the TAP character device to the VMM memory. The data can be either an I/O port value from the IN instruction or a network packet. At the end of data transfer, pcnet informs the VMM to generate an interrupt.
static void pcnet_receive(void *opaque, const uint8_t *buf, int size) { .... cpu_physical_memory_write(rbadr, src, count); ... pcnet_update_irq(s); }
11. The ioemu module makes a hvm_opt(set_pci_intx_level) hypercall to the VMM to generate an interrupt to the target domain.
int xc_hvm_set_pci_intx_level( int xc_handle, domid_t dom, uint8_t domain, uint8_t bus, uint8_t device, uint8_t intx, unsigned int level) { .... hypercall.op = __HYPERVISOR_hvm_op; hypercall.arg[0] = HVMOP_set_pci_intx_level; hypercall.arg[1] = (unsigned long)&arg; .... rc = do_xen_hypercall(xc_handle, &hypercall); .... }
The VMM sets the guest HVMCDCS area to inject an event with the next VM entry. The target VM will get an interrupt when the VMM launches a VM entry to the target domain (see Sun xVM Server Interrupt and Exception Handling for HVM on page 70).
78
virtualization model as the Sun xVM Server PV architecture (see Sun xVM Server I/O Virtualization on page 56). Paravirtualized drivers (PV drivers) like xbf and xnf are included in the OS distribution. When a VM is created, Dom0 exports virtual I/O devices (for example, xnf and xbf ) instead of emulated I/O devices (for example, pcn and
mpt) to the GOS. PV drivers are subsequently bound to these virtual devices and used
for handling I/O. The I/O transactions follow the same path as described in Chapter 5, Sun xVM Server. PV drivers will be provided for Solaris 10 and Windows so they can run unmodified in the Sun xVM Server with better I/O performance.
79
Logical Domains
Chapter 7
Logical Domains
The Logical Domains (LDoms) technology from Sun Microsystems allows a system's resources, such as memory, CPUs, and I/O devices, to be allocated into logical groupings. Multiple isolated systems, each with their own operating system, resources, and identity within a single computer system, can then be created using these partitioned resources. Unlike Sun xVM Server, LDoms technology partitions a processor into multiple strands, and assigns each strand its own hardware resources. (See Terms and Definitions on page 113.) Each virtual machine, called a domain in LDoms terminology, is associated with one or more dedicated strands. A thin layer of firmware, called the hypervisor, is interposed between the hardware and the operating system (Figure 24). The hypervisor abstracts the hardware resources and provides an interface to the operating system software.
Control Domain
Solaris 10
Domain 1
Solaris 10
Domain 2
Solaris 10
Domain 3
Linux
Hypervisor
CPU Mem CPU Mem CPU Mem CPU Mem
Hardware
CPU Mem CPU Mem
Figure 24. The hypervisor, a thin layer of firmware, abstracts hardware resources and presents them to the OS.
The LDoms implementation includes four components: UltraSPARC T1/T2 processor UltraSPARC hypervisor Logical Domain Manager (LDM) Paravirtualized Solaris OS
Note The terms strand, hardware thread, logical processor, virtual CPU and virtual
processor are used by various documents to refer to the same concept. For consistency, the term strand is used in this chapter.
80
Logical Domains
Note In Sun documents, the term hypervisor is used to refer to the hyperprivileged
software that performs the functions of the VMM and the term domain is used to refer to a VM. To accommodate Sun's terminologies, hypervisor and domain (instead of VMM and VM) are used in this chapter.
This chapter assumes a basic understanding of the UltraSPARC T1/T2 processor, which plays a major role in the implementation of LDoms. (See Chapter 4, SPARC Processor Architecture on page 29.) The remainder of the chapter is organized as follows: Logical Domains (LDoms) Architecture Overview on page 80 provides an overview of the LDoms architecture and the other three components of LDoms: paravirtualized Solaris, the UltraSPARC hypervisor, and the Logical Domain manager. CPU Virtualization in LDoms on page 84 discusses CPU virtualization including trap and interrupt handling. Memory Virtualization in LDoms on page 88 discusses memory virtualization including physical memory allocation and page translations. I/O Virtualization in LDoms on page 91 discusses I/O virtualization and describes the operation of the disk block and network drivers.
81
Logical Domains
Kernel Hypercalls
Kernel Hypercalls
Figure 25. A control domain, Solaris OS, and Linux guest domains running in logical domains on an UltraSPARC T1/T2 processor-powered server.
The UltraSPARC T1/T2 processor architecture is described earlier in Chapter 4, SPARC Processor Architecture on page 29. In this section, the other three components of the LDoms technology paravirtualized Solaris OS, hypervisor, and logical domain manager are discussed.
Paravirtualized Solaris OS
The Solaris kernel implementation for the UltraSPARC T1/T2 hardware class (uname -m) is referred to as the Solaris sun4v architecture. In this implementation, the Solaris OS is paravirtualized to replace operations that require hyperprivileged mode with hypervisor calls. The Solaris OS communicates with the hypervisor through a set of hypervisor APIs, and uses these APIs to request that the hypervisor perform hyperprivileged operations. Sun4v support for LDoms is a combination of partitioning the UltraSPARC T1/T2 processor into strands and virtualization of memory and I/O services. Unlike Sun xVM Server and VMware, an LDoms domain does not share strands with other domains. Each domain has one or more strands assigned to it, and each strand has its own hardware resources so that it can execute instructions independently of other strands. The virtualization of CPU functions to support CMT is implemented at the processor rather than at the software level (that is, there is no software scheduler). A Solaris guest OS can directly access strand-specific registers in a domain and can, for example, perform operations such as setting an OS trap table to the trap base address register (TBA). The Solaris sun4v architecture assumes that the platform includes the hypervisor as part of its firmware. The hypervisor runs in the hyperprivileged mode, and the Solaris
82
Logical Domains
OS runs in the privileged mode of the processor. The Solaris kernel uses hypercalls to request that the hypervisor perform hyperprivileged functions of the processor. Like Intel's VT and AMD's Pacifica architectures, the sun4v architecture leverages CPU support (hyperprivileged mode) for the implementation of the hypervisor. Unlike Intel's VT and AMD's Pacifica architectures which provide a special mode of execution for the hypervisor and thus make the hypervisor transparent to the GOS, the support for the hypervisor in UltraSPARC T1/T2 is non-transparent to the GOS. The UltraSPARC T1/T2 processors provide a set of hypervisor APIs for the GOS to delegate the hyperprivileged operations to the hypervisor.
Hypervisor Services
The hypervisor layer is a component of the UltraSPARC T1/T2 system's firmware. An UltraSPARC systems firmware consists of Open Boot PROM (OBP), Advanced Lights Out Management (ALOM), Power-on Self Test (POST), and the hypervisor. The hypervisor leverages the UltraSPARC T1/T2 hyperprivileged extensions to provide a protection mechanism for running multiple guest domains on the system. The hypervisor includes a number of hypervisor services to its overlaying domains. These services include hypervisor APIs that are the interfaces for a GOS to request hypervisor services, and Logical Domain Channel (LDC) services which are used by virtual device drivers for inter-domain communications. Hypervisor API The Sun4v hypervisor API [11] uses the Tcc instruction to cause the GOS to trap into hyperprivileged mode, in a similar fashion to how OS system calls are implemented. The function of the hypervisor API is equivalent to system calls in the OS that enable user applications to request services from the OS. The Sun4v hypervisor API allows a GOS to perform the following actions: Request services from the hypervisor Get and set CPU information through the hypervisor The UltraSPARC Virtual Machine Specification [11] lists the complete set of services and APIs for: API versioning request and check for a version of the hypervisor APIs with which it may be compatible Domain services enable a control domain to request information about or to affect other domains CPU services control and configure a strand; includes operations such as start/stop/suspend a strand, set/get the trap base address register, and configure the interrupt queue MMU services perform MMU related operations such as configure the TSB, map/demap the TLB, and configure the fault status register
83
Logical Domains
Memory services zero and flush data from cache to memory Interrupt services get/set interrupt enabled, target strand, and state of the interrupt Time-of-Day services get/set time-of-day Console services get/put a character to the console Channel Services provide communication channels between domains (see Logical Domain Channel (LDC) Services on page 83) The following two examples of hv_mem_sync() and hv_api_set_version() show the implementation for hypervisor calls:
% mdb -k > hv_mem_sync,6/ai hv_mem_sync: hv_mem_sync: mov %o2, %o4 hv_mem_sync+4: mov 0x32, %o5 hv_mem_sync+8: ta %icc, %g0 + 0 hv_mem_sync+0xc:retl hv_mem_sync+0x10: stx > hv_api_set_version,6/ai hv_api_set_version: hv_api_set_version: mov hv_api_set_version+4: clr hv_api_set_version+8: ta hv_api_set_version+0xc: retl hv_api_set_version+0x10: stx
%o1, [%o4]
The trap type in the range 0x180-0x1FF is used to transition from a privileged mode to a hyperprivileged mode. In the two preceding examples, a TT value of 0x180 (offset of 0) is used for hv_mem_sync(), and a TT value of 0x1FF (offset of 0x7f ) is used for
hv_api_set_version().
Hypervisor calls are normally invoked during the startup of the kernel to set up strands for the domain. Only a few hypercall functions are called during the runtime of the kernel, including: hv_tod_set(), hv_tod_get(), hv_set_ctx0(),
hv_mmu_map_perm_addr(), hv_mmu_unmap_perm_addr(), hv_set_ctxnon0(), and hv_mmu_set_stat_area().
Logical Domain Channel (LDC) Services The hypervisor provides communication channels between domains. These channels are accessed within a domain as an endpoint. Two endpoints are connected together forming a bi-directional point-to-point LDC. All traffic sent to a local endpoint arrives at the corresponding endpoint at the other end of the channel in the form of short fixed-length (64-byte) message packets. Each endpoint is associated with one receive queue and one transmit queue. Messages from a channel are deposited by the hypervisor at the tail of a queue, and the receiving
84
Logical Domains
domain indicates receipt by moving the corresponding head pointer for the queue. To send a packet down an LDC, a domain inserts the packet into its transmit queue, and then uses a hypervisor API call to update the tail pointer for the transmit queue. In the Solaris OS, the hypervisor LDC service is used as a simulated I/O bus interface, enabling a virtual device to communicate with a real device on the I/O domain. All virtual devices that communicate with the I/O domain for device access are a leaf nodes on the LDC bus. For example, a virtual disk driver, vdc, uses the LDC service to communicate with the virtual disk driver, vds, on the other side of the channel. Both
vdc and vds are leaf nodes on the channel bus (see I/O Virtualization in LDoms on
page 91).
85
Logical Domains
Each strand has two interrupt queues: cpu_mondo and dev_mondo. The cpu_mondo queue is used for CPU-to-CPU cross-call interrupts; the dev_mondo queue is used for I/O-to-CPU interrupts. The Solaris kernel allocates memory for each queue, and registers these queues with the hv_cpu_qconf() hypercall. When the queue is nonempty (that is, the queue header is not equal to the queue tail), a trap is generated to the target CPU. The data of the interrupt received (mondo data) is stored in the queue.
86
Logical Domains
The I/O and CPU cross-call interrupt delivering mechanism is as follows: 1. An I/O device asserts its interrupt line to generate an interrupt to the processor. The I/O bridge chip receives the interrupt request and prepares a mondo packet to be sent to the target processor whose CPU number is stored in the bridge chip register by the OS. The mondo packet contains an interrupt number that uniquely identifies the source of the interrupt. 2. The hypervisor receives an interrupt request from the hardware through the interrupt vector trap (0x60). For example, the trap table for the T2000 firmware has the following entries:
ENTRY(htraptable) .... TRAP(tt0_05e, HSTICK_INTR) TRAP(tt0_05f, NOT) TRAP(tt0_060, VECINTR) ....
The CPU number and interrupt number are also delivered, along with the interrupt trap. The interrupt vector trap handle, VECINTR, uses the interrupt number to determine the source of the interrupt. If the interrupt is coming from I/O, the trap handler use the CPU number to find the dev_mondo queue associated with the CPU and adds the interrupt to the tail of the dev_mondo queue. When the head of the queue is not equal to the tail, a trap (0x7C for CPU cross calls and 0x7D for I/O) is generated to the CPU that owns the queue. 3. Traps 0x7C and 0x7D are taken via the GOS trap table. For I/O interrupts,
dev_mondo() is the trap handler for 0x7D.
87
Logical Domains
# mdb -k > trap_table+0x20*0x7c/ai 0x1000f80: 0x1000f80: ba,a,pt %xcc, +0xc784 <cpu_mondo> > trap_table+0x20*0x7d/ai 0x1000fa0: 0x1000fa0: ba,a,pt %xcc, +0xc800 <dev_mondo> >
The dev_mondo() handler takes the interrupt out of the queue by incrementing the queue header. It also finds the interrupt vector data, struct
intr_vec, from the systems interrupt vector table. The struct intr_vec
data contains the priority interrupt level (PIL) and the driver's interrupt service routine (ISR) for the interrupt. The dev_mondo() handler then sets the
SOFTINT register with the PIL of the interrupt.
4.
Setting the SOFTINT register causes an interrupt_level_n trap, 0x410x4f, to be generated where n is the PIL of the interrupt. The GOS's trap handler
If the PIL of the interrupt is below the clock PIL, an interrupt thread is allocated to handle the interrupt. Otherwise, the high level interrupt is handled by the currently executing thread. In summary, the interrupt delivering mechanism is a two stage process. First, an interrupt is delivered to the hypervisor as the interrupt vector trap, 0x60. Then the interrupt is added to an interrupt queue, which causes another trap to the GOS.
88
Logical Domains
The high resolution timer is provided by the rdtick instruction, which reads the counter field of the TICK register. The rdtick instruction is a privileged instruction that can be executed by the Solaris OS without the hypervisor involvement.
Description Real Address (memory) Noncacheable Real Address Real Address Little-endian Noncacheable Real Address Little-endian
89
Logical Domains
The partition ID register is defined in ASI 0x58, VA 0x80 [2] with an 8-bit field for the partition ID. The full representation of each type of address is as follows:
real_address = context_ID :: virtual_address physical_address = partition ID :: real_address
or:
physical_address = partition ID :: context ID :: virtual_address
64-bit VA + context ID
Figure 26. Different types of addressing are used in different modes of operation.
Page Translations
Page translations in the UltraSPARC architecture are managed by software through several different type of traps (see Memory Management Unit on page 32). Depending on the trap type, traps may be handled by the hypervisor or the GOS. Table 9 summarizes the MMU related trap types (see also Table 12-4 in [2]).
Table 9. MMU-related trap types in the UltraSPARC T1/T2 processor
Trap Cause iTLB Miss dTLB Miss Protection Violation Several Several
90
Logical Domains
In the hypervisor trap table, htraptable, the instructions for handling dTLB miss, trap 0x68, are:
% mdb ./ontario/release/q > htraptable+0x20*0x68,8/ai htraptable+0xd00: htraptable+0xd00: rdpr htraptable+0xd04: cmp htraptable+0xd08: bgu,pn htraptable+0xd0c: mov htraptable+0xd10: ba,pt htraptable+0xd14: ldxa htraptable+0xd18: illtrap htraptable+0xd1c: illtrap
%priv_16, %g1 %g1, 3 %xcc, +0x73b8 <watchdog_guest> 0x28, %g1 %xcc, +0x97a0 <dmmu_miss> [%g1] 0x4f, %g1 0 0
The trap table transfers control to dmmu_miss() to load the page translation from the TSB. If the translation doesn't exist in the TSB, dmmu_miss() calls dtsb_miss(). The handler dtsb_miss() sets the TT register to trap type 0x31 (data_access_MMU_miss), changes the PSTATE register to the privileged mode, and transfers control to the GOS's trap handler for trap 0x31. The portion of
dtsb_miss() that performs this functionality is shown in the following example:
> dtsb_miss,80/ai .... wrpr %g0, 0x31, %tt rdpr %pstate, %g3 or %g3, 4, %g3 wrpr %g3, %pstate rdpr %tba, %g3
! write 0x31 to %tt ! read %pstate to %g3 ! write %g3 to%pstate ! get privileged mode's trap ! table base address ! set %g3 to the address of ! trap type 0x31 ! jump to 0x31 trap handler
%g3
In the Solaris OS, the trap handler for trap type 0x31 calls the handler
sfmmu_slow_dmmu_miss() to load the page translation from hme_blk. If no entry
91
Logical Domains
% mdb -k > trap_table+0x20*0x31,2/ai scb+0x620: scb+0x620: ba,a +0xc1b4 scb+0x624: illtrap 0 > sfmmu_pagefault,80/ai .... sfmmu_pagefault+0x78: sethi sfmmu_pagefault+0x7c: or sfmmu_pagefault+0x80: ba,pt
<sfmmu_slow_dmmu_miss>
92
Logical Domains
During the system boot, the OBP device tree information is passed to the Solaris OS and used to create the system device nodes. Output from the following pftconf(1M) command shows the system configuration of a typical non-I/O domain:
# prtconf System Configuration: Sun Microsystems Memory size: 1024 Megabytes System Peripherals (Software Nodes): sun4v
SUNW,Sun-Fire-T200 scsi_vhci, instance #0 packages (driver not attached) SUNW,builtin-drivers (driver not attached) deblocker (driver not attached) disk-label (driver not attached) terminal-emulator (driver not attached) dropins (driver not attached) SUNW,asr (driver not attached) kbd-translator (driver not attached) obp-tftp (driver not attached) ufs-file-system (driver not attached) chosen (driver not attached) openprom (driver not attached) client-services (driver not attached) options, instance #0 aliases (driver not attached) memory (driver not attached) virtual-memory (driver not attached) virtual-devices, instance #0 ncp (driver not attached) console, instance #0 channel-devices, instance #0 disk, instance #0 network, instance #0 cpu (driver not attached) cpu (driver not attached) cpu (driver not attached) cpu (driver not attached) iscsi, instance #0 pseudo, instance #0
As this system configuration shows, no physical devices are exported to the domain. The virtual-devices entry is the nexus node of all virtual devices. The channeldevices entry is the bus node for the virtual devices that require communication with
the I/O domain. The disk and network entries are leaf nodes on the channeldevices bus.
93
Logical Domains
The Solaris drivers that are specific to the LDom configuration are listed below:
LDOM drivers: vdc (virtual disk client 1.4) non I/O domain only ldc (sun4v LDC module v1.5) ds (Domain Services 1.3) cnex (sun4v channel-devices nexus dri) vnex (sun4v virtual-devices nexus dri) dr_cpu (sun4v CPU DR 1.2) drctl (DR Control pseudo driver v1.1) qcn (sun4v console driver v1.5) vnet (vnet driver v1.4) non I/O domain only vds (virtual disk server v1.6) I/O domain only vsw (sun4v Virtual Switch Driver 1.5) I/O domain only
Similar to Sun xVM Server, the LDoms VIO on a non-I/O domain uses a split device driver architecture for virtual disk and network devices. The vdc and vnet client drivers are used in non-I/O domains. The vds and vsw server drivers are used in the I/O domain to support the vdc and vnet drivers. The vnex nexus driver, the driver for the
virtual-devices nexus node, provides bus services to its children nodes, vnet and vdc.
The VIO framework uses the hypervisors Logical Domain Channel (LDC) service for driver communication between domains. The LDC forms bi-directional point-to-point links between two endpoints. All traffic sent to a local endpoint arrives only at the corresponding endpoint at the other end of the channel in the form of short fixedlength (64 byte) message packets. Each endpoint is associated with one receive queue and one transmit queue. Messages from a channel are deposited by the hypervisor at the tail of a queue, and the receiving domain indicates receipt by moving the corresponding head pointer for the queue. To send a packet down an LDC, a domain inserts the packet into its transmit queue, and then uses a hypervisor API call to update the tail pointer for the transmit queue.
94
Logical Domains
10
vds
4 Yes File? No 7 8 3 2 5 6
FS
1
vdc
9
FS
Native Driver
Figure 27. Sequence of events for disk I/O from a non-I/O domain to an I/O domain.
For non-I/O domains, the following events occur when applications use read(2) and
write(2) system calls to access a file:
1. 2. 3. 4. 5. 6. 7. 8. 9.
The file system calls the vdc driver's strategy(9E) entry point. The vdc drivers send the I/O buf, buf(9S), to the LDC. The vdc driver returns after all data is successfully sent to the LDC. The vds driver is notified by the hypervisor that messages are available on its queue. The vds driver retrieves data from the LDC and sends it to the device service that is mapped to the client virtual disk. The vds driver starts the block I/O by sending the I/O request to the native driver and then dispatching a task queue to await I/O completion. The native SCSI driver receives the device interrupt. The vds driver's I/O completion is woken up by biodone(9F). The vds driver sends a message to vdc indicating I/O completion. The vdc driver receives the message from vds, and calls biodone(9F) to wake up anyone waiting for it.
For I/O domains, the I/O path of data requests is simpler: 10. Block I/O requests are sent directly from the file system to the native driver. In addition to the strategy(9E) driver entry point for supporting file system and raw device access, the vdc driver also supports most of the ioctl(2) commands as
95
Logical Domains
defined in dkio(7I) for disk control. The Solaris kernel variable dk_ioctl1 defines the exact disk ioctl commands supported by the vdc driver.
Network Driver
The Solaris LDoms network drivers include a client network driver, vnet, and a virtual switch, vsw, on the server side. To transmit a packet, vnet sends a packet over the LDC to vsw. The binding of vnet to vsw is defined in the vnet resource of the domain when the domain is created. The vsw forwards the packet to the native driver, and includes the IP address of vnet as the source address. The vnet driver returns as soon as the packet has been put on a buffer and the buffer has been added to the tail of the LDC queue. When receiving packets from the network, if the native driver is configured as a virtual switch in the vswitch resource of the domain, the packet is passed up from the native driver to vsw. The vsw finds the MAC address associated with the destination IP address from its ARP table. The vsw gets the target domain from the MAC address, and gets the vnet interface from the vnet resource. The packet is then sent to the LDC of the designated vnet driver. The vnet driver uses Solaris GLD v3 interfaces and is fully compatible with the native driver using the same GLD v3 interface. Figure 28 depicts the flow of receiving a packet from the network through an I/O domain to a guest domain. The sequence of operations for receiving packets is as follows: 1. 2. 3. Data is stored via DMA into the native driver, e1000g, receive buffer ring. The vsw sends the packet to client driver, vnet, through the LDC. The LDC receiving worker thread gets the packet and sends it to the vnet driver.
1.
Information on the Solaris kernel variable dk_ioctl can be looked up at the Web site: http://www.opensolaris.org/.
96
Logical Domains
I/O Domain ldc`ldc_write vsw`vsw_dringsend+0x234 vsw`vsw_portsend+0x60 vsw`vsw_forward_all+0x134 vsw`vsw_switch_l2_frame+0x248 mac`mac_rx+0x58 e1000g`e1000g_intr_pciexpress+0xb8 px`px_msiq_intr+0x1b8 intr_thread+0x170 cpu_halt+0xc0 idle+0x128 thread_start+4
Network Chip
Figure 28. Flow of control for receiving a network packet from an I/O domain to a guest domain.
97
VMware
Chapter 8
VMware
VMware, the current market leader in virtualization software for x86 platforms, offers three virtual machine systems: VMware Workstation; no-cost VMware Server, formerly known as VMware GSX Server; and VMware Infrastructure 3, a suite of virtualization products based on VMware ESX Server Version 3. The VMware Workstation and VMware Server products are add-on software modules that run on a host OS such as Windows, Linux-hosted, or BSD variants (Figure 29). In these implementations the VMM is a part of, and has the same privilege as, the host OS kernel. The guest OS runs as an application on the host OS. The Solaris OS can only run as a guest OS on VMware Workstation and VMware Server. The VMware Infrastructure suite of products is built around the VMware ESX Server. VMware ESX Server runs on bare metal and uses a derived version of SimOS [18] as the kernel for running the VMM and I/O services. All other operating systems run as a guest OS. VMware Infrastructure supports Windows, Linux, and Solaris as guest OS. VMware ESX Server provides lower overhead and better control of system resources than VMware Workstation and Server. However, because it provides all device drivers, it therefore supports fewer devices than VMware Workstation and VMware Server. Figure 29 shows the configuration of VMware ESX Server and GSX Server.
GSX Server
Guest OS Linux VMM Guest OS Solaris VMM Host Apps Guest OS Linux
ESX Server
Guest OS Solaris Guest OS Windows
VMM Hardware
Figure 29. VMware GSX Server (Vmware Workstation and VMware Server products) runs within a host operating system, while VMware ESX Server runs on the bare metal.
VMware ESX Server is a Type I VMM, and has exclusive control of hardware resources (see Types of VMM on page 10). In contrast, VMware Workstation and VMware Server are Type II VMMs, and leverage the host OS by running inside the OS kernel.
98
VMware
Guest Application
Network Driver SCSI Driver
Service Console
Management Interface
Storage Emulation
VMkernel
Network Stack Storage Stack
VMM
Storage Driver
CPU
Network
Storage
The following sections discuss the functional components of VMware Infrastructure, with particular emphasis on the virtualization layer which forms the core of all VMware virtualization products.
99
VMware
layer. The core of the ESX virtualization layer is the VMM, which includes three modules (Figure 31) [12]: Execution decision module decides whether VM instructions should be sent to the direct execution module or the binary translation module Binary translation module used to execute the VM whenever the hardware processor is in a state in which direct execution cannot be used Direct execution module enables the VM to directly execute its instruction sequences on the underlying hardware processor
VM GOS VMM Execution Execution Module Decision
Binary Translation
Direct Execution
Figure 31. VMware ESX Server virtualizes the CPU hardware through binary translation whenever the processor itself cannot directly execute an instruction.
The decision to use binary translation or direct execution depends on the state of the processor and whether the segment is reversible or not (see Segmented Architecture on page 23). If the content of the descriptor table, for example the GDT, is changed by the VMM because of a context switch to another VM, the segment is non-reversible. Direct execution can be used only if the VM is running in an unprivileged mode and the hidden descriptors of the segment register are reversible. In all other cases, the VMM will switch to the binary translation module.
Binary Translation
The binary translation (BT) module is believed influenced by the machine simulators Shade [13] and Embra [14]. Embra is part of SimOS [18] which was developed by a Stanford team led by Mendel Rosenblum, one of the founders of VMware. While extensive details of the BT module implementation have not been published, Agesen [15], Embra [14], and Shade [13] provide some information on its implementation. The BT module translates GOS instructions, which are running in a deprivileged VM, into instructions that can run in the privileged VMM segment. The BT module receives x86 binary instructions, including privileged instructions, as input. The output of the module is a set of instructions that can be safely executed in the non-privileged mode. Agesen [15] gives an example of how control flow is handled in the BT module.
100
VMware
To avoid frequently retranslating blocks of instructions, translated blocks are kept in a Translation Cache (TC). The execution of a block of instructions is simulated by locating the blocks translation in the TC and jumping to it. A hash table maintains the mappings from a program counter to the address of the translated code in the TC. The main loop of the dynamic binary translation simulator is shown in Figure 32. The loop checks to see if the current simulated program counter address is present in the TC. If it is present in the TC, the translated block is executed. If it is not, the translator is called to add the block to the TC. Each block of translated code ends by loading the new simulated program counter and jumping back to the main loop for dispatching.
Translator Main{ .... dispatch loop if (PC_not_in TC(PC)) tc=translate(pc); newpc = pc_to_tc(pc); jump_to_pc(newpc) .... } translate(pc) { .... blk = read_instructions(pc); perform_translation(blck); write_into_TC(blk); .... }
Figure 32. Binary translation manages a translation cache to reduce the need to re-translate frequently executed blocks of instructions.
Translation Cache Translation Cache: code fragments which end with jump back to dispatch_loop
A more detailed description of binary translation is beyond the scope of this paper. Readers should refer to Shade [13] and Embra [14] for more details about dynamic binary translation. Some privileged instructions that have simple operations use in-TC sequences. For example, a clear interrupt instruction (cli) can be replaced by setting a virtual processor flag. Privileged instructions that have more complex operations (such as setting cr3 during a context switch), require a call out of the TC to perform the emulation work. In addition to binary translation and logic for determining the code execution, the virtualization layer employs other techniques to overcome x86 virtualization issues: Memory Tracing The virtualization layer traces modifications on any given physical page of the virtual machine, and is notified of all read and write accesses made to that page in a transparent manner. This memory tracing ability in the VMM is enabled by page faults and the ability to single-step the virtual machine via binary translation.
101
VMware
Shadow Descriptor Tables The x86's segmented architecture (see Segmented Architecture on page 23) has a segment caching mechanism that allows the segment register's hidden fields to be re-used. However, this can approach can cause difficulty if the descriptor table is modified in a non-coherent way. The virtualization layer supports the GOS system descriptor tables using VMM shadow descriptor tables. The VMM descriptor tables include shadow descriptors that correspond to predetermined descriptors of the VM descriptor tables. The VMM also includes a segment tracking mechanism that compares the shadow descriptors with their corresponding VM segment descriptors. This mechanism indicates any lack of correspondence between shadow descriptor tables with their corresponding VM descriptor tables, and updates the shadow descriptors so that they correspond to their respective corresponding VM segment descriptors. The ESX Server's VMM implementation is unique in that each GOS has an associated VMM. The ESX Server may include any number of VMMs in a given physical system, each supporting a corresponding VM; the number of VMMs is limited only by available memory and speed requirements. The features in the virtualization layer mentioned in the previous discussion allow multiple concurrent VMMs, with each VMM supporting an unmodified GOS in the virtualization layer.
CPU Scheduling
The ESX Server implements a rate-based proportional-share scheduler [19] that is similar to the fair-share-scheduler scheme used by the Solaris OS (see [21] Chapter 8) in which each virtual machine is given a number of shares. The amount of CPU time given to each VM is based on its fractional share of the total number of shares of active VMs in the whole system. The term share is used to define a portion of the systems CPU resources that is allocated to a VM. If a greater number of CPU shares is assigned to a VM, relative to other VMs, then that VM receives more CPU resources from the scheduler. CPU shares are not equivalent to percentages of CPU resources. Rather, shares are used to define the relative weight of a CPU load in a VM in relation to CPU loads of other VMs. The following formula shows how the scheduler calculates per-domain allocation of CPU resources.
Allocation Shares i domain = ----------------------------------------------------TotalShares
domain
The ESX scheduler allows specifying minimum (reservation) and maximum (limit) CPU utilization for each virtual machine. A minimum CPU reservation guarantees that a virtual machine always has this minimum percentage of a physical CPUs time
102
VMware
allocated to it, regardless of the total number of shares. A maximum CPU limit ensures that the virtual machine never uses more than this maximum percentage of a physical CPUs time, even if extra idle time is available. The proportional-share algorithm is only applied if the VM CPU utilization falls within the range of reservation and limit CPU utilization. Figure 33 shows how CPU resource allocation is calculated.
Total MHz
Reservation
0 MHz
In an SMP environment in which a VM could have more than one virtual CPU (VCPU), a scalability issue arises when one VCPU is spinning on a lock held by another VCPU that gets de-scheduled. The spinning VCPU wastes CPU cycles spinning on the lock until the lock owner VCPU is finally scheduled again and releases the lock. ESX implements co-scheduling to work around this problem. In co-scheduling (also called gang scheduling), all virtual processors of a VM are mapped one-to-one onto the underlying processors and simultaneously scheduled for an equal time slice. The ESX scheduler guarantees that no VCPUs are spinning on a lock hold by a VCPU that has been preempted. However, co-scheduling does introduce other problems. Because all VCPUs are scheduled at the same time, co-scheduling activates a VCPU regardless of whether there are jobs in the VCPU's run queue. Co-scheduling also precludes multiplexing multiple VCPUs on the same physical processor.
Timer Services
Similar to Sun xVM Server, ESX Server faces the same issue of getting clock interrupts delivered to VMs at the configured interval [16]. This issue arises because the VM may not get scheduled when interrupts are due to deliver. ESX Server keeps track of the clock interrupt backlog and tries to deliver clock interrupts at a higher rate when the backlog gets large. However, the backlog can get so large that it is not possible for the GOS to catch up with the real time. In such cases, ESX Server stops attempting to catch
103
VMware
up if the clock interrupt backlog grows beyond 60 seconds. Instead, ESX Servers sets its record of the clock interrupt backlog to zero and synchronizes the GOS clock with the host machine clock. ESX Server virtualizes the Time Stamp Counter (TSC) so that the virtualized TSC counter matches with the GOS clock (see Time Stamp Counter (TSC) on page 28). When the clock interrupt backlog is cleared due to catching up or due to reset when the backlog is too large, the virtualized TSC catches up with the adjusted clock.
Page Translations
Each GOS in the ESX Server maintains page tables for virtual-to-physical address mappings. The VMM also maintains shadow page tables for the virtual-to-machine page mappings along with physical-to-machine mappings in its memory. The processor's MMU uses the VMM's shadow page table. When a GOS updates its page tables with a virtual-to-physical translation, the VMM intercepts the instruction, gets the physical-to-machine mapping from its memory, and loads the shadow page table with the virtual-to-machine mapping. This mechanism allows normal memory accesses in the VM to execute without adding address translation overhead if the shadow page tables are set up for that access.
104
VMware
sound chip [20]. In addition, ESX Server also provides virtual PCI emulation for PCI addon devices such as SCSI, Ethernet, and SVGA graphics (see Figure 30 on page 98). The device tree as exported by the VMM to a GOS is shown in the following
prtconf(1M) output.
% prtconf System Configuration: Sun Microsystems i86pc Memory size: 1648 Megabytes System Peripherals (Software Nodes): i86pc scsi_vhci, instance #0 isa, instance #0 i8042, instance #0 keyboard, instance #0 mouse, instance #0 lp (driver not attached) asy, instance #0 (driver not attached) asy, instance #1 (driver not attached) fdc, instance #0 fd, instance #0 pci, instance #0 pci15ad,1976 (driver not attached) pci8086,7191, instance #0 pci15ad,1976 (driver not attached) pci-ide, instance #0 ide, instance #0 sd, instance #16 ide (driver not attached) pci15ad,1976 (driver not attached) display, instance #0 pci1000,30, instance #0 sd, instance #0 pci15ad,750, instance #0 iscsi, instance #0 pseudo, instance #0 options, instance #0 agpgart, instance #0 (driver not attached) xsvc, instance #0 objmgr, instance #0 acpi (driver not attached) used-resources (driver not attached) cpus (driver not attached) cpu, instance #0 (driver not attached) cpu, instance #1 (driver not attached)
105
VMware
The PCI vendor ID of VMware is 15ad. The following entries are relevant to VMware I/O virtualization:
Device Entry Description VMware emulation of Intel's 100FX Gigabit Ethernet VMware emulation of the Intel 440BX/ZX PCI bridge chip the LSI logic 53C1020/1030 SCSI controller VMware virtual SVGA
For the example device tree shown here, the Solaris OS binds the e1000g driver to
pci15ad,750 and uses e1000g as the network driver. The actual network hardware
used on the system is a Broadcom's NetXtreme Dual Gigabit Adapter with the PCI ID
pci14e4,1468. VMware translates the e1000g device interfaces passed by the
Solaris e1000g driver, and sends them to the Broadcom's NetXtreme device. For storage, unlike Sun xVM Server, ESX Server continues to use sd as the interface to file systems. The emulation of disk interface is provided at the SCSI bus adapter interface (LSI logic SCSI controller) instead of at the SCSI target interface (SCSI disk sd).
Device Emulation
Each storage device, regardless of the specific adapters, appears as a SCSI drive connected to an LSI Logic SCSI adapter within the VM. For network I/O, ESX Server emulates an AMD Lance/PCNet or Intel E1000g device driver, or uses a custom interface called vmxnet for the physical network adapter. VMware provides device emulation rather than the I/O emulation as used by Sun xVM Server and UltraSparc LDoms (see I/O Virtualization on page 16). In a simple scenario, consider an application within the VM making an I/O request to the GOS, as illustrated in Figure 34:
106
VMware
Guest
Application
VMKernel
7 3
I/O Device
1.
Applications perform I/O operations through the interface to the device as exported by the VMware VMM (see VMware I/O Virtualization on page 103). The virtual device interface uses the native drivers (for example, the e1000g for network and mpt for the LSI SCSI HBA) in the Solaris kernel.
2. 3. 4. 5. 6. 7. 8. 9.
The Solaris native driver attempts to access the device via the IN/OUT instructions (for example, by writing a DMA descriptor to the device's DMA engine). The VMM intercepts the I/O instructions and then transfers control to the deviceindependent module in the VMkernel for handling the I/O request. The VMkernel converts the I/O request from the emulated device to one for the real device, and sends the converted I/O request to the driver for the real device. The VMware driver sends the I/O request to the real I/O device. When an I/O request completion interrupt (for example, DMA completion interrupt) arrives, the VMkernel device driver receives and processes the interrupt. The VMkernel then notifies the VMM of the target virtual machine, which copies data to the VM memory and then raises the interrupt to the GOS. The Solaris drivers interrupt service routine (ISR) is called. The Solaris driver performs a sequence of I/O accesses (for example, reads the transaction status, acknowledges receipt) to the I/O ports before passing the data to its applications.
The VMkernel ensures that data intended for each virtual machine is isolated from other VMs.
VMware
Section III
Additional Information
Appendix A: VMM Comparison (page 109) Appendix B: References (page 111) Appendix C: Terms and Definitions (page 113) Appendix D: Author Biography (page 117)
108
VMware
109
VMM Comparison
Appendix A
VMM Comparison
This chapter presents a summary comparison of the four virtual machine monitors discussed in this paper: Sun xVM Server without HVM, Sun xVM Server with HVM, VMware, and Logical Domains (LDoms). Table 10 summarizes their general characteristics; provides information on their CPU, Memory, and I/O virtualization implementation; and lists the management options available for each.
General
Sun xVM Server w/o HVM Sun xVM Server w/HVM VMM version 3.0.4 Supported ISA x86 and IA-64 VMM Layer Run on bare metal 3.0.4 x86 and IA-64 Run on bare metal Full Linux, NetBSD, FreeBSD, OpenBSD, Windows Yes Yes Limited by memory Hardware Virtualization GPL (free)
Virtualization Scheme Paravirtualization Supported GOS Linux, NetBSD, FreeBSD, OpenBSD, Solaris SMP GOS Yes 64-bit GOS Yes Max VMs Limited by memory Method of operation Modified GOS License GPL (free) CPU
Windows, Linux, Netware, Solaris, Linux Solaris Yes Yes Limited by memory Binary Translation Proprietary VMware Fair Share Privileged Deprivileged Yes Yes 32 on UltraSPARC T1; 64 on UltraSPARC T2 Modified OS CDDL (Free) LDoms N/A Hyperprivileged Privileged
Sun xVM Server w/o HVM Sun xVM Server w/HVM CPU scheduling Credit Credit Privileged (ring 0) Reduced privileged
VMM Privilege Mode Privileged (ring 0) GOS Privileged Mode Unprivileged (ring 3 for 64-bit kernel; ring 1 for 32-bit kernel CPU Granularity Fractional Interrupt Queued and delivered to run Memory
Fractional Queued and delivered to run VMware Shadow page Balloon driver Managed by VMM
Sun xVM Server w/o HVM Sun xVM Server w/HVM Shadow Page Balloon driver Managed by VMM
Page Translation Hypercall to VMM Physical Memory Balloon driver Allocation Page Tables Managed by VMM
110
VMM Comparison
I/O
Sun xVM Server w/o HVM Sun xVM Server w/HVM I/O Granularity Shared I/O Virtualization I/O emulation by Dom0 Shared
VMware Shared
LDoms PCI bus I/O emulation by I/O domain Virtual driver on non I/O domain and native driver on I/O domain LDoms Control domains CLI: ldm(1M), XML, and SNMP MIBs
Device emulation by Device emulation by QEMU or I/O emulation by vmkernel Dom0 Native driver on DomU and Dom0 (QEMU) Native driver on guest supported by the VMM VMware Service console - SPOF GUI: Virtual Center
Sun xVM Server w/o HVM Sun xVM Server w/HVM Dom0 - SPOF CLI: (xm(1)) GUI: virt-manager
111
References
Appendix B
References
1. Popek, Gerald J. and Goldberg, Robert P. Formal Requirements for Virtualizable Third Generation Architectures, Communications of the ACM 17 (7), pages 412421, July 1974. 2. 3. UltraSPARC Architecture 2005: One Architecture.... Multiple Innovative Implementations, Draft D0.9, 15 May 2007. Robin, John Scott and Irvine, Cynthia E. Analysis of the Intel Pentium's Ability to Support a Secure Virtual Machine Monitor, Proceedings of the 9th USENIX Security Symposium, August 2000. 4. 5. VMware: http://www.vmware.com/vinfrastructure/ Waldspurger, Carl A. Memory Resource Management in VMware ESX Server, Proceedings of the 5th Symposium on Operating Systems Design and Implementation, Dec. 2002. 6. 7. 8. Xen, The Xen virtual machine monitor, University of Cambridge Computer Laboratory: http://www.cl.cam.ac.uk/research/srg/netos/xen/ IA-32 Intel Architecture Software Developer's Manual, March 2006. System V Application Binary Interface AMD64 Architecture Processor Supplement Draft Version 0.98, September 27, 2006. http://www.x86-64.org/documentation/abi.pdf 9. AMD64 Architecture Programmers Manual, Volume 2: System Programming, Rev. 3.12, September 2006. 10. OpenSPARC T1 Microarchitecture Specification, Revision A, August 2006. 11. UltraSPARC Virtual Machine Specification (The sun4v architecture and Hypervisor API specification), Revision 1.0, January 24, 2006. 12. Devine, Scott W.; Bugnion, Edouard; Rosenblum, Mendel. Virtualization system including a virtual machine monitor for a computer with a segmented architecture, U.S. Patent 6,397,242, October 26, 1998. 13. Cmelik, Robert F. and Keppel, David. Shade: A Fast Instruction Set Simulator for Execution Profiling, ACM SIGMETRICS Performance Evaluation Review, pages 128137, May 1994. 14. Witchel, Emmett and Rosenblum, Mendel. Embra: Fast and Flexible Machine Simulation, The Proceedings of ACM SIGMETRICS '96: Conference on Measurement and Modeling of Computer Systems, 1996. 15. Adams, Keith and Agesen, Ole. A Comparison of Software and Hardware Techniques for x86 Virtualization, ASPLOS 2006, San Jose, CA, USA, October 21-25, 2006.
112
References
16. Timekeeping in VMware Virtual Machines, VMware white paper, August 2005. 17. Bittman, T. Gartner RAS Core Strategic Planning SPA-21-5502, Research Note 14, November 2003. 18. Rosenblum, Mendel; Herrod, Stephen A.; Witchel, Emmett; and Gupta, Anoop. Complete Computer Simulation: The SimOS Approach, IEEE Parallel and Distributed Technology, pages 34-43, Winter 1995. 19. VMware ESX Server 2 Architecture and Performance Implication, VMware white paper, 2005. 20. Sugerman, Jeremy; Venkitachalam, Ganesh; and Lim, Beng-Hong. Virtualizing I/O Devices on VMware Workstations Hosted Virtual Machine Monitor, Proceedings of the 2001 USENIX Annual Technical Conference, Boston, Massachusetts, USA, June 25-30, 2001. 21. System Administration Guide: Solaris Containers-Resource Management and Solaris Zones, Part No: 817-1592 -14, June 2007 22. Drakos, Nikos; Hennecke, Marcus; Moore, Ross; and Swan, Herb. Xen Interface
30. PCI SIG, Address Translation Services, Revision 1.0, March 8, 2007. 31. AMD I/O Virtualization Technology (IOMMU) Specification, Revision 1.20, Publication# 34434, February 2007. 32. Jun Nakajima, Asit Mallick, Ian Pratt, Keir Fraser, x86-64 XenLinux: Architecture, Implementation, and Optimizations, Proceedings of the Linux Symposium, July 19-22 2006. Ontario, Canada. 33. OpenSPARC T2 Core Microarchitecture Specification, July 2007, Revision 5. 34. UltraSPARC Architecture 2007, Hyperprivileged, Privileged, and Nonprivileged, Draft D0.91, Aug 2007. 35. PCI SIG, Single Root I/O Virtualization and Sharing Specification, Revision 1.0, September 11, 2007.
113
Appendix C
114
OS includes: file system, devices, networking, security, and Inter Process Communication (IPC). Pacifica AMD's implementation for Hardware Virtualization, also known as AMD-V or AMD SVM. Paravirtualization Paravirtualization is an implementation of virtual machine that requires the guest OS to be modified to run in the VM. Paravirtualization provides partial emulation of the underlying hardware to a VM and requires the guest OS to replace all sensitive instructions and passes the control to the VMM for handling these operations. Privileged Instructions Privileged instruction are those that result in trap if the processor is running in user mode and do not result in trap if the processor is running in supervisor mode. Secure Virtual Machine (SVM) AMD's implementation for Hardware Virtualization, also known as Pacifica or ADM-V (see [9] Chapter 15). Sensitive Instructions Sensitive instructions [1] [12] are those that change the configuration of resources (memory), affect the processor mode without going through the memory trap sequence (page fault), or whose behavior changes with the processor mode or the contents of relocation register. If sensitive instructions are a subset of privileged instructions, it is relatively easy to build a VM because all sensitive instructions will result in a trap and the underlying VMM can process the trap and emulate the behavior of these sensitive instructions. If some sensitive instructions are not privileged instructions, special measure has to be taken to handles these sensitive instructions. Shadow Page A technique for hiding the layout of machine memory from a virtual machine's operating system. A virtual page table is presented to the guest OS by the VMM, but not connected to the processor's memory management unit. The VMM is responsible for trapping accesses to the table, validating updates and maintaining consistency with the real page table that is visible to the processor MMU. Shadow page is typically used to provide full virtualization to a VM. Simple Earliest Deadline First (sEDF) One of the scheduling algorithms used in Sun xVM Hypervisor for x86 for scheduling domains. See section CPU Scheduling on page 48 for a detailed description of sEDF. Strand Strand [2] refers to the state that hardware must maintain in order to execute a software thread. Specifically, a strand is the software-visible state (PC, NPC, general-purpose registers, floating-point registers, condition codes, status registers, ASRs, etc.) of a thread plus any microarchitecture state required by hardware for its execution. Strand replaces the ambiguous term hardware thread. The number of strands in a processor defines the number of threads that an operating system can schedule on that processor at any given time. Sun xVM Hypervisor for x86 Sun xVM Hypervisor for x86 is the VMM of the Sun xVM Server. Sun xVM Infrastructure Sun Cross Virtualization and Management Infrastructure is a complete solution offering for virtualizing and managing the data center. Sun xVM Infrastructure = Sun xVM Server + xVM Ops Center Sun xVM Ops Center Sun xVM Ops Center is the management suite for the Sun xVM Server.
115
Sun xVM Server Sun xVM Server is a paravirtualized Solaris OS that includes support for the Xen open source community work on the x86 platform and support for LDoms on the UltraSPARC T1/T2 platform. In this paper, Sun xVM Server specifically refers to the Sun xVM Server for the x86 platform. Vanderpool Intel's implementation for Hardware Virtualization, also known as Intel-VT. Virtual CPU (VCPU) VCPU is an entity that can be dispatched by the scheduler of a guest OS. For UltraSPARC processorss LDoms, VCPU is also know as strand, hardware thread, or logical processor. Virtual Machine (VM) Virtual machine is a discrete execution environment that abstracts computer platform resources to an operating system. Each virtual machine runs an independent and separate instance of operating system. Popek and Goldberg [1] also defines VM as an efficient, isolated duplicate of a real machine. Virtual Machine Monitor (VMM) The VMM is a software layer that runs directly on top of the hardware and virtualizes all resources of the computer system. The VMM layer is situated between VMs and hardware resources. The VMM abstracts hardware resources to VMs and performs privileged and sensitive actions on the behalf of VM. Virtualization Technology (VT) Intel's implementation for Hardware Virtualization, also known as Vanderpool. Xen Xen is a open source VMM for x86, IA-64, and PPC [6].
116
117
Author Biography
Appendix D
Author Biography
Chien-Hua Yen is currently a senior staff engineer in the ISV engineering group at Sun. Before joining Sun more than 12 years ago, he had been with several Silicon Valley companies working as a software development engineer on Unix file systems, real time embedded system, and device drivers. His first job at Sun was with the kernel I/O group developing a kernel virtual memory segment driver for device memory mapping. After the kernel group, he worked with third party hardware vendors on developing PCI drivers for the Solaris OS and high availability products for the Sun CompactPCI board. In the last two yeas, Chien-Hua has been working with ISVs on application performance tuning, Solaris 10 adoption, and Solaris virtualization.
Acknowledgements
The author would like to thank Honlin Su, Lodewijk Bonebakker, Thomas Bastian, Ray Voight, and Joost Pronk for their invaluable comment; Patric Change for his encouragement and support; Suzanne Zorn for her editorial work; and Kemer Thompson for his constructive comments and his coordination of the reviews.
Sun Microsystems, Inc. 4150 Network Circle, Santa Clara, CA 95054 USA Phone 1-650-960-1300 or 1-800-555-9SUN (9786) Web sun.com
2007 Sun Microsystems, Inc. All rights reserved. Sun, Sun Microsystems, the Sun logo, Java, JVM, Solaris, and Sun BluePrints are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States and other countries. All SPARC trademarks are used under license and are trademarks or registered trademarks of SPARC International, Inc. in the U.S. and other countries. Products bearing SPARC trademarks are based upon architecture developed by Sun Microsystems, Inc. Information subject to change without notice. Printed in USA 11/07