MIT IAP Course Lecture #1: Virtualization 101

Carl Waldspurger (SB SM ’89 PhD ’95)
VMware R&D
January 16, 2007

Copyright © 2007 VMware, Inc. All rights reserved.

What is Virtualization?

vir•tu•al (adj): existing in essence or effect, though not in actual fact
Virtual systems
• Abstract physical components using logical objects • Dynamically bind logical objects to physical configurations

Examples
• Network – Virtual LAN (VLAN), Virtual Private Network (VPN) • Storage – Storage Area Network (SAN), LUN • Computer – Virtual Machine (VM), simulator

Copyright © 2007 VMware, Inc. All rights reserved.

2

Overview
Virtual Machines Virtualization Approaches Processor Virtualization Additional Topics

Copyright © 2007 VMware, Inc. All rights reserved.

3

Starting Point: A Physical Machine
Physical Hardware
• Processors, memory, chipset, I/O bus and devices, etc. • Physical resources often underutilized

Software
• Tightly coupled to hardware • Single active OS image • OS controls hardware

Copyright © 2007 VMware, Inc. All rights reserved.

4

What is a Virtual Machine?
Hardware-Level Abstraction
• Virtual hardware: processors, memory, chipset, I/O devices, etc. • Encapsulates all OS and application state

Virtualization Software
• Extra level of indirection decouples hardware and OS • Multiplexes physical hardware across multiple “guest” VMs • Strong isolation between VMs • Manages physical resources, improves utilization

Copyright © 2007 VMware, Inc. All rights reserved.

5

viruses within one VM cannot affect other VMs Performance Isolation • Partition system resources • Example: VMware controls for reservation. crashes. shares Copyright © 2007 VMware. 6 . MMU Strong Guarantees • Software bugs.VM Isolation Secure Multiplexing • Run multiple VMs on single physical host • Processor hardware isolates VMs. e. All rights reserved. limit.g. Inc.

7 . backup. applications. demos • Virtual appliances Copyright © 2007 VMware. All rights reserved. data • Memory and device state Snapshots and Clones • Capture VM state on the fly and restore to point-in-time • Rapid system provisioning. remote mirroring Easy Content Distribution • Pre-configured apps. Inc.VM Encapsulation Entire VM is a File • OS.

DOS VM drives virtual IDE and vLance devices. 8 . Inc.VM Compatibility Hardware-Independent • Physical hardware hidden by virtualization layer • Standard virtual hardware exposed to VM Create Once. All rights reserved. Run Anywhere • No configuration issues • Migrate VMs between hosts Legacy VMs • Run ancient OS on new platform • E.g. mapped to modern SAN and GigE hardware Copyright © 2007 VMware.

Common Virtualization Uses Today Test and Development – Rapidly provision test and development servers. store libraries of pre-configured test machines Server Consolidation and Containment – Eliminate server sprawl by deploying systems into virtual machines that can run safely and move transparently across shared hardware Business Continuity – Reduce cost and complexity by encapsulating entire systems into single files that can be replicated and restored onto any target server Enterprise Desktop – Secure unmanaged PCs without compromising end-user autonomy by layering a security policy in software around desktop virtual machines Copyright © 2007 VMware. 9 . Inc. All rights reserved.

Overview Virtual Machines Virtualization Approaches • Virtual machine monitors (VMMs) • Virtualization platform types • Alternative system virtualizations Processor Virtualization Additional Topics Copyright © 2007 VMware. All rights reserved. Inc. 10 .

Inc. 11 .What is a Virtual Machine Monitor? An Old Concept • Classic definition from Popek & Goldberg ’74 • IBM mainframes since ’60s VMM Characteristics • Fidelity • Performance • Isolation / Safety Copyright © 2007 VMware. All rights reserved.

they emulate the behavior of different hardware architectures • Simulators generally have very high overhead • A hardware-level VM utilizes the underlying physical processor directly Copyright © 2007 VMware. All rights reserved. right? • No.VMM Technology So this is just like Java. Inc. a Java VM is very different from the physical machine that runs it • A hardware-level VM reflects underlying processor architecture Like a simulator or emulator that can run old Nintendo games? • No. 12 .

1972 • “Trap and emulate” model for privileged instructions • Vendors had vertical control over proprietary hardware. All rights reserved. IBM VM/370 mainframe systems • Timeshare multiple single-user OS instances on expensive hardware Classical VMM • Run VM directly on hardware From IBM VM/370 product announcement. 13 . ca. Inc. operating systems. VMM Copyright © 2007 VMware.VMMs Past An Old Idea • Hardware-level VMs since ’60s • IBM S/360.

Inc. e. 14 . from laptops to datacenter VMware Fusion for Mac OS X running WinXP. “non-virtualizable” instructions • Pioneered by VMware in ’98 Copyright © 2007 VMware. 2006 • Run unmodified commodity guest operating systems • Significant challenges. All rights reserved.VMMs Present Renewed Interest • Academic research since ’90s • VMs for commodity systems • Server consolidation VMM for x86 • Industry-standard hardware.g.

VMM Platform Types Hosted Architecture • Install as application on existing x86 “host” OS. Parallels Desktop Bare-Metal Architecture • “Hypervisor” installs directly on hardware • Acknowledged as preferred architecture for high-end servers • Examples: VMware ESX Server. OS X • Small context-switching driver • Leverage host I/O stack and resource management • Examples: VMware Player/Workstation/Server. Microsoft Viridian (2008) Copyright © 2007 VMware. Xen. Linux. e. 15 . Inc. Windows.g. All rights reserved. Microsoft Virtual PC/Server.

Inc.System Virtualization Alternatives Virtual machines abstracted using a layer at different places Language Level OS Level Hardware Level Copyright © 2007 VMware. All rights reserved. 16 .

NET / Mono Smalltalk Bare-Metal/ Hypervisor • • • • Hosted • • • • • • HP Integrity VM IBM zSeries z/VM VMware ESX Server Xen Microsoft Virtual Server Microsoft Virtual PC Parallels Desktop VMware Player VMware Workstation VMware Server OS Level • • • • • Emulators • • • • Para-virtualization • • • Virtual Iron VMware VMI Xen FreeBSD Jail HP Secure Resource Partitions Sun Solaris Zones SWsoft Virtuozzo User-Mode Linux Bochs Microsoft VPC for Mac QEMU Virtutech Simics Copyright © 2007 VMware. Inc.System Virtualization Taxonomy System Virtualization Hardware Level High-Level Language • • • Java Microsoft . 17 . All rights reserved.

Overview Virtual Machines Virtualization Approaches Processor Virtualization • Classical techniques • Software x86 VMM • Hardware-assisted x86 VMM • Para-virtualization Additional Topics Copyright © 2007 VMware. 18 . Inc. All rights reserved.

All rights reserved. disable virtual interrupts.Classical Instruction Virtualization Trap and Emulate • Run guest operating system deprivileged • All privileged instructions trap into VMM • VMM emulates instructions against virtual state e. not physical interrupts • Resume direct execution from next guest instruction Implementation Technique • This is just one technique • Popek and Goldberg criteria permit others Copyright © 2007 VMware. 19 . Inc.g.

20 .Classical Memory Virtualization Traditional VMM Approach Extra Level of Indirection shadow page table • Virtual → “Physical” Guest maps VPN to PPN using primary page tables • “Physical” → Machine VMM maps PPN to MPN VPN guest PPN VMM hardware TLB Shadow Page Table • Composite of two mappings MPN • For ordinary memory references Hardware maps VPN to MPN • Cached by physical TLB Copyright © 2007 VMware. Inc. All rights reserved.

Memory Traces Shadow Page Table • Derived from primary page table in guest • VMM must keep primary and shadow coherent Trace = Coherency Mechanism • Write-protect primary page table • Trap guest writes to primary • Update or invalidate corresponding shadow • Transparent to guest Copyright © 2007 VMware. 21 . All rights reserved. Inc.

All rights reserved.Classical VMM Performance Native Speed Except for Traps • No overhead in direct execution • Overhead = trap frequency × average trap cost Trap Sources • Most frequent: Guest page table traces • Privileged instructions • Memory-mapped device traces Copyright © 2007 VMware. Inc. 22 .

IF in unprivileged mode! • So no trap to return control to VMM Deprivileging not possible with x86! Copyright © 2007 VMware.x86 Virtualization Challenges Not Classically Virtualizable • x86 ISA includes instructions that read or modify privileged state • But which don’t trap in unprivileged mode Example: POPF instruction • Pop top-of-stack into EFLAGS register • EFLAGS.IF bit privileged (interrupt enable flag) • POPF silently ignores attempts to alter EFLAGS. Inc. All rights reserved. 23 .

How to Virtualize x86? Interpretation • Problem – too inefficient • x86 decoding slow Code Patching • Problem – not transparent • Guest can inspect its own code Binary Translation (BT) • Approach pioneered by VMware • Run any unmodified x86 OS in VM Extend x86 Architecture Copyright © 2007 VMware. 24 . Inc. All rights reserved.

Software VMM: Binary Translation Direct execute unprivileged guest application code • Will run at full speed until it traps. 25 . etc. we get an interrupt. proactively transfer control to the VMM (no need for traps) • Safe instructions are emitted without change • For “unsafe” instructions. run it unprivileged • Since x86 has non-virtualizable instructions. All rights reserved. “Binary translate” all guest kernel code. Inc. emit a controlled emulation sequence • VMM translation cache for good performance Copyright © 2007 VMware.

26 . All rights reserved.VMware Translator Properties Binary – input is x86 “hex”. Inc. not source Dynamic – interleave translation and execution On Demand – translate only what about to execute (lazy) System Level – makes no assumptions about guest code Subsetting – full x86 to safe subset Adaptive – adjust translations based on guest behavior Copyright © 2007 VMware.

All rights reserved...BT Mechanics Input: BB 55 ff 33 c7 03 .. Inc. Copyright © 2007 VMware. Each Translator Invocation • Consume a basic block (BB) • Produce a compiled code fragment (CCF) Store CCF in Translation Cache translator • Future reuse • Capture working set of guest kernel • Amortize translation costs • Not “patching in place” Output: CCF 55 ff 33 c7 03 .. 27 .

%eax 80460ba4 BB 25555b0 25555b1 25555b3 25555b9 25555bb 25555c1 25555c2 25555c4 25555c9 25555cb push %ebp push (%ebx) mov (%ebx). All rights reserved. 81c(%ebx) push %edx mov %ebp. 28 . %esp mov %esp.Example: IDENT Translation 80304a69 80403a6a 80403a6c 80403a72 80403a74 80403a7a 80403a7b 80403a7d push push mov mov mov push mov call %ebp (%ebx) (%ebx). Inc. ffffffff mov %edx. %eax push 80403a82 int 3a data: 80460ba4 CCF 25555c4: push return address 25555c9: invoke translator on callee Copyright © 2007 VMware. 81c(%ebx) %edx %ebp. ffffffff %edx. %esp %esp.

29 . Inc. All rights reserved.Adaptive BT Translation Cache Translated Code Is Fast • Mostly IDENT translations • Runs “at speed” !*! Except Writes to Traced Memory • Page fault (shown as !*!) • Decode and interpret instruction • Fire trace callbacks • Resume execution • Can take 1000’s of cycles Invoke Translator Copyright © 2007 VMware.

All rights reserved.Adaptive BT: Fast Trace Handling Detect and Track Trace Faults JMP Splice in TRACE Translation • Execute memory access in software • Avoid page fault • No re-decoding TRACE • Faster resumption Faster Traces • 10x performance improvement • Adapts to runtime behavior Invoke Translator Copyright © 2007 VMware. 30 . Inc.

31 . Inc.Software VMM Evaluation Benefits • Adaptation • Fast traces • Fast I/O emulation • Flexibility Costs • Running translator • Path lengthening • System call slowdown • Complexity Copyright © 2007 VMware. All rights reserved.

aka “Pacifica” Performance • VT/SVM help avoid BT.Hardware-Assisted VMM Recent x86 Extension • 1998 – 2005: Software-only VMMs using binary translation • 2005: Intel and AMD start extending x86 to support virtualization First-Generation Hardware • Enables classical trap-and-emulate VMMs • Intel VT. Inc. but not MMU ops (actually slower!) • Main problem is efficient virtualization of MMU and I/O. Not executing the virtual instruction stream Copyright © 2007 VMware. aka “Vanderpool Technology” • AMD SVM. All rights reserved. 32 .

33 . Inc. hardware-walked • Buffers simple exits Copyright © 2007 VMware. All rights reserved.VT/SVM Architecture Diagram CPL 3 CPL 3 • Y-axis: old school x86 privilege (CPL) • X-axis: virtualization privilege CPL 2 CPL 2 Guest Mode • Runs unmodified OS • Sensitive operations “exit” (trap out) to host mode CPL 1 CPL 1 VMCB CPL 0 Host CPL 0 Guest • Virtual Machine Control Block • VMM-controlled.

Inc.Hardware-Assisted VMM Hardware-Assisted Direct Exec CPL 0-3 Guest mode Fault. I/O . 34 . Resume Guest Host mode VMM CPL 0-3 Copyright © 2007 VMware... Trace. All rights reserved. Interrupt.

All rights reserved. AMD NPT Copyright © 2007 VMware. Inc.Hardware-Assisted VMM Evaluation Benefits • Simplicity (no BT) • Fast system calls • No translator overheads Costs • Exits: 1000’s of cycles for traces and I/O • No adaptation or software flexibility • Stateless model Future • Hardware support for fast MMU virtualization • Intel EPT. 35 .

What is Paravirtualization? Full Virtualization • No modifications to guest OS • Excellent compatibility. popularized by Xen • Modify guest OS to be aware of virtualization layer • Remove non-virtualizable parts of architecture • Avoid rediscovery of knowledge in hypervisor • Excellent performance and simple. Xen. Inc. but poor compatibility Ongoing Linux Standards Work • “Paravirt Ops” interface between guest and hypervisor • Small team from VMware. IBM LTC. Copyright © 2007 VMware. but complex Paravirtualization Exports Simpler Architecture • Term coined by Denali project in ’01. 36 . good performance. All rights reserved. etc.

Inc.Paravirtualization: Conceptual Diagram Guest OS System call interface Guest OS Hypercalls (GOOD) Hypervisor Hardware Hypervisor Hardware NOT GOOD! Full Virtualization Copyright © 2007 VMware. Paravirtualization 37 . All rights reserved.

0.x Native VMware ESX Native Native Copyright © 2007 VMware. 38 . Inc. All rights reserved.VMware Vision: Transparent Paravirtualization Same OS binary Dom0 VMI Linux DomU Xeno Linux VMI Linux Windows Solaris VMI Linux Xen 3.

39 .com/academic/resources.vmware. All rights reserved.html • A Comparison of Software and Hardware Techniques for x86 Virtualization (ASPLOS ’06) • Fast Transparent Migration for Virtual Machines (USENIX ’05) • Memory Resource Management in VMware ESX Server (OSDI ’02) • Virtualizing I/O Devices on VMware Workstation’s Hosted VMM (USENIX ’01) Additional Academic Publications • Xen and the Art of Virtualization (SOSP ’03) • Disco: Running Commodity Operating Systems on Scalable Multiprocessors (SOSP ’97) • Many more … Copyright © 2007 VMware.Further Reading VMware Publications • www. Inc.

Inc. 40 . All rights reserved.Additional Topics I/O Virtualization Memory Management Copyright © 2007 VMware.

I/O Virtualization Stack Guest Device Driver Guest OS Device Driver Virtual Device • Model existing device. 41 . e1000 • Model an idealized device. e. transparent NIC teaming Real Device • Physical hardware.g.g. Inc. All rights reserved. e.g. bcm5700 • Likely to be different than virtual device Copyright © 2007 VMware. e. e.g. vmxnet Virtualization Layer Device Emulation I/O Stack Device Driver • Emulates the virtual device • Remaps guest and real I/O addresses • Multiplexes and drives physical device • Provides additional features.

All rights reserved.I/O Virtualization Implementations Emulated I/O Hosted or Split Guest OS Device Driver Host OS/Dom0/ Parent Domain Device Emulation Device Emulation I/O Stack Device Driver Device Emulation I/O Stack Device Driver Device Manager Passthrough I/O Guest OS Device Driver Hypervisor Direct Guest OS Device Driver VMware Workstation. Virtual Server VMware ESX Server (storage and network) A Future Option Many Challenges Copyright © 2007 VMware. Xen. VMware ESX Server (for slow devices). Inc. Microsoft Viridian. VMware Server. 42 .

Passthrough I/O Virtualization High Performance Guest OS Device Driver Guest OS Device Driver Guest OS Device Driver • Guest drives device directly • Minimizes CPU utilization Enabled by HW Assists Virtualization Layer I/O MMU Device Manager • I/O-MMU for DMA isolation e. All rights reserved.g. PCI-SIG IOV spec VF VF VF Challenges • Hardware independence • Migration. Inc. 43 .g. suspend/resume • Memory overcommitment I/O Device PF PF = Physical Function. AMD IOMMU • Partitionable I/O device e. Intel VT-d. VF = Virtual Function Copyright © 2007 VMware.

Inc.Additional Topics I/O Virtualization Memory Management Copyright © 2007 VMware. All rights reserved. 44 .

45 .Memory Management Desirable capabilities • Efficient memory overcommitment • Accurate resource controls • Exploit sharing opportunities Challenges • Allocations should reflect both importance and working set • Best data to guide decisions known only to guest OS • Guest and meta-level policies may clash Copyright © 2007 VMware. Inc. All rights reserved.

VMware Memory Management Reclamation mechanisms • Ballooning – guest driver allocates pinned PPNs. maps to same MPN copy-on-write Allocation policies • Proportional sharing – revoke memory from VM with minimum shares-per-page ratio • Idle memory tax – charge VM more for idle pages than for active pages to prevent unproductive hoarding Copyright © 2007 VMware. paged in on demand • Page sharing – hypervisor identifies identical PPNs based on content. 46 . hypervisor deallocates backing MPNs • Swapping – hypervisor transparently pages out PPNs. Inc. All rights reserved.

Inc. 47 .Ballooning inflate balloon (+ pressure) may page out to virtual disk Guest OS balloon Guest OS balloon guest OS manages memory implicit cooperation may page in from virtual disk deflate balloon (– pressure) Guest OS Copyright © 2007 VMware. All rights reserved.

Inc. no guest OS changes • Background activity saves memory over time Copyright © 2007 VMware. 48 . data. zeros Transparent page sharing • Map multiple PPNs to single MPN copy-on-write • Pioneered by Disco [Bugnion ’97]. apps • Collapse redundant copies of code. but required guest OS hooks Content-based sharing • General-purpose.Page Sharing Motivation • Multiple VMs running same OS. All rights reserved.

All rights reserved. 49 . Inc.Page Sharing: Scan Candidate PPN 011010 110101 010111 101100 hash page contents …2bd806af VM 1 VM 2 VM 3 hint frame Machine Memory Hash: VM: PPN: MPN: …06af 3 43f8 123b hash table Copyright © 2007 VMware.

Page Sharing: Successful Match VM 1 VM 2 VM 3 shared frame Machine Memory Hash: …06af Refs: 2 MPN: 123b hash table Copyright © 2007 VMware. Inc. 50 . All rights reserved.