You are on page 1of 47

Virtual Machines

(part II)

Wei Chung Hsu


徐慰中

05/17/2019
Efficient Emulation for VM
 Process Virtual Machines: Often Cross-ISA

– Key tech: Dynamic Binary Translation


 E.g. QEMU, ARIES, Rosetta, Intel-32/EL, ….

 System Virtual Machine


– Key tech: Trap and emulate, ie.Trapping
privileged/sensitive instructions
Often Same-ISA
 Software support
 Hardware support
– Intel VT-x, Intel VT-i, AMD-V, ARM Virtualization Extensions
– Intel VT-d, Intel VT-c
– IOMMU
Source ISA States
Architecture State
 Hold complete Program
counter
source Code Condition codes
architecture Register 0
states in the Data Register 1
interpreter’s
……..
data memory
Register N

Stack
Interpreter
Code
Example

Source ISA Struct


States{
Int PC;
Char EFlag[6];
Architecture State
Int IReg[8];
 Hold complete Double FReg[8];Program
…. counter
source Code } Condition codes
architecture Register 0
states in the Data Register 1
interpreter’s
……..
data memory
Register N

Stack
Interpreter
Code
Decode-Dispatch Interpretation

while (!halt && !interrupt)


{ PC Sourc
inst = code(PC); e
opcode = extract(inst,31,6); ISA
switch(opcode) { code
case LoadWord: LoadWord(inst);
case ALU: ALU(inst);
case Branch: Branch(inst);
...
}
Instruction Function: Load Word

LoadWord(inst)
{
RT = extract (inst,25,5);
RA = extract (inst,20,5);
displacement =extract (inst,15,16);
source = regs[RA];
address = source + displacement ;
regs[RT] = data[address];
PC = PC + 4;
}
From executable to executable
Source ISA: assume MIPS
addl %eax,%ebx,%ecx
Add r1, r2, r3 …..
Ld r4, 4(sp) movl %edx,4(%esp)
Add r5, r4, r6 …..
… movl %ecx, 100(%esp);
… Code movl %edx, 104(%esp);
Morphing addl %eax, %ecx,%edx;
movl %eax, 108(%esp);
….
….

Target ISA: assume x86


The target executable could be many
times bigger than the source, but can be
directly executed rather than interpreted.
Binary Translation
 Generate custom code for every source
instruction. For example, a load instruction in
source code could be translated into a
respective load instruction in native code.
 Get rid of repeated parsing, decoding, and
jumping overhead.
 Register mapping (of different ISAs) is needed
to reduce load/stores significantly.
 Compiled emulation is an early form of binary
translation.
Challenges to Static Translation
– Code Discovery Problems
– Code Location Problems
– SMC (Self-Modifying Code) Problems
– Data Misalignment problems
– ….
 In order to translate, the emulator must be able
Codecode.
to discover Discovery Problem
Easier said than done, especially with x86
//ie. instructions are variable length in x86 and
ARM, so cannot extinguish inst vs. data

Q: Do you know what is ELF ? Variable length instructions !!


Examples of Embedded Data
that will occur within Insturctions
1) PC-relative loads 2) Padded data
---- 1st procedure ---
ARM instruction LDR ….
return
LDR R1, [pc, #4] ---- data ---- /* padding */
----- data ----- ---- 2nd procedure ---
... ….
Most data references in x64 ….
are RIP-relative 3) JUMP Table
Switch (num) { jump Switch_start Switch_start:
Case 1: CASE1_label:…, jump beyond ADD #-1, R7
Case 2: CASE2_label:…, jump beyond MOVE CASE_JUMP_TABLE,
CASE3_label:…, jump beyond R0
Case 3:
CASE4_label:…, jump beyond Left_shift #2, R7
Case 4: CASE_JUMP_TABLE JUMP (R0,R7) /* indirect jump */
} DATA CASE1_label
DATA CASE2_label
DATA CASE3_label
DATA CASE4_label
Code Discovery Methods
 Easier for fixed length instruction ISA
– All instruction boundaries are clearly identified
– Even if data is mis-interpreted as instruction,
such instructions will not get executed anyway.
– Unfortunately, the most popular ISA (x86 and
ARM/Thumb) are all variable length.
 If BinaryTranslation is targeting compiler
generated code, embedded data are usually
easier to identify //hard in human code
– PC-relative data
– Jump Tables
Code Location Problem
 TPC (Target PC) is different from SPC
(Source PC)
 For direct branches, StaticBinaryTranslation
can tell which TPC a SPC is mapped to.
 For indirect branches, SPC is unknown at
static translation time. So we must provide a
way to map SPCs to TPCs at runtime.
 Example (IA32 to PPC) Incorrect translation:
Addi r16,r11,4
movl %eax, 4(%esp) Lwzx r4,r2,r16
Mtctr r4
jmp %eax Bctr
/* jump indirect through ctr,
but ctr contains SPC */
Code Location Resolution
 Some forms of mapping, which maps SPC to
TPC is required at runtime. One simple
implementation is to use an address-mapping
table.
 For DBT (Dynamic Binary Translation)
systems, address mapping table is built on-the-
fly. Its size is determined by the dynamic code
regions executed.
 For SBT (Static Binary Translation), the table
size will be an issue since every instruction
might be a branch target.
Dynamic Binary Translation
 A common approach to speed up emulations
– Application migration
 Cross-ISA
e.g. IA-32 Execution Layer  IA32 to Itanium
Rosetta  PPC to IA32/X64
ARIES  HP-PARISC to Itanium
FX!32  IA32 to DEC Alpha
Houdini  ARM to x86
 Same-ISA
e.g. Dynamo, Adore, DynamoRIO, PIN, Valgrind
– Virtual platform building
 Emulate future hardware, eg. to develop software when hw not out yet
e.g. ARMIE  ARM SVE Simulator
Shade  SUN SPARC simulator
 Friendly developing environment
e.g. Android Emulator  Develop Android applications

Department of Computer
Science and Engineering 15
Retargetable
DynamicBinaryTranslation
Guests Hosts
ARMv7 ARMv7
QEMU
ARMv8 ARMv8
X86
DBT IR X86

MIPS MIPS

New ISAs
Department of Computer
Science and Engineering
Application Virtualization
 Isolation Properties
– Fault isolation, Software isolation
– Performance isolation
 Encapsulation
– Cleanly capture all VM states
– Enables VM snapshots, clones
 Migration
– Independent of physical hardware
– Enables migration of live VMs
 Interposition
– All requests go through VMM – this allows VM management such as
profiling, encryption, compression, replication.
System VMs 17
Application Virtualization
 Resource consolidation
– Server consolidation
– Client consolidation
 Simultaneous support for multiple OSes/Apps
– Easy way to implement timesharing, e.g. IBM System/370
 Simultaneous support for different OSes/Apps
– E.g. Windows and Unix
 Error containment
– If one VM crashes, the other VMs can continue to work
Assumes VMM is correct (smaller/simpler)
 Operating System debugging
– Can proceed while system is being used for normal work

System VMs 18
Resource Consolidation
 Server consolidation
– Reduce number of servers
– Reduce space, power and cooling
– 70-80% reduction numbers cited in industry
 Client consolidation
– Developers: test multiple OS versions, distributed application
configurations on a single machine
– End user: Windows on Linux, Windows on Mac
– Reduce physical desktop space, avoid managing multiple
physical computers

System VMs 19
Application Virtualization, contd.
 Operating System Migration
– Can proceed while “old” OS continues to be
used TIME
New System Programmers System Programmers
Release Converted
Production Users

Converted
Old Production Users
Unconverted Production Users
Release Production Users

Permanently
Unconverted
Production Users
new release new release newer release
being tested installed being tested

System VMs 20
Today’s Applications

Server Consolidation

App
App
… App
App App OS OS
… OS
OS OS
VMM
HW VMM HW HW
HW
HW

Benefit: Cost Savings Work Isolation

R&D Production

App

OS

VMM
HW
Benefit: Business Agility and Productivity
Emerging Applications

App App App App


App App 1 2 3 4

OS OS OS OS
OS OS

VMM VMM VMM


VMM
HW HW HW
HW
CPU Usage CPU Usage
Benefit: Business 9
0
3
0

Continuity % %

Disaster Recovery
Partitioning Dynamic load
balancing
App App App App
1 2 3 4
OS OS OS OS

VMM

HW
CPU Virtualization
 ISA Virtualizablity
Ideally, if an ISA has
privileged instructions
Non-privileged
and non-privileged instructions
exclusively separated so that
Privileged all control-sensitive and
behavior-sensitive instructions
are privileged instructions, then
The trap-and-emulate model can
be efficiently implemented.

System VMs 23
Instruction Types -- Summary

Non-
Privileged
Innocuous
Privileged

Behavior- Sensitive Control- Sensitive


sensitive sensitive

 Innocuous Instructions: Those that are not control or behavior


sensitive

System VMs 24
Non-
Privileged
Innocuous
Privileged

Behavior- Sensitive Control- Sensitive


sensitive sensitive

Ideally, we would like to trap those sensitive instructions.


Running guest OS in de-privileged mode has two problems

 trapping too many – no need to trap on all sensitive insts


 trapping too few -- some sensitive instructions are not trapped

System VMs 25
Para-Virtualization vs. Full Virtualization
 Full Virtualization (FV)
– Transparent. Guest OSes are unmodified.
 Para-Virtualization (PV)
– //Actively modify guest OS’ necessary parts
– Special hooks to allow the guests and host to
communicate.
– Simplifies VMM and reduce overhead
– Require the guest OS to be explicitly ported for the
“Para-API”.

System VMs 26
Hardware Assisted System VM: Intel’s VT-x
Pre VT-x Post VT-x

VMM ring de-privileging of guest OS VMM executes in the VMX root-mode

Guest OS aware it is not at Ring 0 Guest OS de-privileging eliminated


Guest OS runs directly on hardware

Source: [2] 27
Full Virtualization
 Support multiple guest OSes on a single hardware
platform; all running the same ISA
Windows Solaris
Linux Application
Application Application

Unmodified Unmodified Unmodified


Linux OS Windows Solaris

Virtual Intel x86 Virtual Intel x86 Virtual Intel x86

traps
Hypervisor

Intel x86
Hardware
Memory I/O devices
System VMs 28
SW-Assisted Virtualization
 DBT to translate

Windows Solaris
Linux Application
Application Application

Unmodified Unmodified Unmodified


Linux OS Windows Solaris

Virtual Intel x86 Virtual Intel x86 Virtual Intel x86

traps
Hypervisor

Intel x86
Hardware
Memory + Support for VM I/O devices
System VMs 29
HW-Assisted Virtualization
 Hardware support virtualization have been added to
simplify Full Virtualization since 2005, such as Intel
VT-x, VT-i, VT-d, VT-c, AMD-V, ….
Windows Solaris
Linux Application
Application Application

Unmodified Unmodified Unmodified


Linux OS Windows Solaris

Virtual Intel x86 Virtual Intel x86 Virtual Intel x86

traps
Hypervisor

Intel x86
Hardware
Memory + Support for VM I/O devices
System VMs 30
Para-virtualization
 Guest OSes may be modified to communicate with the
hypervisor via hypercalls. I/O drivers have been
specialized.
Windows Solaris
Linux Application
Application Application

Modified Modified Modified


Linux OS Windows Solaris
Virtual Intel x86 Virtual Intel x86 Virtual Intel x86

Hypercalls
Hypervisor

Intel x86
Hardware
Memory I/O devices
System VMs 31
Identical Guest Systems
 Support multiple guest OSes on a single hardware
platform; all running the same ISA
Linux Linux Linux
Application Application Application

Linux OS Linux OS Linux OS

Virtual Intel x86 Virtual Intel x86 Virtual Intel x86

Intel x86
Hardware
Memory I/O devices
System VMs 32
Multi-processing in Linux

Linux Linux Linux


Application Application Application

Linux OS

Intel x86
Hardware
Memory I/O devices
System VMs 33
Container/Docker
System Application Application
Containers Containers Containers
(e.g. openVZ) (e.g. Docker) (e.g. Docker)

Container Engine

Linux OS

Intel x86
Hardware
Memory I/O devices
System VMs 34
Container Technology
 Container is a virtual environment which groups and isolates a set
of processes and resources from the host and other containers.
 There is a tremendous surge of interest in the use of various
container technologies in cloud computing in recent years.
 Docker is intended to run a single application, such as MySQL,
Nginx (a web server or a load balancer), or Redis (key-value
database). If you want to run two or more applications, should
consider two dockers or using a system container (such LXC – a
Linux instance).
 Container has often been called “Lightweight virtual machine”.
 Compared to VM, containers have a more limited scope. However,
containers have lower overhead, hence are welcome by cloud
computing community.

System VMs 35
Native VM (Type-I) vs. Hosted VMs (Type-II)

Virtual Virtual
Applications Machine Machine
Non-privileged
Applications
modes
VMM VMM
OS

OS VMM Host OS Host OS Privileged


Mode

Hardware Hardware Hardware Hardware

Traditional Native User-mode Dual-mode


uniprocessor VM system Hosted Hosted
system VM system VM system

Type-I Type-II

System VMs 36
Native System VM Environment
Linux Windows Solaris
Applications Applications Applications

Linux Windows Solaris


Example:
Xen Virtual Virtual Virtual
Intel IA-32 Intel IA-32 Intel IA-32

Virtual Machine Monitor (VMM)


VMM is responsible
for scheduling and
managing the
allocation of HW Intel IA-32 Hardware
resources
User Mode Hosted VM

Windows Apps Windows Apps

Guest OS Guest OS
(Windows) (Windows)
Example: Can patch
VMware VMM privileged
GSX server instructions to
VMM calls
Hosted OS (Linux) (traps), or using
DBT techniques

Intel IA-32 Hardware


Some Popular Virtual Machines

 Oracle VirtualBox (Windows/Mac/Linux)

 Parallels (Windows/Mac/Linux)

 Vmware (Windows/Linux)

 QEMU (Linux)

System VMs 39
VirtualBox
• Available for x86
based machines
(both Intel and
AMD).
• Users can load
multiple guest OS
under a single
host OS.
• Support both
software-based
and hardware-
based
virtualization.
• Open Source
Software
• Free

System VMs 40
Parallel
• Available for Apple
Mac/intel based
machines
• Users can load
multiple guest OS
(e.g. Linux,
Windows) under a
single Mac host
OS.
• $79.99 (~NTD
$2600)

System VMs 41
VMware
• Available for x86
based machines
(both Intel and
AMD, and Mac).
• Users can load
multiple guest OS
under a single
host OS.
• VMware fusion
lets you run >200
OS, including
Window XP thru
Windows 8.
• Deliver Windows
applications to
Mac users
• $189

System VMs 42
QEMU
• A generic and open
source machine
emulator and virtualizer,
supporting both process
VM and system VM.
• When used as a
machine emulator,
QEMU can run OS and
applications cross ISA
(e.g. ARM app on PC)
with good performance.
• QEMU supports
virtualization when
executing under Xen or
KVM hypervisors.
• Free

System VMs 43
Intel VT-x Technology (Vanderpool)
 New CPU Modes: VMX root/non-root
modes
– VMM runs in VMX root mode
– Guest VM runs in VMX non-root mode
– Each mode has ring 0 to ring 3
 Virtual Machine Control Structure (VMCS)
 Transitions
– VM entry: root to non-root transition
– VM exit: non-root to root transition

System VMs 44
VMCS
VMCS consists of 6 control groups
Guest state area
– Guest states saved on VM exits and loaded
on VM entries
Host state area
– Host states loaded from the host state area on
VM exits
VM execution control fields
VM-exit control fields
VM-entry control fields
VM-exit information fields
System VMs 45
VM Timesharing
 VMM Timeshares resources among guests
– Similar to OS timesharing applications

VMM VMM restores


determines next architected state
VM to be for next VM
activated
VMM sets timer
VMM saves interval and VMM sets PC to timer
Timer interrupt enables interrupt handler of OS
architected state
occurs of running VM interrupts in next VM

VMM Active
First VM Active Next VM Active

System VMs 46
VM Timesharing
 VMM Timeshares resources among guests
– Similar to OS timesharing applications

VMM VMM restores


determines next architected state
VM to be for next VM
activated VMM sets timer
VMM saves interval and VMM sets PC to timer
Timer interrupt enables interrupt handler of OS
architected state
occurs interrupts in next VM
of running VM

First VM Active VMM Active Next VM Active

System VMs 47

You might also like