You are on page 1of 28

Session WV23

Fast and Easy Disk Workload


Characterization on
VMware ESX Server
Irfan Ahmad
R&D Engineer
VMware, Inc.
Motivation

This session is useful to you if you:


Have disk performance sensitive workload
Want to improve disk performance of your production VMs
Need to communicate better with your SAN admin about
your I/O workload
Want to match different RAID arrays to different workloads
Wish to monitor your disk workload characteristics over time
Like to pick the best filesystem for each job
Abstract and Outline

Disk I/O characterization of We demonstrate our technique


applications is the first step in using:
tuning disk subsystems; key Filebench OLTP
questions:
Differences between ZFS and UFS
I/O block size filesystems on Solaris
Spatial locality OSDL dbt-2
I/O interarrival period Windows large file copy
Active queue depth Negligible overheads in cpu,
Latency memory, and latency
Read/Write Ratios
Our technique allows
transparent and online
collection of essential workload
characteristics
Applicable to arbitrary, unmodified
operating systems running in virtual
machines
Target Audience

System administrators interested in optimizing


the disk subsystem
Good tool for sysadmins to gather data for the
consumption of SAN administrators
Meant for power users
In its current form, the user interface isn’t very pretty
Command line only; no graphical interface
Data produced is very detailed
Requires minimal post-processing for basic analysis
Workload Characterization Technique

Histograms of observed data values can be much more


informative than single numbers like mean, median, and
standard deviations from the mean
E.g., multimodal behaviors are easily identified by plotting a
histogram, but obfuscated by a mean
Histograms can actually be calculated efficiently online
Why take one number if you can have a distribution?
Made up Example
2500

2000

1500

1000

500
Frequency

0
1

10
Mean is 5.3! Latency of an operation (microseconds)
Workload Characterization Technique

The ESX disk I/O workload


characterization is on a
per-virtual disk basis Data
collected
Allows us to separate out each
per-virtual
different type of workload into its
disk
own container and observe trends
Histograms only collected if
enabled; no overhead otherwise
Technique:
For each virtual machine I/O request
in ESX, we insert some values
into histograms
E.g., size of I/O request → 4KB
6
4
2
0
1024

2048

4096

8192
Workload Characterization Technique
Full List of Histograms

Read/Write Distributions are I/O Size


available for our histograms All, Reads, Writes
Overall Read/Write ratio? Seek Distance
Are Writes smaller or larger All, Reads, Writes
than Reads in this workload?
Are Reads more sequential
Seek Distance Shortest
than Writes? Among Last 16
Which type of I/O is incurring Outstanding IOs
more latency? All, Reads, Writes
In reality, the problem is not I/O Interarrival Times
knowing which question
All, Reads, Writes
to ask
Collect data, see what you find
Latency
All, Reads, Write
Workload Characterization Technique
Histograms Buckets

To make the histograms practical, bin sizes are on


rather irregular scales
E.g., the I/O length histogram bin ranges like this:
…, 2048, 4095, 4096, 8191, 8192, … rather odd: some buckets
are big and others are as small as just 1
Certain block sizes are really special since the underlying storage
subsystems may optimize for them; single those out from the start
(else lose that precise information)
E.g., important to know if the I/O was
16KB or some other size in the
interval (8KB,16KB)

2048
4095
4096
8191
8192
16383
16384
32768
Test Setup

Machine Model HP DL 585 G2

CPU 8 cpus (4 socket, dual-core)


@2.4 GHz

Total Memory 8 GB

Hypervisor VMware ESX Server

Disk Subsystem • EMC Symmetrix 500GB RAID


(FC SAN) • QLogic 2340 (Fibre Channel)

Table 1. Machine/storage specifications


Filebench OLTP (Solaris)

Filebench is a model-based workload generator for file


systems developed by Sun Microsystems
Input to this program is a model file that specifies processes,
threads in a workflow
Filebench OLTP “personality” is a model to emulate an
Oracle database server generating I/Os under an online
transaction processing workload
Other personalities include fileserver, webserver, etc.
Used two different filesystems (UFS and ZFS)
To study what effect a filesystem can have on I/O characteristics
Ran filebench on Solaris 5.11 (build 55)
I/O Length
Filebench OLTP 3500
I/O Length Histogram

3000
2500
2000

Frequency
1500
UFS 1000
500
0

512
1024
2048
4095
4096
8191
8192
16383
16384
32768
49152
65535
65536
81920
131072
262144
524288
>524288
4K and 8K I/O
transformed Length (bytes)

into 128K by I/O Length Histogram


1600
ZFS? 1400
1200
Frequency

1000
800

ZFS 600
400
200
0
512
1024
2048
4095
4096
8191
8192
16383
16384
32768
49152
65535
65536
81920
131072
262144
524288
>524288
Length (bytes)
Seek Distance 1400
Seek Distance Histogram

Filebench OLTP 1200


1000
Seek distance: a

Frequency
800

measure of 600

sequentiality UFS 400

versus 200
0
randomness in a

-500000
-50000
-5000
-500
-64
-16
-6
-2
0
2
6
16
64
500
5000
50000
500000
workload
Distance (sectors)
Somehow a
Seek Distance Histogram
random workload 300

is transformed 250

into a sequential 200


Frequency

one by ZFS! ZFS 150

More details 100

needed ... 50

0
-500000
-50000
-5000
-500
-64
-16
-6
-2
0
2
6
16
64
500
5000
50000
500000
Distance (sectors)
Seek Distance
Filebench OLTP—More Detailed Split out reads & writes

Seek Distance Histogram (Writes) Seek Distance Histogram (Reads)


1200 600

1000 500
800
Frequency

400

Frequency
600
300
400

UFS 200
200

100
0
-500000
-50000
-5000
-500
-64
-16
-6
-2
0
2
6
16
64
500
5000
50000
500000
0

-500000

-50000
-5000

-500

-64

-16

-6

-2

16

64

500

5000

50000

500000
Distance (sectors) Distance (sectors)

Seek Distance Histogram (Writes) Seek Distance Histogram (Reads)


300 300

250 250

200 200

Frequency
Frequency

150 150
ZFS 100 100

50 50

0 0
-500000

-50000

-5000

-500

-64
-16

-6

-2

0
2

16

64

500
5000

50000

500000

-500000
-50000

-5000
-500

-64

-16
-6

-2

6
16

64

500

5000
50000

500000
Distance (sectors) Distance (sectors)

Transformation from Random to Sequential: primarily for Writes


Reads: Seek distance is reduced (look at histogram shape & scales)
Filebench OLTP
Summary

So, what have we learnt about Filebench OLTP?


I/O is primarily 4K but 8K isn’t uncommon (~30%)
Access pattern is mostly random
Reads are entirely random
Writes do have a forward-leaning pattern
ZFS is able to transform random Writes into sequential:
Aggressive I/O scheduling
Copy-on-write (COW) technique (blocks on disk not modified in place)
Changes to blocks from app writes are written to alternate locations
Stream otherwise random data writes to a sequential pattern on disk
Performed this detailed analysis in just a few minutes
OSDL Database Test 2 (Linux 2.6.17-10)

OSDL DBT-2:
A fair usage implementation of the TPC-C benchmark specification
Simulates a wholesale parts supplier
Several workers access a database, update customer information
and check on parts inventories
More info on DBT-2 @
More info on TPC-C @ http://www.tpc.org/tpcc
Used the PostgreSQL 8.1 RDBMS implementation of DBT-2
http://www.postgresql.org/docs/techdocs
Ubuntu 6.10 server distribution
Linux 2.6.17-10
Scaling factor of 250 (warehouses) with 50 connections
OSDL Database Test 2 (Linux 2.6.17-10)
Analysis
Seek Distance Histogram (Writes)
300

Workload is primarily 250

random (big spikes towards 200

Frequency
the right and left edges of 150

100
the graph) 50

Still, many I/Os that are 0

-500000

-50000

-5000

-500

-64

-16

-6

-2

16

64

500

5000

50000

500000
within 500 sectors (20%) or Distance (sectors)

within 5,000 sectors (33%) of 1800


I/O Length Histogram

the previous command 1600


1400
1200

Frequency
1000
The workload is almost 800
600

exclusively 8K for both 400


200
0
reads and writes 512
1024
2048
4095
4096
8191
8192
16383
16384
32768
49152
65535
65536
81920
131072
262144
524288
> 524288
Length (bytes)
OSDL Database Test 2 (Linux 2.6.17-10)
Analysis (2)
Outstanding I/Os Histogram (Reads, Writes)
1000
Reads
900
Writes
800

The number of outstanding 700

Frequency
600

I/Os are very different in this 500


400
300
workload between reads 200
100

and writes 0

12

16

20

24

28

32

64

> 64
I/Os Outstanding at Arrival time
PostgreSQL almost always
1200

issues 32 write I/Os Outstanding I/Os Histogram over Time

1000-1200
simultaneously 800-1000
600-800
1000

400-600 800

I/O rate from this workload 200-400


0-200
600

varies over time as much as Frequency

15% over a 2 min period 400

200
S16
S11
Time (in 6
S6 0
sec

> 64
32
24
intervals) S1

16
8
4
1
I/Os Outstanding
at Arrival time
OSDL Database Test 2 (Linux 2.6.17-10)
Summary

On the aggregate the workload appears random


But 20% of I/Os are within 250KB and 33% are within 2.4MB!
I/O size is 8K for both reads and writes
Outstanding I/Os very different between reads and writes
PostgreSQL almost always issues 32 write I/Os simultaneously
I/O rate varies over time (up to 15%)
Don’t assume that every database workload behaves the
same; measure and determine for yourself
Windows File Copy 800
700
I/O Latency Histogram
Vista Enterprise

XP versus Vista 600


XP Pro

Frequency
500
400
300

XP issues 64KB I/Os 200


100
0

I/Os are largely sequential.

10

100

500

1000

5000

15000

30000

50000

100000

>100000
Latency (microseconds)

Vista is issuing very large 2000


I/O Length Histogram
Vista Enterprise

I/Os (1MB) 1800


1600
XP Pro

1400

Frequency
1200
Latency is higher 1000
800
600
400
Number of commands is lower 200
0

512

1024

2048

4095

4096

8191

8192

16383

16384

32768

49152

65535

65536

81920

131072

262144

524288

>524288
I/Os are very sequential Length (bytes)

Vista enables large I/Os to 1600


Seek Distance Histogram
Vista Enterprise

be issued; file copy is just 1400


1200
XP Pro

Frequency
1000
an example 800
600

Keep an eye out for increasing 400


200

I/O sizes in future workloads 0


-500000

-50000

-5000

-500

-64

-16

-6

-2

16

64

500

5000

50000

500000
Distance (bytes)
Performance Overhead of Stats Collection

Used iometer to generate 4KB Sequential Reads


16 outstanding I/Os
Windows 2003 Enterprise Edition 64-bit VM
4KB is the most realistic worst-case scenario for overheads
Overhead is negligible (tested on internal build)

Online Histo Service Disabled Enabled


IOps 8187 8137
IOps Std. Dev. 6.5 200
MBps 35.1 34.8
CPU (out of 800) 106.0 108.0
CPU Std. Dev. 2.7 4.8
CPU Efficiency (UsedSec/IOps) 0.0417 0.0424
Latency (ms) 1.6 1.6

Table 2. Microbenchmark Performance


How to Run

Simple command-line interface; help is self-explanatory


$ /usr/lib/vmware/bin/vscsiStats -h
Use Excel to graph the output or inspect visually
Not detailed enough?
For analysis that cannot be done efficiently online, we provide a
virtual SCSI command tracing framework.
No actual customer data is included in the trace, rather only a list
of (command type, logical block address, time) tuples
$ vscsiStats -t …
Users can write scripts to post-process and analyze this trace
How to Run
Sample Output
$ /usr/lib/vmware/bin/vscsiStats -p iolength $ /usr/lib/vmware/bin/vscsiStats -p latency
Histogram: IO lengths of commands { Histogram: latency of IOs in Microseconds (us) {
min : 512 min : 191
max : 32768 max : 13391
mean: 11731 mean: 598
count : 241 count : 288
{ {
5 (<= 512) 0 (<= 1)
14 (<= 1024) 0 (<= 10)
5 (<= 2048) 0 (<= 100)
17 (<= 4095) 248 (<= 500)
76 (<= 4096) 28 (<= 1000)
1 (<= 8191) 4 (<= 5000)
20 (<= 8192) 8 (<= 15000)
18 (<= 16383) 0 (<= 30000)
36 (<= 16384) 0 (<= 50000)
49 (<= 32768) 0 (<= 100000)
0 (<= 49152) 0 (> 100000)
0 (<= 65535) }
0 (<= 65536) }
0 (<= 81920)
0 (<= 131072)
0 (<= 262144)
0 (<= 524288)
0 (> 524288)
}
}

Bin Ranges (Bucket Limits).


Think x-axis of histograms plots
Best Practices

When to use
When deploying a new disk performance sensitive workload
When optimizing an existing disk performance critical production VM
Which metrics to start with
Most important metrics to discuss with SAN Admin:
I/O Size
Read/Write Ratios
Outstanding I/Os
How to interpret
Pay attention to changes in distribution shape as well as magnitude
Corrective actions
Tune disk subsystem and remeasure; pay attention to latency histogram
Limitation: Histograms Are Per Virtual Disk

Strength: allows deep analysis of separate flavors of


workloads in each VM by splitting workloads by virtual disk
Place DB redo logs on a separate virtual disk than the DB tablespaces
Weakness: doesn’t give a complete picture of I/O going to a
VMFS LUN
Many VMs might be doing I/O from same ESX host
VMs from different ESX hosts might be doing I/O
In general, it is a hard problem to figure out
Rule of thumb: I/O to a LUN from different apps is effectively random
Still: storage arrays are rather smart to pull off individual sequential
streams and schedule I/O per stream
Interference with Multiple VMs
I/O Latency Histogram (Writes) Filecopy Solo
Analogous to multiple Filecopy (Iometer Interference)
hosts connected to the 700
600
same storage 500

Frequency
400
Two workloads: 300

200
Windows XP filecopy 100
0
Windows 2003 Iometer

10

100

500

1000

5000

15000

30000

50000

100000

100000
512K Random Writes
with 64 Outstanding IOs
Latency (microseconds)
Run Together,
Windows filecopy I/O Latency Histogram (Reads) Filecopy Solo
experienced a Filecopy (Iometer Interference)
800
slowdown 700
600
See the histogram shift to
Frequency

500
the right indicating 400
300
increasing disk I/O 200
latencies. 100
0
1

10

100

500

1000

5000

15000

30000

50000

100000

100000
Latency (microseconds)
Potential Future Work
If Enough Interest Exists

Further work in this area really depends on your


feedback
Some ideas worth talking about:
A graphical control and visualization UI
Recommendations for virtual disk placement
Binning virtual disks by workload similarity
User-specified histogram bin sizes
Provide feedback and file feature requests
Email: irfan@vmware.com
Stay tuned at author’s blog: http://virtualscoop.org
VROOM blog: http://blogs.vmware.com/performance
Questions? What’s Next?

Meet The Engineers Session


Area A @ Tues 6:00 pm - 7:30 pm
Read my academic paper
“Easy and Efficient Disk I/O Workload Characterization in
VMware ESX Server”
Submitted to IEEE International Symposium on Workload
Characterization (IISWC) 2007. To be presented: Sept 29, 2007
Try it out for yourself
For VMware ESX 3.0, vm-support collects much of this data. If
you send us the output we can parse it for you.
The interactive version of this tool (as shown today) is under
development. Not in any currently shipping products.

You might also like