Sajjad-Troubleshooting Storage Performance

Troubleshooting Storage Performance
Sajjad Siddicky , GSS Escalation
1
Theme: Storage “Where bad things happen”
This is not easy stuff

• This is a complex and confusing topic
The Impact on your virtual infrastructure is serious

• Disk latency can have serious impact on Applications running in virtual
environment
2
Scope, scope, scope!
 What is affected?
• Particular application
• Only 1 Guest
• All guests accessing the same Lun/Volume
• All guests running on the same ESXi host
• Are there Guests on the same lun NOT reporting issues?
3
So, think It IS Storage! Where do I start ?
Mutiple ESXi hosts affected? Start with Array logs

 SAN Array logs
• Error logs
• Latency stats (IOPS / MBPS)
• Cache status
• Schedules tasks (backup ,replication etc)
1 ESXi host affected? Start with host logs

 ESXi logs
• Vmkernel (SCSI sense code failures, Reservation errors)
• Esxtop
4
Storage – “Where bad things happen”
Virtual SCSI
ESXi VMFS NFS client
IOps/MBps Maximums
I/O - “not enough speed”
Paths Processor saturation
Front-end Cache issues

Processor
Array Cache
IOps/MBps maximums
Back-end and device configuration
Spindles – “just not enough disks”
5
ESXi host - Vmkernel logs
Location: /var/log/vmkernel.log (ESXi 5.x)
Example #1:
vmkernel: 1:08:42:28.062 cpu3:8374)NMP:
nmp_CompleteCommandForPath:2190: Command 0x16 (0x41047faed080) to
NMP device "naa.600508b40006c1700001200000080000" failed on physical
path "vmhba39:C0:T1:L16" H:0x0 D:0x28 P:0x0 Possible sense data: 0x0 0x0
0x0.
VMK_SCSI_DEVICE_QUEUE_FULL (TASK SET FULL) = 0x28
This status is returned when the LUN prevents accepting SCSI commands
from initiators due to lack of resources, namely the queue depth on the array.
*kb 1030381 for complete listing of device-side NMP errors
6
ESXi host - Vmkernel logs
Example #2:
vmkernel: 116:03:44:19.039 cpu4:4100)NMP:

nmp_CompleteCommandForPath: Command 0x2a (0x4100020e0b00) to
NMP device "naa.600508b40006c1700001200000080000” failed on physical
path "vmhba2:C0:T0:L152" H:0x2 D:0x0 P:0x0 Possible sense data: 0x0 0x0
0x0.
VMK_SCSI_HOST_BUS_BUSY = 0x02 or 0x2
This status is returned when the HBA driver is unable to issue a command to
the device. This status can occur due to dropped FCP frames in the
environment.
*kb 1029039 for complete listing of host-side NMP errors
7
ESXi host - Vmkernel logs (ESXi 5.x)
Device naa.5000c5000b36354b performance has

deteriorated. I/O latency increased from average value of
1832 microseconds to 19403 microsecond
Note: The message shows microseconds which can be converted to

milliseconds: 19403 microseconds = 19.403 milliseconds.
Only relevant if error is repeating for long periods on the same

device and / or latency value is large (>20 millisecond)
8
ESXTOP
- Local console
- SSH
- vMA (vSphere Management Assistance)
E S X T O P S C R E E N S
CPU
Scheduler
Memory
Scheduler
Virtual
Switch
vSCSI • c: cpu (default) • i: Interrupts
c, i, p m n d, u, v • m: memory • d: disk adapter
• n: network • u: disk device
• p: power management • v: disk VM
9
ESXtop Disk Adapter Screen (d)
Host bus adapters (HBAs) - Latency stats from the

includes SCSI, iSCSI, RAID, Device, Kernel and the
and FC-HBA adapters Guest
DAVG/cmd - Average latency (ms) from the Device (LUN)

KAVG/cmd - Average latency (ms) in the VMKernel
GAVG/cmd - Average latency (ms) in the Guest
 Kernel Latency Average (KAVG)

• The Amount of time an IO spends in the VMKernel (mostly made up of Kernel Queue Time)
• Investigation Threshold: > 2ms, Should typically be 0 ms
 Device Latency Average (DAVG)

• This is the latency seen at the device driver level
• Investigation Threshold: > 20ms, lower is better, some spikes okay
10
Disk I/O – 3 Main Latencies
Application
Guest OS
VMM
vSCSI KAVG KAVG = QAVG + Kernel
processing time
ESX Storage
Stack QAVG QAVG = time I/O spends in
Storage Adapter Queue
Driver
HBA GAVG GAVG = DAVG + KAVG
Fabric
DAVG
Array SP
11
Interpreting Latency Values
 DAVG is HIGH (>20 ms)

 Is it always high? over 20 ms constantly
 Check Array logs and fabric / network
 Scheduled tasks (backup / replication etc)
 KAVG is HIGH (>2 ms)

 Host resource contention
 QAVG high (>2ms) – QUEUEING
12
Queue Length (max # of active commands)
Application
Guest OS GQLEN - Guest OS Queue Length
VMM
VMkernel
vSCSI
ESX Storage
Stack
AQLEN - Adapter Queue Length
Driver
HBA
DQLEN – Device/Lun Queue Length
Fabric
Array SP SQLEN – Array (SP) Queue Length
DQLEN can change dynamically with SIOC enabled

SIOC will throttle depending on Shares/ Priorities
13
Queuing example
1 HBA can support only 2,000 active

commands addressing 40 LUNs
Adapter Queue length = 64
each LUN gets 1 queue so ,

64 x 40 = 2,560 total commands
Result :
I/O will Queue up in Kernel (> 2,000 max)
VMware sets 32 as the default device queue length

(for Qlogic in 5.x, it is 64)
Do not change unless Array vendor recommends to

* Kb 1267 for more information
14
Queuing in the VMKernel
When the active requests exceeds the
LUN 32 IOs in flight and device queue depth, all additional I/O will
Queue 32 Queued be queued in the VMKernel and will reflect
depth is 32 (100% active) in the QAVG
Esxtop “u”
Queuing is
GAVG = DAVG + KAVG (QAVG + kernel time) occurring
KAVG is
non-zero
When the device queue is full, I/O will

backup to the VMKernel queue
15
Storage Latencies Will Effect CPU State Times (“c” esxtop)
WAIT
• Waiting on Idle (Idle VMX – not too
much activity) RDY
• Waiting on mem. pg. to swap on disk
The CPU scheduler CSTP
IDLE SWPWT Blocked MLMTD pausing access to
% of time that PCPU Due to RUN
% of time that Waiting on % of time the
the VCPU is in the world is storage I/O
contention or limits Co-de-scheduled
world was not
idle loop waiting for ESX completion scheduled due state for SMP VMs
swapping for VM to CPU limit
violations % of time the VM is
running on a PCPU
VMWAIT
If %WAIT is high, is it due to %VMWAIT (Blocked)?
16
No errors on ESX host and STORAGE, Now what?
Check path to Storage
Fibre Channel
• CRC errors
– Bad SFP , Cable
• C3 discard, BB_credit exhaustion
– Fabric overloaded, Oversubscription
• Fabric Routing issues, etc
iSCSI / NAS
• Wrong VMKernel used, wrong uplink

• Physical switch
• Spanning-tree Flooding
• Port errors
• Network latency
• Switch CPU usage, etc.
17
Error Check (FIBRE-CHANNEL)
cat /proc/scsi/qla2xxx/1
QLogic PCI to Fibre Channel Host Adapter for QMI2572:

FC Firmware version 5.06.02 (90d5), Driver version 911.k1.1-19vmw
………….
Dpc flags = 0x0
Link down Timeout = 045
Port down retry = 005
Login retry count = 008
Execution throttle = 2048
ZIO mode = 0x6, ZIO timer = 1
Commands retried with dropped frame(s) = 417238
Things to check:
- HBA Driver known issues
- Fabric errors
*kb 1005576 – Enabling verbose logging on QLogic and Emulex Host Bus Adapters
18
Error Check (iSCSI / NFS)
ESXTOP “n”
PORT-ID USED-BY TEAM-PNIC DNAME PKTTX/s MbTX/s PKTRX/s MbRX/s %DRPTX %DRPRX
16777217 iSCSI n/a vSwitch1 0.00 0.00 0.00 0.00 0.00 0.00
16777218 vmnic2 - vSwitch1 1.61 0.00 10.00 0.01 0.00 0.00
16777219 vmk1 vmnic2 vSwitch1 1.20 0.00 6.83 0.01 0.00 0.00
VMKPING TCPDUMP
# vmkping –d –s 1472 10.16.224.237 -c 30 # tcpdump-uw -i vmk1 -w

PING 10.16.224.237 (10.16.224.237): 1472 data bytes /vmfs/volumes/localdatastore/nfs.pcap
1480 bytes from 10.16.224.237: icmp_seq=0 ttl=64 time=0.053 ms
………….
--- 10.16.224.237 ping statistics ---

30 packets transmitted, 30 packets received, 0% packet loss
round-trip min/avg/max = 0.031/0.089/0.129 ms
19
Quick Tip
Disable Bad path

# esxcli storage core path set --state=off -p path
Where:
path is the particular path to be enabled/disabled
device is the NAA ID of the device
state is active or off
# esxcli storage core path set --state=off -p fc.2000001b32865b73:2100001b32865b73-

fc.50060160c6e018eb:5006016646e018eb-naa.6006016095101200d2ca9f57c8c2de11
- Also can be done easily from the GUI by going to “Modify

path” and right click and select Disable
20
Helpful Tools
21
Guest – level issues
 Iometer
 Perfmon (windows) / top (linux)
Host– level issues

 Esxtop
 Vcenter Performance graphs
 Vcenter Operation Manager
22
Perfmon (windows)
Avg. Disk sec / Transfer =

average time for each data transfer
~ GAVG
23
GAVG should be close to R
Application
A = Application Latency
File
Guest System A
R = Perfmon
I/O Drivers Windows R “Avg. Disk Sec/transfer”
Device Queue
S S = Windows
Physical Disk Service Time
Virtual SCSI G
G = Guest Latency
VMkernel VMFS NFS client K
K = ESX Kernel
D D = Device Latency
24
Iometer (I/O workload generator tool)
Simulate I/O
 Windows and Linux

 Sequential / Random
 Metrics Collected
• Total I/Os per Sec.
• Throughput (MB)
• CPU Utilization
• Latency (avg. & max)
25
vCenter “Disk” Performance Chart
Latency statistics available for Disks in the vCenter performance charts
KAVG
• Kernel Read latency
• Kernel write latency
• Kernel command latency
QAVG
• Queue command latency
• Queue write latency
• Queue read latency
GAVG
• Read latency
• Write latency
• Command latency
DAVG
• Physical device command latency
• Physical device read latency
• Physical device write latency
26
Capture ESXTOP results while issue exists
Batch mode:
esxtop -b -d 2 -n 100 > esxtopcapture.csv
Where “-b” stands for batch mode, “-d 2″ is a delay of 2 seconds and “-n 100″ are 100 iterations. In this specific case
esxtop will log all metrics for 200 seconds.
Vm-support snapshot (Preferred):

- pre ESXI 5.x
vm-support -s -d <duration in seconds> -i <interval in seconds>
- ESXi 5.x
vm-support -p -d <duration in seconds> -i <interval in seconds>
KB articles:
- Gathering esxtop performance data at specific times using crontab (http://kb.vmware.com/kb/1033346)
- Collecting performance snapshots using vm-support (http://kb.vmware.com/kb/1967)
27
Considerations and Recommendations
28
VAAI (VMware vStorage APIs for Array Integration)
• Offloads tasks to SAN Storage (reduce CPU

load on host)
• Especially helpful for environments -

• VDI (boot-storms, create snapshots)
• Mass virtual machine provision (create VM)
• Mass cloning
• Mass Storage vMotions
Make sure SAN storage firmware

is upgraded to support VAAI
29
Partition Alignment
Mis-aligned
Aligned
VMFS partition is automatically aligned at 64kb when created from

vSphere
30
VMFS vs RDM
VMFS Scalability
8000
7000
 VMFS is a distributed file system
6000
 VMFS has Negligible performance cost VMFS
and superior functionality 5000
IOPS
4000 RDM
(virtual)
3000 RDM
(physical)
2000
Use VMFS, unless RDM required 1000
0
4K IO 16K IO 64K IO
31
Virtual Disk modes
 Independent Persistent
• Changes persistently written to disk
 Independent Non-persistent
• Changes written to re-do log, ESXi host reads the re-do log first for read (Performance hit)
• Changes get lost when vm is powered off
 Snapshot
• Changes written to re-do log, ESXi host reads the re-do log first for read (Performance hit)
Independent Persistent has best performance benefits but no Snapshot

capabilities
32
Thick vs Thin (VMDK)
 Thick VMDK:
MBs I/O Throughput
 Eager-zeroed
 Blocks zeroed out during vmdk creation
 Performance hit during creation but faster later
 Lazy-zeroed
 Space allocated first, but blocks zeroed out after
first-write
 Faster creation, but slower First write.
 Thin VMDK:
 Same performance hit as Thick Lazy-Zero
 fully Inflated and Zeroed, same as Thick Eager-
Zero
http://www.vmware.com/pdf/vsp_4_thinprov_perf.pdf
NO real Performance difference, VAAI will offload
zeroing anyways
 33
Multipathing policy
 MRU (default for Active-Passive array)

 No option to improve performance
 FIXED policy: (default for Active-Active array)

 Balancing Lun ownership among the two Storage Controllers (SP) can improve
performance
 Round-Robin policy:
 Utilizes ALL available paths by load-balancing
 Best performance in most cases
ALWAYS USE CORRECT POLICY RECOMMENDED BY ARRAY VENDOR to avoid issues (Lun
Thrashing) impacting performance
Use Round-Robin for best performance
34
Extents Vs NO Extents?
 Theoretically using Extents can provide performance benefits in shared

environment
 But, considering management overhead VMware Engineering recommends NOT to

use extents for VMFS volumes. VMFS-5 does provide better management
capabilities by allowing for these larger LUN sizes, which makes a significant
amount of the storage administration overhead go away.
Use extent only if you have to

- You are still on VMFS-3 and need a data store larger than 2TB.
- You have storage devices which cannot be grown at the back-end, but you need a
data store larger than 2TB
DO NOT use Extents for VMFS volumes, no performance difference
35
Virtual Storage Adapters
 BusLogic Parellel
 LSI Logic Parallel
 LSI Logic SAS
 PVSCSI( Paravirtual)
 Reduces CPU Utilization
 Increased throughput
 Not supported as boot-device for most OS
Use PVSCSI if possible
Make sure VMware Tools for the guest is updated
36
Throttling I/O per VM
• Use Shares and Limits individually on hosts

VM A VM B
1500 500
Shares Shares
ESX Server
25 %
device queue depth

75%
Provide Higher share values for I/O intensive disks
Shares proportional to other VMs on the same ESXi host

(when SIOC Disabled)
37
SIOC: Storage Contention Solution
With Storage IO Control
Actual Disk Resources utilized by each VM
Are in the correct ratio even across ESX Hosts
 SIOC calculates datastore normalized latency
to identify storage contention VM A VM B VM C
1500 500 500
 SIOC enforces fairness when datastore latency
Shares Shares Shares
ESX Server 1 ESX Server 2

crosses a threshold of 30 ms (configurable)
 Fairness enforced by limiting VMs access to

25 %
queue slots
device queue
75%
6
Provide Higher share values for I/O 100 %
intensive disks. 0 0
Storage Array Queue

SIOC only kicks in when normalized
Storage Queue
60% 20% 20%
latency threshold is exceeded Throttled
Storage
Controlled
With SIOC – Latency is Controlled
38
“Common Containers? Why”
• Mix disk-intensive with non-

disk-intensive virtual
machines on a datastore.
• Mix virtual machines with

different peak access
times.
• But… also ideally think

about “SLA”
39
VMDK Workload Consolidation
Too many sequential threads

Sequential on a lun will appear as a
Random random workload to the storage
Negative Impact on
Sequential Sequential Perf.
Mixing Sequential with Random

Random can hurt Sequential workload
Random Throughput.
Negative Impact on
Sequential Sequential Perf.
Random Group similar workloads together

Random (Random w/ Random and
Sequential /w Sequential)
Random
40
Sizing Storage
Throughput MB/s
*IOPS Write Read
RAID 0 175 44 110
RAID 5 40 31 110
RAID 6 30 30 110
RAID 10 85 39 110
Useable Storage Space * 100% Sequential write for 15k disks
Rules of Thumb
Drive Type MB/sec IOPS Latency Use Case
• 50 - 150 IOPs/ VM FC 4Gb (15k) 100 200 5.5ms High Perf. Trans
• <15 ms latencies FC 4Gb (10k) 75 165 6.8ms High Perf. Trans

SAS (10k) 150 185 12.7ms Streaming
• ~Typical workload
8K IO Size SATA (7200) 140 38 12.7ms Streaming/Nearline
45% Write SSD 230(read) 25000(read) < 1 ms High Perf. Trans
80% Random 180(write) 6000(write) Tiered Storage / Cache
41
vSphere 5.x
New Storage Features
42
vSphere 5.x - Storage Performance Features / Studies
 SIOC for NFS: Cluster-wide virtual machine storage I/O prioritization
 SDRS: Intelligent placement and on-going space & load balancing of

Virtual Machines across Datastores in the same Datastore cluster.
 VAAI: vSphere Storage APIs for Array Integration primitives for Thin
Provisioning
 1 Million IOPS: vSphere 5.x can support astonishing high levels of IO

Operation per second, enough to support today’s largest and most
consolidated cloud environments
 FCOE Performance: New vSphere 5.x ability to utilize built in software

based FCoE virtual adapters to connect to your FCoE base Storage
infrastructure
43
vFlash in vSphere 5.5
Virtual SCSI
vFlash Read Cache
VMkernel VMFS NFS client
Paths
Front-end
Processor
Array Cache
Back-end
Spindles
44
Common causes of Storage Performance issues
• Under-sized storage arrays/devices unable to provide the needed performances
• Infrastructure issue (Fabric, Network)
• I/O Stack Queue congestion
• ESX Host CPU Saturation
• Incorrectly Tuned Applications
• Guest Level Driver and Queuing Interactions
45
Storage Optimization in VMware (Best practices)
Array side ESXi
• Consult SAN Configuration Best • Use correct multipathing for Array

Practice Guides type (Round-Robin preferred)
• Ensure disks are correctly distributed • Change Queue Depth values only
• Ensure the appropriate controller when suggested by array vendor
cache is enabled • Isolate iSCSI / NFS traffic from
• Spread I/O requests across available management , vmotion traffic. Use
paths Jumbo frames if possible
• Appropriate RAID level used • Utilize VAAI, SIOC, NIOC features
• Array firmware / HBA • HBA driver / firmware
46
Troubleshooting Process revisited:
 SCOPE the issue – SAVE TIME

 1 ESXi host effected –> check host logs first
 Multiple ESXi hosts effected –> check Array logs first
 Application Latency ? –> check Application tuning best practices
 ESXtop
 DAVG, High ? -> check SAN IOPS / latency.
-> If SAN IOPS / latency is low, check Fabric/ Network
 KAVG high? check QAVG (Queue stats)
 Look out for I/O Throttling implied by SIOC / Shares
47
Thank you!
Email: ssiddicky@vmware.com
Presented by,
Sajjad Siddicky , GSS Escalation
48

Sajjad-Troubleshooting Storage Performance

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Sajjad-Troubleshooting Storage Performance

Uploaded by

Copyright:

Available Formats

Troubleshooting Storage Performance

Sajjad Siddicky , GSS Escalation

This is not easy stuff

The Impact on your virtual infrastructure is serious

Mutiple ESXi hosts affected? Start with Array logs

1 ESXi host affected? Start with host logs

ESXi VMFS NFS client

Paths Processor saturation

Front-end Cache issues

Location: /var/log/vmkernel.log (ESXi 5.x)

VMK_SCSI_DEVICE_QUEUE_FULL (TASK SET FULL) = 0x28

*kb 1030381 for complete listing of device-side NMP errors

vmkernel: 116:03:44:19.039 cpu4:4100)NMP:

VMK_SCSI_HOST_BUS_BUSY = 0x02 or 0x2

*kb 1029039 for complete listing of host-side NMP errors

Device naa.5000c5000b36354b performance has

Note: The message shows microseconds which can be converted to

Only relevant if error is repeating for long periods on the same

Host bus adapters (HBAs) - Latency stats from the

DAVG/cmd - Average latency (ms) from the Device (LUN)

 Kernel Latency Average (KAVG)

 Device Latency Average (DAVG)

 DAVG is HIGH (>20 ms)

 KAVG is HIGH (>2 ms)

DQLEN can change dynamically with SIOC enabled

1 HBA can support only 2,000 active

Adapter Queue length = 64

each LUN gets 1 queue so ,

VMware sets 32 as the default device queue length

Do not change unless Array vendor recommends to

When the device queue is full, I/O will

If %WAIT is high, is it due to %VMWAIT (Blocked)?

Check path to Storage

• Wrong VMKernel used, wrong uplink

QLogic PCI to Fibre Channel Host Adapter for QMI2572:

# vmkping –d –s 1472 10.16.224.237 -c 30 # tcpdump-uw -i vmk1 -w

--- 10.16.224.237 ping statistics ---

Disable Bad path

# esxcli storage core path set --state=off -p fc.2000001b32865b73:2100001b32865b73-

- Also can be done easily from the GUI by going to “Modify

Host– level issues

Avg. Disk sec / Transfer =

 Windows and Linux

Latency statistics available for Disks in the vCenter performance charts

Vm-support snapshot (Preferred):

• Offloads tasks to SAN Storage (reduce CPU

• Especially helpful for environments -

Make sure SAN storage firmware

VMFS partition is automatically aligned at 64kb when created from

Independent Persistent has best performance benefits but no Snapshot

 Faster creation, but slower First write.

 MRU (default for Active-Passive array)

 FIXED policy: (default for Active-Active array)

Use Round-Robin for best performance

 Theoretically using Extents can provide performance benefits in shared

 But, considering management overhead VMware Engineering recommends NOT to

Use extent only if you have to

DO NOT use Extents for VMFS volumes, no performance difference

Use PVSCSI if possible

Make sure VMware Tools for the guest is updated

• Use Shares and Limits individually on hosts

device queue depth

Provide Higher share values for I/O intensive disks