Virtual Cpu Scheduling Techniques For Kernel Based Virtual Machine (KVM)

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/261247420
Virtual Cpu Scheduling Techniques for Kernel Based Virtual Machine (Kvm)
Conference Paper · October 2013

DOI: 10.1109/CCEM.2013.6684443
CITATIONS READS
10 2,144
1 author:
K.T. Raghavendra
4 PUBLICATIONS 27 CITATIONS
SEE PROFILE
All content following this page was uploaded by K.T. Raghavendra on 15 August 2018.
The user has requested enhancement of the downloaded file.

Virtual CPU scheduling techniques for Kernel
Based Virtual Machine (KVM)
K. T. Raghavendra
∗ raghavendra.kt@in.ibm.com
Linux Technology Center

IBM India
Abstract—In a multi-CPU Virtual Machine(VM), virtual CPUs 2) Optimizing the VCPU scheduling in under-commit sce-
(VCPUs) are not guaranteed to be scheduled simultaneously. nario by detecting potential under-commit cases.
Operating System (OS) constructs, such as busy-wait (for e.g., 3) Refining the VCPU candidate to yield to by determining
spin locks) are written with an assumption of CPUs running con-
currently on bare-metal wastes lot of CPU time. The Hardware potential Lock Holder Preempted (LHP) VCPUs.
assisted Pause Loop Exit (PLE) feature detects unnecessary busy-
loop constructs in guest VMs and traps to the VCPU scheduler
II. V IRTUAL CPU (VCPU) SCHEDULING TECHNIQUES
a.k.a PLE handler to choose a best VCPU candidate to run. A simplified original PLE handler algorithm is presented
The existing approach (before the optimization mentioned in the in Fig. 1. It iterates over all the VCPUs until it finds a non
paper) does a directed yield1 to a random VCPU and needs more
running VCPU and does a directed yield. Searching for the non
intelligence.
We also need to carefully consider the over-commit ratio2 while running VCPU starts from the last boosted VCPU3 . This logic
designing the VCPU scheduling algorithm. For e.g., trapping to has high probability of yielding to a spinning VCPU. Yielding
the PLE handler is an overhead during under-commit cases. The to such VCPUs result in severe performance degradation.
existing approach lacks the over-commit ratio awareness. Hence
we need effective scheduling of VCPUs to boost the performance function PLE HANDLER(vcpu V)
of VMs. kvm = V.kvm
We present three major improvements to old VCPU scheduling
technique that include choosing a better VCPU for directed yield last boosted vcpu = kvm.last boosted vcpu
and optimizing for under-commit cases. All these approaches yielded = 0
have been accepted into Linux kernel. These changes potentially for pass = 0; pass < 2 AND !yielded AND try; do
bring around 300-400% improvements to I/O intensive cloud for i = 0; i < num vcpu; i = i + 1 do
VMs (large under-committed guests) and up to 25% improvement we assume i spans 1 to total VCPU in VM
to over-committed CPU intensive VMs.
if !pass AND i <= last boosted vcpu then
I. I NTRODUCTION i <= last boosted vcpu
continue
Adapting traditional Operating Systems (OSs) to virtual-
else if pass AND i > last boosted vcpu then
ization has been a complex process. In virtualization, virtual
break
CPUs (VCPUs) are tasks for host OS. Busy-wait constructs
end if
with an assumption of CPUs running in parallel leads to
cur vcpu = kvm.vcpu array[i];
performance degradation [1]. The hardware assisted Pause
if cur vcpu = V then
Loop Exit (PLE) [2] feature detects busy-wait and forces
continue
guest exit and traps to the PLE handler. The PLE handler of
end if
KVM [3] takes VCPU scheduling decisions. But unfortunately
if cur vcpu.running = V then
the PLE feature lacks the knowledge of under-commit cases
continue
where each VCPU can be mapped to at least one physical
end if
CPU, which obviates the need for any special mechanism
if kvm vcpu yield to(cur vcpu) then
to address busy-wait related problems. Instead it adds more
kvm.last boosted vcpu = i
overhead. Hence we need under-commit aware intelligent
end if
VCPU scheduling techniques which is also smart to choose
end for
better VCPU in over-commit cases. In this paper we discuss
pass = pass + 1
about three algorithmic improvements to current KVM PLE
end for
handler in the Linux kernel [4] viz.,
end function
1) Choosing a better VCPU to do directed yield by recog-
nizing the VCPUs that are busy-looping. Fig. 1. Original PLE handler
1A task giving away it’s CPU time to another task.

2 Ratio of total virtual CPUs to physical CPUs. 3A VCPU to which directed yield was done last time.
function PLE HANDLER(vcpu V)
A. Choosing a better VCPU
kvm = V.kvm
We use two main heuristics to refine the VCPU candi- last boosted vcpu = kvm.last boosted vcpu
dates [5]. Firstly, we do not yield to a VCPU that has recently yielded = 0
done PLE. This logic avoids us from choosing a busy-looping V.in spin loop = true
VCPU that could potentially spin further and burn the CPU mark the current VCPU in spin loop
time. But we also need to consider that the spinning VCPU pass = 0
becomes an eligible lock-holder after some time. Hence we for pass = 0; pass < 2 AND !yielded; do
have an additional logic to choose the VCPU in second for i = 0; i < num vcpu; i = i + 1 do
iteration after all the preempted lock-holders are given a we assume i spans 1 to total VCPU in VM
chance to release the lock. Fig. 3 presents the algorithm. if !pass AND i <= last boosted vcpu then
Explanation of algorithm: When a VCPU enters the PLE i <= last boosted vcpu
handler, it sets the in spinloop flag. Then we check for a continue
dynamically eligible VCPU to yield to in VCPU iteration loop. else if pass AND i > last boosted vcpu then
Finally we reset the in spinloop and eligibility flag while break
exiting PLE handler. The dynamic eligibility checking and end if
eligibility updating logic is presented in Fig. 2. A VCPU is cur vcpu = kvm.vcpu array[i];
eligible when, if cur vcpu = V then
1) The VCPU is not in the spin loop OR continue
2) VCPU is in the spin loop but it was skipped in the last end if
iteration. if cur vcpu.running = V then
We use an eligible flag to toggle the eligibility in each iteration. continue
end if
function IS ELIGIBLE FOR YIELD(vcpu V) if !is eligible for yield(cur vcpu) then
eligible = !V.in spin loop OR (V.in spin loop AND Check if the target VCPU is eligible for yield
V.dy eligible) continue
VCPU is eligible when it is not in spin loop or it was end if
skipped once already if kvm vcpu yield to(cur vcpu) then
if V.in spin loop then V.dy eligible = !V.dy eligible kvm.last boosted vcpu = i
Toggling eligible flag for next iteration end if
end if end for
return eligible; pass = pass + 1
end function end for
V.in spin loop = f alse
Fig. 2. Directed yield eligibility checking function Reset VCPU’s spin loop flag
Ensure vcpu is not eligible during next spin loop
The updated original PLE handler is shown in Fig. 3. V.dy eligible = f alse
end function
B. Optimization for under-commit cases
Fig. 3. PLE Handler with best VCPU choosing optimization
As we already noted, in under-commit scenarios PLE ex-
hibits overhead due to unnecessary iteration to choose a better number of tries probability
VCPU. The magnitude of degradation is very high that cloud 1
1 4
vendors disable PLE for under-commit. We present a statistical 1
2 8
solution to identify under-commit scenario [6]. In such case 3 1
16
VCPUs return back to guest and spin instead of wasting time
in PLE handler. This spinning helps to quickly acquire the Fig. 4. Probability of false exiting PLE handler for 1.5x over-commit
lock when lock-holder releases the lock. To detect potential
under-commit cases, when the CPUs with source and target
VCPU tasks have a single running task, modified directed yield Let p be the probability of finding the run queue length
function returns −1. The modified directed yield logic is in of one on a particular CPU. For n tries, the probability of
Fig. 5. exiting PLE handler is pn+1 . Because we would have come
When the PLE handler fails successively thrice because across one source VCPU and n target VCPUs with run queue
of return value of −1 by directed yield, we are in potential length one. The theoretical worst case of false exiting occurs
under-commit. Refer Fig. 7 for the algorithm. The rationale at 1.5 over-commit. Probability for that case is tabulated in
for choosing three successive failures to avoid false exiting Fig. 4.
from PLE handler is explained below. Reducing probability with more tries results in performance
function PLE HANDLER(vcpu V)
overhead. Trying thrice for directed yield means, we have
kvm = V.kvm
iterated over three eligible VCPUs along with many non-
last boosted vcpu = kvm.last boosted vcpu
eligible candidates. In worst case, we iterate over all the
yielded = 0
VCPUs and performance get hit. Thus by trying thrice, hitting
1 try = 3
false PL exit is reduced to 16 without potentially affecting the
V.in spin loop = true
performance.
pass = 0
for pass = 0; pass < 2 AND !yielded AND try; pass
= pass + 1 do
function KVM VCPU YIELD TO(vcpu V) for i = 0; i < num vcpu; i = i + 1 do
yielded = 0 if !pass AND i <= last boosted vcpu then
target task = vcpu to task(V ) i <= last boosted vcpu
current rq = run queue of(current) continue
target rq = run queue of(target task) else if pass AND i > last boosted vcpu then
if target rq.nr running = 1 AND break
current rq.nr running = 1 then end if
yielded = −1 cur vcpu = kvm.vcpu array[i];
else if yield to(current, target task) then if cur vcpu.preempted then
yielded = 1 continue
end if end if
return yielded if cur vcpu = V then
end function continue
end if
Fig. 5. Modified directed yield if cur vcpu.running = V then
continue
end if
if !is eligible for yield(cur vcpu) then
continue
end if
yielded = kvm vcpu yield to(cur vcpu)
if yielded > 0 then
kvm.last boosted vcpu = i
C. Yielding to potential lock holder VCPU else if yielded < 0 then
try = try − 1
if try = 0 then
break
In section II-A we discussed how to filter spinning VCPUs, end if
we improve further by choosing potential preempted lock- end if
holder VCPU [7]. We use preempt notifiers to record the end for
VCPUs that are preempted in running state. end for
V.in spin loop = f alse
V.dy eligible = f alse
end function
function KVM SCHED IN(preempt notifier pn)
vcpu = preempt notifier to vcpu(pn) Fig. 7. Final PLE Handler
if vcpu.preempted then
vcpu.preempted = f alse
end if Preempt notifiers in Linux kernel provide a way for addi-
end function tional processing during context switch, for e.g., tracing. When
function KVM SCHED OUT(preempt notifier pn) we are about to schedule-out or schedule-in a VCPU task, we
vcpu = preempt notifier to vcpu(pn) record the preemption information for the VCPU as in Fig. 6.
if current.state = T ASK RU N N IN G then A VCPU is preempted if it was in running state when a context
vcpu.preempted = true switch occurred. Note that spinning VCPUs also fall in that
end if category but are already filtered in section II-A. In PLE handler
end function we have added an extra condition in VCPU iteration loop to
check for preempted VCPUs. Fig. 7 presents the final version
Fig. 6. Preempt notifier functions accommodating all the improvements.
III. R ESULTS cases, and around 43 to 76% for over-commit cases. CPU
We have used kernbench [8], ebizzy [9], hackbench [10], intensive benchmarks such as hackbench also has shown 18
sysbench [11], dbench [12] benchmarks for evaluation. Hy- to 23% improvements. These results are quite promising for
pervisor and guest kernels are based on an Enterprise Linux KVM based cloud systems. We further conclude that the
kernel. Fig. 8 explains more about test environment. improvements bring us closer to the theoretical expectation of
under-commit scenario achieved by disabling the PLE feature.
Machine A hardware with 32 CPU cores, 256 GB Analyzing with tools such as perf4 has shown significant
RAM, with 32 VCPU guests. reduction of time spent in double run queue lock as shown
OS base: 2.6.32 based enterprise kernel in Fig. 12. This has resulted in performance improvement
patched: base with VCPU scheduling im- as expected. Improved performance with more time spent in
provements PLE handler in over-commit cases implies efficient VCPU
Scenarios 1x: benchmark running on a single guest. scheduling.
2x : benchmark running on two guests.
3x : benchmark running on three guests. % time % time
4x : benchmark running on four guests. base patched
Run method #threads = 2 * #vcpu 1x double runqueue lock 60.13 8.34
ple handler 6.61 0.84
Fig. 8. Test set up
2x double runqueue lock 53.21 22.78
ple handler 5.29 5.69
base patched % improvement
Fig. 12. Dbench: performance analysis
kernbench 1x 56.6044 43.2922 23.51796
kernbench 2x 138.2129 133.8037 3.19015
kernbench 3x 276.4728 262.5978 5.01858
kernbench 4x 434.7892 417.6703 3.93729
sysbench 1x 15.6517 13.3413 14.76134
sysbench 2x 13.2901 13.0005 2.17907
sysbench 3x 18.6782 18.5560 0.65424
sysbench 4x 24.6450 24.0154 2.55468
hackbench 1x 62.3579 50.8957 18.38131
hackbench 2x 130.5594 113.2193 13.28139
hackbench 3x 247.1980 193.5613 21.69787
hackbench 4x 406.3914 312.8074 23.02805
Fig. 9. Kernbench Sysbench and Hackbench: Execution time in seconds.

Lower is better
Fig. 13. Kernbench result: Execution time in seconds. Lower is better

ebizzy 1x 1130.4000 5475.0000 384.34183
ebizzy 2x 1811.6000 2729.0000 50.64032
ebizzy 3x 1527.4167 2241.9167 46.77833
ebizzy 4x 1247.7500 1751.8750 40.40272
Fig. 10. Ebizzy: Records per second. Higher is better

dbench 1x 959.0109 3760.4990 292.12265
dbench 2x 1203.0520 2127.0720 76.80632
dbench 3x 905.3629 1310.9867 44.80234
dbench 4x 645.9955 927.2237 43.53408
Fig. 11. Dbench: Throughput. Higher is better

Fig. 14. Sysbench: Execution time in seconds. Lower is better
The result shows that I/O intensive workloads such as
dbench has improved by around 300% for under-commit 4 perf tool is part of linux kernel.
Andrew Theurer had also proposed directed yield throt-
tling [15] so that we periodically iterate over VCPUs to choose
better candidate instead of doing it always to reduce the
overhead.
Paravirtualizing the busy-wait constructs for KVM [16]
shows a huge benefit in over-committed scenarios in non-PLE
environment. On PLE assisted hardware they bring moderate
improvements especially in over-commit scenarios. For e.g.,
paravirtual spinlocks help in identifying eligible lock-waiter
and paravirtual flush TLB helps to reduce unnecessary waiting
on preempted VCPUs doing flush TLB.
Gang scheduling [17] tries to schedule all the VCPUs of
same VM simultaneously. This helps in avoiding virtualization
Fig. 15. Hackbench: Execution time in seconds. Lower is better
related issues like lock holder preemption, lock waiter preemp-
tion. But it needs complex changes to the existing scheduler.
The gang scheduling suffers from scalability problem since
it involves communication between CPUs for scheduling to
achieve synchronization.
Several simultaneous directed yields are possible when
many VCPUs spin. All the pause loop exited VCPUs, starting
from the last boosted VCPU to choose a better candidate is
not a good idea. Approaches like choosing a real random
VCPU [18] as starting point to iterate over VCPUs, or starting
with next VCPU [19] have been tried to solve that. Both the
approaches have not been proven to be significantly useful
except obviating the need for the last boosted VCPU.
For the future work, accurately detecting under-commit
instead of making statistical guess by noting actual number of
preempted VCPUs with the help of preempt notifiers, doing a
Fig. 16. Ebizzy: Records per second. Higher is better
conditional reschedule or yield when directed yield attempts
fail in over-committed scenario can potentially help us to
improve the performance further.
V. ACKNOWLEDGEMENT
I thank Srivatsa Vaddagiri, Srikar D, Avi Kivity, Marcelo
Tosatti, Peter Zilstra, Gleb Natapov, Jiannan, Rik Van Riel
and many others who suggested improvements, ideas and also
contributed as reviewers of the source code in the kernel
mailing list.
LEGAL STATEMENT
This work represents the views of the authors and does not
necessarily represent the view of IBM. Linux is a registered
trademark of Linus Torvalds. Other company, product, logos
and service names may be trademarks or service marks of
Fig. 17. Dbench: Throughput. Higher is better others.
This document is provided “AS-IS”, with no express or
IV. R ELATED WORKS implied warranties. Use the information in this document at
There are many proposed improvements to PLE handler or your own risk. Results mentioned in the paper is for reference
related to it’s overheads in KVM. Several ideas have been purposes only, and are not to be relied on in any manner.
discussed on choosing a better VCPU in [13] though it lacks
commit awareness.
Andrew Theurer had proposed an idea of optimizing the
need for the double run queue lock [14]. It is expected to
reduce the double run queue lock overhead during directed
yield.
R EFERENCES
[1] K. T. Raghavendra, S. Vaddagiri, N. Dadhania, and J. Fitzhardinge, [11] A. Kopytov, “System performance benchmark,” March 2009, ”http://
“Paravirtualization for scalable kernel-based virtual machine (kvm),” in sourceforge.net/projects/sysbench/files/sysbench/0.4.12/”.
Cloud Computing in Emerging Markets (CCEM), 2012 IEEE Interna- [12] A. Tridgell, “dbench,” February 2008, ”ftp://samba.org/pub/tridge/
tional Conference on, 2012, pp. 1–5. dbench”.
[2] “Intel virtualization. technology specification for the ia-32 intel archi- [13] T. Friebel, “Preventing guests from spinning around,” Xen Summit, June
tecture,” April 2005. 2008, ”http://www.xen.org/files/xensummitboston08/LHP.pdf”.
[3] “Kernel-based virtual machine (kvm),” ”http://www.linux-kvm.org”. [14] A. Theurer, “Reducing double runqueue lock overhead,” September
[4] “The linux kernel archives,” ”http://www.kernel.org/pub/linux/kernel/v3. 2012, ”https://lkml.org/lkml/2012/9/7/305”.
x/”. [15] ——, “Throttled yield,” September 2012, ”https://lkml.org/lkml/2012/
[5] K. T. Raghavendra, “Improving directed yield in ple handler,” July 2012, 11/28/718”.
”https://lkml.org/lkml/2012/7/18/247”. [16] S. V. K. T. Raghavendra, Jeremy Fitzhardinge, “Paravirtualized ticket
[6] P. Z. K. T. Raghavendra, “Improving undercommit scenarios,” January spinlocks,” June 2013, ”https://lkml.org/lkml/2013/6/1/168”.
2013, ”https://lkml.org/lkml/2013/1/22/104”. [17] N. Dadhania., “Gang scheduling in linux kernel scheduler,” January
[7] K. T. Raghavendra, “Better yield to candidate using preemption noti- 2012, ”http://lanyrd.com/2012/linuxconfau/spdzq/”.
fiers,” March 2013, ”https://lkml.org/lkml/2013/3/4/332”. [18] K. T. Raghavendra, “Use random vcpu,” June 2012, ”https://lkml.org/
[8] C. Kolivas, “Kernbench,” December 2009, ”http://mirror.sit.wisc.edu/ lkml/2012/6/21/166”.
pub/linux/kernel/people/ck/apps/kernbench/”. [19] ——, “Use vcpu id as pivot,” August 2012, ”https://lkml.org/lkml/2012/
[9] V. Henson, “Ebizzy,” January 2008, ”http://sourceforge.net/projects/ 8/29/232”.
ebizzy/files/ebizzy/0.3/”.
[10] “Hackbench,” 2008, ”https://build.opensuse.org/package/files?package=
hackbench&project=benchmark”.
View publication stats

Virtual Cpu Scheduling Techniques For Kernel Based Virtual Machine (KVM)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Virtual Cpu Scheduling Techniques For Kernel Based Virtual Machine (KVM)

Uploaded by

Copyright:

Available Formats

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Conference Paper · October 2013

The user has requested enhancement of the downloaded file.

Linux Technology Center

1A task giving away it’s CPU time to another task.

Fig. 9. Kernbench Sysbench and Hackbench: Execution time in seconds.

Fig. 13. Kernbench result: Execution time in seconds. Lower is better

Fig. 10. Ebizzy: Records per second. Higher is better

base patched % improvement

Fig. 11. Dbench: Throughput. Higher is better

View publication stats

You might also like