You are on page 1of 12

IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. 8, NO.

2, APRIL-JUNE 2020 635

Locality-Aware Scheduling for Containers


in Cloud Computing
Dongfang Zhao , Mohamed Mohamed, and Heiko Ludwig

Abstract—The state-of-the-art scheduler of containerized cloud services considers load balance as the only criterion; many other
important properties, including application performance, are overlooked. In the era of Big Data, however, applications evolve to be
increasingly more data-intensive thus perform poorly when deployed on containerized cloud services. To that end, this paper aims to
improve today’s cloud service by taking application performance into account for the next-generation container schedulers. More
specifically, in this work we build and analyze a new model that respects both load balance and application performance. Unlike prior
studies, our model abstracts the dilemma between load balance and application performance into a unified optimization problem and
then employs a statistical method to efficiently solve it. The most challenging part is that some sub-problems are extremely complex
(for example, NP-hard), and heuristic algorithms have to be devised. Last but not least, we implement a system prototype of the
proposed scheduling strategy for containerized cloud services. Experimental results show that our system can significantly boost
application performance while preserving high load balance.

Index Terms—Cloud computing, service computing, containers, data management, high-performance computing

1 INTRODUCTION

O NE core ingredient in cloud computing is virtualization


(for example, Xen [30], Red Hat KVM [22], Microsoft
Hyper-V [18], EMC/VMware vSphere [9], IBM z/VM [12])
strives to preserve load balance and is completely agnostic
about the physical layout of the containers, which can result
in intense resource contention within and between con-
that enables high resource utilization as well as the isolation tainers (for example, local disk I/O bandwidth, network
of applications. Conventionally, the virtualization is achieved bandwidth). All of these combined together urge us to
through a hypervisor that manages multiple virtual machines answer this research question: how to improve the scheduler of
(VM) on the same physical machine. Nevertheless, VMs containerized cloud services and make them more aware of their
require a complete stack of software from operating system applications’ data locality.
to application libraries, which takes a significant amount of In this paper, we make the following contributions.
time and resources before getting ready to execute the appli-
cation. Recently, lightweight isolation (for example, Docker  We identify a critical performance deficiency in the
container [8]) emerges and gets well embraced because it state-of-the-art resource scheduler in cloud comput-
achieves virtualization without booting a full VM; Thanks to ing. More specifically, we pinpoint that a round-
the user-level engine, a container usually takes only a couple robin policy of containerized cloud services incurs
of seconds to start in its own logical space. significant performance overhead from both local
The lightweight isolation at the fine-granularity (i.e., con- disk I/O and network traffic.
tainers), however, introduces new challenges on scheduling  We abstract the minimization of performance defi-
jobs on containers. The conventional VM-level scheduler ciency into optimization problems and analyze their
considers VMs as standardized and independent units that complexities. In particular, we prove that some sub-
are maintained by the lower-layer hypervisor. Nevertheless, problems are NP-hard.
containers are more proximate to each other in the sense  We devise heuristic algorithms for the scheduling
that the controller runs at the same level (all in user space), problems. Specifically, we tackle the problem in a bot-
which means the isolation between containers is not as tom-up fashion: we first address it from two exclusive
strong as VMs. Moreover, the state-of-the-art scheduler perspectives: local disk and network traffic, respec-
tively. Then we incorporate both aspects and develop
a dynamic solution for general scenarios.
 D. Zhao is with the Department of Computer Science & Engineering,  We implement the proposed approach by extending
University of Nevada, Reno, NV 89557. E-mail: dzhao@unr.edu.
 M. Mohamed and H. Ludwig are with the Ubiquitous Platforms Group, Diego—the scheduler of the open-source project
IBM Almaden Research Center, San Jose, CA 95120. CloudFoundry [5] and IBM’s Bluemix [11].
E-mail: {mmohamed, hludwig}@us.ibm.com.  We, experimentally, verify the correctness of the sys-
Manuscript received 23 Aug. 2017; revised 4 Dec. 2017; accepted 12 Jan. tem prototype at small scales and evaluate its effec-
2018. Date of publication 16 Jan. 2018; date of current version 9 June 2020. tiveness at large scale. Small-scale experiments show
(Corresponding author: Dongfang Zhao.)
Recommended for acceptance by C. Krintz. that our system can save up to 60 percent network
Digital Object Identifier no. 10.1109/TCC.2018.2794344 traffic with no effect to load balance. At large scales
2168-7161 ß 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See ht_tps://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Kyunghee Univ. Downloaded on November 15,2020 at 03:53:18 UTC from IEEE Xplore. Restrictions apply.
636 IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. 8, NO. 2, APRIL-JUNE 2020

Fig. 3. Multiple data-intensive applications on the same cell cause


serious resource contention and degenerated performance.
Fig. 1. Application-service affinity placement.
disjoint spaces for different applications, yet fail to isolate
(i.e., 2,400 jobs) our system could save up to 89 per- the resources from a performance point of view: because
cent network traffic with 50 percent skewness of only a single head exists for disk seek and data access, con-
load balance. current accesses to the same HDD cause seriously degraded
The remainder of this paper first (Section 2) gives a brief I/O throughput and latency.
background of containerized services in cloud computing As a concrete example for illustrating resource conten-
and articulates the unique challenges. We formulate the tion in containerized cloud service, we show three scenarios
problem under different scenarios, show that some of them of different types of workloads deployed on the same node
are NP-hard, and devise and analyze heuristic algorithms (or, cell): compute-intensive workloads only, I/O-intensive
in Section 3. To justify the effectiveness of the proposed workloads only, and a mix of both compute- and I/O-inten-
algorithms and optimizations, we implement the models sive workloads. In this oversimplified example, we allocate
and algorithms in Diego (Section 4) and report experimental four containers in a VM. The compute-intensive application
results (Section 5). Section 6 reviews related work on job is a web application written in Ruby to recursively calculate
scheduling, and Section 7 concludes this paper. Fibonacci series to number 40. The data-intensive applica-
tion is a simple web portal that triggers a writing of 2 GB
2 BACKGROUND AND MOTIVATION data to the disk.
As a motivating example, Fig. 1 shows a typical scenario Fig. 2 shows that a full-scale (i.e., 4 applications) of com-
where several deployed applications depend on certain pute-intensive applications adds about 9 percent overhead
services. In the state-of-the-art scheduler of cloud comput- comparing with the baseline. This experiment makes a strong
ing (for example, Diego [7]) these dependencies are not con- case that the isolation of CPU resources is excellent in the con-
sidered in the decision-making process; the default strategy text of containerized applications. After all, multiple cores
to deploy applications is in a straightforward round-robin within the same chip are relatively independent to each other
fashion to achieve good load balance. As a result, the net- that attribute small interaction overhead between peers.
work traffic incurred by the placement decision could be Fig. 3 shows that two data-intensive applications com-
huge that jeopardizes the application’s performance. pete for the I/O bandwidth of the underlying disk band-
In addition to the overhead from network traffic, another width and finish in about as twice long time as the baseline.
major source of performance degradation stems from the The root cause of this phenomena is that the disk I/O is not
intra-node resource contention. More specifically, physical truly parallelized, although the space is exclusively parti-
resources shared between multiple containers on the same tioned for multiple containers. That is, the single head of the
node (either physical node or cloud instance) could be a underlying disk is busy with the concurrent I/O requests
potential bottleneck of performance. Note that, “shared from multiple containers; the disk becomes the performance
resource” is not necessarily circumvented by parallelization bottleneck and a single point of failure.
or partitions. For instance, a conventional hard-disk drive Lastly, Fig. 4 shows that the data-intensive application
(HDD) could serve as multiple virtual block devices for cells and the compute-intensive application barely intervene the
or dockers. The partitions of HDD meet the requirement of other when both are deployed on the same cell. In other
words, the execution time of both applications are

Fig. 2. Multiple compute-intensive applications on the same cell have Fig. 4. Compute- and data-intensive applications deployed on the same
relatively low contention. cell do not cause noticeable contention.

Authorized licensed use limited to: Kyunghee Univ. Downloaded on November 15,2020 at 03:53:18 UTC from IEEE Xplore. Restrictions apply.
ZHAO ET AL.: LOCALITY-AWARE SCHEDULING FOR CONTAINERS IN CLOUD COMPUTING 637

completed almost in the same time duration as their per- need to be adjusted. However, how to migrate and how
spective baseline. Those results confirm our conjecture: the much overhead is involved with the migration is beyond the
contention between distinct types of resources is light inside scope of this paper. In practice, both offline and online (live)
the same cell. migrations are available for VMs and containers (for exam-
ple, Jelastic [14]).
3 REDUCING RESOURCE CONTENTION WITH
3.2 Local Contention
LOCALITY-AWARE JOB SCHEDULING
The following optimization makes the number of I/O-con-
3.1 Assumption tention nodes minimal. That is, the goal is to allocate
We assume the existing mapping between service and con- ½A1 ; . . . ; Ai  to C (i.e., a mapping array M½i ¼ j, 1  j  jCj)
tainers do not change. That is, at least one qualified cell such that
exists for the request. In practice, when this assumption jCj X  
X 
does not hold, multiple alternatives can be applied in order arg min fAk :io  cellðAk Þ ¼¼ Cj g  Cj :io ; (1)
to achieve the assumption; for instance, the provider might M j¼1
return an error message, block the request until a qualified
cell becomes available, or reselect a cell that causes no subject to
contention. Cj :cpu > Ai :cpu
A cell is the smallest isolated environment where an Cj :ram > Ai :ram
application can run independently. In practice, a cell can be
Cj :disk > Ai :disk:
implemented with a Docker instance. A zone is usually a
data center located at the same geographic location, within
The above equation aims to minimize the number of
which the network speed is orders of magnitude higher
cells where I/O-intention occurs. If the applications are not
than that between zones.
I/O-intensive, there is a high chance that no cells will have
The list of available cells is denoted by C, where Cj repre-
I/O intentions, in which case applications can easily fit into
sents the jth cell. The running applications’ indexes on a
random cells as long as other metrics are met (e.g., CPU usage,
particular cell are represented by Cj :apps. Similarly, Zj :apps
load balance). However, the problem becomes extremely
indicates the list of applications running on the jth zone, Zj .
challenging when the optimal placement is not obvious:
Available resources of Cj can be retrieved by Cj :cpu,
when many I/O-intensive applications are deployed to a lim-
Cj :ram, and Cj :disk. The total I/O bandwidth of Cj is
ited number of nodes, we know there will be some nodes that
denoted as Cj :io.
are I/O throttled; the question then becomes how to minimize
The list of applications is denoted by A, where Ai rep-
such bottleneck, or, minimize the number of I/O-contention
resents the ith application. The cell index where an app-
nodes.
lication has been deployed is represented by cellðAi Þ. The
Obviously, there exists an exponential solution to this
requested resources (for example, CPUs, memory, disk
problem that is not feasible in the real-world scheduler;
space) of Ai are represented by Ai :cpu, Ai :ram, and Ai :disk.
More efficient and practical algorithms are in need. As a
We denote the required disk throughput of Ai as Ai :io. If
matter of fact, cloud providers might not want to deploy
Ai :io is not explicitly specified, we assume Ai will aggres-
any data-intensive application at all if no fit is available
sively take all the disk bandwidth of Cj .
because that would cause side effect to existing applications
The definitions of cells can be generalized to other enti-
in addition to the slowdown of the newly deployed one. A
ties as well. For example, if we are interested in the metrics
good strategy is to defer the application to either (1) any of
at the zone level, then the above notations are applicable as
the deployed jobs is finished, or (2) more resources (for
well by replacing C to Z. Similarly, the zone where an appli-
example, new VMs, additional disks) are allocated. Then
cation Ai resides is denoted by zoneðAi Þ, and the available
the problem becomes the following: how to allocate as
resources of Zj is denoted by Zj :cpu, Zj :ram, and Zj :disk.
many as possible applications with no outbreak of any cell’s
The dependency matrix D specifies the interaction
I/O bandwidth
between applications. For example, ðDi;j ¼¼ 1Þ means appli-
cation Ai has non-trivial I/O to application Aj . Another i  X
X 

matrix T stores the pair-wise traffic of data movement arg max fAh :io  cellðAh Þ ¼¼ Cj &&
among all location units (for example, cells, physical nodes, M k¼1
 (2)
zones). The numerical value in Ti;j indicates the closeness 
cellðAk Þ ¼¼ Cj g  Cj :io ;
between Ci and Cj . For instance, the cost of transferring data
between two cells on the same physical machine is obviously subject to
lower than those on two different racks, which is then lower
than those on two different locations (i.e., zones). In practice, Cj :cpu > Ai :cpu
the closeness could be network latency if users have many Cj :ram > Ai :ram
small I/Os, or network bandwidth if large volume of data is Cj :disk > Ai :disk
expected.
One important assumption of some of the following dis- In contrast to Equation (1) that minimizes over the cells,
cussions is that containerized applications and services can here we try to maximize over a total number i of applications.
be migrated to arbitrary locations. This is a required property The application Ah at the first line of Equation (2) denotes that
for the global optimization where deployed applications the cell Cj has been occupied by an I/O-intensive application.

Authorized licensed use limited to: Kyunghee Univ. Downloaded on November 15,2020 at 03:53:18 UTC from IEEE Xplore. Restrictions apply.
638 IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. 8, NO. 2, APRIL-JUNE 2020

Thus, we want to ensure that the subsequent application Ak is Oðn2 Þ in total as analyzed in the following (we use n
at the second line would not saturate the total I/O bandwidth instead of i by convention to indicate the size of the input).
of the cell. Because presumably i > jCj (otherwise the optimization
Equation (2) is a variation of the classical 0-1 Multiple problem has trivial solutions), we also use OðnÞ to indicate
Knapsack Problem (MKP) with two additional conditions: the worst time when iterating on C. Both Line 1 and Line 2
take Oðnlog nÞ. Line 4 takes Oðlog nÞ and Lines 8–13 take
1) all items (i.e., applications) have the same profit (we OðnÞ. Line 16 takes Oð1Þ and Line 17 takes another Oðlog nÞ.
assume all applications have the same priority), and Therefore, the most costly part is the nested loop of Line 3
2) three additional constraints (CPU, RAM, and disk and Line 8, which give us Oðn2 Þ.
space) on each item.
The MKP optimization problem itself is NP-hard, and 3.3 Network Contention
Jansen et al. [13] showed that the relax on profits (i.e., condi-
There are various granularities of network contention such as
tion 1) does not make the problem simpler. Therefore,
cell-to-cell, machine-to-machine, and zone-to-zone. To make
we propose a pseudo-heuristic algorithm with dynamic pro-
the following analysis neat and clear, we analyze the zone-to-
gramming to meet condition 2 in addition to the optimization
zone contention as the representative scenario. The idea and
problem.
validity, however, holds true for other scenarios as well
Algorithm 1, in essence, is a greedy algorithm sharing the
(for example, cell-to-cell, machine-to-machine). We do not
same spirit of two classical algorithms: (1) the Kruskal’s algo-
differentiate an application and a service to simplify the follow-
rithm [15] for finding the minimal spanning tree in a graph,
ing analysis; After all, both are just jobs from the system’s
and (2) the best fit algorithm in memory paging allocation.
point of view. Therefore, the dependency matrix D intro-
Our algorithm first sorts both applications and cells accord-
duced in Section 3.1 is also applicable here in such a context
ing to their I/O status (Lines 1–2). Then, from lowest to
where (Di;k ¼¼ 1) means that an application Ai is dependent
highest, the application is assigned to the cell whose free
I/O resource is larger than and closest to the application’s on service/application Ak . The problem for an application to
need on the three conditions of CPU, RAM and disk space achieve the minimal network traffic then becomes the follow-
(Lines 4–14). Finally, if a cell is found qualified for all ing. Given Ai we find Zj (1  j  jZj) such that
the requirements its I/O status is updated and its location X
is moved to the appropriate place in the sorted queue arg min Di;k  Tj;zoneðAk Þ ; (3)
j k
(Lines 15–18). subject to

Algorithm 1. Heuristically Deploying I/O-Intensive Zj :cpu > Ai :cpu


Applications to Qualified Containers to Reduce Local Zj :ram > Ai :ram
I/O Contention Zj :disk > Ai :disk:
Input: Application to be scheduled: A1 ; . . . ; Ai ; Available cells
It should be clear that Equation (3) only optimizes the
Cj where 1  j  jCj
Output: The mapping between A and C: M½k :¼ j network traffic of a single new application. This problem
1: Sort Ak (1  k  i) in the increasing order of Ak :io obviously has a polynomial solution. We can iterate all the
2: Sort C in the increasing available zones (OðjZjÞ), and in each iteration we calculate
Porder
 of:  the aggregate cost between Ai and its dependent services
Cj :io avail :¼ Cj :io  Am :io j cellðAm Þ ¼¼ Cj
3: for k :¼ 1::i do (OðjAjÞ). The minimal cost is also updated during each itera-
4: Binary search for Cj such that: tion. Therefore, the overall cost takes OðjZj  jAjÞ.
Cj :io avail  Ak :io && It should be clear that, the above discussion considers
Cj1 :io avail <Ak :io if j > 1 only network traffic, and load balance is out of the question.
5: if Cj does not exist then The following discussion focuses on how to tune both fac-
6: return M tors—load balance and network traffic—in a coordinated
7: else manner. That is, in the following we will have one more
8: for (t :¼ j; t < ¼ jCj; t :¼ t þ 1) do constraint on load balance when optimizing network traffic.
9: if (Ct :cpu  Ak :cpu && Ct :ram  Ak :ram In the simplest scenario, the coefficient of variance (CV)
&& Ct :disk  Ak :disk) then between the loads on different zones should be under a
10: M½k :¼ t user-defined threshold (UDT). Formally
11: break
X
12: end if arg min Di;k  Tj;zoneðAk Þ ; (4)
13: end for j k
14: end if subject to
15: if (t  jCj) then
16: Ct :io avail :¼ Ct :io avail  Ak :io CVðZÞ  UDT
17: Adjust the position of Ct in C (binary search) Zj :cpu > Ai :cpu
18: end if Zj :ram > Ai :ram
19: end for
Zj :disk > Ai :disk:
Because the progress of the optimization is cached in Equation (4) does not add much complexity to the original
memoization in Line 10, the time complexity of Algorithm 1 criterion Equation (3). In essence, in each iteration when

Authorized licensed use limited to: Kyunghee Univ. Downloaded on November 15,2020 at 03:53:18 UTC from IEEE Xplore. Restrictions apply.
ZHAO ET AL.: LOCALITY-AWARE SCHEDULING FOR CONTAINERS IN CLOUD COMPUTING 639

looping on jZj we need to recalculate CV and drop those where the scoring function is
unqualified candidates. Thus the process takes OðjZj  ðjZjþ
jAjÞÞ in total. However, an obvious limitation exists in Equa- a  trðMÞ þ b  cvðMÞ
F ðMÞ ¼ ;
tion (4): what if no qualified zone is available with the aþb
required CV?
subject to
We propose to normalize both load-balance and net-
work-traffic criteria and apply a weight factor to each. The Zj :cpu > Ai :cpu
normalization of network traffic is measured by the ratio of Zj :ram > Ai :ram
the application’s actual traffic divided by the aggregate Zj :disk > Ai :disk:
bandwidth. Ideally, if all applications are executed on local
nodes only, the normalized network traffic is simply zero. In Equation (8), we use the weighted arithmetic mean
Formally, the normalization of network traffic (tr) for Ai on (AM) as the objective function. We did not choose other
Zj (i.e., a mapping matrix M where Mi;j indicates Ai is forms because the weights in AM have a good balance
deployed on Zj ) is between sensitivity and complexity. We want the objective
Pi1 function to have a medium sensitivity: if the model does not
Di;k  Tj;zoneðAk Þ
trðMi;j Þ ¼ k¼1
PjZj PjZj : (5) promptly responds to an updated weight, then tuning the
m¼1 n¼1 Tm;n parameters will incur significant overhead; if the model is
oversensitive, the optimal parameter can be easily missed.
Obviously, we have
The cost of calculating the objective function should be min-
trðMi;j Þ 2 ½0    1; 8i; j; imized, as otherwise the overall cost may be significantly
affected. This is why we did not introduce those more com-
which can be observed by that the total throughput of appli- plex objective functions, such as geometric mean, harmonic
cations cannot exceed the overall bandwidth. mean, quadratic mean, and so on.
The normalization of load balance is defined as the We in this section, without loss of generality, assume that
adjusted coefficient of variance of the overall deployed both the traffic and the load balance contribute equally to
applications on each zone. That is the objective function. In practice, this can be easily changed
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi according to specific requirements. For example, the a can
P
ð1 þ jZj :appsj  ZÞ2 þ k6¼j ðjZk :appsj  ZÞ2 be amplified by 10 times if traffic control is far more impor-
cvðMi;j Þ ¼ pffiffiffiffiffiffiffi ; tant than load balance. We will experimentally show the
jZj  Z impact of distinct a’s and b’s in Section 5.
(6) The complexity analysis of Equation (8) is as follows.
where
PjZj Equation (7) can be efficiently updated in constant time
1þ j¼1 jZj :appsj assuming zones are not changed. If we maintain the differ-
Z¼ : (7) ence square for each zone in memory, Equation (6) can be
jZj
updated also in constant time. Nevertheless, it takes OðjZjÞ
It should be noted that in Equation (6) we only consider to update Equation (6). In Equation (5), every dependency
the number of applications deployed to a zone as its work- is checked taking OðjAjÞ. Therefore, the overall time is
load. That is, we do not explicitly account resources (e.g., OðjZj  ðjZj þ jAjÞÞ. That is, although adding both network
CPU usage, memory footprint, disk space) in our model. traffic and load balance into consideration looks more com-
The reason of doing so is because the proposed model plicated than the simple threshold-check approach (i.e.,
assumes all applications and zones are homogeneous for Equation (4)), the former one essentially takes as much time
the sake of simplicity; as a result, the number of applica- in the same order of magnitude as the latter.
tions, implicitly, reflect the resources being used in a specific The problem becomes significantly harder (even without
zone. Indeed, a more realistic model would have parameter- considering load balance) with a subtle change when we
ized the sizes of distinct availability zones as well as the consider scheduling a batch of applications or all the appli-
resources taken by each application, which would break cations globally. In essence, instead of the linear problem on
the brevity of our analysis. To address that, we can assume a single dimension of jZj, we are now facing a combinatorial
that applications taking heterogeneous resources can be problem of placing jAj applications on jZj zones such that
normalized into virtual applications each of which is con- XX
sidered identical in terms of resources. This is analogous to arg min Di;k  Tj;zoneðAk Þ ; (9)
vCPUs [3] used in Amazon Web Services (AWS): although M i k
instance types range widely from tiny single-core machines
subject to
to those high-end server-class Intel Xeon nodes, they are
still comparable in terms of the number of virtual cores. Zj :cpu > Ai :cpu
We are now ready to define the objective function with Zj :ram > Ai :ram
both factors. Let a and b denote the weights for the normal- Zj :disk > Ai :disk:
izations of traffic tr() and and coefficient of variance cv
(), respectively. Our goal is then to find Zj for Ai (i.e., Mi;j ) A brute-force solution would obviously take exponential
such that time, OðjZjjAj Þ. We will prove the problem of Equation (9) is
arg min F ðMÞ; (8) NP-hard by reducing it to the Traveling Salesman Problem
M (TSP). The zone matrix can be represented as a graph whose

Authorized licensed use limited to: Kyunghee Univ. Downloaded on November 15,2020 at 03:53:18 UTC from IEEE Xplore. Restrictions apply.
640 IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. 8, NO. 2, APRIL-JUNE 2020

vertexes represent the zones and edges represent the net- active debate [32], we strongly believe pushing workloads
work traffic between two zones, which is a similar setup of toward end devices would become a trend with the
TSP. Instead of TSP where we try to find the shortest path enabling technology of containers in many computing com-
covering all vertexes, we are now trying to find the smallest munities including scientific computing, social network,
summation of edges arbitrarily determined by the depen- web services, and so forth.
dency matrix. If we specify the dependency matrix where Although this paper is concentrated on the differenti-
each application only depends on two neighbor applications ated performances across multiple availability zones in the
and enforce one zone holding only one application, then our conventional geographically-distributed clouds, the idea
global-network-minimization problem is reduced to TSP. and approach of minimizing local- and network-contention
This paper is focused on the optimization problem with can be applied to more computing paradigms including
regard to a single application. Therefore, the algorithm, fog computing. Essentially, in fog computing the comput-
analysis, and evaluation are all concentrated on how to mini- ing and storage units are slimmer and more distributed
mize the contention when a new application is deployed. We than data centers—they are everywhere from personal
leave the optimization of all applications as an open question electronic device (for example, smart phones, smart watch)
to the community as well as one of our future works. to enterprise facilities such as weather sensors and surveil-
lance monitors. Despite a radical change from provider-
3.4 Generalized Model hosted resources to ubiquitous dispersion, one of the key
Both types of contention—local (Section 3.2) and network challenges remains: the computation should be deployed
(Section 3.3)—we have discussed so far are disconnected. A as close as possible to its data to avoid the costly network
strategy from the entire system’s point of view is the focus congestion.
of this section. As a case in point, when a new application is The locality problem is more challenging in the fog
to be deployed, should it be scheduled to avoid local or net- computing from the following two perspectives.
work contention? More specifically, although we should try
 The network is more heterogeneous and volatile
to assign an application to the same cell or zone of its depen-
than cloud computing. In fog computing, the net-
dent service, is it wise to do so if both the application and its
work could be mobile network, WiFi signal, Ethernet
dependent service incur significant disk I/Os? Previous
connection, Bluetooth, and so forth, which is com-
sections partially answer the above question from two
pletely another story comparing with the relatively
exclusively aspects; The following will discuss how to coor-
stable Internet connection in cloud computing. As a
dinate both in a systematic manner.
case in point, a mobile phone’s latency is signi-
The idea is to still maintain the network traffic as little as
ficantly more volatile than that of a physical node in
possible with an additional condition that no local disk
the data center. Consequently, the matrix we intro-
bandwidth is deplete. Therefore, Equation (8) is slightly
duce to characterize the network traffic needs to be
changed into
frequently updated.
arg min F ðMÞ; (10)  The scale is significantly larger. In cloud computing,
M when discussing availability zones, they might be in
where orders of Oð10Þ. The entity granularity is coarse: the
a  trðMÞ þ b  cvðMÞ machines within the same data center are considered
F ðMÞ ¼ ; as homogeneous nodes so the entire data center
aþb
might be treated as a single entity. In fog computing,
subject to however, each device is a potential entity because
Zj :cpu > Aj :cpu there is little association between devices. As a
result, the traffic matrix would be proportional to the
Zj :ram > Aj :ram total number of devices instead of Oð10Þ. Note that
Zj :disk > Aj :disk there could be millions of users or more that could
Zj :io > Aj :io: invalidate the matrix approach we have discussed so
far in this paper.
The newly added condition on Zj :io and Aj :io indicates
that (1) the system maintains a zone-level metadata for its 4 IMPLEMENTATION
I/O bandwidth usage, and (2) the I/O requirement of app- We have implemented the hybrid model discussed in
lications is known upfront. While the first one is straightfor- Section 3. We modified the Diego [7] source code to
ward to implement, the second one can be implemented as allow users to choose whether locality should be consid-
a probe service or as a machine learning algorithm based on ered in the scheduling decision. In particular, a new
historical usage. module called locality.go is developed among other
changes to the existing modules such as auctionrun-
3.5 Broader Applicability and Limitations ner.go and scheduler.go.
Containers, as a lightweight alternative to VMs, by nature The Locality class1 is defined in Fig. 5. The dep and
become a good candidate for the new computing paradigm traffic fields represent the dependency and traffic matri-
where the workload (computation, storage, communication) ces discussed in Section 3. The appZone field is a hashmap
is pushed to the edge—edge computing or fog computing.
While a concrete definition of fog computing is still under 1. Technically, it is a struct in the Go programing language.

Authorized licensed use limited to: Kyunghee Univ. Downloaded on November 15,2020 at 03:53:18 UTC from IEEE Xplore. Restrictions apply.
ZHAO ET AL.: LOCALITY-AWARE SCHEDULING FOR CONTAINERS IN CLOUD COMPUTING 641

 The models package maintains all the abstraction


models in Diego, which was extended in the Desir-
edLRP to also manage the application dependencies.
That is, the Locality class retrieves the application
dependency from the DesiredLRP (step 3 ) that is
originally specified in the manifest file by the users.
The dependency matrix between applications is specified
in the manifest file when the application is launched. The
information is saved in a new field named Service-
Dependency in the DesiredLRP class. Later on, when
Fig. 5. Definition of the Locality class. scheduling the application the dependency information is
pulled out for the dependency matrix.
from an existing application to its assigned zone. The
zones field is a sortable array of all the available zones. The 5 EVALUATION
zoneAppNum field records the number of running applica-
tions on a particular zone. We verified the correctness of the proposed scheduling
A Locality reference is instantiated and initialized in algorithms at small scales (i.e., eight applications), then
the main scheduling entity named auctionrunner. That evaluated its effectiveness at large scales (i.e., 2,400 applica-
is, only one instance of Locality (for example, the depen- tions/containers), both by leveraging the built-in simulation
dency matrix, the traffic matrix, and so forth) is maintained. module in Diego. The experiments were carried out on
The Locality reference is then passed to the scheduler 64-bit Ubuntu operating systems with quad-core CPUs and
object that could call either the vanilla placement algorithm 64 GB memory. The original Diego scheduler was released
or the method named AssignCellWithMinTraffic(app as v0.13 and was primarily implemented with the Go pro-
string) exposed by Locality. gramming language v1.5; each docker image was installed
A simplified diagram of key components is shown in with the cflinuxfs2 stack supported by Cloud Foundry. Note
Fig. 6. Two major packages are involved: auctionrunner that, we assume the containers in this discussion are of the
and models. same capacity for clear presentation; in the real world the
containers might have different captivities and yet they can
 The auctionrunner package is responsible for all be normalized to the same and the following evaluation
the auction-related scheduling, where most changes will be still valid. The simulation module arbitrarily creates
were made to the vanilla Diego. It takes the dep- various types of workloads (i.e., with different number of
loyment request and initializes a Locality object applications, containers, cells, zones, etc.) that triggers the
(step 1), whose reference is then passed to the Diego scheduler. While the workloads are simulated, the
Scheduler class (step 2 ). The scheduler deploys scheduler itself is indeed executed (not simulated).
the application to the cell according to the placement We started with an oversimplified example: two cells
algorithm specified in the Locality class. (REP-1 and REP-2) provisioned in two zones (Z0 and Z1),

Fig. 6. Diego [7] extended with the Locality class.

Authorized licensed use limited to: Kyunghee Univ. Downloaded on November 15,2020 at 03:53:18 UTC from IEEE Xplore. Restrictions apply.
642 IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. 8, NO. 2, APRIL-JUNE 2020

Fig. 7. Distribution of eight applications on two zones with Diego. Fig. 10. Distribution of the same eight applications on two zones with
Diego/P.

TABLE 1
Function Scores and Placement of Eight Applications on
Two Zones (Z0; Z1) in Two Setups: {F 1 : a ¼ 1; b ¼ 1}
and {F 2 : a ¼ 1; b ¼ 2} (Selected Zone is Underlined)

App A0 A1 A2 A3 A4 A5 A6 A7
1
Z0 F .50 .50 .39 .02 .12 .19 .24 .27
Z1 F1 .50 .00 .19 .02 .12 .19 .24 .27
Z0 F2 .67 .67 .37 .02 .15 .24 .25 .15
Z1 F2 .67 .00 .24 .02 .15 .15 .11 .15
Fig. 8. Dependency matrix of applications.

discussion, we use 10 and 100 for clear presentation instead


of the raw measurements (i.e., 34 and 385).
We are now ready to understand why our proposed
scheduler places six applications in Z0 and two others to
Fig. 9. Traffic matrix of zones.
Z1, as shown in Fig. 10. Since neither A0 nor A1 has any
dependency, A0 is assigned to Z0 and A1 is assigned to Z1.
respectively. Each of the cells has a capacity of eight con- For A2, because it is dependent on A1 it is deployed to Z1 as
tainers. Eight applications are to be deployed, namely A0 to well. Yet, because A3 is dependent on A0, it is deployed to
A7. The vanilla placement algorithm in Diego, unsurpris- Z0. Similarly, A4 . . . A7 are deployed to Z0 because their
ingly, assigns four applications on each zone for the sake of dependencies are all on Z0. Therefore, we observe six appli-
load balance as the simulator output shows in Fig. 7 where cations are on Z0: A0, A3, . . . , A7; two applications are on
“+” indicates a deployed application and “.” represents a Z1: A1 and A2.
free container. The overall application traffic is calculated as follows.
Now let us consider the same set of applications are Recall that according to the strict load-balance rule of Diego,
being deployed with the proposed locality-aware algo- all even-numbered applications are deployed to Z0 and all
rithms. Applications A0 to A7 are deployed to our 2-zone odd-numbered applications are deployed to Z1. Unfortu-
cloud in a serial manner where each application is depen- nately, all of A2, . . . , A7 have dependencies (refer to Fig. 8)
dent on other applications specified as 1 in Fig. 8. For residing on the other zone of its own. Therefore, the overall
instance, A0 and A1 do not have any dependent applica- traffic is 100  6 ¼ 600. On the other hand, in our proposed
tions (or, services), while A2 is dependent on A1 and A3 is scheduler all the six applications have their dependencies in
dependent on A0. For i :¼ 4::7, A(i) is dependent on A the same zone, resulting in the overall traffic 6  10 ¼ 60.
(i-1). That is, with locality-aware scheduling we are able to
The traffic matrix between zones is shown in Fig. 9. Here migrate costly inter-zone traffic into lighter intra-zone traffic.
we assume the network traffic across different zones is sym- While the network traffic is greatly reduced with the
metric, which is why the matrix is also symmetric. The unit proposed scheduling strategy, one obvious limitation lies
of the numbers does not matter as long as they accurately on the skewness of job placement: Z0 has 6 applications:
reflect the relative traffic. In practice, “traffic” could be end- 300 percent of Z1. Statistically, it results in significant vari-
to-end latency for small file accesses, network throughput 2 2
ance: ð64Þ þð24Þ
2 ¼ 4. That is, the coefficient of variation is
for data-intensive workloads, or a mix of forth. Due to the pffiffiffi
huge performance difference between local and remote 4=4 ¼ 50%. An extremely low network traffic is definitely
traffic, we should try to assign dependent applications on desirable in terms of performance; yet it indicates a low net-
the same zone if possible from performance’s perspective. If work utilization from vendor’s perspective. Therefore, we
either zone yields the same traffic, then we pick the one are interested in trading off a portion of network traffic to
with fewer running applications (from the zoneAppNum (significantly) improve CV.
field in Fig. 5) for the sake of load balance. Since we assume We first re-ran the same workloads as discussed above
all containers are of the same capacity, the above decision- by assigning the same weight on load balance and locality.
ing is equivalent to choosing the zone with the most capaci- That is, a ¼ 1 and b ¼ 1. As shown in Table 1 (first two
ties when multiple zones qualify for the application. rows), the allocation is exactly the same as when consider-
In this example, the traffic across two zones is 10 times ing no load-balance at all (see Fig. 10). This can be best
larger than the local one. We chose 10 by observing that the explained by that the coefficient of variance between zones
inter-zone latency is roughly 10 times larger than intra-zone is not significant enough to impact the weighted function
on Amazon EC2: the latency within North California is that is dominated by the network traffic.
34 milliseconds while the latency between North California Then we increase the weight on load balance (i.e., vc):
and Singapore 385 milliseconds [6]. In the following a ¼ 1 and b ¼ 2. The results are reported in the bottom two

Authorized licensed use limited to: Kyunghee Univ. Downloaded on November 15,2020 at 03:53:18 UTC from IEEE Xplore. Restrictions apply.
ZHAO ET AL.: LOCALITY-AWARE SCHEDULING FOR CONTAINERS IN CLOUD COMPUTING 643

Fig. 11. Network traffic and load balance of 2,400 jobs on 30 zones with different configurations.

rows in Table 1. Now we observe A5 and A6 are placed in Z1 to 0.50). Therefore, a good practice to choose a and b is to
rather than Z0. That is, the increased weight on load balance only sweep on the parameter space at small scales of the
counterfeits the network traffic on these two applications. ratio between a and b.
Also note that, even though in this scenario both Z0 and Z1 We visually demonstrate how the proposed locality-
have the same number of applications just like in vanilla aware scheduler places the applications to different contain-
Diego (Fig. 7), the applications are different in each zone. ers and zones under different criteria. We first show two
In the scenario of {a ¼ 1; b ¼ 2}, the overall network traf- representative scenarios in Figs. 12 and 13, respectively.
fic is 240 because A0 = A1 = 0, A2 = A3 = A4 = A6 = 10, and Both experiments are deployed to the largest scale, i.e.,
A5 = A7 = 100. While the traffic usage is higher than the 2,400 jobs on 30 zones, with different parameters. After-
strict network-optimal strategy, the ratio is still relatively wards, we will illustrate two extreme cases (Figure) where
low: 240=ð220  8Þ ¼ 13:6%. In return, the system is per- a is set to 1 and 1,000, respectively.
fectly load-balanced. Comparing with the vanilla Diego The first setup in Fig. 12 shows how our proposed
scheduler, our proposed approach achieves the same load scheduler, when a = 2 and b = 1, makes a better trade-off
balance but reduce the traffic from 600 to 240—saving between data locality and load balance compared to
1  240
600 ¼ 60% network bandwidth. the case of a = 1 and b = 1 where only 12 percent network
We scale the experiment out to 2,400 applications and 30 traffic is reduced although it is perfectly load balanced.
zones. The traffic saving and load balance is reported in Specifically, this placement slightly gives up load balance
Fig. 11 on various combinations of parameters. Several by deploying more workloads in the middle containers
interesting observations are discussed in the following. (i.e., coefficient of variance 0.21 per Fig. 11). In return, it
First, the right-most column reports that the upper-
bound of performance improvement is 89 percent where
load-balance is not considered at all (i.e., b ¼ 0). This result
is understandable because the inter-zone network is 10
costly than intra-zone, which means diverting most inter-
node traffic to within the zone roughly results in about
10 percent network traffic.
Second, the effect of locality weight at smaller scales (i.e.,
a
b  5) is more significant to both traffic save and load bal-
ance. When the ratio increases from 1 to 5, the traffic save
increases 60 percent (from 12 to 72 percent) and the coeffi-
cient of variance increases 0.35 (from 0.01 to 0.36). On the
other hand, when the ratio increases from 5 to 1,000, the
traffic save increases only 17 percent (from 72 to 89 percent)
and the coefficient of variance increases only 0.14 (from 0.36 Fig. 12. Placement of 2,400 jobs on 30 zones when a = 2, b = 1.

Authorized licensed use limited to: Kyunghee Univ. Downloaded on November 15,2020 at 03:53:18 UTC from IEEE Xplore. Restrictions apply.
644 IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. 8, NO. 2, APRIL-JUNE 2020

Fig. 13. Placement of 2,400 jobs on 30 zones when a = 100, b = 1. Fig. 15. Placement of 2,400 jobs on 30 zones when a = 1,000, b = 1.

helps save almost half of the network traffic (i.e., 46.43 per- at extreme scales is beyond the scope of this paper, but we
cent per Fig. 11). plan to work on this direction in our future work.
The second setup in Fig. 13, where a = 100 and b = 1,
represents a placement decision on the other end of the 6 RELATED WORK
spectrum—load balance is largely skewed in order to In nature, load balance and data locality are fundamentally
reduce considerable network traffic. Indeed, one can visu- orthogonal to each other. Some of our prior work [26] tried to
ally tell the skewness: six zones from Z20 to Z26 host only a find a good trade-off between the two metrics in high-perfor-
handful of jobs where all the other containers are almost at mance computing. This work, however, concentrates on the
full capacity. On the other hand, per Fig. 11 such a setup analytical model of load balance and application perfor-
would save 88 percent network traffic. Depending on the mance in the context of containerized cloud services that is
service level agreement (SLA), this placement in practice nowadays widely adopted by platform-as-a-service (PaaS).
might be desired, for example when load balance is not Scheduling is actively studied to improve the I/O perfor-
taken into account by the vendor while the network perfor- mance at large scales. For instance, one of our prior work
mance is one of major metrics for the provisioned service. focused on the batch scheduling on petascale systems [36].
The thrid setup in Fig. 14 (a = 1 and b = 1) exhibits The objective was to design and evaluate a batch scheduler
the scenario where load balance is set as the top priority. with a holistic view of both system state and jobs’ activities
The high load-balance can be observed visually from the on an extreme-scale production system, Mira [19] at
figure: almost all zones are assigned with a similar number Argonne National Laboratory, a top 5 supercomputer in the
(
2) of applications. However, this does not come for free, world [25]. In contrast, this work is focused on the schedul-
as it does not save much network traffic as other setups ing part of containerized cloud services with respect to net-
(e.g., 12 percent versus 46 percent when a ¼ 2). Neverthe- work traffic in addition to disk I/O.
less, depending on the application’s requirement, this might Ahn et al. [2] proposed VM-level scheduling techniques
be a desirable setup. to migrate micro-architectural resources such as shared
The fourth and last experiment is shown in Fig. 15 (a = cache and memory controller. The micro-architectural
1,000 and b = 1). We observe a similar application-zone resources are not traditionally isolated at the VM layer, but
allocation as Fig. 13: most zones are exhausted with a few manipulated by the intra-system. They showed that the pro-
left with very light workload. In fact, now with an extremely posed scheduling approaches are highly effective for cache
large a, several zones do not have any workload at all. sharing and non-uniform memory access (NUMA) affinity.
Compared to a ¼ 100, the load balance is further compro- There are more studies [4], [21] on the NUMA scheduling at
mised (CV ¼ 0:5 versus CV ¼ 0:48) with marginal increase the VM level, as well as adaptive VM scheduling optimized
of traffic saving (i.e., 88:12% ! 89:06%). This experiments for non-concurrent workloads [29]. At the infrastructure-as-
illustrate that a is not very sensitive when being set as a-service layer, more resource management tools were
extremely large values. A theoretical analysis of a’s sensitivity increasingly introduced such as security checks [34], net-
work monitoring [35], dynamic deployment [20], [23], etc.
This work, on the other hand, targets a different set of
micro-architectural resources (disk I/O bandwidth and net-
work traffic) at a finer granularity (i.e., containers).
Scheduling is also researched to improve other aspects of
the system. For example, one of the most interesting angles
is power consumption and electricity bill [10], [17], [31].
Those works showed that a well-tuned job scheduler could
significantly reduce the power consumption and electricity
bill. This work can be combined with the aforementioned to
further reduce the financial investment if network traffic is
considered as a major portion of the operation cost.
Although Docker [8] is one of the most popular imple-
Fig. 14. Placement of 2,400 jobs on 30 zones when a = 1, b = 1. mentation for container services, it per se does not expose

Authorized licensed use limited to: Kyunghee Univ. Downloaded on November 15,2020 at 03:53:18 UTC from IEEE Xplore. Restrictions apply.
ZHAO ET AL.: LOCALITY-AWARE SCHEDULING FOR CONTAINERS IN CLOUD COMPUTING 645

any data locality for the deployed containers. While Kuber- in this work will be extended to a larger context and inte-
netes [16] does provide management of container services grated to an orchestrated cloud service. Currently, this
in the form of “pods”, what it does is to assign containers work assumes that the storage layer is a shared-nothing
to the pod without much optimization inspired by the cluster of local Linux file systems (e.g., ext4); we will be
application’s I/O patterns. In contrast, this work proposes a working on providing a GPFS [24] interface so that other
series of models and algorithms to allow users to specify a communities (e.g., scientific data management, high perfor-
degree of skewness of load balance in order to achieve bet- mance computing) would also benefit from the scheduling
ter data locality and consequently the performance from a techniques presented in this paper.
global perspective of the entire cluster.
There are also important works in scheduling for better ACKNOWLEDGMENTS
load balance from distributed systems and high-perfor-
mance computing (HPC) communities, such as [1], [28], This work is supported in part by the Big Data Research Ini-
[33]. In particular, Wang et al. [27] proposed an ephemeral tiatives sponsored by the Nevada System of Higher Educa-
burst buffer file system, a breakthrough to alleviate the con- tion (on behalf of the University of Nevada, Reno), the AWS
ventional I/O pressure that has existed in HPC systems for Research Grants from Amazon, and the Azure Research
decades. We believe the ephemeral burst buffer can be lev- Award from Microsoft. Some preliminary work was con-
eraged by the community of cloud computing as well, as it ducted when D. Zhao worked at IBM Almaden Research
is orthogonal to the algorithms and models proposed by Center. The authors would like to thank Prof. Ioan Raicu
this paper. Thus, applying the ephemeral burst buffer file (Illinois Institute of Technology), Prof. Ion Stoica (University
system proposed in [27] will likely further improve the per- of California, Berkeley), and Prof. Magdalena Balazinska
formance of many cloud computing applications. (University of Washington) for insightful discussions.

7 CONCLUSION AND FUTURE WORK REFERENCES


The emerging containerized cloud services, although pro- [1] F. Ahmad, S. T. Chakradhar, A. Raghunathan, and T. N. Vijaykumar,
vide a finer granularity of visualization, as any other new “ShuffleWatcher: Shuffle-aware scheduling in multi-tenant Map-
technology inevitably poses new challenges. Among them, Reduce clusters,” in Proc. USENIX Conf. USENIX Annu. Tech. Conf.,
2014, pp. 1–12.
the performance of applications and services is of great inter- [2] J. Ahn, C. Kim, J. Han, Y.-R. Choi, and J. Huh, “Dynamic virtual
ests not only from users’ perspectives but also the more machine scheduling in clouds for architectural shared resources,”
performance-aware service-level agreement between the in Proc. 4th USENIX Conf. Hot Topics Cloud Comput., 2012, p. 19.
[3] AWS vCPU. (2017). [Online]. Available: https://aws.amazon.
providers and customers. This paper identifies the container- com/ec2/virtualcores, Accessed on: Dec. 3, 2017.
ized service scheduler as one of the performance deficiencies [4] Y. Cheng, W. Chen, X. Chen, B. Xu, and S. Zhang, “A user-level
in the state-of-the-art design of PaaS. More specifically, we NUMA-aware scheduler for optimizing virtual machine perform-
pinpoint that a simple round-robin policy of placing applica- ance,” in Proc. Int. Symp. Adv. Parallel Process. Technol., 2013,
pp. 32–46.
tions/services on containers incurs significant performance [5] Cloud Foundry. (2015). [Online]. Available: http://docs.
overhead from both local disk I/O and network traffic. cloudfoundry.org/, Accessed on: Jul. 17, 2015.
The first step we take to tackle the challenge is formulat- [6] CloudWatch. (2015). [Online]. Available: http://www.cloudwatch.
ing it into optimization problems and analyze their com- in/, Accessed on: Jul. 15, 2015.
[7] Diego Project. (2015). [Online]. Available: https://github.com/
plexities. Unsurprisingly, some of the problems turn out to cloudfoundry-incubator/diego-release, Accessed on: Jul. 16, 2015.
be NP-hard. To that end, we devise heuristic algorithms [8] Docker container. (2015). [Online]. Available: https://github.
and solutions for different scenarios. The proposed tech- com/docker/docker, Accessed on: Jul. 16, 2015.
[9] EMC/VMWare vSphere. (2015). [Online]. Available: https://www.
nique is implemented and evaluated in the de facto PaaS vmware.com/products/vsphere/, Accessed on: Jul. 17, 2015.
platform—CloudFoundary [5]. We justify the correctness of [10] J. Hikita, A. Hirano, and H. Nakashima, “Saving 200kW and $200
the proposed work at small scales, which shows that the K/year by power-aware job/machine scheduling,” in Proc. IEEE
network traffic could be reduced by 60 percent with no side Int. Symp. Parallel Distrib. Process., 2008, pp. 1–8.
[11] IBM Bluemix. (2015). [Online]. Available: http://www.ibm.com/
effect to load balance. In order to demonstrate its effective- cloud-computing/bluemix/, Accessed on: Jul. 17, 2015.
ness at large scale, we carry out large-scale experiments [12] IBM z/VM. (2015). [Online]. Available: http://www.vm.ibm.
(i.e., 2,400 applications) showing that the proposed schedul- com/overview/, Accessed on: Jul. 17, 2015.
[13] K. Jansen, F. Land, and K. Land, “Bounding the running time of
ing strategy is able to save up to 89 percent network traffic algorithms for scheduling and packing problems,” in Proc. 13th
by trading off 50 percent of load balance. Int. Conf. Algorithms Data Struct., 2013, pp. 439–450.
There are three major future research directions with [14] Jelastic. (2015). [Online]. Available: http://ops-docs.jelastic.com/
this work. First, an application’s I/O requirement would be cluster-features#c, Accessed on: Jul. 16, 2015.
[15] J. B. Kruskal, “On the shortest spanning subtree of a graph and the
better quantified. The current design assumes that the user traveling salesman problem,” Proc. Amer. Math. Soc., vol. 7, no. 1,
would specify this property in a configuration file (e.g., pp. 48–50, 1956.
manifest). An automatic and systematic approach is desir- [16] Kubernetes. (2017). [Online]. Available: https://github.com/
able to make the system more accurate that could also kubernetes/kubernetes, Accessed on: Jul. 24, 2017.
[17] O. Mammela, M. Majanen, R. Basmadjian, H. De Meer, A. Giesler,
reduce human intervening. Second, an efficient approach to and W. Homberg, “Energy-aware job scheduler for high-
achieve the global optimization of network traffic is in need. performance computing,” Comput. Sci. Res. Develop., vol. 27, no. 4,
This paper scrapes the surface of this problem—proving it pp. 265–275, 2012.
[18] Microsoft Hyper-V. (2015). [Online]. Available: https://technet.
NP-hard—and leaves developing efficient algorithms to the microsoft.com/library/hh831531.aspx, Accessed on: Jul. 17, 2015.
optimization solution as an open question to future work. [19] Mira. (2015). [Online]. Available: http://www.alcf.anl.gov/user-
Third, the models, algorithms, and prototypes discussed guides/mira-cetus-vesta, Accessed on: Jul. 17, 2015.

Authorized licensed use limited to: Kyunghee Univ. Downloaded on November 15,2020 at 03:53:18 UTC from IEEE Xplore. Restrictions apply.
646 IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. 8, NO. 2, APRIL-JUNE 2020

[20] F. Paraiso, S. Challita, Y. Al-Dhuraibi, and P. Merle, “Model- Dongfang Zhao received the PhD degree in com-
driven management of docker containers,” in Proc. IEEE 9th Int. puter science from the Illinois Institute of Technol-
Conf. Cloud Comput., 2016, pp. 718–725. ogy, Chicago. He is an assistant professor in the
[21] J. Rao, K. Wang, X. Zhou, and C.-Z. Xu, “Optimizing virtual Department of Computer Science & Engineering,
machine scheduling in NUMA multicore systems,” in Proc. IEEE University of Nevada, Reno. His research interests
19th Int. Symp. High Perform. Comput. Archit., 2013, pp. 306–317. span data management systems, high-perfor-
[22] Red Hat KVM. (2015). [Online]. Available: http://www.redhat. mance computing, cloud computing, distributed
com/en/files/resources/en-rh-kvm-kernal-based-virtual-machine. systems, and machine intelligence. He completed
pdf, Accessed on: Jul. 17, 2015. his postdoctoral fellowship in the School of
[23] S. G. Saez, V. Andrikopoulos, R. J. Sanchez, F. Leymann, and Computer Science & Engineering, University
J. Wettinger, “Dynamic tailoring and cloud-based deployment of of Washington, Seattle.
containerized service middleware,” in Proc. IEEE 8th Int. Conf.
Cloud Comput., 2015, pp. 349–356. Mohamed Mohamed received the PhD degree
[24] F. Schmuck and R. Haskin, “GPFS: A shared-disk file system for in computer science from Telecom SudParis,
large computing clusters,” in Proc. 1st USENIX Conf. File Storage Institut Mines-Telecom, Paris, France. He is
Technol., 2002, Art. no. 19. a research staff member in the Ubiquitous
[25] Top500. (2014). [Online]. Available: http://www.top500.org/list/ Platforms Group, IBM Almaden Research Cen-
2014/06/, Published Jun. 2014; Accessed on: Sep. 5, 2014. ter, San Jose, California. He is working on differ-
[26] K. Wang, X. Zhou, T. Li, D. Zhao, M. Lang, and I. Raicu, ent projects that are primarily related to PaaS
“Optimizing load balancing and data-locality with data-aware including data persistence management and
scheduling,” in Proc. IEEE Int. Conf. Big Data, 2014, pp. 119–128. SLAs management. He is one of the main con-
[27] T. Wang, K. Mohror, A. Moody, K. Sato, and W. Yu, “An ephem- tributors in designing and implementing rSLA lan-
eral burst-buffer file system for scientific applications,” in Proc. guage and framework, as well as CloudFoundry’s
Int. Conf. High Perform. Comput. Netw. Storage Anal., 2016, pp. 69:1– persistence support.
69:12.
[28] Y. Wang, R. Goldstone, W. Yu, and T. Wang, “Characterization
and optimization of memory-resident MapReduce on HPC sys- Heiko Ludwig received the master’s and PhD
tems,” in Proc. IEEE 28th Int. Parallel Distrib. Process. Symp., 2014, degrees in information systems from Otto-Frie-
pp. 799–808. drich University Bamberg, Germany. He is a
[29] C. Weng, Q. Liu, L. Yu, and M. Li, “Dynamic adaptive scheduling research staff member and manager with IBM
for virtual machines,” in Proc. 20th Int. Symp. High Perform. Distrib. Almaden Research Center, in San Jose, Califor-
Comput., 2011, pp. 239–250. nia. Leading the Ubiquitous Platforms research
[30] Xen. (2015). [Online]. Available: http://www.xenproject.org/, group, he is currently working on topics related to
Accessed on: Jul. 17, 2015. container-based systems and Platform as a Ser-
[31] X. Yanget al., “Integrating dynamic pricing of electricity into vice (PaaS), IOT platforms, SLA and quality man-
energy aware scheduling for HPC systems,” in Proc. Int. Conf. agement, as well as platform security.
High Perform. Comput. Netw. Storage Anal., 2013, Art. no. 60.
[32] S. Yi, C. Li, and Q. Li, “A survey of fog computing: Concepts,
applications and issues,” in Proc. Workshop Mobile Big Data, 2015, " For more information on this or any other computing topic,
pp. 37–42. please visit our Digital Library at www.computer.org/csdl.
[33] M. Zaharia, D. Borthakur, J. S. Sarma, K. Elmeleegy, S. Shenker,
and I. Stoica, “Delay scheduling: A simple technique for achieving
locality and fairness in cluster scheduling,” in Proc. 5th Eur. Conf.
Comput. Syst., 2010, pp. 265–278.
[34] Y. Zhai, L. Yin, J. Chase, T. Ristenpart, and M. Swift, “CQSTR:
Securing cross-tenant applications with cloud containers,” in Proc.
7th ACM Symp. Cloud Comput., 2016, pp. 223–236.
[35] D. Zhao, “Toward real-time and fine-grained monitoring of soft-
ware-defined networking in the cloud,” in Proc. IEEE 9th Int. Conf.
Cloud Comput., 2016, pp. 884–887.
[36] Z. Zhouet al., “I/O-aware batch scheduling for petascale comput-
ing systems,” in Proc. IEEE Int. Conf. Cluster Comput., 2015,
pp. 254–263.

Authorized licensed use limited to: Kyunghee Univ. Downloaded on November 15,2020 at 03:53:18 UTC from IEEE Xplore. Restrictions apply.

You might also like