Professional Documents
Culture Documents
Abstract—The state-of-the-art scheduler of containerized cloud services considers load balance as the only criterion; many other
important properties, including application performance, are overlooked. In the era of Big Data, however, applications evolve to be
increasingly more data-intensive thus perform poorly when deployed on containerized cloud services. To that end, this paper aims to
improve today’s cloud service by taking application performance into account for the next-generation container schedulers. More
specifically, in this work we build and analyze a new model that respects both load balance and application performance. Unlike prior
studies, our model abstracts the dilemma between load balance and application performance into a unified optimization problem and
then employs a statistical method to efficiently solve it. The most challenging part is that some sub-problems are extremely complex
(for example, NP-hard), and heuristic algorithms have to be devised. Last but not least, we implement a system prototype of the
proposed scheduling strategy for containerized cloud services. Experimental results show that our system can significantly boost
application performance while preserving high load balance.
Index Terms—Cloud computing, service computing, containers, data management, high-performance computing
1 INTRODUCTION
Authorized licensed use limited to: Kyunghee Univ. Downloaded on November 15,2020 at 03:53:18 UTC from IEEE Xplore. Restrictions apply.
636 IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. 8, NO. 2, APRIL-JUNE 2020
Fig. 2. Multiple compute-intensive applications on the same cell have Fig. 4. Compute- and data-intensive applications deployed on the same
relatively low contention. cell do not cause noticeable contention.
Authorized licensed use limited to: Kyunghee Univ. Downloaded on November 15,2020 at 03:53:18 UTC from IEEE Xplore. Restrictions apply.
ZHAO ET AL.: LOCALITY-AWARE SCHEDULING FOR CONTAINERS IN CLOUD COMPUTING 637
completed almost in the same time duration as their per- need to be adjusted. However, how to migrate and how
spective baseline. Those results confirm our conjecture: the much overhead is involved with the migration is beyond the
contention between distinct types of resources is light inside scope of this paper. In practice, both offline and online (live)
the same cell. migrations are available for VMs and containers (for exam-
ple, Jelastic [14]).
3 REDUCING RESOURCE CONTENTION WITH
3.2 Local Contention
LOCALITY-AWARE JOB SCHEDULING
The following optimization makes the number of I/O-con-
3.1 Assumption tention nodes minimal. That is, the goal is to allocate
We assume the existing mapping between service and con- ½A1 ; . . . ; Ai to C (i.e., a mapping array M½i ¼ j, 1 j jCj)
tainers do not change. That is, at least one qualified cell such that
exists for the request. In practice, when this assumption jCj X
X
does not hold, multiple alternatives can be applied in order arg min fAk :io cellðAk Þ ¼¼ Cj g Cj :io ; (1)
to achieve the assumption; for instance, the provider might M j¼1
return an error message, block the request until a qualified
cell becomes available, or reselect a cell that causes no subject to
contention. Cj :cpu > Ai :cpu
A cell is the smallest isolated environment where an Cj :ram > Ai :ram
application can run independently. In practice, a cell can be
Cj :disk > Ai :disk:
implemented with a Docker instance. A zone is usually a
data center located at the same geographic location, within
The above equation aims to minimize the number of
which the network speed is orders of magnitude higher
cells where I/O-intention occurs. If the applications are not
than that between zones.
I/O-intensive, there is a high chance that no cells will have
The list of available cells is denoted by C, where Cj repre-
I/O intentions, in which case applications can easily fit into
sents the jth cell. The running applications’ indexes on a
random cells as long as other metrics are met (e.g., CPU usage,
particular cell are represented by Cj :apps. Similarly, Zj :apps
load balance). However, the problem becomes extremely
indicates the list of applications running on the jth zone, Zj .
challenging when the optimal placement is not obvious:
Available resources of Cj can be retrieved by Cj :cpu,
when many I/O-intensive applications are deployed to a lim-
Cj :ram, and Cj :disk. The total I/O bandwidth of Cj is
ited number of nodes, we know there will be some nodes that
denoted as Cj :io.
are I/O throttled; the question then becomes how to minimize
The list of applications is denoted by A, where Ai rep-
such bottleneck, or, minimize the number of I/O-contention
resents the ith application. The cell index where an app-
nodes.
lication has been deployed is represented by cellðAi Þ. The
Obviously, there exists an exponential solution to this
requested resources (for example, CPUs, memory, disk
problem that is not feasible in the real-world scheduler;
space) of Ai are represented by Ai :cpu, Ai :ram, and Ai :disk.
More efficient and practical algorithms are in need. As a
We denote the required disk throughput of Ai as Ai :io. If
matter of fact, cloud providers might not want to deploy
Ai :io is not explicitly specified, we assume Ai will aggres-
any data-intensive application at all if no fit is available
sively take all the disk bandwidth of Cj .
because that would cause side effect to existing applications
The definitions of cells can be generalized to other enti-
in addition to the slowdown of the newly deployed one. A
ties as well. For example, if we are interested in the metrics
good strategy is to defer the application to either (1) any of
at the zone level, then the above notations are applicable as
the deployed jobs is finished, or (2) more resources (for
well by replacing C to Z. Similarly, the zone where an appli-
example, new VMs, additional disks) are allocated. Then
cation Ai resides is denoted by zoneðAi Þ, and the available
the problem becomes the following: how to allocate as
resources of Zj is denoted by Zj :cpu, Zj :ram, and Zj :disk.
many as possible applications with no outbreak of any cell’s
The dependency matrix D specifies the interaction
I/O bandwidth
between applications. For example, ðDi;j ¼¼ 1Þ means appli-
cation Ai has non-trivial I/O to application Aj . Another i X
X
matrix T stores the pair-wise traffic of data movement arg max fAh :io cellðAh Þ ¼¼ Cj &&
among all location units (for example, cells, physical nodes, M k¼1
(2)
zones). The numerical value in Ti;j indicates the closeness
cellðAk Þ ¼¼ Cj g Cj :io ;
between Ci and Cj . For instance, the cost of transferring data
between two cells on the same physical machine is obviously subject to
lower than those on two different racks, which is then lower
than those on two different locations (i.e., zones). In practice, Cj :cpu > Ai :cpu
the closeness could be network latency if users have many Cj :ram > Ai :ram
small I/Os, or network bandwidth if large volume of data is Cj :disk > Ai :disk
expected.
One important assumption of some of the following dis- In contrast to Equation (1) that minimizes over the cells,
cussions is that containerized applications and services can here we try to maximize over a total number i of applications.
be migrated to arbitrary locations. This is a required property The application Ah at the first line of Equation (2) denotes that
for the global optimization where deployed applications the cell Cj has been occupied by an I/O-intensive application.
Authorized licensed use limited to: Kyunghee Univ. Downloaded on November 15,2020 at 03:53:18 UTC from IEEE Xplore. Restrictions apply.
638 IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. 8, NO. 2, APRIL-JUNE 2020
Thus, we want to ensure that the subsequent application Ak is Oðn2 Þ in total as analyzed in the following (we use n
at the second line would not saturate the total I/O bandwidth instead of i by convention to indicate the size of the input).
of the cell. Because presumably i > jCj (otherwise the optimization
Equation (2) is a variation of the classical 0-1 Multiple problem has trivial solutions), we also use OðnÞ to indicate
Knapsack Problem (MKP) with two additional conditions: the worst time when iterating on C. Both Line 1 and Line 2
take Oðnlog nÞ. Line 4 takes Oðlog nÞ and Lines 8–13 take
1) all items (i.e., applications) have the same profit (we OðnÞ. Line 16 takes Oð1Þ and Line 17 takes another Oðlog nÞ.
assume all applications have the same priority), and Therefore, the most costly part is the nested loop of Line 3
2) three additional constraints (CPU, RAM, and disk and Line 8, which give us Oðn2 Þ.
space) on each item.
The MKP optimization problem itself is NP-hard, and 3.3 Network Contention
Jansen et al. [13] showed that the relax on profits (i.e., condi-
There are various granularities of network contention such as
tion 1) does not make the problem simpler. Therefore,
cell-to-cell, machine-to-machine, and zone-to-zone. To make
we propose a pseudo-heuristic algorithm with dynamic pro-
the following analysis neat and clear, we analyze the zone-to-
gramming to meet condition 2 in addition to the optimization
zone contention as the representative scenario. The idea and
problem.
validity, however, holds true for other scenarios as well
Algorithm 1, in essence, is a greedy algorithm sharing the
(for example, cell-to-cell, machine-to-machine). We do not
same spirit of two classical algorithms: (1) the Kruskal’s algo-
differentiate an application and a service to simplify the follow-
rithm [15] for finding the minimal spanning tree in a graph,
ing analysis; After all, both are just jobs from the system’s
and (2) the best fit algorithm in memory paging allocation.
point of view. Therefore, the dependency matrix D intro-
Our algorithm first sorts both applications and cells accord-
duced in Section 3.1 is also applicable here in such a context
ing to their I/O status (Lines 1–2). Then, from lowest to
where (Di;k ¼¼ 1) means that an application Ai is dependent
highest, the application is assigned to the cell whose free
I/O resource is larger than and closest to the application’s on service/application Ak . The problem for an application to
need on the three conditions of CPU, RAM and disk space achieve the minimal network traffic then becomes the follow-
(Lines 4–14). Finally, if a cell is found qualified for all ing. Given Ai we find Zj (1 j jZj) such that
the requirements its I/O status is updated and its location X
is moved to the appropriate place in the sorted queue arg min Di;k Tj;zoneðAk Þ ; (3)
j k
(Lines 15–18). subject to
Authorized licensed use limited to: Kyunghee Univ. Downloaded on November 15,2020 at 03:53:18 UTC from IEEE Xplore. Restrictions apply.
ZHAO ET AL.: LOCALITY-AWARE SCHEDULING FOR CONTAINERS IN CLOUD COMPUTING 639
looping on jZj we need to recalculate CV and drop those where the scoring function is
unqualified candidates. Thus the process takes OðjZj ðjZjþ
jAjÞÞ in total. However, an obvious limitation exists in Equa- a trðMÞ þ b cvðMÞ
F ðMÞ ¼ ;
tion (4): what if no qualified zone is available with the aþb
required CV?
subject to
We propose to normalize both load-balance and net-
work-traffic criteria and apply a weight factor to each. The Zj :cpu > Ai :cpu
normalization of network traffic is measured by the ratio of Zj :ram > Ai :ram
the application’s actual traffic divided by the aggregate Zj :disk > Ai :disk:
bandwidth. Ideally, if all applications are executed on local
nodes only, the normalized network traffic is simply zero. In Equation (8), we use the weighted arithmetic mean
Formally, the normalization of network traffic (tr) for Ai on (AM) as the objective function. We did not choose other
Zj (i.e., a mapping matrix M where Mi;j indicates Ai is forms because the weights in AM have a good balance
deployed on Zj ) is between sensitivity and complexity. We want the objective
Pi1 function to have a medium sensitivity: if the model does not
Di;k Tj;zoneðAk Þ
trðMi;j Þ ¼ k¼1
PjZj PjZj : (5) promptly responds to an updated weight, then tuning the
m¼1 n¼1 Tm;n parameters will incur significant overhead; if the model is
oversensitive, the optimal parameter can be easily missed.
Obviously, we have
The cost of calculating the objective function should be min-
trðMi;j Þ 2 ½0 1; 8i; j; imized, as otherwise the overall cost may be significantly
affected. This is why we did not introduce those more com-
which can be observed by that the total throughput of appli- plex objective functions, such as geometric mean, harmonic
cations cannot exceed the overall bandwidth. mean, quadratic mean, and so on.
The normalization of load balance is defined as the We in this section, without loss of generality, assume that
adjusted coefficient of variance of the overall deployed both the traffic and the load balance contribute equally to
applications on each zone. That is the objective function. In practice, this can be easily changed
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi according to specific requirements. For example, the a can
P
ð1 þ jZj :appsj ZÞ2 þ k6¼j ðjZk :appsj ZÞ2 be amplified by 10 times if traffic control is far more impor-
cvðMi;j Þ ¼ pffiffiffiffiffiffiffi ; tant than load balance. We will experimentally show the
jZj Z impact of distinct a’s and b’s in Section 5.
(6) The complexity analysis of Equation (8) is as follows.
where
PjZj Equation (7) can be efficiently updated in constant time
1þ j¼1 jZj :appsj assuming zones are not changed. If we maintain the differ-
Z¼ : (7) ence square for each zone in memory, Equation (6) can be
jZj
updated also in constant time. Nevertheless, it takes OðjZjÞ
It should be noted that in Equation (6) we only consider to update Equation (6). In Equation (5), every dependency
the number of applications deployed to a zone as its work- is checked taking OðjAjÞ. Therefore, the overall time is
load. That is, we do not explicitly account resources (e.g., OðjZj ðjZj þ jAjÞÞ. That is, although adding both network
CPU usage, memory footprint, disk space) in our model. traffic and load balance into consideration looks more com-
The reason of doing so is because the proposed model plicated than the simple threshold-check approach (i.e.,
assumes all applications and zones are homogeneous for Equation (4)), the former one essentially takes as much time
the sake of simplicity; as a result, the number of applica- in the same order of magnitude as the latter.
tions, implicitly, reflect the resources being used in a specific The problem becomes significantly harder (even without
zone. Indeed, a more realistic model would have parameter- considering load balance) with a subtle change when we
ized the sizes of distinct availability zones as well as the consider scheduling a batch of applications or all the appli-
resources taken by each application, which would break cations globally. In essence, instead of the linear problem on
the brevity of our analysis. To address that, we can assume a single dimension of jZj, we are now facing a combinatorial
that applications taking heterogeneous resources can be problem of placing jAj applications on jZj zones such that
normalized into virtual applications each of which is con- XX
sidered identical in terms of resources. This is analogous to arg min Di;k Tj;zoneðAk Þ ; (9)
vCPUs [3] used in Amazon Web Services (AWS): although M i k
instance types range widely from tiny single-core machines
subject to
to those high-end server-class Intel Xeon nodes, they are
still comparable in terms of the number of virtual cores. Zj :cpu > Ai :cpu
We are now ready to define the objective function with Zj :ram > Ai :ram
both factors. Let a and b denote the weights for the normal- Zj :disk > Ai :disk:
izations of traffic tr() and and coefficient of variance cv
(), respectively. Our goal is then to find Zj for Ai (i.e., Mi;j ) A brute-force solution would obviously take exponential
such that time, OðjZjjAj Þ. We will prove the problem of Equation (9) is
arg min F ðMÞ; (8) NP-hard by reducing it to the Traveling Salesman Problem
M (TSP). The zone matrix can be represented as a graph whose
Authorized licensed use limited to: Kyunghee Univ. Downloaded on November 15,2020 at 03:53:18 UTC from IEEE Xplore. Restrictions apply.
640 IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. 8, NO. 2, APRIL-JUNE 2020
vertexes represent the zones and edges represent the net- active debate [32], we strongly believe pushing workloads
work traffic between two zones, which is a similar setup of toward end devices would become a trend with the
TSP. Instead of TSP where we try to find the shortest path enabling technology of containers in many computing com-
covering all vertexes, we are now trying to find the smallest munities including scientific computing, social network,
summation of edges arbitrarily determined by the depen- web services, and so forth.
dency matrix. If we specify the dependency matrix where Although this paper is concentrated on the differenti-
each application only depends on two neighbor applications ated performances across multiple availability zones in the
and enforce one zone holding only one application, then our conventional geographically-distributed clouds, the idea
global-network-minimization problem is reduced to TSP. and approach of minimizing local- and network-contention
This paper is focused on the optimization problem with can be applied to more computing paradigms including
regard to a single application. Therefore, the algorithm, fog computing. Essentially, in fog computing the comput-
analysis, and evaluation are all concentrated on how to mini- ing and storage units are slimmer and more distributed
mize the contention when a new application is deployed. We than data centers—they are everywhere from personal
leave the optimization of all applications as an open question electronic device (for example, smart phones, smart watch)
to the community as well as one of our future works. to enterprise facilities such as weather sensors and surveil-
lance monitors. Despite a radical change from provider-
3.4 Generalized Model hosted resources to ubiquitous dispersion, one of the key
Both types of contention—local (Section 3.2) and network challenges remains: the computation should be deployed
(Section 3.3)—we have discussed so far are disconnected. A as close as possible to its data to avoid the costly network
strategy from the entire system’s point of view is the focus congestion.
of this section. As a case in point, when a new application is The locality problem is more challenging in the fog
to be deployed, should it be scheduled to avoid local or net- computing from the following two perspectives.
work contention? More specifically, although we should try
The network is more heterogeneous and volatile
to assign an application to the same cell or zone of its depen-
than cloud computing. In fog computing, the net-
dent service, is it wise to do so if both the application and its
work could be mobile network, WiFi signal, Ethernet
dependent service incur significant disk I/Os? Previous
connection, Bluetooth, and so forth, which is com-
sections partially answer the above question from two
pletely another story comparing with the relatively
exclusively aspects; The following will discuss how to coor-
stable Internet connection in cloud computing. As a
dinate both in a systematic manner.
case in point, a mobile phone’s latency is signi-
The idea is to still maintain the network traffic as little as
ficantly more volatile than that of a physical node in
possible with an additional condition that no local disk
the data center. Consequently, the matrix we intro-
bandwidth is deplete. Therefore, Equation (8) is slightly
duce to characterize the network traffic needs to be
changed into
frequently updated.
arg min F ðMÞ; (10) The scale is significantly larger. In cloud computing,
M when discussing availability zones, they might be in
where orders of Oð10Þ. The entity granularity is coarse: the
a trðMÞ þ b cvðMÞ machines within the same data center are considered
F ðMÞ ¼ ; as homogeneous nodes so the entire data center
aþb
might be treated as a single entity. In fog computing,
subject to however, each device is a potential entity because
Zj :cpu > Aj :cpu there is little association between devices. As a
result, the traffic matrix would be proportional to the
Zj :ram > Aj :ram total number of devices instead of Oð10Þ. Note that
Zj :disk > Aj :disk there could be millions of users or more that could
Zj :io > Aj :io: invalidate the matrix approach we have discussed so
far in this paper.
The newly added condition on Zj :io and Aj :io indicates
that (1) the system maintains a zone-level metadata for its 4 IMPLEMENTATION
I/O bandwidth usage, and (2) the I/O requirement of app- We have implemented the hybrid model discussed in
lications is known upfront. While the first one is straightfor- Section 3. We modified the Diego [7] source code to
ward to implement, the second one can be implemented as allow users to choose whether locality should be consid-
a probe service or as a machine learning algorithm based on ered in the scheduling decision. In particular, a new
historical usage. module called locality.go is developed among other
changes to the existing modules such as auctionrun-
3.5 Broader Applicability and Limitations ner.go and scheduler.go.
Containers, as a lightweight alternative to VMs, by nature The Locality class1 is defined in Fig. 5. The dep and
become a good candidate for the new computing paradigm traffic fields represent the dependency and traffic matri-
where the workload (computation, storage, communication) ces discussed in Section 3. The appZone field is a hashmap
is pushed to the edge—edge computing or fog computing.
While a concrete definition of fog computing is still under 1. Technically, it is a struct in the Go programing language.
Authorized licensed use limited to: Kyunghee Univ. Downloaded on November 15,2020 at 03:53:18 UTC from IEEE Xplore. Restrictions apply.
ZHAO ET AL.: LOCALITY-AWARE SCHEDULING FOR CONTAINERS IN CLOUD COMPUTING 641
Authorized licensed use limited to: Kyunghee Univ. Downloaded on November 15,2020 at 03:53:18 UTC from IEEE Xplore. Restrictions apply.
642 IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. 8, NO. 2, APRIL-JUNE 2020
Fig. 7. Distribution of eight applications on two zones with Diego. Fig. 10. Distribution of the same eight applications on two zones with
Diego/P.
TABLE 1
Function Scores and Placement of Eight Applications on
Two Zones (Z0; Z1) in Two Setups: {F 1 : a ¼ 1; b ¼ 1}
and {F 2 : a ¼ 1; b ¼ 2} (Selected Zone is Underlined)
App A0 A1 A2 A3 A4 A5 A6 A7
1
Z0 F .50 .50 .39 .02 .12 .19 .24 .27
Z1 F1 .50 .00 .19 .02 .12 .19 .24 .27
Z0 F2 .67 .67 .37 .02 .15 .24 .25 .15
Z1 F2 .67 .00 .24 .02 .15 .15 .11 .15
Fig. 8. Dependency matrix of applications.
Authorized licensed use limited to: Kyunghee Univ. Downloaded on November 15,2020 at 03:53:18 UTC from IEEE Xplore. Restrictions apply.
ZHAO ET AL.: LOCALITY-AWARE SCHEDULING FOR CONTAINERS IN CLOUD COMPUTING 643
Fig. 11. Network traffic and load balance of 2,400 jobs on 30 zones with different configurations.
rows in Table 1. Now we observe A5 and A6 are placed in Z1 to 0.50). Therefore, a good practice to choose a and b is to
rather than Z0. That is, the increased weight on load balance only sweep on the parameter space at small scales of the
counterfeits the network traffic on these two applications. ratio between a and b.
Also note that, even though in this scenario both Z0 and Z1 We visually demonstrate how the proposed locality-
have the same number of applications just like in vanilla aware scheduler places the applications to different contain-
Diego (Fig. 7), the applications are different in each zone. ers and zones under different criteria. We first show two
In the scenario of {a ¼ 1; b ¼ 2}, the overall network traf- representative scenarios in Figs. 12 and 13, respectively.
fic is 240 because A0 = A1 = 0, A2 = A3 = A4 = A6 = 10, and Both experiments are deployed to the largest scale, i.e.,
A5 = A7 = 100. While the traffic usage is higher than the 2,400 jobs on 30 zones, with different parameters. After-
strict network-optimal strategy, the ratio is still relatively wards, we will illustrate two extreme cases (Figure) where
low: 240=ð220 8Þ ¼ 13:6%. In return, the system is per- a is set to 1 and 1,000, respectively.
fectly load-balanced. Comparing with the vanilla Diego The first setup in Fig. 12 shows how our proposed
scheduler, our proposed approach achieves the same load scheduler, when a = 2 and b = 1, makes a better trade-off
balance but reduce the traffic from 600 to 240—saving between data locality and load balance compared to
1 240
600 ¼ 60% network bandwidth. the case of a = 1 and b = 1 where only 12 percent network
We scale the experiment out to 2,400 applications and 30 traffic is reduced although it is perfectly load balanced.
zones. The traffic saving and load balance is reported in Specifically, this placement slightly gives up load balance
Fig. 11 on various combinations of parameters. Several by deploying more workloads in the middle containers
interesting observations are discussed in the following. (i.e., coefficient of variance 0.21 per Fig. 11). In return, it
First, the right-most column reports that the upper-
bound of performance improvement is 89 percent where
load-balance is not considered at all (i.e., b ¼ 0). This result
is understandable because the inter-zone network is 10
costly than intra-zone, which means diverting most inter-
node traffic to within the zone roughly results in about
10 percent network traffic.
Second, the effect of locality weight at smaller scales (i.e.,
a
b 5) is more significant to both traffic save and load bal-
ance. When the ratio increases from 1 to 5, the traffic save
increases 60 percent (from 12 to 72 percent) and the coeffi-
cient of variance increases 0.35 (from 0.01 to 0.36). On the
other hand, when the ratio increases from 5 to 1,000, the
traffic save increases only 17 percent (from 72 to 89 percent)
and the coefficient of variance increases only 0.14 (from 0.36 Fig. 12. Placement of 2,400 jobs on 30 zones when a = 2, b = 1.
Authorized licensed use limited to: Kyunghee Univ. Downloaded on November 15,2020 at 03:53:18 UTC from IEEE Xplore. Restrictions apply.
644 IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. 8, NO. 2, APRIL-JUNE 2020
Fig. 13. Placement of 2,400 jobs on 30 zones when a = 100, b = 1. Fig. 15. Placement of 2,400 jobs on 30 zones when a = 1,000, b = 1.
helps save almost half of the network traffic (i.e., 46.43 per- at extreme scales is beyond the scope of this paper, but we
cent per Fig. 11). plan to work on this direction in our future work.
The second setup in Fig. 13, where a = 100 and b = 1,
represents a placement decision on the other end of the 6 RELATED WORK
spectrum—load balance is largely skewed in order to In nature, load balance and data locality are fundamentally
reduce considerable network traffic. Indeed, one can visu- orthogonal to each other. Some of our prior work [26] tried to
ally tell the skewness: six zones from Z20 to Z26 host only a find a good trade-off between the two metrics in high-perfor-
handful of jobs where all the other containers are almost at mance computing. This work, however, concentrates on the
full capacity. On the other hand, per Fig. 11 such a setup analytical model of load balance and application perfor-
would save 88 percent network traffic. Depending on the mance in the context of containerized cloud services that is
service level agreement (SLA), this placement in practice nowadays widely adopted by platform-as-a-service (PaaS).
might be desired, for example when load balance is not Scheduling is actively studied to improve the I/O perfor-
taken into account by the vendor while the network perfor- mance at large scales. For instance, one of our prior work
mance is one of major metrics for the provisioned service. focused on the batch scheduling on petascale systems [36].
The thrid setup in Fig. 14 (a = 1 and b = 1) exhibits The objective was to design and evaluate a batch scheduler
the scenario where load balance is set as the top priority. with a holistic view of both system state and jobs’ activities
The high load-balance can be observed visually from the on an extreme-scale production system, Mira [19] at
figure: almost all zones are assigned with a similar number Argonne National Laboratory, a top 5 supercomputer in the
(
2) of applications. However, this does not come for free, world [25]. In contrast, this work is focused on the schedul-
as it does not save much network traffic as other setups ing part of containerized cloud services with respect to net-
(e.g., 12 percent versus 46 percent when a ¼ 2). Neverthe- work traffic in addition to disk I/O.
less, depending on the application’s requirement, this might Ahn et al. [2] proposed VM-level scheduling techniques
be a desirable setup. to migrate micro-architectural resources such as shared
The fourth and last experiment is shown in Fig. 15 (a = cache and memory controller. The micro-architectural
1,000 and b = 1). We observe a similar application-zone resources are not traditionally isolated at the VM layer, but
allocation as Fig. 13: most zones are exhausted with a few manipulated by the intra-system. They showed that the pro-
left with very light workload. In fact, now with an extremely posed scheduling approaches are highly effective for cache
large a, several zones do not have any workload at all. sharing and non-uniform memory access (NUMA) affinity.
Compared to a ¼ 100, the load balance is further compro- There are more studies [4], [21] on the NUMA scheduling at
mised (CV ¼ 0:5 versus CV ¼ 0:48) with marginal increase the VM level, as well as adaptive VM scheduling optimized
of traffic saving (i.e., 88:12% ! 89:06%). This experiments for non-concurrent workloads [29]. At the infrastructure-as-
illustrate that a is not very sensitive when being set as a-service layer, more resource management tools were
extremely large values. A theoretical analysis of a’s sensitivity increasingly introduced such as security checks [34], net-
work monitoring [35], dynamic deployment [20], [23], etc.
This work, on the other hand, targets a different set of
micro-architectural resources (disk I/O bandwidth and net-
work traffic) at a finer granularity (i.e., containers).
Scheduling is also researched to improve other aspects of
the system. For example, one of the most interesting angles
is power consumption and electricity bill [10], [17], [31].
Those works showed that a well-tuned job scheduler could
significantly reduce the power consumption and electricity
bill. This work can be combined with the aforementioned to
further reduce the financial investment if network traffic is
considered as a major portion of the operation cost.
Although Docker [8] is one of the most popular imple-
Fig. 14. Placement of 2,400 jobs on 30 zones when a = 1, b = 1. mentation for container services, it per se does not expose
Authorized licensed use limited to: Kyunghee Univ. Downloaded on November 15,2020 at 03:53:18 UTC from IEEE Xplore. Restrictions apply.
ZHAO ET AL.: LOCALITY-AWARE SCHEDULING FOR CONTAINERS IN CLOUD COMPUTING 645
any data locality for the deployed containers. While Kuber- in this work will be extended to a larger context and inte-
netes [16] does provide management of container services grated to an orchestrated cloud service. Currently, this
in the form of “pods”, what it does is to assign containers work assumes that the storage layer is a shared-nothing
to the pod without much optimization inspired by the cluster of local Linux file systems (e.g., ext4); we will be
application’s I/O patterns. In contrast, this work proposes a working on providing a GPFS [24] interface so that other
series of models and algorithms to allow users to specify a communities (e.g., scientific data management, high perfor-
degree of skewness of load balance in order to achieve bet- mance computing) would also benefit from the scheduling
ter data locality and consequently the performance from a techniques presented in this paper.
global perspective of the entire cluster.
There are also important works in scheduling for better ACKNOWLEDGMENTS
load balance from distributed systems and high-perfor-
mance computing (HPC) communities, such as [1], [28], This work is supported in part by the Big Data Research Ini-
[33]. In particular, Wang et al. [27] proposed an ephemeral tiatives sponsored by the Nevada System of Higher Educa-
burst buffer file system, a breakthrough to alleviate the con- tion (on behalf of the University of Nevada, Reno), the AWS
ventional I/O pressure that has existed in HPC systems for Research Grants from Amazon, and the Azure Research
decades. We believe the ephemeral burst buffer can be lev- Award from Microsoft. Some preliminary work was con-
eraged by the community of cloud computing as well, as it ducted when D. Zhao worked at IBM Almaden Research
is orthogonal to the algorithms and models proposed by Center. The authors would like to thank Prof. Ioan Raicu
this paper. Thus, applying the ephemeral burst buffer file (Illinois Institute of Technology), Prof. Ion Stoica (University
system proposed in [27] will likely further improve the per- of California, Berkeley), and Prof. Magdalena Balazinska
formance of many cloud computing applications. (University of Washington) for insightful discussions.
Authorized licensed use limited to: Kyunghee Univ. Downloaded on November 15,2020 at 03:53:18 UTC from IEEE Xplore. Restrictions apply.
646 IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. 8, NO. 2, APRIL-JUNE 2020
[20] F. Paraiso, S. Challita, Y. Al-Dhuraibi, and P. Merle, “Model- Dongfang Zhao received the PhD degree in com-
driven management of docker containers,” in Proc. IEEE 9th Int. puter science from the Illinois Institute of Technol-
Conf. Cloud Comput., 2016, pp. 718–725. ogy, Chicago. He is an assistant professor in the
[21] J. Rao, K. Wang, X. Zhou, and C.-Z. Xu, “Optimizing virtual Department of Computer Science & Engineering,
machine scheduling in NUMA multicore systems,” in Proc. IEEE University of Nevada, Reno. His research interests
19th Int. Symp. High Perform. Comput. Archit., 2013, pp. 306–317. span data management systems, high-perfor-
[22] Red Hat KVM. (2015). [Online]. Available: http://www.redhat. mance computing, cloud computing, distributed
com/en/files/resources/en-rh-kvm-kernal-based-virtual-machine. systems, and machine intelligence. He completed
pdf, Accessed on: Jul. 17, 2015. his postdoctoral fellowship in the School of
[23] S. G. Saez, V. Andrikopoulos, R. J. Sanchez, F. Leymann, and Computer Science & Engineering, University
J. Wettinger, “Dynamic tailoring and cloud-based deployment of of Washington, Seattle.
containerized service middleware,” in Proc. IEEE 8th Int. Conf.
Cloud Comput., 2015, pp. 349–356. Mohamed Mohamed received the PhD degree
[24] F. Schmuck and R. Haskin, “GPFS: A shared-disk file system for in computer science from Telecom SudParis,
large computing clusters,” in Proc. 1st USENIX Conf. File Storage Institut Mines-Telecom, Paris, France. He is
Technol., 2002, Art. no. 19. a research staff member in the Ubiquitous
[25] Top500. (2014). [Online]. Available: http://www.top500.org/list/ Platforms Group, IBM Almaden Research Cen-
2014/06/, Published Jun. 2014; Accessed on: Sep. 5, 2014. ter, San Jose, California. He is working on differ-
[26] K. Wang, X. Zhou, T. Li, D. Zhao, M. Lang, and I. Raicu, ent projects that are primarily related to PaaS
“Optimizing load balancing and data-locality with data-aware including data persistence management and
scheduling,” in Proc. IEEE Int. Conf. Big Data, 2014, pp. 119–128. SLAs management. He is one of the main con-
[27] T. Wang, K. Mohror, A. Moody, K. Sato, and W. Yu, “An ephem- tributors in designing and implementing rSLA lan-
eral burst-buffer file system for scientific applications,” in Proc. guage and framework, as well as CloudFoundry’s
Int. Conf. High Perform. Comput. Netw. Storage Anal., 2016, pp. 69:1– persistence support.
69:12.
[28] Y. Wang, R. Goldstone, W. Yu, and T. Wang, “Characterization
and optimization of memory-resident MapReduce on HPC sys- Heiko Ludwig received the master’s and PhD
tems,” in Proc. IEEE 28th Int. Parallel Distrib. Process. Symp., 2014, degrees in information systems from Otto-Frie-
pp. 799–808. drich University Bamberg, Germany. He is a
[29] C. Weng, Q. Liu, L. Yu, and M. Li, “Dynamic adaptive scheduling research staff member and manager with IBM
for virtual machines,” in Proc. 20th Int. Symp. High Perform. Distrib. Almaden Research Center, in San Jose, Califor-
Comput., 2011, pp. 239–250. nia. Leading the Ubiquitous Platforms research
[30] Xen. (2015). [Online]. Available: http://www.xenproject.org/, group, he is currently working on topics related to
Accessed on: Jul. 17, 2015. container-based systems and Platform as a Ser-
[31] X. Yanget al., “Integrating dynamic pricing of electricity into vice (PaaS), IOT platforms, SLA and quality man-
energy aware scheduling for HPC systems,” in Proc. Int. Conf. agement, as well as platform security.
High Perform. Comput. Netw. Storage Anal., 2013, Art. no. 60.
[32] S. Yi, C. Li, and Q. Li, “A survey of fog computing: Concepts,
applications and issues,” in Proc. Workshop Mobile Big Data, 2015, " For more information on this or any other computing topic,
pp. 37–42. please visit our Digital Library at www.computer.org/csdl.
[33] M. Zaharia, D. Borthakur, J. S. Sarma, K. Elmeleegy, S. Shenker,
and I. Stoica, “Delay scheduling: A simple technique for achieving
locality and fairness in cluster scheduling,” in Proc. 5th Eur. Conf.
Comput. Syst., 2010, pp. 265–278.
[34] Y. Zhai, L. Yin, J. Chase, T. Ristenpart, and M. Swift, “CQSTR:
Securing cross-tenant applications with cloud containers,” in Proc.
7th ACM Symp. Cloud Comput., 2016, pp. 223–236.
[35] D. Zhao, “Toward real-time and fine-grained monitoring of soft-
ware-defined networking in the cloud,” in Proc. IEEE 9th Int. Conf.
Cloud Comput., 2016, pp. 884–887.
[36] Z. Zhouet al., “I/O-aware batch scheduling for petascale comput-
ing systems,” in Proc. IEEE Int. Conf. Cluster Comput., 2015,
pp. 254–263.
Authorized licensed use limited to: Kyunghee Univ. Downloaded on November 15,2020 at 03:53:18 UTC from IEEE Xplore. Restrictions apply.