You are on page 1of 18

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2020.3034500, IEEE
Transactions on Cloud Computing
IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. XX, NO. XX, XX XXXX 1

Characterizing Co-located Workloads in Alibaba


Cloud Datacenters
Congfeng Jiang, Yitao Qiu, Weisong Shi, Zhefeng Ge, Jiwei Wang, Shenglei Chen,
Christophe Cérin, Zujie Ren, Guoyao Xu, and Jiangbin Lin

Abstract—Workload characteristics are vital for both data center operation and job scheduling in co-located data centers, where online
services and batch jobs are deployed on the same production cluster. In this paper, a comprehensive analysis is conducted on Alibaba’s
cluster-trace-v2018 of a production cluster of 4034 machines. The findings and insights are the following: (1) The workload on the
production cluster poses a daily cyclical fluctuation, in terms of CPU and disk I/O utilization, and the memory system has become the
performance bottleneck of a co-located cluster. (2) Batch jobs including their tasks and derived instances can be approximated as Zipf
distribution. However, for all batch jobs with directed acyclic graph dependency, they suffer from co-location with online services since
the online services are highly prioritized. (3) The resource usages of containers have similar cyclical fluctuation consistent with the
whole cluster, while their memory usages remain approximately constant. (4) The number of batch jobs co-located with online services
is dependent on the mispredictions per kilo instructions of online services. In order to guarantee the QoS of online services, when the
MPKI of online services rises, the number of batch jobs to be co-located on the same machine should decrease.

Index Terms—Co-located jobs, workload characterization, online services, batch jobs, Internet data center, scheduling.

1 I NTRODUCTION

Q UALITY of experiences (QoE) of online services, such as


searching, shopping and video streaming, are highly
positively affected by their latency and responsiveness,
Although the co-location of such jobs improves ma-
chine utilization, it challenges the data center scheduler and
cluster management system in various aspects including
especially the long tail latency. Therefore, online service QoS (Quality of Service) of online services, performance
providers usually employ over-provisioning approaches to interference and isolation between both services. More-
respond to intermittent burst workloads for better QoE and over, the co-location of online services and batch jobs also
competitiveness. Such resource over-provisioning wastes results in scheduling complexity and interference, which
energy and many resources, and it is even worse for small sometimes can deteriorate system performance. For exam-
and mid-sized data centers [1], [2]. Currently, there is a trend ple, the latency-critical online service’s performance will
that in order to increase resource utilization and reduce significantly degrade if they are not properly co-located
power and energy costs in data centers, companies co-locate with batch jobs because slight CPU and memory contention
online services and batch jobs on the same cluster in their can result in a significant increase in service latency and
data centers. The rationale behind the co-location is that performance jitters. In some extreme cases, batch jobs will
the varying workload in data centers results in resource be evicted to secure the QoE of online services. Therefore,
utilization fluctuation, which provides an opportunity for online services and batch jobs must be carefully scheduled
resource multiplexing and interleaving. Specific online ser- and co-located temporally and spatially by trading offing
vices are latency-critical but may not consume much CPU between the online services’ latency and the throughput
and/or memory resources. At the same time, offline batch of batch jobs. Specifically, a carefully designed co-location
jobs demand as many resources as possible to maximize of online services and batch jobs in a data center not only
their performance, such as job throughput. Moreover, online provides further workload consolidation, but it also ensures
services and batch jobs are complement in both temporal the QoS guarantee of applications as well as the reduction
and spatial scale, which provides an opportunity for the co- of energy consumption.
location of services. Before job co-location, workload characteristics, includ-
ing task arrival patterns, job hierarchy, resource usage, job
• C. F. Jiang, Y. T. Qiu, Z. F. Ge, J. W. Wang and S. L. Chen are with the distribution, job dependency, job waiting time, and job
School of Computer Science and Technology, Hangzhou Dianzi Univer- failure patterns, are vital for operating, job scheduling,
sity, Hangzhou, China. E-mail: {cjiang, qiuyitao, gezhefeng, wangjiwei, power management, and server health management in data
chenshenglei}@hdu.edu.cn
• W. S. Shi is with the Department of Computer Science, Wayne State
centers. For example, online services and batch jobs cannot
University, Detroit, MI, USA. E-mail: weisong@wayne.edu be co-located successfully in complementarity until their
• C. Cérin is with University of Paris 13, Sorbonne Paris Cité, LIPN/CNRS resource usage and job distribution are identified, respec-
UMR 7030, France. E-mail: christophe.cerin@lipn.univ-paris13.fr tively. Therefore, workload characterization helps under-
• Z.J. Ren is with Zhejiang Lab, Hangzhou, China. E-mail:
renzj@hdu.edu.cn stand workload patterns and design better job scheduling
• G.Y. Xu, and J.B. Lin are with Alibaba Cloud, Hangzhou, China. E-mail: policies in data centers [3]. Researchers have proposed vari-
{yao.xgy, jiangbin.lin}@alibaba-inc.com ous approaches to characterize data center workloads, such
Manuscript received November 26, 2019; revised XXX, XXX. as time series analysis [4], workload prediction [5], energy

2168-7161 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Carleton University. Downloaded on May 26,2021 at 12:33:29 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2020.3034500, IEEE
Transactions on Cloud Computing
IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. XX, NO. XX, XX XXXX 2

performance [6], [7], workload shifting [8], and dynamic resource usage of containers only shows an apparent
resizing [9]. cyclical pattern for CPU and disk I/O utilization, except
Moreover, trace data from a production cluster system memory utilization. Moreover, the memory system has
can help both industry and academic communities design become the performance bottleneck of co-located data
a better job scheduling system [10], [11]. However, due centers.
to commercial confidentiality reasons, few trace data are 2) For batch jobs, their distributions of tasks and derived
released from production systems, especially from large instances can be approximated by Zipf distribution.
Internet companies. Thus, current cloud research is lacking This approximation can help the community design a
in terms of data on the characteristics of the production co- better job scheduling model.
located workloads of large cloud service providers(CSPs). In 3) For all batch jobs with DAG dependency information,
2011, Google released trace data from 12.5K machine cells the batch jobs suffer from co-location with online ser-
over about a month-long period [12]. In 2017, Microsoft vices since the performance loss of online services is
released trace data of the 2M first-party virtual machine minimized on a higher priority.
(VM) workload of Microsoft Azure in one of its geographical 4) The resource usage (CPU and disk) of containers has
regions for 30 consecutive days [13]. The Google trace data similar cyclical fluctuations, which is consistent with
is relatively old, which may not reflect the current clus- the whole cluster, while their memory usage remains
ter management system capabilities and merits, while the approximately constant. This usage pattern of contain-
Microsoft trace data only contains virtual machine-related ers provides an opportunity for job co-location and
information. resource utilization estimation.
Containerization is currently widely deployed due to 5) For interference elimination, the co-location of batch
its lightweight, responsiveness, resource density, and oper- jobs with online services must be conducted based on
ational simplicity. The trace data of containers from large online service performance, including mispredictions
service providers is incredibly important yet not available per kilo instructions,cycles per instruction, and memory
for the research community. In 2017, Alibaba released trace bandwidth. The co-location should trade off between
data, cluster-trace-v2017 [14], which contains cluster infor- resource usage improvement and performance degra-
mation of a production cluster in 24 hours (container data is dation.
12 hours long), as well as includes 1.3K machines that run Our findings and insights presented in this paper can
co-located online services and batch jobs. In 2018, Alibaba help the industrial and academic community better under-
released another type of trace data, cluster-trace-v2018, of a stand the workload characteristics and improve resource
production cluster system with 4034 machines which ran utilization in co-located Internet data centers.
co-located jobs for eight days (some data spans across nine The remainder of this paper is organized as follows. In
days) [14]. Both cluster-trace-v2017 and cluster-trace-v2018 Section 2, we give an overview of the cluster-trace-v2018
are captured from Alibaba's private cloud. The cluster-trace- trace data. Section 3provides the resource usage analysis
v2018 trace data is by far the largest of the co-located online results of all machines, such as resource usage pattern anal-
services (long-running applications, LRA, i.e., containers) ysis and quantification, diurnal cycle modeling, and over-
and batch jobs in production clusters. The exposure of subscription evaluation. In Section 4, we give the analysis
this data to the public can help address the challenges results of our analysis of online service, i.e., containers, in-
large Internet Data Centers(IDCs) face where online services cluding application deployment size distribution, container
and batch jobs are co-located. Characterization of this trace resource usage, and application deployment load balancing.
data may provide useful insights for scheduler cooperation In Section 5, we present an analysis of the batch jobs and
between online services and batch jobs. It can also help characterize the jobs, tasks, and instances in terms of job
tradeoff resource allocation between online services and size distribution, and task DAG dependency. In Section 6,
batch jobs to balance improved throughput of batch jobs we present an interference analysis of co-located workloads,
while maintaining acceptable service quality and fast failure including resource usage correlation, workload intensity
recovery for online services. and quantity, and performance impact on containers. In
In our previous work [10], [11], [15], we analyzed the Section 8, we offer the related work on workload characteri-
Taobao production cluster and the cluster-trace-v2017 and zation and scheduling in cloud data centers. We summarize
found many impressive results that are useful for both our work and conclude the paper in Section ??.
industry and academia. Since cluster-trace-v2018 has more
data (such as directed acyclic graph (DAG) information
of jobs and failure statistics) and a more extended period 2 AN OVERVIEW OF THE ANALYZED
(eight days) than cluster-trace-v2017, it is worth analyzing DATASET
this newly released trace data for the public community. In December 2018, Alibaba released its second trace data
In this paper, we analyze the Alibaba cluster-trace-v2018 from a real production cluster, namely, the cluster-trace-
trace data in 3 dimensions of machines, batch jobs, and v2018 dataset. The cluster-trace-v2018 dataset is the latest
online services, including characteristics of resource usage, trace data from giant CSP. This trace data is from the
job distribution, job dependency, and job completion time. Alibaba private cloud cluster, co-located and scheduled by
We find that: Alibaba Fuxi [16] and Sigma scheduler. Specifically, Fuxi is
1) The workload poses a daily cyclical fluctuation in CPU responsible for batch jobs scheduling while Sigma is for
and disk I/O utilization of the servers. In such co- online container scheduling. In this trace, each machine is
located clusters with online services and batch jobs, the identical in hardware configuration, i.e., the same number

2168-7161 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Carleton University. Downloaded on May 26,2021 at 12:33:29 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2020.3034500, IEEE
Transactions on Cloud Computing
IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. XX, NO. XX, XX XXXX 3

of processor cores and the same memory capacity. The co-


location of jobs means offline jobs and online services are
mixed, i.e., containers (running online services) and batch
jobs are running simultaneously on the same machine of the
cluster. There are 6 files in this trace data (in .csv file format).
The dataset has three types of information, i.e., machines,
containers, and batch jobs. We list the basic information of
the cluster-trace-v2018 dataset in Table 1.
TABLE 1: Data volume of the trace

File Name Number of records File size


machine meta 17,592 539.6KB
machines
machine usage 246,637,252 8.4GB (a) CPU utilization
container meta 370,540 17.8MB
containers
container usage 4,015,763,787 164.1GB
batch task 14,295,731 765.1MB
batch jobs
batch instance 1,351,255,775 103.7GB

All the entities in the dataset are machines, applications,


containers, batch jobs, tasks, and instances. In a production
system, a single job may include multiple tasks, and a single
task may contain multiple instances. We list the numbers of
each entity in the trace data in Table 2.

TABLE 2: Numbers of entities in the trace data

Machines Applications Containers Jobs Tasks Instances (b) Memory utilization


4,034 9,790 71,476 4,201,014 14,295,731 1,350,473,907

3 CHARACTERIZING CO-LOCATED WORK-


LOAD
3.1 Machine Usage
3.1.1 Overall Machine Usage and Grouping
Resource multiplexing and job co-location can increase re-
source utilization and reduce the energy consumption of
a computing cluster. Resource management is very effec-
tive and useful for large CSPs who know their workload
characteristics very well. Moreover, resource usage and
performance optimization can be improved if an accurate (c) Disk utilization
prediction model can be established from such workloads.
For example, a precise estimation of task completion and Fig. 1: Machine utilization during eight days
waiting time can help setup contention aware scheduling in
a co-located cluster. Therefore, resource utilization analysis
is vital for the quality evaluation of such job co-location and lower. On day 8, the whole cluster may exhibit a notable
resource multiplexing. The average resource usage of CPU, increase in workload as the CPU and memory utilization
memory, and disk I/O utilization of each machine over eight increase accordingly.
days (i.e., 192 hours long in total) is presented in Fig 1. The resource usage reflects the workload on the corre-
Fig 1 shows that the CPU utilization of all machines sponding machine. Fig 1 shows the overall resource usage
shows a cyclical trend with time. It can be seen that most trend over 8 days. Although the 4034 machines are iden-
of the machines have higher memory utilization and lower tical in hardware configuration, different workload may
disk utilization, compared with CPU utilization. The me- be scheduled on the machines during different working
dian value of the average CPU utilization of all machines times. Moreover, a production system usually leverages the
is 38.1%, while that value for memory is 90.75%. This fact load balancing policy to smooth workload variation among
suggests that the memory resource is almost used up during different machines. To investigate the potential cause of
the week while the CPU utilization is reasonably high, imbalance in the cluster, the 4034 machines in the cluster
considering the machines are co-located with online services are grouped into four groups, according to the instance
and batch jobs. However, the disk I/O utilization is much execution on different machines, as shown in Table 3. Table 3

2168-7161 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Carleton University. Downloaded on May 26,2021 at 12:33:29 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2020.3034500, IEEE
Transactions on Cloud Computing
IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. XX, NO. XX, XX XXXX 4

shows that different subsets of machines execute different that the subsets of M_Idle and M_ContainerOnly consist of
amounts of batch instances. The numbers suggest that al- the 167 machines in Table 4 that do not execute any batch
though the whole cluster is co-located with online services jobs.
and batch jobs, not all of the machines are co-located with
batch jobs throughout the eight days. TABLE 4: Machine group by jobs co-location

TABLE 3: Machine group by scheduled instance numbers Machine Type amount percentage grouping
M_Idle 19 0.47% Type A machines
Machine Type amount percentage Min sched Inst Max sched Inst
M_ContainerOnly 148 3.67% as in Table 3
A 167 4.14% 0 0
M_BatchOnly 10 0.25%
B 16 0.40% 6 26400
M_Containers&Batch 3857 95.61%
C 76 1.88% 57778 87232
D 3775 93.58% 140461 493865 The CPU usage of M_Idle, M_ContainerOnly,
M_BatchOnly, and M_Containers&Batch machines are de-
The cumulated scheduled instances on each machine are picted in Fig 3. One explanation for the Type A machine
depicted in Fig 2. Fig 2 shows a noticeable segmentation is that these machines run some higher priority online
trend and an increase for these four groups of machines services. The average and median values of CPU utilization
listed in Table 3. Fig 2 also shows that instances are evenly in Fig 3 are listed in Table 5. Fig 3 also suggests that if a ma-
distributed on machines of Type D. Table 3 shows that batch chine is exclusive and only dedicated to container running,
jobs are not perfectly distributed on different machines, its average and median CPU utilization is 7.51% and 8.04%,
even the load balancing mechanism is deployed in the respectively. However, after co-location, the machine's av-
cluster management system. One explanation is that some erage and median CPU utilization significantly increase to
machines’ online applications are redirected with varying 39.28% and 38.18%, respectively. This fact validates and
requests of the user from outside. Therefore, to secure QoS reflects the rationale of co-location in the production cluster
guarantee of online services, batch jobs are not allowed to for improved resource efficiency. Fig 3 shows that the peaks
be scheduled on these machines. However, this also shows and valleys of online services and batch jobs are staggered
that the initial load balancing function of online services and complementary. Therefore, job scheduling can adapt to
may not function well. This observation suggests that the this characteristic to smooth the resource utilization. Fig 3c
one-directional interference originated from the workload also shows that these machines are well co-located. Fig 3
imbalance of online services and inducted to batch jobs. also shows that the CPU utilization of M_BatchOnly ma-
chines is in a wider range than M_ContainerOnly machines.
Moreover, the CPU utilization of M_BatchOnly machines is
much higher than M_ContainerOnly machines. This is be-
cause batch jobs are more computing-intensive than online
services. Since online services are more latency-sensitive and
their latency must be limited not to exceed the predefined
range, the machines running online services must keep their
CPU utilization in a lower range. Moreover, rigorous latency
control for online services limits the resource multiplexing
between online services and batch jobs. Therefore, the batch
jobs account for the majority of the utilization of server CPU
resources, and dominate the CPU usage on machines.
TABLE 5: CPU utilization of different types of machines

Machine Type average median

M_ContainerOnly 7.51% 8.04%


Fig. 2: Scheduled instances on machines M_BatchOnly 26.68% 24.92%

Table 3 also suggests that except for machines of Type A, M_Containers&Batch 39.28% 38.18%
B, and C, machines of Type D account for 93.58% of all ma-
M_Idle 0.89% 0.89%
chines. Therefore, machines of Type D are good representa-
tives of co-located machines in the production cluster. Note
that in Table 3, 167 machines of Type A are not scheduled to
execute any batch jobs over eight days. As we investigate 3.1.2 Resource Usage and Quantification Modeling
later in Table 4, these 167 machines can be further split In terms of CPU utilization, their CPU utilization distri-
into two subsets, 19 of which (labeled as M_Idel machine bution is depicted in Fig 4. All machines' CPU utilization
type) are idle all the time and 148 of which (labeled as can be quantified as Gaussian distribution(single Gaussian
M_ContainerOnly machine type) only run online containers. and double Gaussian), and both single and double Gaussian
We group the 4034 machines into 4 groups in Table 4. Note fittings have very high quantification precision values, while

2168-7161 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Carleton University. Downloaded on May 26,2021 at 12:33:29 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2020.3034500, IEEE
Transactions on Cloud Computing
IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. XX, NO. XX, XX XXXX 5

1500
machine numbers
Gaussian Fitting

machine numbers
1000

500

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
average memory utilization

Fig. 5: Gaussian fitting of machine average memory utiliza-


tion

is periodically changing over 24 hours [15]. However, in


Alibaba cluster-trace-v2018, the memory utilization is more
stable and continually evolving around more than the 90%
utilization spot, which shows that currently, the analyzed
cluster is almost using up memory resources. Similarly,
Fig. 3: CPU utilization of machines (x axis is timestamp, y machines' disk utilization distribution is depicted in Fig 6.
axis is average machine CPU utilization. Note that records All machines' disk utilization can be quantified as Gaussian
in table machine_usage.csv around timestamp 150000s are distribution. The maximal, minimal and median value of
missing and these values are set as 0) machine average disk utilization is 98.11%, 0.0%, and 7.42%,
respectively. Moreover, 86% of the machines exhibit disk
double Gaussian fittings have a slightly higher fitting preci- utilization between 5% and 11%.
sion than single Gaussian fittings. The maximal, minimal,
and median value of machine average CPU utilization is 800
60.5%, 0.0%, and 38.1%. Moreover, 90% of the machines machine numbers
Gaussian Fitting
exhibit CPU utilization between 30% and 50%.
600
machine numbers

350
machine numbers
Single Gaussian Fitting 400
300
Double Gaussian Fitting
machine numbers

250
200
200

150
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
100
average disk utilization
50
Fig. 6: Gaussian fitting of machine average disk utilization
0
0 0.1 0.2 0.3 0.4 0.5 0.6
Fig 4 to Fig 6 show that the machines in the produc-
average cpu utilization
tion cluster exhibit significant temporal and spatial imbal-
Fig. 4: Gaussian fitting of machine average CPU utilization ance in resource utilization in terms of CPU, memory, and
disk utilization. Precisely, for workload characterization and
As for machine memory utilization, its distribution is modeling, resource utilization of machines can be quantified
depicted in Fig 5. All machines' memory utilization can be by Gaussian distribution. The k-means algorithm is used
quantified as Gaussian distribution. The maximal, minimal, to cluster all machines into six groups, and the Calinski-
and median value of machine average memory utilization Harabasz metric is used to determine the clustering result
is 96.1%, 3.0%, and 90.75%, respectively . Moreover, 92% of as follows:
the machines exhibit memory utilization between 87% and tr(Bk ) m − k
92%. s(k) = ∗ (1)
tr(Wk ) k − 1
In Alibaba cluster trace-v2017, memory utilization has
higher fluctuations than CPU, and its memory utilization Here s(k) is the score of clustering, where m is the number

2168-7161 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Carleton University. Downloaded on May 26,2021 at 12:33:29 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2020.3034500, IEEE
Transactions on Cloud Computing
IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. XX, NO. XX, XX XXXX 6

of samples, k is class number, Bk is covariance matrix of dif-


1
ferent classes, Wk is covariance matrix of inner classes, tr is
matrix track. A higher score means a better clustering result.
0.8
The features of machine clustering are listed in Table 6, and
their CDF of resource utilization is in Fig 7. Note that 13
0.6
machines with abnormal CPU utilization values (negative or

CDF
greater than 100) are excluded before clustering. Therefore, mGroup0
0.4
mGroup1
all clustered machine in Table 6 are 4021 in total. mGroup2
mGroup3
TABLE 6: Features of machine clustering according to re- 0.2 mGroup4
mGroup5
source utilization
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
Group Feature Vectors Machine average cpu utilization

Name CPU memory disk Amount (a) CPU

mGroup0 35.77-44.42 81.84-92.61 3.30-32.60 1711


1
mGroup1 0.0002-28.39 3.00-48.93 0.0-25.94 116 mGroup0
mGroup1
mGroup2 2.26-21.42 52.16-96.16 0.61-24.01 136 mGroup2
0.8
mGroup3
mGroup3 39.91-60.56 81.56-92.38 2.94-28.12 1028 mGroup4
mGroup5
mGroup4 34.72-58.79 81.58-92.08 40.82-98.11 32 0.6

CDF
mGroup5 20.18-37.65 49.44-94.62 2.74-20.35 998
0.4

For CPU utilization, mGroup1 and mGroup2 have sim- 0.2


ilar values. In terms of memory utilization, mGroup0,
mGroup3, and mGroup4 have overlapped values, while 0
mGroup1 has much lower memory utilization. For disk 0 0.2 0.4 0.6 0.8 1
average memory utilization
utilization, mGroup4 has the highest values (50%-60% in
average), while other groups have much lower disk utiliza- (b) Memory
tion, 15% on average. Among all these machines, only some
of them exhibit obvious group-wide usage patterns. For 1

example, resource utilization of 116 machines in mGroup1


is very low, while resource utilization of 32 machines of 0.8

mGroup4 is relatively high.


0.6
CDF

3.1.3 Implications for Machine Usage


0.4 mGroup0
Over the past decades, advances in CPU speed have far mGroup1
out-paced advances in DRAM latency. Therefore, main- mGroup2
0.2 mGroup3
memory access has become a performance bottleneck for mGroup4
many computer applications, widely known as the "memory mGroup5
0
wall." Worse, workload consolidation may result in cache 0 0.2 0.4 0.6 0.8 1
average disk utilization
pollution and downgrade memory system performance.
For example, in the co-located cloud, resource multiplexing (c) Disk
increases utilization but causes performance interference Fig. 7: CDF of resource utilization of 6 clustered machine
between containers and batch jobs. Different containers (or groups
tenants in the Cloud) may suffer from the cache flush or
pollution by their noisy neighbors [17].
One of the frequently used approaches to break the
memory wall is to optimize the cache allocation, isolation analyzed in Fig 3, the batch jobs are both data-intensive
and memory bandwidth partitioning [17]. In the analyzed and memory-intensive. Further, most of the business logic of
cluster in this paper, most of the batch job and container batch jobs will generate multiple objects during execution.
workloads are written in Java language, which has auto- These codes increasingly demand in-memory processing
matic memory management support. However, Java ap- for high-volume, high-velocity data analytics to achieve
plication performance is largely affected by its garbage performance goals, while high object churn and large heap
collector (GC), which is responsible for collecting objects sizes put severe strain on the garbage collector. Non-scalable
that are no longer being used. In a large scale co-located garbage collection pauses can reduce throughput for batch
environment like the analyzed cluster in this paper, the workloads significantly, and cause high tail-latencies for
garbage collection efficiency of Java Virtual Machine (JVM) interactive applications. Therefore, the more batch jobs are
is problematic and unable to efficiently support such huge co-located, the more Java objects are generated, but the
co-located workloads with a huge number of generated worse the JVM garbage collection becomes. This is the side
objects in terms of throughput and pause times [18]. As effect of co-location beyond improved resource efficiency.

2168-7161 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Carleton University. Downloaded on May 26,2021 at 12:33:29 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2020.3034500, IEEE
Transactions on Cloud Computing
IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. XX, NO. XX, XX XXXX 7

Our analysis in this paper shows that memory is becom- people visiting online e-commerce websites, which creates
ing the new potential bottleneck on modern co-located the opportunity for the highest batch job co-location in early
machines while CPU is not fully stressed with average morning. Moreover, the daily workload has a strong self-
utilization around 38%, but memory utilization is reaching correlation except for 8 hours long workload on day 8. To
capacity. This fact suggests that memory inefficiency of JVM investigate the self-similarity of daily workload CPU utiliza-
limits the further co-location of batch jobs and containers. tion, their correlation coefficients are calculated and show
Therefore, hardware-software co-optimization including dy- that, except for day 8, the minimal correlation coefficient
namic CAT (cache allocation technology) for CMPs (chip of daily workload is 0.744 among any two days, and the
multiple processors) and application-driven memory-usage average values of the correlation coefficient of each day are
pattern characterization was proposed by researchers [19]. 0.836, 0.882, 0.884, 0.844, 0.874, and 0.888, from day 1, day 3
For example, the application-layer optimizer characterizes to day 7, respectively.
data object usage patterns, allocates objects to appropriate We calculate the daily arrival rate of batch jobs and find
memory locations/pages in the virtual address space, and apparent bursts at 3 AM every day except day 5, which
sends corresponding page-level memory usage informa- exhibits one more burst at 8 PM. One explanation is that,
tion to the OS, while the OS-layer optimizer initiates this the outliers’ cause may be due to online stress testing of
guidance when deciding which physical page to use to the production cluster at that time. Usually, it is inevitably
back a particular virtual page, according to user-specified essential to stress the production system by testing a co-
performance goals. More aggressively, using region-based located batch job workload to get the real performance
memory management instead of GC in distributed data and feedback. Fig 9 gives the batch job arrival statistics
processing systems may reduce memory overheads signifi- over eight days and shows that batch job arrivals peak at
cantly. 3 AM, and then the machine utilization peaks at 4 AM.
As discussed in the following section, since the workload All machines considering all co-located workloads, their
in the analyzed cluster exhibit an apparent diurnal cyclical CPU utilization peaks at 6 AM, and the CPU utilization
pattern, fine-grained source code level profiling may help of the container falls at 4 AM every day. Fig 9 shows
reduce the memory footprints and improve system perfor- that except for day 1, the cluster peaks at 4 AM for batch
mance, and eventually leads the way to a higher co-location workloads, while at the same time, the containers' CPU
level in the data center. Finally, except for the memory wall utilization is the lowest every day. It also suggests that
problem, the analyzed cluster exhibits an obvious imbalance the Fuxi scheduler is scalable for massive batch jobs from
of resource utilization even in a production system with a 30K to 150K jobs per hour. The correlation coefficients of
load balancer. daily batch job arrival rates are also calculated and show
that except for day 5, the minimal correlation coefficient of
the daily workload is 0.782 among any two days, and the
3.2 Diurnal Cycles and Workload Self-Similarity
average values of the correlation coefficient of each day are
Fig 1 in Section 2 already showed that the CPU utiliza- 0.916,0.921,0.831,0.894,0.924 and 0.868, from day 2 to day 4,
tion of each machine in the cluster exhibit a daily cyclical and day 6 to day 8, respectively.
fluctuation. We present the cluster level CPU and memory
utilization in Fig 8, which shows that the cluster CPU 104
15
utilization reaches to peak around 6AM every day. We found
that this is due to the amount of batch jobs that peaks at 3
AM, and afterward, the system utilization increases until 6 10
job numbers

AM. More specifically, we stack each day's CPU utilization


of the cluster, which shows a noticeable diurnal pattern, and 5
this remark validates the findings in [20].

0
0 20 40 60 80 100 120 140 160 180 200
0.6
time stamp(h)

0.5 avg CPU Fig. 9: Submitted batch jobs in each hour (8 days long)
avg memory
utilization

0.4 We also compute the average CPU utilization of all on-


line services, i.e., containers, in Fig 10, and their correlation
0.3 coefficients. Fig 10 shows that container CPU utilization
peaks at 10 AM and falls at 4 AM every day. Usually, people
0.2
do not access online services from 2 AM to 6 AM. Therefore,
0 20 40 60 80 100 120 140 160 180 192
batch jobs are scheduled to co-locate with online services to
time(h)
share the whole cluster. That is why the batch job arrival
Fig. 8: Cluster level resource utilization peaks at 3 AM after the online service request begins to fall
at 23 PM.
Since the memory utilization almost reaches more than Implication: Workload exhibits a diurnal pattern for
90%, the CPU utilization shows the workload exhibits di- most online websites. Therefore, the main idea behind co-
urnal fluctuations. Usually, people sleep at night and work location is to multiplex the server resources among online
during the daytime, and in the early morning there are few services and batch jobs. Batch jobs can be precisely sched-

2168-7161 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Carleton University. Downloaded on May 26,2021 at 12:33:29 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2020.3034500, IEEE
Transactions on Cloud Computing
IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. XX, NO. XX, XX XXXX 8

0.5 1000
Day2 1

Day3 0.8

machine numbers
average cpu utilization 0.4 800
Day4 0.6

CDF
Day5 0.4

Day6
600
0.2
0.3
Day7 0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

Day8 proportion

0.2 Day9 400

0.1 200

0 0
0 2 4 6 8 10 12 14 16 18 20 22 23 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
time(h) proportion
(a) CPU oversubscription
Fig. 10: Diurnal patterns of container CPU utilization
500
1

0.8

machine numbers
uled to machines when requests to online services fall dur- 400 0.6

CDF
0.4

ing the night. The trace data shows that the analyzed cluster 0.2

300 0

is well co-located. However, workload characterization can 0 0.2 0.4 0.6


proportion
0.8 1

help more precisely co-location scheduling and interference 200

avoidance. Fig 10 shows the strong self-similarity of daily 100


workloads and indicates that the workload of containers
is highly predictable per day. Moreover, self-similarity can 0
0 0.2 0.4 0.6 0.8 1
help identify potential workload exceptions or power excep- proportion
tions in data centers. For example, if the web access and I/O (b) Memory oversubscription
communications do not increase significantly, but the power
increases a lot, potential suspicious power attacks may exist Fig. 11: Machine resource over-subscription
to the data center.
lower CPU utilization of machines validates this based on
3.3 Resource Over-subscription the former analysis. Specifically, containers consume more
Over-subscription is commonly used in a Cloud computing memory than CPUs, which makes it possible for CPU over-
environment for resource multiplexing and server consoli- subscription.
dation. Regarding resource reservation and promise, over- Implication: Resource over-subscription makes resource
subscription can increase resource utilization and host more scheduling more challenging when requested resources are
containers on the same machine. However, resource over- exceeding the total physical resource limit, where SLOs
subscription has significant negative impacts on system per- (service level objectives) may be violated, and QoS cannot
formance when system resources are contended by multiple be guaranteed. Workload characterization provides insights
co-located containers and batch jobs. For online services, we for workload patterns and opportunities for resource over-
propose the following metric to quantify the resource usage subscription. However, the over-subscription decision must
ratio r as follow: be made carefully according to resource usage patterns.
Pn
j=1 resource_of _container j
r= (2)
resource_of _machine 4 CHARACTERIZING CO-LOCATED CON-
if r>1.0, it means that the machine is oversubscribed. We TAINERS
plot the CPU and memory over-subscription ratio of the Since Alibaba is typically an E-commerce-oriented com-
container workloads on all machines in Figure 11. pany, online services including browsing, shopping, and
Fig 11 shows that 43.3% machines are oversubscribed transactions, are highly prioritized and guaranteed. Online
by container CPU usage, and 19.9% machines have r<0.5. browsing and shopping, including frequent dynamic Java
In terms of memory, all machines are not oversubscribed web pages generation, images browsing, database query,
by containers, where 86.02% machines have r<0.5. Recall item searching and sorting, price matching and pricing
in Section 3.1 that almost all the machines used up mem- change history queries, and some event or trigger-based
ory resources. At the same time, their CPU utilization notification services like temporary transient promotion and
was nearly 40%. Fig 11 suggests that containers are more price change reminders. Online transactions include pay-
memory bounded than computing bounded while batch ment database ACID (atomicity, consistency, isolation, and
jobs are both computing-intensive and memory-intensive. durability) checking, and risk control processing such as
To validate our findings, we depict the planned CPU and fraud identification. These online services are interactive,
memory usage of batch jobs in Fig 12, which shows that the very latency-sensitive, and high latency can lead to the
requested memory of batch jobs of each task are distributed user leaving and business loss. Thus, user experience can
in the lower zone compared to the CPU, whose requests favorably impact the CSP’s financial revenue, and the low
are more evenly distributed from 0 cores to 11 cores. The latency of online services must be guaranteed at a high
over-subscription analysis shows that the Sigma scheduler priority. Online services are so essential that in some cases,
may prioritize memory than the CPU of the target machine co-located batch jobs must be evicted to maintain a high QoE
when allocating a new container to the target machine. The of online services. Moreover, there exists interference caused

2168-7161 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Carleton University. Downloaded on May 26,2021 at 12:33:29 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2020.3034500, IEEE
Transactions on Cloud Computing
IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. XX, NO. XX, XX XXXX 9

for 629 containers spanning 444 machines, load balancing is


108
also very critical to provide stable and continuous service
provisioning.
10 7
To investigate the deployment size of each application,
106 we conduct some statistics as follows. The average and me-
105 dian number of machines deployed for each application are
S义
se:i.

2 and 6.5. Among all applications, 3067(3067/9790=31.3%)


4
1
JOJ

applications are deployed on two machines, and 71.8% of


o
aqwnu

3
1

all applications are deployed on 1, 2, or 3 machines. 80%


o
2
1

applications are deployed on less than six containers, while


101 90% of applications are deployed on less than 12 machines.
Only 12 applications are deployed on more than 200 con-

10 °

tainers.
2

9
1

3 4 5 6 7 8 10 11
CPU cores planned per task
700
(a) Planned CPU

container numbers
600

500

400
108
300
107 200

106 100

105 0
S义

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 9790
se:i.

o
4 o
1

application deployment unit ID


JOJ

(a)
aqwnu

3 o
1
2
1

10000
1

10 1 0.8
application numbers

0.6

CDF

10 ° 1000 0.4

0.2
2

0
100 101 102 103
6 8 10 12 14 16 18 container numbers

percentage of memory planned per task 100

(b) Planned memory


10
Fig. 12: Planned CPU and memory usage of batch jobs
1
1 10 100 1000
by co-locating batch jobs. Therefore, characterization of co- container numbers

located containers can help understand the patterns of user (b)


access behavior and avoid the interference of co-location.
Fig. 13: Deployment amounts of each application

4.1 Application Deployment Size


In the Alibaba private cloud, services are deployed in mul-
tiple containers spanning in different machines, which is
called a deployment unit (DU). All the containers inside
one DU are identical. In a real co-location environment,
various applications are deployed in the same cluster. Since
different applications may have different service size, we list
the deployment amounts of each application in Fig 13 and
the CDF of machine number deployed for one application
in Fig 14. In the trace data, there are 9790 applications,
where one application has at most 629 containers and at
least one container. More specifically, an application may
be deployed at most on 444 machines. This fact means
that one application may have multiple containers on one Fig. 14: CDF of machine number deployed for one applica-
single machine. Although this deployment may increase tion
resource utilization, it may cause fault-tolerant issues if the
machine is down or suffering from serious resource con- Fig 14 shows that 95% of the applications are deployed
tention. Therefore, a soft-fault tolerant mechanism must be on less than 25 machines, and 99% of the applications are
enhanced in such a co-location scenario to provide enough deployed on less than 99 machines. We list the machine dis-
service availability based on multiple containers. Moreover, tribution of each application deployment in Fig 15. Fig 15a

2168-7161 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Carleton University. Downloaded on May 26,2021 at 12:33:29 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2020.3034500, IEEE
Transactions on Cloud Computing
IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. XX, NO. XX, XX XXXX 10

shows that most deployment units span more than one cator of workload characteristics and can be used for perfor-
machine. Fig 15b shows that 3067 applications are deployed mance bottleneck diagnosis and performance profiling.
on two machines, which represents the no.1 in terms of the The container resource usage is depicted in Fig 16. Note
amount of deployed units. 71.8% of applications are de- that the trace data spans from day 1 to day 9. However,
ployed in less than four machines, and 90% of applications there is no container information during day 1. Therefore, in
are deployed in less than 12 machines. Fig 16, all container-related data starts from day 2, i.e., after
the first 24 hours. Fig 16 shows that the CPU utilization
500 exhibits a cyclical pattern over the eight days, and it also
exhibits a diurnal pattern from 9 AM to 9 PM every day.
machine numbers

400
At 4 AM, CPU utilization falls to the lowest daily usage.
300
For all containers, CPU and disk utilization are much lower
compared to memory usage.
200

100 90

0 80
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 9790
application deployment unit ID avg CPU

average resource utilization(%)


70
(a) Application deployment across machines avg memory
avg disk
60
3500
50
application numbers

3000

2500 40

2000
30
1500
20
1000

500 10

0
1 10 100 1000 0
machine numbers 24 40 60 80 100 120 140 160 180 200 215
time stamp(h)
(b) Application proportion according to deployed machines numbers
Fig. 16: Container resource usage
Fig. 15: Machine distribution of single deployment applica-
tion
The container resource utilization is quantified as dif-
Implication: Application deployment size is a crucial ferent distribution as listed in Table 7(x is the resource
factor for resource management in a co-location environ- utilization and y is the number of containers with the same
ment. Load balancing and system monitoring services must resource utilization). For example, the average CPU and
be scalable when deployment size increases with the work- memory utilization comply with exponential distribution,
load. Since multiple containers of one application are dis- while disk utilization complies with Gaussian distribution.
tributed across multiple machines, communication and col-
laboration are frequent for software-level fault tolerance, TABLE 7: Fitting function and parameters of container re-
load balancing, and request re-direction. Traditional co- source utilization distribution
location is usually using proactive or ad-hoc heuristics
based on available resources during time slacks of online Resource
services. However, such co-location may cause interference Fitting function a b c R-square
category
between online services and batch jobs, resource over-
CPU 5792 -0.08659 / 0.9773
commitment, and resource contention. Control theory based y = a ∗ exp(b ∗ x)
co-location is promising in QoS guarantee and resource memory 0.002976 0.146 / 0.6168
contention optimization when application deployment size y = a ∗ exp(−((x − b)/c)2 ) disk_io 14430 8.203 2.452 0.9926
increases.
Implication: We observe that the majority of containers
4.2 Container Resource Usage exhibit low CPU and disk utilization but very high memory
In a co-located environment, online services are latency- utilization through the analysis. Moreover, more than 50%
sensitive and may suffer from tail latency when concurrent of containers indicate memory utilization of more than 90%.
batch jobs contend for resources. Typically, a container will Since the QoS of online services is constrained by its upper
be allocated with memory, CPU, and disk resources. CPU limit of memory resource provisioning, the server can not
allocation is represented by the number of CPU cores. accommodate more containers if its memory resources are
Memory allocations are normalized values. Disk alloca- reaching the performance upper limit. However, batch jobs
tions are also normalized values. Usually, containers are may still be co-located on such machines for CPU and disk
over-provisioned to guarantee better QoS. However, in co- resource multiplexing while maintaining the co-location
location runtime, container resource usage is a crucial indi- performance.

2168-7161 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Carleton University. Downloaded on May 26,2021 at 12:33:29 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2020.3034500, IEEE
Transactions on Cloud Computing
IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. XX, NO. XX, XX XXXX 11

4.3 Load Balancing of Application Deployment 5 CHARACTERIZING CO-LOCATED BATCH


JOBS
We first compute average resource utilization over the eight 5.1 Job Size
days of each container belonging to the same application. 5.1.1 Job Hierarchy Distribution
Then we compute the average resource utilization of all the
Usually, one batch job is divided into multiple tasks, and
containers belonging to the same application and label this
each task executes different business logic, i.e., the instances.
value as the application's average resource utilization.
A task belonging to one job consists of a directed acyclic
We compute the distribution of the resource utilization of graph (DAG) due to data dependency or procedure de-
each application and find that the CPU utilization complies pendency. An instance is the smallest unit of batch job
with the exponential distribution and its fitting curve is scheduling. For batch processing, all instances within a task
y = a ∗ exp(b ∗ x) (x is average CPU utilization, and y is the execute the same application codes with the same resource
number of applications with such CPU utilization), where request but with different input data. To approximate the
a=2185, b=-0.252, and R2 =0.8925. This suggests that on- job distribution, we first depict the CDF of jobs and tasks in
line services are over-provisioned to guarantee low latency, Fig 17.
therefore, each container has very low CPU utilization.
In terms of the memory utilization of each applica-
tion, the maximal, minimal, average, and median value
of memory utilization is 99.99%, 0%, 67.24%, and 70.42%,
respectively. Here the memory utilization complies with tri-
Gaussian distribution with R2 =0.8596.
Similarly, the disk utilization complies with the Gaussian
distribution with a fitting curve as y = a∗exp(−((x−b)/c)2 )
(x is average disk utilization, and y is the number of appli-
cations with such disk utilization), where a=2746, b=8.258,
c=1.819, and R2 =0.9922, respectively. It is mainly distributed
in 5%-10%, and 90% of applications have an average disk Fig. 17: Job size in a job-task-instance hierarchy
utilization of less than 10%.
Fig 17 shows that most batch jobs are small: more than
Quantification modeling shows that most of the online 40% of jobs contain only one task, and more than 65% of
applications exhibit low CPU and disk utilization but high tasks contain only one instance. For example, 95% of jobs
memory utilization, which is consistent with the machines' have fewer than ten tasks, and 99% of jobs have fewer
overall resource utilization in the cluster. Moreover, this than 21 tasks. Moreover, for job size in terms of instance
also suggests an imbalance of resource utilization in CPU, numbers, 95% of jobs have fewer than 1059 instances and
memory, and disk on different applications. More specifi- 99% jobs have fewer than 5728 instances. Similarly, for
cally, applications have significant differences in CPU and task size in terms of instance numbers, 95% of tasks have
memory utilization. Through quantification fitting, we find fewer than 270 instances, and 99% of tasks have fewer than
that applications' average CPU utilization exhibits exponen- 1241 instances. Among all the jobs, more than 30% contain
tial distribution while the average memory and disk uti- only one instance. This fact suggests that lightweight jobs
lization exhibit Gaussian distribution. Note that the memory occupy a high proportion of the total jobs. Among all the
usage here is not the exact total of active used memory. It 14295731 tasks, 23557 do not have any instance information;
sums up the memory that is no longer used and not yet therefore, there are 14272174 valid tasks, and in all of them,
reclaimed by JVM. That is why in the Alibaba production the average instance number of each task is 98, while the
system, memory usage as high as more than 90% is still median number is 1.
allowed. However, since it is difficult to precisely quantify Moreover, among all the 4034 machines, there are ten
the actively used memory in real-time, current memory which only run batch jobs without any co-locating online
usage metrics are always meaningful and vital for workload services. On these ten machines, there are 842148 tasks, and
characterization and resource efficiency quantification. the average instance number of each task is 1331, and the
Implication: In a co-located cluster, memory is becoming median number is 372. We present the comparison of in-
the bottleneck that affects the performance of batch jobs stance number between all tasks, and those run on machines
and online services. Moreover, memory resources will be without co-locations in Fig 18. Fig 18 shows that at the 65.5th
exhausted much faster if the efficiency of memory garbage percentile, for all tasks, the number of instances per task is
collection is not scalable with the increase of workload 1, while for tasks without co-location, the value is 668. This
size. In Alibaba, most of the applications are written in observation suggests that bigger tasks may be executed on
Java, and garbage collection efficiency in JVM in a highly dedicated machines to achieve better performance.
parallel environment may lead to memory reclaim delays To quantify the distribution of task size, we compute
and even failure. When memory usage reaches the high each task’s frequency with a different number of instances.
water level, batch jobs may be evicted to secure the QoS We list the comparison of the top 10 tasks with the largest
of online services. Therefore, efficient memory reclaiming proportion among all tasks in Table 8. For all tasks, 65.5%
and compression mechanism is vital for system performance have only one instance, and 5.0% have three instances.
guarantees [21]. However, these 65.5% tasks only account for 0.7% of all trace

2168-7161 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Carleton University. Downloaded on May 26,2021 at 12:33:29 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2020.3034500, IEEE
Transactions on Cloud Computing
IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. XX, NO. XX, XX XXXX 12

size distribution in the whole analyzed trace. This result is


consistent with the results of cluster-trace-v2017 [15].

1.0
0.655
(24.38%,0.99)
1111 (9.53%,0.95)
0.8

0.6

CDF
1111 0.4

0.2

Fig. 18: CDF comparison of jobs size in all tasks and tasks 0.0
without co-location 0% 20% 40% 60% 80% 100%
proportion of instances of the tasks (sorted by task size)

Fig. 19: CDF distribution of number of instances per task(x


data instances. Moreover, for tasks on machines without axis is the proportion of instances of the tasks sorted by task
co-location, the top 2 tasks with the highest frequencies size)
are tasks with 1111 instances (4.7%) and 1 instance (3.2%).
Table 8 shows that among the top 10 tasks without co- We sort the tasks by their size and rank them by their
locations, 8 of them have more than 100 instances per task. frequencies that occur with such size, and quantify the task
Tasks without co-location are much bigger than those of all size with Zipf distribution as follows:
tasks. Although most of the tasks (more than 65.5%) are
small and contain only one instance, there are many tasks P (r) = C/rα (3)
that contain more than 90000 instances each. For example, Where: r is the rank of frequencies of instance number
for the top 1000 biggest tasks, each task contains 85570 per task, P(r) is the frequencies of instance number per task.
instances on average and its median value is 86496. C and α are distribution constants.
For all tasks, we can quantify them with α=3.491,
TABLE 8: Top 10 tasks with the largest proportion among
C=9.349e6, and R2 =0.998. For tasks excluding the top 10
all tasks (sorted by proportions)
biggest tasks, we can quantify them as Zipf distribution
with α=1.159, C =1.232e6, and R2 =0.996. This result suggests
All tasks Tasks without co-location
Rank that the rank and frequencies are complying with Zipf
num of num of
distribution with very high coefficients of determination.
instances frequency instances frequency Implication: For workload generation, simulation, and
of per task of per task scheduling modeling, job size can be modeled as Zipf distri-
1 1 9353657(65.5%) 1111 39976(4.7%) bution. This result also suggests there exists an imbalance of
2 3 720244(5.0%) 1 26746(3.2%)
task size. Moreover, a production cluster must have enough
capacity to host the maximum size of a task and a job. Small
3 2 294228(2.1%) 1018 7383(0.9%)
tasks’ dominance suggests that the service provider need
4 5 282959(2.0%) 351 4777(0.6%) not reserve large amounts of capacity within each cluster for
5 7 218887(1.5%) 179 4456(0.5%) batch job growth. In the analyzed cluster, batch jobs have
6 4 162229(1.1%) 371 4365(0.5%) a lower priority than online services and can be deferred
7 9 149685(1.0%) 3 4360(0.5%)
or evicted when high priority online services do not have
enough resources. Meanwhile, bigger tasks can be executed
8 11 112828(0.8%) 796 4002(0.5%)
on separate machines to achieve typical performance and
9 13 106102(0.7%) 556 3602(0.4%) avoid penalties with co-locating online services.
10 6 82218(0.6%) 223 3246(0.4%)
In total 80.3% 12.2% 5.2 Task Dependency
In batch jobs, many tasks are dependent on other tasks.
With batch task dependencies, users can create tasks to
5.1.2 Zipf Distribution of Job Size and Job Imbalance be scheduled for execution on compute nodes after the
Fig 19 shows the CDF distribution of the number of in- completion of one or more preceding tasks. Specifically, a
stances per task. For example, the top 1000 biggest tasks task can not start execution until its preceding tasks finish.
(0.007% of all tasks) account for 6.1% of all instances. These dependencies are represented by DAG information in
Moreover, 9.53% tasks contain 95% of the instances of all the trace data.
tasks, and 24.38% of tasks contain 99% of the instances of Fig 20 gives the DAG of job j_2459, which contains 17
all tasks. This shows a significant imbalance of the task tasks and 11509 instances where there are 16 dependencies.

2168-7161 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Carleton University. Downloaded on May 26,2021 at 12:33:29 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2020.3034500, IEEE
Transactions on Cloud Computing
IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. XX, NO. XX, XX XXXX 13

Note the number along the edge is the execution time (in CPU utilization of batch job
CPU utilization of online service
average CPU utilization of batch job
average CPU utilization of online service

second) of each task, and the number in the vertex is the task

average CPU utilization (core)


CPU utilization (core)
id. For better representation, we add a starting point and an
ending point in the DAG. The completion time of job j_2459
is 519s (sum of the completion times on the critical path, i.e.,
ideal job execution time). In comparison, its waiting time is time (hour) time (hour)

254s, which represents a 32.86% communication ratio over (a) (b)


its makespan.
Fig. 21: Correlation of CPU usage

are not latency-sensitive, the processing queue order can be


start
adjusted to improve the overall efficiency better.

2 1 14 13 10 9 5 6 6 INTERFERENCE BETWEEN CO-LOCATED


160 21 453 505 119 26 50 61
WORKLOADS
Deploying online services and batch jobs on the same server
3 15 11 7 will inevitably cause competition between resources(CPU
or memory). This section will discuss the main factors that
0 12 2 2
cause the different server resource changes and the mutual
4 16 12 8 interference between co-located workloads.

0 1 1 0
6.1 Main Factors of Resource Utilization Fluctuation
17 To explore the impact on CPU resource usage caused by co-
1
located services and batch workloads, we analyze the CPU
resource usage of online services and batch tasks within the
end time range of 24h-192h, as shown in Fig 21. The results
in Fig 21 show that the CPU usage of both online and
batch workloads has the characteristics of daily periodic
Fig. 20: DAG representation of job j_2459
fluctuations. In the figure, peaks and valleys of the two lines
are staggered, indicating that the two types of workloads
We list the dependency statistics of all tasks in Table 9. are complement to each other in the use of CPU resources.
We also compute the task size as shown in Table 9, which In other words, when the CPU resource usage of the online
shows that among all the scheduled instances, 96.9% belong service is high, the CPU usage of the batch jobs is low, and
to tasks with DAG, and 3.1% do not. In terms of task size, vice versa. Moreover, we calculated the correlation between
each task with DAG contains 113 instances on average, the CPU resource usage of containers, batch jobs, and the
while each task without DAG has only 19 instances on entire cluster. The correlation coefficient of the CPU resource
average. This fact also suggests that tasks with DAG depen- usage of containers and batch jobs is -0.17. The correlation
dencies are bigger than those without DAG dependencies. coefficient of batch jobs and CPU usage of the entire cluster
Implication: In a batch processing system, end users is 0.79, and the correlation coefficient between containers
can define task dependencies to run a task or set of tasks and the CPU usage of the entire cluster is 0.36. Therefore, it
only after a parent task has been completed. Some examples can be inferred from the data that 1) batch jobs are the main
of this scenario include MapReduce-style workloads in the factor leading to the rise of the entire CPU resource usage. 2)
cloud, jobs whose data processing tasks can be expressed Ideally, the correlation coefficient of the CPU usage of batch
as a DAG , or some pre-rendering and post-rendering jobs and containers should be -1, now only -0.17. This result
processes, where each task must complete before the next means that the co-located strategy can still be improved in
task can begin. Failure of the parent task may cause a Alibaba’s co-located cluster.
significant delay in its successive tasks. Our findings show To improve the entire cluster’s CPU resource efficiency,
that the communication time ratio over a job's makespan operators need to locate the source of the increase in CPU
increases when the total number of dependencies increases. resource usage. It is necessary to investigate whether the
This result suggests that dependency structure optimization fluctuation of the CPU resource usage of the cluster is due
and depth reduction may help decrease the communication to the rise in the number of workloads or the increase in the
time ratio over a job's makespan(Due to page limitation, we CPU usage of a single-source workload. Fig 21(b) shows the
omit the detailed quantitative analysis). average CPU usage for batch jobs and containers. It can be
Furthermore, the wait time of tasks is influenced by both seen that the average CPU usage also exhibits periodic fluc-
execution time and interval time. However, the execution tuations. Moreover, we calculated the correlation coefficient
time will be highly affected by the size of tasks and their between the average and total CPU usage of containers and
topologies from the above findings. As a result, compared to batch jobs. The correlation coefficient between the average
execution time, optimization for interval time is more likely and total CPU usage of containers is 0.82, while the correla-
to be addressed. Since the scheduler assigns a lower priority tion coefficient of the average and total CPU usage of batch
to batch jobs, tasks always line up to be processed. As they jobs is -0.23. This result suggests that the fluctuation of the

2168-7161 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Carleton University. Downloaded on May 26,2021 at 12:33:29 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2020.3034500, IEEE
Transactions on Cloud Computing
IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. XX, NO. XX, XX XXXX 14

TABLE 9: Task dependencies statistics of all jobs

With DAG
No DAG In total
with Dependency without Dependency
Jobs 1,025,989 (24.4%) 3,175,025(75.6%) 4,201,014(100%)
Tasks 2,283,759(16.0%) 6,728,933(47.0%) 5,283,039(37.0%) 14,295,731(100%)
Scheduled instances 41,779,379(3.1%) 1,309,476,396(96.9%) 1,351,255,775(100%)
Instances per task(avg.) 19 113 98

number of tasks
number of containers
number of jobs
number of instances
average size of memory
size of memory

number
number
average memory usage of batch job
average memory usage of online service

memory usage of batch job


memory usage of online service

time (hour) time (hour) time (hour) time (hour)

(a) (b) (a) (b)

Fig. 22: Correlation of memory usage Fig. 23: Correlation of workload intensity and quantity
between containers and batch tasks
CPU usage of containers is more likely caused by changes 1) Batch jobs have a greater impact on the cluster’s CPU
in the CPU usage of the containers themselves. resource usage, and containers have a greater impact
Fig 22(a) shows the change of the total memory usage of on the cluster’s memory usage;
batch tasks and containers with time, and Fig 22(b) shows 2) Batch jobs’ resource usage is more related to the number
the change of the average memory usage of batch tasks and of batch workloads;
containers with time. Similarly, due to anomalies in the data, 3) Containers’ resource usage is more correlated with the
the container changes between 24h-39h will not be analyzed. resource usage of one single source workload.
It can be seen from Fig 22 that the change in the total and
average values of container memory usage is almost the 6.2 Impact on Online Container Services
same, which suggests that the fluctuation of the memory In cluster-trace-v2018, the fields CPI (Cycle per Instruction),
usage of each container is small, but the memory usage MPKI (Mispredictions Per Kilo Instructions) and Mem_GPS
of each container is high. The memory usage of the batch (Memory Bandwidth) are metrics that reflect container per-
jobs has periodic changes, and the correlation coefficient formance. However, it should be noted that the CPI, MPKI,
between the number of instances and the memory usage and Mem_GPS fields are not consecutive, and the timestamp
is 0.79, which suggests that the change in the memory usage range for recording these fields is [553010, 691200], which
of the batch jobs is mainly due to the change in the number means that these fields only appear in the last two days of
of batch jobs themselves. the eight days.
Furthermore, we investigate the relationship between We first calculate the average values of CPI, MPKI and
CPU usage and the number of containers and batch tasks, Mem_GPS of the containers respectively deployed on co-
and calculate the fluctuation of the number of containers located and online service-only machines for each times-
and instances of batch tasks overtime, as shown in Fig 23. tamp in the interval [553010, 691200]. Then we calculate the
Since batch jobs are composed of a job-task-instance three- maximum, minimum, average and standard deviation of
layer structure, the changes in the number of jobs and tasks each average timestamp value, which are shown in Table 10
with time are also investigated. In Fig 23(a), the number to Table 12. The maximum value and standard deviation of
of containers is relatively small from 24h to 39h, and the CPI and MPKI of the containers on co-located machines are
container records are missing within such a period, which significantly larger than those of the containers on non-co-
suggests the data in that period is abnormal. After 39h, located machines, which indicates that the response latency
the number of containers is stable near 60,000, and it is and cache miss rate of the containers become more unstable
estimated that there is no apparent relationship between the after co-locating containers and batch workloads. Moreover,
change in CPU usage and the number of containers. the Mem_GPS of the containers deployed on the co-located
As shown in Fig 23, the number of batch jobs, tasks, machines is much lower than that of the containers de-
and instances all show periodic fluctuations. We calculated ployed on the non-co-located machines, which means that
the correlation coefficient between the number of offline the memory bandwidth of the container is sharply reduced
tasks at different levels (job, task, instance) and CPU usage. after co-locating workloads. One reasonable explanation is
The correlation coefficients between the CPU usage and the that the batch jobs occupy part of the memory resources,
number of batch instances, batch tasks, and batch jobs, are while the original memory resources are relatively scarce.
0.88, 0.77, and 0.63, respectively. Therefore, it can be inferred Thus it has a great impact on the performance of online ser-
that the fluctuation of CPU resource usage of batch tasks is vices. Specifically, the co-location of batch jobs increases the
mainly related to the number of batch tasks deployed. uncertainty of online services, and the QoS of the containers
In summary, the above analysis suggests that: is also degraded.

2168-7161 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Carleton University. Downloaded on May 26,2021 at 12:33:29 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2020.3034500, IEEE
Transactions on Cloud Computing
IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. XX, NO. XX, XX XXXX 15

(a) CPI (b) MPKI (c) Mem_GPS

Fig. 24: Performance Impact on online containers suffering from co-located batch tasks

TABLE 10: Data characteristics of CPI reach a stable situation, the scheduler will increase the
number of batch tasks deployed on the physical machine
Max Min Average Standard Deviation
where the containers are located. However, under such cir-
co-locatd 2.090 1.170 1.580 0.117 cumstances, container’s performance might decrease in turn
none co-locatd 1.680 1.380 1.577 0.012 if batch task deployment bursts. For instance, container’s
contention intensity would be alleviated if CPI and MPKI
TABLE 11: Data characteristics of MPKI show a downward trend. As a result of this, the idle resource
would be left on the machines that already host the contain-
Max Min Average Standard Deviation ers. The co-located scheduler would tend to dispatch the
co-locatd 1.460 1.020 1.075 0.038 batch workloads onto such machines to improve resource
none co-locatd 1.300 1.000 1.057 0.016 efficiency. CPI of containers will go up along with the
increasing deployed number of batch workloads, indicating
TABLE 12: Data characteristics of Mem_GPS the upward trend of the latency of the online services in
containers. Thus, in this case, the scheduler starts to set a
Max Min Average Standard Deviation roof ceiling for the number of batch tasks deployed on the
machines, to ensure that the QoS of the online services can
co-locatd 0.039 0.060 0.124 0.037
be guaranteed.
none co-locatd 1.040 0.580 0.779 0.043
Dynamic adjustment of the co-located ratio between
batch jobs and containers can be enabled by modifying the
number of batch tasks deployed according to container per-
To further investigate the performance impact of batch formance. However, such an adjustment method inevitably
workloads on online service, we statistically analyze the results in a continuous increasing and decreasing of the
relationship between container performance and the num- batch workload, which will lead to container performance
ber of batch workloads deployed. Since results show that fluctuation. Therefore, the QoS of container services can
the number of batch workloads deployed peaks at around not be consistently guaranteed. Also, the overhead of the
4 AM every day, we only select the [600000, 640000] time scheduler will increase under such a situation. Choosing
interval (about 4 AM on the 8th day) for statistical analysis a suitable co-located ratio of containers and batch work-
and remove the abnormal data. The results are shown in loads is a significant challenge facing the data centers
Fig 24. that are currently adopting co-located technology. Adaptive
Fig 24 shows the relationship between the number of scheduling is promising to dynamically adjust the schedul-
batch workloads deployed and CPI, MPKI, and Mem_GPS, ing strategy according to the runtime status of the system.
respectively. The left y-axis of each sub-figure is the average However, adaptive scheduling faces greater challenges for
of each statistical indicator, and the right y-axis is the a co-located cluster because the coordinated architecture of
number of batch workloads deployed. The red curve reflects multiple schedulers adds more complexity to the scheduler
the trend of three indicators over time and the blue curve design.Further, machine learning is also promising to mine
demonstrates the direction of the number of batch work- potential relationships between workloads’ features from
loads deployed with time. As shown in Fig 24, the number large-scale trace data and can be used to establish more
of batch workloads deployed is negatively correlated with efficient scheduling strategies.
the three indicators of the container performance. This result
means when CPI, MPKI, or Mem_GPS of the containers
increases, the number of batch workloads deployed will 7 RELATED WORK
decrease for co-located machines. Therefore, it can be in- Workload characterization is an essential process for data
ferred that the cluster scheduler will determine the number center operators to identify system bottlenecks and figure
of batch workloads deployed on the machine based on the out solutions for optimizing performance. Therefore, to
container’s current runtime performance. When containers properly design and provision current and future services

2168-7161 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Carleton University. Downloaded on May 26,2021 at 12:33:29 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2020.3034500, IEEE
Transactions on Cloud Computing
IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. XX, NO. XX, XX XXXX 16

in data centers, much effort has been made to understand previous work [15], we characterize the cluster-trace-v2017 in
cloud workloads’ nature. For example, Cortez et al. [22] new dimensions including failure patterns and correlations
present a detailed characterization of several VM workload among CPU and memory usage. Guo et al [20] analyze the
behaviors from Microsoft Azure. They analyze the work- Alibaba cluster-trace-v2018 trace data and focus on the re-
loads’ key characteristics by VM lifetime, VM deployment source efficiency. They identified that the memory resource
size, and resource consumption. Jia et al. [23] identify that is becoming the performance bottleneck in Alibaba's co-
the data analysis workloads are significantly diverse in located data center.
terms of both speedup performance and micro-architectural Tian et al [40] analyze the Alibaba cluster-trace-v2018 trace
characteristics in data centers. data, and they mainly focus on one aspect of production
Workload characterization can also help understand cluster workload, i.e., the dependencies of tasks. They focus
workload patterns and design better job scheduling poli- on the DAG jobs’gather or scattering characteristics and the
cies in data centers. Many workload characteristics aware in-degree and out-degree statistics of DAGs. For example,
resource management and job scheduling policies are pro- the results show that recurring jobs are widespread, and
posed. For example, Guo et al. [24] divide the workload into their resource usage and duration are predictable.
two categories: delay-sensitive interactive workload and In this paper, we conduct a comprehensive analysis of
delay-tolerant batch workload, and they present Stochas- Alibaba cluster-trace-v2018 in all three dimensions: (1)ma-
tic Cost Minimization Algorithm (SCMA) for workload chines, including machine usage and diurnal cyclic patterns;
management. Chen et al. [25] and Chong et al. [26] argue (2) online services, including application deployment size
that workload characterization is important to optimize the distribution, container resource usage, and application de-
workload management in data centers, and they dynam- ployment load balancing, and (3)batch jobs, including task
ically make decisions with the knowledge of the newly DAG dependencies, job completion time and job size. Our
emerging workload. Khan et al. [4] predict resource demand, work provides a different profile of a production cluster
resource utilization, and job/task length for resource provi- compared to existing work, including the quantitative anal-
sioning or job scheduling purposes. They estimate the future ysis of jobs at different granularities (job, task, and instance).
need of applications in terms of resources and allocate them
in advance, and release them once they are not required.
Resource utilization is the key factor that affects the data
center resource efficiency and energy efficiency [6], [27].
Therefore, the resource usage pattern in data centers can 8 C ONCLUSIONS
help understand the resource efficiency and workload dis-
tribution in data centers. Sun et al. [28] presents an overview Trace data from a production cluster can help both industry
of different kinds of resource management mechanisms and academic communities design a better job scheduler
for data centers. Tan et al. [29] and Mazumdar et al. [30] and optimal resource usage. Contemporary CSPs co-locate
propose methods for analyzing resource usage and mod- online services and batch jobs on the same clusters to
eling resource usage patterns. Contemporary data center increase machine utilization and reduce energy costs. How-
operators consolidate and co-locate workloads on the same ever, the mixture of online services and batch jobs also
computing cluster to increase resource utilization,such as results in scheduling complexity and interferences among
virtualization and co-location. Ahmad et al. [31] analyze and online services and batch jobs. To understand the workload
compare the current VM migration and server consolida- characteristics of co-located data centers, characterizing the
tion framework. Varasteh et al. [32] survey virtual machine workload in terms of task arrival patterns, job hierarchy,
migration and server consolidation, the parameters, and resource usage, job distribution, and job inter-dependency,
algorithmic approaches used to consolidate VMs onto PMs. is vital but complicated for both data center operators and
Mastroianni et al. [33] present ecoCloud, a self-organizing researchers.
and adaptive approach for the consolidation of VMs on In this paper, we analyze various characteristics of co-
two kinds of resources, namely CPU and RAM,to limit the located online services and batch jobs from a production
number of VM migrations and server switches. Also, since cluster in Alibaba's private Cloud. We present a detailed
consolidating different applications may lead to a drop- analysis of machines, containers, and batch workloads. We
in performance, Chen et al. [34] develop a light-weight, found that the workload poses a daily cyclical fluctuation,
non-intrusive methodology to achieve application-centric in terms of CPU and disk I/O utilization of the servers.
performance targets while consolidating homogeneous and Moreover, we quantified the batch jobs, their distributions
heterogeneous applications together. Besides, to maintain of tasks, and derived instances as Zipf distribution. This
high resource utilization, various resource allocation strate- quantification can help the community design a better job
gies have been proposed for CPU and RAM. For example, scheduling model. We also identified that co-locating batch
Warneke et al. [35] propose an approach to improve memory processing jobs of modern big data analytics may cause
utilization. garbage collection failures and memory resource shortages,
Although there has been research on traditional work- especially for managed programming languages like Java
load characterization, server consolidation, and VM migra- and Scala. Furthermore, we also analyzed the DAG depen-
tion in data centers, workload characterization on co-located dencies of batch jobs. Our findings and insights presented
data centers is rare. There have been some studies analyzing here can help data center operators better understand the
Alibaba's trace data cluster-trace-v2017 [36], [37], [38], [39] . workload characteristics and improve resource utilization
They focus on imbalance phenomena in the cloud. In our in co-located cloud data centers.

2168-7161 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Carleton University. Downloaded on May 26,2021 at 12:33:29 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2020.3034500, IEEE
Transactions on Cloud Computing
IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. XX, NO. XX, XX XXXX 17

ACKNOWLEDGMENTS [20] J. Guo, Z. Chang, S. Wang, H. Ding, Y. Feng, L. Mao, and Y. Bao,
“Who limits the resource efficiency of my datacenter: An analysis
This work is supported by the Natural Science Foundation of alibaba datacenter traces,” in 2019 IEEE/ACM 27th International
of China (No.61972118 ), and the Key Research and De- Symposium on Quality of Service (IWQoS). IEEE, 2019, pp. 1–10.
velopment Program of Zhejiang Province (No.2018C01098). [21] P.-A. Tsai and D. Sanchez, “Compress objects, not cache lines: An
object-based compressed memory hierarchy,” in Proceedings of the
We also greatly appreciate the discussions with the Alibaba Twenty-Fourth International Conference on Architectural Support for
Fuxi and Sigma scheduler team members: Haiyang Ding, Programming Languages and Operating Systems. ACM, 2019, pp.
Qi Lv, Zheng Cao, Liping Zhang, and many other engineers. 229–242.
Their thoughtful discussions and insights inspired our work [22] E. Cortez, A. Bonde, A. Muzio, M. Russinovich, M. Fontoura, and
R. Bianchini, “Resource central: Understanding and predicting
presented in this paper. workloads for improved resource management in large cloud plat-
forms,” in Proceedings of the 26th Symposium on Operating Systems
Principles. ACM, 2017, pp. 153–167.
R EFERENCES [23] Z. Jia, L. Wang, J. Zhan, L. Zhang, and C. Luo, “Characterizing
data analysis workloads in data centers,” in 2013 IEEE International
[1] C. Jiang, Y. Wang, D. Ou, B. Luo, and W. Shi, “Energy proportional Symposium on Workload Characterization (IISWC). IEEE, 2013, pp.
servers: Where are we in 2016?” in 2017 IEEE 37th International 66–76.
Conference on Distributed Computing Systems (ICDCS). IEEE, 2017, [24] Y. Guo, Y. Gong, Y. Fang, P. P. Khargonekar, and X. Geng, “Energy
pp. 1649–1660. and network aware workload management for sustainable data
[2] H. Fuchs, A. Shehabi, M. Ganeshalingam, L. Desroches, B. Lim, centers with thermal storage,” IEEE Transactions on Parallel and
K. Roth, and A. Tsao, “Characteristics and energy use of volume Distributed Systems, vol. 25, no. 8, pp. 2030–2042, 2013.
servers in the united states,” 2017. [25] T. Chen, X. Wang, and G. B. Giannakis, “Cooling-aware energy
[3] M. C. Calzarossa, L. Massari, and D. Tessera, “Workload charac- and workload management in data centers via stochastic op-
terization: A survey revisited,” ACM Computing Surveys (CSUR), timization,” IEEE Journal of Selected Topics in Signal Processing,
vol. 48, no. 3, p. 48, 2016. vol. 10, no. 2, pp. 402–415, 2015.
[4] A. Khan, X. Yan, S. Tao, and N. Anerousis, “Workload charac-
[26] F. T. Chong, M. J. Heck, P. Ranganathan, A. A. Saleh, and H. M.
terization and prediction in the cloud: A multiple time series
Wassel, “Data center energy efficiency: Improving energy effi-
approach,” in 2012 IEEE Network Operations and Management Sym-
ciency in data centers beyond technology scaling,” IEEE Design
posium. IEEE, 2012, pp. 1287–1294.
& Test, vol. 31, no. 1, pp. 93–104, 2013.
[5] J. Xue, F. Yan, R. Birke, L. Y. Chen, T. Scherer, and E. Smirni,
[27] Y. Qiu, C. Jiang, Y. Wang, D. Ou, Y. Li, and J. Wan, “Energy aware
“Practise: Robust prediction of data center time series,” in 2015
virtual machine scheduling in data centers,” Energies, vol. 12, no. 4,
11th International Conference on Network and Service Management
p. 646, 2019.
(CNSM). IEEE, 2015, pp. 126–134.
[6] D. F. Snelling and C. S. van den Berghe, “Characterization of data [28] X. Sun, N. Ansari, and R. Wang, “Optimizing resource utilization
center energy performance,” Fujitsu Sci. Tech. J, vol. 48, no. 2, pp. of a data center,” IEEE Communications Surveys & Tutorials, vol. 18,
230–236, 2012. no. 4, pp. 2822–2846, 2016.
[7] C. Jiang, Y. Wang, D. Ou, Y. Li, J. Zhang, J. Wan, B. Luo, and [29] J. Tan, P. Dube, X. Meng, and L. Zhang, “Exploiting resource usage
W. Shi, “Energy efficiency comparison of hypervisors,” Sustainable patterns for better utilization prediction,” in 2011 31st International
Computing: Informatics and Systems, vol. 22, pp. 311–321, 2019. Conference on Distributed Computing Systems Workshops. IEEE,
[8] Z. Liu, A. Wierman, Y. Chen, B. Razon, and N. Chen, “Data center 2011, pp. 14–19.
demand response: Avoiding the coincident peak via workload [30] S. Mazumdar and A. S. Kumar, “Statistical analysis of a data
shifting and local generation,” Performance Evaluation, vol. 70, centre resource usage patterns: A case study,” in Proceedings of the
no. 10, pp. 770–791, 2013. International Conference on Computing and Communication Systems.
[9] K. Wang, M. Lin, F. Ciucu, A. Wierman, and C. Lin, “Characteriz- Springer, 2018, pp. 767–779.
ing the impact of the workload on the value of dynamic resizing in [31] R. W. Ahmad, A. Gani, S. H. A. Hamid, M. Shiraz, A. Yousafzai,
data centers,” in ACM SIGMETRICS Performance Evaluation Review, and F. Xia, “A survey on virtual machine migration and server
vol. 40, no. 1. ACM, 2012, pp. 405–406. consolidation frameworks for cloud data centers,” Journal of net-
[10] Z. Ren, X. Xu, J. Wan, W. Shi, and M. Zhou, “Workload character- work and computer applications, vol. 52, pp. 11–25, 2015.
ization on a production hadoop cluster: A case study on taobao,” [32] A. Varasteh and M. Goudarzi, “Server consolidation techniques in
in 2012 IEEE International Symposium on Workload Characterization virtualized data centers: A survey,” IEEE Systems Journal, vol. 11,
(IISWC). IEEE, 2012, pp. 3–13. no. 2, pp. 772–783, 2015.
[11] Z. Ren, J. Wan, W. Shi, X. Xu, and M. Zhou, “Workload analysis, [33] C. Mastroianni, M. Meo, and G. Papuzzo, “Probabilistic consoli-
implications, and optimization on a production hadoop cluster: dation of virtual machines in self-organizing cloud data centers,”
A case study on taobao,” IEEE Transactions on Services Computing, IEEE Transactions on Cloud Computing, vol. 1, no. 2, pp. 215–228,
vol. 7, no. 2, pp. 307–321, 2013. 2013.
[12] https://github.com/google/cluster-data/. [34] L. Y. Chen, G. Serazzi, D. Ansaloni, E. Smirni, and W. Binder,
[13] https://github.com/Azure/AzurePublicDataset/. “What to expect when you are consolidating: effective prediction
[14] https://github.com/alibaba/clusterdata/. models of application performance on multicores,” Cluster comput-
[15] C. Jiang, G. Han, J. Lin, G. Jia, W. Shi, and J. Wan, “Characteristics ing, vol. 17, no. 1, pp. 19–37, 2014.
of co-allocated online services and batch jobs in internet data [35] D. Warneke and C. Leng, “A case for dynamic memory partition-
centers: A case study from alibaba cloud,” IEEE Access, vol. 7, ing in data centers,” in Proceedings of the Second Workshop on Data
pp. 22 495–22 508, 2019. Analytics in the Cloud. ACM, 2013, pp. 41–45.
[16] Z. Zhang, C. Li, Y. Tao, R. Yang, H. Tang, and J. Xu, “Fuxi: a [36] C. Lu, K. Ye, G. Xu, C.-Z. Xu, and T. Bai, “Imbalance in the cloud:
fault-tolerant resource management and job scheduling system at An analysis on alibaba cluster trace,” in 2017 IEEE International
internet scale,” in Proceedings of the VLDB Endowment, vol. 7, no. 13. Conference on Big Data (Big Data). IEEE, 2017, pp. 2884–2892.
VLDB Endowment Inc., 2014, pp. 1393–1404. [37] W. Chen, K. Ye, Y. Wang, G. Xu, and C.-Z. Xu, “How does the
[17] J. Park, S. Park, and W. Baek, “Copart: Coordinated partitioning workload look like in production cloud? analysis and clustering of
of last-level cache and memory bandwidth for fairness-aware workloads on alibaba cluster trace,” in 2018 IEEE 24th International
workload consolidation on commodity servers,” in Proceedings of Conference on Parallel and Distributed Systems (ICPADS). IEEE,
the Fourteenth EuroSys Conference 2019. ACM, 2019, p. 10. 2018, pp. 102–109.
[18] L. Xu, T. Guo, W. Dou, W. Wang, and J. Wei, “An experimental [38] Y. Cheng, A. Anwar, and X. Duan, “Analyzing alibaba’s co-located
evaluation of garbage collectors on big data applications,” Pro- datacenter workloads,” in 2018 IEEE International Conference on Big
ceedings of the VLDB Endowment, vol. 12, no. 5, pp. 570–583, 2019. Data (Big Data). IEEE, 2018, pp. 292–297.
[19] J. Simão, S. Esteves, A. Pires, and L. Veiga, “Gc-wise: A self- [39] Q. Liu and Z. Yu, “The elasticity and plasticity in semi-
adaptive approach for memory-performance efficiency in java containerized co-locating cloud workload: A view from alibaba
vms,” Future Generation Computer Systems, vol. 100, pp. 674–688, trace,” in Proceedings of the ACM Symposium on Cloud Computing.
2019. ACM, 2018, pp. 347–360.

2168-7161 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Carleton University. Downloaded on May 26,2021 at 12:33:29 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2020.3034500, IEEE
Transactions on Cloud Computing
IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. XX, NO. XX, XX XXXX 18

[40] H. Tian, Y. Zheng, and W. Wang, “Characterizing and synthesiz- Zujie Ren received the Ph.D. degree in com-
ing task dependencies of data-parallel jobs in alibaba cloud,” in puter engineering from Zhejiang University, in
Proceedings of the ACM Symposium on Cloud Computing, 2019, pp. 2010. He is currently the Director of the Re-
139–151. search Center for Cross-Media Co-Processing,
Congfeng Jiang received his Ph.D. degree from Zhejiang Lab, Hangzhou, China. His research
Huazhong University of Science and Technology interests include big data systems, cloud com-
in 2007. He is currently a Professor of the School puting, and data center technologies.
of Compute Science and Technology, Hangzhou
Dianzi University, China. His research interests
include system optimization, performance evalu-
ation, and distributed system benchmarking. He
is an IEEE Member.
Guoyao Xu is currently a research engineer at
Alibaba Cloud. He obtained his B.S degree from
Yitao Qiu is currently a master student of the Xidian University,Xi’an,China in 2011. He got his
School of Computer Science and Technology, Master and PhD degrees both from Wayne State
Hangzhou Dianzi University, China. His research University, in 2013 and 2019, respectively. His
interests include cloud computing, big data anal- research interests include datacenter operation
ysis and distributing system. optimization.

Weisong Shi is a Charles H. Gershenson Dis- Jiangbin Lin is currently an expert software en-
tinguished Faculty Fellow and a Professor of gineer of Alibaba Cloud. He received the B.S.
Computer Science at Wayne State University. degree from Peking University in 2008. He was
There he directs the Mobile and Internet Sys- the innovator of Pangu Testing Framework. His
tems Laboratory, Connected and Autonomous research interests include distributed storage
Driving Laboratory, and Intel IoT Innovators Lab. systems, system benchmarking, and reliability
His research interests include edge computing, optimization.
cloud computing, and distributed systems.He is
an IEEE Fellow and a Distinguished Scientist of
ACM.

Zhefeng Ge is currently a master student of the


School of Computer Science and Technology,
Hangzhou Dianzi University, China. His research
interests include cloud computing, data analysis
and distributing system.

Jiwei Wang is currently a master student of the


School of Computer Science and Technology,
Hangzhou Dianzi University, China. His research
interests include cloud computing, edge comput-
ing and distributing system.

Shenglei Chen is currently a master student


of the School of Computer Science and Tech-
nology, Hangzhou Dianzi University, China. Her
research interests include container scheduling,
cloud computing and distributing system.

Christophe Cérin has been a Professor of Com-


puter Science at the University of Paris 13,
France since 2005. His research interests are
in the field of high performance computing, in-
cluding grid computing. He is developing middle-
ware, algorithms, tools and methods for manag-
ing distributed systems. He is an IEEE Member.

2168-7161 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Carleton University. Downloaded on May 26,2021 at 12:33:29 UTC from IEEE Xplore. Restrictions apply.

You might also like