LOPO

LOPO: An Out-of-order Layer Pulling Orchestration
Strategy for Fast Microservice Startup

Lin Gu∗ , Junhao Huang∗ , Shaoxing Huang∗ , Deze Zeng† , Bo Li‡ , Hai Jin∗
IEEE INFOCOM 2023 - IEEE Conference on Computer Communications | 979-8-3503-3414-2/23/$31.00 ©2023 IEEE | DOI: 10.1109/INFOCOM53939.2023.10229072
∗ National
Engineering Research Center for Big Data Technology and System,
Services Computing Technology and System Lab, Cluster and Grid Computing Lab,
School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, China
† School of Computer Science, China University of Geosciences, Wuhan, Hubei, China
‡ Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Hong Kong
Abstract—Container based microservices have been widely

applied to promote the cloud elasticity. The mainstream Docker ,QRUGHU
containers are structured in layers, which are organized in stack 2XWRIRUGHU

with bottom-up dependency. To start a microservice, the required
6WDUWXS7LPHV
layers are pulled from a remote registry and stored on its host
server, following the layer dependency order. This incurs long
microservice startup time and hinders the performance efficiency.
In this paper, we discover that, for the first time, the layer
pulling order can be adjusted to accelerate the microservice
startup. Specifically, we address the problem on microservice +RWHO5HVHUYDWLRQ 0HGLD6HUYLFH 6RFLDO1HWZRUN
layer pulling orchestration for startup time minimization and $SSOLFDWLRQV
prove it as NP-hard. We propose a Longest-chain based Out-
of-order layer Pulling Orchestration (LOPO) strategy with low Fig. 1. The startup time of three applications under in-order and our-of-order
computational complexity and guaranteed approximation ratio. layer pulling strategies
Through extensive real-world trace driven experiments, we verify
the efficiency of our LOPO and demonstrate that it reduces the
container level orchestration and highly depend on the micro-
microservice startup time by 22.71% on average in comparison
with state-of-the-art solutions. service placement decisions. For example, Lou et al. [10]
Index Terms—Microservice Startup Acceleration, Image Layer place the dependent microservices on one server and pull the
Pulling, Image Layer Storage smallest microservice first to shorten the startup time. Gu et
al. [11] and Fu et al. [12] accelerate microservice startup
I. I NTRODUCTION by placing the microservices with commonly shared layers
Containers have become the mainstream service provision into the same server to eliminate the duplication of pulling
due to the huge advantages in lightweightness, elasticity, and the same layer. Some recent studies [13], [14] restructure the
scalability over conventional virtual machines [1], [2]. The containers into multiple versions with different sizes and func-
container-based microservices are deployed on-demand, and tionalities, so that users can pull the smaller images first toward
their lifetime is relatively short [3]. For example, Google starts faster startup. In general, existing startup acceleration solutions
over 7000 microservices per second, while half of them are need to explicitly adjust the microservice placement decisions
alive for less than 30 minutes, 27% less than 5 minutes, and (e.g., hacking the container orchestration in Kubernetes) or
10% even less than 1 minute. Meanwhile, to start a new to restructure the containers. This limits their applicability
microservice, its corresponding container must be pulled from in practical scenario, while it is highly desirable to design
a remote registry to a local server, often incurring long startup acceleration mechanisms as a plug-in without modifications
time in the order of minutes. Recent statistics show that the on the container-based microservice.
startup time occupies a large portion, even as high as 50% To this end, let us first take a closer look into the startup
of a microservice’s lifetime [4]–[6]. The long startup time process. A container usually consists of multiple layers pack-
has become the main bottleneck, and severely hinders the aging the required functionalities such as system dependencies
performance of microservices [7]–[9]. and runtime tools [15]. To start a microservice, all its required
Therefore, much effort has been devoted to reducing the layers must be first pulled from a remote registry and then
microservice startup time. Existing solutions are primarily stored on the local hard disk. The host server will initialize
several pulling threads (three in default for Docker) for all
The research was supported in part by National Key Research and De- microservices to retrieve their layers concurrently. Meanwhile,
velopment Program of China under grant 2022ZD0115301, NSFC grant No. one storage thread is initialized for each microservice, re-
61972171 and 62032008, RGC RIF grant under the contract R6021-20, and
RGC GRF grants under the contracts 16209120 and 16200221. Deze Zeng is sponsible for storing the pulled layers. The constituent layers
the corresponding author (deze@cug.edu.cn). of a container are structured in stack, and the layer storage
Authorized licensed use limited to: Huazhong University of Science and Technology. Downloaded on November 28,2023 at 12:19:11 UTC from IEEE Xplore. Restrictions apply.
procedure must be done in a bottom-up order to guarantee • We conduct extensive trace driven experiments to validate
the layer dependency. That is, an upper layer can be stored the effectiveness of our LOPO strategy. The results show
on local hard disk only when all its lower layers have been that compared with state-of-the-art solutions as Docker,
already stored [16]. However, it is worth pointing out that GLSA [10], ADAL [11], and CNTR [13], our LOPO
this does not impose any restriction on the layer pulling averagely reduces the startup time by 23.56%, 22.79%,
order, although the most widely used container product, i.e., 23.56%, and 20.91%, respectively.
Docker, simply issues the pulling in bottom-up order. From our The remainder of this paper is organized as follows. We
preliminary investigation, we discover that the pulling order present the problem formulation and NP-hard analysis in
greatly influences the startup time and therefore should be Section II. Then, we propose our LOPO strategy in Section III
carefully orchestrated. and report out experiment results in Section IV. Section V
We have conducted a pilot experiment by starting 3 real summarizes some related work. Finally, Section VI concludes
world applications, i.e., Hotel Reservation, Media Service, this work.
and Social Network [17], which comprises 5, 7, and 6 mi-
croservices with totally 42, 52, and 61 layers, respectively. II. S YSTEM M ODEL AND P ROBLEM F ORMULATION
Fig. 1 shows the startup time of these three applications
using the default in-order pulling strategy and our manually In this section, we present a formal description to the layer
adjusted out-of-order strategy. By in-order, we can see that it pulling orchestration problem, and analyze its complexity.
takes 89.09, 102.79, and 120.53 seconds to start these three
A. System Model
applications, respectively. Then, we apply an out-of-order way
by grouping the small layers and large layers to balance We mainly focuses on the layer pulling behaviors of one
the pulling workload of different threads. In this case, the server. We assume that there is a server to be deployed with
startup time is decreased to 64.30, 71.69, and 102.61 seconds, a set M of container based microservices. As the contain-
reduced by 27.83%, 30.31%, and 14.87%, respectively. The ers are structured in layers, we use L to represent all the
results illustrate that out-of-order layer pulling can potentially layers required by M. For a container m ∈ M, Θ(l, m) ∈
accelerate startup process. Furthermore, it can be particularly {0, 1}, ∀m∈ M, l ∈ L is used to indicate whether m requires
advantageous in that this re-ordering decision does not require layer l (Θ(l, m) = 1) or not (Θ(l, m) = 0). To startup the
any container restructuring and microservice placement adjust- microservices, a set K of threads are initialized to pull all
ment. these layers L. For each thread k ∈ K, B is used to indicate
However, it is challenging to design the optimal layer the allocated network bandwidth.
pulling strategy for startup time minimization. On one hand, a The startup of a microservice can be divided into two
host server may have to start a large number of microservices procedures: (1) pulling the layers in .tar.gz format, and (2)
at the same time, and their containers usually consist of many decompressing these layers and storing them locally. The size
layers with different sizes. Although layers can be pulled out- of layer l ∈ L before and after decompression are defined as
of-order, they must be stored in-order. On the other hand, Ωp (l) and Ωs (l), respectively. For layer l ∈ L, we define xl,k
there are only limited number of pulling threads on each to indicate whether thread k is responsible for pulling layer l
host server, and a microservice can be started up only when (xl,k = 1) or not (xl,k = 0). Then, these pulled layers must be
all its layers have been completely pulled. As a result, the decompressed and stored in-order according to the bottom-up
startup time minimization formulation has to address two dependency relationship. The storage bandwidth is denoted as
correlated problems: (1) how to balance the heterogeneous W . With the consideration of layer dependency, we specially
pulling workloads on different pulling threads, and (2) how to introduce Nl to represent l’s successor layer set that can only
determine the pulling order on each thread. Focusing on this be stored after layer l.
problem, in this paper, we propose a Longest-chain based Out-
of-order layer Pulling Orchestration (LOPO) strategy, which B. Problem Formulation
can be integrated with any existing container orchestration Based on the system model above, we next present a formal
system. The main contributions are summarized as follows. description to the problem on layer pulling orchestration
• To our best knowledge, we are the first to investigate problem for startup time minimization. For the convenience of
the layer pulling order orchestration problem for micro- the readers, the major notations in this paper are summarized
service startup acceleration. The problem is proved as in Table I.
NP-hard by reduction from the classical list scheduling 1) Layer Pulling Procedure: To startup the microservices,
problem. all the required layers must be completely pulled and stored.
• To deal with the computational complexity, we construct
Put it another way, each required layer must be pulled by one
a layer chain to estimate the startup time with joint and only one pulling thread.
consideration of the layer size and layer dependency, X
and accordingly design our LOPO strategy with low xl,k = 1, ∀l ∈ L (1)
computational complexity and guaranteed approximation k∈K
ratio of 2.
TABLE I
M AJOR N OTATIONS Essentially, there are two decisions to be made for
Constants layer pulling orchestration, i.e., layer pulling task alloca-
M, L, K The microservice set, layer set and pulling thread set tion (xl,k , ∀l ∈ L, k ∈ K) and the layer pulling order
Θ(l, m) Whether layer l is required by microservice m on each thread. The layer pulling order is represented by
The successor layer set of layer l
the starting time of layer pulling, i.e., Tbp (l, k). We aim to
Nl
Ωp (l) The layer size of l before decompression
Ωs (l) The layer size of l after decompression minimize the startup time of all the microservices, which is
B The pulling bandwidth of layer pulling thread determined by the latest finishing time of layer storage, i.e.,
W The storage bandwidth of layer storage thread
maxl∈L,m∈M Tfs (l, m). By summing up the above, the layer
Variables
xl,k Whether layer l is pulled by thread k or not pulling orchestration (LPO) problem with the objective of
Tbp (l, k) The starting time of pulling layer l on thread k minimizing the startup time can be formulated as follows.
Tfp (l, k) The finishing time of pulling layer l on thread k
Tbs (l, m) The starting time of storing layer l of container m LPO:
Tfs (l, m) The finishing time of storing layer l of container m min { max Tfs (l, m)}
p
xl,k ,Tb (l,k) l∈L,m∈M
s.t. : (1) − (7)

Each pulling thread can pull one and only one layer at a
time. C. NP-hardness Analysis
Tfp (l′ , k) ≤ Tbp (l∗ , k)∥Tbp (l′ , k) ≥ Tfp (l∗ , k) We prove the NP-hardness of our LPO problem through
(2) reducing from list scheduling problem [18]. List scheduling
∀l′ , l∗ ∈ L, k ∈ K, l′ ̸= l∗ , xl′ ,k = xl∗ ,k = 1 problem usually consists of a set H of identical processors
Here Tfp (l, k) and Tbp (l, k) represent starting time and finishing and a set J of tasks. Each task j ∈ J should be allocated
time of pulling layer l on thread k. and processed on one and only one processor h ∈ H.
If the layer pulling of l is not allocated to thread k, its Whenever there is an idle processor h, a task j should be
starting time should be enforced to zero. immediately allocated. Let µ(j, h) denote the processing time
for processor h to execute task j and α denote the maximum
0 ≤ Tbp (l, k) ≤ A · xl,k , ∀l ∈ L, k ∈ K (3) processing time of all |H| processors, i.e, the task processing
makespan, where |·| is the cardinality function. The goal of list
Here A is an arbitrary large number.
scheduling problem is to find a processing sequence for all the
The finishing time of layer pulling is determined by the
tasks as F = (f1 , f2 , · · · , f|J| ) with the minimum processing
starting time, the pulling bandwidth B, and the compressed
makespan.
layer size Ωp (l).
In our LPO problem, the set of pulling threads K can be
Ωp (l) regarded as the set of processors H while the layers l ∈ L to
Tfp (l, k) = Tbp (l, k) + xl,k · , ∀l ∈ L, k ∈ K (4)
B be pulled and stored can be regarded as the task set J with the
Ω (l)
processing time of τ (l, k) = pB and τ (l, m) = ΩW s (l)
. Let
2) Layer Storage Procedure: A layer l ∈ L can be stored us consider a special case where the storage bandwidth W is
only after it is completely pulled from the registry. Hence, the large enough, which means ΩW s (l)
→ 0, ∀l ∈ L. In this case,
starting time Tbs (l, m) of storing layer l of container m ∈ M we only need to focus on the pullingPmakespan minimization,
must be after the finishing time Tfp (l, k) of pulling layer l. i.e., min ω
ep where ω ep = max∀k∈K ∀l∈L τ (l, k). Hence, the
LPO problem can be solved by generating a sequence list Q e to
Tbs (l, m) ≥ Θ(l, m) · Tfp (l, k), ∀l ∈ L, m ∈ M, k ∈ K (5) pull l ∈ L, aiming to minimize the pulling makespan ω ep . This
is a typical list scheduling problem, which has been proved as
Note that there is only one storage thread for each micro- NP-hard [19]. Hence, the LPO problem, as a general case, is
service. Hence, if layers l′ and l∗ belong to the same micro- also NP-hard.
service m and layer l′ is the successor of layer l∗ , l′ can be
III. L ONGEST- CHAIN BASED A LGORITHM
stored only when the storage of layer l∗ finishes. Such partial
order for layer dependency guaranteeing can be expressed as A. Algorithm Design
follows. To tackle the computational complexity, we propose a
Tbs (l′ , m) ≥ Θ(l′ , m) · Θ(l∗ , m) · Tfs (l∗ , m), longest-chain based strategy LOPO, as summarized in Al-
(6) gorithm 1. Recall that microservice startup consists of the
∀l∗ ∈ L, l′ ∈ Nl∗ , m ∈ M
pulling and storage procedures, let P = {p1 , p2 , · · · , p|L| } be
The time consumed by storing l is decided by the decom- the layer pulling task set and S = {s1 , s2 , · · · , s|L| } be the
pressed layer size Ωs (l) and the storage bandwidth of layer layer storage task set. Note that layer pulling can be done
storage thread for microservice m. concurrently out-of-order while the storage must be done in
Ωs (l) order. Hence, for each layer l in microservice m, we construct
Tfs (l, m) = Tbs (l, m) + Θ(l, m) · , ∀l ∈ L, m ∈ M (7) a layer chain consisting of the pulling task pl , storage tasks
W
Algorithm 1 LOPO Algorithm layer ˜l. To balance the workload of each thread, the thread k̃
1: Initialize thread allocation decisions X = {xl,k , ∀l ∈ with the minimum pulling workload is selected to pull layer
L, k ∈ K} and pulling starting time decisions Tpb = ˜l in line 11. Hence, we append ˜l to thread k̃’s layer pulling
{Tbp (l, k), ∀l ∈ L, k ∈ K}. Initialize thread workload queue Q e in line 13. After allocating layer ˜l to thread k̃, the
k̃
ek = 0, ∀k ∈ K and its layer pulling task queue corresponding allocating decision xl̃,k̃ = 1 and the pulling
e
Qek , ∀k ∈ K starting time Tbp (˜l, k̃) ← Bk̃ are updated in line 14, and we add
2: for m ∈ M do the layer size Ωp (˜l) to the thread workload as ek̃ ← ek̃ +Ωp (˜l)
3: for l ∈ L & Θ(l, m) = 1 do in line 15. This procedure repeats until all layers in sequence
4: C(l) = {pl → sl → sl+1 → · · · → s|Lm | } L̃ have been pulled. Finally, we obtain the feasible solution X
5: Calculate the chain length LC(l) and Tpb as to the LPO problem.
6: end for The computational complexity of layer chain construction
7: end for from line 2 to line 7 is O(|L|), the layer sort in line 8 is
8: Sort layer set L by LC(l) in descending order as L̃ O(|L|·log(|L|)), and the thread selection from line 9 to line 16
9: for ˜l ∈ L̃ do is O(|K| · |L|), with |K| as a constant. Hence, our LOPO
10: for k ∈ K do algorithm is with computational complexity O(|L| · log(|L|)).
11: find k̃ which ek̃ = min{ek,∀k∈K }
12: end for B. Algorithm Analysis
13: Append ˜l to pulling queue Q e
k̃
Recall that microservice startup consists of layer pulling
e
14: Update pulling decisions xl̃,k̃ = 1, Tbp (˜l, k̃) ← Bk̃ tasks {p1 , p2 , · · · , p|L| } and storage tasks {s1 , s2 , · · · , s|L| },
Ω (l)
15: Update thread workload ek̃ ← ek̃ + Ωp (˜l) with pulling time τ (pl ) = pB and storage time τ (sl ) =
Ωs (l)
16: end for
p W for each layer l ∈ L. The goal of our LOPO strategy is to
17: X and Tb is the feasible solution for our problem determine the layer pulling order on each thread as Q ek so as to
minimize the microservice startup time ω e = maxl∈L Tfs (l, m).
First, let us consider a special case of LPO with only the
of sl and the storage tasks for all its successor layers Nl as layer pulling procedure consisting of the pulling task set P as
C(l) = {pl → sl → sl+1 → · · · → s|Lm | }. It can be seen LPO-P. With only the layer pulling tasks in LPO-P, reducing
that all tasks in this chain must be done sequentially, since we the startup time ω is equivalent to minimizing the layer pulling
have to pull l first and then store {l → l + 1 → · · · → |Lm |} makespan ωp , which is determined by the latest finished layer
′
in order. Let LC(l) = pB + ΩW
Ω (l) s (l)
+ l′ ∈Nl ( ΩsW(l ) ) be
P pulling task.
the length of the chain C(l). It can be seen that the chain Lemma 1. Defining ωp and ωp′ as the pulling makespan of
length LC(l) takes into consideration of both the layer size any two solutions Qk and Q′k for LPO-P, respectively, we have
and layer dependency. What’s more, no matter when we start ωp /ωp′ ≤ 2 − |K|
1
.
to pull layer l, it further takes at least time LC(l) to start
the microservices, since the successor layer storage may be
delayed by their pulling tasks. Hence, we leverage the layer
chain to estimate the remaining startup time and make the
layer pulling decisions.
First, the thread allocation decisions xl,k ∈ X, pulling
starting time decisions Tbp (l, k) ∈ Tpb and thread workload
ek , ∀k ∈ K are initialized and set to 0. A layer pulling
queue for each thread is initialized as Q ek in line 1. Since all
layers of one microservice share one storage thread, for each
Fig. 2. An example solution for the LPO-P problem with two empty tasks
microservice m ∈ M, we construct the layer chain for each l and the makespan of 17
in m from line 3 to line 6. The chain length is calculated as
Ω (l) m |)
|C(l)| = pB + ΩW s (l)
+ Ωs (l+1)
W + · · · + Ωs (|L
W by adding up Proof: All the time points inside the pulling makespan,
the layer pulling and storage time of itself and the storage time i.e., ∀t ∈ (0, ωp ] can be partitioned into two subsets, i.e., A
of all successor layers in line 5. If a layer l belongs to multiple and B, where A includes time points when all the threads
microservices, we choose the longest chain to calculate LC(l) . are busy and B contains the rest time points when at least
Then, we sort all layers in L in descending order according one thread is idle (but not all). Fig. 2 shows the pulling
to its chain length |Cl | into a new layer sequence L̃ in line 8. procedure of one 8-layer microservice consisting of the task set
After sorting, the first layer in L̃ has the longest chain, of P = {p1 , · · · , p8 } consuming τ (pl ) = {8, 6, 5, 7, 6, 4, 2, 5}
implying the longest pulling time and storage time, and should time units, respectively. It can be seen that time points in
be pulled with higher priority. Therefore, in each iteration from (0, 12] can be categorized into the set A and time points in
line 9 to line 16, we choose the first layer ˜l in L̃ as the layer (12, 17] belong to the set B. Specially, there is no pulling
to be pulled. Now, we need to determine the pulling thread for task in the last 3 time units of thread k1 and the last 5 time
units of thread k3, which can be defined as the empty tasks Proof: Following Lemma 1, ωp and ωp† denote the
Φ = {φ1 , φ2 }. corresponding pulling makespan of an arbitrary LPO solution
Assume that pl∗ is the last completed task, whose com- ω and optimal solution ω † , respectively.
pletion time is the pulling makespan ωp (i.e., ∃ pl∗ ∈
ω ωp + (ω − ωp )
P, Tfp (l∗ , k) = ωp , if xl∗ ,k = 1). For any time point t before = †
the starting time of pulling task pl∗ , i.e., t ≤ Tbp (l∗ , k), all
ω †
ωp + (ω † − ωp† )
the pulling threads K should be busy. The reason is that if ωp ω − ωp
≤ † + (13)
t ∈ B, there should be at least one idle thread and pulling ωp ω†
task pl∗ should have been allocated to another thread at time 1 ω − ωp
≤2− +
t or earlier. That is to say, t must be in set A if t ≤ Tbp (l∗ , k) |K| ω†
and xl∗ ,k = 1.
The second inequality holds because the LPO solution
Hence, we can conclude that the pulling time inter-
processes at least one more storage task at time ωp or ωp† than
val of the last pulling task covers the set B, i.e., B ⊂
the corresponding LPO-P solution (i.e., ω > ωp and ω † > ωp† ).
(Tbp (l∗ , k), Tbp (l∗ , k) + τ (pl∗ )]. Meanwhile, there are at most
The third inequality can be obtained according to Lemma 1.
(|K| − 1) threads idle at any time point of set B. Hence,
Here ω − ωp indicates the storage tail latency, i.e., the
considering the empty tasks.
X remaining storage task processing time after all pulling tasks
τ (φe ) ≤ (|K| − 1)τ (pl∗ ) (8) are completed. Obviously, the storage tail latency is equal to
φe ∈Φ or shorter than the longest layer storage time among all the
Obviously, both the pulling makespan ωp and ωp′ must not microservices, defined by C1 . Note that C1 will not be greater
be shorter than the processing time of any single pulling task. than the shortest startup time ω † . Hence, we have the general
bound of LPO problem based on the solution of LPO-P as
ωp ≥ τ (pl∗ ) (9) follows.
ω 1 C1 1
and †
≤2− + † ≤3− (14)
ω |K| ω |K|
ωp′ ≥ τ (pl∗ ) (10)
Specially, if a thread processes only one pulling task from It is noticeable that ω/ω † ≤ 3 − |K|
1
is the general bound
time 0 to the end, (9) or (10) takes the equal sign. for LPO problem derived from any solution of LPO-P, it is
From Fig. 2, we can see that by introducing the concept of not tight enough. Next, we try to reduce the startup time by
empty tasks, each thread k ∈ K is always busy processing a improving the pulling makespan ωp .
pulling task or an empty task until the makespan. Hence, the
ωp can be calculated as follows. Theorem 2. By re-orchestrating the pulling tasks of LPO-P to
Qbk and obtaining the pulling makespan as ω
bp , the startup time
1 X X
of LPO problem can be reduced to ω b and the approximation
ωp = { τ (φe ) + τ (pl )}
|K| ratio can be improved to 37 − 3|K|
1
.
φe ∈Φ pl ∈P
(11)
1 ′ ′ Proof: Following the traditional makespan optimization
≤ {(|K| − 1)ωp + |K|ωp }
|K| idea in [18], we allocate the pulling task with the largest
The second inequality can be obtained by integrating (8), processing time to an idle thread every round, i.e., largest-
(9), and the fact that no more than |K| times of any makespan layer-first, then the following inequality holds.
are required to complete all the pulling tasks. Based on (11), ω
bp 4 1
we can derive the general bound of LPO-P as follows. ≤ − (15)
ωp‡ 3 3|K|
ωp 1
≤2− (12) Here ωp‡ denotes the optimal pulling makespan of LPO-P. ωp‡
ωp′ |K|
should be no longer than the pulling makespan of the optimal
solution for LPO problem, i.e., ωp‡ ≤ ωp† .
That is, the pulling makespan of any solution for LPO-P
1 Similar to the proof of Theorem 1, we have the following
problem will not be longer than 2 − |K| times of the optimal
inequality.
makespan.
By solving the problem of LPO-P, a pulling sequence Qk ω ω
bp b−ω
ω bp
≤ † +
b
can be obtained. Following the pulling order of Qk , the ω† ωp ω †
corresponding storage sequence can be determined. That is, ω

bp b−ω
ω bp
from the LPO-P solution Qk , we can further calculate the ≤ ‡ + †
ωp ω
microservice startup time ω. Let ω † denote the optimal startup (16)
4 1 b−ω
ω bp
time of LPO. ≤ − +
3 3|K| ω†
Theorem 1. The general bound for LPO problem is ω/ω † ≤ 7 1
1
3 − |K| . ≤ −
3 3|K|
The third inequality can be derived from (15). By pulling task sl−i whose starting time Tbs (l − i, m) ≤ ω ep is found.
the largest layer first, we improve the approximation ratio of Obviously, {sl−i+1 → sl−i+2 → ... → sl } should be included
1
LPO from 3 − |K| (as the general bound) to 73 − 3|K|
1
. in the layer chain.
However, Theorem 2 focuses on the pulling makespan Then, for the storage task sl−i starting before or at time
optimization and only takes the layer size into consideration, ω
ep , if the corresponding pulling task pl−i processed on thread
which may lead to longer storage tail latency as well as longer k ∗ finishes before the storage task starts, i.e., Tfp (l − i, k ∗ ) <
startup time. We consider a special case with only one pulling Tbs (l − i, m), the starting time of storage task sl−i is deter-
thread, where the optimal pulling makespan can be obtained. mined by its predecessor storage task, i.e., Tfs (l − i − 1, m) =
As shown in Fig. 3, the pulling makespan of pulling all Tbs (l − i, m). Otherwise, sl−i can start earlier. We move back-
layers sequentially and reversely are both 17. However, their ward to sl−i−1 and repeat the above assessment until a storage
storage tail latency varies, as 3 and 11, respectively. From task sl−i−j satisfying Tfp (l − i − j, k ′ ) = Tbs (l − i − j, m) is
Fig. 3 we can see that the layer pulling makespan and the found. Similarly, {sl−i−j+1 → sl−i−j+2 → ... → sl } should
storage tail latency should be carefully balanced toward the also be included in the layer chain.
goal of minimizing startup time. To this end, both the layer For a pulling task pl−i−j , every time point before its starting
dependency and layer size should be considered, as mentioned time on thread k ′ belongs to set A. That is, t must be in A if
in Section I. Based on this observation, we next show that our t < Tbp (l − i − j, k ′ ). Otherwise, there must be an idle thread
LOPO strategy further reduces the approximation ratio from and the task pl−i−j should have started at t or earlier. By such
7 1
3 − 3|K| to 2. means, we can find the layer chain C(l − i − j) = {pl−i−j →
sl−i−j → · · · → sl−i → · · · → sl } that covers set B.
Fig. 3. The effect of pulling order on storage tail latency Fig. 4. An example chain C(l) in LOPO solution to cover set B
Theorem 3. The LOPO strategy can solve the LPO problem Let us take the startup procedure of two 9-layered mi-
with a guaranteed approximation ratio of 2. croservices i.e., tomee and php, as an example to illustrate
Proof: We assume that the pulling makespan and startup the layer chain searching process, as shown in Fig. 4. At
time obtained by our LOPO strategy are ω ep and ω e , respec- time point t = ω e , the last finished storage task is sphp9 ,
tively. Similar to the proof of Lemma 1, we can obtain an which corresponds to sl in the above searching process. Then,
approximation ratio on the LOPO solution ω e by checking the we look for the predecessor storage tasks of sphp9 until
length of set A and set B. Different from Lemma 1, all pulling sphp6 because sphp6 satisfies Tbs (php6, php) ≤ ω ep , and sphp6
threads are considered to process empty tasks after the pulling corresponds to sl−i . Continuing from sphp6 , storage task sphp3
makespan ω ep , since all pulling tasks are completed. The empty and pulling task pphp3 corresponding to sl−i−j and pl−i−j
tasks will continue until time ω e when all tasks are completed. can be found, respectively, since sphp3 and pphp3 hold that
Now, we start with the last finished storage task sl , i.e., Tfp (php3, k3) = Tbs (php3, php). Hence, the layer chain can
Tfs (l, m) = ωe , and look backward to find a layer chain C(e l) be presented as C(php3) = {pphp3 → sphp3 → · · · →
used in our LOPO strategy, in which the tasks can only be sphp6 → · · · → sphp9 }, which covers set B of time interval
processed sequentially. Based on this layer chain, we can (24.56s, 34.62s].
derive the upper bound of the B set. Now, we obtain the layer chain C(l − i − j), which covers
For the storage task sl , its starting time Tbs (l, m) is directly set B. Similar to (8), we have the following inequality.
affected by its pulling task or its predecessor storage tasks. X
If Tbs (l, m) > ω ep , we have Tfs (l − 1, m) = Tbs (l, m) because τ (φe ) ≤ |K||C(l − i − j)| (17)
all pulling tasks have already completed at time ω ep but its φe ∈Φ
predecessor layer storage task sl−1 is not completed before
Tbs (l, m). Then, for sl−1 , if Tbs (l − 1, m) > ω ep , we can still Note that all tasks in C(l−i−j) cannot be done concurrently.
find such a predecessor storage task sl−2 through the above It implies that this chain exists in any solution of LPO. Then,
assessment. Such assessment can be repeated until a storage the startup time of all solutions must not be less than the chain
length |C(l − i − j)|, including the optimal solution. Let ω † as the startup time. Fig. 5(c) further shows the average pulling
denote the optimal startup time. makespan. Our LOPO decreases the pulling makespan by
12.08%, 15.32%, 12.08%, and 9.28%, compared with Docker,
ω † ≥ |C(l − i − j)| (18)
GLSA, ADAL, and CNTR, respectively. We can also see that
Similar to Lemma 1, we have the following inequality. as network bandwidth increases, the pulling makespan gap
1 X X among the five strategies becomes smaller. This is because
ω
e= { τ (φe ) + τ (pl )} with higher network bandwidth the pulling time of each layer
|K|
φe ∈Φ pl ∈P
(19) decreases and the optimization space by orchestrating layer
1 † † pulling also becomes small. It can be expected that, with
≤ {|K|ω + |K|ω }
|K| sufficiently large network bandwidth, the pulling makespan
It implies that no more than |K| times of the optimal startup of all strategies will converge.
time is required to complete all the pulling tasks. Finally, we Next, Fig. 5(d) details the storage tail latency. We first
can get the approximation ratio as ωωe† ≤ 2. surprisingly observe the huge advantage of LOPO on storage
tail latency. Another interesting finding is that the storage
It can be seen that by introducing the longest chain to tail latency of Docker, GLSA, ADAL, and CNTR keeps rel-
estimate the remaining startup time and make the layer pulling atively stable while the storage tail latency of LOPO strategy
decisions, our LOPO strategy further reduces the approxima- shows an increasing trend. This is because the storage tail
1
tion ratio from 3 − |K| to 2. latency is determined by the total storage time and pulling
makespan. Under a fixed total storage time, the tail latency
IV. P ERFORMANCE E VALUATION becomes longer with shorter pulling makespan. Since LOPO
In this section, we conduct trace based experiments on our can achieve shorter pulling makespan with higher network
LOPO strategy and report the performance evaluation results. bandwidth, the storage tail latency becomes longer.
The default settings are as follows. The network bandwidth, Overall, the startup time of LOPO is 27.34%, 31.67%,
storage bandwidth, and the number of threads are set in the 27.34%, and 25.43% shorter than Docker, GLSA, ADAL, and
range of 100 ∼ 400Mbps, 100 ∼ 400Mbps, and 1 ∼ 6 CNTR, respectively.
respectively. We compare our LOPO with Docker, as well 2) With DeathStarBench: We take real world trace of
as state-of-the-art startup acceleration solutions GLSA [10], DeathStarBench with 3 applications of Social Network, Media
ADAL [11], and CNTR [13]. Service, and Hotel Reservation to test the performance of
LOPO. Social Network consists of 6 microservices with totally
A. On Different Network Bandwidths 61 layers and the layer size in the range of 0.00012MB
1) With the top 100 most pulled microservices: We select to 219.80MB. Media Service consists of 7 microservices
the top 100 most pulled microservices from Docker Hub and with totally 52 layers and the layer size in the range of
start them on 5 servers according to the Zipf distribution [20]. 0.00012MB to 219.80MB. Hotel Reservation consists of 5
The top 100 microservice images contain 1 to 21 layers, with microservices with totally 42 layers and the layer size in
layer sizes ranging from 0.0024MB to 618.30MB. Following the range of 0.00012MB to 134.80MB. We start different
the default setting, there are three pulling threads. We change types and numbers of microservices in these 3 applications
thread network bandwidth from 100Mbps to 400Mbps and on seven servers, and present their average startup time in
report the achieved startup time in Fig. 5. Fig. 6. Similar findings in Fig. 5 can also be observed from
First, Fig. 5(a) shows the microservice pulling makespan Fig. 6. It can be seen that our LOPO always achieves the
(color bars) and storage tail latency (grey bars) on each server shortest startup time among all five strategies under different
when the thread network bandwidth is 100Mbps. It can be bandwidth in Fig. 6(a) and Fig. 6(b). Compared with Docker,
seen that our LOPO always achieves the shortest startup time, GLSA, ADAL, and CNTR, LOPO also reduces the pulling
pulling makespan, and storage tail latency. Compared with makespan by 16.15%, 9.39%, 16.15%, 13.57%, respectively,
Docker, GLSA, ADAL, and CNTR, our LOPO decreases as shown in Fig. 6(c). Meanwhile, the storage tail latency is
the startup time by 21.76%, 24.91%, 21.76%, and 20.93%, reduced by 46.15%, 57.01%, 46.15%, 49.31%, respectively, as
respectively. It is noticeable that Docker and ADAL show the shown in Fig. 6(d). For the servers 1, 4, 5, and 7, the storage
same startup time. This is because ADAL reduces the startup tail latency is even reduced to 2.74s, 1.97s, 1.48s, and 1.09s,
time by placing the microservices with common layers onto since our LOPO can better utilize the multiple threads and
the same server. Hence, when the microservice placement is reduce the storage tail latency by processing the layers with
prefixed, ADAL can do nothing hence shows the same startup longest chain first. Compared with Docker, GLSA, ADAL,
time with Docker. and CNTR, our LOPO averagely reduces the startup time by
Next, we vary the thread network bandwidth from 100Mbps 23.56%, 22.79%, 23.56%, and 20.91%, respectively.
to 400Mbps and show the average startup time in Fig. 5(b).
As expected, with the increase of network bandwidth, the B. On Different Number of Threads
startup time of all strategies decreases. This is because higher Next, we vary the number of pulling threads from 1 to 6.
network bandwidth implies shorter pulling makespan as well The network bandwidth of each thread is set as 100Mbps.
6WRUDJH &175B3>@ */6$>@ */6$>@
*/6$B3>@ '2&.(5B3 $'$/>@ $'$/>@
$'$/B3>@ /232B3 &175>@ &175>@
6WRUDJH7DLO/DWHQF\V

3XOOLQJ0DNHVSDQV
'2&.(5 '2&.(5
/232 /232
6WDUWXS7LPHV
6WDUWXS7LPHV

*/6$>@
$'$/>@
&175>@
'2&.(5
/232

VHUYHU VHUYHU VHUYHU VHUYHU VHUYHU
6HULDO1XPEHURI6HUYHUV 1HWZRUN%DQGZLGWK0ESV 1HWZRUN%DQGZLGWK0ESV 1HWZRUN%DQGZLGWK0ESV
(a) Startup time under network band- (b) Average startup time under differ- (c) Pulling makespan under different (d) Storage tail latency under different
width as 100Mbps ent network bandwidths network bandwidths network bandwidths
Fig. 5. The startup time of 100 most pulled images from Docker Hub under different network bandwidths

6WRUDJH &175B3>@ */6$>@ */6$>@
*/6$B3>@ '2&.(5B3 $'$/>@ $'$/>@
$'$/B3>@ /232B3 &175>@ &175>@
6WRUDJH7DLO/DWHQF\V

3XOOLQJ0DNHVSDQV
'2&.(5 '2&.(5
/232 /232

6WDUWXS7LPHV
6WDUWXS7LPHV

*/6$>@
$'$/>@
&175>@
'2&.(5
/232

VHUYHU VHUYHU VHUYHU VHUYHU VHUYHU VHUYHU VHUYHU
6HULDO1XPEHURI6HUYHUV 1HWZRUN%DQGZLGWK0ESV 1HWZRUN%DQGZLGWK0ESV 1HWZRUN%DQGZLGWK0ESV
(a) Startup time under network band- (b) Average startup time under differ- (c) Pulling makespan under different (d) Storage tail latency under different
width as 100Mbps ent network bandwidths network bandwidths network bandwidths
Fig. 6. The startup time of different applications from DeathStarBench under different network bandwidths
Fig. 7 shows the startup time under different number of

threads with DeathStarBench applications. We can see that 6WRUDJH &175B3>@
*/6$B3>@ '2&.(5B3
LOPO strategy always performs the best and the startup time $'$/B3>@ /232B3
6WDUWXS7LPHV
gradually decreases with increasing number of threads. When

there is only 1 thread, it is impossible to reduce the layer
pulling makespan, which therefore is with the same value for

all the five strategies. Yet, it is still possible to orchestrate

the layer pulling to hide the storage time. By such means,
LOPO can significantly reduce the storage tail latency, making
1XPEHURI7KUHDGV
it even hard to be seen in Fig. 7. With more than one thread,
LOPO also always shows the best performance. Note that Fig. 7. Startup time under different number of threads
our LOPO can better orchestrate the layer pulling order via
the longest layer chain and allocate the pulling task to the
thread with the least workload. Hence, no matter how many
pulling threads there are, LOPO can determine better pulling
*/6$>@
order and fully utilize the multiple pulling threads, thereby $'$/>@
&175>@
6WRUDJH7DLO/DWHQF\V
'2&.(5
achieving the shortest startup time, averagely 17.91%, 14.95%, /232
17.91%, and 16.64% shorter than Docker, GLSA, ADAL,
and CNTR, respectively. One interesting thing is that there
exists a diminishing marginal effect with the number of pulling

threads. That is, when the number of threads is large enough,
e.g., larger than the number of layers, each layer pulling task

can be handled by one thread individually. In this case, the 6WRUDJH%DQGZLGWK0ESV
startup time cannot be optimized by any strategy.
Fig. 8. Storage tail latency under different storage bandwidth
C. On Different Storage Bandwidths
Finally, let us check how the storage bandwidth affects
the startup time with DeathStarBench applications. We set only affects the storage tail latency. The results are reported
the network bandwidth as 100Mbps, the number of pulling in Fig. 8.
threads as 3, and the storage bandwidth of each microservice A decreasing trend of storage tail latency can be seen when
varies from 100Mbps to 400Mbps. The storage bandwidth the storage bandwidth changes from 100Mbps to 400Mbps.
Moreover, with the increase of storage bandwidth, the storage R EFERENCES
tail latency of all five strategies gradually converge. This is [1] S. Arnautov, B. Trach, F. Gregor, T. Knauth, A. Martin, C. Priebe,
because when the storage bandwidth is sufficient, most layers J. Lind, D. Muthukumaran, D. O’Keeffe, M. L. Stillwell, D. Goltzsche,
can be stored immediately after being pulled. Hence, the D. Eyers, R. Kapitza, P. Pietzuch, and C. Fetzer, “SCONE: Secure linux
containers with intel SGX,” in Proc. of USENIX OSDI, 2016, pp. 689–
storage tail latency is mainly determined by the storage time 703.
of a few last pulled layers. Under large storage bandwidth, [2] P. Sharma, L. Chaufournier, P. Shenoy, and Y. C. Tay, “Containers
this storage tail latency is relatively short for all strategies, and virtual machines at scale: A comparative study,” in Proc. of ACM
Middleware, 2016, pp. 1–13.
e.g., after storage bandwidth reaching 300Mbps. Therefore, [3] L. Du, T. Wo, R. Yang, and C. Hu, “Cider: A rapid docker container
the gaps between different strategies become smaller with the deployment system through sharing network storage,” in Proc. of IEEE
increasing storage bandwidth. HPCC, 2017, pp. 332–339.
[4] A. Verma, L. Pedrosa, M. Korupolu, D. Oppenheimer, E. Tune, and
J. Wilkes, “Large-scale cluster management at google with borg,” in
V. R ELATED W ORK Proc. of ACM EuroSys, 2015, pp. 1–17.
[5] K. Mahajan, S. Mahajan, V. Misra, and D. Rubenstein, “Exploiting
content similarity to address cold start in container deployments,” in
Container-based microservices are widely implemented in Proc. of ACM CoNEXT, 2019, pp. 37–39.
the cloud to provide on-demand online services [21], [22]. [6] M. Shahrad, R. Fonseca, Í. Goiri, G. Chaudhry, P. Batum, J. Cooke,
However, compared with the short lifetime of microservices, E. Laureano, C. Tresness, M. Russinovich, and R. Bianchini, “Serverless
in the wild: Characterizing and optimizing the serverless workload at a
the startup time is relatively long and has become one of large cloud provider,” in Proc. of USENIX ATC, 2020, pp. 205–218.
the main challenges. Some studies focus on the microservice [7] L. Wang, M. Li, Y. Zhang, T. Ristenpart, and M. Swift, “Peeking behind
placement and task scheduling to increase the overlap between the curtains of serverless platforms,” in Proc. of USENIX ATC, 2018,
pp. 133–146.
pulled images, thereby reducing image pull time [23]. For [8] P. Silva, D. Fireman, and T. E. Pereira, “Prebaking functions to warm
example, Gu et al. [11] take advantage of common layer share the serverless cold start,” in Proc. of ACM Middleware, 2020, pp. 1–13.
to minimize the image pulling traffic and microservice startup [9] H. Qiu, S. S. Banerjee, S. Jha, Z. T. Kalbarczyk, and R. K. Iyer, “FIRM:
An intelligent fine-grained resource management framework for SLO-
latency. Some works propose new storage driver for containers Oriented microservices,” in Proc. of USENIX OSDI, 2020, pp. 805–825.
to reduce microservice startup time. Harter et al. [24] design a [10] J. Lou, H. Luo, Z. Tang, W. Jia, and W. Zhao, “Efficient container
new Docker storage driver for fast microservice startup which assignment and layer sequencing in edge computing,” IEEE Transactions
on Services Computing, pp. 1–14, 2022.
lazily pulls image data. Other studies also try to modify the [11] L. Gu, D. Zeng, J. Hu, H. Jin, S. Guo, and A. Y. Zomaya, “Exploring
microservice image structure to reduce the image size. For layered container structure for cost efficient microservice deployment,”
example, Thalheim et al. [13] propose CNTR, and split image in Proc. of IEEE INFOCOM, 2021, pp. 1–9.
[12] S. Fu, R. Mittal, L. Zhang, and S. Ratnasamy, “Fast and efficient
into thin and fat versions, with the thin images being pulled container startup at the edge via dependency scheduling,” in Proc. of
first for fast startup and then fat ones being pulled on-demand. USENIX HotEdge, 2020.
Unfortunately, all above solutions need explicit interven- [13] J. Thalheim, P. Bhatotia, P. Fonseca, and B. Kasikci, “Cntr: Lightweight
OS containers,” in Proc. of USENIX ATC, 2018, pp. 199–212.
tion such as microservice placement decision adjustment or [14] S. Li, A. Zhou, X. Ma, M. Xu, and S. Wang, “Commutativity-guaranteed
container restructure. Once the microservice system is chosen docker image reconstruction towards effective layer sharing,” in Proc.
or the placement is pre-determined, these strategies cannot of ACM WWW, 2022, pp. 3358–3366.
[15] T. Xu and D. Marinov, “Mining container image repositories for software
be applied. Compared to existing solutions, LOPO is more configuration and beyond,” in Proc. of ACM ICSE-NIER, 2018, pp. 49–
applicable since it accelerates the startup on each server 52.
individually. It is also for this reason that LOPO can be also [16] L. Gu, Q. Tang, S. Wu, H. Jin, Y. Zhang, G. Shi, T. Lin, and J. Rao,
“N-docker: A NVM-HDD hybrid docker storage framework to improve
combined with existing solutions to further reduce the startup docker performance,” in Proc. of IFIP NPC, 2019, pp. 182–194.
time. [17] Y. Gan, Y. Zhang, D. Cheng, A. Shetty, P. Rathi, N. Katarki, A. Bruno,
J. Hu, B. Ritchken, and B. Jackson, “An open-source benchmark suite
for microservices and their hardware-software implications for cloud &
VI. C ONCLUSION edge systems,” in Proc. of ACM ASPLOS, 2019, pp. 3–18.
[18] R. L. Graham, “Bounds on multiprocessing timing anomalies,” SIAM
Container based microservices suffers from its slow startup. Journal on Applied Mathematics, vol. 17, no. 2, pp. 416–429, 1969.
[19] M. R. Garey, D. S. Johnson, and R. Sethi, “The complexity of flowshop
We discover that by orchestrating the microservice layers and jobshop scheduling,” Mathematics of Operations Research, vol. 1,
pulling order, the startup time can be reduced. In this paper, no. 2, pp. 117–129, 1976.
we study the problem on layer pulling order orchestration for [20] L. A. Adamic and B. A. Huberman, “Zipf’s law and the Internet,”
Glottometrics, pp. 143–150, 2002.
microservice startup acceleration. The problem is first proved [21] A. Celesti, L. Carnevale, A. Galletta, M. Fazio, and M. Villari, “A
as NP-hard by reduction from list scheduling problem. We then watchdog service making container-based micro-services reliable in IoT
construct a layer chain to estimate the startup time and further Clouds,” in Proc. of IEEE FiCloud, 2017, pp. 372–378.
[22] F. Manco, C. Lupu, F. Schmidt, J. Mendes, S. Kuenzer, S. Sati,
design our LOPO strategy with polynomial computational K. Yasukata, C. Raiciu, and F. Huici, “My vm is lighter (and safer)
complexity. We also theoretically analyze the guaranteed ap- than your container,” in Proc. of ACM SOSP, 2017, pp. 218–233.
proximation ratio of LOPO as 2. To evaluate the performance [23] I. Gog, M. Schwarzkopf, A. Gleave, R. N. M. Watson, and S. Hand,
“Firmament: Fast, centralized cluster scheduling at scale,” in Proc. of
efficiency of LOPO, extensive trace based experiments are USENIX OSDI, 2016, pp. 99–115.
conducted. The results show that LOPO averagely reduces [24] T. Harter, B. Salmon, R. Liu, A. C. Arpaci-Dusseau, and R. H. Arpaci-
the startup time by 23.56%, 22.79%, 23.56%, and 20.91%, Dusseau, “Slacker: Fast distribution with lazy docker containers,” in
Proc. of ACM FAST, 2016, pp. 181–195.
respectively, compared with four state-of-the-art solutions.

LOPO

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

LOPO

Uploaded by

Copyright:

Available Formats

LOPO: An Out-of-order Layer Pulling Orchestration

Strategy for Fast Microservice Startup

Abstract—Container based microservices have been widely

s.t. : (1) − (7)

corresponding storage sequence can be determined. That is, ω

Fig. 7 shows the startup time under different number of

gradually decreases with increasing number of threads. When

exists a diminishing marginal effect with the number of pulling

You might also like

LOPO

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

LOPO

Uploaded by

Copyright:

Available Formats

LOPO: An Out-of-order Layer Pulling Orchestration

Strategy for Fast Microservice Startup

Abstract—Container based microservices have been widely

s.t. : (1) − (7)

corresponding storage sequence can be determined. That is, ω

  

Fig. 7 shows the startup time under different number of

gradually decreases with increasing number of threads. When 

exists a diminishing marginal effect with the number of pulling

You might also like

gradually decreases with increasing number of threads. When