You are on page 1of 15

Journal of Network and Computer Applications 196 (2021) 103224

Contents lists available at ScienceDirect

Journal of Network and Computer Applications


journal homepage: www.elsevier.com/locate/jnca

HOPASS: A two-layer control framework for bandwidth and delay guarantee


in datacenters
Kai Lei a,b ,∗, Junlin Huang a , Xiaodong Li a , Yu Li a , Ye Zhang a , Bo Bai c , Fan Zhang c ,
Gong Zhang c , Jingjie Jiang c
a
Shenzhen Key Lab for Information Centric Networking & Blockchain Technology (ICNLAB), School of Electronic and Computer Engineering (SECE), Peking
University, Shenzhen, China
b PCL Research Center of Networks and Communications, Peng Cheng Laboratory, Shenzhen, China
c
Theory Lab, 2012 Labs, Huawei Technologies, Co. Ltd., Hong Kong

ARTICLE INFO ABSTRACT

Keywords: In data center networks (DCNs), flows with different objectives coexist and compete for limited network
Data center resources (such as bandwidth and buffer space). Without harmonious resource planning, chaotic competition
Multi-objective optimization among these flows would lead to severe performance degradation. Furthermore, low latency is critical for
Receiver-driven protocol
many emerging applications such as augmented reality(AR), virtual reality(VR) and telepresence, making the
Flow scheduling
network control problem even more challenging. To address the above issues, this paper novelly proposes
Bandwidth allocation
a receiver-driven two-layer control framework called HOPASS, which incorporates a slow control layer and
a fast control layer to strike a balance among multiple network sharing objectives and achieve low latency.
The slow control layer ensures bandwidth guarantee in an aggregated flow level by solving a multi-objective
network utility maximization (NUM) problem using an online learning approach. Then the results will be
dispatched to the data plane by configuring weights in the switches with weighted fair queue functionality.
Under the configuration dictated by the slow control layer, the fast control layer leverages the token packets
sent by receivers to dynamically probe and reserve network capacity, so that it can proactively prevent
network congestion and guarantee low latency data delivery. To evaluate the proposed framework, we have
implemented HOPASS in ns-3 and conduct extensive experiments under various network scenarios. The
simulation results show that HOPASS achieves near-optimal performance in terms of bandwidth allocation
in multi-objective scenarios and also guarantees low end-to-end delay. Moreover, it outperforms DCTCP and
NewReno in terms of average bandwidth utilization and global total network utility at the aggregated flow
level. Therefore, we conclude that HOPASS provides an effective framework for DCNs when considering both
multi-objective optimization and a low latency network.

1. Introduction For applications like on-demand high-definition video streaming, the


associated flows require high throughput. Meanwhile, some real-time
Due to the development of the emerging applications with various applications, like live telecast and network conference, need low end-
service requirements in today’s data center networks (DCNs), flows to-end latency for flows. Moreover, some applications demand high
with different performance objectives coexist and compete for the lim- throughput and low end-to-end delay for flows, such as augmented
ited network resources. Infrastructure-as-a-Service clouds has introduce reality (AR), virtual reality (VR) and telepresence. As for applications
multiple kinds of virtual machine (VMs) (Masdari and Zangakani, 2019; such as email, they only require best-effort services. It is evident that
Hosseinzadeh et al., 2020), which could have different amounts of some of these objectives may conflict with each other. Thus, a multi-
various resources. Through the Vms, customers could achieve their objective network utility maximization (NUM) problem needs to be
different level of Quality of Service (QoS), which brings a significant solved to strike a balance among these conflicting objectives.
challenge for the network resource allocation and flow scheduling.

∗ Corresponding author at: Shenzhen Key Lab for Information Centric Networking & Blockchain Technology (ICNLAB), School of Electronic and Computer
Engineering (SECE), Peking University, Shenzhen, China.
E-mail addresses: leik@pkusz.edu.cn (K. Lei), huangjunlin@pku.edu.cn (J. Huang), lxdong0128@stu.pku.edu.cn (X. Li), liyu7u@pku.edu.cn (Y. Li),
zhangye21@pku.edu.cn (Y. Zhang), baibo8@huawei.com (B. Bai), zhang.fan2@huawei.com (F. Zhang), nicholas.zhang@huawei.com (G. Zhang),
jiang.jingjie@huawei.com (J. Jiang).

https://doi.org/10.1016/j.jnca.2021.103224
Received 12 April 2021; Received in revised form 12 August 2021; Accepted 10 September 2021
Available online 9 October 2021
1084-8045/© 2021 Elsevier Ltd. All rights reserved.
K. Lei et al. Journal of Network and Computer Applications 196 (2021) 103224

There has been a tremendous amount of effort in solutions to the • An online learning based approach to solving the multi-
NUM problem. However, as far as we know, most of the existing objective NUM problem in the slow control layer: The learning
solutions only try to optimize network performance with a single based approach can solve NUM problems with a wide variety of
objective. For example, the solutions in Alizadeh et al. (2010, 2013), utility functions. What is more, our proposed method converges
Munir et al. (2013), Nagaraj et al. (2016), Perry et al. (2014), Vamanan fast and the convergent result is near-optimal.
et al. (2012), Zats et al. (2012), Frohlich and Gelenbe (2020) and • An objective-oriented scheduling among flows based on a
Fröhlich et al. (2021) only focus on optimizing a single performance receiver-driven approach: Receivers dynamically adjust the to-
objective such as maximum throughput, minimum congestion, or min- ken sending rate based on the estimated network state (e.g., con-
imum latency. Note that the single objective optimization may lead to gestion level and packet drop rate), and schedule small flows and
the degradation of other performance metrics because some objectives large flows at different frequencies. Then the senders can only
such as throughput and delay naturally contradict each other. Simply send data after receiving tokens. Such a mechanism can positively
applying a single-objective solution can only satisfy some applications’ address the possible congestion and meet the requirements from
requirements while hurting the performance of applications with a both latency-sensitive mice flows and bandwidth-hungry elephant
different objective. It is desirable to consider the trade off among flows.
different performance objectives to achieve relative fairness among
different flows. The rest of the paper is organized as follows. Section 2 introduces
the related work and the motivation. Section 3 describes the design
However, even given a multi-objective resource allocation solution,
details of HOPASS. Section 4 presents the modeling process and the
there still needs an objective-oriented scheduling among flows under
theoretic analysis of the SCL and FCL, respectively. Section 5 evaluates
various constraints to achieve the performance balance and to address
HOPASS against the serval classic method and shows the performance
the congestion in the network data plane. Such a congestion control
of HOPASS. Finally, Section 6 makes a conclusion.
problem is very challenging because of the well-known incast prob-
lem incurred by massive bursty and concurrent traffic from parallel-
computing (Zaharia et al., 2012) and distributed deep learning (Abadi 2. Background and motivation
et al., 2016; Li et al., 2014) applications. Traditional sender-driven
transport layer protocols such as TCP CUBIC (Ha et al., 2008) and 2.1. Multi-objective network utility maximization
DCTCP (Alizadeh et al., 2010) cannot ensure low queuing delay, zero
data loss and high throughput under the incast scenario. Moreover, in
In DCNs, flows associated with different applications may have
DCNs, queuing delay is the main contributing factor for latency since
different performance requirements, such as flow completion time and
propagation delay is low. Therefore, considering the commonly used
end-to-end throughput. Most of the existing solutions only focus on
many-to-one communication pattern in DCNs, we propose to adopt
optimizing a single performance objective, such as minimizing the end-
receiver-driven congestion control to resolve this issue.
to-end delay or maximizing the throughput. NUMFabric (Nagaraj et al.,
Received-driven protocols (Gao et al., 2015; Cho et al., 2017; Han-
2016) converted the bandwidth allocation problem into a Network
dley et al., 2017; Montazeri et al., 2018) have been quite popular
Utility Maximization (NUM) problem and solved it in a distributed
recently to solve congestion issues in DCNs due to incast traffic. The
manner. However, the solution in NUMFabric cannot be extended to
idea of receiver-driven control is that the senders can learn the guar-
solve the multi-objective NUM problem. DeTail (Zats et al., 2012),
anteed amount of traffic to be sent based on the number of tokens
𝐿2 𝑇 𝐶𝑃 (Munir et al., 2013), pHost (Gao et al., 2015), pFabric (Alizadeh
received from the receiver. Such a control scheme reacts to the network
et al., 2013), PIAS (Bai et al., 2015) and Homa (Montazeri et al., 2018)
congestion in a proactive way instead of the passive manner used in the
are typical works with the objective of minimizing the flow completion
sender-driven control schemes.
time (FCT). Even though some works studied the coexistence of mul-
In this paper, we propose HOPASS, a two-layer control framework
tiple utilities in the network, none of them can give a complete and
for DCNs. With the cooperation of the two control layers, HOPASS can
effective solution.
not only achieve the near-optimal performance for the global multi-
On the other hand, existing works that consider multiple objectives
objective NUM problem but also guarantees low latency for small flows.
either fail to clarify how to allocate resources among flows with dif-
Specifically, HOPASS consists of two layers: (i) a slow control layer
ferent objectives, or require too much time to converge to the optimal
(SCL) that allocates bandwidth among aggregated flows with different
allocation. SCC (Tian et al., 2017) proposed a two-layer control method
objectives and works at a coarse-grained mode, and (ii) a fast control
to solve a multi-tenant multi-objective bandwidth allocation problem,
layer (FCL) that works at a fine-grained mode and acts as a received-
but it only solves this problem with given weights. SCC does not give a
driven protocol to schedule the flows under the allocated bandwidth
solution on how to get the weights that determine how many resources
decided by the SCL and to do the congestion control function.
each type of flow is allocated with. BwE (Kumar et al., 2015) proposed
There are several challenges to designing such a two-layer control
a hierarchical bandwidth allocation framework, which optimizes the
framework. Firstly, different objectives may contradict each other and
performance of flows with different requirements in terms of through-
how to design a scheme to ensure fairness across flows is very challeng-
put and fairness. However, BwE is a purely centralized solution that
ing. Secondly, enforcing the solution of the multi-objective optimiza-
requires global information from the network and induces a very long
tion in a distributed manner is desirable for practical implementation.
convergence time.
Thirdly, ensuring low latency for small flows while preventing the large
Besides, some methods based on learning approach have been ap-
flows from starving. Our proposed framework, HOPASS, addresses the
plied to solving bandwidth allocation problems recently and are quite
above challenges and the contributions are summarized as follows:
useful when dealing with highly dynamic networks. The authors in Xu
• A two-layer framework for solving multi-objective NUM and et al. (2018) proposed a centralized flow scheduling strategy based on
ensuring low latency: This framework includes two layers that deep reinforcement learning (DRL). The authors in Dong et al. (2018)
focus on solving a multi-objective NUM problem and resolve also consider using DRL approach to adjust the flow transmission rate
congestion, respectively. The two layers work together to balance to adapt to the dynamics of the network. However, even these solutions
the performances of flows with different objectives and ensure a can modify the reward function to realize different objectives, they are
low latency in the data plane. not able to achieve multiple objectives at the same time.

2
K. Lei et al. Journal of Network and Computer Applications 196 (2021) 103224

Table 1
Summary of characteristics.
In-network congestion Differentiated QoS No switch support
pHost (Gao et al., 2015) ✗ ✗ ✓
Homa (Montazeri et al., 2018) ✗ ✓ ✓
ExpressPass (Cho et al., 2017) ✓ ✗ ✗
FCL ✓ ✓ ✗

2.2. Receiver-driven protocol with congestion control

Typical applications in DCN, such as MapReduce (Dean and Ghe-


mawat, 2004), web search, and distributed machine learning, often
employ the many-to-one communication pattern. This leads to bursty
flow arrivals and a large number of concurrent flows. Such incast
traffic essentially makes most network congestion happening at the
receiver side. Hence, the traditional sender-based congestion control
schemes (Ha et al., 2008; Alizadeh et al., 2010; Wilson et al., 2011;
Vamanan et al., 2012) fail to perform well since the congestion signal
Fig. 1. The average queuing delay at different switches.
delivered from the far end are pretty delayed and the senders can only
react to the delayed signal to avoid congestion in a very passive way.
Table 2
Instead, the receiver-driven protocol enables proactive congestion Common optimization objectives and corresponding utility functions.
control using token packets (also known as credits Cho et al., 2017 Objective of flows Utility function
or message grants Montazeri et al., 2018). When a sender initiates a
Proportional fairness 𝑈𝑝𝑓 (𝑥) = 𝑤1 𝑙𝑜𝑔(𝑥)
new flow, it first sends a request-to-start (RTS) message to the receiver. 1−𝛼
𝛼-fairness 𝑈𝛼 (𝑥) = 𝑤2 𝑥1−𝛼
Once receiving an RTS message, the receiver determines how many 1−𝛼
Minimize flow completion time 𝑈𝑓 𝑐𝑡 (𝑥) = 𝑤3 𝑥𝑠
data packets can be sent by each sender based on its available network 𝑓

capacity. The receiver then issues the associated number of tokens to Remark: To ensure that the utility functions are concave, 𝛼 ranges from 0 to 1.
the senders per maximum transmission unit (MTU) time. On receiving
each token from the receiver, the sender sends one packet. Therefore,
the number of tokens received strictly determines the amount of data
traffic the sender pours into the network. Then the sender can have requirements from different kinds of users and concentrate more on
a rough estimate on the available network capacity, and the receiver applications’ efficiency. That is why we are trying to solve the multi-
can also avoid network congestion and bursty traffic pouring into the objective NUM problem in data center networks under limited network
network by adjusting the token sending rate. computing and transmission resources. The reward for networks or
However, existing receiver-driven protocols have mainly focused customers may be different for flows of different services while ob-
on minimizing flow completion time (FCT) for latency-sensitive mice taining the same network resources. Utility functions, which map the
flows. Bandwidth-hungry elephant flows either become starved due obtained bandwidth to the expected reward, are introduced to quantify
to low priority (Gao et al., 2015; Montazeri et al., 2018), or only the network utility. Diversified or user-defined optimization objectives
achieve fair sharing (Cho et al., 2017; Handley et al., 2017) without can be supported by changing the form or parameters of the utility
differentiated QoS. Furthermore, congestion can happen within the functions. At present, the objectives that data center networks focus
core network. As shown in Fig. 1, we measure the queue lengths of the on mainly include but are not limited to those listed in Table 2.
switches in the network core under three different production traces In Table 2, different utility functions represent different require-
and the same topology as in pHost’s paper (Gao et al., 2015), the ments of bandwidth. 𝛼-fairness (Kelly, 1997) focuses on the fairness
average queuing delay is as large as 895 ns (180 + 715 under IMC10). among the bandwidth allocation of different users. Proportional fair-
Given that the per-link propagation delay is only 200 ns, the queuing ness is a special case of 𝛼-fairness. When 𝛼 → ←← 1, according to the
delay is crucial in end-to-end latency. We compared the characteristics L’Hospital’s rule, the utility function of 𝛼-fairness can convert to utility
of FCL with some similar solutions, the comparisons of FCL against function of proportional fairness. The third utility function concentrates
these solutions are summarized in Table 1. In pHost (Gao et al., 2015) on a performance metric named flow completion time (Munir et al.,
and Homa (Montazeri et al., 2018), the authors assume congestion only 2013). In this functions, 𝑠𝑓 represents the flow size.
happens in the edge network and the core network is congestion-free. In the scenario of multi-objective coexistence, flow control method
Hence, as the table shown, the proposed receiver-driven approaches should allocate the limited network resources to different flows reason-
cannot be applied to solve any in-network congestion. ExpressPass (Cho ably to maximize the total utility of the data center networks. With the
et al., 2017) tackles the in-network congestion issue by actively limiting help of utility functions, flow control problem in data center networks
the token sending rate at switches and enforcing symmetric routes can be modeled as a multi-objective NUM problem. Its mathematical
for token and data packets. However, it requires non-trivial modifica- expression can be listed as follows:
tions of switches which is highly undesirable for commodity switches. 𝑁𝑈 𝐹𝑖
∑ ∑
Although FCL takes switch support into the design to achieve better 𝑚𝑎𝑥𝑖𝑚𝑖𝑧𝑒 𝑈𝑖 (𝑥𝑖𝑗 )
performance with the cooperation of SCL, the operations in switches 𝑖 𝑗
(1)
are quite simple and can be soft-implemented. 𝑠𝑢𝑏𝑗𝑒𝑐𝑡 𝑡𝑜 𝐀𝐗 < 𝐂
𝑥𝑖𝑗 ≥ 0
2.3. Motivation and challenges
Here, 𝑁𝑈 represents the number of utility functions coexisting in the
Nowadays, optimization of network performance pays more atten- networks. 𝐹𝑖 represents the number of flows for 𝑖th objective. 𝑈𝑖 repre-
tion to the efficiency of network resources such as throughput, band- sents the utility function of 𝑖th objective. 𝐂 represents the bandwidth
width utilization, delay and so on. Under the trend of cloudification capacity of the link. 𝐗 represents the bandwidth allocation results. 𝑥𝑖𝑗
for applications, data center networks should satisfy the differentiated represents the bandwidth allocation for the 𝑗th flows with objective 𝑖

3
K. Lei et al. Journal of Network and Computer Applications 196 (2021) 103224

and 𝑥𝑖𝑗 ∈ 𝐗. 𝐀 is a 𝐿 × 𝐹𝑖 matrix and 𝐿 denotes the number of links


of the network. 𝐴𝑙𝑖 in the matrix is 1 when the 𝑖th flow gets through
the 𝑙th link, otherwise it is 0. The constraint condition of this problem
mainly comes from limited link capacity.
If we constrain the utility functions to concave functions, the multi-
objective NUM problem will be transformed into a convex optimization
problem. This problem can be solved mathematically by the condi-
tions of Lagrange Multiplier and KKT (Karush–Kuhn–Tucker) condition.
Relying on the centralized control architecture, the mathematical so-
lution mentioned above is theoretically feasible, but there are still
many challenges in practical applications. As the scale of traffic in
data center networks continues to increase, the multi-objective NUM
problem’s scale is growing larger and larger. Although centralized
control architecture can guarantee a globally optimal solution, it also
introduces long communication and calculation time. Long communi-
cation latency is caused by the extensive information exchange between
the central controller and other network devices, such as hosts and
switches. Long calculation time comes from the centralized solution
to large-scale convex optimization problems. However, the traffic in
data center networks has obvious heavy-tailed characteristics. Most
of the flows are short flows, but the long flows occupy most of the
traffic. Most of the time, the short flows have finished transmitting Fig. 2. Architecture of HOPASS.
before the optimal bandwidth allocation results are figured out. On the
contrary, the distributed control architecture can provide fast control
for packets, but it cannot guarantee convergence to the global optimal decentralized way, and it is adaptive to the dynamic information of
point because the decisions were made based on local information. the network (such as non-persistent congestion). Also, this layer consid-
Neither the centralized control architecture nor the distributed control ers both latency-sensitive mice flows and bandwidth-hungry elephant
architecture can simultaneously take the needs of both fast control flows. We illustrate such an design idea in Fig. 2.
and accurate control into consideration. To address this problem, it is Under the above design of the two-layer control framework, we
inevitable to re-design the control framework. observe that both layers are critical to ensure good end-to-end per-
formance. Suppose there is only a slow control layer, it will be hard
2.4. A two-layer control framework to guarantee low latency network and effective weighted bandwidth
allocation without the network congestions control in the fast control
Based on the above analysis, it is complicated to either solve the layer. On the contrary, if there is no global performance guarantee
multi-objective NUM problem or achieve multi-objective performance without the slow control layer, the overall network performance would
with a low-latency guarantee for small flows and reasonable congestion be severely affected. What is more, the two-layer control combined with
control in a distributed manner. It is far more challenging if one wants fast and slow control can balance the contradiction between accurate
to achieve both at the same time. bandwidth allocation and fast scheduling for packets.
Some researchers do this problem from a macro perspective. They
aim to find request routing paths or a mapping between requests 3. Design of HOPASS
and services through machine learning methods. CRE (Francois and
Gelenbe, 2016) is a logically centralized cognitive routing engine based In this section, we illustrate the design ideas behind HOPASS. Note
on random Neural Networks with Reinforcement Learning. When the that our goal is to achieve multi-objective NUM and provide a low-
overlay nodes use the public Internet as the communication means, latency network infrastructure based on reasonable congestion control.
it tries to find the optimal overlay paths with minimal monitoring It is difficult to ultimately achieve the above goals if innovations are
overhead. Reinforcement Learning is also used in paper (Frohlich et al., only made in algorithm design. In order to strike a balance between
2020) to help an SDN controller to direct user requests at the edge control accuracy and efficiency, changes in the control framework are
toward appropriate servers where the requests can be satisfied. Pa- necessary. This paper profoundly studies the traffic control architecture
per (Wang and Gelenbe, 2018) designs three online QoS aware adaptive of data center network from the two dimensions of control framework
task allocation schemes to provide a lower response time by assign- and algorithm.
ing tasks to sub-systems and splitting the task arrival stream into
sub-streams at rates computed from the hosts’ processing capabilities. 3.1. Overview of the architecture
Instead of leading requests to the instantiated services, paper (Frohlich
and Gelenbe, 2020) focuses on the services placement optimization by The nervous system of humans has important enlightenment for
using the SDN and machine learning means. the design of the control system. It is mainly composed of the brain
Different from the above methods, this paper focus on finer granu- and spinal cord. The brain is responsible for regulating some complex
larity. Suppose that the flows’ destinations have been determined, when activities such as thinking, learning and so on. The spinal cord is
the multi-objective flows go through the same path, how to guarantee responsible for some simple reflex activities, such as knee jerk reflex.
a fair and stable bandwidth allocation and low latency simultaneously? Inspired by the nervous system, John C. Doyle proposed fast and slow
We propose a two-layer control framework that incorporates a slow neural control theory, emphasizing the importance of combining cen-
control layer and a fast control layer to address this. Specifically, the tralized slow control and distributed fast control. Based on the above
slow control layer works in a coarse-time granularity and allocates theory, we proposed a coordinated distributed flow control framework.
the network resources at a macroscopic scale, that is, aggregated flow This framework is still a distributed control framework in essence
level. The slow control layer is adaptive to the static network states to guarantee low control delay and transmission delay. However, it
(such as network topology) and guarantees the bandwidth allocation allows control nodes to fully coordinate with each other or collect
for aggregated flows. As for the fast control layer, it operates in a information on a longer time scale. And then, an optimal control

4
K. Lei et al. Journal of Network and Computer Applications 196 (2021) 103224

result will be calculated to adjust the fast control for each packet.
The proposed control framework can be regarded as a compromise
between centralized control and distributed control. It is expected that
this framework is able to provide accurate flow control results while
ensuring the timeliness of control.
Based on the proposed coordinated distributed control framework,
this paper design a hierarchical flow control method named HOPASS to
solve the multi-objective NUM problem in data center networks. Unlike
the flat control paradigm, such as traditional pure distributed control
or centralized control, a hierarchical control structure makes it possible Fig. 3. Control cycle of slow control layer.
to leverage the idea of divide and conquer to solve the large-scale NUM
flow control problem. The multi-objective flow control problem can be
decomposed into the following two sub-problems: collect more local information in a coarse-time granularity to support
an optimal decision, the control cycle of the SCL is relatively long.
• Coarse-grained bandwidth allocation problem focuses on how
In SCL, a packet is the basic unit of scheduling. Therefore, com-
to allocate bandwidth among different objectives to maximize the
pared with SCL, the control cycle of FCL is significantly smaller. The
overall utility of the network. This problem can be defined as
available packets scheduling methods of FCL are diversified, such as
Eq. (2).
the adjustment for transmission window or rate, queue management
𝑁𝑈
∑ ∑
𝐿 strategy, etc. As a result, the deployment of FCL can be divided into
𝑚𝑎𝑥𝑖𝑚𝑖𝑧𝑒 𝑈𝑖 ( 𝑋𝑖𝑙 ) two parts. The first part is the queue management strategy deployed
𝑖 𝑙
(2) on the switch to ensure that the network traffic approximates the
𝑁𝑈
∑ near-optimal results calculated by SCL. The second part is deployed
𝑠𝑢𝑏𝑗𝑒𝑐𝑡 𝑡𝑜 𝑋𝑖𝑙 ≤ 𝐶𝑙 , ∀𝑙
𝑖 on the end-hosts to provide transmission control and objective-orient
packets scheduling. In design, the FCL is compatible with existing
where 𝑁𝑈 represents the number of utility functions coexisting in
or redesigned single-objective flow control methods. In our previous
the networks. 𝑈𝑖 represents the utility function of 𝑖th objective. 𝐶𝑙
work (Lei et al., 2019), we made necessary modifications to an existing
represents the bandwidth capacity of the link 𝑙. 𝑋𝑖𝑙 represents the
method NUMFabric (Nagaraj et al., 2016), and then used it as the fast
bandwidth reserved for the 𝑖th objective on link 𝑙.
control layer algorithm, which verified the feasibility of this design.
• Fine-grained packets scheduling problem, defined as Eq. (3),
In this paper, we redesign a receiver-driven protocol with congestion
mainly focuses on solving the single-objective NUM problem and
control for FCL. Proactive congestion control in FCL helps to meet the
realizes objective-oriented packet scheduling under the specific
requirements of weighted bandwidth sharing and low-latency.
bandwidth constraints. Besides, it is also an important prob-
lem that how to build a low-latency DCN through fast packet
3.2. Design of slow control layer
scheduling.
𝐹𝑖
∑ In this section, we will illustrate the design of SCL in details.
𝑚𝑎𝑥𝑖𝑚𝑖𝑧𝑒 𝑈𝑖 (𝑥𝑖𝑘 ) While solving the coarse-grained bandwidth allocation problem shown
𝑘
(3) in Eq. (2), the SCL algorithm’s design is faced with two challenges. On
𝐹𝑖
∑ the one hand, the requirements of upper-layer applications and users
𝑠𝑢𝑏𝑗𝑒𝑐𝑡 𝑡𝑜 𝑎𝑙𝑖𝑘 𝑟𝑎𝑡𝑒𝑖𝑘 ≤ 𝑋𝑖𝑙 , ∀𝑙
𝑘 keep changing, which puts forward the new demand for the service
quality provided by the under-layer data center networks. As a result,
where 𝐹𝑖 represents the number of flows for 𝑖th objective. 𝑥𝑖𝑘
the expressions and parameters of utility functions are diversified and
represents the transmission rate for the 𝑘th flows with objective 𝑖.
will continue to expand. On the other hand, traffic in DCNs is highly
𝑎𝑙𝑖𝑘 indicates whether the 𝑘th flows with objective 𝑖 passes through
dynamic because of the difference and uncertainty in flow size, objec-
the link 𝑙. When the flow passes through link 𝑙, this value will be
tives, priority and start time. Besides, burst traffic and network failures
set as 1, otherwise 0.
are inevitable and unpredictable. Most of the traditional flow control
HOPASS consists of a Slow Control Layer (SCL) and a Fast Control methods in DCNs were designed based on the network’s mathematical
Layer (FCL). The SCL is responsible for the macro control for the model. To guarantee that the method works in a specific network, it is
network resources. It allocates the network resources in a relatively necessary to build an accurate model for the network environment and
long timescale (usually every few microseconds) and searches for a user requirements.
near-optimal bandwidth allocation for different objectives to maximize To deal with the two challenges mentioned above, HOPASS applies
the overall network utility. The FCL focuses on micro scheduling at an online convex optimization approach in the design of SCL. The
the packet level. Taking the bandwidth pre-allocated by the SCL for online learning approach has the following two advantages. Firstly,
different objectives as input, the FCL uses specific packet control strate- it is independent of any specific model. Secondly, compared with the
gies to further allocate bandwidth pre-allocated to flows with the same traditional model-based approach, this approach is applicable to a
objective. Another target for FCL is providing a best-effort performance wider range of utility functions. So the restrictions of utility functions
guarantee for both latency-sensitive mice flows and bandwidth-hungry are limited. To ensure a near-optimal solution, the utility functions are
elephant flows. In HOPASS, the combination of SCL and FCL can not required to be smooth and strictly concave.
only ensure the fast scheduling for packets on a small time scale, but Leveraging the idea of Online Gradient Descent (ODD) in online
also calculate an optimal bandwidth allocation scheme on a large time convex optimization (Shalev-Shwartz, 2011), the SCL can dynamically
scale to improve the overall utility of the DCNs. More details of HOPASS search for an optimal or near-optimal bandwidth allocation scheme
are shown in Fig. 2. through periodic exploration and decision. The SCL algorithm includes
Deployed only on the switch, the SCL prefers to leverage a learning- two key steps. First, SCL explore the rewards for different bandwidth al-
based approach to solve the bandwidth allocation problem among location. Second, SCL would make decisions based on several historical
different objectives. As a result, in SCL, the unit of scheduling is decisions’ feedback.
aggregated flow. For a specific link, the flows with the same utility In SCL, time is divided into continuous slots of equal size. The
function will be classified into an aggregated flow. Since SCL needs to bandwidth allocation will be updated every time slot. As shown in

5
K. Lei et al. Journal of Network and Computer Applications 196 (2021) 103224

Fig. 4. Description of SCL algorithm.

Fig. 3, each control cycle consists of two phases, i.e., an exploration


phase and a decision phase. Each exploration phase contains several
time slots. The number of time slots in the study phase is equal to
the number of objectives coexisting in the network. In the exploration
phase, each time slot is used to explore the reward of a specific
bandwidth adjustment strategy. The reward is estimated in terms of
the increment of the total utility value of all the objectives. Each Fig. 5. Implementation of the switch.
decision phase contains only one time slot. According to the result of
the exploration phase, the decision maker will calculate a bandwidth
allocation result with a potential largest reward for aggregated flows. In decision phase, the decision maker evaluates the bandwidth
The proposed online learning approach is implemented on the allocation strategies that has been explored in the exploration phase
switches. Fig. 4 shows the update process of bandwidth allocation in and find out a bandwidth adjustment direction which has the largest
the SCL. More details are given as following: potential reward. Then the decision maker will compares this reward
Denote 𝑁 as the number of aggregated flows coexisting in a link. with the reward of previous decision phase. If this reward is larger,
𝑟𝑙𝑛 represents the ratio of capacity that the 𝑛th utility shares on the 𝑙th the decision maker will update the bandwidth allocation according to .
link. Since we want to ensure a high utilization of bandwidth and avoid Otherwise, it means that the total network utility cannot be further
wasting the resources, the constraint shown as Eq. (4) is set for {𝑟𝑙𝑛 }: improve by any adjustment of bandwidth allocation. The bandwidth
allocation will remain the same as the previous control cycle.

𝑁
𝑟𝑙𝑛 = 1, ∀𝑙 (4) ⎧ 𝑙 { }
𝑛=1
𝑙
⎪𝑟𝑚 = 𝐷𝑚 + min 𝑆𝑑 , 𝑆𝑚𝑎𝑥 , 𝑚 = 𝑛∗
⎨ (7)
Therefore, the allocated bandwidth on link 𝑙 for aggregated flow 𝑛 can 𝑙 − min{𝑆𝑑 ,𝑆𝑚𝑎𝑥 } ,
⎪𝑟𝑙𝑚 = 𝐷𝑚 𝑚 ≠ 𝑛∗
be calculated using ⎩ 𝑁−1

where 𝑛∗ is index of time slot that gained the largest total network
𝐵𝑛𝑙 = 𝑟𝑙𝑛 𝐶𝑙 (5)
utility. As shown in Eq. (8), 𝑆𝑑 is the increment step size in decision
In exploration phase, each time slot is used to explore the reward phase. If the step size is too large, the risk of overshooting the optimal
of a bandwidth allocation adjustment direction. In each bandwidth solution will increase. So we set an upper boundary 𝑆𝑚𝑎𝑥 for it.
adjustment direction, the bandwidth of a utility is increased by a 𝜂(𝑈𝑚 (𝑥𝑙𝑚 ) − 𝑈𝑚 (𝑥𝑙𝑛 ))
fixed step length 𝑆𝑒 , and the reward of bandwidth of other utilities 𝑆𝑑 = (8)
1
are reduced in equal proportion. At the end of each time slot in the 𝑆𝑒 (1 + 𝑁−1
)(𝑈𝑚 (𝑥𝑙𝑚 ) − 𝑈𝑚 (0))
exploration stage, according to the results of network measurement, the Here, 𝜂 > 0 is a sensitive variable which measures how aggressive the
decision maker uses the corresponding utility function to quantify the bandwidth strategy is updated.
transmission performance of the time slice into numerical values.
The bandwidth allocation strategy is calculated based on the band- 3.3. Design of fast control layer
width allocation results 𝐷𝑛𝑙 of the previous control cycle. At the 𝑛th slot
within the exploration phase, the 𝑛th bandwidth allocation is increased In this section, we will discuss the design of the FCL. As mentioned
𝑆𝑒
by 𝑆𝑒 while the others are all decreased by 𝑁−1 . This rule maintains above, the implementation of FCL includes two parts, i.e., the queue
the constraint in Eq. (4). The expressions for bandwidth update are management strategy on the switches and the transmission control
summarized as follows: protocol on the end hosts.
{
𝑟𝑙𝑚 = 𝐷𝑚𝑙 +𝑆 ,
𝑒 𝑚=𝑛
𝑙 𝑙 − 𝑆𝑒 ,
(6) 3.3.1. Queue management strategy on switches
𝑟𝑚 = 𝐷𝑚 𝑁−1
𝑚≠𝑛
To avoid the mutual interference among packets with different
where 𝑟𝑙𝑚 represents the updated allocated bandwidth value for the 𝑚th objectives while scheduling, the SCL requires the underlayer to guar-
aggregate flow on the 𝑙th link. antee performance isolation among different objectives. As shown in
The value of the step size 𝑆𝑒 have an impact on the convergence of Fig. 5, an isolated queue mechanism is designed in the FCL. Multiple
the algorithm of SCL. If the step size is too small, it will induce a long isolated queues are used to replace the single queue to manage packets.
convergence time, which is undesirable in highly dynamic networks. It Packets with different optimization objectives are processed in different
is also inappropriate to set it too large because a large step size induces queues. To ensure that the actual network traffic is approximate to
a coarse searching process in the feasible set of problem (2), which may the bandwidth allocation result calculated by the SCL, the round-
lead to suboptimal solutions. robin mechanism is used as the inter-queue scheduling strategy in FCL.

6
K. Lei et al. Journal of Network and Computer Applications 196 (2021) 103224

relatively ‘‘large’’ but have been waiting for a long time. It is defined
as follows:
𝑂𝑃 𝑇
𝑂𝑟 =
current_time − flow_start_time + 𝜏
where 𝜏 is a constant value to protect the flows that are relatively small
from being preempted too easily by flows that are relatively large and
just wait for a little while. And OPT means the optimal FCT of flow
(that is, the flow completion time if it is the only flow in the network).
Specifically, the tokens of large flows from the same receiver will be
sent to the sender in the increasing order of 𝑂𝑟 , namely, the flow with
the smallest 𝑂𝑟 is scheduled first at receivers. For example, there are
three large flows F1, F2 and F3, with a same receiver. F1 is relative
large whose OPT is 1000 and starts at time 0. F2 and F3 start at time
Fig. 6. Fast control layer overview.
300 whose OPT are 200 and 900 respectively. Assume 𝜏 equals 100.
At time 300, F2 will be scheduled first since its 𝑂𝑟 equals 2 and is
smaller than F1’s 𝑂𝑟 (2.5) and F3’s 𝑂𝑟 (9). Then F1 will be scheduled
The packets management strategy inside the queue depends on the as it has been waiting for a long time, and finally F3 will get its turn
transmission control protocol. The macro control results for bandwidth to be scheduled.
allocation calculated by the SCL are used to update the queue man-
agement strategy’s parameters periodically. In HOPASS, end-hosts rely
Algorithm 1 FCL algorithm at sender
on the Explicit Congestion Notification (ECN) mechanism (Leung and
Muppala, 2002) to realize proactive window adjustment according to 1: if flow arrives then
the congestion level. The packets dequeue according to a first-in–first- 2: send RquestToStart;
out (FIFO) manner. Besides, the packets will be marked with an ECN 3: send 8 (or less) Pioneer data packets;
label once queue’s length exceeds a certain threshold. The marking 4: if token packet received then
threshold will be updated periodically based on the output of the SCL. 5: tokenQueue.push(token); ⊳ FIFO
6: while sender is idle do
3.3.2. Transmission control protocol on end-hosts 7: Token = tokenQueue.top();
Specifically, the FCL gives a receiver-driven congestion control with 8: send data(Token);
three participants as shown in Fig. 6, i.e., senders, switches, and 9: tokenQueue.pop();
receivers. The implement of fast control layer algorithm mainly in-
cludes two parts: sending and receiving packets by end host and queue
As for the scheduling between large flows and small flows, the
management of switch. The realization of the function of sending
receiver simply adopts Weighted-Round-Robin (WRR) scheduling. So
data packets by the end host mainly includes checking the sending
small flows will be assigned tokens more frequently than large flows,
window and setting the packet header parameters of the data packets.
which means that the small flows have a higher priority to be sent
The behavior of the end host receiving data packets mainly includes
and will not experience head-of-line blocking due to large-sized flows.
updating the measured values and reading the information exchanged
When there are only small flows or large flows, the active flows will
with the intermediate nodes.
be scheduled in every slot to avoid bandwidth waste. This scheduling
As shown in algorithm 1, when a flow starts, the sender will send an
scheme aims to improve the throughput of large flows with a small
RTS message and several pioneering data packets to the receiver. The
compromise for the FCT of small flows.
RTS message contains some information such as flow size, flow weight
However, only WRR scheduling cannot handle in-network conges-
and so on. Pioneering data packets transmission without waiting for the
tion. The leading cause of in-network congestion is the traffic from
first batch of tokens from the receiver mitigates the end-to-end delay large flows. Therefore, the large flow should back off when congestion
and reduces the latency of small flows. For subsequent data transfer, occurs. Besides, to achieve weighted bandwidth allocation, large flows’
the sender will send data in the arrival order of corresponding tokens. token rates should be decreased based on their weights.
As for switches, they need to support ECN marking firstly. When in- ECN is proven to be efficient for congestion control in data center
network congestion occurs and the queue length exceeds the marking networks (Alizadeh et al., 2010; Vamanan et al., 2012; Mittal et al.,
threshold, data packets will be marked by switches. Such congestion 2015), which is available in commodity switches. We use ECN in our
information will be passed to receivers to handle. Secondly, to enhance design as the signal of congestion to adjust token sending rate, which
the cooperation with SCL, the part of FCL, which is deployed on the can react to congestion from the receiver side. Furthermore, to meet
switch, is designed to take weighted-round-robin scheduling for flows the weighted-fair bandwidth allocation, we play a small trick on the
with different objectives. back-off size.
Receivers are responsible for scheduling flows, updating token rate DCTCP (Alizadeh et al., 2010) introduces 𝛼∕2 to decrease rate sizes.
and handling in-network congestion, as shown in algorithm 2. Each When congestion happens, every flow has the same back-off size, which
receiver maintains one queue for active small flows and one queue for is not reasonable as the large flows with different objectives have
active large flows. When a receiver receives an RTS, it puts the flow different weights. We embed flow weights in the token control loop
into the corresponding queue according to its size. To achieve the near- (see the formula in line 15 of algorithm 2). The larger the weight is,
optimal performance of latency-sensitive small flows, each receiver in the smaller the back-off size is. We will prove this token control loop
HOPASS just applies shortest-remaining-processing-time (SRPT) first can achieve weighted sharing in Section 4.
scheduling inside the queue of active small flows, since SRPT is known As flows have different objectives, the data packets of a flow
as the optimal algorithm for minimizing FCT. However, for the in-queue will enter the corresponding logical weighted queue (which is soft-
scheduling of large flows, if we just use the same way, it is obvious implemented) inside each switch along the path and get a weighted
that there will be some large flows that get starved. So we introduce round-robin scheduling. The slow control layer has configured a ratio
a new metric 𝑂𝑟 to assign each large flow an ordering, to alleviate the of capacity, 𝑟𝑙𝑖 for each queue. If the aggregated data rate of flows
conflict between the flows that are relatively ‘‘small’’ and flows that are belonging to the same slice exceeds the allocated capacity, 𝑟𝑙𝑖 𝑐𝑙 , a

7
K. Lei et al. Journal of Network and Computer Applications 196 (2021) 103224

persistent queue will be built up. Once the queue length exceeds the When 𝛼 → 1, 𝐗 approaches 𝐗∗ , which means 𝐗 become a point in
ECN threshold, all subsequent data packets will be marked. Receivers the neighborhood of 𝐗∗ . However, Eq. (10) shows that 𝑓 (𝐗) > 𝑓 (𝐗∗ ). It
will then adjust the token sending rate accordingly. In this way, the contradicts with the assumption that 𝐗∗ is the local maximum solution.
fast control layer can gradually learn the capacity allocated by the slow This is because if 𝐗∗ is the local maximum solution, inequality 𝑓 (𝐗) >
control layer, and flows with the same objective will fairly compete for 𝑓 (𝐳) always holds for any point 𝐳 in its neighborhood. Therefore, such
the allocated bandwidth. an assumption is invalid. The local maximum value is also the global
maximum value in convex optimization problem.
Algorithm 2 FCL algorithm at receiver In the next step, it will be proved that the problem described in
formula (2) is a convex optimization problem, so the principle proved
1: if RequestToStart received then above can be applied on this problem. The proof is described as follow.
2: if f.size>threshold then Firstly, the feasible region of formula (2) is a convex set, as the con-
𝑂𝑃 𝑇
3: 𝑓 .𝑄𝑟 = current_time−flow_start_time+𝜏 ; straints are linear. Secondly, the sum of all utility functions is concave.
4: LargeflowQueue.push(f); Specifically, to verify the concavity condition, we calculate the second-
5: sort LargeflowQueue in ascending order of 𝑄𝑟 order derivative of the objective function of the NUM problem and we
6: else have
7: SmallflowQueue.push(f);
𝑑 2 𝑈𝑝𝑓 (𝑥) 𝑤1
8: sort SmallflowQueue using SRPT =−
9: Assign f.tokenWin; ⊳ Weighted Share 𝑑2𝑥 𝑥2
𝑑 2 𝑈𝛼 (𝑥)
10: if data packet received for Token T then = −𝑤2 𝛼𝑥−𝛼−1 (11)
𝑑2𝑥
11: set token T as responded 2
𝑑 𝑈𝑓 𝑐𝑡 (𝑥) 2𝑤3 𝑠𝑓
12: if updateTime is up then ⊳ update per RTT =−
13: f.tokenWin+ = 1; ⊳ upper-bounded by the capacity 𝑑2𝑥 𝑥3
14: if data with ECN marking then Since 𝐗 is positive and bounded and 𝛼 ∈ [0, 1], 𝑈𝑖 (𝑥𝑖 ) ≤ 0 always
15: f.tokenWin-=f.tokenWin ∗ 𝛼∕(2 ∗ f.weight); ⊳ Weighted holds, meaning the utility functions are all concave. Therefore, the
Back-off with in-network congestion objective function is concave.
16: while receiver is idle do
17: f = SmallflowQueue.top() or LargeflowQueue.top(); ⊳ WRR 4.2. Analysis of fast control layer
18: if flow’s token window is not used up then ⊳ token rate
control In this section, we analyze the properties of FCL from a theoretic
19: Sort SmallflowQueue and LargeflowQueue; perspective. We first develop a fluid model for large flows’ control
20: Send token; feedback loop of token rate. On top of the model, we further derive how
the classification threshold will influence the bandwidth sharing among
large flows and the queuing delay experienced by the small flows.
4. Modeling and theoretic analysis Before the detailed illustration of the fluid model, let us review the
update process of large flows’ token rates in congestion. The receiver
In this section, we give theoretic analysis for the two control layers maintains a running estimate of the fraction of marked data packets.
of HOPASS respectively. The analysis for SCL is to verify the correctness This estimate, 𝛼, is updated once roughly each round-trip time:
and the efficiency of the multi-objective NUM problem solving. And the 𝛼 ← (1 − 𝑔)𝛼 + 𝑔𝐹 , (12)
analysis for FCL, we mainly focus on the part deployed on end-hosts
which acts as a receiver-driven protocol. where F is the fraction of marked data packets in the most recent 𝑅𝑇 𝑇 ,
and 𝑔 ∈ (0, 1) is a fixed parameter using for the exponentially weighted
moving average estimation. And the token rate of a large flow with a
4.1. Analysis of slow control layer
normalized weight 𝑤̃ 𝑖 (the smallest weight equal to one) is updated as
follows:
In this section, we will explore the conditions to ensure that the SCL
𝛼
algorithm can converge to a near-optimal solution, and the associated 𝑇 𝑟𝑖 ← (1 − )𝑇 𝑟𝑖 (13)
2𝑤̃ 𝑖
proofs will be provided.
It is clearly that only the token rate of the large flow with the smallest
It has been established in paper (Zinkevich, 2003) that gradient
normalized weight 1 will be nearly reduced by half, when 𝛼 is close to
descent is an effective method for the online convex programming prob-
1 (heavy congestion).
lems. The average regret can be guaranteed to approach zero, while
applying gradient descent algorithm to solve the convex optimization
4.2.1. Fluid model for weighted sharing among large flows
problem.
To analyze the behavior of our token control loop, we develop a
When the objective function is concave, it has only one local max-
fluid model for large flows with different predefined weights. 𝑁 long-
imum. The only local maximum value found by the gradient descent
lived flows traverse a single bottleneck switch with capacity 𝐶. The
algorithm is also the global maximum. Reduction to absurdity will be
bottleneck only resides in the path for data packets, whereas the token
applied to prove this principle as follows: Assuming that the bandwidth
packets are sent to senders without experiencing any congestion. The
allocation vector 𝐗∗ calculated by the SCL is only a local maximum following non-linear, delay-differential equations describe the dynam-
instead of a global maximum, then there must be another bandwidth ics of token window 𝑇𝑖 (𝑡), the running estimate of the fraction of marked
allocation vector 𝐗′ satisfying the linear constraints in Eq. (2), so that data packets 𝛼(𝑡), and the queue size at the switch, 𝑞(𝑡):
𝑓 (𝐗′ ) > 𝑓 (𝐗∗ ). We can construct the following vector 𝐗
𝑑𝑇𝑖 1 𝑇 (𝑡)𝛼(𝑡)
= − 𝑝(𝑡 − 𝑅∗ ) 𝑖 , (14)
𝐗 = 𝛼𝐗′ + (1 − 𝛼)𝐗∗ (9) 𝑑𝑡 𝑅(𝑡) 2𝑤̃ 𝑖 𝑅(𝑡)
𝑑𝛼 𝑔
According to the property of concave function, we have the follow- = (𝑝(𝑡 − 𝑅∗ ) − 𝛼(𝑡)), (15)
𝑑𝑡 𝑅(𝑡)
ing inequality:
𝑑𝑞 ∑𝑁
𝑇𝑖 (𝑡)
′ ∗ ∗ = − 𝐶, (16)
𝑓 (𝐗) ≥ 𝛼𝑓 (𝐗 ) + (1 − 𝛼)𝑓 (𝐗 ) > 𝑓 (𝐗 ) (10) 𝑑𝑡 𝑅(𝑡)
𝑖=1

8
K. Lei et al. Journal of Network and Computer Applications 196 (2021) 103224

For flows with sizes smaller than the threshold, they will be classi-
fied as small flows, enter the queue with higher scheduling frequency,
get sufficient tokens and finish. The average size of small flows in the
queue with high scheduling frequency can thus be computed as:

𝛿(𝐻) = 𝑥𝑓 (𝑥)𝑑𝑥 (18)
∫1
𝐻
𝑠 𝑠(1 − 𝐻 1−𝑠 )
= 𝑥 𝑑𝑥 = (19)
∫1 𝑥𝑠+1 𝑠−1
Since 𝑠 > 1 for production data traces (e.g., 𝑠 = log54 satisfies the ‘‘80-
20 rule’’), 𝛿 is an increasing function with the threshold 𝐻. Since these
small flows can get enough tokens in one round-trip, their data packets
will directly enter the network and complete transfer before reacting
to the token control loop. Therefore, the amount of data packets in the
bottleneck queue that comes from these flows is 𝜆𝛿.
For the large flows, they will send tokens assigned by receivers until
the bottleneck queue length reaches the ECN marking threshold 𝐾. At
this point, we denote the token window for each large flow as 𝑇𝑖∗ . We
have

𝑇𝑖∗ + 𝜆𝛿(𝐻) = 𝐾 + 𝐶 × 𝑑 (20)
𝑖
Fig. 7. Numerical analysis of FCL. ∑
(𝑇𝑖∗ + 1) + 𝜆𝛿(𝐻) = 𝑁 + 𝐾 + 𝐶 × 𝑑 (21)
𝑖

Since 𝑇𝑖 = (𝑇𝑖∗ + 1) is the maximal token window, and is proportional


Here 𝑅(𝑡) = 𝑑 + 𝑞(𝑡)∕𝐶 is the round-trip time (𝑅𝑇 𝑇 ), where 𝑑 is the
to the weight in steady state, we have
propagation delay (assumed to be equal for all flows). 𝑅∗ = 𝑑 + 𝐾∕𝐶 is
the approximate fixed value for the delay. 𝑝(𝑡) = 𝟏{𝑞(𝑡)>𝐾} indicates the 𝑁 + 𝐾 + 𝐶𝑑 − 𝜆𝛿(𝐻)
𝑇𝑖 = ∑ . (22)
packet marking process at the bottleneck switch; 𝑤̃ 𝑖 is the normalized 𝑖𝑤̃𝑖
weight such that the smallest weight equal to one. Since 𝛿(𝐻) is increasing with the threshold 𝐻, it follows that a
Eq. (14) models the evolution of token rate. 1∕𝑅(𝑡) is the standard
larger threshold will harm the bandwidth of each large flow in the
additive increase term when there is no congestion, and 𝑇𝑖 (𝑡)𝛼(𝑡)∕
queue with low scheduling frequency. We may increase 𝐾 to compen-
(2𝑤̃ 𝑖 𝑅(𝑡)) is the multiplicative decrease term which models the reduc-
sate the damage for weighted bandwidth sharing among large flows.
tion of token rate by a factor 𝛼(𝑡)∕2𝑤̃ 𝑖 when the congestion occurs.
However, since the maximal queuing delay is approximated as (𝑁 +
Eq. (15) is a continuous approximation of (12). And Eq. (16) models
∑ 𝐾 + 𝜆𝛿)∕𝐶, increasing 𝐾 will harm the delay experienced by the small
the queue evolution : 𝑁 𝑖=1 𝑇𝑖 (𝑡)∕𝑅(𝑡) is the net input rate and C is the
flows.
bandwidth capacity.
Through numerical analysis, we plot the trajectory for the token
window and queue lengths when three flows compete through one 5. Evaluation
bottleneck switch in Fig. 7. It is clear to see that HOPASS is periodic
stable, which can be proved using the same method in Alizadeh et al. In this section, we present the evaluation of HOPASS by analyzing
(2011). We can see from the results that the token window of flows
the simulation results on ns-3 (Henderson et al., 2008) which is used
with different weights stabilize around its own weighted share. Since
to evaluate pFabric (Alizadeh et al., 2013). We evaluate our approach
senders in HOPASS immediately sends a data packet once it receives
under different network topologies and different kinds of utility func-
a token, the throughput of each flow is essentially proportional to the
tions. We have compared HOPASS with existing approaches to prove
number of tokens it received in every round trip. Since all flows share
its effectiveness. The evaluation is divided into three parts:
the common bottleneck, they will experience the same increase phase
of token window, and then use their own weights to backoff once • The overall performance of HOPASS with both fast and slow
receiving ECN markings. It follows that in the stable state, the decrease control layer: HOPASS is implemented HOPASS in ns-3 and
phase of different flows’ token rates satisfies: compared with DCTCP (Alizadeh et al., 2010) and TCP NewReno
𝑇𝑖 𝛼(𝑡) 𝑇𝑗 𝛼(𝑡) on the performance metrics such as average bandwidth utilization
𝛥𝑇𝑖 = = 𝛥𝑇𝑗 = (17)
2𝑤̃ 𝑖 𝑅(𝑡) 2𝑤̃𝑗 𝑅(𝑡) and overall network utility.
• Optimality of the slow control: As the slow control layer of
where 𝑇𝑖 and 𝑇𝑗 are the maximal token window before entering the
HOPASS aims to solve the multi-objective NUM problem in ag-
decrease phase. Therefore, we have 𝑇𝑖 ∕𝑇𝑗 = 𝑤𝑖 ∕𝑤𝑗 . In other words, the
gregate flow level, we evaluate the feasibility of the slow control
peak token rates achieve strictly weighted sharing.
layer through a scenario where flows with three different objec-
4.2.2. Influence of classification threshold tives coexist. We use the CVX toolbox of MATLAB (Higham and
We now analyze the influence of the classification threshold for our Higham, 2016) to calculate the optimal solution of the multi-
WRR scheduling. Suppose the threshold for classifying flows into large objective NUM problem and compare it with the result of HOPASS
flows with low scheduling frequency is 𝐻. Flow arrivals form a Poisson to verify the solution’s accuracy.
process with rate 𝜆. It has been observed that in DCNs, the flow size • Performance of the fast control: We conducted extensive sim-
distribution is heavy-tailed (Montazeri et al., 2018). We assume the ulations and evaluated the performance of the fast control layer
flow size distribution follows a Pareto distribution with the tail index over a wide range of topologies, workloads, traffic models, and
(or Pareto index) as 𝑠 > 1. Let 𝑓 (𝑥) be the probability density function performance metrics. We compare the performance with pHost
𝑠
for flow size distribution. We have 𝑓 (𝑥) = . (Gao et al., 2015) and pFabric (Alizadeh et al., 2013).
𝑥𝑠+1

9
K. Lei et al. Journal of Network and Computer Applications 196 (2021) 103224

Fig. 8. Average bandwidth utilization of all the links.


Fig. 9. Global total network utility of all the aggregated flows.

5.1. Overall performance of HOPASS

To evaluate the overall performance of HOPASS, we performed a


simulation in ns-3 with large-scale topology and complex workload in
multiple objective coexistence scenarios.
Topology: Two-tier multi-rooted tree. It is a typical topology in
data center networks, for the simulations in ns-3. In the two-tier
multi-rooted tree topology, each core switch is connected with all the
aggregation switches. The links among them are core links. The links
that connect end hosts with aggregation switches are edge links.
We used a two-tier multi-rooted tree with four core switches and
eight aggregation switches, and each aggregation switch is connected
to sixteen end hosts. The capacity of all the edge links is set to 10
Gbps, and the capacity of core links is 40 Gbps. The propagation delay
between any two end hosts is 16 μs. The workload we used is the same
Fig. 10. Simple topology.
as that in NUMFabric (Nagaraj et al., 2016).
Workload: Workload is based on the measurements of applications
from large enterprises and web searches, whose services’ obvious fea-
5.2. Optimality of slow control layer
ture is heavy-tailed, which means that most of the flows are small-sized
flows and they contribute most of the total workload. The interarrival
5.2.1. Experimental setup
time of flows obeys the exponential distribution.
To evaluate the performance of the learning-based method in the
Performance metrics: In this part of the evaluation, we continually slow control layer, we conducted packet-level simulations in ns-3 and
build a three-objective coexistence scenario. As Table 2 indicated, the calculated the optimal bandwidth allocation using the CVX toolbox of
1−𝛼 1−𝛼
utility functions are 𝑈1 (𝑥) = 𝛽𝑙𝑜𝑔(𝑥), 𝑈2 (𝑥) = 𝑥1−𝛼 and 𝑈3 (𝑥) = 𝑥𝑠 , MATLAB. In this evaluation, we mainly focus on the convergence time
𝑓
respectively with 𝛽 = 1200, 𝛼 = 0.1 and 𝑠𝑓 = 1.1. and optimality of the slow control layer decisions.
Setup: To verify the performance of HOPASS, we also compare Topology: We chose a simple topology for the simulation. This sim-
HOPASS with DCTCP (Alizadeh et al., 2010), a widely deployed con- ple topology (Fig. 10) is a small asymmetric two-tier multi-rooted tree
gestion control algorithm, and TCP NewReno, a traditional Transport with two core switches, two aggregation switches, and four end hosts.
Control Protocol. These two algorithms are both implemented in ns- The capacity of all the edge links is set to 10 Gbps with 2 μs propagation
3 simulator and use the same topology and workload to conduct delay. The core links’ capacity is 40 Gbps and the propagation delay of
the simulation. The performance metrics used to evaluate these three core links is the same as the edge links. The buffer size of each port on
approaches include: (a) average bandwidth utilization of all the links; the switches is 1 MB.
(b) global total network utility of all the aggregated flows. Setup: The slow control layer focuses on the multi-objective NUM
problem of aggregated flows on each link. We use three long-lived
Experimental result analysis: As shown in Fig. 8, the average
flows to simulate three aggregated flows with different utility functions.
bandwidth utilization of DCTCP and NewReno is almost equal, but the
These three long flows are generated from sender 𝑆1 to receivers
average bandwidth utilization of HOPASS is almost twice these two
𝑅1, 𝑅2, 𝑅3, respectively. All these three flows share the same bottleneck
benchmark approaches. It proves that the proposed HOPASS is able to
link 𝑆1 − 𝐴1. The flows among three source–destination pairs have
achieve high utilization of bandwidth. different performance objectives: proportional fairness, 𝛼-fairness, and
Fig. 9 shows that HOPASS is able to achieve higher global total minimum flow completion time. The utility functions are the same as
network utility of all the aggregated flows compared with DCTCP and Section 5.1.
Newreno, which proves the effectiveness of the slow control layer. The
learning-based method helps update the bandwidth allocation strategy 5.2.2. The optimality of HOPASS
so that the network has a higher probability of achieving a more The first experiment is used to evaluate the optimality of HOPASS
considerable global total utility of all the aggregated flows. on the multi-objective NUM problem. We compare the bandwidth

10
K. Lei et al. Journal of Network and Computer Applications 196 (2021) 103224

Fig. 11. Throughput and the rate of aggregated flow on the bottleneck link in multiple
objective coexistence scenario.

allocation of each objective and the total utility of the bottleneck link
with the optimal solution calculated by MATLAB’s CVX toolbox, which Fig. 12. Topology.
is a convex optimization problem solver. In order to ensure the network
has enough time to converge to a stable bandwidth allocation, we set
a large flow size (500 MB in the experiment) for these three flows.
performance metrics: To verify that HOPASS is suitable for a wide of the bottleneck link reaches its capacity in 0.005 s, which means
range of objectives, we change the network’s preference for different HOPASS ensures the link converges to a high bandwidth utilization
objectives by changing the parameters of utility functions and con- quickly. We owe the high utilization of bandwidth to both the SCL and
ducted multiple sets of experiments. The results are shown in Table 3. FCL algorithms of HOPASS. In the slow control layer, as (4) shown,
By comparing the bandwidth allocation result of HOPASS with the the sum of the bandwidth allocation ratio is always equal to 1, which
optimal solution calculated by MATLAB, we verify the optimality of means that there is no waste of bandwidth in any control cycle. Besides,
HOPASS over different objectives. We define a formula to measure the the fast control layer algorithm strictly controls the aggregated rate of
difference between the results of HOPASS and the optimal solution: each objective to the bandwidth constraint given by the slow control
layer, which also contributes to the high utilization of bandwidth. We
𝑈𝑐𝑣𝑥 − 𝑈𝐻𝑂𝑃 𝐴𝑆𝑆
𝛿= × 100% (23) further verify the performance of the fast control layer in the next part.
𝑈𝑐𝑣𝑥
where, 𝛿 denotes the normalized error between the result of HOPASS 5.3. Experimental in fast control layer
and the optimal solution; 𝑈𝑐𝑣𝑥 denotes the total maximum utility of the
optimal bandwidth allocation; 𝑈𝐻𝑂𝑃 𝐴𝑆𝑆 denotes the total utility under
In this section, we provide simulation results and evaluate the
the bandwidth allocation of HOPASS.
performance of FCL over a wide range of topologies, workloads, traffic
experimental result analysis: As shown in Table 3, even though
models, and performance metrics, and compare its performance with
in some cases the bandwidth allocation results of HOPASS are different
pHost (Gao et al., 2015) and pFabric (Alizadeh et al., 2013).
from that of the ideal solution, the error of utilities between HOPASS
Topology: Two-tier multi-rooted tree. We use the same tree topology
and the ideal solution is less than 0.4%. So these solutions can be
as in pHost (Gao et al., 2015), pFabric (Alizadeh et al., 2013), and
regarded as near-optimal. The error is inevitable because of the limited
PIAS (Bai et al., 2015), shown in Fig. 12. The two-tier multi-rooted
precision of the convergence step size in the slow control layer algo-
tree topology consists of three levels of components. The top layer has
rithm. HOPASS is able to provide an optimal or near-optimal solution
four core switches. Each core switch has nine aggregation switches
to the multi-objective network utility optimization problem.
(at the second layer), and each aggregation switch has 16 hosts (at
5.2.3. Learning-based method in SCL the bottom layer). The link of the core-aggregation switch is 40 Gbps,
The second experiment evaluates the convergence time of the and the link of host-aggregation is 10 Gbps. Each link has a 12.5 μs
learning-based method used in the slow control layer. To simulate propagation delay. Network switches implement cut-through routing
various flow patterns, we set the sizes of these three long flows to and each switch port has a 36 kB queue buffer (same as pFabric).
100 MB, 500 MB and 1 GB, respectively. In the measurement of the Dumbbell. Fig. 12 shows the dumbbell topology that consists of two
simulation, the first sample point was measured 0.005 s after the start switches, two senders and two receivers. Each link is 10 Gbps, 12.5 μs
of the simulation. Then the rate of these three aggregated flows is propagation delay, with 36 kB buffer at the switch.
measured every 0.02 s. Workload: the naive workload produces approximately 100 MB
Fig. 11 shows that the network reaches a stable bandwidth alloca- large flow and 30 kB small flow, mainly used to verify the weighted
tion in 0.08 s after the start of the simulation. The converged bandwidth bandwidth sharing. Then we compare the performance of FCL against
allocation is 2588.2 Mbps, 6428.1 Mbps, 1642.3 Mbps, which is a near- pHost and pFabric over Three traces: IMC10, Data Mining, and Web
optimal bandwidth allocation according to Table 3. Besides, when one Search, same as in pHost. The IMC10 and Data Mining workloads have
of the flow stops injecting traffic into the network, a new convergence a larger fraction of small flows than Web Search.
can be reached within only 0.02 s. This can be attributed to the fact that Traffic model: we generate the flows under three models. Concur-
the slow control layer can quickly respond to the dynamics of flows. rent traffic has concurrent large flows, and the small flows are generated
Besides, this experiment can also prove that HOPASS can achieve according to a Poisson distribution. Default traffic means many-to-many
high utilization of bandwidth. As shown in Fig. 11, the throughput traffic, same as in pHost, each host can be the sender or receiver, and

11
K. Lei et al. Journal of Network and Computer Applications 196 (2021) 103224

Table 3
Comparison between bandwidth allocation result of HOPASS and the solution calculated by Matlab.
Utility function HOPASS Ideal solution Error (%)
Proportional fairness 𝛼-fairness Minimize FCT Total utility Proportional fairness 𝛼-fairness Minimize FCT Total utility
𝛽 = 1000, 𝑠𝑓 = 0.7 1952.1 Mbps 359.7 Mbps 8353.3 Mbps 12 635.5 1912.7 Mbps 655.9 Mbps 8096.4 Mpbs 12 640.2 0.037
𝛽 = 1000, 𝑠𝑓 = 1.1 3138.0 Mbps 6960.5 Mbps 561.8 Mbps 11 515.0 2432.5 Mbps 7252.8 Mbps 975.0 Mbps 11 555.0 0.35
𝛽 = 1200, 𝑠𝑓 = 1.0 3665.8 Mbps 5364.4 Mbps 1626.4 Mbps 13 150.0 2853.8 Mbps 5785.6 Mbps 2017.3 Mbps 13 193.4 0.33
𝛽 = 800 , 𝑠𝑓 = 1.0 2588.2 Mbps 6428.1 Mbps 1642.8 Mbps 10 042.4 1924.1 Mbps 6476.7 Mbps 2258.3 Mbps 10 085.2 0.4

Fig. 13. Dumbbell topology. For small flows, the slowdown under FCL is similar to pFabric, which is optimal due to the shortest remaining processing time first strategy. Compared
to pHost, the 80% queuing delay under FCL is less than 12 μs, which is 63.6% lower than pHost (33 μs). For large flows, FCL’s mean throughput is higher than pFabric, and is
only 16.7% lower than pHost.

Fig. 14. Four large flows with different weights on the Dumbbell topology. pFabric excessively suppresses the throughput of large flows when there are still small flows (before
sample = 40). pHost’s two receivers schedule one large flow at a time. When f0 and f3 finishes, they will schedule f1 and f2. FCL achieves weighted bandwidth sharing for large
flows.

the flow arrival times are subject to a Poisson distribution. Incast traffic according to different weights to give way to the small flows. Thus,
means each receiver gets flows from a specified number of data sources. the small flow’s FCT and queuing delay are better, at the price of
Performance metrics: slowdown. We use the slowdown to evaluate acceptable throughput degradation for large flows. pHost only degrades
the flow completion time as in pHost and pFabric, OPT(i) denotes the flow when the expired token exceeds a BDP, and the congestion may
optimal FCT of flow i, ACT(i) denotes the actual FCT in observation. have occurred for a period, which is hysteretic and not adaptive.
The slowdown denotes the ratio of the ACT(i) and OPT(i). The closer This will cause high throughput, high FCT and long queuing delays.
that the slowdown is to 1, the better performance. pFabric’s priority scheduling and priority-based packet drop play a
significant role in the FCT. By using rate control, it prevents spurious
5.3.1. Concurrent flows in dumbbell topology packet drops to approximate SRPT. However, it also leads to the lowest
We use the naive workload and concurrent traffic model to measure mean throughput for large flows.
the performance of FCL under the dumbbell topology. There are four Fig. 14 shows the large flow’s time series diagram of the three
large flows (f0, f1, f2, f3), their weights are 1:1:2:4, generated from protocols. pFabric suppresses the throughput of large flows at the
sender1 and sender2 to receiver1 and receiver2 respectively. There are switch in the beginning, because the small flow gets a high priority
5000 small flows generated randomly, and their arrival time is subject than the large flow. When small flow finishes (after sample = 40), it
to Poisson distribution. schedules the large flows one by one at a switch, which may make other
Fig. 13 shows the performance of pFabric, pHost and FCL. We flows starve for a long time. pHost schedules the large flow one by one
set the best parameters for pFabric and pHost, to evaluate the small at a receiver, and switch fairly allocates the bandwidth for the incoming
flow’s mean slowdown, the large flow’s mean throughput and CDF flows, which cannot guarantee the bandwidth for different applications.
of the small flow’s queuing delay. FCL shows similar FCT of small FCL approaches the weighted bandwidth allocation by the weighted
flows with pFabric (both better than pHost), and 80% of small flows’ back-off mechanism, and will not make one flow or receiver wait for
queuing delay is under 12 μs, 14.3% lower than pFabric (14 μs) and a long time. In addition, it makes the best use of the link capacity,
63.6% lower than pHost (33 μs). FCL schedules the small flows more when f3 and f2 are finished, the remaining flows weighted share the
frequently, and it is one of the mechanisms to minimize the FCT. The rest bandwidth. It does not absolutely obey the weight ratio, because
second mechanism is that the receiver takes rate control when in- there is small flows’ interference. Also, the rate increase phase does not
network congestion occurs. Specifically, the large flows would back off take the weighted step.

12
K. Lei et al. Journal of Network and Computer Applications 196 (2021) 103224

Fig. 15. FatTree topology. Fast control layer’s mean small flows’ slowdown is similar to pFabric and pHost. 99% small flows’ queuing delay is less than 2 μs, which is 75% lower
than pHost (8 μs). For large flows, the mean throughput is consistently higher than pFabric, and is higher than pHost under the data mining trace.

5.3.2. Default traffic in two-tier multi-rooted tree topology Declaration of competing interest
To show our scheme’s scalability, we use the topology as in
Fig. 12(a), default traffic and Three traces to evaluate three protocols’ The authors declare that they have no known competing finan-
performance. Fig. 15 indicates that FCL achieves a similar mean FCT cial interests or personal relationships that could have appeared to
for small flow with pFabric and pHost, and improves the large flow’s influence the work reported in this paper.
throughput. Besides, 99% small flows’ queuing delay of FCL is less
than 2 μs, 75% lower than pHost (8 μs), while pFabric’s queuing delay
Acknowledgments
is close to 0. pFabric achieves global optimization for small flows’
FCT, but it needs the switch’s help to schedule and drop the data.
Moreover, it ignores the objective for large flow’s throughput. pHost This work was supported by National Natural Science Foundation
just handles the congestion by stopping sending tokens for a time, hence of China (NSFC 62072012), Key-Area Research and Development Pro-
its ’’on-off’’behavior causes the unstable throughput. FCL achieves both gram of Guangdong Province, China (2020B0101090003), Shenzhen
low latency for the small flows and high throughput for the large Research Project, China (JSGG20191129110603831), Shenzhen Key
flows, because it employs the WRR scheduling and weighted back-off Laboratory Project, China (ZDSYS201802051831427).
at the receivers. WRR scheduling makes the small flows be scheduled
more frequently, and the weighted back-off can guarantee weighted References
bandwidth sharing among the large flows.
Abadi, Martín, Barham, Paul, Chen, Jianmin, Chen, Zhifeng, Davis, Andy, Dean, Jeffrey,
6. Conclusion Devin, Matthieu, Ghemawat, Sanjay, Irving, Geoffrey, Isard, Michael, et al., 2016.
Tensorflow: a system for large-scale machine learning. In: Proc. UNSENIX OSDI,
This paper innovatively proposes a two-layer network resource allo- Vol. 16. pp. 265–283.
cation framework named HOPASS for DCNs to solve a multi-objective Alizadeh, Mohammad, Greenberg, Albert, Maltz, David A., Padhye, Jitendra, Pa-
tel, Parveen, Prabhakar, Balaji, Sengupta, Sudipta, Sridharan, Murari, 2010. Data
NUM problem and ensure a low-latency network with reasonable con-
center TCP (DCTCP). ACM SIGCOMM Comput. Commun. Rev. 40 (4), 63–74.
gestion control. HOPASS consists of a slow control layer (SCL) and fast Alizadeh, Mohammad, Javanmard, Adel, Prabhakar, Balaji, 2011. Analysis of DCTCP:
control layer (FCL), and these two layers are designed to solve different Stability, convergence, and fairness. In: ACM SIGMETRICS. pp. 73–84.
problems: at the SCL, we propose a learning-based approach to solve Alizadeh, Mohammad, Yang, Shuang, Sharif, Milad, Katti, Sachin, Mckeown, Nick,
the multi-objective NUM at an aggregate flow level. The FCL adopts a Prabhakar, Balaji, Shenker, Scott, 2013. pFabric: minimal near-optimal datacenter
receiver-driven approach that gives a performance guarantee for both transport. In: Proc. ACM SIGCOMM. pp. 435–446.
Bai, Wei, Chen, Li, Chen, Kai, Han, Dongsu, Tian, Chen, Wang, Hao, 2015. Information-
latency-sensitive mice flows and bandwidth-hungry elephant flows and
agnostic flow scheduling for commodity data centers. In: Proc. Usenix NSDI. pp.
takes proactive congestion control in a distributed manner to meet the 455–468.
requirements of weighted bandwidth sharing. Cho, Inho, Jang, Keon, Han, Dongsu, 2017. Credit-scheduled delay-bounded congestion
The simulation results in ns-3 show that HOPASS can achieve the control for datacenters. In: Proc. ACM SIGCOMM. pp. 239–252.
near-optimal solution to the multi-objective NUM problem in multi- Dean, Jeffrey, Ghemawat, Sanjay, 2004. MapReduce: simplified data processing on large
objective scenarios. In the simple topology, the bottleneck link band- clusters. In: Proc. USENIX OSDI.
width utilization of the proposed method is 99.56%, the convergence Dong, Mo, Meng, Tong, Zarchy, Doron, Arslan, Engin, Godfrey, Brighten,
Schapira, Michael, 2018. PCC vivace: Online-learning congestion control. In:
time of the algorithm is less than 0.4 s, the error between the control
Proc. USENIX NSDI.
result and the theoretical optimal control result solved by Matlab is less Francois, Frederic, Gelenbe, Erol, 2016. Optimizing secure SDN-enabled inter-data
than 0.2%. In addition, with limited additional overhead, the proposed centre overlay networks through cognitive routing. In: 2016 IEEE 24th Inter-
flow control scheme achieves less vibration and 2.03 times the overall national Symposium on Modeling, Analysis and Simulation of Computer and
network utility of the typical data center flow control method DCTCP, Telecommunication Systems (MASCOTS). IEEE, pp. 283–288.
which fully verifies the superiority of HOPASS. Finally, by leveraging Frohlich, Piotr, Gelenbe, Erol, 2020. Optimal fog services placement in SDN IoT network
using random neural networks and cognitive network map. In: The 19th Inter-
the learning-based method, HOPASS is flexible and applicable to a wide
national Conference on Artificial Intelligence and Soft Computing, Zakopane, PL,
range of network utilities. Springer LNAI. 12415, pp. 78–89. http://dx.doi.org/10.1007/978-3-030-61401-0.
Fröhlich, P., Gelenbe, E., Fiołka, J., Checinski, J., Nowak, M., Filus, Z., 2021. Smart
CRediT authorship contribution statement SDN management of fog services to optimize QoS and energy. Sensors 21, 3105.
http://dx.doi.org/10.3390/s21093105.
Kai Lei: Conceptualization, Methodology, Source, Supervision. Jun- Frohlich, Piotr, Gelenbe, Erol, Nowak, Mateusz P., 2020. Smart SDN management of fog
lin Huang: Software, Investigation, Formal analysis. Xiaodong Li: services. In: GIOTS 2020: Global IoT Summit 2020, IEEE Communications Society,
1-5 June 2020, Dubin, Ireland. TechRxiv.
Writing – review & editing, Data curation. Yu Li: Writing – original
Gao, Peter X, Narayan, Akshay, Kumar, Gautam, Agarwal, Rachit, Ratnasamy, Sylvia,
draft, Visualization. Ye Zhang: Formal analysis, Validation. Bo Bai:
Shenker, Scott, 2015. pHost: Distributed near-optimal datacenter transport over
Projection administration, Source, Funding acquisition. Fan Zhang: commodity network fabric. In: Proc. ACM CoNEXT. pp. 1:1–1:12.
Funding acquisition, Source. Gong Zhang: Funding acquisition, Source. Ha, Sangtae, Rhee, Injong, Xu, Lisong, 2008. CUBIC: a new TCP-friendly high-speed
Jingjie Jiang: Source. TCP variant. ACM SIGOPS Oper. Syst. Rev. 42 (5), 64–74.

13
K. Lei et al. Journal of Network and Computer Applications 196 (2021) 103224

Handley, Mark, Raiciu, Costin, Agache, Alexandru, Voinescu, Andrei, Moore, An- Kai Lei received his Ph.D. degree in Computer Science
drew W., Antichi, Gianni, Wójcik, Marcin, 2017. Re-architecting datacenter from Peking University, China, in 2015, M.Sc. in Computer
networks and stacks for low latency and high performance. In: Proc. ACM Science from Columbia University in 1999, and B.Sc. in C.S.
SIGCOMM. pp. 29–42. from Peking University in 1998. He worked for companies
Henderson, Thomas R., Lacage, Mathieu, Riley, George F., Dowell, Craig, including IBM T.J Watson Research Center, Citigroup, Or-
Kopena, Joseph, 2008. Network simulations with the ns-3 simulator. SIGCOMM acle, and Google from 1999 to 2004. He is currently an
Demonstr. 14 (14), 527. associate professor in the School of Electronic and Computer
Higham, Desmond J., Higham, Nicholas J., 2016. MATLAB Guide. SIAM. Engineering, Peking University, China. His research interests
Hosseinzadeh, Mehdi, Ghafour, Marwan Yassin, Hama, Hawkar Kamaran, Vo, Bay, include Future Internet, Blockchain and Federated Learning.
Khoshnevis, Afsane, 2020. Multi-objective task and workflow scheduling approaches
in cloud computing: a comprehensive review. J. Grid Comput. 1–30.
Kelly, Frank, 1997. Charging and rate control for elastic traffic. Eur. Trans.
Telecommun. 8 (1), 33–37.
Kumar, Alok, Jain, Sushant, Naik, Uday, Raghuraman, Anand, Kasinadhuni, Nikhil, Zer-
meno, Enrique Cauich, Gunn, C. Stephen, Ai, Jing, Amarandei-Stavila, Mihai, 2015.
BwE:Flexible, hierarchical bandwidth allocation for WAN distributed computing. In:
Proc. ACM SIGCOMM. pp. 1–14. Junlin Huang received his B.S. degree in Electronic Engi-
Lei, K., Huang, J., Li, Y., Zhang, F., Susanto, H., Bai, B., Zhang, G., Liu, J., 2019. neering from Sun Yat-sen University, China, in 2017 and
HOMMO: A hierarchical flow management framework for multi-objective data M.S. degree in Computer Science from Peking University,
center networks. In: 2019 IEEE Global Communications Conference (GLOBECOM). China, in 2020, respectively. He is currently a software
pp. 1–6. engineer at Tencent. His research interest mainly focuses
on flow management and scheduling.
Leung, I.K.-K., Muppala, J.K., 2002. Packet marking strategies for explicit congestion
notification (ECN). In: Conference Proceedings of the 2001 IEEE International
Performance, Computing, and Communications Conference (Cat. No.01CH37210).
Li, Mu, Andersen, David G., Park, Jun Woo, Smola, Alexander J., Ahmed, Amr,
Josifovski, Vanja, Long, James, Shekita, Eugene J., Su, Bor-Yiing, 2014. Scaling
distributed machine learning with the parameter server. In: Proc. UNSENIX OSDI,
Vol. 14. pp. 583–598.
Masdari, Mohammad, Zangakani, Mehran, 2019. Green cloud computing using
proactive virtual machine placement: challenges and issues. J. Grid Comput. 1–33.
Mittal, Radhika, Lam, Vinh The, Dukkipati, Nandita, Blem, Emily, Wassel, Hassan,
Ghobadi, Monia, Vahdat, Amin, Wang, Yaogong, Wetherall, David, Zats, David,
Xiaodong Li received his B.S. degree in University of
2015. TIMELY: RTT-based congestion control for the datacenter. ACM SIGCOMM
Science and Technology Beijing, China in 2020. He is
Comput. Commun. Rev. 45 (4), 537–550.
working toward his M.S. degree in Computer Science in
Montazeri, Behnam, Li, Yilong, Alizadeh, Mohammad, Ousterhout, John, 2018. Homa: Peking University. His research interest in mainly focused
A receiver-driven low-latency transport protocol using network priorities. In: on the area of the network measurement and programmable
Proc. ACM SIGCOMM. switches.
Munir, Ali, Qazi, Ihsan A., Uzmi, Zartash A., Mushtaq, Aisha, Ismail, Saad N.,
Iqbal, M. Safdar, Khan, Basma, 2013. Minimizing flow completion times in data
centers. In: INFOCOM, 2013 Proceedings IEEE. pp. 2157–2165.
Nagaraj, Kanthi, Bharadia, Dinesh, Mao, Hongzi, Chinchali, Sandeep, Alizadeh, Mo-
hammad, Katti, Sachin, 2016. Numfabric: Fast and flexible bandwidth allocation
in datacenters. In: Conference on ACM SIGCOMM 2016 Conference. pp. 188–201.
Perry, Jonathan, Ousterhout, Amy, Balakrishnan, Hari, Shah, Devavrat, Fugal, Hans,
2014. Fastpass: a centralized zero-queue datacenter network. In: Proc. ACM Yu Li received her B.S. degree in Software Engineering
SIGCOMM. pp. 307–318. from Sun Yat-sen University in 2018. She is working toward
Shalev-Shwartz, Shai, 2011. Online learning and online convex optimization. Found. her M.S. degree in Computer Science at Peking University.
Trends Mach. Learn. 4 (2), p.A6,1–11,13–79,81–85,87,89–91. Her research interest is mainly focused on the area of the
Tian, Chen, Munir, Ali, Liu, Alex X., Liu, Yingtong, Li, Yanzhao, Sun, Jiajun, Zhang, Fan, network controlling.
Zhang, Gong, 2017. Multi-tenant multi-objective bandwidth allocation in datacen-
ters using stacked congestion control. In: INFOCOM 2017 - IEEE Conference on
Computer Communications, IEEE. pp. 1–9.
Vamanan, Balajee, Hasan, Jahangir, Vijaykumar, T.N., 2012. Deadline-aware datacenter
tcp (D2TCP). Acm Sigcomm Comput. Commun. Rev. 42 (4), 115–126.
Wang, Lan, Gelenbe, Erol, 2018. Adaptive dispatching of tasks in the cloud. IEEE Trans.
Cloud Comput. 6 (1), 33–45.
Wilson, Christo, Ballani, Hitesh, Karagiannis, Thomas, Rowtron, Ant, 2011. Better never
than late:meeting deadlines in datacenter networks. In: Proc. ACM SIGCOMM. pp.
50–61.
Xu, Zhiyuan, Tang, Jian, Meng, Jingsong, Zhang, Weiyi, Wang, Yanzhi, Liu, Chi Harold, Ye Zhang received her B.S. degree in Spatial Information
Yang, Dejun, 2018. Experience-driven networking: A deep reinforcement learning and Digital Technology from Wuhan University, China, in
based approach. 2016 and M.S. degree in Computer Science from Peking
University, China, in 2019, respectively. Her research inter-
Zaharia, Matei, Chowdhury, Mosharaf, Das, Tathagata, Dave, Ankur, Ma, Justin,
est mainly focuses on congestion control and receiver-driven
McCauley, Murphy, Franklin, Michael J., Shenker, Scott, Stoica, Ion, 2012. Resilient
protocol design.
distributed datasets: A fault-tolerant abstraction for in-memory cluster computing.
In: Proc. USENIX NSDI.
Zats, David, Das, Tathagata, Mohan, Prashanth, Borthakur, Dhruba, Katz, Randy, 2012.
Detail:reducing the flow completion time tail in datacenter networks. Acm Sigcomm
Comput. Commun. Rev. 42 (4), 139–150.
Zinkevich, Martin, 2003. Online convex programming and generalized infinitesimal
gradient ascent. In: Proceedings of the Twentieth International Conference on
International Conference on Machine Learning. In: ICML’03, AAAI Press, pp.
928–935.

14
K. Lei et al. Journal of Network and Computer Applications 196 (2021) 103224

Bo Bai is currently the Director of Theory Lab, 2012 Labs, Gong Zhang is now a Principal researcher in the future
Huawei Technologies Co., Ltd., Hong Kong. He received his network architecture of Huawei 2012 Labs. He has over
Ph.D. degree in the Department of Electronic Engineering 18 years of Research experience as a system architect in
from Tsinghua University, Beijing, China, in 2010. Before Network, distributed systems, and communication systems.
that, he received his B.S. degree from Xidian University, Recently years, his primary research interests are Network
Xi’an China, in 2004. His current research interests include architecture and large-scale distributed systems.
B5G/6G mobile networking, graph informatics, non-linear
information theory, optimization problem solver, etc.

Jingjie Jiang was a researcher in Theory Lab, Huawei Hong


Kong research center. She received the BEng degree from
the Department of Automation, Tsinghua University, China,
in 2012, and the Ph.D. degree in Department of Computer
Fan Zhang is currently with Theory Lab, Huawei Hong
Science and Engineering, Hong Kong University of Science
Kong Research Institute as a principle researcher. He re-
and Technology, in 2017. Her research interests include
ceived the BEng (first class Hons) degree from Chu Kochen
datacenter networking and scheduling, congestion control,
Honors College, Zhejiang University (ZJU), in 2010, and
blockchain, and distributed systems.
the Ph.D. degree from the Hong Kong University of Science
and Technology (HKUST), in 2015. His research interests in-
clude the neural network in artificial intelligence, stochastic
optimization, convex optimization, control theory, etc.

15

You might also like