Professional Documents
Culture Documents
& CLOUD COMPUTING NOW powers applications applications are interactive, latency critical
from every domain of human endeavor, which services that must meet strict performance
require ever improving performance, respon- (throughput and tail latency), and availability
siveness, and scalability.2,5,6,8 Many of these constraints, while also handling frequent soft-
ware updates.4–7; 12 The past five years have
Digital Object Identifier 10.1109/MM.2020.2985960 seen a significant shift in the way cloud services
Date of publication 22 April 2020; date of current version 22 are designed, from large monolithic implemen-
May 2020. tations, where the entire functionality of a
0272-1732 ß 2020 IEEE Published by the IEEE Computer Society IEEE Micro
10
Authorized licensed use limited to: b-on: Universidade de Lisboa Reitoria. Downloaded on January 18,2021 at 15:32:14 UTC from IEEE Xplore. Restrictions apply.
Second, microservices enable programming
language and framework heterogeneity, with
each tier developed in the most suitable lan-
guage, only requiring a common API for micro-
services to communicate with each other;
typically over remote procedure calls (RPC) or a
RESTful API. In contrast, monoliths limit the lan-
guages used for development, and make fre-
quent updates cumbersome and error-prone.
Figure 1. Differences in the deployment of
Finally, microservices separate failure doma-
monoliths and microservices.
ins across application tiers, allowing cleaner error
isolation, and simplifying correctness and perfor-
service is implemented in a single binary, to mance debugging, unlike in monoliths, where
large graphs of single-concerned and loosely- resolving bugs often involves troubleshooting the
coupled microservices.1,10 This shift is becom- entire service. This also makes them applicable to
ing increasingly pervasive, with large cloud pro- Internet-of-Things (IoT) applications that often
viders, such as Amazon, Twitter, Netflix, Apple, host mission-critical computation.
and EBay having already adopted the microser- Despite their advantages, microservices rep-
vices application model, and Netflix reporting resent a significant departure from the way cloud
more than 200 unique microservices in their services are traditionally designed, and have
ecosystem, as of the end of 2016.1 broad implications in both hardware and soft-
The increasing popularity of microservices is ware, changing a lot of assumptions current ware-
justified by several reasons. First, they promote house-scale systems are designed with. For
composable software design, simplifying and example, since dependent microservices are typi-
accelerating development, with each microser- cally placed on different physical machines, they
vice being responsible for a small subset of the put a lot more pressure on high bandwidth and
application’s functionality. The richer the func- low latency networking than traditional applica-
tionality of cloud services becomes, the more tions. Furthermore, the dependencies between
the modular design of microservices helps man- microservices introduce backpressure effects
age system complexity. They similarly facilitate between dependent tiers, leading to cascading
deploying, scaling, and updating individual micro- QoS violations that propagate and amplify
services independently, avoiding long develop- through the system, making performance debug-
ment cycles, and improving elasticity. For ging expensive in both resources and time.11
applications that are updated on a daily basis, Given the increasing prevalence of microser-
modifying, recompiling, and testing a large mono- vices in both cloud and IoT settings, it is impera-
lith is both cumbersome and prone to bugs. tive to study both their opportunities and
Figure 1 shows the deployment differences challenges. Unfortunately most academic work
between a traditional monolithic service, and an on cloud systems is limited to the available
application built with microservices. While the open-source applications; monolithic designs in
entire monolith is scaled out on multiple servers, their majority. This not only prevents a wealth
microservices allow individual components of of interesting research questions from being
the end-to-end application to be elastically explored, but can also lead to misdirected
scaled, with microservices of complementary research efforts whose results do not translate
resources bin-packed on the same physical to the way real cloud services are implemented.
server. Even though modularity in cloud services
was already part of the service-oriented architec-
ture (SOA) design approach, the fine granularity DeathstarBench SUITE
of microservices, and their independent deploy- Our article,10 presented at ASPLOS’19,
ment create hardware and software challenges addresses the lack of representative and open-
different from those in traditional SOA workloads. source benchmarks built with microservices, and
May/June 2020
11
Authorized licensed use limited to: b-on: Universidade de Lisboa Reitoria. Downloaded on January 18,2021 at 15:32:14 UTC from IEEE Xplore. Restrictions apply.
Top Picks
Figure 2. Graph of microservices in Social Network. Figure 3. Graph of microservices in Media Service.
quantifies the opportunities and challenges of this unidirectional follow relationships. Figure 2
new application model across the system stack. shows the architecture of the end-to-end service.
Benchmark Suite Design: We have designed, Users (client) send requests over http, which
implemented, and open-sourced a set of end- first reach a load balancer, implemented with
to-end applications built with interactive micro- nginx. Once a specific webserver is selected,
services, representative of popular production also in nginx, the latter uses a php-fpm module
online services using this application model. Spe- to talk to the microservices responsible for com-
cifically, the benchmark suite includes a social posing and displaying posts, as well as microser-
network, a media service, an ecommerce shop, a vices for advertisements and search engines. All
hotel reservation site, a secure banking system, messages downstream of php-fpm are Apache
and a coordination control platform for UAV Thrift RPCs. Users can create posts embedded
swarms. Across all applications, we adhere to the with text, media, links, and tags to other users.
design principles of representativeness, modular- Their posts are then broadcasted to all their fol-
ity, extensibility, software heterogeneity, and end- lowers. Users can also read, favorite, and repost
to-end operation. posts, as well as reply publicly, or send a direct
Each service includes tens of microservices in message to another user. The application also
different languages and programming models, includes machine learning plugins, such as user
including node.js, Python, C/Cþþ, Java, Java- recommender engines, a search service using
script, Scala, and Go, and leverages open-source Xapian, and microservices to record and display
applications, such as NGINX, memcached, Mon- user statistics, e.g., number of followers, and to
goDB, Cylon, and Xapian. To create the end-to-end allow users to follow, unfollow, or block other
services, we built custom RPC and RESTful APIs accounts. The service’s backend uses memc-
using popular open-source frameworks like ached for caching, and MongoDB for persistent
Apache Thrift, and gRPC. Finally, to track how storage for posts, profiles, media, and recom-
user requests progress through microservices, we mendations. The service is broadly deployed at
have developed a lightweight and transparent to our institution, currently servicing several hun-
the user distributed tracing system, similar to Dap- dred users. We also use this deployment to
per and Zipkin that tracks requests at RPC granu- quantify the tail at scale effects of microservices.
larity, associates RPCs belonging to the same end- Media Service: The application implements an
to-end request, and records traces in a centralized end-to-end service for browsing movie informa-
database. We study both traffic generated by real tion, as well as reviewing, rating, renting, and
users of the services, and synthetic loads gener- streaming movies. Figure 3 shows the architec-
ated by open-loop workload generators. ture of the end-to-end service. As with the social
network, a client request hits the load balancer,
Applications in DeathStarBench which distributes requests among multiple nginx
Social Network: The end-to-end service imple- webservers. Users can search and browse infor-
ments a broadcast-style social network with mation about movies, including their plot,
12 IEEE Micro
Authorized licensed use limited to: b-on: Universidade de Lisboa Reitoria. Downloaded on January 18,2021 at 15:32:14 UTC from IEEE Xplore. Restrictions apply.
photos, videos, cast, and review information, as account, search information about the bank, or
well as insert new reviews in the system for a contact a representative. Once logged in, a user
specific movie by logging into their account. can process a payment, pay their credit card bill,
Users can also select to rent a movie, which browse information about loans or request one,
involves a payment authentication module to and obtain information about wealth manage-
verify that the user has enough funds, and a ment options. Most microservices are written in
video streaming module using nginx-hls, a pro- Java and Javascript. The back-end databases use
duction nginx module for HTTP live streaming. memcached and MongoDB instances.
The actual movie files are stored in NFS, to avoid IoT Swarm Coordination: Finally, we explore an
the latency and complexity of accessing chunked environment where applications run both on the
records from nonrelational databases, while cloud and on edge devices. The service coordi-
movie reviews are kept in memcached and Mon- nates the routing of a swarm of programmable
goDB instances. Movie information is main- drones, which perform image recognition and
tained in a sharded and replicated MySQL obstacle avoidance. We have designed two ver-
database. The application also includes movie sion of this service. In the first, the majority of the
and advertisement recommenders, as well as a computation happens on the drones, including
couple auxiliary services for maintenance and the motion planning, image recognition, and
service discovery, which are not shown in the obstacle avoidance, with the cloud only con-
figure. structing the initial route per-drone, and holding
E-Commerce Site: The service implements an persistent copies of sensor data. This architec-
e-commerce site for clothing. The design draws ture avoids the high network latency between
inspiration, and uses several components of the cloud and edge, however, it is limited by the on-
open-source Sockshop application. The applica- board resources. In the second version, the cloud
tion front-end in this case is a node.js service. is responsible for most of the computation. It per-
Clients can use the service to browse the inven- forms motion control, image recognition, and
tory using catalogue, a Go microservice that obstacle avoidance for all drones, using the
mines the back-end memcached and MongoDB ardrone-autonomy, and Cylon libraries, in
instances holding information about products. OpenCV and Javascript, respectively. The edge
Users can also place orders (Go) by adding items devices are only responsible for collecting sensor
to their cart (Java). After they log in (Go) to their data and transmitting them to the cloud, as well
account, they can select shipping options (Java), as recording some diagnostics using a local node.
process their payment (Go), and obtain an js logging service. In this case, almost every
invoice (Java) for their order. Finally, the service action suffers the cloud-edge network latency,
includes a recommender engine for suggested although services benefit from the additional
products, and microservices for creating an item cloud resources. We use 24 programmable Parrot
wishlist (Java), and displaying current discounts. AR2.0 drones, together with a backend cluster of
Hotel Reservation: The service implements a 20 two-socket, 40-core servers.
hotel reservation site, where users can browse
information about hotels and complete reserva-
tions. The service is primarily written in Go, with Adoption
the backend tiers implemented using memc- DeathStarBench is open-source software
ached and MongoDB. Users can filter hotels under a GPL license. The project is currently in
according to ratings, price, location, and avail- use by several tens of research groups both in
ability. They also receive recommendations on academia and industry. In addition to the open-
hotels they may be interested in. source project, we have also deployed the social
Banking System: The service implements a network as an internal social network at Cornell
secure banking system that processes payments, University, currently used by over 500 students,
loan requests, and credit card transactions. and have used execution traces for several
Users interface with a node.js front-end, similar
to the one in E-commerce to login to their https://github.com/delimitrou/DeathStarBench.
May/June 2020
13
Authorized licensed use limited to: b-on: Universidade de Lisboa Reitoria. Downloaded on January 18,2021 at 15:32:14 UTC from IEEE Xplore. Restrictions apply.
Top Picks
14 IEEE Micro
Authorized licensed use limited to: b-on: Universidade de Lisboa Reitoria. Downloaded on January 18,2021 at 15:32:14 UTC from IEEE Xplore. Restrictions apply.
Figure 5. (a) Overview of the FPGA configuration for RPC Figure 7. Cascading QoS violations in Social
acceleration, and (b) the performance benefits of acceleration in Network compared to per-microservice CPU
terms of network and end-to-end tail latency. utilization.
from acceleration on network processing latency performance issue, but can on occasion make it
alone, and on the end-to-end latency of each of the worse, by admitting more traffic into the system.
services. Network processing latency improves by The more complex the dependence graph
1068x over native TCP, whereas end-to-end tail between microservices, the more pronounced
latency improves by 43% and up to 2:2x. For inter- such issues become. Figure 6 shows the microser-
active, latency-critical services, where even a vices dependence graphs for three major cloud
small improvement in tail latency is significant, service providers, and for one of our applications
network acceleration provides a major boost in (Social Network). The perimeter of the circle (or
performance. sphere surface) shows the different microservi-
ces, and edges show dependencies between
Cluster Management them. Such dependencies are difficult for develop-
A major challenge with microservices has to ers or users to describe, and furthermore, they
do with cluster management. Even though the change frequently, as old microservices are
cluster manager can elastically scale out individ- swapped out and replaced by newer services.
ual microservices on-demand instead of the entire Figure 7 shows the impact of cascading QoS
monolith, dependencies between microservices violations in the Social Network service. Darker
introduce backpressure effects and cascading colors show tail latency closer to nominal opera-
QoS violations that propagate through the sys- tion for a given microservice in Figure 7(a), and
tem, hurting quality of service (QoS). Backpres- low utilization in Figure 7(b). Brighter colors sig-
sure can additionally trick the cluster manager nify high per-microservice tail latency and high
into penalizing or upsizing a highly utilized micro- CPU utilization. Microservices are ordered based
service, even though its saturation is the result of on the service architecture, from the back-end
backpressure from another, potentially not-satu- services at the top, to the front-end at the bot-
rated service. Not only does this not solve the tom. Figure 7(a) shows that once the back-end
service at the top experiences high tail latency,
the hotspot propagates to its upstream services,
and all the way to the front-end. Utilization in
this case can be misleading. Even though the sat-
urated back-end services have high utilization in
Figure 7(b), microservices in the middle of the
figure also have even higher utilization, without
this translating to QoS violations.
Conversely, there are microservices with
relatively low utilization and degraded perfor-
mance, for example, due to waiting on a blocking/
synchronous request from another, saturated
Figure 6. Microservices graphs for three production tier. This highlights the need for cluster manag-
clouds, and our Social Network. ers that account for the impact dependencies
May/June 2020
15
Authorized licensed use limited to: b-on: Universidade de Lisboa Reitoria. Downloaded on January 18,2021 at 15:32:14 UTC from IEEE Xplore. Restrictions apply.
Top Picks
16 IEEE Micro
Authorized licensed use limited to: b-on: Universidade de Lisboa Reitoria. Downloaded on January 18,2021 at 15:32:14 UTC from IEEE Xplore. Restrictions apply.
As with the hardware and cluster management
implications above, these results again emphasize
the need for hardware and software techniques
that improve performance predictability at scale
without hurting latency and resource efficiency.
May/June 2020
17
Authorized licensed use limited to: b-on: Universidade de Lisboa Reitoria. Downloaded on January 18,2021 at 15:32:14 UTC from IEEE Xplore. Restrictions apply.
Top Picks
showed that programmable acceleration can 6. C. Delimitrou and C. Kozyrakis, “Quasar: Resource-
greatly reduce one of the primary overheads of efficient and QoSAware cluster management,” in
multitier services; network processing. Proc. 19th Int. Conf. Archit. Support Program.
As microservices continue to evolve, it is Lang. Oper. Syst., Salt Lake City, UT, USA, 2014,
essential for datacenter hardware, operating pp. 127–144.
and networking systems, cluster managers, and 7. C. Delimitrou and C. Kozyrakis, “HCloud: Resource-
programming frameworks to also evolve with efficient provisioning in shared cloud systems,” in
them, to ensure that their prevalence does not Proc. 21st Int. Conf. Archit. Support Program. Lang.
come at a performance and/or efficiency loss. Oper. Syst., Apr. 2016, pp. 473–488.
Both DeathStarBench and the resulting study of 8. C. Delimitrou and C. Kozyrakis, “Bolt: I know what you
the system implications of microservices are a did last summer... In the cloud,” in Proc. 22nd Int. Conf.
call to action for the research community to fur- Archit. Support Program. Lang. Oper. Syst., Apr. 2017,
ther explore the opportunities and challenges of pp. 599–613.
this emerging application model. 9. D. Firestone et al., “Azure accelerated networking:
Smartnics in the public cloud,” in Proc. 15th USENIX
ACKNOWLEDGMENTS Symp. Netw. Syst. Design Implementation, 2018,
We sincerely thank C. Kozyrakis, D. Sanchez, pp. 51–66.
D. Lo, as well as the academic and industrial users 10. Y. Gan et al., “An open-source benchmark suite for
of the benchmark suite, and the anonymous microservices and their hardware-software
reviewers for their feedback on earlier versions of implications for cloud and edge systems,” in Proc.
this article. This work was supported in part by 24th Int. Conf. Archit. Support Program. Lang. Oper.
an NSF CAREER award, in part by NSF grant CNS- Syst., Apr. 2019, pp. 3–18.
1422088, in part by a Google Faculty Research 11. Y. Gan et al., “Seer: Leveraging big data to
Award, in part by a Alfred P. Sloan Foundation Fel- navigate the complexity of performance debugging
lowship, in part by a Facebook Faculty Research in cloud microservices,” in Proc. 24th Int. Conf.
Award, in part by a John and Norma Balen Sesqui- Archit. Support Program. Lang. Oper. Syst., Apr.
centennial Faculty Fellowship, and in part by gen- 2019, pp. 19–33.
erous donations from Google Compute Engine, 12. D. Lo, L. Cheng, R. Govindaraju, P. Ranganathan, and
Windows Azure, and Amazon EC2. C. Kozyrakis, “Heracles: Improving resource
efficiency at scale,” in Proc. 42nd Annu. Int. Symp.
Comput. Archit., 2015, pp. 450–462.
& REFERENCES
1. “The evolution of microservices,” 2016. [Online].
Available: https://www.slideshare.net/adriancockcroft/ Yu Gan is currently working toward the Ph.D. degree
evolution-of-microservices-craft-conference
with the School of Electrical and Computer Engineer-
ing, Cornell University, where he works on cloud
2. L. Barroso, U. Hoelzle, and P. Ranganathan, The
computing and root cause analysis for interactive
Datacenter as a Computer: An Introduction to the
microservices. He is a student member of IEEE and
Design of Warehouse-Scale Machines. Morgan & ACM. Contact him at yg397@cornell.edu.
Claypool: San Rafael, CA, USA, 2018.
3. S. Chen, S. Galon, C. Delimitrou, S. Manne, and Yanqi Zhang is currently working toward the Ph.D.
J. F. Martinez, “Workload characterization of degree with the School of Electrical and Computer
interactive cloud services on big and small server Engineering, Cornell University, where he works on
platforms,” in Proc. Int. Symp. Workload cloud systems and resource management for inter-
Characterization, Oct. 2017, pp. 125–134. active microservices. He is a student member of
4. J. Dean and L. A. Barroso, “The tail at scale,” IEEE and ACM. Contact him at yz2297@cornell.edu.
Commun. ACM, vol. 56 no. 2, pp. 74–80, 2013.
5. C. Delimitrou and C. Kozyrakis, “Paragon: QoS-aware Dailun Cheng is currently working toward the
scheduling for heterogeneous datacenters,” in Proc. M.Eng. degree with the School of Electrical and
18th Int. Conf. Archit. Support Program. Lang. Oper. Computer Engineering, Cornell University. Contact
Syst., Houston, TX, USA, 2013, pp. 77–88. him at dc924@cornell.edu.
18 IEEE Micro
Authorized licensed use limited to: b-on: Universidade de Lisboa Reitoria. Downloaded on January 18,2021 at 15:32:14 UTC from IEEE Xplore. Restrictions apply.
Ankitha Shetty is currently working toward Chris Colen is currently working toward the M.Eng.
the M.Eng. degree with the School of Computer degree with the School of Computer Science, Cornell
Science, Cornell University. Contact him at University. Contact him at cdc99@cornell.edu.
aas394@cornell.edu.
Fukang Wen is currently working toward the M.Eng.
Priyal Rathi is currently working toward the M.Eng. degree with the School of Computer Science, Cornell
degree with the School of Computer Science, Cornell University. Contact him at fw224@cornell.edu.
University. Contact him at pr348@cornell.edu.
Catherine Leung is currently working toward the
Nayan Katarki is currently working toward the M.Eng. degree with the School of Computer Science,
M.Eng. degree with the School of Electrical and Cornell University. Contact him at chl66@cornell.edu.
Computer Engineering, Cornell University. Contact
him at nk646@cornell.edu. Siyuan Wang is currently working toward the M.Eng.
degree with the School of Computer Science, Cornell
Ariana Bruno is currently working toward the University. Contact him at sw884@cornell.edu.
M.Eng. degree with the School of Electrical and
Computer Engineering, Cornell University. Contact Leon Zaruvinsky is currently working toward the
him at amb633@cornell.edu. M.Eng. degree with the School of Computer Science,
Cornell University. Contact him at laz37@cornell.edu.
Justin Hu is currently working toward the M.Eng.
degree with the School of Computer Science, Cornell Mateo Espinosa is currently working toward the
University. Contact him at jh2625@cornell.edu. M.Eng. degree with the School of Computer Science,
Cornell University. Contact him at me326@cornell.edu.
Brian Ritchken is currently working toward the
M.Eng. degree with the School of Electrical and Rick Lin is currently working toward the M.Eng.
Computer Engineering, Cornell University. Contact degree with the School of Electrical and Computer
him at bjr96@cornell.edu. Engineering, Cornell University. Contact him at
cl2545@cornell.edu.
Brendon Jackson is currently working toward the
M.Eng. degree with the School of Electrical and Zhongling Liu is currently working toward the
Computer Engineering, Cornell University. Contact M.Eng. degree with the School of Electrical and
him at btj28@cornell.edu. Computer Engineering, Cornell University. Contact
him at zl682@cornell.edu.
Kelvin Hu is currently working toward the M.Eng.
degree with the School of Computer Science, Cornell Jake Padilla is currently working toward the
University. Contact him at sh2442@cornell.edu. M.Eng. degree with the School of Computer Science,
Cornell University. Contact him at jsp264@cornell.edu.
Meghna Pancholi is currently working toward the
B.S. degree with the School of Computer Science, Christina Delimitrou is currently an Assistant
Cornell University. Contact him at mp832@cornell.edu. Professor with the School of Electrical and Computer
Engineering, Cornell University, where she works on
Yuan He is currently working toward the M.Eng. computer architecture and distributed systems.
degree with the School of Electrical and Computer Her research interests include resource-efficient data-
Engineering, Cornell University. Contact him at centers, scheduling and resource management with
yh772@cornell.edu. quality-of-service guarantees, emerging cloud and IoT
application models, and cloud security. Delimitrou
Brett Clancy is currently working toward the M.Eng. received the Ph.D. degree in electrical engineering
degree with the School of Computer Science, Cornell from Stanford University. She is a member of IEEE and
University. Contact him at bjc265@cornell.edu. ACM. Contact her at delimitrou@cornell.edu.
May/June 2020
19
Authorized licensed use limited to: b-on: Universidade de Lisboa Reitoria. Downloaded on January 18,2021 at 15:32:14 UTC from IEEE Xplore. Restrictions apply.