You are on page 1of 42

The InfoQ eMag / Issue #86 / October 2020

The Real-Time APIs


Design, Operation, and Observation

The Challenges of Real-Time APIs: Mike Real-Time APIs


Building a Reliable Amundsen on Designing for in the Context of
Real-Time Event- Speed and Observability Apache Kafka
Driven Ecosystem

FACILITATING THE SPREAD OF KNOWLEDGE AND INNOVATION IN PROFESSIONAL SOFTWARE DEVELOPMENT


www.dbooks.org
InfoQ @ InfoQ InfoQ InfoQ

The Real-Time APIs


Design, Operation, and Observation

IN THIS ISSUE

The Challenges of Building


a Reliable Real-Time Event-
Driven Ecosystem 06 Four Case Studies for
Implementing Real-Time APIs 27
Real-Time APIs: Mike
Amundsen on Designing for
Speed and Observability 12 Real-Time APIs in the Context
of Apache Kafka 34

Load Testing APIs and


Websites with Gatling: It’s
Never Too Late to Get Started
18

PRODUCTION EDITOR Ana Ciobotaru / COPY EDITORS Lawrence Nyveen & Susan Conant / DESIGN Dragos Balasoiu
GENERAL FEEDBACK feedback@infoq.com / ADVERTISING sales@infoq.com / EDITORIAL editors@infoq.com
CONTRIBUTORS

Matthew O’Riordan Karthik Krishnaswamy


is the technical co-founder of Ably, a global, is Director, Product Marketing at F5
cloud-based real-time messaging platform Networks, drives marketing initiatives for
that provides APIs used by thousands NGINX API Management and F5 Cloud
of developers and businesses. Matthew Services. He is an experienced product
has been a programmer for over 20 years marketer with a proven track record of
and first started working on commercial developing and promoting IT solutions.
internet projects in the mid-90s when Prior to F5, Karthik held similar positions at
Internet Explorer 3 and Netscape were still Fluke Networks, Cisco Systems and Nimble
battling it out. While he enjoys coding, the Storage, a Hewlett Packard Enterprise
challenges he faces as an entrepreneur company.
starting and scaling businesses is what
drives him. Matthew has previously started
and successfully exited from two previous
tech businesses.

Guillaume Corre Robin Moffatt


works as a software engineer, consultant is a Senior Developer Advocate at
and Tech Lead at Gatling Corp, based in Confluent, the company founded by the
Paris at Station F, world’s biggest startup original creators of Apache Kafka, as well
campus. Swiss army knife by nature, he as an Oracle ACE Director (Alumnus). He
enjoys simple things such as data viz, has been speaking at conferences since
optimization and crashing production 2009 including QCon, Devoxx, Strata, Kafka
environments using Gatling, the best Summit, and Øredev. You can find his talks
developer tool to load test your application, online, subscribe to his YouTube channel
preventing applications and websites from (thanks lockdown!), and read his blog.
becoming victims of their own success and
help them face critical situations, like go-
lives or Black Fridays

www.dbooks.org
A LETTER FROM
THE EDITOR
Application Programming In- performance targets. Services
terfaces (APIs) are seemingly must also be monitored, both
everywhere. Thanks to the pop- from an operational perspec-
ularity of web-based products, tive (SLIs), and also a business
cloud-based X-as-a-service perspective (KPIs).
offerings, and IoT, it is becom-
ing increasingly important for According to Akamai research,
engineers to understand all API calls now make up 83% of
aspects of APIs, from design, to all web traffic. In “Four Case
building, to operation. Studies for Implementing
Real-Time APIs”, Karthik Krish-
Daniel Bryant Research shows that there is naswamy explores the idea that
increasing demand for near competitive advantage is no
works as a Product Architect at
Datawire, and is the News Manager real-time APIs, in which speed longer won by simply having
at InfoQ, and Chair for QCon London. and flexibility of response is APIs; the key to gaining ground
His current technical expertise
focuses on ‘DevOps’ tooling, cloud/ vitally important. This emag is based on the performance
container platforms and microservice
implementations. Daniel is a leader
explores this emerging trend in and the reliability of those APIs.
within the London Java Community more detail.
(LJC), contributes to several open
source projects, writes for well-known The four case studies all pro-
technical websites such as InfoQ, In “Real-Time APIs: Mike vide interesting lessons. How-
O’Reilly, and DZone, and regularly
presents at international conferences Amundsen on Designing for ever, a key takeaway is that
such as QCon, JavaOne, and Devoxx. Speed and Observability”, we when revenue is correlated with
learn how to meet the increas- speed, performance trumps
ing demand placed on API- feature richness in API man-
based systems, and explore agement solutions. Teams
three areas for consideration: should recognize when this is
architecting for performance, the case, and design system
monitoring for performance, applications and infrastructure
and managing for performance. accordingly.

Amundsen argues that un- In “Load Testing APIs and Web-


derstanding a system and the sites with Gatling: It’s Never
resulting performance -- and Too Late to Get Started”, Guil-
being able to identify bottle- laume Corre discusses how to
necks -- is critical to meeting conduct load tests against APIs
The InfoQ eMag / Issue #84 / June 2020
and websites that can both validate vide examples using Apache Kafka,
performance after a long stretch a scalable event streaming platform
of development and also get use- with which you can build applica-
ful feedback from an application in tions around the powerful concept of
order to increase its scaling capabil- events.
ities and performance.
In “The Challenges of Building a
Corre argues that engineers should Reliable Real-Time Event-Driven
avoid creating “the cathedral” of Ecosystem”, Matthew O’Riordan
load testing and end up with little argues that to truly benefit from the
time to improve performance overall. power of real-time data, the entire
Instead, write the simplest possi- tech stack needs to be event-driven.
ble test and iterate from there. It is
important to establish the goals, When it comes to event-driven APIs,
constraints, and conditions of any engineers can choose between mul-
load test. And always identify and tiple different protocols. Options in-
verify any assumptions. clude the simple webhook, the newer
WebSub, popular open protocols
In “Real-Time APIs in the Context such as WebSockets, MQTT or SSE.
of Apache Kafka”, we change gears, In addition to choosing a protocol,
and Robin Moffatt explores one of engineers also have to think about
the challenges that we have always subscription models: server-initiat-
faced in building systems: how to ed (push-based) or client-initiated
exchange information between them (pull-based).
efficiently whilst retaining the flexi-
bility to modify the interfaces with- We hope you enjoy this collection
out undue impact elsewhere. of articles focused on real-time
APIs. Please provide feedback via
Moffatt argues that events offer a editors@infoq.com or find us on
Goldilocks-style approach in which Twitter.
real-time APIs can be used as the
foundation for applications that are
flexible yet performant; loosely-cou-
pled yet efficient. He goes on to pro-

5
www.dbooks.org
The Challenges of Building a Reliable
The InfoQ eMag / Issue #86 / October 2020

Real-Time Event-Driven Ecosystem


by Matthew O’Riordan, Co-Founder of Ably

Globally, there is an increasing also be used to stream high vol- server-side complexity, orscal-
appetite for data delivered in umes of data between different ing to support high-volume
real time. Since both producers businesses or different compo- data streams.
and consumers are more and nents of a system.
more interested in faster expe- What exactly is a real-time
riences and instantaneous data This article starts by exploring API?
transactions, we are witnessing the fundamental differences Usually, when we talk about
the emergence of the real-time between the REST model and data being delivered in real
API. real-time APIs. Up next, we dive time, we think about speed. By
into some of the many engi- this logic, one could assume
This new type of event-driven neering challenges and consid- that improving REST APIs to
API is suitable for a wide variety erations involved in building a be more responsive and able
of use cases. It can be used to reliable and scalable event-driv- to execute operations in real
power real-time functionality en ecosystem, such as choos- time (or as close as possi-
and technologies such as chat, ing the right communication ble) makes them real-time
alerts, and notifications or IoT protocol and sub scription APIs. However, that’s just an
devices. Real-time APIs can model, managing client, and improvement of an existing
condition, not a fundamental
change. Just because a tra-
ditional REST API can deliver
data in real time does not
make it a real-time API.

The basic premise around


real-time APIs is that they are
event-driven. According to the
event-driven design pattern,
a system should react or
respond to events as they hap-
pen. Multiple types of APIs can
be regarded as event-driven,
as illustrated in Figure 1.

Figure 1 The real-time API


family. Source: Ably

6
The InfoQ eMag / Issue #86 / October 2020
Streaming, Pub/Sub, and Push Let’s look at a simple and fiction- Compared to REST, event-driven
are patterns that can be success- al example to better understand APIs invert complexity and put
fully delivered via an event-driven how these components interact. more responsibility on the shoul-
architecture. This makes all of Let’s say we have a football app ders of producers rather than
them fall under the umbrella of that uses a data stream to deliver consumers. (See Figure 2)
event-driven APIs. real-time updates to end-users
whenever something relevant This complexity inversion relates
Unlike the popular request-re- happens on the field. If a goal to the very foundation of the way
sponse model of REST APIs, is scored, the event is pushed event-driven APIs are designed.
event-driven APIs follow an to a channel. When a consumer While in a REST paradigm the
asynchronous communication uses the app, they connect to the consumer is always responsible
model. An event-driven archi- respective channel, which then for maintaining state and always
tecture consists of the following pushes the event to the client has to trigger requests to get up-
main components: device. dates. In an event-driven system,
the producer is responsible for
• Event producers—they push Note that in an event-driven maintaining state and pushing
data to channels whenever an architecture, producers and updates to the consumer.
event takes place. consumers are decoupled.
Components perform their task Event-driven architecture
• Channels—they push the data
independently and are unaware considerations
received from event produc-
of each other. This separation Building a dependable
ers to event consumers.
of concerns allows you to more event-driven architecture is by no
• Event consumers—they reliably scale a real-time system means an easy feat. There is an
subscribe to channels and and it can prevent potential is- entire array of engineering chal-
consume the data. sues with one of the components lenges you will have to face and
from impacting the other ones. decisions you will have to make.
Among them, protocol fragmen-

Figure 2 REST vs event-driven: complexity is inverted. Source: Ably

7
www.dbooks.org
tation and choosing the right model is simpler from a producer to data when they are online
The InfoQ eMag / Issue #86 / October 2020

subscription model (client-initi- perspective: if no consumers are (connected) and don’t care
ated or server-initiated) for your subscribed to the data stream, what happens when they are
specific use case are some of the the producer has no work to do disconnected. Due to this fact,
most pressing things you need to and relies on clients to decide the complexity of the producer
consider. when to reconnect. Additionally, is reduced, as the server-side
complexities around maintaining doesn’t need to be stateful.
While traditional REST APIs all state are also handled by con- The complexity is kept at a low
use HTTP as the transport and sumers. Wildly popular and ef- level even on the consumer side:
protocol layer, the situation is fective even when high volumes generally, all client devices have
much more complex when it of data are involved, WebSockets to do is connect and subscribe to
comes to event-driven APIs. You represent the most common the channels they want to listen
can choose between multiple dif- example of a client-initiated to for messages.
ferent protocols. Options include protocol.
the simple webhook, the newer There are several client-initiated
WebSub, popular open protocols In contrast, with a server-initi- protocols you can choose from.
such as WebSockets, MQTT or ated approach, the producer is The most popular and efficient
SSE, or even streaming protocols, responsible for pushing data to ones are:
such as Kafka. consumers whenever an event
occurs. This model is often pref- • WebSocket. Provides full-du-
This diversity can be a dou- erable for consumers, especially plex communication channels
ble-edged sword—on one hand, when the volume of data increas- over a single TCP connection.
you aren’t restricted to only one es, as they are not responsible Much lower overhead than
protocol; on the other hand, you for maintaining any state—this half-duplex alternatives such
need to select the best one for responsibility sits with the pro- as HTTP polling. Great choice
your use case, which adds an ducer. The common webhook— for financial tickers, loca-
additional layer of engineering which is great for pushing rather tion-based apps, and chat
complexity. infrequent low-latency updates— solutions.
is the most obvious example of a
• MQTT. The go-to protocol
Besides choosing a protocol, you server-initiated protocol.
for streaming data between
also have to think about sub-
devices with limited CPU
scription models: server-initiated We are now going to dive into
power and/or battery life, and
(push-based) or client-initiated more details and explore the
networks with expensive or
(pull-based). Note that some strengths, weaknesses, and
low bandwidth, unpredict-
protocols can be used with both engineering complexities of cli-
able stability, or high latency.
models, while some protocols ent-initiated and server-initiated
Great for IoT.
only support one of the two sub- subscriptions.
scription approaches. Of course, • SSE. Open, lightweight,
this brings even more engineer- Client-initiated vs. server- subscribe-only protocol for
ing complexity to the table. initiated models—challenges and event-driven data streams.
use cases Ideal for subscribing to data
In a client-initiated model, the Client-initiated models are the feeds, such as live sport
consumer is responsible for best choice for last-mile deliv- updates.
connecting and subscribing to an ery of data to end-user devices.
event-driven data stream. This These devices only need access

8
Among these, WebSocket is Tennis Australia was interested // Subscribe to messages on

The InfoQ eMag / Issue #86 / October 2020


arguably the most widely-used in keeping engineering complexi- channel
channel.subscribe('score',
protocol. There are even a couple ty to a minimum—they wanted to
function(message) {
of proprietary protocols and open publish one message every time alert(message.data);
solutions that are built on top of there was an update and distrib- });
raw WebSockets, such as Socket. ute that message to all connect-
IO. All of them are generally light- ed client devices via a message Traditionally, developers use the
weight, and they are well sup- broker. client-initiated model to build
ported by various development apps for end-users. It’s a sen-
platforms and programming lan- In the end, instead of building sible choice since protocols like
guages. This makes them ideal their own proprietary solution, WebSockets, MQTT, or SSE can
for B2C data delivery. Tennis Australia chose to use be successfully used to stream
Ably as the message broker. frequent updates to a high num-
Let’s look at a real-life use This enables Tennis Australia to ber of users, as demonstrated
case to demonstrate how Web- keep things very simple on their by the Tennis Australia exam-
Socket-based solutions can be side—all they have to do is pub- ple. However, it’s hazardous to
used to power a client-initiated lish a message to Ably whenever think that client-initiated models
event-driven system. Tennis there’s a score update. The mes- scale well when high-throughput
Australia (the governing body sage looks something like this: streams of data are involved—I’m
for tennis in Australia) wanted a referring to scenarios where
solution that would allow them to var ably = new Ably. businesses exchange large vol-
Realtime('API_KEY'); umes of data, or where an orga-
stream real-time rally and com-
var channel = ably. nization is streaming information
mentary updates to tennis fans channels.get('tennis-score-
browsing the Australian Open updates'); from one system to another.
website. Tennis Australia had
// Publish a message to In such cases, it’s usually not
no way of knowing how many
the tennis-score-updates practical to stream all that data
client devices could subscribe channel
to updates at any given moment, channel.publish('score', over a single consumer-initiated
nor where these devices could 'Game Point!'); connection (between one server
be located throughout the world. that is the producer, and another
Ably then distributes that mes- one that is the consumer). Often,
Additionally, client devices are
sage to all connected client de- to manage the influx of data,
generally unpredictable—they
vices over WebSockets, by using the consumer needs to shard it
can connect and disconnect at
a pub/sub approach, while also across multiple nodes. But by
any moment.
handling most of the engineering doing so, the consumer also has
complexity on the producer side, to figure out how to distribute
Due to these constraints, a cli-
such as connection churn, back- these smaller streams of data
ent-initiated model where a client
pressure, or message fan-out. across multiple connections and
device would open a connection
whenever it wanted to subscribe deal with other complex engi-
Things are kept simple for con- neering challenges, such as fault
to updates was the right way to
sumers as well. All a client device tolerance.
go. However, since millions of
had to do is open a WebSocket
client devices could connect at
connection and subscribe to We’ll use an example to better
the same time, it wouldn’t have
updates: illustrate some of the consum-
been scalable to have a 1:1 rela-
tionship with each client device. er-side complexities. For ex-

9
www.dbooks.org
ample, let’s say you have two are HTTP-based, so there’s an To avoid having to deal with
The InfoQ eMag / Issue #86 / October 2020

servers (A and B) that are con- overhead with each webhook complex engineering challenges,
suming two streams of data. event (message) because each HubSpot decided to use Ably as
Let’s imagine that server A fails. one triggers a new request. In a message broker that enables
How does server B know it needs addition, webhooks provide no chat communication between
to pick up additional work? How integrity or message ordering end-users. Furthermore, Ably
does it even know where server guarantees. uses a server-initiated model
A left off? This is just a basic to push chat data into Amazon
example, but imagine how hard it That’s why for streaming data Kinesis, which is the data pro-
would be to manage hundreds of at scale you should usually go cessing component of HubSpot’s
servers consuming hundreds of with a streaming protocol such message bus ecosystem. (See
data streams. As a general rule, as AMQP, Kafka, or ActiveMQ, to Figure 3)
when there’s a lot of complex- name just a few. Streaming pro-
ity involved, it’s the producer’s tocols generally have much lower Consumer complexity is kept to
responsibility to handle it; data overheads per message, and they a minimum. HubSpot only has to
ingestion should be as simple as provide ordering and integrity expose a Kinesis endpoint and
possible for the consumer. guarantees. They can even pro- Ably streams the chat data over
vide additional benefits—idempo- as many connections as needed.
That’s why in the case of stream- tency, for example. Last but not
ing data at scale you should least, streaming protocols enable A brief conclusion
adopt a server-initiated model. you to shard data before stream- Hopefully, this article offers a
This way, the responsibility of ing it to consumers. taste of what real-time APIs are
sharding data across multiple and helps readers navigate some
connections and managing It’s time to look at a real-life im- of the many complexities and
those connections rests with plementation of a server-initiated challenges of building an effec-
the producer. Things are kept model. HubSpot is a well-known tive real-time architecture. It is
rather simple on the consumer developer of marketing, sales, naive to think that by improving a
side—they would typically have and customer service software. traditional REST API to be quicker
to use a load balancer to distrib- As part of its offering, HubSpot and more responsive you get a
ute the incoming data streams to provides a chat service (Conver- real-time API. Real time in the
available nodes for consumption, sations) that enables communi- context of APIs means so much
but that’s about as complex as it cation between end-users. The more.
gets. organization is also interested
in streaming all that chat data to By design, real-time APIs are
Webhooks are often used in serv- other HubSpot services for on- event-driven; this is a funda-
er-initiated models. The webhook ward processing and persistent mental shift from the request-re-
is a very popular pattern because storage. Using a client-initiated sponse pattern of RESTful
it’s simple and effective. As a subscription to successfully services. In the event-driven
consumer, you would have a load stream high volumes of data to paradigm, the responsibility is in-
balancer that receives webhook their internal message buses is verted, and the core of engineer-
requests and distributes them to not really an option. For this to ing complexities rests with the
servers to be processed. How- happen, HubSpot would need to data producer, with the purpose
ever, webhooks become less know what channels are active of making data ingestion as easy
and less effective as the volume at any point in time, to pull data as possible for the consumer.
of data increases. Webhooks from them.

10
But having one real-time API is not enough—this is not a solution to a

The InfoQ eMag / Issue #86 / October 2020


TL;DR
problem. To truly benefit from the power of real-time data, your entire
tech stack needs to be event-driven. Perhaps we should start talking
more about event-driven architectures than about event-driven APIs.
After all, can pushing high volumes of data to an endpoint (see Hub- • Globally, there is an increas-
ing appetite for data deliv-
Spot example above for details) even be classified as an API? ered in real time. As both
producers and consumers
are more and more interest-
ed in faster experiences and
instantaneous data transac-
tions, we are witnessing the
emergence of the real-time
API.

• When it comes to
event-driven APIs, engi-
neers can choose between
multiple different protocols.
Options include the simple
webhook, the newer Web-
Sub, popular open protocols
such as WebSockets, MQTT
or SSE, or even streaming
protocols, such as Kafka. In
addition to choosing a pro-
tocol, engineers also have
to think about subscription
models: server-initiated
(push-based) or client-initi-
ated (pull-based).

• Client-initiated models are


the best choice for the “last
mile” delivery of data to
end-user devices. These
devices only need access to
data when they are online
(connected) and don’t care
what happens when they are
disconnected. Due to this
fact, the complexity of the
producer is reduced, as the
server-side doesn’t need to
be stateful.

• In the case of streaming


data at scale, engineers
should adopt a server-ini-
tiated model. The respon-
sibility of sharding data
across multiple connec-
tions and managing those
connections rests with the
producer, and other than the
potential use of a client-side
Fig 3 High-level overview of HubSpot chat architecture. load balancer, things are
kept rather simple on the
Source: Ably consumer side.

11
www.dbooks.org
The InfoQ eMag / Issue #86 / October 2020

Real-Time APIs: Mike Amundsen on Designing


for Speed and Observability
by Daniel Bryant, Product Architect @ambassadorlabs | News Manager @InfoQ | Chair @QConLondon

In a recent apidays webinar, Mike both an increased scale of oper- The recent adoption of cloud
Amundsen, trainer and author of ation and a decreased response technology and microser-
the recent O’Reilly book API Traf- time. vices-based architecture has
fic Management 101, presented enabled innovation and increased
“High Performing APIs: Architect- The changing nature of cus- speed of development through-
ing for Speed at Scale”. Drawing tomer expectations, combined out the business world.
on recent research by IDC, he with new data-driven products
argued that organizations will and an increase in consumption However, this technology and
have to drive systemic changes at the edge (mobile, IoT, etc.), architecture style is not always
to meet the upcoming increased has meant that the need for low conducive to creating performant
demand for consumption of latency “real-time APIs” is rapidly applications; in the cloud nearly
business services via APIs. This becoming the norm. everything communicates over
change in requirements relates to a virtualized network, and in a

12
service-oriented architecture 90% of new applications being areas for consideration: archi-

The InfoQ eMag / Issue #86 / October 2020


multiple separate processes are built will feature a microser- tecting for performance, monitor-
typically invoked to fulfill each vice architecture, and 35% of all ing for performance, and manag-
business function. Amundsen production applications will be ing for performance. (See figure
stated that a series of holistic “cloud native.” below)
changes to how systems are
designed, operated, and man- API call volumes are increasing Architecting for performance
aged is required to meet the new as more organizations embrace Migrating applications to a cloud
demands. digital transformations. Accord- vendor’s platform brings many
ing to the same report, 71% of benefits, such as the ability to
The performance imperative organizations expect to see the take advantage of data process-
Amundsen began his presenta- volume of API calls increase ing or machine learning-based
tion by describing the “perfor- in the next 2 years. About 60% services to innovate rapidly, or
mance imperative” he is seeing expect over 250 million API calls reducing the total cost of owner-
throughout the IT industry. per month (~10 million per busi- ship (TCO) of an organization’s
Focusing first on the ecosystem ness day). There is an increasing platform. However, cloud infra-
transformation, he referenced a focus on transaction response structure can be significantly
2019 IDC research report “IDC time, with 59% of organizations different than traditional hard-
MaturityScape: Digital Trans- expecting the latency of a typ- ware, both in terms of configura-
formation Platforms 1.0.” This ical API request to be under 20 tion and performance. Amundsen
report states that 75% of or- milliseconds, and 93% expecting cautioned that performing a “lift
ganizations will be completely a latency under 50 milliseconds. and shift” of an existing appli-
digitally transformed in the next cation is not enough to ensure
decade, with those companies Amundsen argued that to meet performance.
not embracing a modern way of the increasing demand placed
working not surviving. By 2022, on organizations, there are three

13
www.dbooks.org
The vast majority of infrastruc- services can be independently results, which can remove the
The InfoQ eMag / Issue #86 / October 2020

ture components and services released and independently need to constantly query up-
within a cloud platform are con- scaled. Embracing asynchronous stream services. Data may also
nected over a network—e.g. block methods of communicating, such need to be “staged” appropriately
stores are often implemented as as messaging and events, can re- throughout the entire end-to-end
network-attached storage (NAS). duce wait states. This offers the request handling process. For
Colocation of components is not possibility of an API call rapidly example, caching results and
guaranteed—e.g. an API gateway returning an acknowledgment data in localized points of pres-
might be located in a different and decreasing latency from an ence (PoPs) via content delivery
physical data center than the end-user perspective, even if the networks (CDNs), caching in an
virtual machine (VM) a backend actual work has been queued API gateway, and replication of
application is running on. The at the backend. Engineers will data stores across availability
global reach of cloud platforms also need to build in transaction zones (local data centers) and
opens opportunities for reaching reversal and recovery processes globally. For some high trans-
new customers and also provides to reduce the impact of inevitable action throughput use cases,
more effective disaster recov- failures. Additional thought lead- writes may have to be streamed
ery options, but this also means ers in this space, such as Bernd to meet demand, for example,
that the distance between your Ruecker, encourage engineers writing data locally or via a high
customers and your services can to build “smart APIs” and learn throughput distributed logging
increase. more about business process system like Apache Kafka for
modeling and event sourcing. writing to an external data store
Amundsen suggested that en- at a later point in time.
gineering teams redesign com- For systems to perform as
ponents into smaller services, required, data read and write Engineers may have to “re-
following the best practices patterns will frequently have to think the network,” (respecting
associated with the microser- be reengineered. Amundsen sug- the eight fallacies of distributed
vices architectural style. These gested judicious use of caching computing), and design their
cloud infrastructure to follow
best practices relevant to their
cloud vendor and application
architecture. Decreasing request
and response size may also be
required to meet demands. This
may be engineered in tandem
with the ability to increase the
message volume. The industry
may see the “return of the RPC,”
with strongly-typed communi-
cation contracts and high-per-
formance binary protocols. As
convenient as JSON is, compared
to HTTP, a lot of computation
(and time) is used in serialization
and deserialization, and the text-

14
based message payloads sent cloud networking may consist of are increasingly adopting the

The InfoQ eMag / Issue #86 / October 2020


over the wire are typically larger. many layers, from the high-level site reliability engineering (SRE)
API gateway and service mesh approach of defining and col-
Monitoring for performance implementations to the cloud lecting service level indicators
Understanding a system and the software-defined networking (SLIs) for operational metrics,
resulting performance, and being (SDN) components to the actual which consist of top-line metrics
able to identify bottlenecks, are virtualized and physical network- such as utilization, saturation,
critical to meeting performance ing hardware—and these need to and errors (USE), or request rate,
targets. Observability and moni- be instrumented effectively. errors, and duration (RED). Busi-
toring are critical aspects of any ness metrics are typically related
modern application platform. In- Services must also be moni- to key performance indicators
frastructure must be monitored, tored, both from an operational (KPIs) that are driven by an or-
from machines to all aspects of perspective and also a business ganization’s objectives and key
the networking stack. Modern perspective. Many organizations results (OKRs).

With infrastructure emitting lenge as much as it is a technical for “insight” into how a system is
metrics and services producing challenge. performing are vitally important.
SLI- and KPI-based metrics, As stated by Dr. Nicole Forsgren
the corresponding data must be Managing for performance and colleagues in Accelerate, the
effectively captured, processed, Amundsen stated that the orga- four key metrics that are correlat-
and presented to key stake- nization should aim to create a ed with high performing organi-
holders for this to be acted on. “dashboard culture.” Monitoring zations are lead time, deployment
This is often a cultural chal- top-line metrics, user traffic, and frequency, mean time to restore
the continuous delivery process (MTTR), and change failure

15
www.dbooks.org
percentage. Engineers should The elastic nature of cloud infra- top-level metrics related to a
The InfoQ eMag / Issue #86 / October 2020

be able to use internal tooling to structure typically allows almost customer’s experience, and also
effectively solve problems, such unlimited scaling, albeit with allow engineers to test hypothe-
as controlling access to func- potentially unlimited cost. How- ses for issues by drilling-down to
tionality, scaling services, and ever, the culture of managing for see individual service-to-service
diagnosing errors. performance requires the ability metrics. Being able to answer ad
to easily understand how cur- hoc questions of an observability
The addition of security mitiga- rent systems are being used. For or monitoring solution is current-
tion is often in conflict with per- example,engineers need to know ly an area of active research and
formance—for example, adding whether scaling specific compo- product development.
layers of security verification can nents is an appropriate response
slow responses—but this is a to seeing increased loads at any Once the ability to gain insight
delicate balancing act. Engineers given point in time. and solve problems has been
should be able to understand obtained, the next stage of
systems, their threat models, and Diagnosing errors in a cloud- managing for performance is the
whether there are any indicators based microservices system ability to anticipate. This is the
of current problems—e.g. an typically requires an effective capability to automatically recov-
increased number of HTTP 503 user experience (or developer er from failure—building antifrag-
error codes being generated, or experience) in addition to re- ile systems—and the ability to
a gateway being overloaded with quiring a collection of tooling experiment with the system, such
traffic. to collect, process, and display as examining fault-tolerance
metrics, logs, and traces. Dash- properties via the use of chaos
boards and tooling should show engineering.

16
Summary the engineering organization to

The InfoQ eMag / Issue #86 / October 2020


TL;DR
Amundsen concluded the presen- enable the effective identification
tation by reminding the audience and resolution of problems, and
of the IDC research report; the IT also anticipate issues and enable
industry should prepare for API call experimentation.
volumes to increase and the require- • The “IDC MaturityScape:
Digital Transformation
ments for transaction processing A more detailed exploration into Platforms 1.0” report states
time to decrease. Supporting these the topics discussed here can be that 71% of organizations
new requirements will demand orga- found in Mike Amundsen’s recent expect to see the volume of
API calls increase in the next
nization and system-wide changes. O’Reilly book API Traffic Manage- 2 years. 59% of organiza-
ment 101. tions expect the latency of
a typical API request to be
Software developers will have
under 20 milliseconds, and
to learn about redesigning ser- 93% expect a latency under
vices and reengineering data, and 50 milliseconds.
instrumenting services for both • According to the same
increased visibility into operation- report, 90% of new applica-
tions being built will feature
al and business metrics. Plat- a microservice architecture,
form teams will have to consider and 35% of all production
applications will be “cloud
rethinking networks, monitoring
native.”
infrastructure, and managing traf-
• To meet the increasing de-
fic. Application operation and SRE mand placed on API-based
teams will need to work across systems, there are three
areas for consideration:
architecting for performance,
monitoring for perfor-
mance, and managing for
performance.

• Good architectural practices


include avoiding a simple
“lift and shift” to the cloud,
embracing asynchronous
processing, and reengineer-
ing data and networks.

• Understanding a system and


the resulting performance,
and being able to identi-
fy bottlenecks, is critical
to meeting performance
targets. Services must also
be monitored, both from
an operational perspective
(SLIs), and also a business
perspective (KPIs).

17
www.dbooks.org
Load Testing APIs and Websites with
The InfoQ eMag / Issue #86 / October 2020

Gatling: It’s Never Too Late to Get


Started
by Guillaume Corré, Tech Lead, Gatling Corp

You open your software engi- After a long pause, “It’s already testing and end up with little time
neering logbook and start writ- late, time goes by. There is so to improve performance overall.
ing—“Research and Development much left to tell you, so many Often, working on improvements
Team’s log. Never could we have things I wish I knew, so many is where you want to spend most
foreseen such a failure with our mistakes I wish I never made.” if not all of your time. Load test-
application and frankly, I don’t Unstoppable, you write, “If only ing is just a means to an end and
get it. We had it all; it was a mas- there was something we could nothing else.
terpiece! Tests were green, met- have done to make it work...”
rics were good, users were happy, If you are afraid because your
and yet when the traffic arrived in And then, lucidity striking back deadline is tomorrow and you are
droves, we still failed.” at you—you wonder: why do we looking for a quick win, then wel-
even do load testing in the first come, this is the article for you.
Pausing to take a deep breath, place? Is it to validate perfor-
you continue—“We thought we mance after a long stretch of Writing the simplest possible
were prepared; we thought we development or is it really to get simulation
had the hang of running a popu- useful feedback from our appli- In this article, we will be writing
lar site. How did we do? Not even cation and increase its scaling a bit of code, Scala code to be
close. We initially thought that capabilities and performance? more precise as we will be using
a link on TechCrunch wouldn’t In both cases validation takes Gatling. While the code might
bring that much traffic. We also place, but in the latter, getting look scary, I can assure it is not.
thought we could handle the feedback is at the center of the In fact, Scala can be left aside
spike in load after our TV spot process. for the moment and we can
ran. What was the point of all this think about Gatling as its own
testing if the application stopped You see, this isn’t really about language:
responding right after launch, load testing per se, the goal is
and we couldn’t fix this regard- focused on making sure your
less of the number of servers we application won’t crash when it
threw at it, trying to salvage the goes live. You don’t want to be
situation as we could.” writing “the cathedral” of load

18
import io.gatling.core.Predef._ val scn = scenario(“simplest”)

The InfoQ eMag / Issue #86 / October 2020


import io.gatling.http.Predef._ .exec(
http(“Home”)
class SimplestPossibleSimulation extends .get(“/”)
Simulation { )

val baseHttpProtocol = Notice I used the term scenario even though we


http.baseUrl(“https://computer-
only spoke about simulations until now. While they
database.gatling.io”)
are often conflated, they have both really distinct
val scn = scenario(“simplest”) meanings:
.exec(
http(“Home”)
• A scenario describes the journey a single user
.get(“/”)
) performs on the application, navigating from
page to page, endpoint to endpoint, and so on,
setUp( depending on the type of application
scn.inject(atOnceUsers(1))
).protocols(baseHttpProtocol) • A simulation is the definition of a complete
} test with populations (e.g.: admins and users)
assigned to their scenario
Let’s break this down into smaller parts. As Gatling
is built using Scala, itself running on the Java Vir- • When you launch a simulation, the result is
tual Machine, there are some similarities to expect. called a run
The first one is that the code always starts with
package imports: This terminology helps make sense of the final
section, (3) setting up the simulation:
import io.gatling.core.Predef._
import io.gatling.http.Predef._ setUp(
scn.inject(atOnceUsers(1))
They contain all the Gatling language definitions ).protocols(baseHttpProtocol)
and you’ll use these all the time, along with other
imports, depending on the need. Then, you create Since a simulation is roughly a collection of sce-
enough to encapsulate the simulation: narios, it is the sum of all the scenarios configured
with their own injection profile (which we will ignore
class SimplestPossibleSimulation extends for now). The protocol on which is based all the
Simulation {} requests done in a scenario can be either shared, if
chained after setUp, or defined per scenario, if we
SimplestPossibleSimulation is the name of the
had multiple ones, scn1 and scn2, as such:
simulation and Simulation is a construct defined
by Gatling that you “extend” and will contain all setUp(
the simulation code, in three parts: (1) we need to scn1.inject(atOnceUsers(1)).
define which protocol we want to use and which protocols(baseHttpProtocol1),
scn2.inject(atOnceUsers(1)).
parameters we want it to have:
protocols(baseHttpProtocol2)
)
val baseHttpProtocol =
http.baseUrl(“https://computer-database.
Notice how everything is chained, separated by a
gatling.io”)
comma: scn1 configured with such injection profile
(2) The code of the scenario itself: and such protocol, etc.

19
www.dbooks.org
When you run a Gatling simula- plestPossibleSimulation. is to use a build tool, such as
The InfoQ eMag / Issue #86 / October 2020

tion, it will launch all the scenari- scala inside the user-files/ Maven, or SBT. If this is the
os configured inside at the same simulations folder. Then, from solution you would like to try, you
time, which is what we’ll do right the command line, at the root of can clone our demo repositories
now. the bundle folder, type: using git:

Running a simulation ./bin/gatling.sh • Gatling Maven Plugin Demo


One way to run a Gatling simula-
Or, on Windows: • Gatling’s SBT plugin demo
tion is by using the Gatling Open-
Source bundle. It is a self-con-
.\bin\gatling.bat Or, if you want to follow along
tained project with a folder to
with this article, you can clone
drop simulation files inside and a This will scan simulations inside
the following repository which
script to run them. the previous folder and prompt
was made for the occasion:
you, asking which one you want
Warning! to run. As the bundle contains
• Gatling Maven and SBT
some examples, you’ll have more
project
All installations require a proper than one to choose from.
Java installation, JDK 8 mini-
After running the test, you no-
mum, see the Gatling installation While this is the easiest way
tice there is a URL at the end of
docs for all the details. to run a Gatling simulation, it
the output, which you open and
doesn’t work well with source
stumble upon:
After unzipping the bundle, you repositories. The preferred way
can create a file named Sim-

20
You will notice there is a request named “Home The idea is simple: lots of users, the smallest

The InfoQ eMag / Issue #86 / October 2020


Redirect 1” which we didn’t make ourselves. You amount of time possible. atOnceUsers is perfect for
might be thinking it’s probably Gatling following a that, with some caveats:
redirect automatically and separating the response
times from all the different queries. But it does look scn.inject(
atOnceUsers(10000)
kind of underwhelming, and you would be right.
)

Configuring the Injection Profile It is difficult to say how many users you can launch
Looking back, when configuring our scenario we at the same time without being too demanding on
did the following: hardware. There are a lot of possible limitations:
CPU, memory, bandwidth, number of connections
scn.inject(atOnceUsers(1))
the Linux kernel can open, number of available
sockets on the machine, etc. Is it easy to strain
Which is great for debugging purposes, but not re-
hardware by putting too high a value and ending up
ally thrilling in relation to load testing. The thing is,
with nonsensical results?
there is no right or wrong way to write an injection
profile, but first things first.
This is where you could need multiple machines to
run your test. To give an example, a Linux kernel, if
An injection profile is a chain of rules, executed in
optimized properly, can easily open 5k connections
the provided order, that describes at which rate you
every second, 10k being already too much.
want your users to start their own scenario. While
you might not know the exact profile you must
Splitting 10 thousand users over the span of a min-
write, you almost always have an idea of the ex-
ute would still be considered a stress test, because
pected behavior. This is because you are typically
of how strenuous the time constraint is:
either anticipating a lot of incoming users at a spe-
cific point in time, or you are expecting more users
scn.inject(
to come as your business grows. rampUsers(10000) during 1.minute
)
Questions to think about include: do you expect
users to arrive all at the same time? This would be Soak test
the case if you intend to offer a flash sale or your If the time span gets too long, the user journey will
website will appear on TV soon; and do you have an start feeling more familiar. Think “users per day,”
idea of the pattern your users will follow? It could “users per week,” and so on. However, imitating
be that users arrive at specific hours, are spread how users arrive along a day is not the purpose of a
across the day, or appear only within working soak test, just the consequence.
hours, etc. This knowledge will give you an angle
of attack, which we call a type of test. I will present When soak testing, what you want to test is the
three of them. system behavior under a long stretch of time: how
does the CPU behave? Can we see any memory
Stress testing leaks? How do the disks behave? The network?
When we think about load testing, often we think
about “stress testing,” which it turns out is only a A way of doing this is to model users arriving over a
single type of test; “Flash sales” is the underlying long period of time:
meaning.

21
www.dbooks.org
scn.inject( You could say doing this 20 times could be a bit
The InfoQ eMag / Issue #86 / October 2020

rampUsers(10000000) during 10.hours // cumbersome…but as the base of Gatling is code,


~277 users per sec
you could either make a loop that generates the
)
previous injection profile, or use our DSL dedicated
This will feel like doing “users per day.” Still, if mod- to capacity testing:
eling your analytics is your goal, then computing
the number of users per second that your ramp-up scn.inject(
incrementUsersPerSec(10)
is doing and reducing the duration would give you .times(20)
results faster: .eachLevelLasting(5.minutes)
.separatedByRampsLasting(30.seconds) //
scn.inject( optional
constantUsersPerSec(300) during .startingFrom(10) // users per sec too!
10.minutes )
)
See on the next page in Figure 1 how to model this
Speaking about this, rampUsers over constan- graphically. And that’s it!
tUserPerSec is just a matter of taste. The for-
mer gives you an easy overview of the total users Where to start with loading testing?
arriving while the latter is more about throughput. If it is your first time load testing, whether you
However, thinking about throughput makes it eas- already know the target user behavior or not, you
ier to do “ramping up,” i.e. progressively arriving at should start with a capacity test. Stress testing
the final destination: is useful but analyzing the metrics is really tricky
under such a load. Since everything is failing at the
scn.inject( same time, it makes the task difficult, even impos-
rampUsersPerSec(1) to 300 during
10.minutes, sible. Capacity testing offers the luxury to go slowly
constantUsersPerSec(300) during 2.hours to failure, which is more comfortable for the first
) analysis.

Capacity test To get started, just run a capacity test that makes
Finally, you could simply be testing how much your application crash as soon as possible. You
throughput your system can handle. In which case only need to add complexity to the scenario when
a capacity test is the way to go. Combining meth- everything seems to run smoothly.
ods from the previous test, the idea is to level the
throughput from some arbitrary time, increase the Then, you need to look at the metrics. (See Figure
load, level again, and continue until everything goes 2).
down and we get a limit. Something like:
What do all of these results even mean?
scn.inject(
constantUsersPerSec(10) during 5.minutes, The above chart shows response time percentiles.
rampUsersPerSec(10) to 20 during When load testing, we could be tempted to use av-
30.seconds, erages to analyze the global response time and that
constantUsersPerSec(20) during 5.minutes, would be error-prone. If an average can give you
rampUsersPerSec(20) to 30 during
30.seconds, a quick overview of what happened in a run, it will
... hide under the rug all the things you actually want
) to look at. This is where percentiles come in handy.

22
The InfoQ eMag / Issue #86 / October 2020
Figure 1

Figure 2

Think of it this way: if the average on your application, and 0.01% of times in ascending order and
response time is some amount of them timed-out, that would be 30 mark spots:
milliseconds, how does the expe- thousand users that weren’t able
rience feel in the worst case for to access your application. • 0% of the responses time is
1% of your user base? Better or the minimum, marking the
worse? How does it feel for 0.1% Percentiles are usually used, lowest value
of your users? And so on, getting which correspond to thinking
• 100% is the maximum
closer and closer to zero. The about this the other way around.
higher the amount of users and The 1% worse case for users is • 50% is the median
requests, the closer you’ll need turned into “how does the expe-
to get to zero in order to study rience feel at best for 99% of the From here, you go to 99% and
extreme behaviors. To give you users,” 0.1% is turned into 99.9%, as close to 100% as possible, by
an example, if you had 3 million etc. A way to model the compu- adding nines, depending on how
users performing a single request tation is to sort all the response many users you have.

23
www.dbooks.org
In the previous chart, we had a This is the most important part. If There is a catch though. Out of
The InfoQ eMag / Issue #86 / October 2020

99 percentile of 63 milliseconds, you know the running conditions these three acceptance criteria,
but is that good or not? With of a test, you can do wonders. only one is actually useful to de-
the maximum being 65, it would You can (and should) test locally, scribe a system under load, can
seem so. Would it be better if it with everything running on your you guess which and what they
were10 milliseconds, though? computer, as long as you under- can be replaced with?
stand what it boils down to: no
Most of the time metrics are inference can be made on how (1) “Mean response time under
contextual and don’t have any it will run in production, but that 250ms”: It should come naturally
meaning by themselves. Broadly allows you to do regression test- from the previous section that
speaking, a 10ms response time ing. Just compare the metrics while average isn’t bad, it isn’t
on localhost isn’t an achieve- between multiple runs, and it will sufficient to describe user behav-
ment, and it would be impossible tell you whether you made the ior, it should always be seconded
from Paris to Australia due to performance better or worse. with percentiles:
the speed of light constraint. We
have to ask, “in what conditions You don’t need a full-scale • Mean response time under
was the run actually performed?” testing environment to do that. 250ms
This will help us greatly deduce If you know your end-users are
• 99 percentile under 250ms
whether or not the run was actu- mobile users, testing with all
ally that good. These conditions machines located in a single data • Max under 1000ms
include: center could lead to a disaster.
However, you don’t need to be (2) “2000 active users”: This one
• What is the type of server the 100% realistic either. Just keep is more tricky. Nowadays we are
application is running? in mind what it implies and what flooded with analytics showing
you can deduce from the testing us “active users” and such, so
• Where is it located?
conditions: a data center being a it should be tempting to define
• What is the application huge local network, results can’t acceptance criteria using this
doing? be compared to mobile networks, measurement. That is an issue
and observed results would in though. The only way to model
• Is it under a network?
fact be way better than reality. an amount of active users direct-
• Does it have TLS? ly is by creating a closed mod-
Specifying the need el. Here, you will end up with a
• What is the scenario doing?
Not only should you be queue of users, and this is where
• How do you expect the appli- aware of the conditions un- the issue lies. Just think about it.
cation to behave? der which the tests are run, If your application were to slow
but you should also decide down and users are on a queue,
• Does everything run on the beforehand what makes the they would start piling up in the
same cloud provider in the test a success or a failure. To queue, but only a small amount,
same data center? do that, we define criteria, as (the maximum amount allowed in
• Are there any kinds of latency such: the application), would be “ac-
to be expected? Think mobile tive” and performing requests.
(3G), long distance. 1. Mean response time under The amount of active users
250ms would stay the same, but users
• Etc.
2. 2000 active users would be waiting outside as they
would be in a shop.
3. Less than 1% failed requests

24
In real life, if a website goes down, people will • We got at least 100 requests/seconds

The InfoQ eMag / Issue #86 / October 2020


continue to refresh the page until something comes
• 99 percentile under 250ms
up, which will further worsen the performance of
the website until people give up and you lose them. • And less than 1% failed requests
You shouldn’t model active users unless it is your
use case—you should instead target an amount of Finally, the Given-When-Then model can be fully
active users. It can be difficult, but doable. It will integrated as a Gatling simulation code:
depend on: the duration of the scenario, including
response time, pauses, etc., the network, the injec- import io.gatling.core.Predef._
import io.gatling.http.Predef._
tion profile, etc.
import scala.concurrent.duration._
This is what you could measure instead:
class FullySpecifiedSimulation extends
Simulation {
• 2000 active users

• Between 100 and 200 new users per second val baseHttpProtocol =
http.baseUrl(“https://computer-database.
• More than 100 requests per second gatling.io”)

// When
(3) “Less than 1% failed requests” was in fact the val scn = scenario(“simplest”)
only criterion that properly represents a system .exec(
under load between the three. However, it is not to http(“Home”)
.get(“/”)
be taken as a rule of thumb. Depending on the use
)
case, 1% may be too high. Think of an e-commerce
site, you might allow some pages to fail here and // Given
there, but having a failure at the end of the conver- setUp(
scn.inject(
sion funnel right before buying would be fatal to
rampUsersPerSec(1) to 200 during
your business model. You would have this specific 1.minute,
request with a failure criterion at 0% and the rest to constantUsersPerSec(200) during
either a higher percentage or no criterion at all. 9.minutes
)
).protocols(baseHttpProtocol) // Then
All of this leads to a specification based on the Giv- .assertions(
en-When-Then model and how to think about load global.requestsPerSec.gte(100),
testing in general, with everything we learned so far: global.responseTime.percentile(99).
lt(250),
global.failedRequests.percent.lte(1)
• Given: Injection Profile )
}
• When: Scenario

• Then: Acceptance Criteria The Scenario (When) part didn’t change (yet), but
you will need the Injection Profile (Given) at the
Using the previous examples, it can look like this: same place, and a new part, called assertions, as
Then. What assertions does is computing met-
• Given: a load of 200 users per second rics from within all the simulation, and will fail the
“build” if they are under/over the requirements.
• When: users visit the home page Using a Maven project, for example, you’ll see:
• Then:

25
www.dbooks.org
[INFO] ------------------------------------ have on the Gatling side. It will help you see if you
The InfoQ eMag / Issue #86 / October 2020

------------------------ are using too many resources on the machine on a


[INFO] BUILD SUCCESS
huge test with information such as CPU, memory,
[INFO] ------------------------------------
------------------------ network connections, and their states, networking
[INFO] Total time: 07:07 min issues, etc. There are many popular metrics col-
[INFO] Finished at: 2020-08- lection solutions out there, such as Prometheus
27T20:52:52+02:00 combined with Node Exporter. These can be used
[INFO] ------------------------------------ alongside Grafana combined with an appropriate
------------------------ dashboard. Let it run and collect the metrics all the
time. Check when running a test on the time win-
And BUILD FAILURE otherwise. This is handy dow of the test.
when using a CI tool such as Jenkins, for which we
Equipped with such tools, the test protocol will boil
provide a Gatling Jenkins plugin. Using it, you can
down to:
configure the running of simulations right into your
CI, and get notified if your acceptance criteria are
1. Ensure system and application monitoring is
failed and by how much.
enabled

Note that in the previous example we used the 2. Run a load test (capacity, for example)
keyword global as a starting point, but you could
3. Analyze and deduce a limit
be as precise as having a single request having no
failures and ignore all the others: 4. Tune

.assertions( 5. Try again


details(“Checkout”).failedRequests.
percent.is(0) Going beyond
)
Closing your logbook, you realize you weren’t that
far off succeeding after all. Rather than building a
You can find more examples in our assertions
full-fledged load testing Cathedral, you decide to
documentation.
go one small step at a time and understanding that
knowledge is key. The sooner you have the infor-
Defining a test protocol
mation you need, the better.
If you believe that a specific variable, for example,
the network, is the culprit behind the poor perfor-
From now, when everything seems to work fine, you
mance, you will need a proper testing protocol to
will have the possibility to add complexity to the
test your assumptions. Each variable should be
scenario, approaching more and more the way your
tested against a “witness.” If you suspect “some-
users actually perform actions on your application.
thing” is the cause, test it with and without, then
Your most helpful resources will be the Gatling OSS
compare. Make a baseline and make small varia-
documentation, along with:
tions, then test against this.

• The Quickstart, so you can learn how to record


It boils down to ALWAYS TESTING YOUR
this user journey without writing much code
ASSUMPTIONS,
right at the beginning

Additional tooling will help catch the essentials • The Advanced Tutorial to go deeper
that Gatling can’t. As all metrics given by Gatling
• The Cheat-Sheet that lists all the Gatling lan-
are from the point of view of the user, you’ll need
guage keywords and their usage
to make sure you are equipped with system and
application monitoring on the other side of the
spectrum. System monitoring is also very useful to
26
The InfoQ eMag / Issue #86 / October 2020
Four Case Studies for
Implementing Real-Time APIs
by Karthik Krishnaswamy Director | Product Marketing at F5

API calls now make up 83% of That means competitive advan- win business in this highly com-
all web traffic. The age of the tage is no longer won by sim- petitive landscape.
API has arrived, and companies ply having the APIs; the key to
should be well past the point of gaining ground is based on the Automate API management at
just having enthusiasm for de- performance and the reliability scale for microservices: Green-
veloping APIs — they need them of those APIs. According to our field online bank
to survive in digital business. research at NGINX, you need to
But in the digital era, it’s easy for be able to process an API call in Objective: Deliver real-time per-
your customers and partners to 30ms or less in order to deliver formance and reliability at scale
switch services. Don’t like your real-time experiences. to customers accessing appli-
bank? Opening a new account cations from non-smart mobile
in another bank is as simple as Here are four case studies of devices.
downloading an app. APIs have companies that created an API
made it easy for us to consume structure capable of delivering Problem: Microservices-based
services across all industries. the real-time speed needed to applications delivering poor

27
www.dbooks.org
performance and struggling a lot of complexity around both center and the client applications
The InfoQ eMag / Issue #86 / October 2020

with added latency when pro- API scalability and increased accessing those APIs external-
cessing internal API calls for inter-service communication ly (north-south traffic). They
transactions. (east-west traffic), which devel- also need constant connectivity
opers needed to allow microser- between the control plane and
Key takeaway: External services vices to communicate with each data plane, which requires using
can consist of multiple, internal other. third-party modules, scripts, and
API calls among microservices; local databases. Processing a
slow-performing API infrastruc- This posed a significant chal- single request creates significant
ture can degrade service levels. lenge to the bank as most of their overhead — and it only gets more
customers were not accessing complex when dealing with the
Background and challenge their services from smart devic- east-west traffic associated with
This green-field bank was es but using non-smart mobile a distributed application.
launched in 2016 with the goal phones. As a result, the bank
of providing banking services to was not able to rely on custom- Considering that a single trans-
poor and rural areas in South Af- er computing power. Instead, a action or request could require
rica. In order to compete with the simple mobile phone would com- multiple internal API calls, the
incumbent banks, they decided municate with the bank via SMS, bank found it extremely difficult
to build a modern digital banking which would kick off a series of to deliver good user experiences
application from the ground up internal APIs calls to process the to their customers. For example,
using containers and microser- transaction and deliver the result the bank’s existing API man-
vices to provide digital banking back to the customer via SMS. agement solution was adding
services to unbanked or under- anywhere from 100 to 500ms of
banked households. How real-time API management transaction latency to each call.
helped
They built a distributed architec- Unreliable or slow performance The bank had also established
ture based on this microservices can directly impact or even pre- CI/CD tooling and frameworks
reference architecture. This pro- vent the adoption of new digital to automate microservices
vided the bank with the flexibility services, making it difficult for a development and deployment.
needed to deliver new services business to maximize the poten- They had a Site Reliability (SRE)
quickly enough to match the ex- tial of new products and expand team tasked with overseeing
pectations of today’s digital con- its offerings. Thus, it is not only the entire system and needed
sumers, while limiting downtime. crucial that an API processes a solution that could integrate
The bank chose this reference calls at acceptable speeds, but it easily with their CI/CD pipeline
architecture as it was prescrip- is equally important to have an infrastructure.
tive and gave them a roadmap API infrastructure in place that is
to start small – initially hand- able to route traffic to resources Deploying an API management
ing ingress traffic to a modest correctly, authenticate users, se- solution that decouples the data
Kubernetes cluster – and grow cure APIs, prioritize calls, provide and control plane introduced less
to hundreds or even thousands proper bandwidth, and cache API latency and offered high-per-
of microservices connected via a responses. formance API traffic mediation
service mesh. for both external service and
Most traditional APIM solutions inter-service communication.
However, pivoting to a micros- were made to handle traffic Runtime connectivity to the con-
ervices architecture introduced between servers in the data trol plane was no longer needed

28
for the data plane to process and well-equipped to process this within the corporate network,

The InfoQ eMag / Issue #86 / October 2020


route calls, minimizing complex- new API traffic, causing a degra- rather than being forced to route
ity. As a result, the bank was able dation of performance and devel- them out into the public cloud,
to process API calls up to three oper and DevOps team to invest resulting in a 70% reduction in
times faster. In addition, the new in non-standard solutions. latency with API calls being pro-
API management solution inte- cessed at 20ms or less.
grated directly with the bank’s The telecom company was
existing microservices deploy- already using a solution for API DevOps teams were also able to
ment and CI/CD pipeline, as well management. However, it was easily create, publish, and mon-
as offering monitoring for SRE better suited for handling north- itor APIs, helping to increase
teams to help eliminate human south traffic from external APIs application and microservices
error, downtime, and reduce op- rather than east-west traffic release velocity by integrating
erational complexity. between internal services. The API management tasks direct-
company decided to seek a new ly into their CI/CD pipeline. By
Connect microservices API solution that would allow them to automating routine tasks us-
traffic - Leading Asian-Pacific reduce latency, while also provid- ing APIs, such as API definition
telecommunications provider ing capabilities to empower their and gateway configuration, the
Objective: Process 600 million DevOps teams with self-service organization was able to achieve
API calls per month across at API management. significant savings in time and
least 800 different internal APIs. effort.
How real-time API management
Problem: The existing API in- helped Process large volume of credit
frastructure was too expensive The telecom organization’s ex- card transactions in real-time
and slow to use for internal API isting API management solution - Large US-based credit card
traffic, resulting in performance relied on deployment models company
degradation. where the API management and Objective: Process billions of API
gateways were hosted in the calls in real-time — sub 70ms
Key takeaway: Processing inter- public cloud, meaning that traffic latency.
nal API calls within the corpo- must be looped out to the cloud
rate firewall instead of routing first for every interaction. In this Problem: The existing API man-
it through a cloud outside the case, sending traffic out to the agement solution added 500ms
corporate network improves per- public cloud was not only costly, of latency for every API call, re-
formance significantly. it was much slower — adding sulting in direct revenue loss for
several seconds of latency. the company.
Background and challenge
A leading telecom organization The telecommunications organi- Key takeaways: Performance
from the Asia-Pacific region em- zation opted to segment external trumps feature richness in API
braced an API-first development and internal API management, management solutions when
philosophy as part of their digital implementing a higher-perform- revenue is impacted.
transformation initiative. This ing API management solution to
drove an explosion of internal manage and deploy internal APIs Background and challenge
APIs needed to process 600 mil- and that logically sat “behind” the A leading credit card company
lion API calls per month across existing deployment on the en- was struggling with a transaction
800 different internal APIs. The terprise perimeter. This allowed latency problem. When paying
current infrastructure was not them to process internal API calls with a credit card, most point-of-

29
www.dbooks.org
sale (POS) systems will time out the second card they use is not a richer developer portal, API
The InfoQ eMag / Issue #86 / October 2020

after a set limit expires. Cards issued by the same company, it design tools, and API transfor-
will need to be run through again, could potentially mean millions in mation features of their previous
but transactions will automati- lost dollars—all due to timed-out solution.
cally fail to avoid any duplicate API calls.
transactions being charged to The improved transaction laten-
the card. The company pioneered a re- cy and reliability had the added
al-time API reference architec- benefit of helping to create and
At the same time, the organiza- ture to ensure API calls were pro- solidify new revenue streams. As
tion was moving to Open Banking cessed in as close to real-time part of their open banking efforts,
standards, which provide API as possible. For example, they the company is now able to ex-
specifications that enable shar- deployed clusters of two or more pose their core transactional en-
ing of customer-permissioned high availability API gateways to gine to ISVs and developers and
data and analytics with third-par- improve the reliability and resil- win more business as a result of
ty developers and firms to build iency of their APIs. The company the speed and reliability advan-
applications and services—in- also chose to enable dynamic tages they can demonstrate over
creasing the volume of API calls authentication by pre-provision- their competitors.
into the hundreds of billions. ing authentication information
(using API keys and JSON Web Securely process billions of
As a result, the company started Tokens, or JWT), which made transactions using a single
looking for a solution that would authentication almost instanta- API management solution –
be able to manage and scale to neous. In addition, they chose US-based financial services
handle billions of API calls as to delegate authorization to the company
fast as possible — the goal was business-logic layer of their Objective: A lightweight, cost-ef-
set to sub 70ms latency per API backend so that their API gate- fective solution that can process
call. This was deemed as the ways were only responsible for and route API calls for REST APIs,
threshold before customer ex- handling authentication, resulting SOAP APIs, and externally ac-
perience would be impacted for in faster response times to calls. cessed services that meets strict
POS transactions. federal financial compliance
By following these best practices, requirements.
How real-time API management the company was able to achieve
helped response times that were consis- Problem: The company had three
The company’s existing API solu- tently less than 10ms, exceeding existing solutions to manage
tion was adding 500ms of laten- the performance requirements each of the different types of
cy for every API call, causing a by 85%. This delivered tangible, traffic, which required configuring
nominal amount of transactions direct savings—helping to not specific gateways multiple times
to fail. But even a small fraction only recover lost transactions but in order to make changes and
of billions of transactions is a enabling them to process even quickly achieve scale to serve
significant number, especial- more transactions than before. the needs of internal and external
ly when they result in a loss of The company concluded that consumers.
revenue. these performance gains were
more critical to their business Key takeaway: High API trans-
It’s common for customers to try outcomes than some of the ad- action throughput (calls per
paying again with another credit ditional features the incumbent second) is essential for rapid
card when a transaction fails. If solution had, which included adoption

30
Background and challenge ure specific gateways — making able to reduce context switch-

The InfoQ eMag / Issue #86 / October 2020


Recognizing that the future of fi- updates three times rather than ing and the load on resources.
nancial services would rely heav- once every time a change was This allowed them to leverage
ily on software development, the needed. an asynchronous, event-driven
financial services company be- approach that allows multiple re-
gan to focus on transforming into The company wanted to consol- quests to be handled by a single
a technology company in 2014 idate, but they were also looking worker process — in other words,
by investing in RESTful APIs. This for a solution that would allow they were able to scale to sup-
involved placing heavy emphasis them to scale and handle billions port hundreds of thousands of
on engineering and development, of API calls while protecting the concurrent connections with very
empowering developers to create high-speed developer exchange little additional overhead.
software that was easy to con- platform that was the main core
sume in order to deliver great of their business. To meet their The company also struggled to
applications to end users. scale, flexibility, and performance manage and configure multiple
requirements, the company disparate solutions, which not
In addition to their own offer- decided it couldn’t rely solely on only consumed resources but
ings that served millions of a packaged solution or service. also caused a lot of downtime as
customers, the company also They would need a combination they had to be restarted during
created an internal development of API infrastructure software upgrades. The developer and
exchange platform, which was and custom-developed API DevOps teams then highly cus-
eventually made available exter- tooling. tomized the open source beyond
nally to allow them to integrate its original capability by adding
with business partners and third How real-time API management Lua-based modules and scripts,
parties through APIs. There was helped enabling them to standardize on
extreme pressure to be able to The goal for the company, like this OSS gateway to handle all
deliver performant APIs and an any modern technology-focused types of traffic, helping to elim-
infrastructure that could support business, was resiliency, high inate sprawl and complexity. In
a high volume of transactions speed, and low overhead for addition, the company was able
every day. internal customers that want to to upgrade without downtime or
avoid impacting functionality. service interruption by starting
Over the years, the company had However, handling hundreds of a new set of worker process-
also acquired a lot of technical thousands of concurrent API es when a new configuration
debt, including a legacy service calls without performance degra- is detected while continuing to
bus and appliances to handle dation can be tricky using tradi- route live traffic to the old pro-
SOAP/XML APIs. As a result, the tional application design, which cesses containing the previous
company was using at least three relies on a process-per-connec- configuration. Once the new
different solutions to manage tion model to handle requests. configuration is tested and ready,
API traffic, which made it hard the new gateway immediately
to adapt and respond quickly Using open source software starts accepting connections and
to changing environments and (OSS) as the main foundation to processing traffic based on the
meet market demands. Teams support their API gateway in- new settings.
were forced to constantly config- frastructure, the company was

31
www.dbooks.org
The InfoQ eMag / Issue #86 / October 2020

The company is now able to han-

TL;DR
dle around 360 billion API calls per
month — or about 12 billion calls in
a single day with peak traffic of 2
million API calls per second. • API calls now make up 83%
of all web traffic. Competi-
tive advantage is no longer
Real-time APIs require a real-time won by simply having APIs;
API solution the key to gaining ground is
The case studies above serve to based on the performance
and the reliability of those
demonstrate some of the most APIs.
common ways we are helping or-
• Modern systems can
ganizations develop their API pro- consist of multiple internal
grams. We would love to hear from API calls among microser-
vices; slow-performing API
you about your real-time API needs infrastructure can degrade
and experiences with delivering APIs service levels throughout an
organisation’s systems.
in real-time. What API management
challenges are you facing? How are • Processing internal API calls
within the corporate firewall
you dealing with scaling your API instead of routing it through
infrastructure to meet your modern- a cloud outside the corporate
ization needs? network can improve perfor-
mance significantly.

Please leave comments below or, • Performance trumps feature


richness in API management
better yet, use our open source API solutions when revenue
assessment tool to measure your API is correlated with speed.
Teams should recognize this
performance. Learn more on GitHub. and design system appli-
cations and infrastructure
accordingly.

• Consolidating traffic man-


agement solutions within a
system can improve the ca-
pability to react to changing
requirements, rapidly update
configuration, and bench-
mark overall performance.

32
SPONSORED ARTICLE

The InfoQ eMag / Issue #86 / October 2020


A Reference Architecture for Real-Time APIs

As companies seek to compete in the digital era, APIs become a critical IT and business resource. Archi-
tecting the right underlying infrastructure ensures not only that your APIs are stable and secure, but also
that they qualify as real‑time APIs, able to process API calls end-to-end within 30 milliseconds (ms).

API architectures are broadly broken up into two components: the data plane, or API gateway, and the
control plane, which includes policy and developer portal servers. A real‑time API architecture depends
mostly on the API gateway, which acts as a proxy to process API traffic. It’s the critical link in the perfor-
mance chain.

API gateways perform a variety of functions including authenticating API calls, routing requests to the
right backends, applying rate limits to prevent overburdening your systems, and handling errors and ex-
ceptions. Once you have decided to implement real‑time APIs, what are the key characteristics of the API
gateway architecture? How do you deploy API gateways?

This article addresses these questions, providing a real‑time API reference architecture based on our
work with NGINX’s largest, most demanding customers. We encompass all aspects of the API manage-
ment solution but go deeper on the API gateway which is responsible for ensuring real‑time performance
thresholds are met.

The Real-Time API Reference Architecture


Our reference architecture for real‑time APIs has six components:

• API gateway. A fast, lightweight data‑plane component that processes API traffic. This is the most
critical component in the real‑time architecture.

• API policy server. A decoupled server that configures API gateways, as well as supplying API lifecycle
management policies.

• API developer portal. A decoupled web server that provides documentation for rapid onboarding for
developers who use the API.

• API security service. A separate web application firewall (WAF) and fraud detection component which
provides security beyond the basic security mechanisms built into the API gateway

• API identity service. A separate service that sets authentication and authorization policies for identity
and access management and integrates with the API gateway and policy servers.

• DevOps tooling. A separate set of tools to integrate API management into CI/CD and developer
pipelines.

Please read the full-length version of this article here.

33
www.dbooks.org
Real-Time APIs in the Context
The InfoQ eMag / Issue #86 / October 2020

of Apache Kafka
by Robin Moffatt, Senior Developer Advocate at Confluent

One of the challenges that we From events, we can aggregate


have always faced in building up to create state—the kind of
applications, and systems as a state that we know and love from
whole, is how to exchange infor- its place in RDBMS and NoSQL
mation between them efficiently stores. As well as being the basis
whilst retaining the flexibility to for state, events can also be
modify the interfaces without used to asynchronously trigger
undue impact elsewhere. The actions elsewhere when some-
more specific and streamlined thing happens - the whole basis
an interface, the likelihood that for event-driven architectures. In
it is so bespoke that to change it this way, we can build consumers
would require a complete rewrite. to match our requirements—both
The inverse also holds; gener- stateless, and stateful with fault
ic integration patterns may be tolerance. Producers can opt
adaptable and widely supported, to maintain state, but are not
but at the cost of performance. required to since consumers can
rebuild this themselves from the
Events offer a Goldilocks-style events that are received.
approach in which real-time APIs
can be used as the foundation for If you think about the business
applications which is flexible yet domain in which you work, you
performant; loosely-coupled yet can probably think of many
efficient. examples of events. They can be
human-generated interactions,
Events can be considered as the and they can be machine-gen-
building blocks of most other erated. They may contain a
data structures. Generally speak- rich payload, or they may be in
ing, they record the fact that essence a notification alone. For
something has happened and example:
the point in time at which it oc-
curred. An event can capture this • Event: userLogin
information at various levels of
detail: from a simple notification – Payload: zbeeble-
to a rich event describing the full brox logged in at 2020-08-17
state of what has happened. 16:26:39 BST

34
• Event: CarParked – To store streams of

The InfoQ eMag / Issue #86 / October 2020


events durably and reliably for as
– Payload: Car registration long as you want.
A42 XYZ parked at 2020-08-17
16:36:27 in space X42 • Storage

• Event: orderPlaced – To process streams


of events as they occur or
– Payload: Robin ordered retrospectively.
four tins of baked beans
costing a total of £2.25 at 2020- Kafka is built around the concept
08-17 16:35:41 BST of the log. By taking this simple
but powerful concept of a dis-
These events can be used to tributed, immutable, append-only
directly trigger actions elsewhere log, we can capture and store the
(such as a service that process- events that occur in our busi-
es orders in response to them nesses and systems in a scalable
being placed), and they can also and efficient way. These events
be used in aggregate to provide can be made available to multiple
information (such as the current users on a subscription basis, as
number of occupied spaces in well as processed and aggregat-
a car park and thus which car ed further, either for direct use
parks have availability). or for storage in other systems
such as RDBMS, data lakes, and
So, if events are the bedrock on NoSQL stores.
which we are going to build our
applications and services, we In the remainder of this article, I
need a technology that supports will explore the APIs available in
us in the best way to do this—and Apache Kafka and demonstrate
this is where Apache Kafka® how it can be used in the sys-
comes in. Kafka is a scalable tems and applications that you
event streaming platform that are building.
provides
The Producer and Consumer APIs
• Pub/Sub The great thing about a sys-
tem like Kafka is that producers
– To publish (write) and and consumers are decoupled,
subscribe to (read) streams of meaning, amongst other things,
events, including continuous that we can produce data without
import/export of your data from needing a consumer in place first
other systems. (and because of the decoupling
we can do so at scale). An event
• Stateful stream processing happens, we send it to Kafka—
simple as that. All we need to
know is the details of the Kafka

35
www.dbooks.org
cluster, and the topic (a way of organising data in meet legal compliance. Traditional messaging sys-
The InfoQ eMag / Issue #86 / October 2020

Kafka, kind of like tables in an RDBMS) to which we tems like RabbitMQ, ActiveMQ cannot support such
want to send the event. requirements.

There are clients available for Kafka in many differ- package main
ent languages. Here’s an example of producing an
import (
event to Kafka using Go: “fmt”

package main “gopkg.in/confluentinc/confluent-kafka-


go.v1/kafka”
import ( )
“gopkg.in/confluentinc/confluent-kafka-
go.v1/kafka” func main() {
)
topic := “test_topic”
func main() {
cm := kafka.ConfigMap{
topic := “test_topic” “bootstrap.servers”:
p, _ := kafka.NewProducer(&kafka. “localhost:9092”,
ConfigMap{ “go.events.channel.enable”: true,
“bootstrap.servers”: “group.id”:
“localhost:9092”}) “rmoff_01”}
defer p.Close()
c, _ := kafka.NewConsumer(&cm)
p.Produce(&kafka.Message{ defer c.Close()
TopicPartition: kafka. c.Subscribe(topic, nil)
TopicPartition{Topic: &topic,
Partition: 0}, for {
Value: []byte(“Hello world”)}, nil) select {
case ev := <-c.Events():
} switch ev.(type) {

Because Kafka stores events durably, it means that case *kafka.Message:


km := ev.(*kafka.Message)
they are available as and when we want to con-
fmt.Printf(“✅ Message ‘%v’
sume them, until such time that we age them out received from topic ‘%v’\n”, string(km.
(which is configurable per topic). Value), string(*km.TopicPartition.Topic))
}
}
Having written the event to the Kafka topic, it’s
}
now available for one, or more, consumers, to read.
Consumers can behave in a traditional pub/sub }
manner and receive new events as they arrive, as
well as opt to arbitrarily re-consume events from When a consumer connects to Kafka, it provides
a previous point in time as required by the appli- a Consumer Group identifier. The consumer group
cation. This replay functionality of Kafka, thanks concept enables two pieces of functionality. The
to its durable and scalable storage layer, is a huge first is that Kafka keeps track of the point in the
advantage for many important use cases in prac- topic to which the consumer has read events, so
tice such as machine learning and A/B testing, that when the consumer reconnects it can con-
where both historical and live data are needed. It’s tinue reading from the point that it got to before.
also a hard requirement in regulated industries, The second is that the consuming application may
where data must be retained for many years to want to scale its reads across multiple instances

36
The InfoQ eMag / Issue #86 / October 2020
of itself, forming a Consumer Group that allows for Any application can consume from this topic, using
processing of your data in parallel. Kafka will then the native Consumer API that we saw above, or by
allocate events to each consumer within the group using a REST call. Just like with the native Con-
based on the topic partitions available, and it will sumer API, consumers using the REST API are also
actively manage the group should members sub- members of a Consumer Group, which is termed a
sequently leave or join (e.g., in case one consumer subscription. Thus with the REST API you have to
instance crashed). declare both your consumer and subscription first:

This means that multiple services can use the curl -X POST -H “Content-Type: application/
vnd.kafka.v2+json” \
same data, without any interdependency between
--data ‘{“name”: “rmoff_consumer”,
them. The same data can also be routed to data- “format”: “json”, “auto.offset.reset”:
stores elsewhere using the Kafka Connect API “earliest”}’ \
which is discussed below. http://localhost:8082/consumers/rmoff_
consumer

The Producer and Consumer APIs are available in curl -X POST -H “Content-Type:
libraries for Java, C/C++, Go, Python, Node.js, and application/vnd.kafka.v2+json” --data
many more. But what if your application wants to ‘{“topics”:[“carpark”]}’ \
http://localhost:8082/consumers/
use HTTP instead of the native Kafka protocol? For
rmoff_consumer/instances/rmoff_consumer/
this, there is a REST Proxy. subscription

Using a REST API with Apache Kafka Having done this you can then read the events:
Let’s say we’re writing an application for a de-
vice for a smart car park. A payload for the event curl -X GET -H “Accept: application/vnd.
kafka.json.v2+json” \
recording the fact that a car has just occupied a
http://localhost:8082/consumers/rmoff_
space might look like this: consumer/instances/rmoff_consumer/records
[
{ {
“name”: “NCP Sheffield”, “topic”: “carpark”,
“space”: “A42”, “key”: null,
“occupied”: true “value”: {
} “name”: “Sheffield NCP”,
“space”: “A42”,
We could put this event on a Kafka topic, which “occupied”: true
},
would also record the time of the event as part of
“partition”: 0,
the event’s metadata. Producing data to Kafka us- “offset”: 0
ing the Confluent REST Proxy is a straightforward }
REST call: ]

curl -X POST \ If there are multiple events to receive, then you’ll


-H “Content-Type: application/vnd. get them within a batch per call, and if your client
kafka.json.v2+json” \ wants to check for new events, they will need to
-H “Accept: application/vnd.kafka.
make the REST call again.
v2+json” \
--data ‘{“records”:[{“value”:{ “name”:
“NCP Sheffield”, “space”: “A42”, “occupied”: We’ve seen how we can get data in and out of Kaf-
true }}]}’ \ ka topics. But a lot of the time we want to do more
“http://localhost:8082/topics/carpark”
than just straight-forward pub/sub. We want to

37
www.dbooks.org
The InfoQ eMag / Issue #86 / October 2020

take a stream of events and look at the bigger pic- or directly by the client, as required. With this done,
ture—of all the cars coming and going, how many any client can now subscribe to a stream of chang-
spaces are there free right now? Or perhaps we’d es from the original topic but with a filter applied.
like to be able to subscribe to a stream of updates For example, to get a notification when a space is
for a particular car park only? released at a particular car park they could run:

Conditional Notifications, Stream Processing, and SELECT TIMESTAMPTOSTRING(ROWTIME,’yyyy-MM-


dd HH:mm:ss’) AS EVENT_TS,
Materialised Views
SPACE
To think of Apache Kafka as pub/sub alone is to FROM CARPARK_EVENTS
think of an iPhone as just a device for making and WHERE NAME=’Sheffield NCP’
receiving phone calls. I mean, it’s not wrong to de- AND OCCUPIED=false
EMIT CHANGES;
scribe that as one of its capabilities…but it does so
much more than just that. Apache Kafka includes
Unlike the SQL queries that you may be used to,
stream processing capabilities through the Kaf-
this query is a continuous query (denoted by the
ka Streams API. This is a feature-rich Java client
EMIT CHANGES clause). Continuous queries, known
library for doing stateful stream processing on data
as push queries, will continue to return any new
in Kafka at scale and across multiple machines.
matches to the predicate as the events occur, now
Widely used at companies such as Walmart, Ticket-
and in the future, until they are terminated. ksqlDB
master, and Bloomberg, Kafka Streams also pro-
also supports pull queries (which we explore be-
vides the foundations for ksqlDB.
low), and these behave just like a query against a
regular RDBMS, returning values for a lookup at a
ksqlDB is an event streaming database pur-
point in time. ksqlDB thus supports both the worlds
pose-built for stream processing applications. It
of streaming and static state, which in practice
provides a SQL-based API for querying and pro-
most applications will also need to do based on the
cessing data in Kafka. ksqlDB’s many features
actions being performed.
include filtering, transforming, and joining data
from streams and tables in real-time, creating ma-
ksqlDB includes a comprehensive REST API, the
terialised views by aggregating events, and much
call against which for the above SQL would look
more.
like this using curl:

To work with the data in ksqlDB we first need to curl --http2 ‘http://localhost:8088/query-
declare a schema: stream’ \
--data-raw ‘{“sql”:”SELECT TIMESTAMPTO
CREATE STREAM CARPARK_EVENTS (NAME STRING(ROWTIME,’\’’yyyy-MM-dd HH:mm:ss’\’’)
VARCHAR, AS EVENT_TS, SPACE FROM CARPARK_EVENTS
SPACE WHERE NAME=’\’’Sheffield NCP’\’’ and
VARCHAR, OCCUPIED=false EMIT CHANGES;”}’
OCCUPIED
BOOLEAN) This call results in a streaming response from the
WITH (KAFKA_ server, with a header and then when any matching
TOPIC=’carpark’, events from the source topic are received these are
VALUE_
FORMAT=’JSON’); sent to the client:

ksqlDB is deployed as a clustered application, and


this initial declaration work can be done at startup,

38
The InfoQ eMag / Issue #86 / October 2020
{“queryId”:”383894a7-05ee-4ec8-bb3b-c5ad39811539”,”columnNames”:[“EVENT_
TS”,”SPACE”],”columnTypes”:[“STRING”,”STRING”]}

[“2020-08-05 16:02:33”,”A42”]



[“2020-08-05 16:07:31”,”D72”]

We can use ksqlDB to define and populate new streams of data too. By prepending a SELECT statement
with CREATE STREAM streamname AS we can route the output of the continuous query to a Kafka topic.
Thus we can use ksqlDB to perform transformations, joins, filtering, and more on the events that we send
to Kafka. ksqlDB supports the concept of a table as a first-class object type, and we could use this to
enrich the car park events that we’re receiving with information about the car park itself:

CREATE STREAM CARPARKS AS


SELECT E.NAME AS NAME, E.SPACE,
R.LOCATION, R.CAPACITY,
E.OCCUPIED,
CASE
WHEN OCCUPIED=TRUE THEN 1
ELSE -1
END AS OCCUPIED_IND
FROM CARPARK_EVENTS E
INNER JOIN
CARPARK_REFERENCE R
ON E.NAME = R.NAME;

You’ll notice we’ve also used a CASE statement to apply logic to the data enabling us to create a running
count of available spaces. The above CREATE STREAM populates a Kafka topic that looks like this:

+----------------+-------+----------+----------------------------+----------+------------
--+
|NAME |SPACE |OCCUPIED |LOCATION |CAPACITY |OCCUPIED_IND
|
+----------------+-------+----------+----------------------------+----------+------------
--+
|Sheffield NCP |E48 |true |{LAT=53.4265964, LON=-1.8426|1000 |1
|
| | | |386} | |
|

Finally, let’s see how we can create a stateful aggregation in ksqlDB and query it from a client. To create
the materialised view, you run SQL that includes aggregate functions:

CREATE TABLE CARPARK_SPACES AS


SELECT NAME,
SUM(OCCUPIED_IND) AS OCCUPIED_SPACES
FROM CARPARKS
GROUP BY NAME;

39
www.dbooks.org
This state is maintained across the distributed ksqlDB nodes and can be queried directly using the REST
The InfoQ eMag / Issue #86 / October 2020

API:

curl --http2 ‘http://localhost:8088/query-stream’ \


--data-raw ‘{“sql”:”SELECT OCCUPIED_SPACES FROM CARPARK_SPACES WHERE
NAME=’\’’Birmingham NCP’\’’;”}’

Unlike the streaming response that we saw above, queries against the state (known as “pull queries”, as
opposed to “push queries”) return immediately and then exit:

{“queryId”:null,”columnNames”:[“OCCUPIED_SPACES”],”columnTypes”:[“INTEGER”]}
[30]

If the application wants to get the latest figure, they can reissue the query, and the value may or may not
have changed

curl --http2 ‘http://localhost:8088/query-stream’ \


--data-raw ‘{“sql”:”SELECT OCCUPIED_SPACES FROM CARPARK_SPACES WHERE
NAME=’\’’Birmingham NCP’\’’;”}’
{“queryId”:null,”columnNames”:[“OCCUPIED_SPACES”],”columnTypes”:[“INTEGER”]}
[29]

There is also a Java client for ksqlDB, and community-authored Python and Go clients.

Integration with other systems


One of the benefits of using Apache Kafka as a highly-scalable and persistent broker for your asynchro-
nous messaging is that the same data you exchange between your applications can also drive stream
processing (as we saw above) and also be fed directly into dependent systems.

Continuing from the example of an application that sends an event every time a car parks or leaves a
space, it’s likely that we’ll want to use this information elsewhere, such as:

• analytics to look at parking behaviours and trends

• machine learning to predict capacity requirements

• data feeds to third-party vendors

Using Apache Kafka’s Connect API you can define streaming integrations with systems both in and out of
Kafka. For example, to stream the data from Kafka to S3 in real-time you could run:

curl -i -X PUT -H “Accept:application/json” \


-H “Content-Type:application/json” http://localhost:8083/connectors/sink-s3/config \
-d ‘ {
“connector.class”: “io.confluent.connect.s3.S3SinkConnector”,
“topics”: “carpark”,
“s3.bucket.name”: “rmoff-carparks”,
“s3.region”: “us-west-2”,
“flush.size”: “1024”,
“storage.class”: “io.confluent.connect.s3.storage.S3Storage”,
“format.class”: “io.confluent.connect.s3.format.json.JsonFormat”

40
The InfoQ eMag / Issue #86 / October 2020
}’ cluding loose coupling, service autonomy, elasticity,
flexible evolvability, and resilience.
Now the same data that is driving the notifications
to your application, and building the state that You can use the APIs of Kafka and its surrounding
your application can query directly, is also stream- ecosystem, including ksqlDB, for both subscrip-
ing to S3. Each use is decoupled from the other. tion-based consumption as well as key/value look-
If we subsequently want to stream the same data ups against materialised views, without the need
to another target such as Snowflake, we just add for additional data stores. The APIs are available as
another Kafka Connect configuration; the other native clients as well as over REST.
consumers are entirely unaffected. Kafka Connect
can also stream data into Kafka. For example, the To learn more about Apache Kafka visit developer.
CARPARK_REFERENCE table that we use in ksqlDB confluent.io. Confluent Platform is a distribution
above could be streamed using change data cap- of Apache Kafka that includes all the components
ture (CDC) from a database that acts as the system discussed in this article. It’s available on-premises
of record for this data. or as a managed service called Confluent Cloud.
You can find the code samples for this article and a
Conclusion Docker Compose to run it yourself on GitHub. If you
Apache Kafka offers a scalable event streaming would like to learn more about building event-driv-
platform with which you can build applications en systems around Kafka, then be sure to read Ben
around the powerful concept of events. By using Stopford’s excellent book Designing Event-Driven
events as the basis for connecting your applica- Systems.
tions and services, you benefit in many ways, in-

41
www.dbooks.org
InfoQ @ InfoQ InfoQ InfoQ

Curious about
previous issues?
The InfoQ eMag / Issue #83 / March 2020 The InfoQ eMag / Issue #77 / October 2019 The InfoQ eMag / Issue #81 / January 2020

Service Mesh Taming Complex Microservices:


Ultimate Guide Systems in Production Testing, Observing,
and Understanding
@emilywithcurls

Service Service Mesh Exploring the An Engineer’s Sustainable Operations Testing in Tyler Treat on
12 Microservices Obscuring
Mesh Implementations (Possible) Future of Guide to a Good in Complex Systems with Production—Quality Microservice
Testing Techniques Complexity
Features and Products Service Meshes Night’s Sleep Production Excellence Software, Faster Observability

FACILITATING THE SPREAD OF KNOWLEDGE AND INNOVATION IN PROFESSIONAL SOFTWARE DEVELOPMENT FACILITATING THE SPREAD OF KNOWLEDGE AND INNOVATION IN PROFESSIONAL SOFTWARE DEVELOPMENT
FACILITATING THE SPREAD OF KNOWLEDGE AND INNOVATION IN PROFESSIONAL SOFTWARE DEVELOPMENT

This eMag aims to answer To tame complexity and its This eMag takes a deep
pertinent questions for effects, organizations need dive into the techniques and
software architects and a structured, multi-pronged, culture changes required
technical leaders, such as: human-focused approach, to successfully test,
what is a service mesh?, that: makes operations observe, and understand
do I need a service mesh?, work sustainable, centers microservices.
and how do I evaluate the decisions around customer
different service mesh experience, uses continuous
offerings? testing, and includes chaos
engineering and system
observability. In this
eMag, we cover all of these
topics to help you tame the
complexity in your system.

You might also like