Real-Time Machine Learning v1 MEAP
Real-Time Machine Learning v1 MEAP
1 Introduction to real-
time machine learning
Real-time machine learning (sometimes referred to as online learning) is an approach which uses
real-time data to build predictive systems that adapt to changes in an environment. This is different
from batch wise machine learning (or offline learning) in which historical data sets are carefully
curated for training and evaluation. The fundamental assumption in offline learning is that there is
some ground truth in the input features that remains stable while models are in production. In
reality, the statistical properties of the data, such as probability distributions or relationships
between features are likely to change over time. This shift, known as data drift, can reduce a
machine learning model’s accuracy because it was originally trained on data with different statistical
characteristics. Offline models must be retrained routinely to avoid this degradation in accuracy.
However, retraining these models is often both expensive and time consuming since they require
iterating over large datasets many times. By the time these models are deployed to production they
may be operating on data assumptions that are no longer true. In other words, offline models
cannot adapt to the data changes that occur in real-world environments.
Real-time machine learning models are designed to learn on-the-fly by continuously receiving
data from the environment. Instead of being trained on large datasets, real-time machine learning
models are trained in an incremental fashion as new data becomes available. The idea of online
learning is not new, but it has increasing relevance for modern, data-driven organizations. As the
world becomes more real-time, applications such as fraud detection and user recommendations
will need to incorporate more real-time learning techniques to keep up with quickly changing data
trends.
The goal of this book is to give you the reasoning and tools to start building practical real-time
machine learning systems. We will accomplish this by building a few proof-of-concept online
learning systems from scratch using Python. You will be able to run all the examples in this book on
a single machine.
In this chapter, we will explore what “real-time” data means with respect to machine learning,
compare offline and online machine learning, and introduce some of the most common use cases
for online learning.
An important corollary to this definition is that all data originates as real-time data. As soon as it
becomes persisted into any form of storage such as a file system, data lake, or database, for later
retrieval it ceases to be real-time. In this book we refer to this as data at rest or historical data. The
major difference between real-time and historical data is that real-time data has some application
value related to its timeliness.
Consider the real-time application outlined by the following figure. Emails are received by an
email server and used to train an online spam classification model. The online model determines if
emails are spam or not spam in real-time. Spam emails are immediately discarded, and non-spam
emails are persisted to a database.The email client has the ability to process incoming emails as
notifications and retrieve past emails from the database. Both historical and real-time data streams
are present in this application. Incoming emails are treated as real-time data while they are being
processed until they are persisted to the database. When emails are loaded from the database they
are considered historical pieces of data.
Figure 1.1 An email notification system with real-time and historical data streams
In short, historical data is used for offline learning approaches where all the data is known at once.
Real-time data is used for online learning approaches where learning happens incrementally as soon
as new data is available. These terms also describe where the learning happens. Offline learning
happens in development or staging environments. Online learning happens directly in production.
In the next sections we will discuss these two approaches.
Figure 1.2 Training a model offline and using it for predictions in production
Offline learning starts with a data ingestion phase. Data is loaded from one or more data sources
and serialized into a format for persistent storage. Data ingested for offline learning is often
unstructured, meaning that there is not a pre-defined schema. Text is a common form of
unstructured data because it is widely accessible from the internet and proprietary documents and
is easily extractable from file formats like PDF and HTML. Preprocessing steps are usually applied to
transform the data into a more normalized, numerical format for machine learning. For example,
converting text to vector embeddings or parsing numerical or categorical values from strings.
Data ingestion is followed by a feature engineering phase where the goal is to discover the set of
features that correlate with the output labels. Feature engineering is an iterative process and is
really more of an art than a science. It typically involves both statistical and visual analysis and
intuition about the problem to identify features with predictive power.
In the training phase the dataset is split into distinct training, test, and validation sets. These sets
have the same data schema to ensure that the machine learning model can be used for both
training and inference. The number of data points required to fit a model depends on its
complexity. For example, linear models attempt to find a linear relationship between the features
and labels so the search space is directly dependent on the number of features. More complex
models such as artificial neural networks can have a much larger number of trainable parameters.
Increasing the number of features can improve the predictability of the model but it also increases
the number of data points required to sufficiently describe the input space. This is colloquially
known as the curse of dimensionality.
Machine learning is framed as an optimization problem where the goal is to find the optimal set
of parameters p that minimize the loss against the training set. The predicted outputs can be
computed from any parameter set. This means that a loss function L can be defined in terms of p
that represents the total training loss. This is also referred to as the objective function because it sets
the objective for the problem: to minimize the loss. For example, squared error loss is quantified by
adding up the squares of the individual prediction errors as shown in Formula 1.1.
Formula 1.1 Squared error loss for a parameter set p and number of data points m.
Depending on the complexity of the model and the number of input features, models can have
several or hundreds of parameters that can be adjusted to minimize the loss. With models that
have more than a few parameters it becomes important to be able to efficiently search the
parameter space. A common methodology is gradient descent which computes a loss vector (the
gradient) with respect to the trainable parameters and attempts to adjust the parameters in the
opposite direction of the vector. At each training step the parameters are updated using the
following formula.
The optimization method is a significant choice because it can affect how quickly the model will
converge to an optimal solution or if a solution will be found. Modern machine learning libraries like
scikit-learn provide useful abstractions for training various classes of models given a training set.
Listing 1.1 provides an example of training a LogisticRegression model on the classic iris sample
dataset using scikit-learn and using the model to predict class labels based on numerical input
features.
iris = datasets.load_iris() #A
X = iris.data #A
y = iris.target #A
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42) #A
y_pred = model.predict(X_test) #C
print(“Predicted values: “, y_pred) #C
print(“True values: “, y_test) #C
print(“Accuracy: “, accuracy_score(y_test, y_pred) #C
Predicted values: [1 0 2 1 1 0 1 2 1 1 2 0 0 0 0 1 2 1 1 2 0 2 0 2 2 2 2 2 0 0]
True values: [1 0 2 1 1 0 1 2 1 1 2 0 0 0 0 1 2 1 1 2 0 2 0 2 2 2 2 2 0 0]
Accuracy: 1.0
The evaluation phase involves loading a model into a staging environment and extensively testing it.
This is a common failure point in the development of machine learning applications. The model
might perform poorly on out-of-sample data or might produce results that are unexpected or
inadequate to stakeholders. The former is usually a data issue. The model might be overfit due to a
high amount of variance in the training set, or underfit due to a low amount of variance and not
generalize well to data points from the real world. The latter issue can stem from a lack of
transparency and communication between engineering teams and stakeholders.
The deployment phase is the technical task of pushing the model into production so that end
users have access to it. Successful deployments include mechanisms to capture key performance
indicators or human feedback from users. Deployment marks the end of the offline learning cycle.
The offline learning cycle is similar to the software development life cycle (SDLC) in that the process
is iterative. Models must be retrained routinely in order to keep up with business objectives.
The offline learning workflow is fully synchronous, where each phase of the workflow requires
the previous phase to be completed. For example, model training cannot occur until all the training
data is ingested because offline models generally require access to all the data points at once. This
limits the speed to deployment for new models. The speed to deployment is also limited by model
training time. At each training phase models are re-trained from scratch. The training time is a
function of model complexity and the training size. Since one way to improve the accuracy of offline
models is to increase the training set size, the training time typically increases linearly with each
cycle.
While offline learning is a natural starting point for organizations that want to use historical data
to solve current problems, it also requires a persistent effort to maintain these systems. For use
cases where underlying data trends change more rapidly, a reactive learning approach is required
to ensure that trained models remain relevant and useful. In the next section, we introduce the
concept of online learning to address these issues.
Model drift is the phenomenon where machine learning models become less predictive over
time. This is usually because the model is trained on data assumptions that no longer hold true.
When training machine learning models we select a set of features from the population X that is
likely to be predictive of the output labels Y. For example, in the iris flower classification problem,
the set of features X are petal length, petal width, sepal length, and sepal width, and the output
labels Y is the species of the iris flower: setosa, virginica, and versicolor. We don’t know all the
possible inputs and outputs but we can describe the relationship between them in terms of
probability. If P(X) represents the probability distribution of inputs features and P(Y) represents the
distribution of output labels then it’s possible to express the probability of a set of features and
labels appearing together as the joint probability P(X,Y) = P(X)P(Y|X). This implies there are two major
components to model drift: the distribution of input features P(X) and the probability of the labels
given the features P(Y|X).
Data drift or feature drift describes the case where the distribution of input features P(X) changes
over time. Consider a spam classifier that is trained to classify historical emails as spam or not spam
based on email length. In the initial training set the length of emails varies between 40-60
characters. However, after the model is deployed to production it starts to observe much longer
emails. Since the model was not trained on many emails above 100 characters it may struggle to
accurately classify these longer emails. This scenario describes a shift in the feature distribution
causing the model to perform worse over time. The feature shift could be a result of failing to
include emails in the training set that properly represent reality. However, it could also be caused
by external factors such as changing marketing strategies. The figure below illustrates this feature
drift.
Figure 1.3 The average length of emails increases significantly after a model is deployed to production, an
indication of feature drift. This may cause the model to produce less accurate predictions because it was
trained on much shorter emails.
Concept drift describes the changing relationship between input features and output labels over
time. In mathematical terms this is how P(Y|X) changes over time. Batch models perform well if
P(Y|X) remains static. However, in the real world the outputs we are trying to predict are subject to
external factors. This can be due to transient events like natural disasters or outages or caused by
long term trends such as consumer behavior and market shifts. An example of this is the 2020
coronavirus pandemic which increased consumer demand for specific products such as hand
sanitizer and immunity vitamins (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9111418/). This
likely affected the accuracy of sales projection models since the relationship between data features
and consumer purchases changed overnight. Models trained before 2020 may assume that hand
sanitizer sales remain fairly constant aside from seasonal trends such as the beginning of the
school year. These models have to be retrained because the learned relationships no longer hold
true.
The difference between feature drift and concept drift is illustrated in the figure below. A
machine learning model is fit to an initial set of data points and solves a decision boundary to
distinguish between two output classes. Feature drift causes the data points in the feature space to
change which may impact the accuracy of the model. Concept drift changes the decision boundary
itself which has greater potential to invalidate the model. Note that this doesn’t necessarily mean
that the features have drifted. For example, a model that uses the weather to predict retail sales
would be affected by a global lockdown because consumers are staying home even though the
weather is nice.
Figure 1.4 Feature drift changes the distribution of features, while concept drift changes the problem’s
decision boundary
In the previous section we described how an offline model is trained by finding the set of
parameters that minimize the objective function. Unlike in offline learning, the full dataset is not
available at training time which makes it impossible to compute the total loss. However if the loss
function is additive, such as the sum of squared errors, it can be decomposed into individual terms.
This means that the gradient correction can be computed for individual data samples using the
adjusted formula below. The difference is that the loss function is computed for only the new
sample instead of the entire dataset.
Formula 1.3 Computing incremental gradient loss using a single data sample and updating model parameters
This is the basis of the incremental gradient descent approach. The coefficient controls how
much the parameters are affected so it is thought of as the learning rate. A high learning rate is
desirable for quickly changing environments because the model will adapt faster to changes in the
data. A low learning rate will approximate the batch version of the model but it will be less useful
for short-term predictions because the model will adapt more slowly.
Given that online learning is functionally an approximation of offline learning, online models will
generally produce less accurate predictions than offline models that have the benefit of knowing
the entire dataset beforehand. The major benefit of online models is that they are able to react to
short-term trends and produce near-term predictions. Time-sensitive use cases such as inventory
prediction or load balancing may benefit from being able to learn patterns in real-time. Conversely,
batch models are preferable in situations where the distribution of the data is not likely to change
over time.
Figure 1.6 A stationary distribution of visits (left) shows periodic variance but the average visitors does not
change. A non-stationary distribution (right) shows a gradual increase in visitors over time.
Machine learning models trained on a historical time series of daily visit counts will be affected by
the general increase in visitors over time. Non-stationarity can be corrected for with data
transformations or by modeling it directly. For example, by assuming the increase in the average
visit count is linear and subtracting it out before offline modeling. Online models account for non-
stationarity by learning incrementally. Offline models are better suited for tasks where the data is
stationary with respect to time. For example, an image classifier trained to distinguish dogs from
cats is more easily trained offline because the classes of images will not change over time.
Online learning is somewhat analogous to the rolling mean in statistics. The traditional mean is
simply an average of all the samples in a set, but the rolling or moving average is an average over
the last n samples. The moving average intentionally “forgets” older information in order to get a
more recent snapshot of the data. Online models function similarly to this because the model
parameters are adjusted for each new data sample. The forgetting effect is sometimes desired
because it acts like a form of regularization to avoid overfitting to the data. For example, you might
want a news-based model to forget specific knowledge it has learned during transient periods like
election cycles. Some models such as deep neural networks are vulnerable to “catastrophic
forgetting” in which important knowledge like grammar rules that was previously encoded into the
model is lost. This can be mitigated by reducing the amount of weights that are updated on each
training run.
When assessing whether an online learning approach is appropriate for solving a problem, it’s
important to consider operational factors such as time to deployment, observability, and
maintenance cost. Each phase of the offline learning cycle takes a considerable amount of
engineering time and knowledge across several disciplines. Online models are a lot easier to get
into production since the development workflow is identical to the production workflow. Because
these models make predictions as a part of the learning process they are compatible with staging or
production environments. In contrast, batch models require a separate pipeline to produce
predictions in production.
A unique aspect of online models is that they learn while in production. This makes them mostly
autonomous which removes the need to manually retrain and evaluate them offline. This provides
several advantages over offline models.
There is no need to split the data into training and test sets. The same data gets
used to first make a prediction (test data) and then once the label is available, is
used to train the model (training data)
There is no need to worry about data leakage (accidentally using the test dataset
to train a model)
Retraining a model from scratch periodically is no longer necessary since the
model is continually learning
Reduced memory footprint since a real-time model has to learn from only one
example at a time
One downside of online models is that they require constant monitoring to ensure they are
producing accurate results. Unlike offline learning, online models aren’t guaranteed to produce
similar quality results compared to extensively trained offline models. Traditionally, machine
learning has involved an experiment-oriented, iterative process using batches of data to train a
model. It is not only easier to conduct multiple tests before deploying a model into production, a lot
of work is done to clean up and standardize the data before it is fed to a model. With online
learning, some exploratory data analysis can and should be conducted to understand the
characteristics of the data and to decide on the features to use to train an online model. However,
there is little control in ensuring data quality due to the real-time aspect of data arriving into the
system. As a consequence, a lot more more time needs to be spent putting guardrails into the
system and coming up with mitigation strategies should things go wrong. There is a trade-off
between model adaptability and model accuracy. For high-risk use cases, it is preferable to build
models offline to ensure they are properly validated and tested before deploying to production.
Since online learning involves frequent updates, it is a natural fit for human feedback
mechanisms. Batch models that operate on human feedback require specific retraining before
users of the model will see improvements. Because user preferences can change more frequently
than batch models are retrained, online learning is more compatible with recommender systems.
We will discuss this in more detail in the next section.
Table 1.1 presents the pros and cons of online learning. These are points developers must take
into consideration when embarking on a new project that will use online learning.
Pros Cons
Easier to deploy because of similarities between Need to be constantly monitored to ensure the
the development and production workflows model is producing accurate results
Online models learn while in production, which Requires more guardrails and mitigation
means that retraining workflows are not strategies since there it is not possible to
needed inspect data prior to using it to train an online
model
One roadblock preventing widespread adoption of online models is that it requires stream-oriented
architectures. Traditionally, extract-transform-load (ETL) is used to describe the process of sourcing
raw data, applying transformations to standardize and normalize data into a retrievable state and
loading it into a centralized store. Such data pipelines make offline learning more consistent and
repeatable. Online learning requires an analogous data methodology called stream-transform-load
(STL). We will discuss its implementation details and unique challenges in chapter 2.
As the examples above illustrate, the biggest challenge with traditional machine learning is that
when a model is trained, its parameters and weights are “frozen”. We have already discussed how
model drift can cause models to lose accuracy over time. Retraining models requires training from
scratch on a whole new batch of data, which can be cost prohibitive, slow, and error-prone.
Shifting towards online modeling approaches can help alleviate these challenges. Unlike their
batch counterparts, real-time machine learning models have the ability to continually learn on new
data, i.e., their weights and parameters are updated with every new instance of data. The result is
that such models are able to adapt more quickly and require a lower memory footprint since they
learn from one record at a time. Let’s now take a look at a few use cases that are better suited for
real-time machine learning.
In cybersecurity, real-time anomaly detection algorithms are better suited for analyzing network
traffic logs or system behavior patterns, because they can adapt their detection capabilities in real-
time. This is critical since threat actors are constantly evolving their tactics and techniques. An
anomaly detection model can continuously analyze millions of data packets to detect deviations
from normal behavior patterns and trigger alerts immediately, mitigating potential damage before
it escalates. In addition to network log monitoring, anomaly detection algorithms can be used to
monitor endpoint behavior, application usage patterns, and system configurations. This can be
taken one step further where these models are incorporated as part of a process to automate
cybersecurity operations. For example, a real-time intrusion detection system can not only send
alerts when suspicious behavior is detected, it can then automatically block suspicious IP addresses
or quarantine compromised devices in real-time, resulting in faster incident response times and
prevention of large scale attacks. In short, real-time machine learning can help level the playing field
between cybersecurity experts and cyber criminals.
In healthcare, anomaly detection algorithms can be used for real-time monitoring and early
detection. Many people now use wearable devices that monitor heart rate, physical activity, blood
pressure, glucose, and other health markers. These devices are generating streams of data in real-
time which can be processed and analyzed by real-time anomaly detection algorithms to detect
anomalies such as irregular vital signs, unexpected changes in patient condition, and deviations
from typical health patterns. These algorithms can continually adapt and evolve as they take in new
information from these devices.
In addition to monitoring health related data, anomaly detection algorithms can be used to
improve operational efficiencies at healthcare facilities by continually monitoring data such as
hospital admissions, bed occupancy rates, and drug use to optimize resource allocation, scheduling,
and patient care.
Real-time anomaly detection algorithms are particularly suited for monitoring IoT systems. With
an increase in the number of interconnected devices and sensors, it is critical that these systems
are functioning properly and optimally. It is possible to monitor metrics such as temperature,
humidity, and energy consumption to detect deviations from normal patterns and detect potential
malfunctions, intrusions, or degradations in performance. The real-time nature of these algorithms
also ensures that corrective action can be taken as quickly as possible to prevent system wide
catastrophic failures, extend equipment lifespan, and reduce downtime, resulting in improved
operational efficiency and minimized maintenance costs.
IoT devices often operate on resource-constrained edge computing environments with limited
processing power and bandwidth. Real-time models can be deployed at the edge to perform
lightweight anomaly detection tasks locally, reducing latency and conserving network bandwidth.
These models can detect anomalies rapidly without the need to connect to a centralized cloud
server. This decentralized approach enhances scalability, responsiveness, and efficiency in IoT
deployments, particularly in applications requiring real-time anomaly detection such as smart cities,
autonomous vehicles, or remote asset monitoring.
RLHF can also be used in other applications. One such application is detecting phishing and
social engineering schemes. Human feedback can be used to identify phishing emails that bypass
spam filters and other controls. Intelligence analysts can provide feedback on the characteristics of
phishing attempts such as suspicious URLs, misleading content, or impersonation tactics used by
cyber criminals. This feedback can be collected through a streaming application and be fed to a
reinforcement learning algorithm to ensure that it has the most up to date information to
effectively detect new phishing and social engineering attacks.
In healthcare, RLHF can streamline and enhance decision making by integrating human feedback
in decision support systems. Healthcare experts can provide feedback on the accuracy of
diagnostics, treatment recommendations, and patient management plans suggested by the model
and the model in turn can use the feedback to provide better recommendations.
In this chapter, you have developed an understanding of what real-time data is and how it can
be used to build online models that can both train and generate inferences on a continual basis as
soon as data arrives in the system. We have provided some use cases that are particularly suited for
real-time machine learning. In the next chapter, you will learn how to ingest real-time data to build
data ingestion pipelines using event-driven architecture.
1.5 Summary
Real-time data is data in motion that has not been persisted into any form of
storage.
Real-time machine learning is an approach which uses real-time data to build
predictive systems that adapt to changes in an environment.
Offline learning involves training a model from a historical batch of data.
Online learning involves incrementally training a model from data as it arrives in
the system.
The offline learning workflow is synchronous: each component of the workflow
is dependent on the one preceding it.
The online learning workflow is asynchronous:the components of the workflow
are decoupled from each other.
Online learning evolved to address use cases where there are rapid changes in
data distributions.
Feature drift or data drift occurs when the distribution of the features used to
train a model changes over time.
Concept drift occurs when the relationship between the input features and the
output labels changes over time.
Some use cases that are suitable for real-time machine learning include
recommender systems, anomaly detection, and reinforcement learning.
[1] The transmission speed depends on the speed of light c and the refractive index n of the
fiber optic cable. Here we assume a single-mode fiber with n=1.46 so the ideal transmission
speed is v = c / n = 300,000 km/s / 1.46 = ~205,000 km/s.
Every machine learning project starts with a problem that can be solved by using a machine
learning model that learns patterns from the data. The data available to you and the data you have
the ability to collect defines what is possible for the project. This is no different for real-time
machine learning. In order to start building real-time machine learning applications, you need to
understand what real-time data is available to you and how to use it to produce useful predictions.
In the previous chapter we explored how real-time data instances are used to both generate
real-time inferences and train online models. In order to accomplish this, the data first has to be
ingested and transmitted from its origin to the process where inference is happening. This requires
dedicated data architectures to handle the continuous transfer of data, known as data streams. In
practice, data streams are implemented as message queues or event streams. We discuss message
queues and event streams in section 2.2.
In this chapter you will learn how to ingest data points from a real-time data source. Then, you
will write a publisher to persist this data to an event stream and a subscriber to read both live and
historical data from the stream. Publishers and subscribers will be discussed in section 2.2. You will
also learn how event-driven architectures can solve some of the inherent problems associated with
real-time data streaming such as backpressure and scalability. At the end of this chapter, you will
have a data ingestion mechanism ready for real-time machine learning.
Figure 2.1 illustrates an example of a simple workflow that describes the flow of real-time data
using event-driven architecture. Specifically there is a publisher that ingests data from a real-time
data source and publishes this data to a topic. There is a message broker in the middle that routes
the data from the publisher to a subscriber that has subscribed to the topic.
When selecting a real-time data source for a project, you should consider the following:
Terms of use: Most APIs specify terms and conditions for how the data can be used. For
example, some usage agreements prohibit you from making publicly available copies of the data or
selling an application that uses the data.
Reliability: One challenging aspect of real-time data projects is that they often rely on external
services being available to function properly. When selecting a data source, it’s important to gauge
the reliability of the API by performing some initial due diligence. In particular, we want to select
data sources that are well maintained and used by others. There are specific metrics you can use
for this. For example, reliable APIs often have detailed documentation with examples. It’s a good
sign if the organization or party providing the API has an official programming client and its GitHub
repository is well maintained (e.g. updated within the last month, maintainers are responding to
issues, several hundred stars).
Rate limits: Most APIs impose some kind of rate limiting to protect against spam attacks. When
selecting a data source a necessary step is to decide how much data is needed and how often.
Furthermore, what strategies will you use to deal with the limitations of the real-time data source?
For example, in this chapter we will utilize a polling interval strategy to avoid running into request
rate limits.
Data quality: When using an external data source you are limited by what data is made
available. For example, data that is advertised as “real-time” may only be available in time steps of
30 seconds. It’s important to understand how often new data is required for the use case. An
application that attempts to predict near-future stock price movements may need to ingest updates
on a second-by-second basis, but an application that predicts monthly rainfall may only need to be
updated on a daily basis. In the next chapter we will explore how to assess the viability of real-time
data features with respect to online modeling.
Data formatting: It’s especially common for REST APIs to return unstructured JSON data. One
goal of a data ingestion pipeline is to ensure that this data can be reliably deserialized into a defined
structure. A potential issue with using external APIs is that the data schema can change at any point
which can break downstream applications. For this reason it’s important to use official
programming language APis where possible, and perform field validation so that any compatibility
issues are easily identified and fixed.
To power our nowcasting model, we wish to ingest data about ongoing flights. For offline
learning we would start by locating a historical flights dataset. In contrast, our model will need to
use very recent data to make predictions. Therefore we will need a real-time data source that
captures flight data. Fortunately, the OpenSky (https://opensky-network.org/) network provides
real-time flight data about ongoing air traffic including commercial flights. At the time of writing this
is available to use for non-profit research use.
Note that many APIs will describe themselves as “real-time” APIs but may be designed very
differently. Some are truly real-time in that they utilize long-running, low-latency websockets for
delivering updates in a push-based format. This is most common in the finance domain because of
high trading volumes and knowing information as soon as possible is incredibly valuable. Other APIs
are pull-based so they require the client to make periodic requests for data updates.
OpenSky provides a REST-based API that we can make web requests to. In particular, we are
interested in the GET /states/all route which returns “state vectors” for ongoing flights. We will use
the built-in urllib library in Python to make a single request to the API and print the result in
readable text, as shown in listing 2.1.
with request.urlopen(“https://opensky-
network.org/
api/states/
all”) as response:
print(response.read().decode(“utf-8”))
import json
from urllib import request
with request.urlopen(“https://opensky-
network.org/
api/states/
all?lamin=45.
8389&lomin=5.
9962&lamax=47.
8229&lomax=10.
5226”) as response:
print(json.loads(response.read().decode(“utf-8”)))
def response_to_events(api_response):
flight_events = []
for update in api_response[“states”]: #A
flight_events.append(
{
“icao24”: update[0],
“origin_country”: update[2],
“time_position”: update[3],
“longitude”: update[5],
“latitude”: update[6],
“velocity”: update[9],
“true_track”: update[10]
}
)
return sorted(flight_events, key=lambda x: x[“time_position”]) #B
Now we have a function that parses a single API response into a sequence of events. We will need
something to invoke this function repeatedly to produce real-time data for the nowcasting system.
In event-driven system terms this is called a publisher. Publishers are long-running processes that
communicate events to other parts of the real-time system. In our case, we need a publisher that
routinely queries the OpenSky API and produces events that our system can understand.
Note that we’re a bit limited by the constraints of the data API. HTTP/1.1 is a pull mechanism so
the API will not tell us when new updates are available. The only option is to periodically poll the
server in order to retrieve flight updates. When implementing the publisher, we have to decide on
an interval to poll the server. The longer this interval is, the less real-time our system will be
because we will be operating on older data. However, with lower intervals we risk running into rate
limits imposed by the API’s maintainers. The figure below shows the different results we might get if
we choose different polling intervals.
Figure 2.3 Ingesting flight updates with different polling intervals. Shorter intervals produce more data points
with smaller changes. Longer intervals produce less data points with more significant changes.
Ideally we would poll as fast as possible in order to retrieve the most real-time data. However, in
practice we usually have to make a compromise because it wastes compute resources to
continuously poll and we will quickly run up against rate limits. So how do we determine the
appropriate polling interval for a particular use case? The first thing to look at is the API’s
documentation, which should state how often the underlying data updates and how many requests
are allowed per day/hour/minute. This usually gives a good starting point for determining an
appropriate interval. For example, at the time of writing the OpenSky documentation states that
unauthenticated users can retrieve new data every 10 seconds and 400 /api/states/all requests over
a 500x500 km area are allowed within a 24-hour period. So, we could poll the system every 10
seconds in order to have the latest data but if we want to leave the system running for an entire day
the polling interval should not be above 24 * 60 / 400 = 3.6 minutes. In general, a good starting point
is between 30 seconds and 5 minutes. However, this depends on the criticality of the data and how
often the data is likely to change. For time-sensitive use cases such as alerting systems and financial
markets the polling interval should be less than a minute. For more slowly changing or stable
systems such as server health monitoring or weather tracking a suitable interval is closer to 5
minutes.
We will start with an interval of 10 seconds so we can see how the polling works in action. The
code below is an event generator that queries the API at a defined interval and performs the
transformation from response data into an iterable of events.
import time
from urllib import request
def get_events(url=“https://opensky-
network.org/
api/states/
all?lamin=45.
8389&lomin=5.
9962&lamax=47.
8229&lomax=10.
5226”, interval_sec=10):
for _ in range(3): #A
with request.urlopen(url) as response:
yield from response_to_events(json.loads(response.read().decode(“utf-8”))): #B
time.sleep(interval_sec) #C
We can invoke the generator by iterating over it in a Python for loop to produce events.
class FlightPublisherV1:
def __init__(self, url, interval_sec, file_path):
self.url = url
self.interval_sec = interval_sec
self.file_path = file_path
def get_events(self): #B
while True: #B
with request.urlopen(self.url) as response: #B
yield from self.response_to_events( #B
json.loads(response.read().decode("utf-8")) #B
) #B
time.sleep(self.interval_sec) #B
def run(self): #C
for event in self.get_events(): #C
with open(self.file_path, "a") as file: #C
file.write(json.dumps(event) + "\n") #C
Running the following code will periodically dump batches of flight update events to
flight_updates.jsonl.
publisher = FlightPublisherV1(url="https://opensky-
network.org/
api/states/
all?lamin=45.
8389&lomin=5.
9962&lamax=47.
8229&lomax=10.
5226", interval_sec=60,
file_path="flight_updates.jsonl")
publisher.run()
You now have a publisher which extracts data from a real-time data source, performs simple data
filtering, and persists an ordered log of data events. This mirrors the extract-transform-load (ETL)
pipeline that is common in offline learning. However, in online learning we are interested in
processing this event log as close to real-time as possible. The online model will need to parse this
data and produce inferences as a real-time output. In the next section, you will create a streaming
ETL pipeline that processes the ingested flight data in real-time for online modeling.
Figure 2.4 Synchronous data transfer requires careful coordination between two parties. Asynchronous data
transfer requires a buffer or message queue.
With message queues we aim to decouple the different processes in our flight nowcasting system.
For instance, this will allow the FlightPublisher to continue polling for new flight data without waiting
for subscribers to process the data. However, there are a few issues with our file buffer approach.
One issue is that files are traditionally read sequentially from the beginning to the end of the file. As
more events are published the size of the file increases so it takes longer to read the file. In
addition, the most recent events will always be at the end. In order for the reader to ingest a new
event it must seek to a recent line in the file. This requires the subscriber to keep track of its last
read position, otherwise it will receive duplicate events. This behavior is illustrated in the figure
below.
Figure 2.5 The first time a subscriber reads from the event log it reads 3 events. On the second read it must
seek forward 3 events before reading the new events. Without a message queue, the subscriber must keep
track of its last read offset to avoid receiving duplicate events.
We will now upgrade our file buffered approach to use a real message queue. There are many
message queueing solutions such as Apache Kafka, Google PubSub, and RabbitMQ. We will use
RabbitMQ because we can set up the queue locally with very minimal code. As a message broker it’s
responsible for accepting connection requests from both the publisher and subscriber, handling
queue creation, dispatching messages sent from the publisher to the correct queue, and more
importantly persisting the information in the queue even after the publisher closes its connection.
This is critical because it allows the publisher and subscriber to communicate asynchronously. This
decoupling turns out to be valuable for real-time machine learning because it means that data
ingestion can continue while the system is learning and producing inferences.
The first step is to run the RabbitMQ management service. The easiest way to start the service is
to pull and run the Docker image. If you don’t have Docker, you can install it on your system
following the directions here (https://docs.docker.com/get-started/get-docker/). Use the following
command to start the RabbitMQ service.
docker run -it --rm --name rabbitmq -p 5672:5672 -p 15672:15672 rabbitmq:4.0-management
If the service started successfully then you should see a startup confirmation message.
2024-10-18 02:52:36.304715+00:00 [info] <0.651.0> Server startup complete; 4 plugins started.
2024-10-18 02:52:36.304715+00:00 [info] <0.651.0> * rabbitmq_prometheus
2024-10-18 02:52:36.304715+00:00 [info] <0.651.0> * rabbitmq_management
2024-10-18 02:52:36.304715+00:00 [info] <0.651.0> * rabbitmq_management_agent
2024-10-18 02:52:36.304715+00:00 [info] <0.651.0> * rabbitmq_web_dispatch
2024-10-18 02:52:36.409833+00:00 [info] <0.9.0> Time to start RabbitMQ: 12912 ms
To interact with the server you will need a client that speaks a messaging protocol that
RabbitMQ supports. Fortunately, Pika is a package that implements the Advanced Message Queuing
Protocol (AMQP) in pure Python. AMQP is an open standard application layer protocol which defines
message-based communication between network processes. In our case, it allows us to send
messages from Python code to the RabbitMQ server written in Erlang. Run the following command
in a separate terminal window to install Pika.
Next we can try to send a message to a queue to see if it works. With Pika we have to declare a
queue before sending anything to it. Running the following code will establish a connection to the
local message service, publish a single message to a flight_updates queue, and close the connection.
import json
import pika
connection = pika.BlockingConnection(pika.ConnectionParameters("localhost")) #A
channel = connection.channel() #A
channel.queue_declare(queue="flight_updates") #B
data = {"icao24": "abc123", "origin_country": "United States"}
channel.basic_publish(exchange="", routing_key="flight_updates", body=json.dumps(data))
#C
print("Sent message:", data)
connection.close()
import json
import pika
#A Function callback that gets invoked when a new message is received from the queue.
#B Ensure the message queue exists.
#C Configure the consumer to auto-send acknowledgements (fire-and-forget method)
#D Start consuming events from the queue.
You may be wondering why you have to declare the queue again since we already declared it on the
publisher side. The reason is that in a production system the publisher and subscriber are likely
operating on different machines. Without a secondary communication channel, it’s impossible for
either party to know if the other has started yet. Fortunately the queue_declare call is idempotent,
i.e., it can be called several times but only one queue will be created. Therefore, it’s easier to
accidentally call it twice rather than check if it already exists. Assuming you ran the publish code
earlier, this code should output the following message.
Received message: b'{"icao24": "abc123", "origin_country": "United States"}'
This is the same as before so we have successfully transferred a message from the publisher to
the subscriber through the queue. Note that the subscriber process is hanging. This may seem like
a bug but it’s actually intended behavior. The start_consuming function implements a loop that waits
for new messages. This is a common pattern for event-driven systems. Subscribers wait for things
to happen in the system and then perform some operation, such as producing new data or
updating their internal state. You can confirm that the waiting behavior is happening by simply
running the publisher code again; you should see another message printed by the subscriber.
Now you can create a configurable flight subscriber that reads from a queue. Create a new file
called subscribe.py and include the following code.
import pika
class FlightSubscriberV1:
def __init__(self, queue_name):
self.queue_name = queue_name
import json
import pika
import time
from urllib import request
class FlightPublisherV2:
def __init__(self, url, interval_sec, queue_name):
self.url = url
self.interval_sec = interval_sec
self.queue_name = queue_name
def response_to_events(self, api_response): #A
flight_events = [] #A
for update in api_response["states"]: #A
flight_events.append( #A
{ #A
"icao24": update[0], #A
"origin_country": update[2], #A
"time_position": update[3], #A
"longitude": update[5], #A
"latitude": update[6], #A
"velocity": update[9], #A
"true_track": update[10], #A
} #A
) #A
return sorted(flight_events, key=lambda x: x["time_position"]) #A
def get_events(self): #B
while True: #B
with request.urlopen(self.url) as response: #B
yield from self.response_to_events( #B
json.loads(response.read().decode("utf-8")) #B
) #B
time.sleep(self.interval_sec) #B
def run(self): #C
connection = pika.BlockingConnection(pika.ConnectionParameters("localhost"))
channel = connection.channel()
channel.queue_declare(queue=self.queue_name)
for event in self.get_events():
channel.basic_publish(
exchange="", routing_key=self.queue_name, body=json.dumps(event)
)
time.sleep(0.1) #D
connection.close()
publisher = FlightPublisherV2(url="https://opensky-
network.org/
api/states/
all?lamin=45.
8389&lomin=5.
9962&lamax=47.
8229&lomax=10.
5226", interval_sec=60,
queue_name="flight_updates")
publisher.run() #E
python publisher.py
You should see the same events printed out by both the publisher and subscriber in the same
order. You have now created a basic event-driven pipeline that ingests real-time data and streams it
to a downstream subscriber. This is an improvement over the file buffer mechanism we
implemented earlier because the subscriber does not have to guess when new data is available.
Instead, the subscriber is truly event-driven and can react immediately to new flight events.
Figure 2.6 At least once delivery ensures that every message is delivered at least once but may produce
duplicates. At most once delivery prevents duplicates but may fail to deliver messages.
The ideal semantic for asynchronous messaging is exactly once delivery, although this is impossible
to achieve outside of very specific scenarios. This has to do with the famous Two Generals’ thought
experiment, which was first published in a paper by E. A. Akkoyunlu, K. Ekanadham, and R. V. Huber
in 1975 illustrating the constraints in network communications (https://dl.acm.org/doi/pdf/10.1145/
800213.806523). The scenario demonstrates that it’s impossible for two parties communicating over
an unreliable link to coordinate on an action. Even if the subscriber sends an acknowledgement to
the publisher indicating that a message was properly received, there is no way to ensure that the
acknowledgement message itself is not lost.
Realistically, engineers have to make a choice of whether to handle duplicates or missing data.
For real-time machine learning this depends on the exact use case. In a financial fraud detection
system at least once delivery would be preferred if missing an important transaction would impact
the model’s prediction accuracy. In less critical scenarios such as user recommendations, at most
once delivery is acceptable if the speed at which those recommendations are delivered to users is
more important than losing an occasional data point. For the flight nowcasting system, missing
events is not desirable because it means we have less data for online modeling and we want to
make accurate predictions. Fortunately, detecting duplicates is not too difficult since the aircraft ID
and timestamp forms a unique ID. The following function can be added to the FlightSubscriber to
keep a record of current flight positions and perform a lookup to determine if an event has already
been processed.
Since they are meant to be a temporary buffer, message queues have a finite size. If the subscriber
runs more slowly than the publisher then the number of messages in the queue will continue to
increase. This phenomenon is called backpressure, which refers to how the data throughput is
limited by the processing speed of the subscriber. This is analogous to cars on a highway during
rush hour; there are too many cars for the road to handle so the rate of traffic slows.
Figure 2.7 Backpressure in a message queue causing a newly published message to be dropped
There are a few ways to manage backpressure with a message queue. One method is to
deliberately slow down the speed of the publisher. For real-time machine learning, this option may
not be preferable because it leads to ingesting older data that is no longer relevant. Another option
is to drop messages to decrease the pressure on the queue, which leads to data loss. The best
option when dealing with high throughput data ingestion is to increase the number of subscribers.
Of course, this increases the complexity of the system.
There are functional reasons why you would want multiple subscribers reading from the same
queue. For example, inferences produced by an online model could be read from multiple user-
facing applications. However, recall that messages are removed from the queue once consumed.
This makes it impossible for two different subscribers to receive the same event from a queue. In
order to implement this behavior, we need a real-time buffer mechanism that can also be persisted
to a historical log: the event stream.
import json
import pika
import time
from urllib import request
class FlightPublisherV3:
def __init__(self, url, interval_sec, stream_name):
self.url = url
self.interval_sec = interval_sec
self.stream_name = stream_name
def response_to_events(self, api_response): #A
flight_events = [] #A
for update in api_response["states"]: #A
flight_events.append( #A
{ #A
"icao24": update[0], #A
"origin_country": update[2], #A
"time_position": update[3], #A
"longitude": update[5], #A
"latitude": update[6], #A
"velocity": update[9], #A
"true_track": update[10], #A
} #A
) #A
return sorted(flight_events, key=lambda x: x["time_position"]) #A
def get_events(self): #B
while True: #B
with request.urlopen(self.url) as response: #B
yield from self.response_to_events( #B
json.loads(response.read().decode("utf-8")) #B
) #B
def run(self):
connection = pika.BlockingConnection(pika.ConnectionParameters("localhost"))
channel = connection.channel()
channel.queue_declare(
queue=self.stream_name,
durable=True, #C
arguments={"x-queue-type": "stream"}
)
for event in self.get_events():
print("Sending flight update:", event)
channel.basic_publish( #D
exchange="", routing_key=self.stream_name, body=json.dumps(event) #D
) #D
connection.close()
publisher = FlightPublisherV3(url="https://opensky-
network.org/
api/states/
all?lamin=45.
8389&lomin=5.
9962&lamax=47.
8229&lomax=10.
5226", interval_sec=60,
stream_name="flight_events")
publisher.run() #E
Running publisher.py again produces similar output as before. However, you should see different
output from the broker terminal which implies that the events are indeed being persisted to disk.
2024-10-19 15:41:28.879707+00:00 [info] <0.4238.0> rabbit_stream_coordinator: started writer
__flight_events_1729352488363061228 on rabbit@6bcd5c37b84b in 1
2024-10-19 15:41:28.879955+00:00 [info] <0.4239.0> Stream: __flight_events_1729352488363061228
will use /var/lib/rabbitmq/mnesia/rabbit@6bcd5c37b84b/stream/__flight_events_1729352488363061228
for osiris log data directory
2024-10-19 15:41:28.910821+00:00 [info] <0.4239.0> osiris_writer:init/1: name:
__flight_events_1729352488363061228 last offset: -1 committed chunk id: -1 epoch: 1
Stop the publisher for now. To verify that the events were published, we will test if a new
subscriber can retrieve any of them. We also have a few options to configure subscriber behavior.
The first is subscriber acknowledgements. In event-driven systems the broker needs to know if
an event has been successfully delivered to a subscriber so it can remove the event from the
outgoing queue. To be a good citizen the subscriber should send an acknowledgement signal as
soon as the event is successfully processed. When working with queues we have previously set
auto_ack=True. This resulted in the highest throughput behavior where the broker would delete the
event as soon as it was delivered to the subscriber (the “fire and forget” method).
However, there are several scenarios where the subscriber would be unable to process an event.
For example, a JSON event may fail to validate against a defined schema (e.g. missing a required
key). An event could also be received by a subscriber that it cannot process due to a failed
precondition (e.g. the subscriber needs to update its local database but the database is not ready
yet). In these cases, the subscriber should send a negative acknowledgement informing the broker
that it should either discard or requeue the event. Therefore, it’s safer to set auto_ack=False and
change the subscriber to send manual acknowledgements.
The other option has to do with throughput. RabbitMQ consumers prefetch events on the
subscriber before processing them to relieve backpressure on the stream. Unlimited prefetch
allows for the highest throughput but comes at the risk of overwhelming the subscriber. If the
subscriber fails to acknowledge an event then the prefetched events will accumulate on the
subscriber.
The prefetch count is normally determined experimentally depending on throughput and data
safety requirements. In our case, we are only ingesting less than 100 events every 60 seconds so we
don’t need high throughput; we will set this value to the most conservative value of 1. Here is the
updated FlightSubscriber that consumes from the new event stream and also integrates the
duplicate check to ensure data safety.
import json
import pika
class FlightSubscriberV2:
def __init__(self, stream_name):
self.stream_name = stream_name
self.flights = {}
def check_duplicate(self, event):
if (
event["icao24"] in self.flights
and event["time_position"] <= self.flights[event["icao24"]]["time_position"]
):
return True
self.flights[event["icao24"]] = event
return False
Try running the new subscriber. Surprisingly, we don’t receive any of the published events. Now, try
running the publisher again.
Received flight update: {'icao24': '4ba975', 'origin_country': 'Turkey', 'time_position': 1729359358,
'longitude': 6.1143, 'latitude': 46.2395, 'velocity': 8.23, 'true_track': 45}
The subscriber received data this time, but only one event. Try setting the prefetch_count to 2 and
run the subscriber again. Then, restart the publisher to produce some new events.
Received flight update: {'icao24': '4a3121', 'origin_country': 'Romania', 'time_position': 1729359386,
'longitude': 9.803, 'latitude': 47.734, 'velocity': 171.92, 'true_track': 268.11}
Received flight update: {'icao24': '398579', 'origin_country': 'France', 'time_position': 1729359559,
'longitude': 9.7242, 'latitude': 46.6355, 'velocity': 217.26, 'true_track': 125.46}
Now, the subscriber received only two new events. To understand what’s going on, consider the
sequence of events.
1. The publisher starts and publishes some events to the broker.
Here, the subscriber only consumes new events when the publisher restarts. This is because unlike
with the queue, previously published messages to the broker are not automatically delivered to the
new subscriber. Instead, new subscribers start reading the stream from the last published offset. If
we want the subscriber to read the older events we need to set the x-stream-offset option to a
different value. For example, we could configure the subscriber to start from the beginning of the
stream.
channel.basic_consume(
queue=self.stream_name,
on_message_callback=self.process_message,
arguments={“x-stream-offset”: “first”} #A
)
channel.start_consuming()
The second question is why the subscriber only receives one event in step 3 and two events in step
5. The answer is that we did not configure the subscriber to acknowledge any of the events so the
prefetch queue gets filled up. To consume the rest of the stream the subscriber should
acknowledge back to the broker after each event is properly processed, as shown in the listing
below.
After making these updates, restart the subscriber. You should now see the subscriber consume all
of the events. You can confirm that all the events were received by looking at the last few events.
They should be the same as the last publisher run. You can also look back in the subscriber output
to verify that it consumes the same events as before.
The above highlights an increased need for timeliness in service delivery. This has spillover
effects into machine learning, which is often a critical component in any kind of real-time
application. There are many more use cases where generating inferences in real-time has become a
necessity. We covered a few such examples in the previous chapter. Before we dive into the
benefits, let’s first talk about how machine learning applications are typically deployed currently.
Figure 2.9 Customer review sentiment analysis application using synchronous architecture
The nature of the interaction above is synchronous. The service that submits the request waits till it
receives the response before moving on to the next step in the application workflow. The fact that
the service has to wait until it receives a response introduces some latency into the application
which may be unacceptable for some use cases.
Latency can also enter the system in other ways. If the inference engine is bombarded with more
requests for predictions than it can handle, it causes backpressure that will result in slower
responses. Some application designers may choose to apply rate limiting to manage server load to
manage the influx of requests, but that means that the service will simply reject the requests.
Neither situation is ideal and can result in poor user experience.
Figure 2.10 Customer review sentiment analysis application using synchronous architecture
What the above highlights is that there is a tight coupling between the service that sends the
features and the service that generates and returns the inference. The load on the machine
learning microservice can be impacted by how many requests get sent to it and the requesting
microservice’s performance depends on the responsiveness of the machine learning microservice.
The customer review application that was presented earlier demonstrates a scenario where the
application that calls the machine learning microservice receives the inference and passes it to
another application that presents the inference to customer service personnel to triage the reviews.
What if the inference is needed in multiple applications? For example, the inference might be part
of a dashboard that displays analytics that is used for business intelligence purposes. The
inferences along with the features might also be stored in a database for offline analytics. If there is
a bottleneck in any part of the workflow, the effects can result in a chain reaction that causes the
entire system to fail.
There are other issues with this dependency. If the machine learning microservice needs to
make an API change, then all the other microservices that are dependent on it will also have to
change their code. In many organizations, the team that developed the machine learning
microservice is different from the teams that call it. These teams typically have different objectives,
priorities, and release schedules. The end result is a communication challenge that plagues many
organizations and can cause machine learning projects to fail.
All of this highlights a key point: in order to build a responsive application, quick access to data is
key. Data holds similar value as money in a real-time system. Just as a dollar is worth more today
than it is worth in the future, the faster that data is available, processed, and acted upon in a real-
time machine learning application, the more value it provides to an organization. For example, an
application that is able to adjust its recommendations based on user activity in real-time is more
likely to keep users engaged and coming back than an application that is unable to adjust because it
was trained on the previous day’s data. Similarly, the sooner an application is able to detect and flag
fraudulent financial transactions, the more money the financial institution is likely to save over time.
Figure 2.11 Customer review sentiment analysis application using event-driven architecture
The key difference from the earlier example is that the customer review application that collects
customer review data operates independently of the customer review sentiment analysis
application. It is simply responsible for publishing the data. It is up to the sentiment analysis
application to make sure it is subscribed to the customer review topic and processes the data that is
stored in the topic and generates and publishes the inferences to the customer sentiment topic.
The customer review application does not need to know the internals of the sentiment analysis
application since it is no longer responsible for collecting the inferences and doing all the
downstream work. In fact, this diagram also demonstrates how other applications can be set up to
handle the downstream tasks. In this example, we have two applications that are subscribed to the
customer sentiment topic that the sentiment analysis application publishes data to. There is an
application that drives the customer support portal so that customer support personnel can triage
the reviews. The second application runs analytics on the customer reviews and model output and
creates an analytics portal that can be used for business intelligence purposes. Note that in this
scenario, the other two applications do not talk directly to the sentiment analysis application.
The other benefit of this architecture is that subscribers of a topic can consume the data at any
interval they choose. Therefore, the subscribers can either choose to get real-time events if their
SLA requires immediate processing or they can choose to ingest the events later if there isn’t a need
to collect the events in real-time.
Event-driven architecture also enables better application maintenance. Since these applications
are decoupled, each team can choose its own maintenance cycle and not have to be concerned
about their code changes affecting other teams.
Event-driven applications also solve the problem of data silos by enabling democratization of
data access. In the synchronous paradigm, the consuming application has to follow the API
specification of the application that produces the data. But if applications only need to subscribe to
a topic to receive data and not worry about how to call the data producer, there is a lot more
flexibility in designing the application and the consumer application can be written in a
programming language that is different from that of the data producer. This opens up the
possibility of multiple teams within the organization with different backgrounds to use the data as
they see fit for their applications which results in unlocking more value streams for organizations.
Event-driven architecture also solves another problem that hinders access to data and slows
down development of machine learning applications: the fear that inexperienced team members
may accidentally delete the data. Unlike traditional relational databases, where data can be
updated or deleted, events are immutable and are simply appended to a topic. It’s akin to providing
read-only access to the data. It’s up to the consuming application to process the events based on
their use case. Furthermore, if certain data elements are sensitive and need to be restricted,
applications can be created that mask or remove sensitive elements and publish the events to
another topic that can be used by anyone within the organization. The relational database
equivalent, in contrast, would have required complex row/column based database access policies
that continually need to be updated or the need for the data owners to create and maintain
multiple copies of the same data source to serve multiple users.
A downstream benefit of this feature is that it becomes a lot simpler to compare different model
versions. Since the data is always available, any analysis that needs to be made after the fact
becomes easier. Examples include comparing model parameters, configurations, evaluation
metrics, data distributions, relationships between the features and the target, feature importances,
etc. We can take this one step further and run comparisons against the training data and the data
used for inferences to troubleshoot issues in production. In fact, if the features used for training
and the features used for inferences are stored within the same topic, the same code can be used
to ingest data for training and inference, which reduces the time it takes to deploy a machine
learning application.
Capturing and analyzing application health metrics such as CPU and memory usage also
becomes easier when they are published as events that are stored in relevant topics. This opens up
the possibility to get real-time alerts when unexpected issues arise.
Another component of machine learning applications is collecting user feedback for model
evaluation and retraining purposes. Figure 2.13 demonstrates an example where a front end
application subscribes to the inferences topic and displays the model’s results. Let’s assume it is an
application that contains all the negative customer reviews that customer service associates need to
triage. Next to each result are buttons labeled “correct” or “incorrect” that the customer service
associate can optionally click on to provide feedback. This feedback can then be captured in a
“feedback” topic that can then be used to evaluate model performance and used as data for
retraining.
The examples above demonstrate how event-driven architecture makes it easier to build
responsive, low-latency real-time applications that use batch machine learning models.
It is not a stretch to make the case that event-driven architecture is also a critical component of
building applications that use real-time machine learning models. By design, real-time machine
learning models make a prediction first, and then learn when the label is available. In a real-time
context, there is no control over when the labels will be available. Let’s consider an example where
a machine learning model makes a decision to present a set of articles to a given user and the label
for this use case would be whether the user found this article to be relevant or not and the way we
determine relevance is by whether a user clicks on an article or not. The user click is an “event” that
can fire off a process that stores this click event along with the article to a “labels” topic. We can
have a subscriber on the other end that passes this information to the machine learning model. We
can also have another application that monitors the time the article is on the user page and
publishes a “not-relevant” event to the “labels” topic if the user hasn’t taken any action on the article
after a certain amount of time. All of this is presented in Figure 2.14 below.
As the examples above demonstrate, the traditional software architecture pattern used to build
machine learning applications involves tight coupling between microservices that is not conducive
to real-time applications. Event-driven architecture, on the other hand, provides the ability to build
low latency, responsive machine learning applications that seamlessly coordinate various tasks such
as data ingestion, inference generation, metrics collection, and user feedback.
In this chapter, you have learned how to ingest data from event streams using message queuing
solutions. You have also learned how to create publishers that produce real-time data and
subscribers that consume real-time data. In Chapter 3, we will focus on how to use the data from
event streams to build a nowcasting model that not only generates inferences in real-time but also
continually learns from data as soon as it is available.
2.4 Summary
Real-time machine learning requires access to both historical and real-time data.
An event is a record that something happened at a specific time.
Polling intervals affect the quantity of real-time data that is captured for
machine learning.
Message queues enable simple asynchronous communication between
different processes.
A publisher or producer is a process that produces events.
A subscriber or consumer is a process that subscribes to events.
Synchronous data transfer requires direct coordination between two parties.
Asynchronous data transfer requires a buffer to store published data.
Backpressure is caused by a subscriber processing events more slowly than a
publisher, which reduces overall data throughput.
At least once delivery ensures that every message is delivered at least once but
duplicate messages can be sent.
At most once delivery prevents duplicates but messages may get dropped.
Event streams are append-only ordered logs of events and are sometimes called
topics.
Event brokers coordinate event delivery between many publishers and
subscribers by persisting data to event streams.