0% found this document useful (0 votes)
299 views57 pages

Real-Time Machine Learning v1 MEAP

The document introduces 'Real-Time Machine Learning,' focusing on online learning techniques that allow models to learn incrementally from real-time data rather than relying on historical datasets. It emphasizes the importance of adapting to changing data characteristics for applications like fraud detection and user recommendations. The book aims to provide practical guidance for building real-time machine learning systems using Python, covering topics such as data ingestion, model training, and performance monitoring.

Uploaded by

incapture
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
299 views57 pages

Real-Time Machine Learning v1 MEAP

The document introduces 'Real-Time Machine Learning,' focusing on online learning techniques that allow models to learn incrementally from real-time data rather than relying on historical datasets. It emphasizes the importance of adapting to changing data characteristics for applications like fraud detection and user recommendations. The book aims to provide practical guidance for building real-time machine learning systems using Python, covering topics such as data ingestion, model training, and performance monitoring.

Uploaded by

incapture
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

MEAP Edition

Manning Early Access Program

Real-Time Machine Learning


Version 1

Copyright 2025 Manning Publications

For more information on this and other Manning titles go to manning.com.

© Manning Publications Co. To comment go to liveBook

Licensed to Peter Knez <pknez@knezventures.com>


welcome
Thank you for purchasing the MEAP for Real-Time Machine Learning.
We now live in a world where we are spending a lot more time online due to technological
advancements such as the internet and smartphones. Traditional media such as printed
newspapers and DVDs has been replaced by social media and video streaming services. We conduct
a lot more purchases online and participate in the service economy by using apps like Uber and
Doordash.
The above highlights an increased need for timeliness in service delivery. This has spillover
effects into machine learning, which is often a critical component in any kind of real-time
application. There are many more use cases where generating accurate inferences in real-time has
become a necessity.
This book introduces the concept of real-time machine learning, also known as online learning.
Real-time machine learning uses models that are able to learn incrementally on data as soon as it
arrives in a system without the need to accumulate a batch of data before training a model. These
types of models are better able to adapt to changing characteristics in data to produce accurate
inferences. For example, Netflix used to train its machine learning models daily to serve
recommendations to its users. Netflix switched to using real-time machine learning models, which
resulted in more timely and relevant recommendations and consequently kept users more
engaged.
While real-time machine learning has been around for some time, there are very few resources
available online to help machine learning engineers build solutions using this technology. We
decided to write this book to solve this problem. Our goals for writing this book are to teach you
how to train machine learning models on real-time data and how to build systems that generate
real-time predictions from this data.
We take a practical, hands-on approach with this book. We first help you develop an
understanding of real-time data from a practical perspective. Next, we walk you through how to
ingest real-time data using an event-driven architecture. We will then combine real-time data
ingestion with online model training and inference to build adaptive and timely machine learning
applications. Specifically, you will build a nowcasting model, an anomaly detector, a real-time
recommender system, and a language model that adapts to real-time feedback. Along the way, you
will learn how to measure and monitor the performance of various real-time machine learning
models.

© Manning Publications Co. To comment go to liveBook

Licensed to Peter Knez <pknez@knezventures.com>


We are very excited to have you read this book and can’t wait to see what you are able to build
using the knowledge that you gain from this book. Please be sure to post any questions, comments,
or suggestions you have about the book in the liveBook discussion forum. We want you to get the
most out of this book and so we appreciate any feedback you are able to provide to us.

—Patrick Deziel and Prema Roman

© Manning Publications Co. To comment go to liveBook

Licensed to Peter Knez <pknez@knezventures.com>


brief contents
CHAPTERS

1 Introduction to real-time machine learning


2 Data ingestion for real-time machine learning
3 Implementing a nowcasting model
4 Detecting anomalies in real-time data
5 Creating a responsive recommendation system
6 Improving a language model with real-time feedback
7 Maintaining real-time machine learning models

© Manning Publications Co. To comment go to liveBook

Licensed to Peter Knez <pknez@knezventures.com>


1

1 Introduction to real-
time machine learning

This chapter covers


What is real-time data?
Offline learning vs. online learning
Common use cases for online learning

Real-time machine learning (sometimes referred to as online learning) is an approach which uses
real-time data to build predictive systems that adapt to changes in an environment. This is different
from batch wise machine learning (or offline learning) in which historical data sets are carefully
curated for training and evaluation. The fundamental assumption in offline learning is that there is
some ground truth in the input features that remains stable while models are in production. In
reality, the statistical properties of the data, such as probability distributions or relationships
between features are likely to change over time. This shift, known as data drift, can reduce a
machine learning model’s accuracy because it was originally trained on data with different statistical
characteristics. Offline models must be retrained routinely to avoid this degradation in accuracy.
However, retraining these models is often both expensive and time consuming since they require
iterating over large datasets many times. By the time these models are deployed to production they
may be operating on data assumptions that are no longer true. In other words, offline models
cannot adapt to the data changes that occur in real-world environments.

© Manning Publications Co. To comment go to liveBook

Licensed to Peter Knez <pknez@knezventures.com>


2

Real-time machine learning models are designed to learn on-the-fly by continuously receiving
data from the environment. Instead of being trained on large datasets, real-time machine learning
models are trained in an incremental fashion as new data becomes available. The idea of online
learning is not new, but it has increasing relevance for modern, data-driven organizations. As the
world becomes more real-time, applications such as fraud detection and user recommendations
will need to incorporate more real-time learning techniques to keep up with quickly changing data
trends.
The goal of this book is to give you the reasoning and tools to start building practical real-time
machine learning systems. We will accomplish this by building a few proof-of-concept online
learning systems from scratch using Python. You will be able to run all the examples in this book on
a single machine.
In this chapter, we will explore what “real-time” data means with respect to machine learning,
compare offline and online machine learning, and introduce some of the most common use cases
for online learning.

1.1 What is real-time data?


Think for a minute about all the data you consume on a normal day. Your average day might consist
of scrolling through posts on social media, checking emails, listening to the news on your commute,
hearing bits and pieces of conversation while shopping, and watching TV shows on streaming
services. Of that data, which is historical and which is truly real-time? One way to define the
distinction is in terms of the perceived delay. Received emails are stored in an “inbox” until the
recipient has an opportunity to read them, so most of the time they are historical instances of data.
In contrast, conference calls require low latency interaction so we would generally consider them
real-time.
In reality, video conferencing and other network-enabled applications commonly involve data
transmission and processing delays on the order of several milliseconds. The minimum latency
bound for fiber-optic transmission between opposite sides of the Earth is L = Distance / Speed =
20,000 km / 205,000 km/s = ~98 ms.[1] If you are in Minneapolis, MN and want to send a message to
your friend in Sydney, Australia it would take 14,480 km / 205,000 km/s = ~72 ms and another 72 ms
to receive a message back. Practically, this is mitigated with regional servers or content delivery
networks (CDNs). These strategies usually increase the amount of data replication at the cost of
consistency. When designing real-time applications, engineers must determine how much latency is
acceptable for the target use case.
Since there is a non-zero amount of latency in every real-time application, true real-time data is
an ideal that is unachievable in production environments. In this book, we define real-time data as
data in motion. This means that data can be seconds or minutes old and still considered to be real-
time as long as it is still useful to the recipient. Depending on the use case, the recipient can be one
of several entities.

A stream processor that performs some transformation on the data


An inference engine that accepts data instances and produces a prediction
A live dashboard that renders data for humans

© Manning Publications Co. To comment go to liveBook

Licensed to Peter Knez <pknez@knezventures.com>


3

An important corollary to this definition is that all data originates as real-time data. As soon as it
becomes persisted into any form of storage such as a file system, data lake, or database, for later
retrieval it ceases to be real-time. In this book we refer to this as data at rest or historical data. The
major difference between real-time and historical data is that real-time data has some application
value related to its timeliness.
Consider the real-time application outlined by the following figure. Emails are received by an
email server and used to train an online spam classification model. The online model determines if
emails are spam or not spam in real-time. Spam emails are immediately discarded, and non-spam
emails are persisted to a database.The email client has the ability to process incoming emails as
notifications and retrieve past emails from the database. Both historical and real-time data streams
are present in this application. Incoming emails are treated as real-time data while they are being
processed until they are persisted to the database. When emails are loaded from the database they
are considered historical pieces of data.

Figure 1.1 An email notification system with real-time and historical data streams

In short, historical data is used for offline learning approaches where all the data is known at once.
Real-time data is used for online learning approaches where learning happens incrementally as soon
as new data is available. These terms also describe where the learning happens. Offline learning
happens in development or staging environments. Online learning happens directly in production.
In the next sections we will discuss these two approaches.

1.2 Offline learning


Traditional machine learning involves collecting a large batch of data and performing a number of
training steps to fit a model to the data. This approach is referred to as offline learning because
model training occurs in an environment completely independent from production and uses
historical data to model the future. The offline machine learning cycle is shown in Figure 1.2.

© Manning Publications Co. To comment go to liveBook

Licensed to Peter Knez <pknez@knezventures.com>


4

Figure 1.2 Training a model offline and using it for predictions in production

Offline learning starts with a data ingestion phase. Data is loaded from one or more data sources
and serialized into a format for persistent storage. Data ingested for offline learning is often
unstructured, meaning that there is not a pre-defined schema. Text is a common form of
unstructured data because it is widely accessible from the internet and proprietary documents and
is easily extractable from file formats like PDF and HTML. Preprocessing steps are usually applied to
transform the data into a more normalized, numerical format for machine learning. For example,
converting text to vector embeddings or parsing numerical or categorical values from strings.
Data ingestion is followed by a feature engineering phase where the goal is to discover the set of
features that correlate with the output labels. Feature engineering is an iterative process and is
really more of an art than a science. It typically involves both statistical and visual analysis and
intuition about the problem to identify features with predictive power.
In the training phase the dataset is split into distinct training, test, and validation sets. These sets
have the same data schema to ensure that the machine learning model can be used for both
training and inference. The number of data points required to fit a model depends on its
complexity. For example, linear models attempt to find a linear relationship between the features
and labels so the search space is directly dependent on the number of features. More complex
models such as artificial neural networks can have a much larger number of trainable parameters.
Increasing the number of features can improve the predictability of the model but it also increases
the number of data points required to sufficiently describe the input space. This is colloquially
known as the curse of dimensionality.

© Manning Publications Co. To comment go to liveBook

Licensed to Peter Knez <pknez@knezventures.com>


5

Machine learning is framed as an optimization problem where the goal is to find the optimal set

of parameters p that minimize the loss against the training set. The predicted outputs can be
computed from any parameter set. This means that a loss function L can be defined in terms of p
that represents the total training loss. This is also referred to as the objective function because it sets
the objective for the problem: to minimize the loss. For example, squared error loss is quantified by
adding up the squares of the individual prediction errors as shown in Formula 1.1.

Formula 1.1 Squared error loss for a parameter set p and number of data points m.

Depending on the complexity of the model and the number of input features, models can have
several or hundreds of parameters that can be adjusted to minimize the loss. With models that
have more than a few parameters it becomes important to be able to efficiently search the
parameter space. A common methodology is gradient descent which computes a loss vector (the
gradient) with respect to the trainable parameters and attempts to adjust the parameters in the
opposite direction of the vector. At each training step the parameters are updated using the
following formula.

Formula 1.2 Updating model parameters using gradient descent.

The optimization method is a significant choice because it can affect how quickly the model will
converge to an optimal solution or if a solution will be found. Modern machine learning libraries like
scikit-learn provide useful abstractions for training various classes of models given a training set.
Listing 1.1 provides an example of training a LogisticRegression model on the classic iris sample
dataset using scikit-learn and using the model to predict class labels based on numerical input
features.

© Manning Publications Co. To comment go to liveBook

Licensed to Peter Knez <pknez@knezventures.com>


6

Listing 1.1 Supervised Offline Learning in Python

from sklearn import datasets


from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

iris = datasets.load_iris() #A
X = iris.data #A
y = iris.target #A
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42) #A

model = LogisticRegression(random_state=42).fit(X_train, y_train) #B

y_pred = model.predict(X_test) #C
print(“Predicted values: “, y_pred) #C
print(“True values: “, y_test) #C
print(“Accuracy: “, accuracy_score(y_test, y_pred) #C

#A Loading the sample iris dataset


#B Training a logistic regression model on the training set
#C Evaluating the model against the test set

Here is the output of running this code.

Predicted values: [1 0 2 1 1 0 1 2 1 1 2 0 0 0 0 1 2 1 1 2 0 2 0 2 2 2 2 2 0 0]
True values: [1 0 2 1 1 0 1 2 1 1 2 0 0 0 0 1 2 1 1 2 0 2 0 2 2 2 2 2 0 0]
Accuracy: 1.0

The evaluation phase involves loading a model into a staging environment and extensively testing it.
This is a common failure point in the development of machine learning applications. The model
might perform poorly on out-of-sample data or might produce results that are unexpected or
inadequate to stakeholders. The former is usually a data issue. The model might be overfit due to a
high amount of variance in the training set, or underfit due to a low amount of variance and not
generalize well to data points from the real world. The latter issue can stem from a lack of
transparency and communication between engineering teams and stakeholders.
The deployment phase is the technical task of pushing the model into production so that end
users have access to it. Successful deployments include mechanisms to capture key performance
indicators or human feedback from users. Deployment marks the end of the offline learning cycle.
The offline learning cycle is similar to the software development life cycle (SDLC) in that the process
is iterative. Models must be retrained routinely in order to keep up with business objectives.

© Manning Publications Co. To comment go to liveBook

Licensed to Peter Knez <pknez@knezventures.com>


7

The offline learning workflow is fully synchronous, where each phase of the workflow requires
the previous phase to be completed. For example, model training cannot occur until all the training
data is ingested because offline models generally require access to all the data points at once. This
limits the speed to deployment for new models. The speed to deployment is also limited by model
training time. At each training phase models are re-trained from scratch. The training time is a
function of model complexity and the training size. Since one way to improve the accuracy of offline
models is to increase the training set size, the training time typically increases linearly with each
cycle.
While offline learning is a natural starting point for organizations that want to use historical data
to solve current problems, it also requires a persistent effort to maintain these systems. For use
cases where underlying data trends change more rapidly, a reactive learning approach is required
to ensure that trained models remain relevant and useful. In the next section, we introduce the
concept of online learning to address these issues.

1.3 Online learning


Online learning is a real-time learning approach that addresses the rapidly changing state of data.
Instead of learning on large batches of data, as is the case with offline learning, online learning is an
incremental approach where learning happens using mini-batches or single instances of data.While
the offline learning workflow is synchronous, as described earlier, the online learning workflow is
asynchronous. Each component of the workflow, from data ingestion to feature engineering to
training to generating predictions is decoupled from the other components. Online learning is
sometimes referred to as incremental learning or real-time learning to avoid confusion with internet-
based learning that humans participate in.

1.3.1 The model drift problem


While the original online learning systems were built out of necessity, contemporary online learning
approaches are designed to solve a specific data problem known as model drift. In general, model
drift is the result of models becoming out of date. Model drift can be mitigated by collecting data
and retraining models at a faster rate. However, retraining batch models can get expensive as the
dataset grows. Online learning addresses the drift problem by decreasing the retraining time as
much as possible. Instead of collecting large batches of data offline, newly received data points are
used to immediately update the model. This results in a model which is able to more quickly adapt
to shifts in the data distributions and respond to external changes.

© Manning Publications Co. To comment go to liveBook

Licensed to Peter Knez <pknez@knezventures.com>


8

Model drift is the phenomenon where machine learning models become less predictive over
time. This is usually because the model is trained on data assumptions that no longer hold true.
When training machine learning models we select a set of features from the population X that is
likely to be predictive of the output labels Y. For example, in the iris flower classification problem,
the set of features X are petal length, petal width, sepal length, and sepal width, and the output
labels Y is the species of the iris flower: setosa, virginica, and versicolor. We don’t know all the
possible inputs and outputs but we can describe the relationship between them in terms of
probability. If P(X) represents the probability distribution of inputs features and P(Y) represents the
distribution of output labels then it’s possible to express the probability of a set of features and
labels appearing together as the joint probability P(X,Y) = P(X)P(Y|X). This implies there are two major
components to model drift: the distribution of input features P(X) and the probability of the labels
given the features P(Y|X).
Data drift or feature drift describes the case where the distribution of input features P(X) changes
over time. Consider a spam classifier that is trained to classify historical emails as spam or not spam
based on email length. In the initial training set the length of emails varies between 40-60
characters. However, after the model is deployed to production it starts to observe much longer
emails. Since the model was not trained on many emails above 100 characters it may struggle to
accurately classify these longer emails. This scenario describes a shift in the feature distribution
causing the model to perform worse over time. The feature shift could be a result of failing to
include emails in the training set that properly represent reality. However, it could also be caused
by external factors such as changing marketing strategies. The figure below illustrates this feature
drift.

Figure 1.3 The average length of emails increases significantly after a model is deployed to production, an
indication of feature drift. This may cause the model to produce less accurate predictions because it was
trained on much shorter emails.

© Manning Publications Co. To comment go to liveBook

Licensed to Peter Knez <pknez@knezventures.com>


9

Concept drift describes the changing relationship between input features and output labels over
time. In mathematical terms this is how P(Y|X) changes over time. Batch models perform well if
P(Y|X) remains static. However, in the real world the outputs we are trying to predict are subject to
external factors. This can be due to transient events like natural disasters or outages or caused by
long term trends such as consumer behavior and market shifts. An example of this is the 2020
coronavirus pandemic which increased consumer demand for specific products such as hand
sanitizer and immunity vitamins (https://www.​ncbi.nlm.​nih.gov/pmc/​articles/​PMC9111418/). This
likely affected the accuracy of sales projection models since the relationship between data features
and consumer purchases changed overnight. Models trained before 2020 may assume that hand
sanitizer sales remain fairly constant aside from seasonal trends such as the beginning of the
school year. These models have to be retrained because the learned relationships no longer hold
true.
The difference between feature drift and concept drift is illustrated in the figure below. A
machine learning model is fit to an initial set of data points and solves a decision boundary to
distinguish between two output classes. Feature drift causes the data points in the feature space to
change which may impact the accuracy of the model. Concept drift changes the decision boundary
itself which has greater potential to invalidate the model. Note that this doesn’t necessarily mean
that the features have drifted. For example, a model that uses the weather to predict retail sales
would be affected by a global lockdown because consumers are staying home even though the
weather is nice.

Figure 1.4 Feature drift changes the distribution of features, while concept drift changes the problem’s
decision boundary

© Manning Publications Co. To comment go to liveBook

Licensed to Peter Knez <pknez@knezventures.com>


10

1.3.2 The online learning cycle


Online learning starts by receiving a new instance of data from a real-time stream. The data
instance is analogous to a single sample in a historical dataset; it contains a set of features observed
from the real world. A model is instantiated with little to no knowledge about the characteristics of
the data.A prediction is computed using this model and forwarded to downstream parts of the
application. When the label arrives in the system, it is fed to the model and the model learns from
the label and updates its parameters. Note at this point there are two versions of the output. One is
the predicted value computed by the online model. The other is the ground truth value observed
from the real world. The idea is to teach the model using the loss computed from the predicted
value and the observed value.
Figure 1.5 demonstrates an example of an online learning workflow. It reflects an order of
operations, i.e., the prediction is made first on the sample and then the model is trained on the
same sample after the label is available. Note, however, that unlike the offline machine learning
workflow, these operations are asynchronous due to the nature of working with real-time data. We
do not know ahead of time when we will receive the label, and therefore, these operations will
leverage stream processing engines that are more suited for real-time data processing. We will
explore real-time data processing in Chapter 2.

Figure 1.5 Training a model “online” while using it for predictions

In the previous section we described how an offline model is trained by finding the set of
parameters that minimize the objective function. Unlike in offline learning, the full dataset is not
available at training time which makes it impossible to compute the total loss. However if the loss
function is additive, such as the sum of squared errors, it can be decomposed into individual terms.
This means that the gradient correction can be computed for individual data samples using the
adjusted formula below. The difference is that the loss function is computed for only the new
sample instead of the entire dataset.

Formula 1.3 Computing incremental gradient loss using a single data sample and updating model parameters

© Manning Publications Co. To comment go to liveBook

Licensed to Peter Knez <pknez@knezventures.com>


11

This is the basis of the incremental gradient descent approach. The coefficient controls how
much the parameters are affected so it is thought of as the learning rate. A high learning rate is
desirable for quickly changing environments because the model will adapt faster to changes in the
data. A low learning rate will approximate the batch version of the model but it will be less useful
for short-term predictions because the model will adapt more slowly.
Given that online learning is functionally an approximation of offline learning, online models will
generally produce less accurate predictions than offline models that have the benefit of knowing
the entire dataset beforehand. The major benefit of online models is that they are able to react to
short-term trends and produce near-term predictions. Time-sensitive use cases such as inventory
prediction or load balancing may benefit from being able to learn patterns in real-time. Conversely,
batch models are preferable in situations where the distribution of the data is not likely to change
over time.

1.3.3 Offline vs. online learning


As discussed in section 1.3.2, online learning is most effective when our underlying assumptions
about data are likely to change over time. To illustrate this, imagine you are developing a system to
predict daily visitors to your website. The number of daily visitors can be represented by N = 𝜇 + v,
where 𝜇 represents the average daily visitors and v represents the amount of daily variance in
visitors. Since 𝜇 is dependent on external factors it is likely to change due to seasonal trends, viral
internet events, and a general increase in popularity. Such a data distribution is referred to as non-
stationary because its statistical properties such as means, variances, and covariances change over
time. Real world examples of non-stationary distributions include the weather and stock markets
because they are subject to both seasonal and long-term trends. The figure below shows the
difference between a stationary and non-stationary distribution.

Figure 1.6 A stationary distribution of visits (left) shows periodic variance but the average visitors does not
change. A non-stationary distribution (right) shows a gradual increase in visitors over time.

© Manning Publications Co. To comment go to liveBook

Licensed to Peter Knez <pknez@knezventures.com>


12

Machine learning models trained on a historical time series of daily visit counts will be affected by
the general increase in visitors over time. Non-stationarity can be corrected for with data
transformations or by modeling it directly. For example, by assuming the increase in the average
visit count is linear and subtracting it out before offline modeling. Online models account for non-
stationarity by learning incrementally. Offline models are better suited for tasks where the data is
stationary with respect to time. For example, an image classifier trained to distinguish dogs from
cats is more easily trained offline because the classes of images will not change over time.
Online learning is somewhat analogous to the rolling mean in statistics. The traditional mean is
simply an average of all the samples in a set, but the rolling or moving average is an average over
the last n samples. The moving average intentionally “forgets” older information in order to get a
more recent snapshot of the data. Online models function similarly to this because the model
parameters are adjusted for each new data sample. The forgetting effect is sometimes desired
because it acts like a form of regularization to avoid overfitting to the data. For example, you might
want a news-based model to forget specific knowledge it has learned during transient periods like
election cycles. Some models such as deep neural networks are vulnerable to “catastrophic
forgetting” in which important knowledge like grammar rules that was previously encoded into the
model is lost. This can be mitigated by reducing the amount of weights that are updated on each
training run.
When assessing whether an online learning approach is appropriate for solving a problem, it’s
important to consider operational factors such as time to deployment, observability, and
maintenance cost. Each phase of the offline learning cycle takes a considerable amount of
engineering time and knowledge across several disciplines. Online models are a lot easier to get
into production since the development workflow is identical to the production workflow. Because
these models make predictions as a part of the learning process they are compatible with staging or
production environments. In contrast, batch models require a separate pipeline to produce
predictions in production.
A unique aspect of online models is that they learn while in production. This makes them mostly
autonomous which removes the need to manually retrain and evaluate them offline. This provides
several advantages over offline models.

There is no need to split the data into training and test sets. The same data gets
used to first make a prediction (test data) and then once the label is available, is
used to train the model (training data)
There is no need to worry about data leakage (accidentally using the test dataset
to train a model)
Retraining a model from scratch periodically is no longer necessary since the
model is continually learning
Reduced memory footprint since a real-time model has to learn from only one
example at a time

© Manning Publications Co. To comment go to liveBook

Licensed to Peter Knez <pknez@knezventures.com>


13

One downside of online models is that they require constant monitoring to ensure they are
producing accurate results. Unlike offline learning, online models aren’t guaranteed to produce
similar quality results compared to extensively trained offline models. Traditionally, machine
learning has involved an experiment-oriented, iterative process using batches of data to train a
model. It is not only easier to conduct multiple tests before deploying a model into production, a lot
of work is done to clean up and standardize the data before it is fed to a model. With online
learning, some exploratory data analysis can and should be conducted to understand the
characteristics of the data and to decide on the features to use to train an online model. However,
there is little control in ensuring data quality due to the real-time aspect of data arriving into the
system. As a consequence, a lot more more time needs to be spent putting guardrails into the
system and coming up with mitigation strategies should things go wrong. There is a trade-off
between model adaptability and model accuracy. For high-risk use cases, it is preferable to build
models offline to ensure they are properly validated and tested before deploying to production.
Since online learning involves frequent updates, it is a natural fit for human feedback
mechanisms. Batch models that operate on human feedback require specific retraining before
users of the model will see improvements. Because user preferences can change more frequently
than batch models are retrained, online learning is more compatible with recommender systems.
We will discuss this in more detail in the next section.
Table 1.1 presents the pros and cons of online learning. These are points developers must take
into consideration when embarking on a new project that will use online learning.

Table 1.1 Pros and cons of online learning

Pros Cons

Suitable for use cases where assumptions Subject to catastrophic forgetting: a


about data are likely to change over time phenomenon where a machine learning model
forgets important historical knowledge

Easier to deploy because of similarities between Need to be constantly monitored to ensure the
the development and production workflows model is producing accurate results

Online models learn while in production, which Requires more guardrails and mitigation
means that retraining workflows are not strategies since there it is not possible to
needed inspect data prior to using it to train an online
model

One roadblock preventing widespread adoption of online models is that it requires stream-oriented
architectures. Traditionally, extract-transform-load (ETL) is used to describe the process of sourcing
raw data, applying transformations to standardize and normalize data into a retrievable state and
loading it into a centralized store. Such data pipelines make offline learning more consistent and
repeatable. Online learning requires an analogous data methodology called stream-transform-load
(STL). We will discuss its implementation details and unique challenges in chapter 2.

© Manning Publications Co. To comment go to liveBook

Licensed to Peter Knez <pknez@knezventures.com>


14

1.4 Use cases for real-time machine learning


We are now living in a world where we are spending a lot more time online due to technological
advancements such as the internet, smart phones, and social media. For example, while the bulk of
retail shopping is still done in-store, online shopping as a share of total sales has been rising year
after year. According to the Census Bureau, online sales as a percentage of total retail sales has
steadily gone up from 0.6 percent in the fourth quarter of 1999 to 15.9 percent in the first quarter
of 2024 (https://www.​census.gov/​retail/ecommerce.​html).
Fewer of us read physical newspapers and instead read or watch news online either through the
traditional media services such as CNN and Fox News or through social media. We also consume
entertainment from services such as Netflix, YouTube, and TikTok.
Because of the shift in our consumption patterns, more and more data is being generated
through our activities online, and companies are trying to make sense of this data in order to sell us
products and services.
There have been technological advances in machine learning, as discussed previously, to
address the explosion of data, namely algorithms that can run on distributed systems like Apache
Spark. However, the volume of data only continues to grow and due to the speed of information
flow on the internet, there are additional complications such as rapid changes in data
characteristics. Here are a few situations where solutions built using traditional machine learning
tend to fail:
When the statistical distribution of the data used to train a machine learning
model differs significantly from the distribution of the data encountered in
production.
When there is insufficient/non-existent data to train a machine learning model,
e.g., new or infrequent visitors to a website.
When the characteristics of the data change frequently and models go stale as
soon as they are deployed, e.g., user video watching preferences.

As the examples above illustrate, the biggest challenge with traditional machine learning is that
when a model is trained, its parameters and weights are “frozen”. We have already discussed how
model drift can cause models to lose accuracy over time. Retraining models requires training from
scratch on a whole new batch of data, which can be cost prohibitive, slow, and error-prone.
Shifting towards online modeling approaches can help alleviate these challenges. Unlike their
batch counterparts, real-time machine learning models have the ability to continually learn on new
data, i.e., their weights and parameters are updated with every new instance of data. The result is
that such models are able to adapt more quickly and require a lower memory footprint since they
learn from one record at a time. Let’s now take a look at a few use cases that are better suited for
real-time machine learning.

© Manning Publications Co. To comment go to liveBook

Licensed to Peter Knez <pknez@knezventures.com>


15

1.4.1 Recommender Systems


Recommender systems have been around for a long time. In fact, Amazon popularized the concept
when it used recommender systems to help its customers make purchases through its online store
over two decades ago.
There are many ways to build recommender systems. You can build a collaborative filter that
uses similarities in user profiles to make recommendations or model a user’s behavior over time
using a time series model.
However, consider this: imagine you are a user who likes to watch videos on YouTube. For a
week, you are predominantly interested in watching cat videos, an interest that you developed
because a number of your friends were sharing cute videos on a regular basis. But let’s say that you
just landed your first job as a machine learning engineer and you now want to learn as much as you
can about the field. A model that was built using either a collaborative filter or a time series model
would likely keep recommending cat videos even though you have shifted to watching videos on
machine learning. This is because none of these models have any knowledge about your shift in
preferences.
Similarly, imagine if you are a new user on YouTube. What videos should YouTube recommend
to you? It has no history about your preferences. It turns out that there are ways to use user
interactions as signals to build recommender systems that are more personalized. Data such as
videos users watched, liked, or shared along with comments users posted are all useful features
that can be fed to a real-time machine learning model to generate a list of recommended videos on
the fly. As users change their behaviors, the models are also able to react accordingly and update
their recommendations. TikTok has become successful largely in part due to its ability to
understand user preferences within a few hours of use (https://newsroom.​tiktok.com/​en-us/how-​
tiktok-recommends-​videos-for-​you). In the case of users who have never visited the website before,
a list of videos based on popularity or other metrics can be used as a starting point and then the
models can learn how to update the list based on user interactions.
In addition, because real-time machine learning models are more responsive, they are better
suited to adapt to sudden changes or spikes in user activity. Imagine a news event that went viral
and everyone is interested in watching content related to this news event. This is a real-time event
that a machine learning model trained on batch data would have no information about and
therefore, content related to this event would not show up at the top of user feeds. However, a real-
time machine learning algorithm can quickly detect this and react accordingly. This ensures that the
recommendations are timely and effective, which in turn results in better user satisfaction and
engagement.
Real-time machine learning models are also better suited to take advantage of real-time
contextual information such as time of day, location, device type, and recent browser history as
features to develop better models. They can also be used by online platforms to conduct multiple
experiments simultaneously and use instantaneous user feedback such as clicks, likes, shares, and
time spent to compare the experiments and update recommendations.

© Manning Publications Co. To comment go to liveBook

Licensed to Peter Knez <pknez@knezventures.com>


16

1.4.2 Anomaly Detection


Unlike their batch equivalents, real-time anomaly detection algorithms are able to identify unusual
patterns in data as soon as they arrive. This feature is particularly useful in environments where
data is continuously arriving and evolving, such as network traffic data, IoT systems, financial
transactions, etc.
Let’s take financial transactions as an example. There has been a steady increase in the number
of financial transactions that are being conducted online. Fewer and fewer people deposit checks at
a local bank and pay cash for goods and services. In fact, many goods are being bought online
instead of at physical stores. This opens up all sorts of possibilities for fraudulent activities since
everything is being done behind a computer.
There have been a number of advancements made in this space specifically around using
anomaly detection algorithms to detect unusual transactions. For example, if someone were to
steal your credit card or were able to get access to your credit card information online, they can use
your credit card to make purchases. Anomaly detection algorithms have been quite successful in
noticing the differences in the characteristics of these purchases and flagging them for fraud.
However, patterns of fraud are constantly changing and fraudsters change their tactics as soon
as they find out that a strategy they used in the past is no longer working. Exacerbating all of this is
the volume of data that is being collected that puts a strain on a system that is using a batch
learning model that needs to be retrained on a more frequent basis to keep up. Fortunately, with
the advances in real-time machine learning, there are anomaly detection algorithms that are able to
incrementally learn on data as it arrives and not only have knowledge of user purchase history, but
are able to quickly adapt as the data coming in the stream evolves over time. This means that they
are able to react faster to outliers and can detect them before it is too late. In the case of financial
transactions, speed is key because detecting and addressing fraudulent transactions quickly
ensures that financial institutions are able to minimize losses.
In addition to fraud detection, real-time anomaly detection is also well suited for trading and
investing. The stock market experiences fluctuations in pricing in microseconds and trading models
trained on batches of data struggle to keep up with the changes in market conditions. Online
machine learning models have the capability of ingesting real-time data from multiple sources such
as news, economic indicators, as well as pricing data to optimize pricing and investing strategies
that reflect the latest trends.

© Manning Publications Co. To comment go to liveBook

Licensed to Peter Knez <pknez@knezventures.com>


17

In cybersecurity, real-time anomaly detection algorithms are better suited for analyzing network
traffic logs or system behavior patterns, because they can adapt their detection capabilities in real-
time. This is critical since threat actors are constantly evolving their tactics and techniques. An
anomaly detection model can continuously analyze millions of data packets to detect deviations
from normal behavior patterns and trigger alerts immediately, mitigating potential damage before
it escalates. In addition to network log monitoring, anomaly detection algorithms can be used to
monitor endpoint behavior, application usage patterns, and system configurations. This can be
taken one step further where these models are incorporated as part of a process to automate
cybersecurity operations. For example, a real-time intrusion detection system can not only send
alerts when suspicious behavior is detected, it can then automatically block suspicious IP addresses
or quarantine compromised devices in real-time, resulting in faster incident response times and
prevention of large scale attacks. In short, real-time machine learning can help level the playing field
between cybersecurity experts and cyber criminals.
In healthcare, anomaly detection algorithms can be used for real-time monitoring and early
detection. Many people now use wearable devices that monitor heart rate, physical activity, blood
pressure, glucose, and other health markers. These devices are generating streams of data in real-
time which can be processed and analyzed by real-time anomaly detection algorithms to detect
anomalies such as irregular vital signs, unexpected changes in patient condition, and deviations
from typical health patterns. These algorithms can continually adapt and evolve as they take in new
information from these devices.
In addition to monitoring health related data, anomaly detection algorithms can be used to
improve operational efficiencies at healthcare facilities by continually monitoring data such as
hospital admissions, bed occupancy rates, and drug use to optimize resource allocation, scheduling,
and patient care.
Real-time anomaly detection algorithms are particularly suited for monitoring IoT systems. With
an increase in the number of interconnected devices and sensors, it is critical that these systems
are functioning properly and optimally. It is possible to monitor metrics such as temperature,
humidity, and energy consumption to detect deviations from normal patterns and detect potential
malfunctions, intrusions, or degradations in performance. The real-time nature of these algorithms
also ensures that corrective action can be taken as quickly as possible to prevent system wide
catastrophic failures, extend equipment lifespan, and reduce downtime, resulting in improved
operational efficiency and minimized maintenance costs.
IoT devices often operate on resource-constrained edge computing environments with limited
processing power and bandwidth. Real-time models can be deployed at the edge to perform
lightweight anomaly detection tasks locally, reducing latency and conserving network bandwidth.
These models can detect anomalies rapidly without the need to connect to a centralized cloud
server. This decentralized approach enhances scalability, responsiveness, and efficiency in IoT
deployments, particularly in applications requiring real-time anomaly detection such as smart cities,
autonomous vehicles, or remote asset monitoring.

© Manning Publications Co. To comment go to liveBook

Licensed to Peter Knez <pknez@knezventures.com>


18

1.4.3 Reinforcement Learning


As mentioned previously, reinforcement learning is a branch of machine learning where a model
learns from its interaction with the environment. It uses the context from the environment and
decides on an action to take and then uses a reward or loss associated with the action to adapt to
the environment. Reinforcement learning is particularly useful for complex decision-making tasks.
Another advantage of reinforcement learning is that there is no need for labeled data, which makes
such types of models more flexible and adaptable, especially in a real-time context. Their offline
counterparts, on the other hand, do not interact directly with the environment. Instead, they use
historical context and rewards to figure out the action to take. This approach has drawbacks, as
these models are unable to react to changing conditions in the environment, which can result in
errors in decision making.
In the area of finance, online reinforcement learning can be used for market making. In the stock
market, market makers are tasked with maintaining liquidity by quoting bid and ask prices.
Reinforcement learning algorithms can learn optimal market making strategies by modeling supply
and demand, price movements, and order flow. Reinforcement learning algorithms can learn from
the market and adjust bid-ask spreads dynamically to maximize profitability (reward) and to
minimize risk (loss).
Online reinforcement learning can also be used in algorithmic trading and portfolio
management. Reinforcement learning algorithms can learn from historical data as well as current
market conditions to optimize trading strategies. For example, the reward function can be the
cumulative returns of the portfolio over time and these algorithms can make buy and sell decisions
based on the actions that are most likely to maximize cumulative returns. Similarly, it can be used in
risk management where these algorithms can use data such as price fluctuations, economic
conditions such as interest rate changes, and geopolitical events to create optimal strategies to
hedge against potential risk to the portfolio.
Taking it a step further, reinforcement learning algorithms have been combined with human
feedback, which is also known as RLHF for short. RLHF has been around since about 2008 where it
was primarily used in robotics. It has been popularized recently because of the success of Open AI’s
ChatGPT. Language models have gotten better over the years through successive innovations, and
started to gain momentum with the invention of the transformer models. However, in spite of the
impressive results, the models were not always able to generate good outputs. RLHF was used as a
technique to leverage human feedback to improve the quality of the outputs. The objective is to get
a model that takes in a text input and returns a scalar reward that represents the human
preference. These reward models can either be a fine-tuned language model or a language model
trained from scratch on human preference data. Now that we have a model that was trained on
text and a second model trained on human preference data, we can then use reinforcement
learning to fine-tune the original language model by using the reward model.

© Manning Publications Co. To comment go to liveBook

Licensed to Peter Knez <pknez@knezventures.com>


19

RLHF can also be used in other applications. One such application is detecting phishing and
social engineering schemes. Human feedback can be used to identify phishing emails that bypass
spam filters and other controls. Intelligence analysts can provide feedback on the characteristics of
phishing attempts such as suspicious URLs, misleading content, or impersonation tactics used by
cyber criminals. This feedback can be collected through a streaming application and be fed to a
reinforcement learning algorithm to ensure that it has the most up to date information to
effectively detect new phishing and social engineering attacks.
In healthcare, RLHF can streamline and enhance decision making by integrating human feedback
in decision support systems. Healthcare experts can provide feedback on the accuracy of
diagnostics, treatment recommendations, and patient management plans suggested by the model
and the model in turn can use the feedback to provide better recommendations.
In this chapter, you have developed an understanding of what real-time data is and how it can
be used to build online models that can both train and generate inferences on a continual basis as
soon as data arrives in the system. We have provided some use cases that are particularly suited for
real-time machine learning. In the next chapter, you will learn how to ingest real-time data to build
data ingestion pipelines using event-driven architecture.

1.5 Summary
Real-time data is data in motion that has not been persisted into any form of
storage.
Real-time machine learning is an approach which uses real-time data to build
predictive systems that adapt to changes in an environment.
Offline learning involves training a model from a historical batch of data.
Online learning involves incrementally training a model from data as it arrives in
the system.
The offline learning workflow is synchronous: each component of the workflow
is dependent on the one preceding it.
The online learning workflow is asynchronous:the components of the workflow
are decoupled from each other.
Online learning evolved to address use cases where there are rapid changes in
data distributions.
Feature drift or data drift occurs when the distribution of the features used to
train a model changes over time.
Concept drift occurs when the relationship between the input features and the
output labels changes over time.
Some use cases that are suitable for real-time machine learning include
recommender systems, anomaly detection, and reinforcement learning.

[1] The transmission speed depends on the speed of light c and the refractive index n of the
fiber optic cable. Here we assume a single-mode fiber with n=1.46 so the ideal transmission
speed is v = c / n = 300,000 km/s / 1.46 = ~205,000 km/s.

© Manning Publications Co. To comment go to liveBook

Licensed to Peter Knez <pknez@knezventures.com>


20

2 Data ingestion for real-


time machine learning

This chapter covers


Ingesting real-time event data
Using event-driven architectures to stream and persist real-time data
Advantages of event-driven architectures for real-time machine learning

Every machine learning project starts with a problem that can be solved by using a machine
learning model that learns patterns from the data. The data available to you and the data you have
the ability to collect defines what is possible for the project. This is no different for real-time
machine learning. In order to start building real-time machine learning applications, you need to
understand what real-time data is available to you and how to use it to produce useful predictions.
In the previous chapter we explored how real-time data instances are used to both generate
real-time inferences and train online models. In order to accomplish this, the data first has to be
ingested and transmitted from its origin to the process where inference is happening. This requires
dedicated data architectures to handle the continuous transfer of data, known as data streams. In
practice, data streams are implemented as message queues or event streams. We discuss message
queues and event streams in section 2.2.
In this chapter you will learn how to ingest data points from a real-time data source. Then, you
will write a publisher to persist this data to an event stream and a subscriber to read both live and
historical data from the stream. Publishers and subscribers will be discussed in section 2.2. You will
also learn how event-driven architectures can solve some of the inherent problems associated with
real-time data streaming such as backpressure and scalability. At the end of this chapter, you will
have a data ingestion mechanism ready for real-time machine learning.

© Manning Publications Co. To comment go to liveBook

Licensed to Peter Knez <pknez@knezventures.com>


21

Figure 2.1 illustrates an example of a simple workflow that describes the flow of real-time data
using event-driven architecture. Specifically there is a publisher that ingests data from a real-time
data source and publishes this data to a topic. There is a message broker in the middle that routes
the data from the publisher to a subscriber that has subscribed to the topic.

Figure 2.1 Simple event-driven workflow

2.1 Ingesting real-time events


Let’s say you want a system to predict flight arrival times for ongoing flights. This could help
travelers who need to schedule pick up rides or drivers who need to arrive at the airport at the best
time to pick up travelers. In order to build this predictive system you need to ingest data from real-
time data sources. The system will need to stream flight data and possibly integrate this with other
real-time data sources such as ongoing weather updates to train an online model. Your system will
also need to flight stream predictions out to end users in order to provide them with true real-time
updates.
In this section we will focus on the data ingestion portion. In its simplest form the application
needs to stream data from APIs and capture important events such as flight positions and
velocities. You will learn how to create data events from scratch and publish those events to a real-
time data stream.
The following figure represents the architecture diagram of the data ingestion application that
we will have developed at the conclusion of the chapter.

Figure 2.2 Real-time flights data ingestion application

2.1.1 Selecting a data source


Our goal is to build a nowcasting model to predict arrival times for ongoing flights. Nowcasting is
similar to forecasting in that both approaches aim to predict the future, except that nowcasting
heavily relies on recent trends to predict near-future values. For example, predicting where a
hurricane will make landfall or which direction a stock will move in the next few hours. Our
nowcasting model will take into account recent flight positions and attempt to predict when the
flight will arrive.

© Manning Publications Co. To comment go to liveBook

Licensed to Peter Knez <pknez@knezventures.com>


22

When selecting a real-time data source for a project, you should consider the following:
Terms of use: Most APIs specify terms and conditions for how the data can be used. For
example, some usage agreements prohibit you from making publicly available copies of the data or
selling an application that uses the data.
Reliability: One challenging aspect of real-time data projects is that they often rely on external
services being available to function properly. When selecting a data source, it’s important to gauge
the reliability of the API by performing some initial due diligence. In particular, we want to select
data sources that are well maintained and used by others. There are specific metrics you can use
for this. For example, reliable APIs often have detailed documentation with examples. It’s a good
sign if the organization or party providing the API has an official programming client and its GitHub
repository is well maintained (e.g. updated within the last month, maintainers are responding to
issues, several hundred stars).
Rate limits: Most APIs impose some kind of rate limiting to protect against spam attacks. When
selecting a data source a necessary step is to decide how much data is needed and how often.
Furthermore, what strategies will you use to deal with the limitations of the real-time data source?
For example, in this chapter we will utilize a polling interval strategy to avoid running into request
rate limits.
Data quality: When using an external data source you are limited by what data is made
available. For example, data that is advertised as “real-time” may only be available in time steps of
30 seconds. It’s important to understand how often new data is required for the use case. An
application that attempts to predict near-future stock price movements may need to ingest updates
on a second-by-second basis, but an application that predicts monthly rainfall may only need to be
updated on a daily basis. In the next chapter we will explore how to assess the viability of real-time
data features with respect to online modeling.
Data formatting: It’s especially common for REST APIs to return unstructured JSON data. One
goal of a data ingestion pipeline is to ensure that this data can be reliably deserialized into a defined
structure. A potential issue with using external APIs is that the data schema can change at any point
which can break downstream applications. For this reason it’s important to use official
programming language APis where possible, and perform field validation so that any compatibility
issues are easily identified and fixed.
To power our nowcasting model, we wish to ingest data about ongoing flights. For offline
learning we would start by locating a historical flights dataset. In contrast, our model will need to
use very recent data to make predictions. Therefore we will need a real-time data source that
captures flight data. Fortunately, the OpenSky (https://opensky-​network.org/) network provides
real-time flight data about ongoing air traffic including commercial flights. At the time of writing this
is available to use for non-profit research use.
Note that many APIs will describe themselves as “real-time” APIs but may be designed very
differently. Some are truly real-time in that they utilize long-running, low-latency websockets for
delivering updates in a push-based format. This is most common in the finance domain because of
high trading volumes and knowing information as soon as possible is incredibly valuable. Other APIs
are pull-based so they require the client to make periodic requests for data updates.

© Manning Publications Co. To comment go to liveBook

Licensed to Peter Knez <pknez@knezventures.com>


23

OpenSky provides a REST-based API that we can make web requests to. In particular, we are
interested in the GET /states/all route which returns “state vectors” for ongoing flights. We will use
the built-in urllib library in Python to make a single request to the API and print the result in
readable text, as shown in listing 2.1.

Listing 2.1 Querying current flight positions from OpenSky

from urllib import request

with request.urlopen(“https://opensky-​
network.org/​
api/states/​
all”) as response:
print(response.read().decode(“utf-8”))

The output of this code is a giant list of flight positions.



["c00734","WJA2199
","Canada",1728763865,1728763865,-70.7597,20.2082,10972.8,false,249.58,350.87,0.33,null,11658.6,nul
l,false,0],
["a41b89","DAL799 ","United
States",1728763866,1728763866,-105.9763,40.8415,10668,false,248.95,62.69,0.33,null,11102.34,null,fals
e,0]
The API has returned a JSON-encoded object representing the current positions of all known
flights. For simplicity we can provide additional parameters to the API to filter the output to a
specific region. The modified code below queries all current flights over Switzerland using latitude
and longitude coordinates and parses the result into a Python object we can work with more easily.

Listing 2.2 Filtering flight positions from OpenSky

import json
from urllib import request

with request.urlopen(“https://opensky-​
network.org/​
api/states/​
all?lamin=45.​
8389&lomin=5.​
9962&lamax=47.​
8229&lomax=10.​
5226”) as response:
print(json.loads(response.read().decode(“utf-8”)))

Here is the abbreviated output below.


{'time': 1728765860, 'states': [['4b1804', 'SWR560A ', 'Switzerland', 1728765859, 1728765859, 9.2725,
47.6737, 3970.02, False, 160.76, 254.41, -6.5, None, 4091.94, '1000', False, 0], ['4b1806', 'SWR2YA ',
'Switzerland', 1728765773, 1728765837, 8.5633, 47.4417, 403.86, True, 0, 64.69, None, None, None,
'2000', False, 0]]}
Now that we have repeatable code to capture data of interest, we will process this data into
individual events for real-time machine learning.

© Manning Publications Co. To comment go to liveBook

Licensed to Peter Knez <pknez@knezventures.com>


24

2.1.2 Creating real-time events


An event is a record that something happened. In the context of our nowcasting system events can
refer to specific things that have happened in the world such as planes arriving or departing an
airport or a weather update for a particular geographical region. However, events can also refer to
things happening internally within the application. For example, when a new prediction is produced
it needs to be communicated to another part of the system. Event-driven systems work by
communicating data and control signals in terms of individual events.
The next step in data ingestion is to transform the queried flight positions into individual events.
At a minimum, each event should contain a data payload and a timestamp that indicates when the
event occurred. For time-sensitive systems, the timestamp is important because there will be an
amount of delay between when the event was received and when it is eventually processed by an
online model. Creating an event means defining a data structure that can be interpreted by some
other part of the system. To start, we may wish to capture the position (latitude and longitude),
velocity, direction (true_track), origin country, and unique identification of the aircraft (icao24). The
function below parses a response from the API and returns a list of update objects, one for each
flight in the response. Since the response is not guaranteed to be time-ordered, we will also
manually order the events by the time_position timestamp.

Listing 2.3 Creating events from flight data

def response_to_events(api_response):
flight_events = []
for update in api_response[“states”]: #A
flight_events.append(
{
“icao24”: update[0],
“origin_country”: update[2],
“time_position”: update[3],
“longitude”: update[5],
“latitude”: update[6],
“velocity”: update[9],
“true_track”: update[10]
}
)
return sorted(flight_events, key=lambda x: x[“time_position”]) #B

#A Iterate over all the flight states in the API response.


#B Return a sorted list of flight events from earliest to latest.

Now we have a function that parses a single API response into a sequence of events. We will need
something to invoke this function repeatedly to produce real-time data for the nowcasting system.
In event-driven system terms this is called a publisher. Publishers are long-running processes that
communicate events to other parts of the real-time system. In our case, we need a publisher that
routinely queries the OpenSky API and produces events that our system can understand.

© Manning Publications Co. To comment go to liveBook

Licensed to Peter Knez <pknez@knezventures.com>


25

Note that we’re a bit limited by the constraints of the data API. HTTP/1.1 is a pull mechanism so
the API will not tell us when new updates are available. The only option is to periodically poll the
server in order to retrieve flight updates. When implementing the publisher, we have to decide on
an interval to poll the server. The longer this interval is, the less real-time our system will be
because we will be operating on older data. However, with lower intervals we risk running into rate
limits imposed by the API’s maintainers. The figure below shows the different results we might get if
we choose different polling intervals.

Figure 2.3 Ingesting flight updates with different polling intervals. Shorter intervals produce more data points
with smaller changes. Longer intervals produce less data points with more significant changes.

© Manning Publications Co. To comment go to liveBook

Licensed to Peter Knez <pknez@knezventures.com>


26

Ideally we would poll as fast as possible in order to retrieve the most real-time data. However, in
practice we usually have to make a compromise because it wastes compute resources to
continuously poll and we will quickly run up against rate limits. So how do we determine the
appropriate polling interval for a particular use case? The first thing to look at is the API’s
documentation, which should state how often the underlying data updates and how many requests
are allowed per day/hour/minute. This usually gives a good starting point for determining an
appropriate interval. For example, at the time of writing the OpenSky documentation states that
unauthenticated users can retrieve new data every 10 seconds and 400 /api/states/all requests over
a 500x500 km area are allowed within a 24-hour period. So, we could poll the system every 10
seconds in order to have the latest data but if we want to leave the system running for an entire day
the polling interval should not be above 24 * 60 / 400 = 3.6 minutes. In general, a good starting point
is between 30 seconds and 5 minutes. However, this depends on the criticality of the data and how
often the data is likely to change. For time-sensitive use cases such as alerting systems and financial
markets the polling interval should be less than a minute. For more slowly changing or stable
systems such as server health monitoring or weather tracking a suitable interval is closer to 5
minutes.
We will start with an interval of 10 seconds so we can see how the polling works in action. The
code below is an event generator that queries the API at a defined interval and performs the
transformation from response data into an iterable of events.

Listing 2.4 A flight update event generator

import time
from urllib import request

def get_events(url=“https://opensky-​
network.org/​
api/states/​
all?lamin=45.​
8389&lomin=5.​
9962&lamax=47.​
8229&lomax=10.​
5226”, interval_sec=10):
for _ in range(3): #A
with request.urlopen(url) as response:
yield from response_to_events(json.loads(response.read().decode(“utf-8”))): #B
time.sleep(interval_sec) #C

#A Stop execution after three iterations.


#B The yield keyword makes this function a generator that returns events on every iteration of the loop.
#C Sleep for the time interval to avoid spamming the API.

We can invoke the generator by iterating over it in a Python for loop to produce events.

Listing 2.5 Calling the event generator

for event in get_events():


print(event)

The output of running this code looks something like this.

© Manning Publications Co. To comment go to liveBook

Licensed to Peter Knez <pknez@knezventures.com>


27

{'icao24': '4d21ef', 'origin_country': 'Malta', 'time_position': 1729024220, 'longitude': 8.0245, 'latitude':


46.7367, 'velocity': 228.51, 'true_track': 144.81}
{'icao24': '3986e1', 'origin_country': 'France', 'time_position': 1729024220, 'longitude': 6.5155, 'latitude':
47.0584, 'velocity': 209.11, 'true_track': 210.29}
{'icao24': '4d223b', 'origin_country': 'Malta', 'time_position': 1729024220, 'longitude': 7.23, 'latitude':
47.5584, 'velocity': 219.62, 'true_track': 145.3}
Note that the output is not continuous, but it is event-driven. The triggering event is time-based
and every minute a new batch of events is created when the API is called. After three batches the
generator returns and the process exits. Note that in a real project, the generator will keep querying
indefinitely. This means that the process may eventually run into API rate limits. Long-running
systems should detect these errors and implement a backoff strategy. A backoff strategy helps
automated querying systems like this avoid getting caught in a continuous error loop. We will
discuss how to implement a backoff strategy in Chapter 4 when we build the complete flight
prediction system.
The last thing we need to do is put the data somewhere. For offline modeling we would normally
persist the data instances into a queryable storage format such as a database. For now, we will
simulate this by dumping the events into a JSON lines file, where each line represents a single flight
event encoded in JSON format. At this point we’ll also formalize all the functionality into a
FlightPublisher class, as shown in listing 2.6. This will make it easier to spin up publishers by making
them configured objects.

Listing 2.6 Initial flight data publisher

class FlightPublisherV1:
def __init__(self, url, interval_sec, file_path):
self.url = url
self.interval_sec = interval_sec
self.file_path = file_path

def response_to_events(self, api_response): #A


flight_events = [] #A
for update in api_response["states"]: #A
flight_events.append( #A
{
"icao24": update[0], #A
"origin_country": update[2], #A
"time_position": update[3], #A
"longitude": update[5], #A
"latitude": update[6], #A
"velocity": update[9], #A
"true_track": update[10], #A
} #A
) #A
return sorted(flight_events, key=lambda x: x["time_position"]) #A

© Manning Publications Co. To comment go to liveBook

Licensed to Peter Knez <pknez@knezventures.com>


28

def get_events(self): #B
while True: #B
with request.urlopen(self.url) as response: #B
yield from self.response_to_events( #B
json.loads(response.read().decode("utf-8")) #B
) #B
time.sleep(self.interval_sec) #B

def run(self): #C
for event in self.get_events(): #C
with open(self.file_path, "a") as file: #C
file.write(json.dumps(event) + "\n") #C

#A Extract individual flight updates from an API response.


#B A generator that polls an API and yields flight update events.
#C Run the publisher continuously and append events to a file.

Running the following code will periodically dump batches of flight update events to
flight_updates.jsonl.

Listing 2.7 Invoking the flight data publisher

publisher = FlightPublisherV1(url="https://opensky-​
network.org/​
api/states/​
all?lamin=45.​
8389&lomin=5.​
9962&lamax=47.​
8229&lomax=10.​
5226", interval_sec=60,
file_path="flight_updates.jsonl")
publisher.run()

You now have a publisher which extracts data from a real-time data source, performs simple data
filtering, and persists an ordered log of data events. This mirrors the extract-transform-load (ETL)
pipeline that is common in offline learning. However, in online learning we are interested in
processing this event log as close to real-time as possible. The online model will need to parse this
data and produce inferences as a real-time output. In the next section, you will create a streaming
ETL pipeline that processes the ingested flight data in real-time for online modeling.

2.2 Processing real-time data


Currently our flight data exists in two states. It exists in true real-time format as the FlightPublisher is
reading and processing updates from the API. Then, we persist the updates as a log of events to
disk and the data becomes historical. The historical format is useful for debugging and for us to
make sense of the data but it is not very useful for online modeling.
To support training and utilizing online models, we need to be able to communicate real-time
data between separate processes. In this section, you will learn the basic components of an event-
driven architecture and finish the real-time ingestion implementation in preparation for online
modeling.

© Manning Publications Co. To comment go to liveBook

Licensed to Peter Knez <pknez@knezventures.com>


29

2.2.1 Message queues


Right now our FlightPublisher outputs an ordered sequence of events to a readable file. This is
actually a basic example of a message queue. Message queues are buffers that allow different
processes to communicate asynchronously with each other. Processes that write to message
queues are called publishers or producers and processes that read to them are called subscribers or
consumers. The benefit of message queues is that it allows publishers and subscribers to operate
independently. For example, imagine you want to purchase a magazine about machine learning.
You could travel to a book store and physically purchase this month’s issue of ML Digest. This
requires both parties to coordinate and complete a transaction. This is similar to making a request
to the OpenSky API and receiving a single response. The other option is to purchase a magazine
subscription. In this case, the magazine distributor is only responsible for dropping each monthly
issue in the mailbox. This is analogous to our FlightPublisher logging events to the file system.

Figure 2.4 Synchronous data transfer requires careful coordination between two parties. Asynchronous data
transfer requires a buffer or message queue.

With message queues we aim to decouple the different processes in our flight nowcasting system.
For instance, this will allow the FlightPublisher to continue polling for new flight data without waiting
for subscribers to process the data. However, there are a few issues with our file buffer approach.
One issue is that files are traditionally read sequentially from the beginning to the end of the file. As
more events are published the size of the file increases so it takes longer to read the file. In
addition, the most recent events will always be at the end. In order for the reader to ingest a new
event it must seek to a recent line in the file. This requires the subscriber to keep track of its last
read position, otherwise it will receive duplicate events. This behavior is illustrated in the figure
below.

© Manning Publications Co. To comment go to liveBook

Licensed to Peter Knez <pknez@knezventures.com>


30

Figure 2.5 The first time a subscriber reads from the event log it reads 3 events. On the second read it must
seek forward 3 events before reading the new events. Without a message queue, the subscriber must keep
track of its last read offset to avoid receiving duplicate events.

We will now upgrade our file buffered approach to use a real message queue. There are many
message queueing solutions such as Apache Kafka, Google PubSub, and RabbitMQ. We will use
RabbitMQ because we can set up the queue locally with very minimal code. As a message broker it’s
responsible for accepting connection requests from both the publisher and subscriber, handling
queue creation, dispatching messages sent from the publisher to the correct queue, and more
importantly persisting the information in the queue even after the publisher closes its connection.
This is critical because it allows the publisher and subscriber to communicate asynchronously. This
decoupling turns out to be valuable for real-time machine learning because it means that data
ingestion can continue while the system is learning and producing inferences.
The first step is to run the RabbitMQ management service. The easiest way to start the service is
to pull and run the Docker image. If you don’t have Docker, you can install it on your system
following the directions here (https://docs.​docker.com/​get-started/​get-docker/). Use the following
command to start the RabbitMQ service.
docker run -it --rm --name rabbitmq -p 5672:5672 -p 15672:15672 rabbitmq:4.0-management
If the service started successfully then you should see a startup confirmation message.
2024-10-18 02:52:36.304715+00:00 [info] <0.651.0> Server startup complete; 4 plugins started.
2024-10-18 02:52:36.304715+00:00 [info] <0.651.0> * rabbitmq_prometheus
2024-10-18 02:52:36.304715+00:00 [info] <0.651.0> * rabbitmq_management
2024-10-18 02:52:36.304715+00:00 [info] <0.651.0> * rabbitmq_management_agent
2024-10-18 02:52:36.304715+00:00 [info] <0.651.0> * rabbitmq_web_dispatch
2024-10-18 02:52:36.409833+00:00 [info] <0.9.0> Time to start RabbitMQ: 12912 ms
To interact with the server you will need a client that speaks a messaging protocol that
RabbitMQ supports. Fortunately, Pika is a package that implements the Advanced Message Queuing
Protocol (AMQP) in pure Python. AMQP is an open standard application layer protocol which defines
message-based communication between network processes. In our case, it allows us to send
messages from Python code to the RabbitMQ server written in Erlang. Run the following command
in a separate terminal window to install Pika.
Next we can try to send a message to a queue to see if it works. With Pika we have to declare a
queue before sending anything to it. Running the following code will establish a connection to the
local message service, publish a single message to a flight_updates queue, and close the connection.

© Manning Publications Co. To comment go to liveBook

Licensed to Peter Knez <pknez@knezventures.com>


31

Listing 2.8 Publishing data to a queue

import json
import pika

connection = pika.BlockingConnection(pika.ConnectionParameters("localhost")) #A
channel = connection.channel() #A
channel.queue_declare(queue="flight_updates") #B
data = {"icao24": "abc123", "origin_country": "United States"}
channel.basic_publish(exchange="", routing_key="flight_updates", body=json.dumps(data))
#C
print("Sent message:", data)
connection.close()

#A Connect to a locally running message broker.


#B Ensure the message queue exists.
#C Publish a JSON payload to the message queue.

Here is the output of running the code.


Sent message: {'icao24': 'abc123', 'origin_country': 'United States'}
We can also use Pika to consume or subscribe to messages on the queue. The subscriber code is
a bit more complicated because we usually want to do something with the message. For now, we
will just print it out but eventually we will use the incoming data to perform online inference. Here,
we will configure a simple callback function that gets invoked whenever the subscriber receives a
new message.

Listing 2.9 Consuming data from the queue

import json
import pika

def handle_message(channel, method, properties, body): #A


print(f"Received message: {body}") #A
connection = pika.BlockingConnection(pika.ConnectionParameters("localhost"))
channel = connection.channel()
channel.queue_declare(queue="flight_updates") #B
channel.basic_consume( #C
queue="flight_updates", on_message_callback=handle_message, auto_ack=True #C
) #C
channel.start_consuming() #D

#A Function callback that gets invoked when a new message is received from the queue.
#B Ensure the message queue exists.
#C Configure the consumer to auto-send acknowledgements (fire-and-forget method)
#D Start consuming events from the queue.

© Manning Publications Co. To comment go to liveBook

Licensed to Peter Knez <pknez@knezventures.com>


32

You may be wondering why you have to declare the queue again since we already declared it on the
publisher side. The reason is that in a production system the publisher and subscriber are likely
operating on different machines. Without a secondary communication channel, it’s impossible for
either party to know if the other has started yet. Fortunately the queue_declare call is idempotent,
i.e., it can be called several times but only one queue will be created. Therefore, it’s easier to
accidentally call it twice rather than check if it already exists. Assuming you ran the publish code
earlier, this code should output the following message.
Received message: b'{"icao24": "abc123", "origin_country": "United States"}'
This is the same as before so we have successfully transferred a message from the publisher to
the subscriber through the queue. Note that the subscriber process is hanging. This may seem like
a bug but it’s actually intended behavior. The start_consuming function implements a loop that waits
for new messages. This is a common pattern for event-driven systems. Subscribers wait for things
to happen in the system and then perform some operation, such as producing new data or
updating their internal state. You can confirm that the waiting behavior is happening by simply
running the publisher code again; you should see another message printed by the subscriber.
Now you can create a configurable flight subscriber that reads from a queue. Create a new file
called subscribe.py and include the following code.

© Manning Publications Co. To comment go to liveBook

Licensed to Peter Knez <pknez@knezventures.com>


33

Listing 2.10 Flights subscriber using a message queue

import pika

class FlightSubscriberV1:
def __init__(self, queue_name):
self.queue_name = queue_name

def process_message (self, channel, method, properties, body): #A


print(f"Received flight update: {body}") #A
def run(self): #B
connection = pika.BlockingConnection(pika.ConnectionParameters("localhost"))
#B
channel = connection.channel() #B
channel.queue_declare(queue=self.queue_name) #B
channel.basic_consume( #B
queue=self.queue_name, #B
on_message_callback=self.process_message, #B
auto_ack=True, #B
) #B
channel.start_consuming() #B
subscriber = FlightSubscriberV1(queue_name="flight_updates")
subscriber.run() #C

#A Function callback invoked when a flight update is received.


#B Blocking function that continuously receives events from a message queue.
#C Run the subscriber.

You can run the subscriber like a standard Python program.


python subscriber.py
You should not see any output yet because the publisher is not running. Here is the updated
version of the FlightPublisher that uses the queue, which you can save as publisher.py.

Listing 2.11 Flights publisher using a message queue

import json
import pika
import time
from urllib import request
class FlightPublisherV2:
def __init__(self, url, interval_sec, queue_name):
self.url = url
self.interval_sec = interval_sec
self.queue_name = queue_name
def response_to_events(self, api_response): #A

© Manning Publications Co. To comment go to liveBook

Licensed to Peter Knez <pknez@knezventures.com>


34

flight_events = [] #A
for update in api_response["states"]: #A
flight_events.append( #A
{ #A
"icao24": update[0], #A
"origin_country": update[2], #A
"time_position": update[3], #A
"longitude": update[5], #A
"latitude": update[6], #A
"velocity": update[9], #A
"true_track": update[10], #A
} #A
) #A
return sorted(flight_events, key=lambda x: x["time_position"]) #A
def get_events(self): #B
while True: #B
with request.urlopen(self.url) as response: #B
yield from self.response_to_events( #B
json.loads(response.read().decode("utf-8")) #B
) #B
time.sleep(self.interval_sec) #B
def run(self): #C
connection = pika.BlockingConnection(pika.ConnectionParameters("localhost"))
channel = connection.channel()
channel.queue_declare(queue=self.queue_name)
for event in self.get_events():
channel.basic_publish(
exchange="", routing_key=self.queue_name, body=json.dumps(event)
)
time.sleep(0.1) #D
connection.close()

publisher = FlightPublisherV2(url="https://opensky-​
network.org/​
api/states/​
all?lamin=45.​
8389&lomin=5.​
9962&lamax=47.​
8229&lomax=10.​
5226", interval_sec=60,
queue_name="flight_updates")
publisher.run() #E

#A Convert API responses to flight update events.


#B Generator that polls the API and yields flight update events.
#C Function that runs the publisher indefinitely and publishes events to the message queue.
#D Small delay for demonstration purposes.
#E Blocking call that runs the publisher.

Now, run the publisher in a second terminal window.

© Manning Publications Co. To comment go to liveBook

Licensed to Peter Knez <pknez@knezventures.com>


35

python publisher.py
You should see the same events printed out by both the publisher and subscriber in the same
order. You have now created a basic event-driven pipeline that ingests real-time data and streams it
to a downstream subscriber. This is an improvement over the file buffer mechanism we
implemented earlier because the subscriber does not have to guess when new data is available.
Instead, the subscriber is truly event-driven and can react immediately to new flight events.

2.2.2 Problems with message queues


Message queues are a simple way to implement asynchronous data transfer for real-time machine
learning. However, think for a moment what might happen if you stop the publisher and restart the
subscriber, then try it out.
In this case the restarted subscriber is running but it does not receive any of the older messages.
You might have expected the subscriber to print out all the older messages. In fact, the older
messages have been removed from the queue and it is no longer possible to retrieve them. This is
because the primary goal of the message queue is to enable asynchronous communication
between services, not to persist data. For strictly real-time use cases this is desirable, but there are
situations in real-time machine learning where we want to be able to query historical data. For
example, we may want to replay the data stream to evaluate or debug an online model. With
message queues, it’s not possible to rewind the stream or query older values.
In the examples we have seen so far all messages have been successfully delivered from the
publisher to the subscriber. However, with asynchronous communication there is always the
possibility of a message being dropped. It is possible that the subscriber does not receive a
message sent from the publisher because it loses connection to the broker while attempting to
receive a message. RabbitMQ addresses this by making publishes to the queue transactional and
placing responsibility on the publisher to resend messages that were not delivered to the
subscriber.
This is an example of at least once delivery. At least once delivery ensures that the subscriber will
eventually receive the event, but duplicates are possible. In contrast, at most once delivery ensures
that no duplicates are received, but in this case there is a possibility that messages will be lost. The
difference between these two semantics is illustrated in the figure below.

© Manning Publications Co. To comment go to liveBook

Licensed to Peter Knez <pknez@knezventures.com>


36

Figure 2.6 At least once delivery ensures that every message is delivered at least once but may produce
duplicates. At most once delivery prevents duplicates but may fail to deliver messages.

The ideal semantic for asynchronous messaging is exactly once delivery, although this is impossible
to achieve outside of very specific scenarios. This has to do with the famous Two Generals’ thought
experiment, which was first published in a paper by E. A. Akkoyunlu, K. Ekanadham, and R. V. Huber
in 1975 illustrating the constraints in network communications (https://dl.​acm.org/doi/​pdf/10.1145/​
800213.806523). The scenario demonstrates that it’s impossible for two parties communicating over
an unreliable link to coordinate on an action. Even if the subscriber sends an acknowledgement to
the publisher indicating that a message was properly received, there is no way to ensure that the
acknowledgement message itself is not lost.
Realistically, engineers have to make a choice of whether to handle duplicates or missing data.
For real-time machine learning this depends on the exact use case. In a financial fraud detection
system at least once delivery would be preferred if missing an important transaction would impact
the model’s prediction accuracy. In less critical scenarios such as user recommendations, at most
once delivery is acceptable if the speed at which those recommendations are delivered to users is
more important than losing an occasional data point. For the flight nowcasting system, missing
events is not desirable because it means we have less data for online modeling and we want to
make accurate predictions. Fortunately, detecting duplicates is not too difficult since the aircraft ID
and timestamp forms a unique ID. The following function can be added to the FlightSubscriber to
keep a record of current flight positions and perform a lookup to determine if an event has already
been processed.

© Manning Publications Co. To comment go to liveBook

Licensed to Peter Knez <pknez@knezventures.com>


37

Listing 2.12 Detecting duplicate messages

def check_duplicate(self, event):


if ( #A
event["icao24"] in self.flights #A
and event["time_position"] <= self.flights[event["icao24"]]["time_position"]
#A
): #A
return True
self.flights[event["icao24"]] = event #B
return False

#A This is a duplicate if this is not a newer event for this aircraft.


#B Cache the latest event for this aircraft.

Since they are meant to be a temporary buffer, message queues have a finite size. If the subscriber
runs more slowly than the publisher then the number of messages in the queue will continue to
increase. This phenomenon is called backpressure, which refers to how the data throughput is
limited by the processing speed of the subscriber. This is analogous to cars on a highway during
rush hour; there are too many cars for the road to handle so the rate of traffic slows.

Figure 2.7 Backpressure in a message queue causing a newly published message to be dropped

There are a few ways to manage backpressure with a message queue. One method is to
deliberately slow down the speed of the publisher. For real-time machine learning, this option may
not be preferable because it leads to ingesting older data that is no longer relevant. Another option
is to drop messages to decrease the pressure on the queue, which leads to data loss. The best
option when dealing with high throughput data ingestion is to increase the number of subscribers.
Of course, this increases the complexity of the system.
There are functional reasons why you would want multiple subscribers reading from the same
queue. For example, inferences produced by an online model could be read from multiple user-
facing applications. However, recall that messages are removed from the queue once consumed.
This makes it impossible for two different subscribers to receive the same event from a queue. In
order to implement this behavior, we need a real-time buffer mechanism that can also be persisted
to a historical log: the event stream.

© Manning Publications Co. To comment go to liveBook

Licensed to Peter Knez <pknez@knezventures.com>


38

2.2.3 Event streams


Event streams are designed to be both consumable in real-time and queryable for historical data.
Practically they exist as an ordered, append-only log of events. This makes them useful data
structures for asynchronous communication across multiple publishers and subscribers. In this
way, event streams address a lot of the limitations of queues by persisting the event data to disk.
Event streams require specialized processes called event brokers which manage communicating
events between many different publishers and subscribers.
The RabbitMQ broker already supports persistent data streams. You can define an event stream
by adding the x-queue-type=stream option when creating the queue, and setting durable=True to
ensure that the stream persists across reboots of the broker. Here is the updated publisher that
writes flight updates to a new event stream. Since flight_updates is already defined with
durable=False the broker won’t let us use the same name. Instead, we will create a new stream
called flight_events.

Listing 2.13 Flights publisher using an event stream

import json
import pika
import time
from urllib import request
class FlightPublisherV3:
def __init__(self, url, interval_sec, stream_name):
self.url = url
self.interval_sec = interval_sec
self.stream_name = stream_name
def response_to_events(self, api_response): #A
flight_events = [] #A
for update in api_response["states"]: #A
flight_events.append( #A
{ #A
"icao24": update[0], #A
"origin_country": update[2], #A
"time_position": update[3], #A
"longitude": update[5], #A
"latitude": update[6], #A
"velocity": update[9], #A
"true_track": update[10], #A
} #A
) #A
return sorted(flight_events, key=lambda x: x["time_position"]) #A
def get_events(self): #B
while True: #B
with request.urlopen(self.url) as response: #B
yield from self.response_to_events( #B

© Manning Publications Co. To comment go to liveBook

Licensed to Peter Knez <pknez@knezventures.com>


39

json.loads(response.read().decode("utf-8")) #B
) #B
def run(self):
connection = pika.BlockingConnection(pika.ConnectionParameters("localhost"))
channel = connection.channel()
channel.queue_declare(
queue=self.stream_name,
durable=True, #C
arguments={"x-queue-type": "stream"}
)
for event in self.get_events():
print("Sending flight update:", event)
channel.basic_publish( #D
exchange="", routing_key=self.stream_name, body=json.dumps(event) #D
) #D
connection.close()
publisher = FlightPublisherV3(url="https://opensky-​
network.org/​
api/states/​
all?lamin=45.​
8389&lomin=5.​
9962&lamax=47.​
8229&lomax=10.​
5226", interval_sec=60,
stream_name="flight_events")
publisher.run() #E

#A Convert an API response into flight events.


#B Generator that polls an API and yields flight events.
#C Create an event stream that persists across reboots of the broker.
#D Publish JSON events to the event stream.
#E Run the publisher to start publishing events.

Running publisher.py again produces similar output as before. However, you should see different
output from the broker terminal which implies that the events are indeed being persisted to disk.
2024-10-19 15:41:28.879707+00:00 [info] <0.4238.0> rabbit_stream_coordinator: started writer
__flight_events_1729352488363061228 on rabbit@6bcd5c37b84b in 1
2024-10-19 15:41:28.879955+00:00 [info] <0.4239.0> Stream: __flight_events_1729352488363061228
will use /var/lib/rabbitmq/mnesia/rabbit@6bcd5c37b84b/stream/__flight_events_1729352488363061228
for osiris log data directory
2024-10-19 15:41:28.910821+00:00 [info] <0.4239.0> osiris_writer:init/1: name:
__flight_events_1729352488363061228 last offset: -1 committed chunk id: -1 epoch: 1
Stop the publisher for now. To verify that the events were published, we will test if a new
subscriber can retrieve any of them. We also have a few options to configure subscriber behavior.
The first is subscriber acknowledgements. In event-driven systems the broker needs to know if
an event has been successfully delivered to a subscriber so it can remove the event from the
outgoing queue. To be a good citizen the subscriber should send an acknowledgement signal as
soon as the event is successfully processed. When working with queues we have previously set
auto_ack=True. This resulted in the highest throughput behavior where the broker would delete the
event as soon as it was delivered to the subscriber (the “fire and forget” method).

© Manning Publications Co. To comment go to liveBook

Licensed to Peter Knez <pknez@knezventures.com>


40

However, there are several scenarios where the subscriber would be unable to process an event.
For example, a JSON event may fail to validate against a defined schema (e.g. missing a required
key). An event could also be received by a subscriber that it cannot process due to a failed
precondition (e.g. the subscriber needs to update its local database but the database is not ready
yet). In these cases, the subscriber should send a negative acknowledgement informing the broker
that it should either discard or requeue the event. Therefore, it’s safer to set auto_ack=False and
change the subscriber to send manual acknowledgements.
The other option has to do with throughput. RabbitMQ consumers prefetch events on the
subscriber before processing them to relieve backpressure on the stream. Unlimited prefetch
allows for the highest throughput but comes at the risk of overwhelming the subscriber. If the
subscriber fails to acknowledge an event then the prefetched events will accumulate on the
subscriber.

Figure 2.8 Prefetching events on the subscriber to increase data throughput

The prefetch count is normally determined experimentally depending on throughput and data
safety requirements. In our case, we are only ingesting less than 100 events every 60 seconds so we
don’t need high throughput; we will set this value to the most conservative value of 1. Here is the
updated FlightSubscriber that consumes from the new event stream and also integrates the
duplicate check to ensure data safety.

Listing 2.14 Flights subscriber using an event stream

import json
import pika

class FlightSubscriberV2:
def __init__(self, stream_name):
self.stream_name = stream_name
self.flights = {}
def check_duplicate(self, event):
if (
event["icao24"] in self.flights
and event["time_position"] <= self.flights[event["icao24"]]["time_position"]
):
return True

© Manning Publications Co. To comment go to liveBook

Licensed to Peter Knez <pknez@knezventures.com>


41

self.flights[event["icao24"]] = event
return False

def process_message(self, channel, method, properties, body): #A


event = json.loads(body) #A
if not self.check_duplicate(event): #A
print(f"Received flight update: {event}") #A
def run(self):
connection = pika.BlockingConnection(pika.ConnectionParameters("localhost"))
channel = connection.channel()
channel.queue_declare(
queue=self.stream_name,
durable=True,
arguments={“x-queue-type”: “stream”}
)
channel.basic_qos(prefetch_count=1) #B
channel.basic_consume(
queue=self.stream_name,
on_message_callback=self.process_message, #C
auto_ack=False #D
)
channel.start_consuming()
subscriber = FlightSubscriberV2(stream_name="flight_events")
subscriber.run() #E

#A Function callback to check for duplicate events.


#B Prefetch at most 1 event when consuming events.
#C Configure the callback to check for duplicates.
#D Disable automatic acknowledgements.
#E Run the subscriber to start consuming events from the stream.

Try running the new subscriber. Surprisingly, we don’t receive any of the published events. Now, try
running the publisher again.
Received flight update: {'icao24': '4ba975', 'origin_country': 'Turkey', 'time_position': 1729359358,
'longitude': 6.1143, 'latitude': 46.2395, 'velocity': 8.23, 'true_track': 45}
The subscriber received data this time, but only one event. Try setting the prefetch_count to 2 and
run the subscriber again. Then, restart the publisher to produce some new events.
Received flight update: {'icao24': '4a3121', 'origin_country': 'Romania', 'time_position': 1729359386,
'longitude': 9.803, 'latitude': 47.734, 'velocity': 171.92, 'true_track': 268.11}
Received flight update: {'icao24': '398579', 'origin_country': 'France', 'time_position': 1729359559,
'longitude': 9.7242, 'latitude': 46.6355, 'velocity': 217.26, 'true_track': 125.46}
Now, the subscriber received only two new events. To understand what’s going on, consider the
sequence of events.
1. The publisher starts and publishes some events to the broker.

© Manning Publications Co. To comment go to liveBook

Licensed to Peter Knez <pknez@knezventures.com>


42

2. The subscriber starts with prefetch_count=1 and receives no events.


3. The publisher restarts and publishes more events to the broker. The subscriber
receives one event.
4. The subscriber restarts with prefetch_count=2 and receives no events.
5. The publisher restarts and publishes more events to the broker. The subscriber
receives 2 events.

Here, the subscriber only consumes new events when the publisher restarts. This is because unlike
with the queue, previously published messages to the broker are not automatically delivered to the
new subscriber. Instead, new subscribers start reading the stream from the last published offset. If
we want the subscriber to read the older events we need to set the x-stream-offset option to a
different value. For example, we could configure the subscriber to start from the beginning of the
stream.

Listing 2.15 Consuming from the beginning of the stream

channel.basic_consume(
queue=self.stream_name,
on_message_callback=self.process_message,
arguments={“x-stream-offset”: “first”} #A
)
channel.start_consuming()

#A Start from the first offset when consuming events.

The second question is why the subscriber only receives one event in step 3 and two events in step
5. The answer is that we did not configure the subscriber to acknowledge any of the events so the
prefetch queue gets filled up. To consume the rest of the stream the subscriber should
acknowledge back to the broker after each event is properly processed, as shown in the listing
below.

Listing 2.16 Sending acknowledgements from the subscriber

def process_message(self, channel, method, properties, body):


event = json.loads(body)
if not self.check_duplicate(event):
print(f"Received flight update: {event}")
channel.basic_ack(delivery_tag=method.delivery_tag) #A

#A Acknowledge an event indicating the event was successfully processed.

After making these updates, restart the subscriber. You should now see the subscriber consume all
of the events. You can confirm that all the events were received by looking at the last few events.
They should be the same as the last publisher run. You can also look back in the subscriber output
to verify that it consumes the same events as before.

© Manning Publications Co. To comment go to liveBook

Licensed to Peter Knez <pknez@knezventures.com>


43

Received flight update: {'icao24': '4a3121', 'origin_country': 'Romania', 'time_position': 1729359386,


'longitude': 9.803, 'latitude': 47.734, 'velocity': 171.92, 'true_track': 268.11}
Received flight update: {'icao24': '4b18fc', 'origin_country': 'Switzerland', 'time_position': 1729359430,
'longitude': 8.5584, 'latitude': 47.4614, 'velocity': 0, 'true_track': 275.62}
Received flight update: {'icao24': '70c0cd', 'origin_country': 'Oman', 'time_position': 1729359447,
'longitude': 8.5553, 'latitude': 47.4608, 'velocity': 0, 'true_track': 5.62}
Received flight update: {'icao24': '398579', 'origin_country': 'France', 'time_position': 1729359559,
'longitude': 9.7242, 'latitude': 46.6355, 'velocity': 217.26, 'true_track': 125.46}
Part of the power of event streams is that they scale to multiple publishers and subscribers. You
can verify this by opening a new terminal window, starting a new copy of the subscriber, and
restarting the publisher. After 60 seconds the publisher produces a new series of events and each
subscriber prints the same output, indicating that they both consumed the new events. This was
not possible with a basic message queue because events were removed from the queue upon being
consumed.
Event-driven systems are composed of many publishers and subscribers communicating
through various event streams. The event streams themselves are sometimes referred to as topics.
So far we have been dealing with a singular topic which captures real-time flight data. However, in
our real-time machine learning system we will have several topics dedicated to specific classes of
data. In the next section we will explore how to organize data workflows in terms of topics in order
to build real-time machine learning systems.

2.3 Benefits of event-driven architecture for machine learning


Now that we have demonstrated how to use event-driven architecture to build a data ingestion
pipeline, let’s briefly talk about why it is specifically suited for building machine learning
applications.
Generally speaking, the approach taken to build machine learning products has been to train a
model on a batch of data and then integrate the model into a production software application
where it is used to make inferences. There are a number of ways in which inferences can be
generated. One way is to pre-compute inferences offline in a batch-wise fashion similar to the
training process, i.e., generate the inferences on a static, finite set of data. In these situations,
timeliness is not usually a factor. Examples include predicting house prices, customer lifetime value,
and demand forecasting of goods at a retail store.
However, as time has progressed, more activity is being conducted online. It was only a couple of
decades ago when we were listening to music on CDs and were renting or buying DVDs of movies
and TV shows. Network and cable TV controlled the schedule of programs and a number of financial
transactions were conducted in person. Nowadays, we listen to music and watch content online. We
control when we watch programming on services like Netflix. We conduct a lot more financial
transactions online and we hail ride sharing apps like Uber and order food through online delivery
services like Doordash.

© Manning Publications Co. To comment go to liveBook

Licensed to Peter Knez <pknez@knezventures.com>


44

The above highlights an increased need for timeliness in service delivery. This has spillover
effects into machine learning, which is often a critical component in any kind of real-time
application. There are many more use cases where generating inferences in real-time has become a
necessity. We covered a few such examples in the previous chapter. Before we dive into the
benefits, let’s first talk about how machine learning applications are typically deployed currently.

2.3.1 Synchronous machine learning


A machine learning application is a specialized type of software application, and therefore, it makes
sense to wrap machine learning into the same framework as a traditional software application. This
typically means that a machine learning model along with the inference engine, which is the code
that is used to generate inferences on new data, are shipped as components within a microservice.
Often this microservice is part of a REST application. This involves one microservice making a
request to the machine learning microservice by providing it with features and the machine learning
microservice loads the model, generates the inference, and returns the inference back in the form
of a response. Figure 2.9 illustrates an example of such an interaction. In this example, there is a
web application where customers can post reviews of a restaurant. The reviews get sent to the
backend service which routes the review to the customer review sentiment predictor, which is a
microservice that loads a model that will generate the sentiment of the review (positive or negative)
and return that to the backend service as a response. After receiving the response, the review and
the sentiment get passed to a different application, the customer support portal, that is used by
customer service personnel to triage customer reviews.

© Manning Publications Co. To comment go to liveBook

Licensed to Peter Knez <pknez@knezventures.com>


45

Figure 2.9 Customer review sentiment analysis application using synchronous architecture

The nature of the interaction above is synchronous. The service that submits the request waits till it
receives the response before moving on to the next step in the application workflow. The fact that
the service has to wait until it receives a response introduces some latency into the application
which may be unacceptable for some use cases.
Latency can also enter the system in other ways. If the inference engine is bombarded with more
requests for predictions than it can handle, it causes backpressure that will result in slower
responses. Some application designers may choose to apply rate limiting to manage server load to
manage the influx of requests, but that means that the service will simply reject the requests.
Neither situation is ideal and can result in poor user experience.

© Manning Publications Co. To comment go to liveBook

Licensed to Peter Knez <pknez@knezventures.com>


46

Figure 2.10 Customer review sentiment analysis application using synchronous architecture

What the above highlights is that there is a tight coupling between the service that sends the
features and the service that generates and returns the inference. The load on the machine
learning microservice can be impacted by how many requests get sent to it and the requesting
microservice’s performance depends on the responsiveness of the machine learning microservice.
The customer review application that was presented earlier demonstrates a scenario where the
application that calls the machine learning microservice receives the inference and passes it to
another application that presents the inference to customer service personnel to triage the reviews.
What if the inference is needed in multiple applications? For example, the inference might be part
of a dashboard that displays analytics that is used for business intelligence purposes. The
inferences along with the features might also be stored in a database for offline analytics. If there is
a bottleneck in any part of the workflow, the effects can result in a chain reaction that causes the
entire system to fail.
There are other issues with this dependency. If the machine learning microservice needs to
make an API change, then all the other microservices that are dependent on it will also have to
change their code. In many organizations, the team that developed the machine learning
microservice is different from the teams that call it. These teams typically have different objectives,
priorities, and release schedules. The end result is a communication challenge that plagues many
organizations and can cause machine learning projects to fail.

© Manning Publications Co. To comment go to liveBook

Licensed to Peter Knez <pknez@knezventures.com>


47

All of this highlights a key point: in order to build a responsive application, quick access to data is
key. Data holds similar value as money in a real-time system. Just as a dollar is worth more today
than it is worth in the future, the faster that data is available, processed, and acted upon in a real-
time machine learning application, the more value it provides to an organization. For example, an
application that is able to adjust its recommendations based on user activity in real-time is more
likely to keep users engaged and coming back than an application that is unable to adjust because it
was trained on the previous day’s data. Similarly, the sooner an application is able to detect and flag
fraudulent financial transactions, the more money the financial institution is likely to save over time.

2.3.2 Event-driven machine learning


Earlier in the chapter, we introduced the concept of real-time data flows and event-driven
architecture as solutions to the data freshness challenge. Figure 2.11 illustrates an example of an
application using event-driven architecture. There is a customer review application that passes
customer review data to a customer review topic. The customer review sentiment predictor
application, which contains the machine learning model, will generate an inference as soon as a
new customer review is added to the customer review topic and publish it to the customer
sentiment topic. We can consider the customer review sentiment predictor application as a
“handler” application, one that subscribes to (receives data from) one topic and publishes to a
different topic.

Figure 2.11 Customer review sentiment analysis application using event-driven architecture

© Manning Publications Co. To comment go to liveBook

Licensed to Peter Knez <pknez@knezventures.com>


48

The key difference from the earlier example is that the customer review application that collects
customer review data operates independently of the customer review sentiment analysis
application. It is simply responsible for publishing the data. It is up to the sentiment analysis
application to make sure it is subscribed to the customer review topic and processes the data that is
stored in the topic and generates and publishes the inferences to the customer sentiment topic.
The customer review application does not need to know the internals of the sentiment analysis
application since it is no longer responsible for collecting the inferences and doing all the
downstream work. In fact, this diagram also demonstrates how other applications can be set up to
handle the downstream tasks. In this example, we have two applications that are subscribed to the
customer sentiment topic that the sentiment analysis application publishes data to. There is an
application that drives the customer support portal so that customer support personnel can triage
the reviews. The second application runs analytics on the customer reviews and model output and
creates an analytics portal that can be used for business intelligence purposes. Note that in this
scenario, the other two applications do not talk directly to the sentiment analysis application.
The other benefit of this architecture is that subscribers of a topic can consume the data at any
interval they choose. Therefore, the subscribers can either choose to get real-time events if their
SLA requires immediate processing or they can choose to ingest the events later if there isn’t a need
to collect the events in real-time.
Event-driven architecture also enables better application maintenance. Since these applications
are decoupled, each team can choose its own maintenance cycle and not have to be concerned
about their code changes affecting other teams.
Event-driven applications also solve the problem of data silos by enabling democratization of
data access. In the synchronous paradigm, the consuming application has to follow the API
specification of the application that produces the data. But if applications only need to subscribe to
a topic to receive data and not worry about how to call the data producer, there is a lot more
flexibility in designing the application and the consumer application can be written in a
programming language that is different from that of the data producer. This opens up the
possibility of multiple teams within the organization with different backgrounds to use the data as
they see fit for their applications which results in unlocking more value streams for organizations.
Event-driven architecture also solves another problem that hinders access to data and slows
down development of machine learning applications: the fear that inexperienced team members
may accidentally delete the data. Unlike traditional relational databases, where data can be
updated or deleted, events are immutable and are simply appended to a topic. It’s akin to providing
read-only access to the data. It’s up to the consuming application to process the events based on
their use case. Furthermore, if certain data elements are sensitive and need to be restricted,
applications can be created that mask or remove sensitive elements and publish the events to
another topic that can be used by anyone within the organization. The relational database
equivalent, in contrast, would have required complex row/column based database access policies
that continually need to be updated or the need for the data owners to create and maintain
multiple copies of the same data source to serve multiple users.

© Manning Publications Co. To comment go to liveBook

Licensed to Peter Knez <pknez@knezventures.com>


49

We briefly touched on how event-driven architecture facilitates better application maintenance.


We will now expand on what this means in the context of machine learning specifically. Maintaining
a machine learning application is a lot more complicated than maintaining a software application.
For example, reproducibility of model results is a critical aspect of managing machine learning
models. Data is typically stored in relational databases that only store that latest snapshot of the
data. This means that it is not a trivial task to architect a solution that captures and versions data
that is used to train machine learning models. In event-streaming platforms, the data contained
within an event is immutable and these events are completely ordered. This means that
applications simply need to store the start and end offsets of the data used to train the model.
Figure 2.12 below demonstrates an example where v1 of a model was trained on data from offset 0
to offset 5 and the v2 version of the model was trained on data from offset 6 to offset 9.

Figure 2.12 Using offsets in topics to track model training

A downstream benefit of this feature is that it becomes a lot simpler to compare different model
versions. Since the data is always available, any analysis that needs to be made after the fact
becomes easier. Examples include comparing model parameters, configurations, evaluation
metrics, data distributions, relationships between the features and the target, feature importances,
etc. We can take this one step further and run comparisons against the training data and the data
used for inferences to troubleshoot issues in production. In fact, if the features used for training
and the features used for inferences are stored within the same topic, the same code can be used
to ingest data for training and inference, which reduces the time it takes to deploy a machine
learning application.
Capturing and analyzing application health metrics such as CPU and memory usage also
becomes easier when they are published as events that are stored in relevant topics. This opens up
the possibility to get real-time alerts when unexpected issues arise.

© Manning Publications Co. To comment go to liveBook

Licensed to Peter Knez <pknez@knezventures.com>


50

Another component of machine learning applications is collecting user feedback for model
evaluation and retraining purposes. Figure 2.13 demonstrates an example where a front end
application subscribes to the inferences topic and displays the model’s results. Let’s assume it is an
application that contains all the negative customer reviews that customer service associates need to
triage. Next to each result are buttons labeled “correct” or “incorrect” that the customer service
associate can optionally click on to provide feedback. This feedback can then be captured in a
“feedback” topic that can then be used to evaluate model performance and used as data for
retraining.

Figure 2.13 Using event-driven architecture to capture user feedback

The examples above demonstrate how event-driven architecture makes it easier to build
responsive, low-latency real-time applications that use batch machine learning models.
It is not a stretch to make the case that event-driven architecture is also a critical component of
building applications that use real-time machine learning models. By design, real-time machine
learning models make a prediction first, and then learn when the label is available. In a real-time
context, there is no control over when the labels will be available. Let’s consider an example where
a machine learning model makes a decision to present a set of articles to a given user and the label
for this use case would be whether the user found this article to be relevant or not and the way we
determine relevance is by whether a user clicks on an article or not. The user click is an “event” that
can fire off a process that stores this click event along with the article to a “labels” topic. We can
have a subscriber on the other end that passes this information to the machine learning model. We
can also have another application that monitors the time the article is on the user page and
publishes a “not-relevant” event to the “labels” topic if the user hasn’t taken any action on the article
after a certain amount of time. All of this is presented in Figure 2.14 below.

© Manning Publications Co. To comment go to liveBook

Licensed to Peter Knez <pknez@knezventures.com>


51

Figure 2.14 Realtime machine learning application workflow

As the examples above demonstrate, the traditional software architecture pattern used to build
machine learning applications involves tight coupling between microservices that is not conducive
to real-time applications. Event-driven architecture, on the other hand, provides the ability to build
low latency, responsive machine learning applications that seamlessly coordinate various tasks such
as data ingestion, inference generation, metrics collection, and user feedback.
In this chapter, you have learned how to ingest data from event streams using message queuing
solutions. You have also learned how to create publishers that produce real-time data and
subscribers that consume real-time data. In Chapter 3, we will focus on how to use the data from
event streams to build a nowcasting model that not only generates inferences in real-time but also
continually learns from data as soon as it is available.

2.4 Summary
Real-time machine learning requires access to both historical and real-time data.
An event is a record that something happened at a specific time.
Polling intervals affect the quantity of real-time data that is captured for
machine learning.
Message queues enable simple asynchronous communication between
different processes.
A publisher or producer is a process that produces events.
A subscriber or consumer is a process that subscribes to events.
Synchronous data transfer requires direct coordination between two parties.
Asynchronous data transfer requires a buffer to store published data.
Backpressure is caused by a subscriber processing events more slowly than a
publisher, which reduces overall data throughput.
At least once delivery ensures that every message is delivered at least once but
duplicate messages can be sent.
At most once delivery prevents duplicates but messages may get dropped.
Event streams are append-only ordered logs of events and are sometimes called
topics.
Event brokers coordinate event delivery between many publishers and
subscribers by persisting data to event streams.

© Manning Publications Co. To comment go to liveBook

Licensed to Peter Knez <pknez@knezventures.com>


52

A subscriber can pre-fetch events before processing to increase data


throughput but this can also have the effect of overloading the subscriber.
Event streams can be consumed in real-time or queried for historical data.
Event-driven architecture enables data ingestion to operate concurrently with
other important machine learning processes such as inference and user
feedback collection.

© Manning Publications Co. To comment go to liveBook

You might also like