MLOps Project - Part 4b - Machine Learning Model Monitoring - by Isaac Kargar - Jan, 2023 - DevOps - Dev

3/10/23, 11:00 AM MLOps project — part 4b: Machine Learning Model Monitoring | by Isaac Kargar | Jan, 2023 | DevOps.
dev
Open in app Resume Membership
Published in DevOps.dev
You have 2 free member-only stories left this month. Upgrade for unlimited access.
Isaac Kargar Follow
Jan 14 · 13 min read · · Listen
Save
MLOps project — part 4b: Machine Learning

Model Monitoring
We reviewed Evidently.AI and Seldon ALIBI Detect for model monitoring in our
previous blog post. I will go through managed services on Google Cloud and AWS to
do the model monitoring task.
I like to emphasize the importance of MLOps parts using the following photo from a
paper published by Google:
source
https://blog.devops.dev/mlops-project-part-4b-machine-learning-model-monitoring-1593cddd05f0 1/24
3/10/23, 11:00 AM MLOps project — part 4b: Machine Learning Model Monitoring | by Isaac Kargar | Jan, 2023 | DevOps.dev
As you see, the machine learning code and model development are just a small part
of having a machine learning-based product.
Machine learning models are at the heart of many modern applications, from
recommending products to customers to identifying objects in images. However,
even the best models can degrade over time, leading to poor performance and
inaccurate predictions. To ensure that your machine learning models continue to
perform well, it’s important to monitor them and identify when they need to be
retrained or replaced.
Let’s start with Vertex AI and see how it can help us with the Model Monitoring task.
Vertex AI
One way to do the model monitoring is by using Vertex AI, a fully managed platform
for developing, deploying, and monitoring machine learning models on Google
Cloud. It provides a range of tools and services for building, training, and evaluating
machine learning models, as well as monitoring their performance over time. With
Vertex AI, you can easily create and deploy machine learning models using a variety
of popular frameworks, such as TensorFlow and PyTorch. The platform also
provides tools for optimizing and improving the performance of your models,
including the ability to automatically tune hyperparameters and understand how
your models are making predictions.
In addition to these features, Vertex AI also provides tools for monitoring the
performance of your models over time. This includes data quality checks, confusion
matrices, and feature importance plots, which can help you understand the
accuracy of your models and identify areas for improvement.
Another useful feature of Vertex AI is its AutoML service. The AutoML feature in
Vertex AI allows you to automatically tune the hyperparameters of your machine
learning models. This can save time and improve model performance, as it allows
you to quickly test different hyperparameter configurations and choose the one that
performs the best.
In addition, it has another service for the Explainability of ML models. Vertex AI

provides tools for understanding how your machine learning models are making
predictions. This can be useful for identifying patterns in your data that may be
driving poor performance, and for improving the transparency of your models.
Also, Vertex AI integrates with other Google Cloud services, such as BigQuery and
Cloud Storage, making it easy to store and access your data, and use it to train and
evaluate your models.
To get started with Vertex AI, you’ll first need to create a project on Google Cloud.
From there, you can use the Vertex AI dashboard to create and deploy your machine
learning models. The dashboard provides a range of tools and metrics for
monitoring the performance of your models, including data quality checks,
confusion matrices, and feature importance plots.
Overall, Vertex AI is a powerful platform for monitoring and optimizing machine

learning models on Google Cloud. With its range of tools and integrations, it makes
it easy to ensure that your models are performing at their best.
Let’s see how Vertex AI can help us with the Model Monitoring task.
Model Monitoring in Vertex AI

A deployed model in production performs best on the input data that is most like
the training data. Even though the model hasn’t changed, its performance may
suffer if the input data deviates from the data used to train it.
Model monitoring checks the input data used for predictions for skew and drift in
features, so you can keep your model performing at its best.
Training-serving skew: When the distribution of the feature data in production

differs from the distribution of the feature data used to train the model, this is
known as training-serving skew. Skew detection can be enabled to check your
models for training-serving skew if the original data is still accessible.
Prediction drift: When there is a gradual shift in the distribution of features in

production data over time, this is known as prediction drift. Enabling drift
detection will allow you to track input data for changes over time if the original
training data is unavailable.
You can enable both skew and drift detection.
Skew and drift detection for categorical and numeric features are supported by
Model Monitoring.
Categorical: Categorical features are discrete pieces of information with a finite

set of values that are often categorized by some qualitative trait. Types of
products, countries, and types of customers are only a few examples.
Numerical: Numerical characteristics are pieces of information that can take on

any numeric values. Height and weight, for instance.
Model Monitoring will send you an email when the skew or drift for a model’s
feature goes above the threshold you choose. Additionally, you may check the
distributions of each feature over time to see whether you need to retrain your
model.
To detect training-serving skew and prediction drift, Model Monitoring uses

TensorFlow Data Validation (TFDV) to calculate the distributions and distance scores.
Check here for more detailed information.
To visualize the machine learning experiments, metrics, and other parameters,

Vertex AI uses TensorBoard. Let’s quickly review that.
TensorBoard
Open source TensorBoard (TB) is a Google open source project for machine learning
experiment visualization. Vertex AI TensorBoard is an enterprise-ready managed version
of TensorBoard.
Vertex AI TensorBoard provides various detailed visualizations, that includes:

- Tracking and visualizing metrics such as loss and accuracy over time
- Visualizing model computational graphs (ops and layers)
- Viewing histograms of weights, biases, or other tensors as they change over time
- Projecting embeddings to a lower dimensional space
- Displaying image, text, and audio samples
In addition to the powerful visualizations from TensorBoard, Vertex AI TensorBoard

provides:
- A persistent, shareable link to your experiment’s dashboard
- A searchable list of all experiments in a project
- Tight integrations with Vertex AI services for model training
- Enterprise-grade security, privacy, and compliance
With Vertex AI TensorBoard, you can track, visualize, and compare ML experiments and
share them with your team.
source
And here is how TensorBoard looks in Vertex AI:
source
It’s very easy and straightforward to use TensorBoard in your code. Here is an
example(source):
import tensorflow as tf
import datetime
mnist = tf.keras.datasets.mnist
(x_train, y_train),(x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0
def create_model():
return tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(512, activation='relu'),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(10, activation='softmax')
])
model = create_model()
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
log_dir = "logs/fit/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir, histogra
model.fit(x=x_train,
y=y_train,
epochs=5,
validation_data=(x_test, y_test),
callbacks=[tensorboard_callback])
Then you can run TensorBoard:
tensorboard --logdir logs/fit
source
You can check this link to see more on how to use it on Vertex AI. Also, check here
for the API and how to use TensorBoard in your code.
The current pricing model for TensorBoard is $300 per user per month, which is
very expensive in my opinion. If you want to use something cheaper on GCP, you
can check out my blog post on how to set up MLFlow on GCP.
Resources
Model Monitoring
Demo of Google's Vertex AI Model Monitoring
Monitoring ML models with Vertex AI

Have you watched any of the SpaceX rocket launches? I watched
them many times and each time I felt that I was…
medium.com
Introduction to Vertex AI Model Monitoring | Google Cloud

Whether your business is early in its journey or well on its way to
digital transformation, Google Cloud can help solve…
cloud.google.com
Monitor feature skew and drift | Vertex AI | Google Cloud

Whether your business is early in its journey or well on its way to
digital transformation, Google Cloud can help solve…
cloud.google.com
vertex-ai-samples/model_monitoring.ipynb at main ·
GoogleCloudPlatform/vertex-ai-samples
Sample code and notebooks for Vertex AI, the end-to-end machine
learning platform on Google Cloud …
github.com
A Vertex AI TensorBoard alternative for smaller budgets (Part 1)

A short guide on how to get the advantageous of Vertex AI
TensorBoard at a fraction of the price
blog.ml6.eu
A Vertex AI TensorBoard alternative for smaller budgets (Part 2)

Easily and securely share TensorBoards with your colleagues or
customers
blog.ml6.eu
mlops-with-vertex-ai/02-experimentation.ipynb at main ·
GoogleCloudPlatform/mlops-with-vertex-ai
You can't perform that action at this time. You signed in with another
tab or window. You signed out in another tab or…
github.com
Amazon SageMaker Model Monitoring

Amazon SageMaker is a fully managed service that takes away the need for manual
work at every step of the machine learning process. It does this by using separate
modules to collect and prepare data, train and tune a model, deploy the trained
model, and monitor models that are already in production.
SageMaker’s deploying module enables the hosting of real-time inference

endpoints. We will focus on monitoring the endpoints of the production model.
Machine learning models are typically trained and evaluated using historical data.
But the real-world data may not be the same as the training data, especially as the
models get older and the way the data is spread out changes. For example, the data
units could change from Fahrenheit to Celsius, or your application could start
sending your model null values, which has a big effect on the quality of your model.
Or perhaps, in a real-world retail scenario, consumers’ purchasing preferences
evolve over time. As we've talked about before, this gradual difference between the
model and the real world is called "model drift" or "data drift," and it can have a big
effect on how well predictions work. Similarly, the performance of the model may
decline over time. The degradation of model accuracy with time has an effect on
business outcomes. Continuous monitoring of the model’s performance is vital for
proactively addressing this issue.
With this constant monitoring, you can figure out when and how often to retrain
your machine-learning model. Frequent retraining can be too expensive, but if you
don't train your machine learning model often enough, it might not make the best
predictions. So acting at the right time is valuable here.
When some rules and thresholds you've set are broken, model monitoring sends
metrics to Amazon Cloud Watch. This lets you set up alarms to audit and retrain
models. Data drift and accuracy drift metrics are also kept in S3 buckets, and
SageMaker Studio can be used to see them.
Let’s quickly get familiar with two AWS services that can be used to log and visualize
metrics: SageMaker Studio and CloudWatch.
Everything necessary for developing ML is consolidated inside Amazon SageMaker

Studio. From this one user interface, programmers may create code, track
experiments, preview data, perform debugging and monitoring, and much more. As
a result, this greatly improves developers’ efficiency.
In addition, the environment keeps track of each phase of the ML workflow,

allowing developers to easily switch between them, as well as clone, modify, and
replay them. This allows developers to make adjustments, see the effects, and
iterate quickly, which speeds up the delivery of high-quality ML solutions.
If you’re running any kind of application on Amazon Web Services (AWS), Amazon
CloudWatch can keep an eye on everything in real time. Metrics, which are
variables for measuring resources and applications, can be gathered and monitored
with CloudWatch.
Metrics for all of your active AWS services are automatically updated on the
CloudWatch homepage. You may also curate your own set of metrics to display on a
dashboard dedicated to your bespoke applications.
In the event that a predetermined threshold is achieved, you can set up alarms that
will either notify you or cause immediate action to be taken with respect to the
resources under observation. You can, for instance, monitor the CPU utilization and
disk I/O operations per second of your Amazon EC2 instances to ascertain if more
instances need to be launched to accommodate the influx of new work. In addition
to reducing costs by terminating unused instances, this information can be used to
do so as well.
CloudWatch lets you monitor the overall health of your system, including its
resource use, application performance, and operational status.
An end-to-end flow for deploying and monitoring models in production in Amazon

SageMaker looks like this:
source
It starts with putting the trained model into use and ends with taking corrective
action when drift is found. Here is the end-to-end architecture that corresponds to
that end-to-end flow.
source
Let’s analyze this diagram in more detail. The first step is to deploy the trained
model.
source
source
It starts with ground truth training data and run a training job on SageMaker, which
generates a model artifact. Then the trained model would be available to consumers
via a deployed SageMaker endpoint.
source
Now, with the endpoint deployed, a consuming application can start sending
requests and get back predictions from the model. Then the request and the
responses are captured in S3.
In order to identify if there’s any kind of data drift, we need to have some baseline
data.
source
In the next step, we run a baselining job that generates the statistics and constraints
about the training data.
source
So, now that we have both the baseline details and the inference request, we can
compare them to identify any kind of drift.
source
source
Then we can create a data drift monitoring job that SageMaker will periodically run
at a schedule that we can set.
The job compares the inference requests to the statistics and constraints of the
baseline. For each run of the monitoring job, the results are a violation report that is
saved in Amazon S3, a statistics report of the data collected during the run, and
summary metrics and statistics that are sent to Amazon Cloud Watch.
Here are the few violations that are generated.
source
Now we’re able to detect data quality drift. But what happens if the quality of the
model itself changes? For example, the accuracy of the model decreases. Let’s see
how to use the Model Monitor capability to detect the accuracy drift.
source
The process for detecting accuracy drift is almost identical to the data drift
detection process.
source
We collect the predictions made and the ground truth of the prediction and
compare the two. But what does the inference of ground truth mean? What does
"ground truth" for a prediction mean? That would depend on the predictions made
by your model and the business use case. Consider that you are observing a movie
recommendation model. In this case, a possible ground truth inference is whether
or not the user has actually seen the recommended movie. Or perhaps they simply
clicked on the video but did not actually watch it.
With both the predictions captured and the ground truth provided by the model-
consuming application, SageMaker executes a merge job to combine them together.
Again, the merge job is a recurring task that runs on a set schedule.
Once you have the data merged, it’s time to monitor the accuracy.
source
source
In this step, we can create a model quality monitoring job that is executed
periodically according to a schedule. The model quality job generates statistics,
violations, and Cloud Watch metrics.
SageMaker Studio can also be used to see the metrics that are produced by the two
monitoring jobs.
For the model quality monitoring job, here are some of the metrics that are
generated: accuracy, precision, and recall. SageMaker Model Monitor lets us use
classification and regression metrics, but we can also make our own metrics.
After detecting both data and model drifts, it’s time to take action on them.
source
Both the data drift and the model quality monitoring jobs emit Cloud Watch metrics.
We can create Cloud Watch alerts for these metrics based on threshold values, and if
those thresholds are violated, alerts will be raised and we can decide on what
actions to take, such as updating the model, updating training data, and retraining
and updating the model itself.
source
Now, if we decide to retrain the model, we’re completing that loop and can go back
to the ground truth training data and start training the model one more time.
Let’s also take a look at what SageMaker Studio looks like:
source
To learn more, please check the documentation. You can find some resources in the
following section.
Thank you for taking the time to read my post. If you found it helpful or enjoyable, please
consider giving it a like and sharing it with your friends. Your support means the world to
me and helps me to continue creating valuable content for you.
Resources
Amazon SageMaker Model Monitor - Amazon SageMaker Examples 1.0.0

documentation
This notebook shows how to: * Host a machine learning model in Amazon SageMaker and
capture inference requests…
sagemaker-examples.readthedocs.io
Monitor and Analyze Training Jobs Using Amazon CloudWatch Metrics

An Amazon SageMaker training job is an iterative process that teaches a model to make
predictions by presenting…
docs.aws.amazon.com
Easily monitor and visualize metrics while training models on

Amazon SageMaker | Amazon Web…
Data scientists and developers can now quickly and easily access,
monitor, and visualize metrics that are computed…
aws.amazon.com
Monitoring in-production ML models at large scale using Amazon

SageMaker Model Monitor | Amazon Web…
Machine learning (ML) models are impacting business decisions of
organizations around the globe, from retail and…
aws.amazon.com
Machine Learning - Amazon Web Services

Enable more people to innovate with ML through a choice of tools-
IDEs for data scientists and no-code interface for…
aws.amazon.com
Build, train, deploy, and monitor a machine learning model with

Amazon SageMaker Studio
Amazon SageMaker Studio is the first fully integrated development
environment (IDE) for machine learning that provides…
aws.amazon.com
Amazon SageMaker Studio: The First Fully Integrated

Development Environment For Machine Learning |…
Today, we're extremely happy to launch Amazon SageMaker Studio,
the first fully integrated development environment…
aws.amazon.com
Machine Learning Deep Learning Mlops DevOps Data Science

MLOps Project - Part 4b - Machine Learning Model Monitoring - by Isaac Kargar - Jan, 2023 - DevOps - Dev

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

MLOps Project - Part 4b - Machine Learning Model Monitoring - by Isaac Kargar - Jan, 2023 - DevOps - Dev

Uploaded by

Copyright:

Available Formats

3/10/23, 11:00 AM MLOps project — part 4b: Machine Learning Model Monitoring | by Isaac Kargar | Jan, 2023 | DevOps.

Open in app Resume Membership

Isaac Kargar Follow

Jan 14 · 13 min read · · Listen

MLOps project — part 4b: Machine Learning

In addition, it has another service for the Explainability of ML models. Vertex AI

Overall, Vertex AI is a powerful platform for monitoring and optimizing machine

Model Monitoring in Vertex AI

Training-serving skew: When the distribution of the feature data in production

Prediction drift: When there is a gradual shift in the distribution of features in

You can enable both skew and drift detection.

Categorical: Categorical features are discrete pieces of information with a finite

Numerical: Numerical characteristics are pieces of information that can take on

To detect training-serving skew and prediction drift, Model Monitoring uses

To visualize the machine learning experiments, metrics, and other parameters,

Vertex AI TensorBoard provides various detailed visualizations, that includes:

In addition to the powerful visualizations from TensorBoard, Vertex AI TensorBoard

And here is how TensorBoard looks in Vertex AI:

Then you can run TensorBoard:

tensorboard --logdir logs/fit

Demo of Google's Vertex AI Model Monitoring

Monitoring ML models with Vertex AI

Introduction to Vertex AI Model Monitoring | Google Cloud

Monitor feature skew and drift | Vertex AI | Google Cloud

A Vertex AI TensorBoard alternative for smaller budgets (Part 1)

A Vertex AI TensorBoard alternative for smaller budgets (Part 2)

Amazon SageMaker Model Monitoring

SageMaker’s deploying module enables the hosting of real-time inference

Everything necessary for developing ML is consolidated inside Amazon SageMaker

In addition, the environment keeps track of each phase of the ML workflow,

An end-to-end flow for deploying and monitoring models in production in Amazon

responses are captured in S3.

Here are the few violations that are generated.

Let’s also take a look at what SageMaker Studio looks like:

Amazon SageMaker Model Monitor - Amazon SageMaker Examples 1.0.0

Monitor and Analyze Training Jobs Using Amazon CloudWatch Metrics

Easily monitor and visualize metrics while training models on

Monitoring in-production ML models at large scale using Amazon

Machine Learning - Amazon Web Services

Build, train, deploy, and monitor a machine learning model with

Amazon SageMaker Studio: The First Fully Integrated

Machine Learning Deep Learning Mlops DevOps Data Science

You might also like