ML Observability Build Vs Buy Download Guide 1689038317

ML Observability
Build vs. Buy

Should you build a solution in house?
Explainable AI
ML Monitoring
Root Cause Analysis
Open Source
Technical Debt
www.aporia.com
Ta b l e o f C o n te n ts
Table of Contents
E xe c u t i ve S u m m a r y 1
I n t ro d u c t i o n 6
B u y i n g a n M L O b s e r va b i l i t y P l a t fo r m 8
H ow I t Wo r k s 9
E n d - to - E n d M L O b s e r va b i l i t y 10
C u s to m i z a t i o n C a p a b i l i t i e s 11
E a s e o f I n te g ra t i o n 12
B i a s & Fa i r n e s s C a p a b i l i t i e s 12
I nve s t i g a t i o n C a p a b i l i t i e s 13
R i c h , C o n tex t u a l D a t a i s N e e d e d fo r P ro d u c t i o n - G ra d e M o n i to r i n g 15
C o s ts 16
B u i l d i n g M L M o n i to r i n g w i t h O p e n - S o u rc e 18
Implementation 18
S o l u t i o n S h o r tc o m i n g s 21
L a c k o f V i s u a l i z a t i o n & Ro o t C a u s e A n a ly s i s 21
No Data Segmentation 21
No Explainability 22
C r i t i c a l M o n i to r i n g L i m i t a t i o n s 23
C u s to m i z a t i o n L i m i t a t i o n 24
Challenges in Building a Solution 24
Re q u i re d E x p e r t i s e 26
C o s ts 26
Summar y 2 7
www.aporia.com
www.aporia.com
E xe c u t i ve S u m m a r y 1
Executive Summary
ML Observability is the ability to obtain a deep understanding of your ML
models across their life cycles and their impact on the business. An ML
Observability tool provides data scientists, ML engineers, and business
owners with the capabilities to monitor, visualize, troubleshoot, and explain
machine learning models as they move from the research and training
stage to production.
An effective ML Observability solution empowers organizations to leverage
their ML models with confidence, and includes the following capabilities:
Model Inventory, Model Activity Tracking, Customizable Monitoring,
Prediction Drift, Data Drift Detection, Measuring Performance Metrics in
Production, Explainability & What-if Troubleshooting, Bias & Fairness,
Business Dashboards, Model Versions Comparison, Root Cause Analysis,
Scale & Integrations.
It is possible to build such a solution in-house leveraging open-source tools
such as Prometheus and Grafana. However, to do so requires the
organization to invest significantly in infrastructure, personnel, building
expertise, maintenance, and support. This is a large investment to make
and the end result requires compromising on many areas, which are
essential for successful and trustworthy ML Observability.
When building an in-house solution leveraging open source technologies
the following shortcoming of such a solution should be taken into account:n
R Lack of Visualization & Root Cause Analysis: Data scientists require
various visualization techniques to be able to find the root cause of a
specific problem in a model. These visualizations include: metrics over
time, with comparisons between metrics. Many times these are
visualized over different data segments to be able to understand the
true impact of the behavior change on the model’s output. Similarly,
visualizing distributions and comparing them, over different model
www.aporia.com
Executive Summary 2
versions, data segments, and environments is also a common

visualization method that assists with root cause analysis. These
visualizations and other critical data visualization methods are not
supported in most open source solutions. The lack of visualization and
root-cause analysis tools means that using such a system is useless for
investigating ML related issues8
F No Data Segmentation: Many times ML models may seem to be working

adequately overall, when in reality they are underperforming for certain
data segments. For example, a demand forecasting model may be
underperforming in a specific region, or for a specific brand. A fraud
detection model may identify specific browsers, or users buying specific
products incorrectly as fraudsters. Moreover, issues usually start by
affecting a small portion of the data, which grows over time. Without the
ability to visualize, monitor, and investigate specific data segments, the
effectiveness of the monitoring system is in question$
F No Explainability: Getting an explanation of the model’s predictions is

important for various reasons, from providing an explanation to business
owners, being compliant in regulated industries, debugging a model’s
predictions, and more.Q
F Critical Monitoring Limitations: From prediction-level limitations to bias

and fairness limitations, lacking such a detection capability may result in
the model performing poorly for specific data populations, even when it
appears that the model overall is performing accurately.Q
F Customization Limitations: Model monitoring and observability

customizations are critical for ensuring trust in your ML models. As each
model is unique, the ability to customize monitors, customize metrics,
create custom data segments, and customize views is essential for
successful model monitoring.
www.aporia.com
Executive Summary 3
Leveraging Existing Solutions
Today’s MLOps market provides organizations with an alternative option of

choosing a first-class ML Observability platform such as Aporia. This type
of solution provides organizations with all the critical capabilities necessary
for effective ML monitoring, with little to no investment required by the
organizations beyond a financial commitment. In other words, this enables
organizations to focus on building and crafting the best ML models, to
provide the most value for their business.
www.aporia.com
Executive Summary 4
End-to-End ML Observability Platform
Setup and Monitoring and

Visibility Explainability Investigation
Management Alerting
Compare
Centrazlied Performance Explain Performance
Different Time
Dashboard Over Time Predictions Degradation
Frames
Customized Explain NLP Feature

Model Activity Data Integrity
Dashboard Models Importance
Integrate with Business Investigate Data

Proxy Metrics Model Staleness
All Models Explanation Segments
Supported Distribution
Data Statistics What-If Analysis Detailed Alerts
Model Types Analysis
Drift Score per

Multiple Impacted Data
Model Health Feature & Drift Detection
Explainers Points Analysis
Prediction
Custom Metric Time-Series

Personal View Custom Metrics
Visualization Analysis
Compare
Anomaly Export Data
Scale Support Multiple
Detection Points to CSV
Versions
Intuitive for Data Points Data Segments

Data Scientists Explorer Monitoring
Data Segments Alerts to Email

Visualization and MS Teams
A /B Testing
Training Set as
Champion-
Baseline
Challenger
Bias Monitoring
Supported
Code-Based
Monitors
Retraining
Triggering
www.aporia.com
Executive Summary 5
Setup and Monitoring and

Visibility Explainability Investigation
Management Alerting
Compare
Centrazlied Performance Explain Performance
Different Time
Dashboard Over Time Predictions Degradation
Frames
Customized Explain NLP Feature

Model Activity Data Integrity
Integrate with Business Investigate Data

Proxy Metrics Model Staleness
Supported Distribution
Data Statistics What-If Analysis Detailed Alerts
Drift Score per

Multiple Impacted Data
Model Health Feature & Drift Detection
Explainers Points Analysis
Prediction
Custom Metric Time-Series

Personal View Custom Metrics
Compare
Anomaly Export Data
Scale Support Multiple
Detection Points to CSV
Versions


A /B Testing
Training Set as
Champion-
Baseline
Challenger
Bias Monitoring
Supported
Code-Based
Not Supported
Monitors
Complex Implementation
Retraining
Triggering
www.aporia.com
Introduction
ML Observability is critical for the success of ML models in production and
for building trust in machine learning. By serving as guardrails for models,
ML Observability enables an organization to gain full visibility to their
models, detect and alert on any issues within the models, explain their
models, improve the models, and take action to remediate the risk before it
impacts the business.
An effective ML Observability solution empowers organizations to leverage
their ML models with confidence, and includes the following capabilitiesM
A Model Inventory: Keeps track of all models with a single pane of glasf
A Model Activity Tracking: Ensures the model is active in productioK
A Customizable Monitoring: The ability to tailor monitoring for each unique
model, use case and scenario
A Prediction Drift: Ensures predictions are trustworthyb
A Data Drift & Concept Drift Detection: Enables early detection of model
driftb
A Measuring Performance Metrics in Production: Ensures the model is
performing as intended, e.g in training and researchb
A Explainability & What-if Troubleshooting: Enables business owners and
data scientists to explain model predictions and simulate What-if
scenarios to better understand the model behaviorb
A Bias & Fairness: Ensures the model is compliant for all data populationf
A Business Dashboards: Gives business owners the ability to understand
the model’s impact on business resultsb
A Model Versions Comparison: Comparing and validating the best model
versions in scenarios such as Champion-Challenger, A/B Testing, and
more
www.aporia.com
* Root Cause Analysis: Once an alert is raised, this enables the data
scientist to find the root cause ASAP instead of weekE
* Scale: The ability to support both current and future model observability
needE
* Integrations: Seamlessly fits into the overall MLOps organization
strategy
This document will discuss the build vs. buy decision for an enterprise ML
observability solution, highlighting pros and cons of each approach.
www.aporia.com
Buying an
ML Observability
Platform
www.aporia.com
How It Works
When purchasing an end-to-end ML observability solution, you should
ensure that your solution has the following capabilities: self-hosted
deployment, seamless integration with any ML infrastructure and workflow,
comprehensive monitoring, visibility, root cause analysis, and explainability
for your machine learning models in production. Furthermore, the most
effective ML observability solution will be highly customizable and easily
tailored to fit any ML use case and model.
As we all know, ML models can experience issues and anomalies, such as
data drift, bias, data integrity issues, and performance degradation. This
wide range of issues usually require immediate action and can be easily
solved by data scientists and ML teams who leverage a customizable ML
monitoring solution that supports their specific use cases.
www.aporia.com
Buying an ML Observability Platform 10
Setup and Visibility Explainability Monitoring and Investigation

Management Alerting
Centrazlied Performance Explain Performance Compare
Dashboard Over Time Predictions Degradation Different Time
Frames
Model Activity Customized Explain NLP Data Integrity Feature

Integrate with Proxy Metrics Business Model Staleness Investigate Data

Supported Data Statistics What-If Analysis Detailed Alerts Distribution

Drift Score per Multiple Impacted Data
Model Health Feature & Explainers Drift Detection Points Analysis
Prediction
Personal View Custom Metric Custom Metrics Time-Series

Compare Anomaly Export Data
Scale Support Multiple Detection Points to CSV
Versions

A B Testing
/
Training Set as
Champion- Baseline
Challenger
Bias Monitoring
Supported
Code-Based
Monitors
Retraining
Triggering
www.aporia.com
Buying an ML Observability Platform 11
Customization Capabilities
Data scientists spend weeks or longer in research to achieve the best
performance from their models while training, to meet their specific needs.
Therefore, each model inherently has unique components to it – from
custom metrics, imbalanced data, sensitive populations within the data, and
many more.
When deploying new models, data science teams require a range of

production ML monitors tailored to their specific models and use cases in
order to track and detect model activity, data drift, concept drift, statistical
metric change, missing values, and more, to ensure their models are
working as intended. These monitors are applied on production models and/
or production candidates (model versions that are not yet in production).
Additionally, if a certain monitoring capability is required beyond a built-in

suite of custom monitors available, data scientists and ML engineers may
prefer to create their own custom python-code based monitors to best
support their needs.
Drift ale rt :
Drift ale rt :
Live Monitoring
Data Drift Threshold 0.3
mode l_se rving .py
0.5
Drift ale rt :
@app.post(/my-model/predict
@app.post(" "/my-model/predict"
") )
0.4
Drift score
def predict
defpredict (request
(request ):): 0.3
# Preprocess request Threshold
# XPreprocess request
= preprocess( request) 0.2
X = preprocess(request)
Performinference
# #Perform inference Sep 3 Sep 5 Sep 7 Sep 9 Sep 11 Sep 13 Sep 15 Sep 17
y y= =model.predict(X)
model.predict(X)
# #Log
Logprediction
predictiontotoaporia
aporia
aporia.log_prediction(
aporia.log_predictionX(,X,y)y)
return {"result": y}
return {"result": y}
Aporia Full-Stack Observability Drift Alert Capabilities
www.aporia.com
Ease of Integration
The idea is to make integrations as easy as possible. You want a monitoring
solution that is built for scale, making it easy to integrate all types of models
– batch or streaming, tabular or NLP, and hundreds or billions of predictions
– with a simple integration process.
Usually this process is streamlined and occurs automatically within the
machine learning platform for any new model.
Bias & Fairness Capabilities
Using your end-to-end ML observability solution, your team should be able
to monitor their ML models in production for bias and fairness on key data
segments by applying monitors to them. Due to a large variety of
compliance use cases, having the ability to customize data segments is
important. Your solution should enable you to democratize protected group
monitoring for business stakeholders by making it easy to configure,
visualize, and monitor data segments in a user-friendly interface.
Defining new segments of interest can be done dynamically by providing a
simple set of criteria for the segment of interest.
City = “gotham” and 25 < age < 40
ApproveD 7%
denied 93%
Data Segment:
85% city = "gotham"

15%
and 25 < age < 40
All Data
(excluding selected segment)
Approved 34%
denied
66%
Aporia Segment Capability
www.aporia.com
Investigation Capabilities
Any ML observability solution that you employ should provide an advanced
customizable alerting system, where alerts can be created as actions to
various ML-specific production monitors. Ideally, it has native integrations
with Slack, Microsoft Teams, JIRA, New Relic, and more, as well as
Webhooks to integrate with existing alerting and production ticketing
systems.
Preferablly you would want a monitoring solution that provides detailed
information within the alert – from the time it started, the relevant features,
relevant metrics such as drift score and missing values and their
corresponding thresholds, to the affected data points (including export to
CSV for further investigation in Jupyter Notebook), and more. This
information should be available both from the ML observability solution’s
user interface and from Webhook for integration with external systems.
Additionally, your monitoring system should provide a large range of
visibility and investigation tools to offer additional information and support
comprehensive root-cause analysis.
Distributions
Performance Data Segments
Data Stats
A p o r i a I nve s t i g a t i o n C a p a b i l i t i e s
www.aporia.com
Your alerts should be highly customizable to avoid false positives – from
simple constraints (such as do not detect drift if there are less than X
predictions, or only trigger maximum of Y alerts per day), up to advanced
anomaly detection algorithms.
Getting an alert for a production ML event is important, yet, it is not where
the story ends. There’s a need to investigate further, get to the root cause
as quickly as possible, and remediate.
In order to drill down and investigate the root cause, look for solutions
with a wide range of visualization capabilities, including: V
@ Data Explorel
@ Prediction Explainability and What-if Simulatol
@ Time-Series InvestigatioJ
@ Data Metrics & StatZ
@ Distribution Analysis and more!
These capabilities make it easy to investigate ML-related issues and get to
the root cause as quickly as possible.
D ata Point Exp lainer

Total Data Points
ID : 379109
360,38M
Features Prediction Impact
Features Prediction Actuals
reviously nsured
P _I
Age Dri v ing_license Annual_premium Region_code Proba Will buy insurance
T rue +50%
ID
20 90 False true 0 374... 8 46 0 1.17
D riving License
_
379104 Explain
75 True 37.14K 28 0.75 True
T rue +25%
379104 Explain
33 True 14.59K 37 0.23 True
379104 Explain
51 True 7.99K 23 0.63 False
A ge
379104 Explain
44 True 122.76K 41 0.14 True 50
+17%
379104 Explain
19 True 28.05K 6 0.87 False
379104 Explain
28 True 66.88K 40 0.82 True
Re- xplain
E
Aporia Data Point Explainer
www.aporia.com
Rich, Contextual Data is Needed for Production-
Grade Monitoring
Tracking model inputs and outputs is important, but it’s not enough to really
understand your data and model behavior. What you need to monitor isn’t a
single model – but an entire AI system.
A few examplesm
HX You have a human labeling system, and you’d like to monitor how your
model’s output compares to their labeling, to get real performance
metrics for your modelX
\X Your system contains several models and pipelines, and one of the
models’ output is used as an input feature for a subsequent model.
Underperformance in the first model may be the root cause for
underperformance in the second model, and your monitoring system
should understand this dependency and alert you accordinglyX
rX You have actual business results (e.g whether the ad your
recommendation model suggested was actually clicked) – this is a very
important metric for measuring your model’s performance and is relevant
even if the input features never really changedX
>X You have metadata that you don’t want (or even is not allowed, e.g.,
race/gender) to use as an input feature, but you do want to track it for
monitoring, to make sure the model is not discriminating unintentionally
on that data field.
By using a dedicated ML observability platform, you can monitor an entire
AI system or an ensemble of models. You can also easily monitor data that
isn’t part of the model features (such as gender), monitor the business
outcome, and much more.
www.aporia.com
Costs
An effective ML monitoring system will not only help organizations monitor
their models, it should help improve them which will also improve model
results and drive more revenue for the business. Data scientists will be able
to focus more on building and scaling an organization’s AI instead of
worrying about underperformance.
With a dedicated solution in place, the organization will not need to utilize
essential resources in house to build a solution over an extensive period of
time that will require long-term, continued maintenance to support it. One of
the benefits of procuring a built-for-scale ML monitoring platform is quick
and easy deployment, alongside enabling an organization to focus on the
models they create and the value they provide.
Dedicated ML observability solutions should be built for data scientists, ML
engineers, and business stakeholders to support their many ML use cases
and fortify trust in in their models to generate critical business predictions.
www.aporia.com
Building ML
Monitoring with
Open-Source
Tools
www.aporia.com
B u i l d i n g M L M o n i to r i n g w i t h O p e n - S o u rc e To o l s 18
The open-source community offers multiple solutions for monitoring
application workloads, like Grafana and Prometheus. These tools are
intended for monitoring application logs and application metrics. While
these solutions are not intended for monitoring machine learning models,
we will explore a possible solution, using Prometheus and Grafana for
achieving that goal.
Implementation
The following implementation is a proposal for a basic ML Observability
solution built from scratch, using open-source tools:
Phase 0 - Planning & Design
Before starting this journey of building a monitoring solution for ML models,
you need to write down the list of models that you have, the type of data
that is being used, and who will be using this system.
Then, you need to gather requirements from different data science teams
so you can make sure the monitoring system will fit their needs.
Before beginning the implementation, you need to try out various open-
source tools to explore their advantages and disadvantages and come up
with a proposed architecture for the solution.
Phase 1 - Instrumentation & Data Collection (In-House Development)
The open-source tools for monitoring were built for tracking either logs or
metrics. However, in order to be used for monitoring ML models and their
performance, you need to collect inference data. Therefore, the first step
will be collecting this data from the serving environment and storing it into a
data lake.
www.aporia.com
First, you’ll need to set up a data lake that will be used to store the data and
decide on the format that will be used for it.
Then, you need to implement a solution that could be instrumented within
the serving environment, and send the data to your data lake. While doing
so, you have to be cautious and make sure your implementation doesn’t
create any delays and does not interfere with the model's functionality.
Moreover, you should have a separate implementation of instrumentation
logic for every type of serving that you have<
2< For streaming models (e.g models that are served using a Flask / Django
server), this can be done by streaming predictions to an Apache Kafka
topic.U
G< For batch models (e.g models that run on Airflow jobs), this can be done
by saving the dataframe directly to the data lake.
Lastly, you need to ensure that this instrumentation solution is able to deal
with large amounts of data without causing any issues.
Phase 2 - Metrics Calculation Service (In-House Development)
Now that you have your inference data stored in a data lake, you need to
prepare it for visualization. As mentioned above, Grafana and Prometheus
were not intended to work on raw data. Therefore, before you can start
visualizing it, you have to aggregate this data and derive some metrics
from it.
In this step, you’ll have to implement a service that constantly aggregates
the data from the previous step, and exposes the final metrics in a way that
can be consumed for dashboarding and alerting (e.g Prometheus &
Grafana).
www.aporia.com
You have to take into account processing large scale of data to create these
metrics. Moreover, you need to decide in advance which metrics you plan to
calculate.
Aggregation Examples@
4O Statistical properties of each input & output (e.g: count, min, max,
average.
bO Drift Score: Statistical distance between the distribution of each input &
output to their distribution in a certain baseline (e.g training set.
SO Performance metrics such as AUC ROC, F1 Score, etc.
Then, after calculating these metrics, you need to store them in a format
that can be consumed by Prometheus and Grafana.
Phase 3 - Creating a Dashboard (Prometheus + Grafana)
Finally, now that you have some metrics in the right format, you are ready to
visualize them using Grafana. For this, you should setup Prometheus and
Grafana instances.
After you have a running instance, you should connect it with the metrics
provider so you can start creating dashboards.
Once you have metrics flowing into Prometheus, you can start creating the
first dashboard in Grafana to show various metrics. As there are many users
and models, there's a need to create a dashboard for each one, so
everyone has exactly what they need.
Phase 4 - Monitor Metrics & Alert on Anomalies (Alertmanager)
Using Prometheus’ Alertmanager, you can define alerts on when a metric
reaches above or below a specific threshold (e.g drop in model activity).
www.aporia.com
Solution Shortcomings
Lack of Visualization & Root Cause Analysis
Grafana dashboards work well for traditional engineering metrics such as
CPU and RAM. However, it is very hard to create effective ML dashboards
in Grafana. As an example, Grafana doesn’t provide a way to create a
widget that displays two distributions in the same graph (e.g for Data
Scientists who want to compare a feature distribution in training vs.
production).
Data scientists require various visualization techniques to be able to find the
root cause of a specific problem in a model. These visualizations include:
metrics over time, with comparisons between metrics, many times
visualized over different data segments to be able to understand the true
impact of the behavior change on the model’s output. Similarly, visualizing
distributions and comparing them, over different model versions, data
segments, and environments are also a common visualization method to
help with root cause analysis. These visualizations and other critical data
visualization methods are not supported in Grafana.
The lack of visualization and root-cause analysis tools means that using
such a system is useless for investigating ML related issues. It might help
surface very elementary changes, however the root-cause analysis process
would still require spending weeks or months trying to understand the
impact of an issue on the model results.
No Data Segmentation
Many times ML models may seem to be working adequately overall, when in
reality they are underperforming for certain data segments. For example, a
demand forecasting model may be underperforming in a specific region, or
for a specific brand. A fraud detection model may identify specific
browsers, or users buying specific products wrongfully as fraudsters.
www.aporia.com
browsers, or users buying specific products wrongfully as fraudsters.
Moreover, issues usually start by affecting a small portion of the data, which
grows over time. Without the ability to visualize, monitor, and investigate
specific data segments, the effectiveness of the monitoring system is in
question.
Data segmentation is also crucial for investigating production issues. In
order to find the root cause of a problem, a data scientist has to slice and
dice the data, and find the common denominator for the problematic
predictions.
Missing these capabilities limits data scientists’ ability to investigate and
improve their models, and moreover, increase MTTR and risk for the
organization.
No Explainability
The proposed solution does not provide, and cannot provide any
explainability capabilities.
?
?
?
?
Black Box
Input Output
N o e x p l a i n a b i l i t y = B l a c k B ox O u t p u t s
www.aporia.com
The proposed solution does not provide, and cannot provide any
explainability capabilities. Getting an explanation of the model’s predictions
is important for various reasonsb
5M Business stakeholders: The end-users of the ML model can question
the results they’ve been receiving. For example, a marketing or sales
manager can question the lead scoring they have received from a model.
By providing them with an explanation, they can get a clear
understanding of the reasoning behind the prediction, and more
importantly, trust these predictions in the long runM
UM Regulation: In regulated industries, organizations are obligated to
explain their decisions to their clients. While this is not an issue for
traditional software, the black-box characteristics of AI make it a real
challenge. Without explainability capabilities, ML cannot be used for
these use casesM
-M Debugging: Debugging production code is not an easy task. Debugging
production models is even more challenging. Using explainability tools
can significantly reduce the time it takes to debug and resolve a
problem.
Critical Monitoring Limitations
The proposed solution does not offer monitoring at the prediction level.
Prediction level monitoring is critical in many cases, for example when a
specific prediction has an outlier. Identifying outliers is important for many
reasons. An outlier may indicate bad data, or it may indicate a problematic
prediction, which can affect the business’ bottom line (e.g missing an outlier
may indicate missing an attempted fraud).
Moreover, the proposed solution does not provide detection of bias and
fairness issues, resulting in a lack of compliance with regulations. Ensuring
models do not have bias against a specific population of data is crucial for
business owners and data scientists. The lack of such a detection capability
may result in the model performing poorly for specific data populations,
even when it appears that the model overall is performing accurately.
www.aporia.com
Customization Limitations
The proposed solution doesn’t support customization of the metrics being
monitored. In the real world, each model has different metrics that need to
be calculated in different time windows and at different baselines.
Model monitoring and observability customizations are critical for ensuring
trust in your ML models. As each model is unique, the ability to customize
monitors, customize metrics, create custom data segments and customize
views, is essential for successful model monitoring.
For each unique model, the model’s success is measured differently. In
training, data scientists craft custom metrics for each model to measure its
performance and ensure its quality. This in turn, needs to be translated to
production monitoring. Lacking these capabilities will result in model
Challenges in Building a Solution
The first step in the proposed solution is to collect inputs & outputs of
models in production. This step can be very challenging on its own, for
multiple reasons
Batch vs. Streaming: Some models are batch, while others are
streaming. Data collection is usually different in those two modes
Model Serving Environment Variety: The model serving environment can
differ significantly (e.g Airflow vs. Flask vs. MLFlow vs. RServe)
Scale: Some models have a huge amount of predictions. This makes it
very challenging to collect and store this data at scale
Input Data Format: The input data of different models can be different.
For example, some models use tabular data as inputs, some use text
data as inputs, and others use image data as inputs.
For models with text or image data, storing the embedding vector is
essential for drift detection.
For tabular models with time series data, storing this data can require a
more efficient time series database.
www.aporia.com
In the second step, aggregate this data and generate ML-specific metrics.
This also presents multiple challengesR
j Calculating drift scores can be different for categorical vs. numerical
features.
For categorical features, a statistical distance such as Hellinger is usually
more appropriate. For numerical features, a statistical distance such as
Jensen-Shannon could be more relevant.
This highly varies between different datasets and column types, and so
selecting the correct statistical distance for drift detection can be
challenging.p
j Calculating metrics without a baseline is usually meaningless, so it’s
necessary to calculate them on a baseline as well.
For example, what action item is derived from the information that 3.7%
of a specific column are missing values?
If the training set also had a similar amount of missing values (e.g 3.4%),
then everything might be okay. But what if the training set had 0.002%
missing values? This could indicate a more serious issue.
Therefore, having a baseline for monitoring is critical. Supporting many
different baselines is best, but more complicated to achieve. One of the
best baselines is the training dataset, so you can aggregate it too and
collect metrics for that. But there are multiple challenges with that: First,
training sets can easily become huge (we’ll discuss data scale
challenges in a minute). Second, you need to make sure their schema
could be mapped to the production’s data schemaY
g Calculating drift scores at scale is difficult.
As of March 2022, there isn’t an effective open-source library for
calculating statistical distances using a distributed computation engine
such as Spark. You’ll need to implement these from scratchY
j Choosing the correct thresholds for alerting is difficult.
Each column has a different data type (e.g categorical vs. numerical) or a
different data behavior. Therefore, choosing relevant thresholds to
minimize Alert False Positives and Alert fatigue is challenging.
www.aporia.com
Required Expertise
H To address all the challenges mentioned above, expertise in building,
customizing, scaling and designing such a solution is required. The
solution should be able to adapt to the business’ future plans, which will
require maintaining this expertise over time2
H As more models are developed and more use cases are covered by ML
models, the solution should be able to support future needs for
monitoring these models. Each model would require specific metrics,
sometimes various drift detection methods, and each should be coupled
to specific models.
Costs
The cost of building even a very basic ML observability platform for an
enterprise is immensely high, and requires a unique skillset. Building such a
system requires a group of people with experience in Software Engineering,
Software Architecture, Data Engineering, Data Science and Product
Management.
Assuming a qualified team with the skills and experience as mentioned
above is employed, this is a months-long project that requires ongoing
maintenance and support, leading to a multi-year project with a technical
debt that will continue to grow over time. Additionally, in-house ML
monitoring builds tend to be difficult to scale when the time comes, further
plunging you into more technical debt or in extreme cases scrapping the
project all together. Moreover, you’ll find that various mission critical
integrations will be difficult to implement due to limited abilities, and in
many instances your in-house monitor will lag behind on the technological
side, missing out on cutting-edge advances in the field.
For a business, in which ML Monitoring is necessary but not the sole
purpose of the business, taking on such a project is costly, requiring a
significant investment in resources, and often results in a monitoring
solution that is inadequate to support their needs.
www.aporia.com
Summar y 27
Summary
In order for businesses to get the most out of their ML models and to trust
them with making critical business decisions, a robust, end-to-end ML
observability solution needs to be in place. How you get to that monitoring
solution is the question?
Today’s MLOps market provides organizations with an alternative option to
taking the in-house build route by choosing to procure a first-class ML
Observability platform. This solution provides organizations with all the
capabilities necessary for ML monitoring and explainability, with little to no
investment required by the organizations beyond a financial commitment. In
other words, this enables organizations to focus on building and
implementing the best ML models, to provide the most value for their
business.
Any solution you use must be able to provide the organization with visibility,
monitoring, root cause analysis, and explainability capabilities that are
required for successful ML monitoring in production. Moreover, the solution
must be scalable, used easily by various stakeholders, integrate seamlessly
with the organization’s MLOps pipeline, leverage cutting-edge technology,
and more.
It is possible to build such a solution in house leveraging open-source tools
such as Prometheus and Grafana. However, to do so requires the
organization to invest significantly in infrastructure, personnel, building
expertise, maintenance, and support. This is a large investment to make and
the end result requires compromising on many areas, which are essential for
successful and trustworthy ML Observability.
Other aspects to look at when contemplating building an in-house solution
vs. buying one, is that building is often insufficient in providing data
scientists and ML teams with efficient visualization, investigation, and
www.aporia.com
Summar y 28
explainability tools to fully comprehend the scope of issues that arise from
ML models in production. It also fails to offer required monitoring for
prediction issues (e.g outlier) or detection of bias and fairness issues, which
are both crucial for business success and to instill trust in your AI.
Additionally, you won’t be able to compare model versions, support various
customizations, or monitor different stages across the data pipeline using
the proposed solution, which impacts model performance in production.
Now, depending on your need, open source tools tend to be a less
complicated way to start monitoring your models that come with relative
low expenses up front. Although, this can be quite misleading as most in-
house monitoring builds can end up costing more in the long run, when
considering technical debt and functional limitations.
As you know by now, whether procuring or building your monitor, it’s a
critical tool to employ into your ML workflow, and both options can be
successful in the short run. However, as your use of ML models increases,
it's vital to remember that in order to trust your models’ predictions, you
should aim for reliability in your ML observability platform.
www.aporia.com
Cloud-Native ML Observability
www.aporia.com

ML Observability Build Vs Buy Download Guide 1689038317

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ML Observability Build Vs Buy Download Guide 1689038317

Uploaded by

Copyright:

Available Formats

ML Observability

Build vs. Buy

Root Cause Analysis

Challenges in Building a Solution 24

ML Observability is the ability to obtain a deep understanding of your ML

Observability tool provides data scientists, ML engineers, and business

owners with the capabilities to monitor, visualize, troubleshoot, and explain

An effective ML Observability solution empowers organizations to leverage

their ML models with confidence, and includes the following capabilities:

Model Inventory, Model Activity Tracking, Customizable Monitoring,

Prediction Drift, Data Drift Detection, Measuring Performance Metrics in

Production, Explainability & What-if Troubleshooting, Bias & Fairness,

Business Dashboards, Model Versions Comparison, Root Cause Analysis,

Scale & Integrations.

It is possible to build such a solution in-house leveraging open-source tools

such as Prometheus and Grafana. However, to do so requires the

organization to invest significantly in infrastructure, personnel, building

expertise, maintenance, and support. This is a large investment to make

essential for successful and trustworthy ML Observability.

When building an in-house solution leveraging open source technologies

the following shortcoming of such a solution should be taken into account:n

R Lack of Visualization & Root Cause Analysis: Data scientists require

various visualization techniques to be able to find the root cause of a

specific problem in a model. These visualizations include: metrics over

time, with comparisons between metrics. Many times these are

visualized over different data segments to be able to understand the

true impact of the behavior change on the model’s output. Similarly,

visualizing distributions and comparing them, over different model

versions, data segments, and environments is also a common

F No Data Segmentation: Many times ML models may seem to be working

F No Explainability: Getting an explanation of the model’s predictions is

F Critical Monitoring Limitations: From prediction-level limitations to bias

F Customization Limitations: Model monitoring and observability

Leveraging Existing Solutions

Today’s MLOps market provides organizations with an alternative option of

End-to-End ML Observability Platform

Setup and Monitoring and

Customized Explain NLP Feature

Integrate with Business Investigate Data

Drift Score per

Custom Metric Time-Series

Intuitive for Data Points Data Segments

Data Segments Alerts to Email

End-to-End ML Observability Platform

Setup and Monitoring and

Customized Explain NLP Feature

Integrate with Business Investigate Data

Drift Score per

Custom Metric Time-Series

Intuitive for Data Points Data Segments

Data Segments Alerts to Email

ML Observability is critical for the success of ML models in production and

for building trust in machine learning. By serving as guardrails for models,

ML Observability enables an organization to gain full visibility to their

impacts the business.

An effective ML Observability solution empowers organizations to leverage

their ML models with confidence, and includes the following capabilitiesM

A Model Activity Tracking: Ensures the model is active in productioK

A Customizable Monitoring: The ability to tailor monitoring for each unique

model, use case and scenario

A Prediction Drift: Ensures predictions are trustworthyb

A Measuring Performance Metrics in Production: Ensures the model is

performing as intended, e.g in training and researchb