Professional Documents
Culture Documents
Explainable AI
ML Monitoring
Open Source
Technical Debt
www.aporia.com
Ta b l e o f C o n te n ts
Table of Contents
E xe c u t i ve S u m m a r y 1
I n t ro d u c t i o n 6
B u y i n g a n M L O b s e r va b i l i t y P l a t fo r m 8
H ow I t Wo r k s 9
E n d - to - E n d M L O b s e r va b i l i t y 10
C u s to m i z a t i o n C a p a b i l i t i e s 11
E a s e o f I n te g ra t i o n 12
B i a s & Fa i r n e s s C a p a b i l i t i e s 12
I nve s t i g a t i o n C a p a b i l i t i e s 13
R i c h , C o n tex t u a l D a t a i s N e e d e d fo r P ro d u c t i o n - G ra d e M o n i to r i n g 15
C o s ts 16
B u i l d i n g M L M o n i to r i n g w i t h O p e n - S o u rc e 18
Implementation 18
S o l u t i o n S h o r tc o m i n g s 21
L a c k o f V i s u a l i z a t i o n & Ro o t C a u s e A n a ly s i s 21
No Data Segmentation 21
No Explainability 22
C r i t i c a l M o n i to r i n g L i m i t a t i o n s 23
C u s to m i z a t i o n L i m i t a t i o n 24
Re q u i re d E x p e r t i s e 26
C o s ts 26
Summar y 2 7
www.aporia.com
www.aporia.com
E xe c u t i ve S u m m a r y 1
Executive Summary
models across their life cycles and their impact on the business. An ML
machine learning models as they move from the research and training
stage to production.
and the end result requires compromising on many areas, which are
www.aporia.com
Executive Summary 2
www.aporia.com
Executive Summary 3
www.aporia.com
Executive Summary 4
Compare
Centrazlied Performance Explain Performance
Different Time
Dashboard Over Time Predictions Degradation
Frames
Supported Distribution
Data Statistics What-If Analysis Detailed Alerts
Model Types Analysis
Compare
Anomaly Export Data
Scale Support Multiple
Detection Points to CSV
Versions
A /B Testing
Training Set as
Champion-
Baseline
Challenger
Bias Monitoring
Supported
Code-Based
Monitors
Retraining
Triggering
www.aporia.com
Executive Summary 5
Compare
Centrazlied Performance Explain Performance
Different Time
Dashboard Over Time Predictions Degradation
Frames
Supported Distribution
Data Statistics What-If Analysis Detailed Alerts
Model Types Analysis
Compare
Anomaly Export Data
Scale Support Multiple
Detection Points to CSV
Versions
A /B Testing
Training Set as
Champion-
Baseline
Challenger
Bias Monitoring
Supported
Code-Based
Not Supported
Monitors
Complex Implementation
Retraining
Triggering
www.aporia.com
I n t ro d u c t i o n 6
Introduction
models, detect and alert on any issues within the models, explain their
models, improve the models, and take action to remediate the risk before it
A Model Inventory: Keeps track of all models with a single pane of glasf
A Data Drift & Concept Drift Detection: Enables early detection of model
driftb
A Bias & Fairness: Ensures the model is compliant for all data populationf
more
www.aporia.com
I n t ro d u c t i o n 7
* Root Cause Analysis: Once an alert is raised, this enables the data
* Scale: The ability to support both current and future model observability
needE
strategy
This document will discuss the build vs. buy decision for an enterprise ML
www.aporia.com
Buying an
ML Observability
Platform
www.aporia.com
B u y i n g a n M L O b s e r va b i l i t y P l a t fo r m 9
How It Works
data drift, bias, data integrity issues, and performance degradation. This
wide range of issues usually require immediate action and can be easily
www.aporia.com
Buying an ML Observability Platform 10
Bias Monitoring
Supported
Code-Based
Monitors
Retraining
Triggering
www.aporia.com
Buying an ML Observability Platform 11
Customization Capabilities
Data scientists spend weeks or longer in research to achieve the best
performance from their models while training, to meet their specific needs.
Therefore, each model inherently has unique components to it – from
custom metrics, imbalanced data, sensitive populations within the data, and
many more.
Drift ale rt :
Live Monitoring
Data Drift Threshold 0.3
mode l_se rving .py
0.5
Drift ale rt :
@app.post(/my-model/predict
@app.post(" "/my-model/predict"
") )
0.4
Drift score
def predict
defpredict (request
(request ):): 0.3
# Preprocess request Threshold
# XPreprocess request
= preprocess( request) 0.2
X = preprocess(request)
Performinference
# #Perform inference Sep 3 Sep 5 Sep 7 Sep 9 Sep 11 Sep 13 Sep 15 Sep 17
y y= =model.predict(X)
model.predict(X)
# #Log
Logprediction
predictiontotoaporia
aporia
aporia.log_prediction(
aporia.log_predictionX(,X,y)y)
return {"result": y}
return {"result": y}
www.aporia.com
B u y i n g a n M L O b s e r va b i l i t y P l a t fo r m 12
Ease of Integration
solution that is built for scale, making it easy to integrate all types of models
to monitor their ML models in production for bias and fairness on key data
ApproveD 7%
denied 93%
Data Segment:
All Data
Approved 34%
denied
66%
www.aporia.com
B u y i n g a n M L O b s e r va b i l i t y P l a t fo r m 13
Investigation Capabilities
with Slack, Microsoft Teams, JIRA, New Relic, and more, as well as
systems.
information within the alert – from the time it started, the relevant features,
relevant metrics such as drift score and missing values and their
user interface and from Webhook for integration with external systems.
Distributions
Data Stats
A p o r i a I nve s t i g a t i o n C a p a b i l i t i e s
www.aporia.com
B u y i n g a n M L O b s e r va b i l i t y P l a t fo r m 14
simple constraints (such as do not detect drift if there are less than X
the story ends. There’s a need to investigate further, get to the root cause
In order to drill down and investigate the root cause, look for solutions
@ Data Explorel
@ Time-Series InvestigatioJ
ID : 379109
360,38M
reviously nsured
P _I
T rue +50%
ID
D riving License
_
379104 Explain
75 True 37.14K 28 0.75 True
T rue +25%
379104 Explain
33 True 14.59K 37 0.23 True
379104 Explain
51 True 7.99K 23 0.63 False
A ge
379104 Explain
44 True 122.76K 41 0.14 True 50
+17%
379104 Explain
19 True 28.05K 6 0.87 False
379104 Explain
28 True 66.88K 40 0.82 True
Re- xplain
E
www.aporia.com
B u y i n g a n M L O b s e r va b i l i t y P l a t fo r m 15
Grade Monitoring
Tracking model inputs and outputs is important, but it’s not enough to really
understand your data and model behavior. What you need to monitor isn’t a
A few examplesm
HX You have a human labeling system, and you’d like to monitor how your
\X Your system contains several models and pipelines, and one of the
>X You have metadata that you don’t want (or even is not allowed, e.g.,
AI system or an ensemble of models. You can also easily monitor data that
isn’t part of the model features (such as gender), monitor the business
www.aporia.com
B u y i n g a n M L O b s e r va b i l i t y P l a t fo r m 16
Costs
their models, it should help improve them which will also improve model
results and drive more revenue for the business. Data scientists will be able
With a dedicated solution in place, the organization will not need to utilize
time that will require long-term, continued maintenance to support it. One of
www.aporia.com
Building ML
Monitoring with
Open-Source
Tools
www.aporia.com
B u i l d i n g M L M o n i to r i n g w i t h O p e n - S o u rc e To o l s 18
these solutions are not intended for monitoring machine learning models,
Implementation
you need to write down the list of models that you have, the type of data
Then, you need to gather requirements from different data science teams
so you can make sure the monitoring system will fit their needs.
Before beginning the implementation, you need to try out various open-
The open-source tools for monitoring were built for tracking either logs or
performance, you need to collect inference data. Therefore, the first step
will be collecting this data from the serving environment and storing it into a
data lake.
www.aporia.com
B u i l d i n g M L M o n i to r i n g w i t h O p e n - S o u rc e To o l s 19
First, you’ll need to set up a data lake that will be used to store the data and
the serving environment, and send the data to your data lake. While doing
so, you have to be cautious and make sure your implementation doesn’t
create any delays and does not interfere with the model's functionality.
2< For streaming models (e.g models that are served using a Flask / Django
topic.U
G< For batch models (e.g models that run on Airflow jobs), this can be done
Lastly, you need to ensure that this instrumentation solution is able to deal
Now that you have your inference data stored in a data lake, you need to
were not intended to work on raw data. Therefore, before you can start
visualizing it, you have to aggregate this data and derive some metrics
from it.
the data from the previous step, and exposes the final metrics in a way that
Grafana).
www.aporia.com
B u i l d i n g M L M o n i to r i n g w i t h O p e n - S o u rc e To o l s 20
You have to take into account processing large scale of data to create these
metrics. Moreover, you need to decide in advance which metrics you plan to
calculate.
Aggregation Examples@
4O Statistical properties of each input & output (e.g: count, min, max,
average.
bO Drift Score: Statistical distance between the distribution of each input &
Then, after calculating these metrics, you need to store them in a format
Finally, now that you have some metrics in the right format, you are ready to
visualize them using Grafana. For this, you should setup Prometheus and
Grafana instances.
After you have a running instance, you should connect it with the metrics
Once you have metrics flowing into Prometheus, you can start creating the
first dashboard in Grafana to show various metrics. As there are many users
www.aporia.com
B u i l d i n g M L M o n i to r i n g w i t h O p e n - S o u rc e To o l s 21
Solution Shortcomings
widget that displays two distributions in the same graph (e.g for Data
production).
help with root cause analysis. These visualizations and other critical data
The lack of visualization and root-cause analysis tools means that using
No Data Segmentation
reality they are underperforming for certain data segments. For example, a
www.aporia.com
B u i l d i n g M L M o n i to r i n g w i t h O p e n - S o u rc e To o l s 22
Moreover, issues usually start by affecting a small portion of the data, which
grows over time. Without the ability to visualize, monitor, and investigate
question.
order to find the root cause of a problem, a data scientist has to slice and
dice the data, and find the common denominator for the problematic
predictions.
improve their models, and moreover, increase MTTR and risk for the
organization.
No Explainability
The proposed solution does not provide, and cannot provide any
explainability capabilities.
?
?
?
?
Black Box
Input Output
N o e x p l a i n a b i l i t y = B l a c k B ox O u t p u t s
www.aporia.com
B u i l d i n g M L M o n i to r i n g w i t h O p e n - S o u rc e To o l s 23
The proposed solution does not provide, and cannot provide any
manager can question the lead scoring they have received from a model.
explain their decisions to their clients. While this is not an issue for
problem.
The proposed solution does not offer monitoring at the prediction level.
prediction, which can affect the business’ bottom line (e.g missing an outlier
Moreover, the proposed solution does not provide detection of bias and
models do not have bias against a specific population of data is crucial for
business owners and data scientists. The lack of such a detection capability
may result in the model performing poorly for specific data populations,
www.aporia.com
B u i l d i n g M L M o n i to r i n g w i t h O p e n - S o u rc e To o l s 24
Customization Limitations
monitored. In the real world, each model has different metrics that need to
training, data scientists craft custom metrics for each model to measure its
The first step in the proposed solution is to collect inputs & outputs of
models in production. This step can be very challenging on its own, for
multiple reasons
Batch vs. Streaming: Some models are batch, while others are
differ significantly (e.g Airflow vs. Flask vs. MLFlow vs. RServe)
Input Data Format: The input data of different models can be different.
For example, some models use tabular data as inputs, some use text
For models with text or image data, storing the embedding vector is
For tabular models with time series data, storing this data can require a
www.aporia.com
B u i l d i n g M L M o n i to r i n g w i t h O p e n - S o u rc e To o l s 25
In the second step, aggregate this data and generate ML-specific metrics.
features.
This highly varies between different datasets and column types, and so
challenging.p
For example, what action item is derived from the information that 3.7%
If the training set also had a similar amount of missing values (e.g 3.4%),
then everything might be okay. But what if the training set had 0.002%
best baselines is the training dataset, so you can aggregate it too and
collect metrics for that. But there are multiple challenges with that: First,
training sets can easily become huge (we’ll discuss data scale
Each column has a different data type (e.g categorical vs. numerical) or a
www.aporia.com
B u i l d i n g M L M o n i to r i n g w i t h O p e n - S o u rc e To o l s 26
Required Expertise
solution should be able to adapt to the business’ future plans, which will
H As more models are developed and more use cases are covered by ML
to specific models.
Costs
Management.
monitoring builds tend to be difficult to scale when the time comes, further
plunging you into more technical debt or in extreme cases scrapping the
project all together. Moreover, you’ll find that various mission critical
many instances your in-house monitor will lag behind on the technological
www.aporia.com
Summar y 27
Summary
In order for businesses to get the most out of their ML models and to trust
implementing the best ML models, to provide the most value for their
business.
Any solution you use must be able to provide the organization with visibility,
and more.
the end result requires compromising on many areas, which are essential for
www.aporia.com
Summar y 28
explainability tools to fully comprehend the scope of issues that arise from
prediction issues (e.g outlier) or detection of bias and fairness issues, which
are both crucial for business success and to instill trust in your AI.
complicated way to start monitoring your models that come with relative
low expenses up front. Although, this can be quite misleading as most in-
house monitoring builds can end up costing more in the long run, when
critical tool to employ into your ML workflow, and both options can be
it's vital to remember that in order to trust your models’ predictions, you
www.aporia.com
Cloud-Native ML Observability
www.aporia.com