You are on page 1of 31

ML Observability

Build vs. Buy


Should you build a solution in house?

Explainable AI

ML Monitoring

Root Cause Analysis

Open Source

Technical Debt

www.aporia.com
Ta b l e o f C o n te n ts

Table of Contents

E xe c u t i ve S u m m a r y 1

I n t ro d u c t i o n 6

B u y i n g a n M L O b s e r va b i l i t y P l a t fo r m 8

H ow I t Wo r k s 9

E n d - to - E n d M L O b s e r va b i l i t y 10

C u s to m i z a t i o n C a p a b i l i t i e s 11

E a s e o f I n te g ra t i o n 12

B i a s & Fa i r n e s s C a p a b i l i t i e s 12

I nve s t i g a t i o n C a p a b i l i t i e s 13

R i c h , C o n tex t u a l D a t a i s N e e d e d fo r P ro d u c t i o n - G ra d e M o n i to r i n g 15

C o s ts 16

B u i l d i n g M L M o n i to r i n g w i t h O p e n - S o u rc e 18

Implementation 18

S o l u t i o n S h o r tc o m i n g s 21

L a c k o f V i s u a l i z a t i o n & Ro o t C a u s e A n a ly s i s 21

No Data Segmentation 21

No Explainability 22

C r i t i c a l M o n i to r i n g L i m i t a t i o n s 23

C u s to m i z a t i o n L i m i t a t i o n 24

Challenges in Building a Solution 24

Re q u i re d E x p e r t i s e 26

C o s ts 26

Summar y 2 7

www.aporia.com
www.aporia.com
E xe c u t i ve S u m m a r y 1

Executive Summary

ML Observability is the ability to obtain a deep understanding of your ML

models across their life cycles and their impact on the business. An ML

Observability tool provides data scientists, ML engineers, and business

owners with the capabilities to monitor, visualize, troubleshoot, and explain

machine learning models as they move from the research and training

stage to production.

An effective ML Observability solution empowers organizations to leverage

their ML models with confidence, and includes the following capabilities:

Model Inventory, Model Activity Tracking, Customizable Monitoring,

Prediction Drift, Data Drift Detection, Measuring Performance Metrics in

Production, Explainability & What-if Troubleshooting, Bias & Fairness,

Business Dashboards, Model Versions Comparison, Root Cause Analysis,

Scale & Integrations.

It is possible to build such a solution in-house leveraging open-source tools

such as Prometheus and Grafana. However, to do so requires the

organization to invest significantly in infrastructure, personnel, building

expertise, maintenance, and support. This is a large investment to make

and the end result requires compromising on many areas, which are

essential for successful and trustworthy ML Observability.

When building an in-house solution leveraging open source technologies

the following shortcoming of such a solution should be taken into account:n

R Lack of Visualization & Root Cause Analysis: Data scientists require

various visualization techniques to be able to find the root cause of a

specific problem in a model. These visualizations include: metrics over

time, with comparisons between metrics. Many times these are

visualized over different data segments to be able to understand the

true impact of the behavior change on the model’s output. Similarly,

visualizing distributions and comparing them, over different model

www.aporia.com
Executive Summary 2

versions, data segments, and environments is also a common


visualization method that assists with root cause analysis. These
visualizations and other critical data visualization methods are not
supported in most open source solutions. The lack of visualization and
root-cause analysis tools means that using such a system is useless for
investigating ML related issues8

F No Data Segmentation: Many times ML models may seem to be working


adequately overall, when in reality they are underperforming for certain
data segments. For example, a demand forecasting model may be
underperforming in a specific region, or for a specific brand. A fraud
detection model may identify specific browsers, or users buying specific
products incorrectly as fraudsters. Moreover, issues usually start by
affecting a small portion of the data, which grows over time. Without the
ability to visualize, monitor, and investigate specific data segments, the
effectiveness of the monitoring system is in question$

F No Explainability: Getting an explanation of the model’s predictions is


important for various reasons, from providing an explanation to business
owners, being compliant in regulated industries, debugging a model’s
predictions, and more.Q

F Critical Monitoring Limitations: From prediction-level limitations to bias


and fairness limitations, lacking such a detection capability may result in
the model performing poorly for specific data populations, even when it
appears that the model overall is performing accurately.Q

F Customization Limitations: Model monitoring and observability


customizations are critical for ensuring trust in your ML models. As each
model is unique, the ability to customize monitors, customize metrics,
create custom data segments, and customize views is essential for
successful model monitoring.

www.aporia.com
Executive Summary 3

Leveraging Existing Solutions

Today’s MLOps market provides organizations with an alternative option of


choosing a first-class ML Observability platform such as Aporia. This type
of solution provides organizations with all the critical capabilities necessary
for effective ML monitoring, with little to no investment required by the
organizations beyond a financial commitment. In other words, this enables
organizations to focus on building and crafting the best ML models, to
provide the most value for their business.

www.aporia.com
Executive Summary 4

End-to-End ML Observability Platform

Setup and Monitoring and


Visibility Explainability Investigation
Management Alerting

Compare
Centrazlied Performance Explain Performance
Different Time
Dashboard Over Time Predictions Degradation
Frames

Customized Explain NLP Feature


Model Activity Data Integrity
Dashboard Models Importance

Integrate with Business Investigate Data


Proxy Metrics Model Staleness
All Models Explanation Segments

Supported Distribution
Data Statistics What-If Analysis Detailed Alerts
Model Types Analysis

Drift Score per


Multiple Impacted Data
Model Health Feature & Drift Detection
Explainers Points Analysis
Prediction

Custom Metric Time-Series


Personal View Custom Metrics
Visualization Analysis

Compare
Anomaly Export Data
Scale Support Multiple
Detection Points to CSV
Versions

Intuitive for Data Points Data Segments


Data Scientists Explorer Monitoring

Data Segments Alerts to Email


Visualization and MS Teams

A /B Testing
Training Set as
Champion-
Baseline
Challenger

Bias Monitoring

Supported
Code-Based
Monitors

Retraining
Triggering

www.aporia.com
Executive Summary 5

End-to-End ML Observability Platform

Setup and Monitoring and


Visibility Explainability Investigation
Management Alerting

Compare
Centrazlied Performance Explain Performance
Different Time
Dashboard Over Time Predictions Degradation
Frames

Customized Explain NLP Feature


Model Activity Data Integrity
Dashboard Models Importance

Integrate with Business Investigate Data


Proxy Metrics Model Staleness
All Models Explanation Segments

Supported Distribution
Data Statistics What-If Analysis Detailed Alerts
Model Types Analysis

Drift Score per


Multiple Impacted Data
Model Health Feature & Drift Detection
Explainers Points Analysis
Prediction

Custom Metric Time-Series


Personal View Custom Metrics
Visualization Analysis

Compare
Anomaly Export Data
Scale Support Multiple
Detection Points to CSV
Versions

Intuitive for Data Points Data Segments


Data Scientists Explorer Monitoring

Data Segments Alerts to Email


Visualization and MS Teams

A /B Testing
Training Set as
Champion-
Baseline
Challenger

Bias Monitoring

Supported
Code-Based
Not Supported
Monitors

Complex Implementation

Retraining
Triggering

www.aporia.com
I n t ro d u c t i o n 6

Introduction

ML Observability is critical for the success of ML models in production and

for building trust in machine learning. By serving as guardrails for models,

ML Observability enables an organization to gain full visibility to their

models, detect and alert on any issues within the models, explain their

models, improve the models, and take action to remediate the risk before it

impacts the business.

An effective ML Observability solution empowers organizations to leverage

their ML models with confidence, and includes the following capabilitiesM

A Model Inventory: Keeps track of all models with a single pane of glasf

A Model Activity Tracking: Ensures the model is active in productioK

A Customizable Monitoring: The ability to tailor monitoring for each unique

model, use case and scenario

A Prediction Drift: Ensures predictions are trustworthyb

A Data Drift & Concept Drift Detection: Enables early detection of model

driftb

A Measuring Performance Metrics in Production: Ensures the model is

performing as intended, e.g in training and researchb

A Explainability & What-if Troubleshooting: Enables business owners and

data scientists to explain model predictions and simulate What-if

scenarios to better understand the model behaviorb

A Bias & Fairness: Ensures the model is compliant for all data populationf

A Business Dashboards: Gives business owners the ability to understand

the model’s impact on business resultsb

A Model Versions Comparison: Comparing and validating the best model

versions in scenarios such as Champion-Challenger, A/B Testing, and

more

www.aporia.com
I n t ro d u c t i o n 7

* Root Cause Analysis: Once an alert is raised, this enables the data

scientist to find the root cause ASAP instead of weekE

* Scale: The ability to support both current and future model observability

needE

* Integrations: Seamlessly fits into the overall MLOps organization

strategy

This document will discuss the build vs. buy decision for an enterprise ML

observability solution, highlighting pros and cons of each approach.

www.aporia.com
Buying an

ML Observability

Platform

www.aporia.com
B u y i n g a n M L O b s e r va b i l i t y P l a t fo r m 9

How It Works

When purchasing an end-to-end ML observability solution, you should

ensure that your solution has the following capabilities: self-hosted

deployment, seamless integration with any ML infrastructure and workflow,

comprehensive monitoring, visibility, root cause analysis, and explainability

for your machine learning models in production. Furthermore, the most

effective ML observability solution will be highly customizable and easily

tailored to fit any ML use case and model.

As we all know, ML models can experience issues and anomalies, such as

data drift, bias, data integrity issues, and performance degradation. This

wide range of issues usually require immediate action and can be easily

solved by data scientists and ML teams who leverage a customizable ML

monitoring solution that supports their specific use cases.

www.aporia.com
Buying an ML Observability Platform 10

End-to-End ML Observability Platform

Setup and Visibility Explainability Monitoring and Investigation


Management Alerting
Centrazlied Performance Explain Performance Compare
Dashboard Over Time Predictions Degradation Different Time
Frames

Model Activity Customized Explain NLP Data Integrity Feature


Dashboard Models Importance

Integrate with Proxy Metrics Business Model Staleness Investigate Data


All Models Explanation Segments

Supported Data Statistics What-If Analysis Detailed Alerts Distribution


Model Types Analysis
Drift Score per Multiple Impacted Data
Model Health Feature & Explainers Drift Detection Points Analysis
Prediction

Personal View Custom Metric Custom Metrics Time-Series


Visualization Analysis
Compare Anomaly Export Data
Scale Support Multiple Detection Points to CSV
Versions
Intuitive for Data Points Data Segments
Data Scientists Explorer Monitoring

Data Segments Alerts to Email


Visualization and MS Teams
A B Testing
/
Training Set as
Champion- Baseline
Challenger

Bias Monitoring
Supported
Code-Based
Monitors

Retraining
Triggering
www.aporia.com
Buying an ML Observability Platform 11

Customization Capabilities
Data scientists spend weeks or longer in research to achieve the best
performance from their models while training, to meet their specific needs.
Therefore, each model inherently has unique components to it – from
custom metrics, imbalanced data, sensitive populations within the data, and
many more.

When deploying new models, data science teams require a range of


production ML monitors tailored to their specific models and use cases in
order to track and detect model activity, data drift, concept drift, statistical
metric change, missing values, and more, to ensure their models are
working as intended. These monitors are applied on production models and/
or production candidates (model versions that are not yet in production).

Additionally, if a certain monitoring capability is required beyond a built-in


suite of custom monitors available, data scientists and ML engineers may
prefer to create their own custom python-code based monitors to best
support their needs.
Drift ale rt :

Drift ale rt :
Live Monitoring
Data Drift Threshold 0.3
mode l_se rving .py
0.5
Drift ale rt :

@app.post(/my-model/predict
@app.post(" "/my-model/predict"
") )
0.4
Drift score

def predict
defpredict (request
(request ):): 0.3
# Preprocess request Threshold

# XPreprocess request
= preprocess( request) 0.2
X = preprocess(request)
Performinference
# #Perform inference Sep 3 Sep 5 Sep 7 Sep 9 Sep 11 Sep 13 Sep 15 Sep 17

y y= =model.predict(X)
model.predict(X)
# #Log
Logprediction
predictiontotoaporia
aporia
aporia.log_prediction(
aporia.log_predictionX(,X,y)y)
return {"result": y}
return {"result": y}

Aporia Full-Stack Observability Drift Alert Capabilities

www.aporia.com
B u y i n g a n M L O b s e r va b i l i t y P l a t fo r m 12

Ease of Integration

The idea is to make integrations as easy as possible. You want a monitoring

solution that is built for scale, making it easy to integrate all types of models

– batch or streaming, tabular or NLP, and hundreds or billions of predictions

– with a simple integration process.

Usually this process is streamlined and occurs automatically within the

machine learning platform for any new model.

Bias & Fairness Capabilities

Using your end-to-end ML observability solution, your team should be able

to monitor their ML models in production for bias and fairness on key data

segments by applying monitors to them. Due to a large variety of

compliance use cases, having the ability to customize data segments is

important. Your solution should enable you to democratize protected group

monitoring for business stakeholders by making it easy to configure,

visualize, and monitor data segments in a user-friendly interface.

Defining new segments of interest can be done dynamically by providing a

simple set of criteria for the segment of interest.

City = “gotham” and 25 < age < 40

ApproveD 7%

denied 93%

Data Segment:

85% city = "gotham"


15%
and 25 < age < 40

All Data

(excluding selected segment)

Approved 34%

denied
66%

Aporia Segment Capability

www.aporia.com
B u y i n g a n M L O b s e r va b i l i t y P l a t fo r m 13

Investigation Capabilities

Any ML observability solution that you employ should provide an advanced

customizable alerting system, where alerts can be created as actions to

various ML-specific production monitors. Ideally, it has native integrations

with Slack, Microsoft Teams, JIRA, New Relic, and more, as well as

Webhooks to integrate with existing alerting and production ticketing

systems.

Preferablly you would want a monitoring solution that provides detailed

information within the alert – from the time it started, the relevant features,

relevant metrics such as drift score and missing values and their

corresponding thresholds, to the affected data points (including export to

CSV for further investigation in Jupyter Notebook), and more. This

information should be available both from the ML observability solution’s

user interface and from Webhook for integration with external systems.

Additionally, your monitoring system should provide a large range of

visibility and investigation tools to offer additional information and support

comprehensive root-cause analysis.

Distributions

Performance Data Segments

Data Stats

A p o r i a I nve s t i g a t i o n C a p a b i l i t i e s

www.aporia.com
B u y i n g a n M L O b s e r va b i l i t y P l a t fo r m 14

Your alerts should be highly customizable to avoid false positives – from

simple constraints (such as do not detect drift if there are less than X

predictions, or only trigger maximum of Y alerts per day), up to advanced

anomaly detection algorithms.

Getting an alert for a production ML event is important, yet, it is not where

the story ends. There’s a need to investigate further, get to the root cause

as quickly as possible, and remediate.

In order to drill down and investigate the root cause, look for solutions

with a wide range of visualization capabilities, including: V

@ Data Explorel

@ Prediction Explainability and What-if Simulatol

@ Time-Series InvestigatioJ

@ Data Metrics & StatZ

@ Distribution Analysis and more!

These capabilities make it easy to investigate ML-related issues and get to

the root cause as quickly as possible.

D ata Point Exp lainer


Total Data Points

ID : 379109

360,38M

Features Prediction Impact

Features Prediction Actuals

reviously nsured
P _I

Age Dri v ing_license Annual_premium Region_code Proba Will buy insurance

T rue +50%
ID

20 90 False true 0 374... 8 46 0 1.17

D riving License
_

379104 Explain
75 True 37.14K 28 0.75 True

T rue +25%
379104 Explain
33 True 14.59K 37 0.23 True

379104 Explain
51 True 7.99K 23 0.63 False
A ge

379104 Explain
44 True 122.76K 41 0.14 True 50
+17%

379104 Explain
19 True 28.05K 6 0.87 False

379104 Explain
28 True 66.88K 40 0.82 True

Re- xplain
E

Aporia Data Point Explainer

www.aporia.com
B u y i n g a n M L O b s e r va b i l i t y P l a t fo r m 15

Rich, Contextual Data is Needed for Production-

Grade Monitoring

Tracking model inputs and outputs is important, but it’s not enough to really

understand your data and model behavior. What you need to monitor isn’t a

single model – but an entire AI system.

A few examplesm

HX You have a human labeling system, and you’d like to monitor how your

model’s output compares to their labeling, to get real performance

metrics for your modelX

\X Your system contains several models and pipelines, and one of the

models’ output is used as an input feature for a subsequent model.

Underperformance in the first model may be the root cause for

underperformance in the second model, and your monitoring system

should understand this dependency and alert you accordinglyX

rX You have actual business results (e.g whether the ad your

recommendation model suggested was actually clicked) – this is a very

important metric for measuring your model’s performance and is relevant

even if the input features never really changedX

>X You have metadata that you don’t want (or even is not allowed, e.g.,

race/gender) to use as an input feature, but you do want to track it for

monitoring, to make sure the model is not discriminating unintentionally

on that data field.

By using a dedicated ML observability platform, you can monitor an entire

AI system or an ensemble of models. You can also easily monitor data that

isn’t part of the model features (such as gender), monitor the business

outcome, and much more.

www.aporia.com
B u y i n g a n M L O b s e r va b i l i t y P l a t fo r m 16

Costs

An effective ML monitoring system will not only help organizations monitor

their models, it should help improve them which will also improve model

results and drive more revenue for the business. Data scientists will be able

to focus more on building and scaling an organization’s AI instead of

worrying about underperformance.

With a dedicated solution in place, the organization will not need to utilize

essential resources in house to build a solution over an extensive period of

time that will require long-term, continued maintenance to support it. One of

the benefits of procuring a built-for-scale ML monitoring platform is quick

and easy deployment, alongside enabling an organization to focus on the

models they create and the value they provide.

Dedicated ML observability solutions should be built for data scientists, ML

engineers, and business stakeholders to support their many ML use cases

and fortify trust in in their models to generate critical business predictions.

www.aporia.com
Building ML

Monitoring with

Open-Source

Tools

www.aporia.com
B u i l d i n g M L M o n i to r i n g w i t h O p e n - S o u rc e To o l s 18

The open-source community offers multiple solutions for monitoring

application workloads, like Grafana and Prometheus. These tools are

intended for monitoring application logs and application metrics. While

these solutions are not intended for monitoring machine learning models,

we will explore a possible solution, using Prometheus and Grafana for

achieving that goal.

Implementation

The following implementation is a proposal for a basic ML Observability

solution built from scratch, using open-source tools:

Phase 0 - Planning & Design

Before starting this journey of building a monitoring solution for ML models,

you need to write down the list of models that you have, the type of data

that is being used, and who will be using this system.

Then, you need to gather requirements from different data science teams

so you can make sure the monitoring system will fit their needs.

Before beginning the implementation, you need to try out various open-

source tools to explore their advantages and disadvantages and come up

with a proposed architecture for the solution.

Phase 1 - Instrumentation & Data Collection (In-House Development)

The open-source tools for monitoring were built for tracking either logs or

metrics. However, in order to be used for monitoring ML models and their

performance, you need to collect inference data. Therefore, the first step

will be collecting this data from the serving environment and storing it into a

data lake.

www.aporia.com
B u i l d i n g M L M o n i to r i n g w i t h O p e n - S o u rc e To o l s 19

First, you’ll need to set up a data lake that will be used to store the data and

decide on the format that will be used for it.

Then, you need to implement a solution that could be instrumented within

the serving environment, and send the data to your data lake. While doing

so, you have to be cautious and make sure your implementation doesn’t

create any delays and does not interfere with the model's functionality.

Moreover, you should have a separate implementation of instrumentation

logic for every type of serving that you have<

2< For streaming models (e.g models that are served using a Flask / Django

server), this can be done by streaming predictions to an Apache Kafka

topic.U

G< For batch models (e.g models that run on Airflow jobs), this can be done

by saving the dataframe directly to the data lake.

Lastly, you need to ensure that this instrumentation solution is able to deal

with large amounts of data without causing any issues.

Phase 2 - Metrics Calculation Service (In-House Development)

Now that you have your inference data stored in a data lake, you need to

prepare it for visualization. As mentioned above, Grafana and Prometheus

were not intended to work on raw data. Therefore, before you can start

visualizing it, you have to aggregate this data and derive some metrics

from it.

In this step, you’ll have to implement a service that constantly aggregates

the data from the previous step, and exposes the final metrics in a way that

can be consumed for dashboarding and alerting (e.g Prometheus &

Grafana).

www.aporia.com
B u i l d i n g M L M o n i to r i n g w i t h O p e n - S o u rc e To o l s 20

You have to take into account processing large scale of data to create these

metrics. Moreover, you need to decide in advance which metrics you plan to

calculate.

Aggregation Examples@

4O Statistical properties of each input & output (e.g: count, min, max,

average.

bO Drift Score: Statistical distance between the distribution of each input &

output to their distribution in a certain baseline (e.g training set.

SO Performance metrics such as AUC ROC, F1 Score, etc.

Then, after calculating these metrics, you need to store them in a format

that can be consumed by Prometheus and Grafana.

Phase 3 - Creating a Dashboard (Prometheus + Grafana)

Finally, now that you have some metrics in the right format, you are ready to

visualize them using Grafana. For this, you should setup Prometheus and

Grafana instances.

After you have a running instance, you should connect it with the metrics

provider so you can start creating dashboards.

Once you have metrics flowing into Prometheus, you can start creating the

first dashboard in Grafana to show various metrics. As there are many users

and models, there's a need to create a dashboard for each one, so

everyone has exactly what they need.

Phase 4 - Monitor Metrics & Alert on Anomalies (Alertmanager)

Using Prometheus’ Alertmanager, you can define alerts on when a metric

reaches above or below a specific threshold (e.g drop in model activity).

www.aporia.com
B u i l d i n g M L M o n i to r i n g w i t h O p e n - S o u rc e To o l s 21

Solution Shortcomings

Lack of Visualization & Root Cause Analysis

Grafana dashboards work well for traditional engineering metrics such as

CPU and RAM. However, it is very hard to create effective ML dashboards

in Grafana. As an example, Grafana doesn’t provide a way to create a

widget that displays two distributions in the same graph (e.g for Data

Scientists who want to compare a feature distribution in training vs.

production).

Data scientists require various visualization techniques to be able to find the

root cause of a specific problem in a model. These visualizations include:

metrics over time, with comparisons between metrics, many times

visualized over different data segments to be able to understand the true

impact of the behavior change on the model’s output. Similarly, visualizing

distributions and comparing them, over different model versions, data

segments, and environments are also a common visualization method to

help with root cause analysis. These visualizations and other critical data

visualization methods are not supported in Grafana.

The lack of visualization and root-cause analysis tools means that using

such a system is useless for investigating ML related issues. It might help

surface very elementary changes, however the root-cause analysis process

would still require spending weeks or months trying to understand the

impact of an issue on the model results.

No Data Segmentation

Many times ML models may seem to be working adequately overall, when in

reality they are underperforming for certain data segments. For example, a

demand forecasting model may be underperforming in a specific region, or

for a specific brand. A fraud detection model may identify specific

browsers, or users buying specific products wrongfully as fraudsters.

www.aporia.com
B u i l d i n g M L M o n i to r i n g w i t h O p e n - S o u rc e To o l s 22

browsers, or users buying specific products wrongfully as fraudsters.

Moreover, issues usually start by affecting a small portion of the data, which

grows over time. Without the ability to visualize, monitor, and investigate

specific data segments, the effectiveness of the monitoring system is in

question.

Data segmentation is also crucial for investigating production issues. In

order to find the root cause of a problem, a data scientist has to slice and

dice the data, and find the common denominator for the problematic

predictions.

Missing these capabilities limits data scientists’ ability to investigate and

improve their models, and moreover, increase MTTR and risk for the

organization.

No Explainability

The proposed solution does not provide, and cannot provide any

explainability capabilities.

?
?
?
?

Black Box
Input Output

N o e x p l a i n a b i l i t y = B l a c k B ox O u t p u t s

www.aporia.com
B u i l d i n g M L M o n i to r i n g w i t h O p e n - S o u rc e To o l s 23

The proposed solution does not provide, and cannot provide any

explainability capabilities. Getting an explanation of the model’s predictions

is important for various reasonsb

5M Business stakeholders: The end-users of the ML model can question

the results they’ve been receiving. For example, a marketing or sales

manager can question the lead scoring they have received from a model.

By providing them with an explanation, they can get a clear

understanding of the reasoning behind the prediction, and more

importantly, trust these predictions in the long runM

UM Regulation: In regulated industries, organizations are obligated to

explain their decisions to their clients. While this is not an issue for

traditional software, the black-box characteristics of AI make it a real

challenge. Without explainability capabilities, ML cannot be used for

these use casesM

-M Debugging: Debugging production code is not an easy task. Debugging

production models is even more challenging. Using explainability tools

can significantly reduce the time it takes to debug and resolve a

problem.

Critical Monitoring Limitations

The proposed solution does not offer monitoring at the prediction level.

Prediction level monitoring is critical in many cases, for example when a

specific prediction has an outlier. Identifying outliers is important for many

reasons. An outlier may indicate bad data, or it may indicate a problematic

prediction, which can affect the business’ bottom line (e.g missing an outlier

may indicate missing an attempted fraud).

Moreover, the proposed solution does not provide detection of bias and

fairness issues, resulting in a lack of compliance with regulations. Ensuring

models do not have bias against a specific population of data is crucial for

business owners and data scientists. The lack of such a detection capability

may result in the model performing poorly for specific data populations,

even when it appears that the model overall is performing accurately.

www.aporia.com
B u i l d i n g M L M o n i to r i n g w i t h O p e n - S o u rc e To o l s 24

Customization Limitations

The proposed solution doesn’t support customization of the metrics being

monitored. In the real world, each model has different metrics that need to

be calculated in different time windows and at different baselines.

Model monitoring and observability customizations are critical for ensuring

trust in your ML models. As each model is unique, the ability to customize

monitors, customize metrics, create custom data segments and customize

views, is essential for successful model monitoring.

For each unique model, the model’s success is measured differently. In

training, data scientists craft custom metrics for each model to measure its

performance and ensure its quality. This in turn, needs to be translated to

production monitoring. Lacking these capabilities will result in model

Challenges in Building a Solution

The first step in the proposed solution is to collect inputs & outputs of

models in production. This step can be very challenging on its own, for

multiple reasonsœ

’ Batch vs. Streaming: Some models are batch, while others are

streaming. Data collection is usually different in those two modes™

’ Model Serving Environment Variety: The model serving environment can

differ significantly (e.g Airflow vs. Flask vs. MLFlow vs. RServe)™

’ Scale: Some models have a huge amount of predictions. This makes it

very challenging to collect and store this data at scale™

’ Input Data Format: The input data of different models can be different.

For example, some models use tabular data as inputs, some use text

data as inputs, and others use image data as inputs.

For models with text or image data, storing the embedding vector is

essential for drift detection.

For tabular models with time series data, storing this data can require a

more efficient time series database.

www.aporia.com
B u i l d i n g M L M o n i to r i n g w i t h O p e n - S o u rc e To o l s 25

In the second step, aggregate this data and generate ML-specific metrics.

This also presents multiple challengesR

j Calculating drift scores can be different for categorical vs. numerical

features.

For categorical features, a statistical distance such as Hellinger is usually

more appropriate. For numerical features, a statistical distance such as

Jensen-Shannon could be more relevant.

This highly varies between different datasets and column types, and so

selecting the correct statistical distance for drift detection can be

challenging.p

j Calculating metrics without a baseline is usually meaningless, so it’s

necessary to calculate them on a baseline as well.

For example, what action item is derived from the information that 3.7%

of a specific column are missing values?

If the training set also had a similar amount of missing values (e.g 3.4%),

then everything might be okay. But what if the training set had 0.002%

missing values? This could indicate a more serious issue.

Therefore, having a baseline for monitoring is critical. Supporting many

different baselines is best, but more complicated to achieve. One of the

best baselines is the training dataset, so you can aggregate it too and

collect metrics for that. But there are multiple challenges with that: First,

training sets can easily become huge (we’ll discuss data scale

challenges in a minute). Second, you need to make sure their schema

could be mapped to the production’s data schemaY

g Calculating drift scores at scale is difficult.

As of March 2022, there isn’t an effective open-source library for

calculating statistical distances using a distributed computation engine

such as Spark. You’ll need to implement these from scratchY

j Choosing the correct thresholds for alerting is difficult.

Each column has a different data type (e.g categorical vs. numerical) or a

different data behavior. Therefore, choosing relevant thresholds to

minimize Alert False Positives and Alert fatigue is challenging.

www.aporia.com
B u i l d i n g M L M o n i to r i n g w i t h O p e n - S o u rc e To o l s 26

Required Expertise

H To address all the challenges mentioned above, expertise in building,

customizing, scaling and designing such a solution is required. The

solution should be able to adapt to the business’ future plans, which will

require maintaining this expertise over time2

H As more models are developed and more use cases are covered by ML

models, the solution should be able to support future needs for

monitoring these models. Each model would require specific metrics,

sometimes various drift detection methods, and each should be coupled

to specific models.

Costs

The cost of building even a very basic ML observability platform for an

enterprise is immensely high, and requires a unique skillset. Building such a

system requires a group of people with experience in Software Engineering,

Software Architecture, Data Engineering, Data Science and Product

Management.

Assuming a qualified team with the skills and experience as mentioned

above is employed, this is a months-long project that requires ongoing

maintenance and support, leading to a multi-year project with a technical

debt that will continue to grow over time. Additionally, in-house ML

monitoring builds tend to be difficult to scale when the time comes, further

plunging you into more technical debt or in extreme cases scrapping the

project all together. Moreover, you’ll find that various mission critical

integrations will be difficult to implement due to limited abilities, and in

many instances your in-house monitor will lag behind on the technological

side, missing out on cutting-edge advances in the field.

For a business, in which ML Monitoring is necessary but not the sole

purpose of the business, taking on such a project is costly, requiring a

significant investment in resources, and often results in a monitoring

solution that is inadequate to support their needs.

www.aporia.com
Summar y 27

Summary

In order for businesses to get the most out of their ML models and to trust

them with making critical business decisions, a robust, end-to-end ML

observability solution needs to be in place. How you get to that monitoring

solution is the question?

Today’s MLOps market provides organizations with an alternative option to

taking the in-house build route by choosing to procure a first-class ML

Observability platform. This solution provides organizations with all the

capabilities necessary for ML monitoring and explainability, with little to no

investment required by the organizations beyond a financial commitment. In

other words, this enables organizations to focus on building and

implementing the best ML models, to provide the most value for their

business.

Any solution you use must be able to provide the organization with visibility,

monitoring, root cause analysis, and explainability capabilities that are

required for successful ML monitoring in production. Moreover, the solution

must be scalable, used easily by various stakeholders, integrate seamlessly

with the organization’s MLOps pipeline, leverage cutting-edge technology,

and more.

It is possible to build such a solution in house leveraging open-source tools

such as Prometheus and Grafana. However, to do so requires the

organization to invest significantly in infrastructure, personnel, building

expertise, maintenance, and support. This is a large investment to make and

the end result requires compromising on many areas, which are essential for

successful and trustworthy ML Observability.

Other aspects to look at when contemplating building an in-house solution

vs. buying one, is that building is often insufficient in providing data

scientists and ML teams with efficient visualization, investigation, and

www.aporia.com
Summar y 28

explainability tools to fully comprehend the scope of issues that arise from

ML models in production. It also fails to offer required monitoring for

prediction issues (e.g outlier) or detection of bias and fairness issues, which

are both crucial for business success and to instill trust in your AI.

Additionally, you won’t be able to compare model versions, support various

customizations, or monitor different stages across the data pipeline using

the proposed solution, which impacts model performance in production.

Now, depending on your need, open source tools tend to be a less

complicated way to start monitoring your models that come with relative

low expenses up front. Although, this can be quite misleading as most in-

house monitoring builds can end up costing more in the long run, when

considering technical debt and functional limitations.

As you know by now, whether procuring or building your monitor, it’s a

critical tool to employ into your ML workflow, and both options can be

successful in the short run. However, as your use of ML models increases,

it's vital to remember that in order to trust your models’ predictions, you

should aim for reliability in your ML observability platform.

www.aporia.com
Cloud-Native ML Observability

www.aporia.com

You might also like