You are on page 1of 52

Data Quality

In Machine Learning
Production Systems
Felix Biessmann,
Sebastian Schelter,
Sebastian Jäger
Why Data Quality for Machine Learning?

1. Manager’s Perspective:

Legal requirement

Biased data (gender, ethnicity)

Quality standards (e.g. Autopilots)

2. Scientist’s Perspective:

Better data quality, better predictive performance

3. Engineer’s Perspective:

Data quality problems worst hidden technical debt in AI systems

2
.
Better Data Quality, Better Predictive Performance

Classi cation Horror vs Comedy from plot:


Misconception 2. Parameter tuning
Data quality most important hyper parameter
is the most important problem
Data Cleaning 0.91

TFIDF, PoS, Stop Words 0.695 Python Hyperopt 0.73

Scikit Learn Default 0.69

Accuracy

Horror vs. Comedy From Plot


Krishnan and Wang, “Data Cleaning - a Statistical Perspective”,
82

SIGMOD Tutorial on Data Cleaning, 2016

3
.
fi
AI in Academia

Machine
Data Predictions
Learning

4
.
AI in Production Systems

Machine
Learning

5
.
AI in Production Systems

Dat
Collection

Machine
Learning

5
.
a

AI in Production Systems

Dat
Collection

Machine
Learning

Featur
Extraction

5
.
a

AI in Production Systems

Dat Dat
Collection Veri cation

Machine
Learning

Featur
Extraction

5
.
a

fi
e

AI in Production Systems

Monitoring

Dat Dat
Collection Veri cation

Con guration
Machine
Learning Analysis Tools
Serving

Featur
Extraction Machine
Resource
Management
Adapted from Sculley et al,
“Hidden Technical Debt in ML Systems” NIPS 2015

5
.
a

fi
fi
e

Automation for Data Quality

• Data Quality is important for responsible usage of ML


• Automated Data Quality Monitoring is difficult
• Lack of automation for Data Quality leads to
• Slow transfer from research to applications
• Poor scalability
• Difficult maintenance

6
.
Automation for Data Quality

• Responsible usage of ML requires novel tooling


• We develop solutions for scalable and automated
• Monitoring of data quality
• Improvement of data quality
• Prediction of data quality impact on ML models
• Modelling Data Errors

7
.
Automated Monitoring of Data Quality

Research Question:
Can we measure Data Quality problems automatically?

8
.
Automated DQ Monitoring: Unit Tests for Data

• So ware systems are testable


• Unit tests
• Integration tests

• Machine Learning systems depend on data

• Data quality is essential for ML systems


➡ Unit tests for Data Quality

9
.
ft
Automated DQ Monitoring: Unit Tests for Data

val verificationResult = VerificationSuite(


.onData(data
.addCheck
Check(CheckLevel.Error, "unit testing my data")
.hasSize(_ == 5) // we expect 5 row
.isComplete("id") // should never be NULL
.isUnique("id") // should not contain duplicate
.isComplete("name") // should never be NUL
// should only contain the values "high" and "low"
.isContainedIn("priority", Array("high", "low"))
// should not contain negative value
.isNonNegative("numViews"
// at least half of the descriptions should contain a url
.containsURL("description", _ >= 0.5)
// half of the items should have less than 10 views
.hasApproxQuantile("numViews", 0.5, _ <= 10))
.run(

Schelter et al, VLDB, 2018 https://github.com/awslabs/deequ

10
.
)

Automated DQ Monitoring: Unit Tests for Data

Check(Level.Error) SELECT AVG(NOT_NULL(name)), Success("isUnique(product_id)",


.isUnique("product_id") AVG(NOT_NULL(description)), Uniqueness("product_id") == 1.0)
.isComplete("name", "description") AVG(price > 0),…
.satisfies("price > 0") FROM data Failure("isComplete(name)",
.hasValidValueRange("priority", Completeness("name") == 1.0,
("high", "low")) Uniqueness("product_id") 0.987)
SELECT SUM(COUNT(*))
.hasCountDistinct("brand", Completeness("name")
FROM data
verifyBrandCount(_)) Completeness("description") Success("isComplete(description)",
GROUP BY asin
Compliance("price > 0") Completeness("description") == 1.0)
HAVING COUNT(*) = 1
Check(Level.Warning, scope=SERIES) Compliance("priority IN (...)")
.hasNoAnomalies(Size, Histogram("brand") Success("hasCountDistinct(brand, ...)",
...
OnlineNormal(stdDevs=3)) Predictability("brand", ...) verifyBrandCount(Histogram("brand")
Size .numDistinct)))
Check(Level.Warning)
.isPredictable("brand", ...
("name", "description"), f1=.98)

user-defined declarative aggregation queries metrics and validation


1 2 data quality metrics 3 4
checks and constraints on Apache Spark results

Schelter et al, VLDB, 2018 https://github.com/awslabs/deequ

11
.
Automated Improvement of Data Quality

Research Question:
Can we fix Data Quality problems automatically?

12
.
Data Quality Problems
• Correctness (Di cult to measure)
• Unknown schemas / data types
• Wrong entries

• Consistency
• Duplicates
• Invalid values

• Statistical properties
• Anomalies
• Are value distributions changing over time?
• Is a column predictable from another column?

• Completeness
• How many missing values?

13
.
ffi
hn-gram
n-gram, where nor a {1,
representation . . . , 5}, inembedding
character-based the character using ase L
ps each n-gram,
acter-based
recurrent neural embeddingwhere
network [? ] nusing
TODO: {1,a Long-Short-Term-
. . . , 5},For
CITATION. inthe
thecharac
cha
c denotes here the number of hash buckets. Note th
ere
ODO: D
t athat c denotes
CITATION.
aAutomated
hashing function here
thatFor
DQ Improvement:
does not require
the
maps the number
character
each n-gram,
Imputation
anyDtraining,
of hash
n-gram
where nbuckets.
for Tables
whereas represe
{1, . . . ,
the othe
ponent
ach D dimensional
that
n-gram,
c does
where vector;
Imputation not nof here
require {1,
attribute c .denotes
any
. . ,
color 5}, here
training,
in the
the number
whereas
characte of h
that are learned
featurizer using
is a stateless backpropagation.
component that does not require any train
Deters that
c denotes
feature maps are
Product
herelearned
contain the usingthat
number
parameters backpropagation.
ofarehashlearned buckets. Note
using backprop
eddings,
nt that we
does
Type use
not a standard
require
Description
any linear
Size embedding
training, whereas
Color fed o
the
embeddings,
meter we
For the case of categorical
that for this use a standard
embeddings, welinear embedd
use a standard li
s connectedareShoe layer.featurizer
learned Theusing
Ideal was… a 12UK single
backpropagation.
hyperparameter
for running for thisone Blackand used
featurizer was at
parameter
llembedding
ws as the numberfor this of
dimensionality featurizer
hidden as wellunits
as was
theof a single
the
number output
of onelayer.
hidden an
uni
beddings,
asan wellweas
LSTM
case, we
the use
number
SDCards Best SDCard
through
featurize x bythe
c a standard
of
ever …
hidden
sequence
iterating linear
8GB
an LSTM ofunits embedding
of the
Blue
characters
through of xfci
outp
the sequen
ameter
rating
via
Row for
an
represented
a character
Dress this
LSTM
as featurizer
through
continuous
This
embedding.
yellow vector
dress was
the
… via
The acharacter
aM single ?embedding.
sequence
sequence one
of of and
charact
charac usTh
well mapped
ector as the
via to a
a sequence
number
character of
of states
hidden h (
embedding. c, units
1), . . . , of
h the
(c,S )
The
c
and
sequencewectake
output lay
oth
c,a 1), . .connected
fully
(c,S
. , h Feature c)
andasColumns
layer we
the take the lastofLabel
featurization state
x . The
(c,S )
h hyperpara
c Column ,m
ng
on an( LSTM through
es
urization
h c, 1),
then the number
c , h(c,Scthe
of x .ofThe
. . . )
and sequence
hyperparameters
Biessmann, Salinas et al, CIKM 2018
layers, we take
the number of hidden of
Biessmann et al. Journal of Machine Learning Research, 2019the characters
last
of each state
units LSTM h
of the L o (

ercharacters
via aofcharacter embedding. Theofsequence of cha
Jaeger et al., Frontiers in Big Data, 2021

umber featurization embedding


hidden ofunits . of
xcicand The thehyperparameters
the number
LSTM hidden
cell and units of
theof eac
the fi
dim
(the LSTM featurizer.
lhe c, number
number 1), One
. of
. .Hot (c,Sc )
,hidden
hof hidden and units
units
Character
wethe
of takeof the
final
One the last
LSTM
Hot fully
state
Hot cell
Oneconnected
(c,Sc )
hPandas,
and
14
o
aturization
ionFinally Encoder of xvectors
all feature . The chyperparameters
c Sequence(xc ) areEncoder concatenated of
intoeach
one LS
featur
Dask or
.

Encoder
columns
ent variablec with sequential
associated 127 with
string
column
For
data
categorical c. we
Wedata
consider
considered
we use a
two
one-hot threedifferent
encoded different possibilities
embedding types (as
c cfor
of
known featurizers
from cword ): an
(xc embeddings).
am epresentation
representation or a character-based
or a use
c (.).
character-based embedding using ausing Long-Short-Term-Memory (LSTM) (LSTM)
or categorical data we 128 For a columns
one-hot withembedding
c encoded sequential embeddingstring data (asa we Long-Short-Term-Memory
consider
known fromtwo different
word possibilities
embeddings). for c (xc ): an
trrent
neural network
neural network[? ] TODO:
[?
129 ] n-gram
TODO: CITATION. CITATION.
representation For the character
For the
or a character-based n-gram
character embedding representation,
n-gram usingrepresentation, c (x c
a Long-Short-Term-Memory) is (xc ) is (LSTM)
umns c with sequential
g function that maps each string
n-gram,
recurrent data where
neural networkwe consider
n {1,
[? ] n .
TODO: .two
. , 5}, different
CITATION.in the possibilities
character
For the sequence
character for
n-gram x
c (x
c
to ): can
representation,
c c c (xc ) is
shing function
Automated that maps each n-gram, where in the character sequence to
DDQ Improvement: aImputation . , for Tables
130
representation orhere
a character-based embedding maps using
{1, . . . , 5},
Long-Short-Term-Memory (LSTM) x
mensional vector; 131
c adenotes
hashing here
function thethatnumber each of n-gram,
hash buckets.
where n Note {1, . .that 5},the hashing
in the character sequence xc to
ntrdimensional
neural
is network
a stateless
vector;
component
here
[? ]132TODO: DD
athat denotes
CITATION.
c dimensional
cdoes here
vector;
not require Forthe thenumber
here
any D character
c denotes whereas
training,
ofhere hash
n-gram buckets.
the number representation,
the of hash
other
Note
two
that the
buckets.
typescNote
chashing
of )that
(x is the hashing
ng rizer
maps is a stateless
function
contain that maps
parameters component
133
eachfeaturizer
that n-gram,
aremaps that
learned does
is aImputation
stateless
where notcomponent
using nofrequire
attribute
backpropagation.
{1,that .any
that
. does
.color training,
, 5}, thewhereas
not require
in using any training,
character thesequence
other two
whereas
xctypes
the other
to twoof types of
ure maps contain feature contain parameters are learned backpropagation.
vector;parameters that are learned using backpropagation.
134
mensional here Dc denotes Product here the number of hash buckets. Note that the hashing
case
er is ofstateless
a categorical embeddings,
component
135 For the case
that
Type
doesweofuse not a standard
Description
categorical
require
embeddings,
any linear
Size
training, embedding
we Color
use a standard
whereas fed
the into
linear
other one
embedding
two fully
types
fed into one fully
the case of categorical
ed layer. The hyperparameter 136 embeddings,
connected forShoe we
layer.featurizer
this The use a
for hyperparameter
standard
was a12UK for thisone
single linearfeaturizerembedding
Blackand used was ato single fed into
one and
set both the used of
one tofully
set both the
maps contain parameters that are learned using backpropagation.
Ideal running …
ected
ng layer. TheTraining
dimensionality hyperparameter
as 137 Rowsembedding dimensionality
well as the number for this of featurizer
as well as was the number a single
c hidden units of the output layer. In the LSTM
one units
of hidden and ofused to setlayer.
the output both theLSTM
In the
case, we featurize Best x by iterating an LSTM through
Blue the sequence of characters of xi that are each
c
edding
case
featurize ofdimensionality
categorical
xc by iterating
138
139
asan well
embeddings, LSTM
represented as the
SDCards
aswe
throughnumber
SDCard
use the
continuous
ever …
of
a standard hidden
sequence
vector via
8GB
a ofunits
linear
character charactersof the output
embedding
embedding. of xThe c layer. In the LSTM
ifedthat into
are one
sequence each fully xc is then
of characters
,ted
edwelayer.
asfeaturize The x
continuous
c
by iterating
vector 140 via
hyperparameter
To-Be-Imputed mapped
Row anforaLSTM
a character
Dress
to this
sequence through
embedding.
featurizer
This yellow
states…hthe
of dress c,The
was
( M
1),sequence
a. sequence
. single
. , h(c,S ?)one
c of ofcharacters
and characters
and
we takeusedthe lastof
to c
set
xstate
c cthat
(c,S )are each
xisi hthen
both ,the
mapped through
esented
ing
to a sequenceas continuous
dimensionality
1.of states hvector
as141
String c, 1),as.via
( a fully
well the
.,h anumber
.connected character
(c,S clayer
)
and
Feature
aswe
of embedding.
the featurization
hidden
Columns take units
the lastof The
of
c
x the
state
Label
. sequence
Thehoutput
Column
(c,S of characters
hyperparameters
c ) layer.
, mapped Inofthrough
eachLSTM
the x isfeaturizer
LSTM c
then are
Representation then the number c , h(c,Sc ) and we take the last state h(c,Scc) , mapped through of the
of layers, the number of hidden units of the LSTM cell and the dimension
ped
onnected to a sequence
e featurize xc by
layer as the of featurization
states
142
iterating
143
an
h (
c,LSTM1),
characters embedding of . x. through
. the thesequence
. Theci hyperparameters
and number of hidden of ofcharacters
each of theof
unitsLSTM thatconnected
featurizer
xifully
final areareeach output layer of
ly
nted connected
number as continuous layerthe
of layers, as144the
vectornumber featurization
via
the LSTM aofcharacter
hidden
featurizer. ofunitsx . of
c
embedding.The thehyperparameters
LSTM The sequence cell and the of
ofeachdimension
charactersLSTMof xfeaturizer
c
is then are
the
rsthe a number
toembedding
sequence of
ci layers,
2. of
and
states the
145h
Numerical the
(
c,number
number
Finally1), all .of. feature
,h of Character
.Hothidden
(c,Shidden
cunits units
vectors
)
and of
we c thetake
) are of Hotthe last
final
concatenated LSTM
fully Hot cell
connected
Onestate
into oneh (c,Sand
feature
Pandas, )
,the
coutput
mapped
vectordimension
layer
x̃ R of
through
D
where ofDthe

CPU
One c (x
Sequence
One = Dc
M
actersfeaturizer.
embedding
onnected layer as the c and
Representation
146 the
is the number
i featurization of x . The hyperparameters
sum over
Encoder
of
all c hidden
latent dimensionsunits of
Encoder
D c .the
We final
will fully
of each
refer
Encoder
to
Dask
the
Spark
or
connected
LSTM featurizer are
numerical output
representation layer of of
the values
LSTM featurizer. in the to-be-imputed column as y {1, 2, . . . , D }, as in standard supervised learning settings. The
ellnumber
feature of layers,
vectors (xthe
c number of hidden units of the LSTM cell and
147
) are concatenated into one feature
c 148 symbols of the target columns use the same encoding as the aforementioned vector x̃
y
R D the dimension of the
where D = categorical Dc variables.
ers
m embedding c and the number of hidden units of the final fully connected output layer of
llyover all latent
all feature dimensions
vectors
i
149c (x
c
After )Dare . We
cthe will refer
concatenated
featurization x̃orto the
ofinto
input numerical
one feature representation
or feature vectorand
columns x̃MxNet of the
theRencoding
D
where values
yD of the= to-be-imputed
Dc
M featurizer. 3. LSTM

e-be-imputed column
latentasdimensions asimputation
inrefer
standard supervised learning settings. The
Featurizers
sum over all 150y column{1, 2,we.D
Embedding
.can
.c,.D We
casty },thewill
n-gram hashing
Embedding
to the
problem numerical
as a supervised representation
problem by learning of the values
to predict the label

CPU/GPU
eofto-be-imputed
all the target
feature columns
vectors column use
151 the
areysame
c distribution
c (x ) as {1,of
concatenated encoding
y from x̃.as the aforementioned
2, . . . ,into
Dy }, one asfeature
in standard vector
… …x̃categorical
supervised RD where variables.
learning D= settings. Dc The
um over all target
latent Imputation is then performed by modeling the probability over all values values or
observed
bols of the
e featurization x̃ dimensions
4. Latent columns
ofrepresentation
input
152
153
useD
or c . We
the
feature same will
columns
Feature referand
encoding
Concatenationto theas
the numerical
the aforementioned
encoding p(y|x̃,
classes of y given an input or feature vector x̃ with some learned parameters . The probability
representation
y ),
of the categoricalof the variables.
to-be-imputed
we -be-imputed
can cast thecolumn
imputation as y {1,) 2,
problem Dyas}, as in problem
.as. .a, supervised
is modeled, standardby supervised
learning tolearning
predict settings.
label The
theto-be-imputed
rs of thethefeaturization x̃154
of input
p(y|x̃, or feature columns and the encoding
Imputation y of the
ion of y target from x̃. columns use the same encoding as thep(y|x̃, aforementioned categorical variables.
mn we can cast the imputation problem as a supervised problem
) = softmax by[W learning
x̃ + b] to predict the label (1)
he onfeaturization
ibution
gure of yperformed
is1: then fromx̃
Imputation x̃.of155by
example
inputmodeling
where
on
or feature
= (W,
non-numerical z, b)columns
p(y|x̃, the and
are),parameters
data
probability
with deeptolearning;
the
learnencoding
with overW all
symbols
y ofD ,the
RyDobserved
explained b R to-be-imputed
values
D
and z is
iny Section
ora vector containing
Biessmann, Salinas et al, CIKM 2018 4.
ofwey can given cast
antheinputimputation
or156feature problem
vector x̃
all parameters asall
of aexp
with supervised
some
learned columnlearnedproblem
featurizers byclearning
parameters
Biessmann
.etFinally,
al. JMLR
toMLOSS,
predict
. softmax(q)
The probability the label
denotes
2019
the elementwise15
utation
)tion is then
of iny most
is modeled, from performed
asx̃. and157APIssoftmax by modeling
function P p(y|x̃,
q
j
), the probability
exp q where qj is the j element of a vector q.
over all observed values
.

or
es of y given an input or feature vector x̃ with
emented libraries that offer imputation some learned
these methods. Text data needs parameters
to be vectorized . Thebefore probability
it can
j
Table 2: F1 scores on held-out data for imputation task product attributes for mode imputation, string
matching, LSTM and character n-gram featurizers. For single product attributes, we report F1 scores relative
to the result obtained with the best model. We also report the median of absolute F1 across all product
attributes. Compared to baselines, our imputation approach is better in all cases, achieving on average a
Imputation for Tables
23-fold increase (compared to Mode imputation) and a 3-fold increase (compared to a rule-based system) in
imputation quality.

Dataset Attribute Mode String matching LSTM N-gram


brand 0.4% 80.3% best 99.9%
dress, English manufacturer 1.0% 22.7% 99.4% best
size 4.8% 0.1% best 96.1%
brand 13.3% 44.5% best 94.5%
monitor, English display technology 30.6% 13.5% 99.8% best
manufacturer 15.1% 33.6% best 95.3%
brand 3.9% 48.5% best 99.1%
notebook, English cpu model manufacturer 82.7% 88.0% 98.9% best
manufacturer 4.3% 35.9% 99.8% best
brand 0.5% 91.5% 99.9% best
shoes, English manufacturer 0.5% 79.2% 98.8% best
size 2.2% 0.0% best 82.7%
toe style 13.1% 23.5% 96.5% best
brand 2.6% 19.3% 98.8% best
shoes, Japanese color 20.4% 58.3% 94.5% best
size 76.7% 2.6% best 99.2%
style 61.3% 13.4% 92.6% best
Median of absolute F1 scores 4.1% 30.1% 92.8% 93.0%

Simple imputation models on par with or better than LSTMs


Table 3: F1 scores on held-out data for imputation task DBpedia. See Table 2 for a description of columns.
In contrast to the relative results for single product attributes, all numbers for the DBpedia dataset are
absolute F1 scores.

Dataset Attribute Mode String matching LSTM N-gram


birth place 0.3% 16.3% 54.1% 60.2%
DBpedia, English genre 1.5% 6.4% 43.2% 72.4% 16
location 0.7% 7.5% 41.8% 60.0%
.
Automated Monitoring of Data Quality Impact

Research Questions:

Can we predict performance of ML systems

• In the presence of covariate shi

• Without access to ML system internals

17
.
ft
ML Production Systems
Classical ML Production Setting

Training / Validation Serving

black box
ML model
Cloud ML Auto ML
Service

Customer
(Engineers without
ML expertise)

24

18
.
Predicting ML Performance

• Many Data Quality problems will be not detected / xed


• Errors in preprocessing
• Covariate shi s
• Adversarial attacks

• Some errors will not have impact on downstream ML model


• Only looking at input data leads to false alarms
• Other errors are di cult to detect in input
• Only looking at input data will miss some errors

19
.
ft
ffi
fi
Predicting ML Performance
Training Data Performance Estimate
Model
Predicting ML Performance
Training Data Performance Estimate
Model

Apply typical*
perturbations
Perturbed Datasets

Predicting ML Performance
Training Data Performance Estimate
Model

Apply typical*
perturbations Performance Estimates
Perturbed Datasets

Predicting ML Performance
Training Data Performance Estimate
Model

Apply typical*
perturbations Performance Estimates
Perturbed Datasets
Featurized model
output

Predicting ML Performance
Training Data Performance Estimate
Model

Apply typical*
perturbations Performance Estimates
Perturbed Datasets
Featurized model
output

Meta Regression
Model

Predicting ML Performance
Training Data Performance Estimate
Model

Apply typical*
perturbations Performance Estimates
Perturbed Datasets
Featurized model
output

Meta Regression
Model

Production Data Prod Performance


Estimate

Predicting ML Performance
Training Data Performance Estimate
Model

Apply typical*
perturbations Performance Estimates
Perturbed Datasets
Featurized model
output

Meta Regression
Model

Production Data Prod Performance


Estimate

Predicting ML Performance under Covariate Shifts

Simple Method to
estimate ML Model
Performance under
realistic covariate shifts /
data quality issues

Sebastian Schelter, Tammo Rukat, Felix Bießmann JENGA-A Framework to Study the Impact of Data Errors on
the Predictions of Machine Learning Models, EDBT, 2021

Sebastian Schelter, Tammo Rukat, Felix Bießmann, Learning to Validate the Predictions of Black Box Classi ers
on Unseen Data, SIGMOD, 2020

21
.

fi
Works for unknown errors

Sebastian Schelter, Tammo Rukat, Felix Bießmann JENGA-A Framework to Study the Impact of Data Errors on
the Predictions of Machine Learning Models, EDBT, 2021

Sebastian Schelter, Tammo Rukat, Felix Bießmann, Learning to Validate the Predictions of Black Box Classi ers
on Unseen Data, SIGMOD, 2020

22
.

fi
ure 4. We observe in all cases that the performance predictor
achieves low prediction errors after having access to a few
hundred
Works with fewexamples.
samples

Figure
and o
datase
certain
slices

Sebastian Schelter, Tammo Rukat, Felix Bießmann JENGA-A Framework to Study the Impact of Data Errors on
Figure 4: Sensitivity of our performance predictors to
the Predictions of Machine Learning Models, EDBT, 2021
the sample
Sebastian Schelter, size
Tammo Rukat, |D
Felix test | for
Bießmann, different
Learning to Validateerrors andof Black
the Predictions modelsBox Classi ers presen
on Unseen on
Data, the income and heart datasets. The performance
SIGMOD, 2020
of slice
predictor achieves low prediction errors after having the inc
23
access to a few hundred examples. .

be stro

fi
Modelling Data Quality Problems

Research Question:

Can we generate realistic errors?

24
.
The Importance of Errors in Machine Learning

• Injecting errors in data has many applications


• Regularization
• Differential Privacy
• Adversarial training
• Self-Supervised Training
• But which errors are useful and realistic?

All clean data sets are alike; All dirty data is dirty in its own way
Charles Sutton 2019 (AIDA workshop), after Leo Tolstoi

25
.
Jenga - A Framework to Study the Impact of Data Errors on
the Predictions of Machine Learning Models

• So ware systems are tested


• ML applications are di cult to test
▪ ML models depend on data
▪ Real test data is limited, models overparametrized
◦ Google's underspeci ation paper (D'Amour et al., 2020)
◦ Stochastic Parrots paper (Bender et al., 2021)

26
.
ft
fi
ffi
How to test ML Systems?

• A lot of data sets with convenient API (OpenML, Vanschoren et al. 2014)
• Data corruptions (Schelter, Rukat, Biessmann, SIGMOD, 2020)

Jenga leverages both to ensure:

• automation
• reproducibility

27
.
SIGMOD, 2020, Portland, OR Sebasti
d ML Pipeline Performance Prediction2
ors compute predictions
SIGMOD, 2020, Portland, OR and scores for the S
Apply User De ned
corrupted examples corruptImpact
Evaluate examples
on
Error Generators to labeled data Trained ML Model
user-defined
error generators
2
compute predictions from the blackbox
Missing Values,
and scores for the
model
rupt 1
age
Scaling, …
user-defined
income
corrupted examples loan?
corrupt examples
from the blackbox
2
3
learn a performance
s
model predictor from mode
age income loan? compute predictions
h error generators
generate corrupt statistics of
scores p
predictions and scor
18 250,000 no
examples with and scores for the
18 250,000 corrupted
no predictions 3
user-defined examples corrupt examples on corrupt
on corrupt
error generators1 N/A 5,000 yes
black box
from
model
the blackbox
examples
examples learned
o
learn a
predict
performance
ors N/A
generate corrupt
5,000
examples withN/A 2,800
age income
no
loan?
yesML model
statistics of
scores Predictor predict
labeled test examples user-defined
error generators
18 250,000 no
Trained
black
predictions
box
on corrupt
on corrupt
examples
age income loan? N/A 5,000 yes examples0.67

18 2,500 no N/A
age income
2,800
loan?
N/A -2,500 N/Ano 2,800
nono black box
ML model ML Model
model per
P
mples
40 5,000
labeled test examples
yes
age income 40
loan? 5,000 ageyesincome loan?
0.52
0.67
37 2,800 18no 2,500 37 -2,800 N/Ano -2,500
loan? 40 5,000
no
ageyes income
40 5,000
loan?
yes
no
0.52

(a) Learning 37a performance


2,800 no predictor
37 for a pretrained
-2,800 no black box classi�er, using synthetic
nocorrupted data to explore
N/A how-2,500 no the resulting prediction quality.
data errors impact
(a) Learning a performance predictor for a pretrained black box classi�er, using s
yes corrupted40 5,000how datayes
data to explore errors impact the resulting prediction quality.
noFigure 1: Overview
37
of our proposed approach: (a) We learn a perform
-2,800 no 28

input data to estimate the prediction quality of a black box mode .

Figure 1: Overview of our proposed approach: (a) We learn a pe


fi
Jenga - Example

29
.
Jenga - Example

30
.
Jenga - Example

31
.
Jenga - Example

32
.
Jenga - Core Components

Tasks:
▪ binary/multiclass classi cation
▪ regression

Corruptions:
▪ Text
▪ Images
▪ Tabular data

Evaluators:
▪ Applies corruptions and tests ML model on task

33
.
fi
Jenga - Available Data Errors

• Text: leetspee
• Images: standard augmentation
• Structured data (tables)
▪ missing dat
◦ missing completely at rando
◦ missing at rando
◦ missing not at rando
▪ swapping column
▪ numerical data
◦ additive Gaussian nois
◦ scaling

34
.
a

Jenga - Available Error Sampling Schemes

• Inspired by Imputation literature, error probability is sampled

• Completely at rando
• For each value, we draw error probability from uniform distributio

• At random
• Error probability depends on values in other column

• Not at random
• Error probability depends on values themselves

35
.
:

Jenga - Error Example

36
.
Summary

• Data Quality is fundamental to responsible usage of ML


• Automation is key to scalable DQ management
• We work on automation of Data Quality
• Monitoring
• Improvement
• Modelling

• Limitations:
• Adoption of DQ tooling remains difficult
• Many DQ problems are difficult to model
• Data on errors is scarce (but see AIAAIC error DB)

37
.
Open Source

Measuring Data Quality:


https://github.com/awslabs/deequ
Fixing Data Quality issues
https://github.com/awslabs/datawi
Predicting Data Quality issues
https://github.com/schelterlabs/jenga

38
.
g

References

• Felix Bießmann, Jacek Golebiowski, Tammo Rukat, Dustin Lange, Philipp Schmidt, Automated data validation in machine learning
systems, Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 2021

• Sebastian Schelter, Tammo Rukat, Felix Bießmann, Towards Automated Data Quality Management for Machine Learning,
Proceedings of the 2020 ML Ops workshop at the Conference on Machine Learning and Systems (MLSys), 2020

• Sebastian Schelter, Felix Biessmann, Dustin Lange, Tammo Rukat, Phillipp Schmidt, Stephan Seufert, Pierre Brunelle, Andrey
Taptunov Unit Testing Data with Deequ Proceedings of the International Conference on Management of Data, 2019

• Sebastian Schelter, Stefan Grafberger, Philipp Schmidt, Tammo Rukat, Mario Kiessling, Andrey Taptunov, Felix Biessmann, Dustin
Lange Di erential Data Quality Veri cation on Partitioned Data IEEE 35th International Conference on Data Engineering (ICDE), 2019

• Stefan Grafberger, Philipp Schmidt, Tammo Rukat Mario Kiessling, Andrey Tap- tunov, Felix Bießmann, Dustin Lange Deequ - Data
Quality Validation for Machine Learning Pipelines ML Systems Workshop, NeurIPS 2018

• Sebastian Schelter, Dustin Lange, Philipp Schmidt, Meltem Celikel, Felix Bießmann, Andreas Grafberger Automating Large-Scale
Data Quality Veri cation Very Large Databases (VLDB), 2018

• Felix Biessmann, David Salinas, Sebastian Schelter, Philipp Schmidt, Dustin Lange, “Deep Learning for Missing Value Imputation in
Tables with Non-Numerical Data”, CIKM 2018

• Biessmann et al. Journal of Machine Learning Research, 2019

• Jaeger, Allhorn, Biessmann., Frontiers in Big Data, 2021

39
.
ff
fi
fi
Fair ML Automated

• Many data sets are biased


• Resulting ML Model predictions are biased
• Building fair ML models requires domain expertise
• Automating application of fairness constraints important

40
.
Fair ML Automated: Declarative Feature Selection

Neutatz, Biessmann, Abedjan, SIGMOD, 2021

41
.
Fair ML Automated: Declarative Feature Selection

Neutatz, Biessmann, Abedjan, SIGMOD, 2021

41
.

You might also like