Hpi Data Quality Talk

Data Quality
In Machine Learning
Production Systems
Felix Biessmann,
Sebastian Schelter,
Sebastian Jäger
Why Data Quality for Machine Learning?
1. Manager’s Perspective:
Legal requirement
Biased data (gender, ethnicity)
Quality standards (e.g. Autopilots)
2. Scientist’s Perspective:
Better data quality, better predictive performance
3. Engineer’s Perspective:
Data quality problems worst hidden technical debt in AI systems
2
.
Better Data Quality, Better Predictive Performance
Classi cation Horror vs Comedy from plot:

Misconception 2. Parameter tuning
Data quality most important hyper parameter
is the most important problem
Data Cleaning 0.91
TFIDF, PoS, Stop Words 0.695 Python Hyperopt 0.73
Scikit Learn Default 0.69
Accuracy
Horror vs. Comedy From Plot

Krishnan and Wang, “Data Cleaning - a Statistical Perspective”,
82
SIGMOD Tutorial on Data Cleaning, 2016
3
.
fi
AI in Academia
Machine
Data Predictions
Learning
4
.
AI in Production Systems
Machine
Learning
5
.
Dat
Collection
Machine
Learning
5
.
a
Dat
Collection
Machine
Learning
Featur
Extraction
5
.
a
Dat Dat
Collection Veri cation
Machine
Learning
Featur
Extraction
5
.
a
fi
e
Monitoring
Dat Dat
Collection Veri cation
Con guration
Machine
Learning Analysis Tools
Serving
Featur
Extraction Machine
Resource
Management
Adapted from Sculley et al,
“Hidden Technical Debt in ML Systems” NIPS 2015
5
.
a
fi
fi
e
Automation for Data Quality
• Data Quality is important for responsible usage of ML

• Automated Data Quality Monitoring is difficult
• Lack of automation for Data Quality leads to
• Slow transfer from research to applications
• Poor scalability
• Difficult maintenance
6
.
Automation for Data Quality
• Responsible usage of ML requires novel tooling

• We develop solutions for scalable and automated
• Monitoring of data quality
• Improvement of data quality
• Prediction of data quality impact on ML models
• Modelling Data Errors
7
.
Automated Monitoring of Data Quality
Research Question:
Can we measure Data Quality problems automatically?
8
.
Automated DQ Monitoring: Unit Tests for Data
• So ware systems are testable

• Unit tests
• Integration tests
• Machine Learning systems depend on data
• Data quality is essential for ML systems

➡ Unit tests for Data Quality
9
.
ft
val verificationResult = VerificationSuite(

.onData(data
.addCheck
Check(CheckLevel.Error, "unit testing my data")
.hasSize(_ == 5) // we expect 5 row
.isComplete("id") // should never be NULL
.isUnique("id") // should not contain duplicate
.isComplete("name") // should never be NUL
// should only contain the values "high" and "low"
.isContainedIn("priority", Array("high", "low"))
// should not contain negative value
.isNonNegative("numViews"
// at least half of the descriptions should contain a url
.containsURL("description", _ >= 0.5)
// half of the items should have less than 10 views
.hasApproxQuantile("numViews", 0.5, _ <= 10))
.run(
Schelter et al, VLDB, 2018 https://github.com/awslabs/deequ
10
.
)
Check(Level.Error) SELECT AVG(NOT_NULL(name)), Success("isUnique(product_id)",

.isUnique("product_id") AVG(NOT_NULL(description)), Uniqueness("product_id") == 1.0)
.isComplete("name", "description") AVG(price > 0),…
.satisfies("price > 0") FROM data Failure("isComplete(name)",
.hasValidValueRange("priority", Completeness("name") == 1.0,
("high", "low")) Uniqueness("product_id") 0.987)
SELECT SUM(COUNT(*))
.hasCountDistinct("brand", Completeness("name")
FROM data
verifyBrandCount(_)) Completeness("description") Success("isComplete(description)",
GROUP BY asin
Compliance("price > 0") Completeness("description") == 1.0)
HAVING COUNT(*) = 1
Check(Level.Warning, scope=SERIES) Compliance("priority IN (...)")
.hasNoAnomalies(Size, Histogram("brand") Success("hasCountDistinct(brand, ...)",
...
OnlineNormal(stdDevs=3)) Predictability("brand", ...) verifyBrandCount(Histogram("brand")
Size .numDistinct)))
Check(Level.Warning)
.isPredictable("brand", ...
("name", "description"), f1=.98)
user-defined declarative aggregation queries metrics and validation

1 2 data quality metrics 3 4
checks and constraints on Apache Spark results
Schelter et al, VLDB, 2018 https://github.com/awslabs/deequ
11
.
Automated Improvement of Data Quality
Research Question:
Can we fix Data Quality problems automatically?
12
.
Data Quality Problems
• Correctness (Di cult to measure)
• Unknown schemas / data types
• Wrong entries
• Consistency
• Duplicates
• Invalid values
• Statistical properties
• Anomalies
• Are value distributions changing over time?
• Is a column predictable from another column?
• Completeness
• How many missing values?
13
.
ffi
hn-gram
n-gram, where nor a {1,
representation . . . , 5}, inembedding
character-based the character using ase L
ps each n-gram,
acter-based
recurrent neural embeddingwhere
network [? ] nusing
TODO: {1,a Long-Short-Term-
. . . , 5},For
CITATION. inthe
thecharac
cha
c denotes here the number of hash buckets. Note th
ere
ODO: D
t athat c denotes
CITATION.
aAutomated
hashing function here
thatFor
DQ Improvement:
does not require
the
maps the number
character
each n-gram,
Imputation
anyDtraining,
of hash
n-gram
where nbuckets.
for Tables
whereas represe
{1, . . . ,
the othe
ponent
ach D dimensional
that
n-gram,
c does
where vector;
Imputation not nof here
require {1,
attribute c .denotes
any
. . ,
color 5}, here
training,
in the
the number
whereas
characte of h
that are learned
featurizer using
is a stateless backpropagation.
component that does not require any train
Deters that
c denotes
feature maps are
Product
herelearned
contain the usingthat
number
parameters backpropagation.
ofarehashlearned buckets. Note
using backprop
eddings,
nt that we
does
Type use
not a standard
require
Description
any linear
Size embedding
training, whereas
Color fed o
the
embeddings,
meter we
For the case of categorical
that for this use a standard
embeddings, welinear embedd
use a standard li
s connectedareShoe layer.featurizer
learned Theusing
Ideal was… a 12UK single
backpropagation.
hyperparameter
for running for thisone Blackand used
featurizer was at
parameter
llembedding
ws as the numberfor this of
dimensionality featurizer
hidden as wellunits
as was
theof a single
the
number output
of onelayer.
hidden an
uni
beddings,
asan wellweas
LSTM
case, we
the use
number
SDCards Best SDCard
through
featurize x bythe
c a standard
of
ever …
hidden
sequence
iterating linear
8GB
an LSTM ofunits embedding
of the
Blue
characters
through of xfci
outp
the sequen
ameter
rating
via
Row for
an
represented
a character
Dress this
LSTM
as featurizer
through
continuous
This
embedding.
yellow vector
dress was
the
… via
The acharacter
aM single ?embedding.
sequence
sequence one
of of and
charact
charac usTh
well mapped
ector as the
via to a
a sequence
number
character of
of states
hidden h (
embedding. c, units
1), . . . , of
h the
(c,S )
The
c
and
sequencewectake
output lay
oth
c,a 1), . .connected
fully
(c,S
. , h Feature c)
andasColumns
layer we
the take the lastofLabel
featurization state
x . The
(c,S )
h hyperpara
c Column ,m
ng
on an( LSTM through
es
urization
h c, 1),
then the number
c , h(c,Scthe
of x .ofThe
. . . )
and sequence
hyperparameters
Biessmann, Salinas et al, CIKM 2018
layers, we take
the number of hidden of
Biessmann et al. Journal of Machine Learning Research, 2019the characters
last
of each state
units LSTM h
of the L o (
ercharacters
via aofcharacter embedding. Theofsequence of cha
Jaeger et al., Frontiers in Big Data, 2021
umber featurization embedding

hidden ofunits . of
xcicand The thehyperparameters
the number
LSTM hidden
cell and units of
theof eac
the fi
dim
(the LSTM featurizer.
lhe c, number
number 1), One
. of
. .Hot (c,Sc )
,hidden
hof hidden and units
units
Character
wethe
of takeof the
final
One the last
LSTM
Hot fully
state
Hot cell
Oneconnected
(c,Sc )
hPandas,
and
14
o
aturization
ionFinally Encoder of xvectors
all feature . The chyperparameters
c Sequence(xc ) areEncoder concatenated of
intoeach
one LS
featur
Dask or
.
Encoder
columns
ent variablec with sequential
associated 127 with
string
column
For
data
categorical c. we
Wedata
consider
considered
we use a
two
one-hot threedifferent
encoded different possibilities
embedding types (as
c cfor
of
known featurizers
from cword ): an
(xc embeddings).
am epresentation
representation or a character-based
or a use
c (.).
character-based embedding using ausing Long-Short-Term-Memory (LSTM) (LSTM)
or categorical data we 128 For a columns
one-hot withembedding
c encoded sequential embeddingstring data (asa we Long-Short-Term-Memory
consider
known fromtwo different
word possibilities
embeddings). for c (xc ): an
trrent
neural network
neural network[? ] TODO:
[?
129 ] n-gram
TODO: CITATION. CITATION.
representation For the character
For the
or a character-based n-gram
character embedding representation,
n-gram usingrepresentation, c (x c
a Long-Short-Term-Memory) is (xc ) is (LSTM)
umns c with sequential
g function that maps each string
n-gram,
recurrent data where
neural networkwe consider
n {1,
[? ] n .
TODO: .two
. , 5}, different
CITATION.in the possibilities
character
For the sequence
character for
n-gram x
c (x
c
to ): can
representation,
c c c (xc ) is
shing function
Automated that maps each n-gram, where in the character sequence to
DDQ Improvement: aImputation . , for Tables
130
representation orhere
a character-based embedding maps using
{1, . . . , 5},
Long-Short-Term-Memory (LSTM) x
mensional vector; 131
c adenotes
hashing here
function thethatnumber each of n-gram,
hash buckets.
where n Note {1, . .that 5},the hashing
in the character sequence xc to
ntrdimensional
neural
is network
a stateless
vector;
component
here
[? ]132TODO: DD
athat denotes
CITATION.
c dimensional
cdoes here
vector;
not require Forthe thenumber
here
any D character
c denotes whereas
training,
ofhere hash
n-gram buckets.
the number representation,
the of hash
other
Note
two
that the
buckets.
typescNote
chashing
of )that
(x is the hashing
ng rizer
maps is a stateless
function
contain that maps
parameters component
133
eachfeaturizer
that n-gram,
aremaps that
learned does
is aImputation
stateless
where notcomponent
using nofrequire
attribute
backpropagation.
{1,that .any
that
. does
.color training,
, 5}, thewhereas
not require
in using any training,
character thesequence
other two
whereas
xctypes
the other
to twoof types of
ure maps contain feature contain parameters are learned backpropagation.
vector;parameters that are learned using backpropagation.
134
mensional here Dc denotes Product here the number of hash buckets. Note that the hashing
case
er is ofstateless
a categorical embeddings,
component
135 For the case
that
Type
doesweofuse not a standard
Description
categorical
require
embeddings,
any linear
Size
training, embedding
we Color
use a standard
whereas fed
the into
linear
other one
embedding
two fully
types
fed into one fully
the case of categorical
ed layer. The hyperparameter 136 embeddings,
connected forShoe we
layer.featurizer
this The use a
for hyperparameter
standard
was a12UK for thisone
single linearfeaturizerembedding
Blackand used was ato single fed into
one and
set both the used of
one tofully
set both the
maps contain parameters that are learned using backpropagation.
Ideal running …
ected
ng layer. TheTraining
dimensionality hyperparameter
as 137 Rowsembedding dimensionality
well as the number for this of featurizer
as well as was the number a single
c hidden units of the output layer. In the LSTM
one units
of hidden and ofused to setlayer.
the output both theLSTM
In the
case, we featurize Best x by iterating an LSTM through
Blue the sequence of characters of xi that are each
c
edding
case
featurize ofdimensionality
categorical
xc by iterating
138
139
asan well
embeddings, LSTM
represented as the
SDCards
aswe
throughnumber
SDCard
use the
continuous
ever …
of
a standard hidden
sequence
vector via
8GB
a ofunits
linear
character charactersof the output
embedding
embedding. of xThe c layer. In the LSTM
ifedthat into
are one
sequence each fully xc is then
of characters
,ted
edwelayer.
asfeaturize The x
continuous
c
by iterating
vector 140 via
hyperparameter
To-Be-Imputed mapped
Row anforaLSTM
a character
Dress
to this
sequence through
embedding.
featurizer
This yellow
states…hthe
of dress c,The
was
( M
1),sequence
a. sequence
. single
. , h(c,S ?)one
c of ofcharacters
and characters
and
we takeusedthe lastof
to c
set
xstate
c cthat
(c,S )are each
xisi hthen
both ,the
mapped through
esented
ing
to a sequenceas continuous
dimensionality
1.of states hvector
as141
String c, 1),as.via
( a fully
well the
.,h anumber
.connected character
(c,S clayer
)
and
Feature
aswe
of embedding.
the featurization
hidden
Columns take units
the lastof The
of
c
x the
state
Label
. sequence
Thehoutput
Column
(c,S of characters
hyperparameters
c ) layer.
, mapped Inofthrough
eachLSTM
the x isfeaturizer
LSTM c
then are
Representation then the number c , h(c,Sc ) and we take the last state h(c,Scc) , mapped through of the
of layers, the number of hidden units of the LSTM cell and the dimension
ped
onnected to a sequence
e featurize xc by
layer as the of featurization
states
142
iterating
143
an
h (
c,LSTM1),
characters embedding of . x. through
. the thesequence
. Theci hyperparameters
and number of hidden of ofcharacters
each of theof
unitsLSTM thatconnected
featurizer
xifully
final areareeach output layer of
ly
nted connected
number as continuous layerthe
of layers, as144the
vectornumber featurization
via
the LSTM aofcharacter
hidden
featurizer. ofunitsx . of
c
embedding.The thehyperparameters
LSTM The sequence cell and the of
ofeachdimension
charactersLSTMof xfeaturizer
c
is then are
the
rsthe a number
toembedding
sequence of
ci layers,
2. of
and
states the
145h
Numerical the
(
c,number
number
Finally1), all .of. feature
,h of Character
.Hothidden
(c,Shidden
cunits units
vectors
)
and of
we c thetake
) are of Hotthe last
final
concatenated LSTM
fully Hot cell
connected
Onestate
into oneh (c,Sand
feature
Pandas, )
,the
coutput
mapped
vectordimension
layer
x̃ R of
through
D
where ofDthe
CPU
One c (x
Sequence
One = Dc
M
actersfeaturizer.
embedding
onnected layer as the c and
Representation
146 the
is the number
i featurization of x . The hyperparameters
sum over
Encoder
of
all c hidden
latent dimensionsunits of
Encoder
D c .the
We final
will fully
of each
refer
Encoder
to
Dask
the
Spark
or
connected
LSTM featurizer are
numerical output
representation layer of of
the values
LSTM featurizer. in the to-be-imputed column as y {1, 2, . . . , D }, as in standard supervised learning settings. The
ellnumber
feature of layers,
vectors (xthe
c number of hidden units of the LSTM cell and
147
) are concatenated into one feature
c 148 symbols of the target columns use the same encoding as the aforementioned vector x̃
y
R D the dimension of the
where D = categorical Dc variables.
ers
m embedding c and the number of hidden units of the final fully connected output layer of
llyover all latent
all feature dimensions
vectors
i
149c (x
c
After )Dare . We
cthe will refer
concatenated
featurization x̃orto the
ofinto
input numerical
one feature representation
or feature vectorand
columns x̃MxNet of the
theRencoding
D
where values
yD of the= to-be-imputed
Dc
M featurizer. 3. LSTM
e-be-imputed column
latentasdimensions asimputation
inrefer
standard supervised learning settings. The
Featurizers
sum over all 150y column{1, 2,we.D
Embedding
.can
.c,.D We
casty },thewill
n-gram hashing
Embedding
to the
problem numerical
as a supervised representation
problem by learning of the values
to predict the label
CPU/GPU
eofto-be-imputed
all the target
feature columns
vectors column use
151 the
areysame
c distribution
c (x ) as {1,of
concatenated encoding
y from x̃.as the aforementioned
2, . . . ,into
Dy }, one asfeature
in standard vector
… …x̃categorical
supervised RD where variables.
learning D= settings. Dc The
um over all target
latent Imputation is then performed by modeling the probability over all values values or
observed
bols of the
e featurization x̃ dimensions
4. Latent columns
ofrepresentation
input
152
153
useD
or c . We
the
feature same will
columns
Feature referand
encoding
Concatenationto theas
the numerical
the aforementioned
encoding p(y|x̃,
classes of y given an input or feature vector x̃ with some learned parameters . The probability
representation
y ),
of the categoricalof the variables.
to-be-imputed
we -be-imputed
can cast thecolumn
imputation as y {1,) 2,
problem Dyas}, as in problem
.as. .a, supervised
is modeled, standardby supervised
learning tolearning
predict settings.
label The
theto-be-imputed
rs of thethefeaturization x̃154
of input
p(y|x̃, or feature columns and the encoding
Imputation y of the
ion of y target from x̃. columns use the same encoding as thep(y|x̃, aforementioned categorical variables.
mn we can cast the imputation problem as a supervised problem
) = softmax by[W learning
x̃ + b] to predict the label (1)
he onfeaturization
ibution
gure of yperformed
is1: then fromx̃
Imputation x̃.of155by
example
inputmodeling
where
on
or feature
= (W,
non-numerical z, b)columns
p(y|x̃, the and
are),parameters
data
probability
with deeptolearning;
the
learnencoding
with overW all
symbols
y ofD ,the
RyDobserved
explained b R to-be-imputed
values
D
and z is
iny Section
ora vector containing
Biessmann, Salinas et al, CIKM 2018 4.
ofwey can given cast
antheinputimputation
or156feature problem
vector x̃
all parameters asall
of aexp
with supervised
some
learned columnlearnedproblem
featurizers byclearning
parameters
Biessmann
.etFinally,
al. JMLR
toMLOSS,
predict
. softmax(q)
The probability the label
denotes
2019
the elementwise15
utation
)tion is then
of iny most
is modeled, from performed
asx̃. and157APIssoftmax by modeling
function P p(y|x̃,
q
j
), the probability
exp q where qj is the j element of a vector q.
over all observed values
.
or
es of y given an input or feature vector x̃ with
emented libraries that oﬀer imputation some learned
these methods. Text data needs parameters
to be vectorized . Thebefore probability
it can
j
Table 2: F1 scores on held-out data for imputation task product attributes for mode imputation, string
matching, LSTM and character n-gram featurizers. For single product attributes, we report F1 scores relative
to the result obtained with the best model. We also report the median of absolute F1 across all product
attributes. Compared to baselines, our imputation approach is better in all cases, achieving on average a
Imputation for Tables
23-fold increase (compared to Mode imputation) and a 3-fold increase (compared to a rule-based system) in
imputation quality.
Dataset Attribute Mode String matching LSTM N-gram

brand 0.4% 80.3% best 99.9%
dress, English manufacturer 1.0% 22.7% 99.4% best
size 4.8% 0.1% best 96.1%
brand 13.3% 44.5% best 94.5%
monitor, English display technology 30.6% 13.5% 99.8% best
manufacturer 15.1% 33.6% best 95.3%
brand 3.9% 48.5% best 99.1%
notebook, English cpu model manufacturer 82.7% 88.0% 98.9% best
manufacturer 4.3% 35.9% 99.8% best
brand 0.5% 91.5% 99.9% best
shoes, English manufacturer 0.5% 79.2% 98.8% best
size 2.2% 0.0% best 82.7%
toe style 13.1% 23.5% 96.5% best
brand 2.6% 19.3% 98.8% best
shoes, Japanese color 20.4% 58.3% 94.5% best
size 76.7% 2.6% best 99.2%
style 61.3% 13.4% 92.6% best
Median of absolute F1 scores 4.1% 30.1% 92.8% 93.0%
Simple imputation models on par with or better than LSTMs

Table 3: F1 scores on held-out data for imputation task DBpedia. See Table 2 for a description of columns.
In contrast to the relative results for single product attributes, all numbers for the DBpedia dataset are
absolute F1 scores.
Dataset Attribute Mode String matching LSTM N-gram

birth place 0.3% 16.3% 54.1% 60.2%
DBpedia, English genre 1.5% 6.4% 43.2% 72.4% 16
location 0.7% 7.5% 41.8% 60.0%
.
Automated Monitoring of Data Quality Impact
Research Questions:
Can we predict performance of ML systems
• In the presence of covariate shi
• Without access to ML system internals
17
.
ft
ML Production Systems
Classical ML Production Setting
Training / Validation Serving
black box
ML model
Cloud ML Auto ML
Service
Customer
(Engineers without
ML expertise)
24
18
.
Predicting ML Performance
• Many Data Quality problems will be not detected / xed

• Errors in preprocessing
• Covariate shi s
• Adversarial attacks
• Some errors will not have impact on downstream ML model

• Only looking at input data leads to false alarms
• Other errors are di cult to detect in input
• Only looking at input data will miss some errors
19
.
ft
ffi
fi
Training Data Performance Estimate
Model
Model
Apply typical*
perturbations
Perturbed Datasets

Model
Apply typical*
perturbations Performance Estimates
Perturbed Datasets

Model
Apply typical*
Perturbed Datasets
Featurized model
output

Model
Apply typical*
Perturbed Datasets
Featurized model
output
Meta Regression
Model

Model
Apply typical*
Perturbed Datasets
Featurized model
output
Meta Regression
Model
Production Data Prod Performance

Estimate

Model
Apply typical*
Perturbed Datasets
Featurized model
output
Meta Regression
Model
Production Data Prod Performance

Estimate

Predicting ML Performance under Covariate Shifts
Simple Method to
estimate ML Model
Performance under
realistic covariate shifts /
data quality issues
Sebastian Schelter, Tammo Rukat, Felix Bießmann JENGA-A Framework to Study the Impact of Data Errors on
the Predictions of Machine Learning Models, EDBT, 2021
Sebastian Schelter, Tammo Rukat, Felix Bießmann, Learning to Validate the Predictions of Black Box Classi ers
on Unseen Data, SIGMOD, 2020
21
.

fi
Works for unknown errors
Sebastian Schelter, Tammo Rukat, Felix Bießmann, Learning to Validate the Predictions of Black Box Classi ers
on Unseen Data, SIGMOD, 2020
22
.

fi
ure 4. We observe in all cases that the performance predictor
achieves low prediction errors after having access to a few
hundred
Works with fewexamples.
samples
Figure
and o
datase
certain
slices
Figure 4: Sensitivity of our performance predictors to
the sample
Sebastian Schelter, size
Tammo Rukat, |D
Felix test | for
Bießmann, different
Learning to Validateerrors andof Black
the Predictions modelsBox Classi ers presen
on Unseen on
Data, the income and heart datasets. The performance
SIGMOD, 2020
of slice
predictor achieves low prediction errors after having the inc
23
access to a few hundred examples. .
be stro

fi
Modelling Data Quality Problems
Research Question:
Can we generate realistic errors?
24
.
The Importance of Errors in Machine Learning
• Injecting errors in data has many applications

• Regularization
• Differential Privacy
• Adversarial training
• Self-Supervised Training
• But which errors are useful and realistic?
All clean data sets are alike; All dirty data is dirty in its own way
Charles Sutton 2019 (AIDA workshop), after Leo Tolstoi
25
.
Jenga - A Framework to Study the Impact of Data Errors on
the Predictions of Machine Learning Models
• So ware systems are tested

• ML applications are di cult to test
▪ ML models depend on data
▪ Real test data is limited, models overparametrized
◦ Google's underspeci ation paper (D'Amour et al., 2020)
◦ Stochastic Parrots paper (Bender et al., 2021)
26
.
ft
fi
ffi
How to test ML Systems?
• A lot of data sets with convenient API (OpenML, Vanschoren et al. 2014)
• Data corruptions (Schelter, Rukat, Biessmann, SIGMOD, 2020)
Jenga leverages both to ensure:
• automation
• reproducibility
27
.
SIGMOD, 2020, Portland, OR Sebasti
d ML Pipeline Performance Prediction2
ors compute predictions
SIGMOD, 2020, Portland, OR and scores for the S
Apply User De ned
corrupted examples corruptImpact
Evaluate examples
on
Error Generators to labeled data Trained ML Model
user-defined
error generators
2
compute predictions from the blackbox
Missing Values,
and scores for the
model
rupt 1
age
Scaling, …
user-defined
income
corrupted examples loan?
corrupt examples
from the blackbox
2
3
learn a performance
s
model predictor from mode
age income loan? compute predictions
h error generators
generate corrupt statistics of
scores p
predictions and scor
18 250,000 no
examples with and scores for the
18 250,000 corrupted
no predictions 3
user-defined examples corrupt examples on corrupt
on corrupt
error generators1 N/A 5,000 yes
black box
from
model
the blackbox
examples
examples learned
o
learn a
predict
performance
ors N/A
generate corrupt
5,000
examples withN/A 2,800
age income
no
loan?
yesML model
statistics of
scores Predictor predict
labeled test examples user-defined
error generators
18 250,000 no
Trained
black
predictions
box
on corrupt
on corrupt
examples
age income loan? N/A 5,000 yes examples0.67
18 2,500 no N/A
age income
2,800
loan?
N/A -2,500 N/Ano 2,800
nono black box
ML model ML Model
model per
P
mples
40 5,000
labeled test examples
yes
age income 40
loan? 5,000 ageyesincome loan?
0.52
0.67
37 2,800 18no 2,500 37 -2,800 N/Ano -2,500
loan? 40 5,000
no
ageyes income
40 5,000
loan?
yes
no
0.52
(a) Learning 37a performance

2,800 no predictor
37 for a pretrained
-2,800 no black box classi�er, using synthetic
nocorrupted data to explore
N/A how-2,500 no the resulting prediction quality.
data errors impact
(a) Learning a performance predictor for a pretrained black box classi�er, using s
yes corrupted40 5,000how datayes
data to explore errors impact the resulting prediction quality.
noFigure 1: Overview
37
of our proposed approach: (a) We learn a perform
-2,800 no 28
input data to estimate the prediction quality of a black box mode .
Figure 1: Overview of our proposed approach: (a) We learn a pe

fi
Jenga - Example
29
.
Jenga - Example
30
.
Jenga - Example
31
.
Jenga - Example
32
.
Jenga - Core Components
Tasks:
▪ binary/multiclass classi cation
▪ regression
Corruptions:
▪ Text
▪ Images
▪ Tabular data
Evaluators:
▪ Applies corruptions and tests ML model on task
33
.
fi
Jenga - Available Data Errors
• Text: leetspee
• Images: standard augmentation
• Structured data (tables)
▪ missing dat
◦ missing completely at rando
◦ missing at rando
◦ missing not at rando
▪ swapping column
▪ numerical data
◦ additive Gaussian nois
◦ scaling
34
.
a
Jenga - Available Error Sampling Schemes
• Inspired by Imputation literature, error probability is sampled
• Completely at rando
• For each value, we draw error probability from uniform distributio
• At random
• Error probability depends on values in other column
• Not at random
• Error probability depends on values themselves
35
.
:
Jenga - Error Example
36
.
Summary
• Data Quality is fundamental to responsible usage of ML

• Automation is key to scalable DQ management
• We work on automation of Data Quality
• Monitoring
• Improvement
• Modelling
• Limitations:
• Adoption of DQ tooling remains difficult
• Many DQ problems are difficult to model
• Data on errors is scarce (but see AIAAIC error DB)
37
.
Open Source
Measuring Data Quality:

https://github.com/awslabs/deequ
Fixing Data Quality issues
https://github.com/awslabs/datawi
Predicting Data Quality issues
https://github.com/schelterlabs/jenga
38
.
g
References
• Felix Bießmann, Jacek Golebiowski, Tammo Rukat, Dustin Lange, Philipp Schmidt, Automated data validation in machine learning
systems, Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 2021
• Sebastian Schelter, Tammo Rukat, Felix Bießmann, Towards Automated Data Quality Management for Machine Learning,
Proceedings of the 2020 ML Ops workshop at the Conference on Machine Learning and Systems (MLSys), 2020
• Sebastian Schelter, Felix Biessmann, Dustin Lange, Tammo Rukat, Phillipp Schmidt, Stephan Seufert, Pierre Brunelle, Andrey
Taptunov Unit Testing Data with Deequ Proceedings of the International Conference on Management of Data, 2019
• Sebastian Schelter, Stefan Grafberger, Philipp Schmidt, Tammo Rukat, Mario Kiessling, Andrey Taptunov, Felix Biessmann, Dustin
Lange Di erential Data Quality Veri cation on Partitioned Data IEEE 35th International Conference on Data Engineering (ICDE), 2019
• Stefan Grafberger, Philipp Schmidt, Tammo Rukat Mario Kiessling, Andrey Tap- tunov, Felix Bießmann, Dustin Lange Deequ - Data
Quality Validation for Machine Learning Pipelines ML Systems Workshop, NeurIPS 2018
• Sebastian Schelter, Dustin Lange, Philipp Schmidt, Meltem Celikel, Felix Bießmann, Andreas Grafberger Automating Large-Scale
Data Quality Veri cation Very Large Databases (VLDB), 2018
• Felix Biessmann, David Salinas, Sebastian Schelter, Philipp Schmidt, Dustin Lange, “Deep Learning for Missing Value Imputation in
Tables with Non-Numerical Data”, CIKM 2018
• Biessmann et al. Journal of Machine Learning Research, 2019
• Jaeger, Allhorn, Biessmann., Frontiers in Big Data, 2021
39
.
ff
fi
fi
Fair ML Automated
• Many data sets are biased

• Resulting ML Model predictions are biased
• Building fair ML models requires domain expertise
• Automating application of fairness constraints important
40
.
Fair ML Automated: Declarative Feature Selection
Neutatz, Biessmann, Abedjan, SIGMOD, 2021
41
.
Fair ML Automated: Declarative Feature Selection
Neutatz, Biessmann, Abedjan, SIGMOD, 2021
41
.

Hpi Data Quality Talk

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Hpi Data Quality Talk

Uploaded by

Copyright:

Available Formats

Data Quality

Biased data (gender, ethnicity)

Quality standards (e.g. Autopilots)

Better data quality, better predictive performance

Data quality problems worst hidden technical debt in AI systems

Classi cation Horror vs Comedy from plot:

TFIDF, PoS, Stop Words 0.695 Python Hyperopt 0.73

Scikit Learn Default 0.69

Horror vs. Comedy From Plot

SIGMOD Tutorial on Data Cleaning, 2016

Automation for Data Quality

• Data Quality is important for responsible usage of ML

• Responsible usage of ML requires novel tooling

• So ware systems are testable

• Machine Learning systems depend on data

• Data quality is essential for ML systems

val verificationResult = VerificationSuite(

Schelter et al, VLDB, 2018 https://github.com/awslabs/deequ

Automated DQ Monitoring: Unit Tests for Data

Check(Level.Error) SELECT AVG(NOT_NULL(name)), Success("isUnique(product_id)",

user-defined declarative aggregation queries metrics and validation

Schelter et al, VLDB, 2018 https://github.com/awslabs/deequ

umber featurization embedding

Dataset Attribute Mode String matching LSTM N-gram

Simple imputation models on par with or better than LSTMs

Dataset Attribute Mode String matching LSTM N-gram

Can we predict performance of ML systems

• In the presence of covariate shi

• Without access to ML system internals

Training / Validation Serving

• Many Data Quality problems will be not detected / xed

• Some errors will not have impact on downstream ML model

Production Data Prod Performance

Production Data Prod Performance

Predicting ML Performance under Covariate Shifts

Can we generate realistic errors?

• Injecting errors in data has many applications

• So ware systems are tested

Jenga leverages both to ensure:

(a) Learning 37a performance

input data to estimate the prediction quality of a black box mode .

Figure 1: Overview of our proposed approach: (a) We learn a pe

Jenga - Available Error Sampling Schemes

• Inspired by Imputation literature, error probability is sampled

Jenga - Error Example

• Data Quality is fundamental to responsible usage of ML

Measuring Data Quality:

• Biessmann et al. Journal of Machine Learning Research, 2019

• Jaeger, Allhorn, Biessmann., Frontiers in Big Data, 2021

• Many data sets are biased

Neutatz, Biessmann, Abedjan, SIGMOD, 2021

Neutatz, Biessmann, Abedjan, SIGMOD, 2021

You might also like