Summary Data Quality Course

Chapter 1 Accuracy is usually measured as a Mean (arithmetic mean) Very sensitive to
DQ is a multidimensional concept. It is the state of percentage:Proportion of the stored data versus outliers
qualitative or quantitive pieces of information, high the potential of 100% accuracy.
quality if it is fit for its intended uses in operation, Median Middle value of a distribution Not
decision making and planning Metadata is "data that provides information about sensitive to outliers
other data.
Chapter 2 Four levels of measurement
1.-what data is intended to represent (definition of
Measuring data quality dimensions serves to terms and research/business rules) 2.-how data
understand, analyze data quality problems resolve or effects this representation: conventions, including
minimize. physical data definition [format, field sizes, data types,
etc.] 3.-the system design and system processing 4.-
the limits of that representation (what the data does
6 most common dimensions: not represent) 5.-how data is used, and how it can be
1.-Completeness ->The capability of representing all used.
and only the relevant aspects of the reality of interest,
data are present or absent 2.-(In)Accuracy->The
Exercise ->without missing values the total the Nominal scale: Each category is assigned a
quality or state of being free from error, content, and
data set is 28, one name value is not accurate. number (the code is arbitrary) Ordinal scale:
form. 3.-Consistency->The ability to belong together
Just 27 values are accurate in the dataset. Each category is associated with an ordered
without contradiction 4.-Validity->The quality of being
well-grounded, sound, or correct 5.-Timeliness->The number. Metric scales: The actual measured
state of being appropriate or adapted to the time or the Accuracy % = (Number of accurate values of a value is assigned.
occasion 6.-Uniqueness->The state of being the only data element in the data set, excluding
manifestations of missing values (27) × 100 / *A scale from a higher level can be transformed
one.
(Number of values for the data element in the into a scale on a lower level, but not vice versa.
data set, excluding manifestation of missing
Each data quality dimension captures one measurable
aspect of data quality: values (28)) = 27 × 100 / 28 = 96.43% Box plot: a method for graphically depicting
groups of numerical data through their quartiles
Objective when it is based on quantitative metrics Inaccuracy % = (Number of inaccurate values of Histogram: the distribution of numerical data
A)Task-independent metrics reflect states of the data a data element in the data set, excluding Scatter plot: two variables for a set of data.
without the contextual knowledge can be applied to any manifestations of missing values (1)) × 100 / Mosaic plot: two or more qualitative variables
data set, regardless of the tasks at hand. B)Task- (Number of values for the data element in the
dependent metrics are developed in specific data set, excluding manifestation of missing Exercise -> Boxplot 1:Both distributions
application contexts included are organization’s values (28)) = 1 × 100 / 28 = 3.57% (men and women) seem to be right skewed.
business rules, company, and government regulations, The median body weight of men is larger.
etc. Chapter 3
The distributions are right skewed is confirmed
by the histograms for men and for women.
Completeness Completeness is usually measured as EDA Exploratory data analysis
Boxplot 2 Distributions of random numbers
a percentage: Proportion of the stored data versus the
potential of 100% complete. Questions:Is all the What exactly is this thing called data wrangling? It’s
characterized by more or less strong right/left
necessary informationavailable? Are critical data values the ability to take a messy, unrefined source of skewness. Small or Large spread. Outliers. 50%
missing in the data set? Are all the data sets recorded? Are data and wrangle it into something useful. of the data points are larger than the median(0-1).
all mandatory data items recorded?
The data set consists of 6 variables, sample size
is n = 99. In total 4 missings → 3 for Final and 1
for TakeHome
Cumulative relative frequency shows that 55% of all
data points belong to category 0 or 1. In category 0 are
24% of the data points, in category 1 31% of the data
points.
Chapter 4 Imputation – Type "mean" = missings will be

Exercise ->MCAR Missing pattern replaced by the mean value
Classification of missings – Types
MCAR (Missing Completely at Random) Missing values are

completely randomly distributed across all cases (persons,
etc.) Cases with missing values do not differ from cases
without missing values. Whether a value is missing from the
data set is not related to any of the variables collected. There
is no correlation of the occurrence of missing values with
other variables. MAR (Missing at Random) The last row shows the total number of missing
The occurrence of a missing value occurs conditionally at
values for each variable and in total: 37 for
random and can be explained by the values in other
variables.Persons with complete data differ from those with Ozone and 7 for Solar.R and in total 44. In 2
incomplete data. M.NAR (Missing Not at Random) cases, the pattern is such that there is a Missing
Values are systematically missing but no information is in the 2 variables Solar.R and Ozone. sample N
available to model their absence. There is no adequate = 153. The graph shows the same information,
statistical procedure to avoid bias. but additionally made visible graphically.
Analysis and treatment of missing values Example "TakeHome"

The missing value is replaced by the mean of
Deletion 1)Listwise deletion (complete-case analysis): the variable of the original data set.
Delete all rows that have a missing value 2) Pairwise
deletion (available case analysis) Considers all data of Here a chi-squared test is conducted. The test is
a person. Leads to different sample sizes for different significant with a p-value of 0.00142, which is
variables Imputation 1)Single imputation-unit less than 0.05
imputation Missing values are replaced by mean /
median of the variable - MCAR setting: no bias - MAR Exercise -> Missing analysis the mice*
& MNAR setting: bias possible (underestimation / package is used to impute MAR values.It is one of the
overestimation) Missing values are replaced by values fastest and probably a gold standard for imputing
derived from a regressions analysis. values.
Imputation – Type "sample" = Each time you run Point vs. Contextual vs. Collective
the imputation of type "sample", you will get different
results. Hence "sample" is defined by "Random Point outliers or global outlier: extremely deviate Potential outliers show up. In fact, it is a
sample from observed values". from well-defined norms or given concepts of bimodal/normal distribution. Large values on the
expected behavior. Contextual outliers: Data right side indicate that there could be outliers.
object is extremely different in a specific context
(but not in every context). Each data object can
be defined by two attributes: 1) Contextual
attributes (Date and location in the temperature
example) 2) Behavioral attributes (Temperature,
humidity and pressure in the temperature
example). Collective outliers :Group of data
objects fall extremely far from well-defined norms
of a data set or given concepts of expected
behavior.This collection is known as collective
outliers.Example: 100 delayed orders form
collective outliers. There are 199 outliers according to the
definition of the box plot.
Chapter 5
Calculation with the percentiles method
Example "TakeHome" Exploratory Data Analysis (EDA)1.- dataset,
The missing value is replaced by a random variable 2.- key figures and graphs, 3.-Boxplot 4.-
Percentile method
16.91.
Exercise -> Visualization techniques
Choosing 5% proportion of all values to define

the outliers gives the limits. The lower limit for
Anomaly detection outliers is 50 kg, the upper limit is 113.2 kg.
observations that lie outside the interval formed
by the 2.5 and 97.5 percentiles will be considered
as potential outliers.
The dataset contains n=13908 data points and of
Exercise ->Statistical tests
these 1389 NA. Mean and median in the wtval
data set are comparable; this indicates that
variable hwy is normally distributed Outlier test according to Grubbs
Removing the NA
Generate the histogram

The null hypothesis is rejected at the 5% significance Parametric models assume a specific family of The dataset contains n = 120 data points. It
level → highest value 184.3 is an outlier distributions to describe the normal data 1) Univariate shows the relationship between learning effort in
methods deals with one random variable at a time self-study [hours per week] and success on the
so each variable would have to be modelled final exam [index 0 to 10] in a master's program.
independently using its own distribution function. 2)
Multivariate methods allow vectors of random
variables to be modelled using the same distribution Preparation: To perform k-means clustering, the
function. variables must be standardized, due to different
units of the two variables.
Grubs's Test -> Allows detecting whether the highest
or lowest value in a dataset is an outlier.Normally
The null hypothesis is kept at the 5% significance level
distributed data.The test has too little power for
→ lowest value 35.6 is not an outlier. The larger the sample sizes n ≤ 6 and should not be executed.
sample, the less the p-value can be used as a measure
of validity Rosner’s Test -> can detect several outliers at once Preparation: To perform k-means clustering,
solves the problem of masking distances between the data points are
Rosner's test (masking: outlier that is close in value to another
calculated.
outlier might be undetected). Normally distributed
dataThe test is most appropriate for large sample
sizes n ≥ 20.
Chapter 6 Running k-means clustering → Number of cluster

has to be set in advance: centers = 4 The choice
Outlier can be identified as: 1) data points that do is based on the fact that the scatter plot shows 4
not fit well in the clustering of the normal class 2) small clusters.
clusters that are far apart from the clusters of the
normal class. Clustering-based methods can be
The input k = 3 corresponds to the decision that there categorized into methods that: 1)define a single
are 3 outliers. Simulations were not run for k>10 or k> point as outlying if does not fit the clustering well
floor(n/2) (typically measured by the distance from a cluster
center) 2)consider small clusters as outliers.
Cluster Based-> Initially run k-means clustering

algorithm to find k cluster. Calculate accuracy Four clusters are created that show
and silhouette index of k-means clustering. approximately equal group sizes.
Exercise->Cluster-Based-outlier Displaying the clusters

detection
The four clusters correspond to the expected
“Outllier" shows that at least 10 potential outliers are pattern.In this solution, one element assigned to
detected in Rosner’s test. The values range from 152.5 cluster 3 is close to cluster 1.
to 184.3.
Scree plot Exercise -> RANSARC
Result: the content of the solution could be

interpreted in this way. As a result of cluster
analysis, these elements could clearly be labeled
The position of the elbow is the indicator of the number as outliers but the fact that the cluster analysis The dataset contains n = 120 data points.
of clusters. has formed a larger cluster in the middle field,
which can be interpreted well in terms of content, Run the RANSAC function
Silhouette plot. and because of the interpretation that makes
sense as a whole together with the elements on * n y d son numeros iguales o similares
the bottom right side, they are kept as part of the
whole.ay.
Chapter 7
Linear regression model: Test the regression model

by evaluating every data point against the model.
RANSAC algorithm: Random Sample Consensus
(RANSAC) is an iterative procedure for estimating
parameters of a mathematical model from a set of
observed data containing outliers. The goal is to
ensure that the outliers do not affect the estimation of
the model. Inliers data whose distribution can be
explained by some set of model parameters outliers
data that do not fit the model.
It shows a rather "balanced" picture: s i being between
Input into the function
0.56, and 0.79. The silhouette of cluster 2, the cluster data a set of observed data points
at the bottom right, has the largest value, indicating that n minimum number of data points required to fit the Outlier detection with DBSCAN – Density-
the cluster is denser, has a greater distance from the model based spatial clustering of applications with
other clusters and is found more distinct to the other k maximum number of iterations allowed in the noise k-means algorithm implicitly assumes a
clusters. algorithm spherical shape for the inliers. Inliers form areas
t threshold value to determine when a data point fits with high density, these form the "building blocks"
Dendrogram /tree diagram a model
for constructing arbitrarily-shaped areas.
d number of close data points required to assert that
a model fits well to data
Returnbestfit model parameters which best fit the

It shows a relatively even structure. There is no large data (or null if no good model is found)
subtree that stands out from the others.
Exercise -> DBSCAN Show hull plot Exercise -> Replication success
Calculate and plot kNNdist (k-Nearest Neighbor
Distances)
General rule for parameter k → k = 4 for all The hull plot shows that the groups that appear
databases (for 2-dimensional data) in the original scatterplot are identified as sepa-
rate clusters by the DBSCAN algorithm. The
DBSCAN algorithm identifies the group at the
bottom right as one of these clusters. The noise
point is to find at the bootom left.
Chapter 8
In most cases the replication estimate rr is
Data-> Information-> Knowledge-> Wisdom smaller than the corresponding original estimate
ro. Furthermore, a substantial number of the
Information Quality (InfoQ) is: the potential of a replication estimates do not achieve statistical
dataset to achieve a specific goal using a given data significance at one-sided 2.5% level, while
analysis method. almost all original estimates did. It turns out that
only in 21 of 73 replications (≙ 29%) the result of
Quality of ... goal definition g, dataX, analysis the original study regarding significant correlation
f and utility measure U is reached. This indicated that either the original
A sudden increase of the kNN distance (a knee) studies are not valid or that replication is difficult,
indicates that the points to the right are most likely Quality of analysis f: it refers to the adequacy of also because it is not possible to include or
outliers. the empirical analysis considering the data and compare all information of the original study in
goal at hand. Reflects the adequacy of the the replication.
Choose Eps where the knee is → 1.3 modelling with respect to the data and for
answering the question of interest. Chapter 10
Here MinPts is chosen 4, according to:
"... eliminate the parameter MinPts by setting it to Chapter 9 Data quality aspects in large data sets
4 for all databases (for 2-dimensional data) («BigData»)
Detecting inherent quality problems in
research data
Hierarchy of the terms: Reproducibility < Replicability

< Repeatability
Reproducibility (Different team, different

experimental setup); Replicability (Different team,
same experimental setup); Repeatability (Same
team, same experimental setup)
0 -> 1 noise point ; 1 -> cluster 1 consisting of 28 data
points; 2 -> cluster 2 consisting of 31 data points
If the two variables originate from populations with DBSCAN
the same distributions, the points lie approximately on Analisys
an angle bisector. The greater the deviation from the This sample has 2 variables with the size of 120 observations (total number of datapoints is 240).
Both variables are numerical.
bisector, the more likely it can be assumed that the The median is not similar, neither the mean.
two variables originate from populations with different The scatter plot indicates the presence of 4 clusters
distributions.
The orange line indicates a suddenly increase from 1.3. Also from 1.5. A sudden increase of the kNN
distance (a knee) indicates that the points to the right are most likely outliers.
Therefore, the eps will be set to 1.3 and the minPts will be set to 4.
So according to this output, there are in total 4 clusters and one noise point.
In all the other clusters, the amount of data points for each cluster are very similar.
Show hull plot.
To make sure that I have not missed any outliers, I will do the same steps again with eps=1.5
The hull plot looks the same – and the number of data points within each cluster are very similar.
therefore, I will say that there are no outliers detected with dbscan.
Clustering
Right skewed distribution Analisys
The boxplot looks very smushed down because the data value range for both variables are different.
Therefore, scaling would be good.
Boxplot The boxplot looks much better now. For both variable the median is pretty much in the middle. In
density boxplot, the lower whisker is shorter than the upper one – so there might be a little skewed to the
right.
According to the cluster plot, elbow plot and silhouette plot, it looks like that 3 clusters would be the
better choice. In the silhouette plot you see that the top “cluster” is not very close to 1 which indicates that
is not enough far away from the other clusters.
You can see that as well in the scatter plot with the clustering - the black dots are very close to the blue
Outliers one. So, it looks like that there is one more cluster than needed. And by looking closer, you see that the x-
axis is very short range therefore, the scatter plots show 2 clustering (for the blue one) but there can pretty
much be one if “zoom” out.
Now it looks much better – the scatter plot shows proper clusters, and the silhouette plot shows that all
cluster are further away than before. The hierarchy plot shows as well that 3 cluster is best suited number
of clusters. So comparing 3 and 4 clusters, I would suggest to go with 3 clusters.
Missing analysis
We have 6 variables with 99 observation => 594 data points.
According to this plot and output, you see that there are in total 4 missing data points.
Which means 0.67% of the data is missing (which is pretty good in a real world). There are 3 cases in
which var_05 value is missing and 1 case in which var_04 is missing. The p value (0.794) of the chi-
squared test is higher than 0.05 which is why the test is not significant.
So the missing values are of MCAR (missing completely at random)

Summary Data Quality Course

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Summary Data Quality Course

Uploaded by

Copyright:

Available Formats

Chapter 1 Accuracy is usually measured as a Mean (arithmetic mean) Very sensitive to

Chapter 4 Imputation – Type "mean" = missings will be

MCAR (Missing Completely at Random) Missing values are

Analysis and treatment of missing values Example "TakeHome"

Exercise -> Visualization techniques

Choosing 5% proportion of all values to define

Generate the histogram

Chapter 6 Running k-means clustering → Number of cluster

Cluster Based-> Initially run k-means clustering

Exercise->Cluster-Based-outlier Displaying the clusters

Result: the content of the solution could be

Linear regression model: Test the regression model

Returnbestfit model parameters which best fit the

Hierarchy of the terms: Reproducibility < Replicability

Reproducibility (Different team, different

You might also like