Professional Documents
Culture Documents
DQ is a multidimensional concept. It is the state of percentage:Proportion of the stored data versus outliers
qualitative or quantitive pieces of information, high the potential of 100% accuracy.
quality if it is fit for its intended uses in operation, Median Middle value of a distribution Not
decision making and planning Metadata is "data that provides information about sensitive to outliers
other data.
Chapter 2 Four levels of measurement
1.-what data is intended to represent (definition of
Measuring data quality dimensions serves to terms and research/business rules) 2.-how data
understand, analyze data quality problems resolve or effects this representation: conventions, including
minimize. physical data definition [format, field sizes, data types,
etc.] 3.-the system design and system processing 4.-
the limits of that representation (what the data does
6 most common dimensions: not represent) 5.-how data is used, and how it can be
1.-Completeness ->The capability of representing all used.
and only the relevant aspects of the reality of interest,
data are present or absent 2.-(In)Accuracy->The
Exercise ->without missing values the total the Nominal scale: Each category is assigned a
quality or state of being free from error, content, and
data set is 28, one name value is not accurate. number (the code is arbitrary) Ordinal scale:
form. 3.-Consistency->The ability to belong together
Just 27 values are accurate in the dataset. Each category is associated with an ordered
without contradiction 4.-Validity->The quality of being
well-grounded, sound, or correct 5.-Timeliness->The number. Metric scales: The actual measured
state of being appropriate or adapted to the time or the Accuracy % = (Number of accurate values of a value is assigned.
occasion 6.-Uniqueness->The state of being the only data element in the data set, excluding
manifestations of missing values (27) × 100 / *A scale from a higher level can be transformed
one.
(Number of values for the data element in the into a scale on a lower level, but not vice versa.
data set, excluding manifesta- tion of missing
Each data quality dimension captures one measurable
aspect of data quality: values (28)) = 27 × 100 / 28 = 96.43% Box plot: a method for graphically depicting
groups of numerical data through their quartiles
Objective when it is based on quantitative metrics Inaccuracy % = (Number of inaccurate values of Histogram: the distribution of numerical data
A)Task-independent metrics reflect states of the data a data element in the data set, excluding Scatter plot: two variables for a set of data.
without the contextual knowledge can be applied to any manifestations of missing values (1)) × 100 / Mosaic plot: two or more qualitative variables
data set, regardless of the tasks at hand. B)Task- (Number of values for the data element in the
dependent metrics are developed in specific data set, excluding manifestation of missing Exercise -> Boxplot 1:Both distributions
application contexts included are organization’s values (28)) = 1 × 100 / 28 = 3.57% (men and women) seem to be right skewed.
business rules, company, and government regulations, The median body weight of men is larger.
etc. Chapter 3
The distributions are right skewed is confirmed
by the histograms for men and for women.
Completeness Completeness is usually measured as EDA Exploratory data analysis
Boxplot 2 Distributions of random numbers
a percentage: Proportion of the stored data versus the
potential of 100% complete. Questions:Is all the What exactly is this thing called data wrangling? It’s
characterized by more or less strong right/left
necessary informationavailable? Are critical data values the ability to take a messy, unrefined source of skewness. Small or Large spread. Outliers. 50%
missing in the data set? Are all the data sets recorded? Are data and wrangle it into something useful. of the data points are larger than the median(0-1).
all mandatory data items recorded?
The data set consists of 6 variables, sample size
is n = 99. In total 4 missings → 3 for Final and 1
for TakeHome
Cumulative relative frequency shows that 55% of all
data points belong to category 0 or 1. In category 0 are
24% of the data points, in category 1 31% of the data
points.
Removing the NA
Chapter 7
General rule for parameter k → k = 4 for all The hull plot shows that the groups that appear
databases (for 2-dimensional data) in the original scatterplot are identified as sepa-
rate clusters by the DBSCAN algorithm. The
DBSCAN algorithm identifies the group at the
bottom right as one of these clusters. The noise
point is to find at the bootom left.
Chapter 8
In most cases the replication estimate rr is
Data-> Information-> Knowledge-> Wisdom smaller than the corresponding original estimate
ro. Furthermore, a substantial number of the
Information Quality (InfoQ) is: the potential of a replication estimates do not achieve statistical
dataset to achieve a specific goal using a given data significance at one-sided 2.5% level, while
analysis method. almost all original estimates did. It turns out that
only in 21 of 73 replications (≙ 29%) the result of
Quality of ... goal definition g, dataX, analysis the original study regarding significant correlation
f and utility measure U is reached. This indicated that either the original
A sudden increase of the kNN distance (a knee) studies are not valid or that replication is difficult,
indicates that the points to the right are most likely Quality of analysis f: it refers to the adequacy of also because it is not possible to include or
outliers. the empirical analysis considering the data and compare all information of the original study in
goal at hand. Reflects the adequacy of the the replication.
Choose Eps where the knee is → 1.3 modelling with respect to the data and for
answering the question of interest. Chapter 10
Here MinPts is chosen 4, according to:
"... eliminate the parameter MinPts by setting it to Chapter 9 Data quality aspects in large data sets
4 for all databases (for 2-dimensional data) («BigData»)
Detecting inherent quality problems in
research data
Therefore, the eps will be set to 1.3 and the minPts will be set to 4.
So according to this output, there are in total 4 clusters and one noise point.
In all the other clusters, the amount of data points for each cluster are very similar.
Show hull plot.
To make sure that I have not missed any outliers, I will do the same steps again with eps=1.5
The hull plot looks the same – and the number of data points within each cluster are very similar.
therefore, I will say that there are no outliers detected with dbscan.
Clustering
Right skewed distribution Analisys
The boxplot looks very smushed down because the data value range for both variables are different.
Therefore, scaling would be good.
Boxplot The boxplot looks much better now. For both variable the median is pretty much in the middle. In
density boxplot, the lower whisker is shorter than the upper one – so there might be a little skewed to the
right.
According to the cluster plot, elbow plot and silhouette plot, it looks like that 3 clusters would be the
better choice. In the silhouette plot you see that the top “cluster” is not very close to 1 which indicates that
is not enough far away from the other clusters.
You can see that as well in the scatter plot with the clustering - the black dots are very close to the blue
Outliers one. So, it looks like that there is one more cluster than needed. And by looking closer, you see that the x-
axis is very short range therefore, the scatter plots show 2 clustering (for the blue one) but there can pretty
much be one if “zoom” out.
Now it looks much better – the scatter plot shows proper clusters, and the silhouette plot shows that all
cluster are further away than before. The hierarchy plot shows as well that 3 cluster is best suited number
of clusters. So comparing 3 and 4 clusters, I would suggest to go with 3 clusters.
Missing analysis
We have 6 variables with 99 observation => 594 data points.
According to this plot and output, you see that there are in total 4 missing data points.
Which means 0.67% of the data is missing (which is pretty good in a real world). There are 3 cases in
which var_05 value is missing and 1 case in which var_04 is missing. The p value (0.794) of the chi-
squared test is higher than 0.05 which is why the test is not significant.
So the missing values are of MCAR (missing completely at random)