Data analysis techniques for data quality

3/5/20
Data Analysis: What
• Data analysis: analysis to determine how the data can be preprocessed

in order to: (Han & Kamber, 2011)
Data Analysis • improve the quality of the data (and,
consequently, of the mining results)
• improve efficiency and ease of mining process
Data Mining
ITERA
Semester II 2019/2020
2
1 2
Data Quality: Common Problem Noisy Data contradicto

ry
Attr1 Attr2 Class

• Common problem in data science: noisy, missing, inconsistent Noise types: class (label) noise and attribute
data. noise 0.25 red positive
• Another problem:
• Class noise: contradictory examples, 0.25 red negative
• imbalanced dataset mislabeled examples
• Outliers (extreme values). 1.02 green positive
• Attribute noise: erroneous (at data
An outlier is a piece of data that is an entry, violation of known data 0.99 green negative
abnormal distance from other points. constraints) mislabele
error at
data entry d
https://sci2s.ugr.es/noisydata
3 4
3 4
Exercise: Find Noisy Data Missing values Data
https://edu.gcfglobal.org/en/excel-tips/a-trick-for-finding-inconsistent-data/1/
https://freecontent.manning.com/pre-processing-data-for-modeling/
5 6
5 6
1
3/5/20
Inconsistent Data Imbalanced dataset
• Inconsistent data contain discrepancies in name or code, Class Frequency

or discrepancies between duplicate records (from multiple Majority
A 1000
class
sources) B 10
Age BirthDate ... ID GPA ... ID Rating ...
18 30 June ... 01 3.25 ... 1 A ... Minority

2000 class
01 3.67 ... 2 B ...
19 30 June ...
... ... ... 3 1 ...
2000
4 3.5 ...
... ... ... 7 8
7 8
Identify Outliers Exercise: Identify Outliers

Commonly used rules to identify
outliers: Identify columns
Low outlier < Q1-1.5*IQR that have outliers ?
High outlier > Q3+1.5*IQR
Median: 23 ; Q1: 19 ; Q3: 24
19 Data: IQR = Q3-Q1=24-19=5
5, 7, 10, 15, 19, 21, 21, Min = 19-7.5=11.5
22, 22, 23, 23, 23, 23, 23,
24, 24, 24, 24, 25 Max = 24+7.5=31.5
https://www.khanacademy.org/math/statistics-probability/summarizing-quantitative-data/box-whisker-plots/a/identifying-outliers-iqr-rule 9 10
9 10
Data Analysis in IBM DS Methodology Data Analysis Process

Dataset
• Input: initial (raw)
dataset
Import Data Data Export
dataset understanding preparation dataset
• Output: final
dataset for
modeling
Dataset for
modelling
Data
analysis 11 12
11 12
2
3/5/20
Dataset Attributes Types

● Dataset is made up of data objects ● Nominal: categories/states,
○ Columns represent variables (aka attributes, features, dimensions). ○ e.g. drive-wheels (4wd, fwd, rwd)
They represent characteristics of data object. ○ Binary: values in 2 states, e.g. fuel-type (diesel, gas), medical test (positive vs.
○ Rows represent data objects (aka samples, examples, instances, data
points, tuples). They represents entities. negative)
○ A collection of separate (related) sets of information that is treated ○ Ordinal: values have a meaningful order (ranking), eg. size (small, medium,
(manipulated) as a single unit by a computer (Cambridge / Oxford large) , day (mon, tue, wed, thu, fri, sat, sun)
Dictionary) ● Numeric: quantitative (integer or real-valued)
○ Interval-scaled: lacks a true zero point, e.g. weather temperature
○ Ratio-scaled: inherent a true zero point, e.g. speed
13 14
13 14
Exercise: Determine its attribute types Structured Dataset Example: Automobile Dataset
● Gender: M (male), F (female)
● City: bdo, jkt, jog, …
205 data
● Economic status: low, medium, high objects = 205
● Amount of pain: 0-10 rows
● Weather temperature: numeric
● Amount of money: numeric
● Speed : numeric 26 columns
Predict price based on 25

attributes of automobile data
15 http://archive.ics.uci.edu/ml/datasets/Automobile 16
15 16
Automobile Dataset: Attributes (categorical, integer, real) Automobile Dataset: Data in csv
Attribute: Attribute Range: 11. length: continuous from 141.1 to 208.1. 3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.60,168.80,64.10,48.80,2548,dohc,four,130,mpfi,3.47,2.68,9.00,111,5000,21,27,13495
1. symboling: -3, -2, -1, 0, 1, 2, 3. 12. width: continuous from 60.3 to 72.3. 3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.60,168.80,64.10,48.80,2548,dohc,four,130,mpfi,3.47,2.68,9.00,111,5000,21,27,16500
1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.50,171.20,65.50,52.40,2823,ohcv,six,152,mpfi,2.68,3.47,9.00,154,5000,19,26,16500
2. normalized-losses:continuous from 65 to 256. 13. height:continuous from 47.8 to 59.8. 2,164,audi,gas,std,four,sedan,fwd,front,99.80,176.60,66.20,54.30,2337,ohc,four,109,mpfi,3.19,3.40,10.00,102,5500,24,30,13950
3. make:alfa-romero, audi, bmw, chevrolet, dodge, 14. curb-weight:continuous from 1488 to 4066. 2,164,audi,gas,std,four,sedan,4wd,front,99.40,176.60,66.40,54.30,2824,ohc,five,136,mpfi,3.19,3.40,8.00,115,5500,18,22,17450
honda, isuzu, jaguar, mazda, mercedes-benz, mercury, 15. engine-type:dohc, dohcv, l, ohc, ohcf, ohcv, rotor. 2,?,audi,gas,std,two,sedan,fwd,front,99.80,177.30,66.30,53.10,2507,ohc,five,136,mpfi,3.19,3.40,8.50,110,5500,19,25,15250
1,158,audi,gas,std,four,sedan,fwd,front,105.80,192.70,71.40,55.70,2844,ohc,five,136,mpfi,3.19,3.40,8.50,110,5500,19,25,17710
mitsubishi, nissan, peugot, plymouth, porsche, renault, 16. num-of-cylinders:eight, five, four, six, three, twelve, 1,?,audi,gas,std,four,wagon,fwd,front,105.80,192.70,71.40,55.70,2954,ohc,five,136,mpfi,3.19,3.40,8.50,110,5500,19,25,18920
saab, subaru, toyota, volkswagen, volvo two. 1,158,audi,gas,turbo,four,sedan,fwd,front,105.80,192.70,71.40,55.90,3086,ohc,five,131,mpfi,3.13,3.40,8.30,140,5500,17,20,23875
4. fuel-type: diesel, gas. 17. engine-size:continuous from 61 to 326. 0,?,audi,gas,turbo,two,hatchback,4wd,front,99.50,178.20,67.90,52.00,3053,ohc,five,131,mpfi,3.13,3.40,7.00,160,5500,16,22,?
5. aspiration: std, turbo. 18. fuel-system: 1bbl, 2bbl, 4bbl, idi, mfi, mpfi, spdi, spfi. 2,192,bmw,gas,std,two,sedan,rwd,front,101.20,176.80,64.80,54.30,2395,ohc,four,108,mpfi,3.50,2.80,8.80,101,5800,23,29,16430
0,192,bmw,gas,std,four,sedan,rwd,front,101.20,176.80,64.80,54.30,2395,ohc,four,108,mpfi,3.50,2.80,8.80,101,5800,23,29,16925
6. num-of-doors: four, two. 19. bore:continuous from 2.54 to 3.94. 0,188,bmw,gas,std,two,sedan,rwd,front,101.20,176.80,64.80,54.30,2710,ohc,six,164,mpfi,3.31,3.19,9.00,121,4250,21,28,20970
7. body-style: hardtop, wagon, sedan, hatchback, 20. stroke:continuous from 2.07 to 4.17. 0,188,bmw,gas,std,four,sedan,rwd,front,101.20,176.80,64.80,54.30,2765,ohc,six,164,mpfi,3.31,3.19,9.00,121,4250,21,28,21105
convertible. 21. compression-ratio: continuous from 7 to 23. 1,?,bmw,gas,std,four,sedan,rwd,front,103.50,189.00,66.90,55.70,3055,ohc,six,164,mpfi,3.31,3.19,9.00,121,4250,20,25,24565
8. drive-wheels: 4wd, fwd, rwd. 22. horsepower:continuous from 48 to 288. ...
9. engine-location: front, rear. 23. Peak-rpm: continuous from 4150 to 6600.
10. wheel-base: continuous from 86.6 120.9. 24. city-mpg: continuous from 13 to 49.
25. highway-mpg:continuous from 16 to 54.
26. Price: continuous from 5118 to 45400.
17 18
17 18
3
3/5/20
Import / Load Dataset Unstructured Dataset Example: YouTube Spam

# Download dataset from http://archive.ics.uci.edu/ml/datasets/Automobile
# Import pandas library
import pandas as pd
# Read the online file by the URL provides above, and assign it to dataframe
variable "df"
df = pd.read_csv("imports-85.data", header=None)
# After reading the dataset, we can use dataframe.head(n)method to check the

top n rows of the dataframe; where n is an integer. print("The first 3 rows
of the dataframe")
df.head(3)
19 https://archive.ics.uci.edu/ml/datasets/YouTube+Spam+Collection 20
19 20
YouTube Spam Dataset: Attributes & Data (csv)

COMMENT_ID,AUTHOR,DATE,CONTENT,CLASS
Exercise: Download and Import
z12pgdhovmrktzm3i23es5d5junftft3f,lekanaVEVO1,2014-07-22T15:27:50,i love this so much. AND also I
Generate Free Leads on Auto Pilot & You Can Too! http://www.MyLeaderGate.com/moretraffic,1
YouTube Spam Dataset
z13yx345uxepetggz04ci5rjcxeohzlrtf4,Pyunghee,2014-07-
27T01:57:16,http://www.billboard.com/articles/columns/pop-shop/6174122/fan-army-face-off-round-3 https://archive.ics.uci.edu/ml/datasets/YouTu
Vote for SONES please....we're against vips....please help us.. >.<,1
... be+Spam+Collection
z12cdlswetvnejcri04cex0jfwy2u3tzj54,Rafi Hossain,2015-06-05T19:55:08,Honestly speaking except taylor
swift and adele i don't lile any of the modern day singers. But i must say whenever i hear this song i feel
goosebumps. Its quite inspiring!! Thanks miss Perry!,0
z120e5uautvcuper304ccf4bjrjugdpbwrc0k,moaz adnan,2015-06-05T20:01:23,who is going to reach the
billion first : katy or taylor ?,0
21 22
21 22
Data Understanding: What Describing Data

• Goal: understand data content, assess data quality, and discover • There are many ways to describe data, but most
initial insights into the data. descriptions focus on quantity and quality of the data.
• Process:
a. Describing data • Key characteristics:
b. Verifying data quality • Dataset size (number of instances and attributes)
c. Exploring data • Show number of rows and columns name
• Surface properties of each attribute
• attribute types, value range (if numeric) or value set (if category)
• Understand the meaning of each attribute and attribute value. Is
there is any names or values that are unknown or unclear ?
• Basic statistics
23 24
23 24
4
3/5/20
Data Analysis in Statistics Descriptive Data Analysis

Descriptive Data Analysis (DDA)
A series of methods that summarize data (eg. sample ● Descriptive data analysis helps to describe basic features of a
mean and standard deviation)
dataset and obtains a short summary about the sample and
measures of the data.
● pandas is an open source library providing high-performance,
Confirmatory Data Analysis (CDA) easy-to-use data structures and data analysis tools for Python
Exploratory Data Analysis (EDA)
A series of methods for generating hypotheses using
A series of methods for statistical inference, calculation of p- ○ Basic statistics of numerical attributes: count, mean, std, min,
visualizations
values and interpretation of their implications for proving
hypotheses
q1..q3, max, skewness, kurtosis
○ Basic statistics of nominal attributes: count, unique, top, freq
EDA will be conducted on dataset to understand the data & prepare the hypothesis
http://www.models.kvl.dk/sites/default/files/Data_Analysis.png, cited from Allen et al. (2018) 25 26
25 26
Descriptive Statistics: Sample Mean Descriptive Statistics: Sample Standard Deviation

Sample variance:
Sample mean:
n=10
Sample standard deviation: S=√S2
Σxi=3+2+3+2+3+4+4+2+3+4=30
n=10; Σxi=30; X_bar=3
X_bar=30/10=3
Σxi2=3*4+4*9+3*16=12+36+48=96
S2=(10*96-(30)2)/(10*9)=2/3=0.67 ⇒
S=0.8165
27 28
27 28
Descriptive Statistics: Quartile Quartile: Interpolation

Quartile for sorted data: Q k=X k.(n+1)/4 interpolation : {‘linear’, ‘lower’, ‘higher’, ‘midpoint’, ‘nearest’}
For data with 10 elements: This optional parameter specifies the interpolation
Q 1=X 1.(11)/4=X 2.75
method to use, when the desired quantile lies between
Q 2=X 2.(11)/4=X 5.5
Q 3=X 3.(11)/4=X 8.25 two data points i and j:
Interpolation: ● linear: i + (j - i) * fraction, where fraction is the

Midpoint: Q 1=X 2.75=(X 2+X 3)/2 fractional part of the index surrounded by i and j.
Linear: Q 1=X 2.75=X 2+(X 3-X 2)*0.75 ● lower: i.
Lower: Q 1=X 2.75=X 2
● higher: j.
Higher: Q 1=X 2.75=X 3
Nearest: Q 1=X 2.75=X 3 ● nearest: i or j whichever is nearest.
● midpoint: (i + j) / 2.
https://pandas.pydata.org/pandas-
docs/stable/reference/api/pandas.DataFrame.quantile
29 .html 30
29 30
5
3/5/20
Case Study From Problem to Approach

Tom wants to sell his car. But the problem is, he doesn't know how ● Problem: estimate reasonable price that represents the value
of the car, so someone would want to purchase it.
much he should sell his car for. ● Success criteria: minimum error (price difference between
estimated price and real price)
Tom wants to sell his car for as much as he can. But he also wants ● Analytic approach: predictive
to set the price reasonably so someone would want to purchase it.
So the price he sets should represent the value of the car.
31 32
31 32
From Requirement to Collection Automobile Dataset to Predict Reasonable Price

● Assume that automobile dataset is provided from this phase.
● Each row contains car attributes (that represent value of the 205 data
car) and its reasonable price. objects = 205
rows
26 columns
Predict price based on 25

attributes of automobile data
33 http://archive.ics.uci.edu/ml/datasets/Automobile 34
33 34
Describing Data: Automobile Dataset Attributes in Dataset

No Attribute name Attribute value No Attribute name Attribute value
● Format: csv (comma separator)
● Dataset size: 205 instances, 26 attributes (including 1 target att) 0 symboling -3, -2, -1, 0, 1, 2, 3. 9 wheel-base continuous: 86.6..120.9
● Attribute types: next slide 1 normalized-losses continuous: 65..256 10 length continuous: 141.1..208.1
○ Numerical attributes: attribute 0,1,9..13,16,18..25 2 make alfa-romero, audi, ... 11 width continuous: 60.3..72.3
○ Nominal attribute: attribute 2..8,14..15,17
3 fuel-type diesel, gas 12 height continuous: 47.8..59.8
● List of instances
4 aspiration std, turbo 13 curb-weight continuous: 1488..4066
● Basic statistics:
5 num-of-doors four, two 14 engine-type dohc, dohcv, l, ...
○ Basic statistics of numerical attributes: count, mean, std, min, q1..q3, max
○ Basic statistics of nominal attributes: count, unique, top, freq 6 body-style hardtop, wagon, ... ...
○ Data composition of attributes: value and its frequency 7 drive-wheels 4wd, fwd, rwd 24 highway-mpg continuous: 16..54
8 engine-location front, rear 25 price continuous: 5118..45400

35 36
35 36
6
3/5/20
Describing Data in Python Data Analysis (pandas) Df.shape, df.head(n)
df = pd.read_csv(filename): all data is loaded into dataframe structure

df.shape: show dataset size (rows,att)
df.head(n) or df.tail(n): show top or bottom n rows
df.sample(n): show n random rows
df.info(): show dataset size, list of attribute types
df[attribute].describe(): show basic statistics of an attribute
df[attribute].value_counts(): show data composition of an attribute
df[attribute].skew(): show skewness of an attribute
df.describe(): show basic statistics of all numeric attributes 37 38
37 38
0. symboling: -3, -2, -1, 0, 1, 2, 3.

1. normalized-losses: continuous from 65 to 256.
2. make:
alfa-romero, audi, bmw, chevrolet, dodge, honda,
df.tail(n), df.sample(n) df.info() isuzu, jaguar, mazda, mercedes-benz, mercury,
mitsubishi, nissan, peugot, plymouth, porsche,
renault, saab, subaru, toyota, volkswagen, volvo
3. fuel-type: diesel, gas.
4. aspiration: std, turbo.
5. num-of-doors: four, two.
6. body-style: hardtop, wagon, sedan, hatchback, convertible.
7. drive-wheels: 4wd, fwd, rwd.
8. engine-location: front, rear.
9. wheel-base: continuous from 86.6 120.9.
10. length: continuous from 141.1 to 208.1.
Numerical att: 11. width: continuous from 60.3 to 72.3.
12. height: continuous from 47.8 to 59.8.
0,1,9..13,16,18..25 13. curb-weight: continuous from 1488 to 4066.
14. engine-type: dohc, dohcv, l, ohc, ohcf, ohcv, rotor.
Nominal att: 15. num-of-cylinders: eight, five, four, six, three, twelve, two.
16. engine-size: continuous from 61 to 326.
2..8,14..15,17 17. fuel-system: 1bbl, 2bbl, 4bbl, idi, mfi, mpfi, spdi, spfi.
18. bore: continuous from 2.54 to 3.94.
19. stroke: continuous from 2.07 to 4.17.
Report the 20. compression-ratio: continuous from 7 to 23.
differences and 21. horsepower: continuous from 48 to 288.
22. peak-rpm: continuous from 4150 to 6600.
why ? 23. city-mpg: continuous from 13 to 49.
24. highway-mpg: continuous from 16 to 54.
25. price: continuous from 5118 to 45400.
39 40
39 40
Numeric Attribute: df[att].describe(), df[att].value_counts() Nominal Attribute: df[att].describe(), df[att].value_counts()
Attribute make:
alfa-romero, audi, bmw, chevrolet, dodge, honda,
Attribute city-mpg: isuzu, jaguar, mazda, mercedes-benz, mercury,
continuous from 13 to 49. renault, saab, subaru, toyota, volkswagen, volvo
41 42
41 42
7
3/5/20
0. symboling: -3, -2, -1, 0, 1, 2, 3.

1. normalized-losses: continuous from 65 to 256.
2. make:
Verifying Data Quality alfa-romero, audi, bmw, chevrolet, dodge, honda,

isuzu, jaguar, mazda, mercedes-benz, mercury,
renault, saab, subaru, toyota, volkswagen, volvo
3. fuel-type: diesel, gas.
1. Identify incorrectness of data type assignment 4. aspiration: std, turbo.
5. num-of-doors: four, two.
2. Identify noise or inconsistent data 6. body-style: hardtop, wagon, sedan, hatchback, convertible.
7. drive-wheels: 4wd, fwd, rwd.
3. Identify missing values 8. engine-location: front, rear.

9. wheel-base: continuous from 86.6 120.9.
Identify outliers Incorrect

10. length: continuous from 141.1 to 208.1.
4. 11. width: continuous from 60.3 to 72.3.
12. height: continuous from 47.8 to 59.8.
data type
13. curb-weight: continuous from 1488 to 4066.
14. engine-type: dohc, dohcv, l, ohc, ohcf, ohcv, rotor.
15. num-of-cylinders: eight, five, four, six, three, twelve, two.
assignment 16. engine-size: continuous from 61 to 326.

17. fuel-system: 1bbl, 2bbl, 4bbl, idi, mfi, mpfi, spdi, spfi.
18. bore: continuous from 2.54 to 3.94.
19. stroke: continuous from 2.07 to 4.17.
20. compression-ratio: continuous from 7 to 23.
21. horsepower: continuous from 48 to 288.
22. peak-rpm: continuous from 4150 to 6600.
23. city-mpg: continuous from 13 to 49.
24. highway-mpg: continuous from 16 to 54.
25. price: continuous from 5118 to 45400.
43 44
43 44
Attribute: Identify Noise or Inconsistent Data Attribute: Identify Missing Values

Is there any attribute
noise, erroneous (at
data entry, violation
of known data
constraints) ?
Is there any
inconsistent data ?
Attribute 5 num-of-doors: four, two
Attribute 1 normalized-losses:
continuous from 65 to 256.
45 46
45 46
Attribute: Identify Outliers Attribute: Identify Outliers

for i in range(len(df.columns)):
if (df[i].dtypes in ['int64','float64']):
print('\nAttribute-',i,':',df[i].dtypes)
Q1=df[i].quantile(0.25)
print('Q1',Q1)
Q3=df[i].quantile(0.75)
print('Q3',Q3)
IQR=Q3-Q1
print('IQR',IQR)
min=df[i].min()
max=df[i].max()
min_IQR=Q1-1.5*IQR
max_IQR=Q3+1.5*IQR
if (min<min_IQR):
print('Low outlier is found')
if (max>max_IQR):
print('High outlier is found')
47 48
47 48
8
3/5/20
Exercise: Data Understanding

1. Load white Wine Quality dataset
(https://archive.ics.uci.edu/ml/datasets/wine+quality)
df = pd.read_csv("winequality-white.csv",sep=';')
2. Describing data:
a. Show dataset size
b. Show surface properties (attribute types, range) of each attribute to understand
the meaning of each attribute and attribute value
3. Verifying data quality: incorrectness of data type assignment, identify
noise, missing value, outliers, and imbalanced dataset.
49
49

Data analysis techniques for data quality

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data analysis techniques for data quality

Uploaded by

Copyright:

Available Formats

3/5/20

Data Analysis: What

• Data analysis: analysis to determine how the data can be preprocessed

Data Quality: Common Problem Noisy Data contradicto

Attr1 Attr2 Class

Exercise: Find Noisy Data Missing values Data

Inconsistent Data Imbalanced dataset

• Inconsistent data contain discrepancies in name or code, Class Frequency

Age BirthDate ... ID GPA ... ID Rating ...

18 30 June ... 01 3.25 ... 1 A ... Minority

Identify Outliers Exercise: Identify Outliers

Data Analysis in IBM DS Methodology Data Analysis Process

Dataset Attributes Types

Predict price based on 25

Import / Load Dataset Unstructured Dataset Example: YouTube Spam

# After reading the dataset, we can use dataframe.head(n)method to check the

YouTube Spam Dataset: Attributes & Data (csv)

Data Understanding: What Describing Data

Data Analysis in Statistics Descriptive Data Analysis

Descriptive Statistics: Sample Mean Descriptive Statistics: Sample Standard Deviation

Descriptive Statistics: Quartile Quartile: Interpolation

Interpolation: ● linear: i + (j - i) * fraction, where fraction is the

Case Study From Problem to Approach

From Requirement to Collection Automobile Dataset to Predict Reasonable Price

Predict price based on 25

Describing Data: Automobile Dataset Attributes in Dataset

8 engine-location front, rear 25 price continuous: 5118..45400

Describing Data in Python Data Analysis (pandas) Df.shape, df.head(n)

df = pd.read_csv(filename): all data is loaded into dataframe structure

0. symboling: -3, -2, -1, 0, 1, 2, 3.

Numeric Attribute: df[att].describe(), df[att].value_counts() Nominal Attribute: df[att].describe(), df[att].value_counts()

0. symboling: -3, -2, -1, 0, 1, 2, 3.

Verifying Data Quality alfa-romero, audi, bmw, chevrolet, dodge, honda,

3. Identify missing values 8. engine-location: front, rear.

Identify outliers Incorrect

assignment 16. engine-size: continuous from 61 to 326.

Attribute: Identify Noise or Inconsistent Data Attribute: Identify Missing Values

Attribute: Identify Outliers Attribute: Identify Outliers

Exercise: Data Understanding

You might also like