You are on page 1of 9

3/5/20

Data Analysis: What

• Data analysis: analysis to determine how the data can be preprocessed


in order to: (Han & Kamber, 2011)
Data Analysis • improve the quality of the data (and,
consequently, of the mining results)
• improve efficiency and ease of mining process

Data Mining
ITERA
Semester II 2019/2020
2

1 2

Data Quality: Common Problem Noisy Data contradicto


ry

Attr1 Attr2 Class


• Common problem in data science: noisy, missing, inconsistent Noise types: class (label) noise and attribute
data. noise 0.25 red positive
• Another problem:
• Class noise: contradictory examples, 0.25 red negative
• imbalanced dataset mislabeled examples
• Outliers (extreme values). 1.02 green positive
• Attribute noise: erroneous (at data
An outlier is a piece of data that is an entry, violation of known data 0.99 green negative
abnormal distance from other points. constraints) mislabele
error at
data entry d

https://sci2s.ugr.es/noisydata
3 4

3 4

Exercise: Find Noisy Data Missing values Data

https://edu.gcfglobal.org/en/excel-tips/a-trick-for-finding-inconsistent-data/1/
https://freecontent.manning.com/pre-processing-data-for-modeling/
5 6

5 6

1
3/5/20

Inconsistent Data Imbalanced dataset

• Inconsistent data contain discrepancies in name or code, Class Frequency


or discrepancies between duplicate records (from multiple Majority
A 1000
class
sources) B 10

Age BirthDate ... ID GPA ... ID Rating ...

18 30 June ... 01 3.25 ... 1 A ... Minority


2000 class
01 3.67 ... 2 B ...
19 30 June ...
... ... ... 3 1 ...
2000
4 3.5 ...
... ... ... 7 8

7 8

Identify Outliers Exercise: Identify Outliers


Commonly used rules to identify
outliers: Identify columns
Low outlier < Q1-1.5*IQR that have outliers ?
High outlier > Q3+1.5*IQR
Median: 23 ; Q1: 19 ; Q3: 24
19 Data: IQR = Q3-Q1=24-19=5
5, 7, 10, 15, 19, 21, 21, Min = 19-7.5=11.5
22, 22, 23, 23, 23, 23, 23,
24, 24, 24, 24, 25 Max = 24+7.5=31.5
https://www.khanacademy.org/math/statistics-probability/summarizing-quantitative-data/box-whisker-plots/a/identifying-outliers-iqr-rule 9 10

9 10

Data Analysis in IBM DS Methodology Data Analysis Process


Dataset
• Input: initial (raw)
dataset
Import Data Data Export
dataset understanding preparation dataset
• Output: final
dataset for
modeling
Dataset for
modelling
Data
analysis 11 12

11 12

2
3/5/20

Dataset Attributes Types


● Dataset is made up of data objects ● Nominal: categories/states,
○ Columns represent variables (aka attributes, features, dimensions). ○ e.g. drive-wheels (4wd, fwd, rwd)
They represent characteristics of data object. ○ Binary: values in 2 states, e.g. fuel-type (diesel, gas), medical test (positive vs.
○ Rows represent data objects (aka samples, examples, instances, data
points, tuples). They represents entities. negative)
○ A collection of separate (related) sets of information that is treated ○ Ordinal: values have a meaningful order (ranking), eg. size (small, medium,
(manipulated) as a single unit by a computer (Cambridge / Oxford large) , day (mon, tue, wed, thu, fri, sat, sun)
Dictionary) ● Numeric: quantitative (integer or real-valued)
○ Interval-scaled: lacks a true zero point, e.g. weather temperature
○ Ratio-scaled: inherent a true zero point, e.g. speed

13 14

13 14

Exercise: Determine its attribute types Structured Dataset Example: Automobile Dataset
● Gender: M (male), F (female)
● City: bdo, jkt, jog, …
205 data
● Economic status: low, medium, high objects = 205
● Amount of pain: 0-10 rows
● Weather temperature: numeric
● Amount of money: numeric
● Speed : numeric 26 columns

Predict price based on 25


attributes of automobile data

15 http://archive.ics.uci.edu/ml/datasets/Automobile 16

15 16

Automobile Dataset: Attributes (categorical, integer, real) Automobile Dataset: Data in csv
Attribute: Attribute Range: 11. length: continuous from 141.1 to 208.1. 3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.60,168.80,64.10,48.80,2548,dohc,four,130,mpfi,3.47,2.68,9.00,111,5000,21,27,13495
1. symboling: -3, -2, -1, 0, 1, 2, 3. 12. width: continuous from 60.3 to 72.3. 3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.60,168.80,64.10,48.80,2548,dohc,four,130,mpfi,3.47,2.68,9.00,111,5000,21,27,16500
1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.50,171.20,65.50,52.40,2823,ohcv,six,152,mpfi,2.68,3.47,9.00,154,5000,19,26,16500
2. normalized-losses:continuous from 65 to 256. 13. height:continuous from 47.8 to 59.8. 2,164,audi,gas,std,four,sedan,fwd,front,99.80,176.60,66.20,54.30,2337,ohc,four,109,mpfi,3.19,3.40,10.00,102,5500,24,30,13950
3. make:alfa-romero, audi, bmw, chevrolet, dodge, 14. curb-weight:continuous from 1488 to 4066. 2,164,audi,gas,std,four,sedan,4wd,front,99.40,176.60,66.40,54.30,2824,ohc,five,136,mpfi,3.19,3.40,8.00,115,5500,18,22,17450
honda, isuzu, jaguar, mazda, mercedes-benz, mercury, 15. engine-type:dohc, dohcv, l, ohc, ohcf, ohcv, rotor. 2,?,audi,gas,std,two,sedan,fwd,front,99.80,177.30,66.30,53.10,2507,ohc,five,136,mpfi,3.19,3.40,8.50,110,5500,19,25,15250
1,158,audi,gas,std,four,sedan,fwd,front,105.80,192.70,71.40,55.70,2844,ohc,five,136,mpfi,3.19,3.40,8.50,110,5500,19,25,17710
mitsubishi, nissan, peugot, plymouth, porsche, renault, 16. num-of-cylinders:eight, five, four, six, three, twelve, 1,?,audi,gas,std,four,wagon,fwd,front,105.80,192.70,71.40,55.70,2954,ohc,five,136,mpfi,3.19,3.40,8.50,110,5500,19,25,18920
saab, subaru, toyota, volkswagen, volvo two. 1,158,audi,gas,turbo,four,sedan,fwd,front,105.80,192.70,71.40,55.90,3086,ohc,five,131,mpfi,3.13,3.40,8.30,140,5500,17,20,23875
4. fuel-type: diesel, gas. 17. engine-size:continuous from 61 to 326. 0,?,audi,gas,turbo,two,hatchback,4wd,front,99.50,178.20,67.90,52.00,3053,ohc,five,131,mpfi,3.13,3.40,7.00,160,5500,16,22,?
5. aspiration: std, turbo. 18. fuel-system: 1bbl, 2bbl, 4bbl, idi, mfi, mpfi, spdi, spfi. 2,192,bmw,gas,std,two,sedan,rwd,front,101.20,176.80,64.80,54.30,2395,ohc,four,108,mpfi,3.50,2.80,8.80,101,5800,23,29,16430
0,192,bmw,gas,std,four,sedan,rwd,front,101.20,176.80,64.80,54.30,2395,ohc,four,108,mpfi,3.50,2.80,8.80,101,5800,23,29,16925
6. num-of-doors: four, two. 19. bore:continuous from 2.54 to 3.94. 0,188,bmw,gas,std,two,sedan,rwd,front,101.20,176.80,64.80,54.30,2710,ohc,six,164,mpfi,3.31,3.19,9.00,121,4250,21,28,20970
7. body-style: hardtop, wagon, sedan, hatchback, 20. stroke:continuous from 2.07 to 4.17. 0,188,bmw,gas,std,four,sedan,rwd,front,101.20,176.80,64.80,54.30,2765,ohc,six,164,mpfi,3.31,3.19,9.00,121,4250,21,28,21105
convertible. 21. compression-ratio: continuous from 7 to 23. 1,?,bmw,gas,std,four,sedan,rwd,front,103.50,189.00,66.90,55.70,3055,ohc,six,164,mpfi,3.31,3.19,9.00,121,4250,20,25,24565
8. drive-wheels: 4wd, fwd, rwd. 22. horsepower:continuous from 48 to 288. ...
9. engine-location: front, rear. 23. Peak-rpm: continuous from 4150 to 6600.
10. wheel-base: continuous from 86.6 120.9. 24. city-mpg: continuous from 13 to 49.
25. highway-mpg:continuous from 16 to 54.
26. Price: continuous from 5118 to 45400.
17 18

17 18

3
3/5/20

Import / Load Dataset Unstructured Dataset Example: YouTube Spam


# Download dataset from http://archive.ics.uci.edu/ml/datasets/Automobile
# Import pandas library
import pandas as pd

# Read the online file by the URL provides above, and assign it to dataframe
variable "df"
df = pd.read_csv("imports-85.data", header=None)

# After reading the dataset, we can use dataframe.head(n)method to check the


top n rows of the dataframe; where n is an integer. print("The first 3 rows
of the dataframe")
df.head(3)

19 https://archive.ics.uci.edu/ml/datasets/YouTube+Spam+Collection 20

19 20

YouTube Spam Dataset: Attributes & Data (csv)


COMMENT_ID,AUTHOR,DATE,CONTENT,CLASS
Exercise: Download and Import
z12pgdhovmrktzm3i23es5d5junftft3f,lekanaVEVO1,2014-07-22T15:27:50,i love this so much. AND also I
Generate Free Leads on Auto Pilot &amp; You Can Too! http://www.MyLeaderGate.com/moretraffic,1
YouTube Spam Dataset
z13yx345uxepetggz04ci5rjcxeohzlrtf4,Pyunghee,2014-07-
27T01:57:16,http://www.billboard.com/articles/columns/pop-shop/6174122/fan-army-face-off-round-3 https://archive.ics.uci.edu/ml/datasets/YouTu
Vote for SONES please....we're against vips....please help us.. &gt;.&lt;,1
... be+Spam+Collection
z12cdlswetvnejcri04cex0jfwy2u3tzj54,Rafi Hossain,2015-06-05T19:55:08,Honestly speaking except taylor
swift and adele i don't lile any of the modern day singers. But i must say whenever i hear this song i feel
goosebumps. Its quite inspiring!! Thanks miss Perry!,0
z120e5uautvcuper304ccf4bjrjugdpbwrc0k,moaz adnan,2015-06-05T20:01:23,who is going to reach the
billion first : katy or taylor ?,0

21 22

21 22

Data Understanding: What Describing Data


• Goal: understand data content, assess data quality, and discover • There are many ways to describe data, but most
initial insights into the data. descriptions focus on quantity and quality of the data.
• Process:
a. Describing data • Key characteristics:
b. Verifying data quality • Dataset size (number of instances and attributes)
c. Exploring data • Show number of rows and columns name
• Surface properties of each attribute
• attribute types, value range (if numeric) or value set (if category)
• Understand the meaning of each attribute and attribute value. Is
there is any names or values that are unknown or unclear ?
• Basic statistics
23 24

23 24

4
3/5/20

Data Analysis in Statistics Descriptive Data Analysis


Descriptive Data Analysis (DDA)
A series of methods that summarize data (eg. sample ● Descriptive data analysis helps to describe basic features of a
mean and standard deviation)
dataset and obtains a short summary about the sample and
measures of the data.
● pandas is an open source library providing high-performance,

Confirmatory Data Analysis (CDA) easy-to-use data structures and data analysis tools for Python
Exploratory Data Analysis (EDA)
A series of methods for generating hypotheses using
A series of methods for statistical inference, calculation of p- ○ Basic statistics of numerical attributes: count, mean, std, min,
visualizations
values and interpretation of their implications for proving
hypotheses
q1..q3, max, skewness, kurtosis
○ Basic statistics of nominal attributes: count, unique, top, freq

EDA will be conducted on dataset to understand the data & prepare the hypothesis
http://www.models.kvl.dk/sites/default/files/Data_Analysis.png, cited from Allen et al. (2018) 25 26

25 26

Descriptive Statistics: Sample Mean Descriptive Statistics: Sample Standard Deviation


Sample variance:
Sample mean:

n=10
Sample standard deviation: S=√S2
Σxi=3+2+3+2+3+4+4+2+3+4=30
n=10; Σxi=30; X_bar=3
X_bar=30/10=3
Σxi2=3*4+4*9+3*16=12+36+48=96
S2=(10*96-(30)2)/(10*9)=2/3=0.67 ⇒
S=0.8165

27 28

27 28

Descriptive Statistics: Quartile Quartile: Interpolation


Quartile for sorted data: Q k=X k.(n+1)/4 interpolation : {‘linear’, ‘lower’, ‘higher’, ‘midpoint’, ‘nearest’}

For data with 10 elements: This optional parameter specifies the interpolation
Q 1=X 1.(11)/4=X 2.75
method to use, when the desired quantile lies between
Q 2=X 2.(11)/4=X 5.5
Q 3=X 3.(11)/4=X 8.25 two data points i and j:

Interpolation: ● linear: i + (j - i) * fraction, where fraction is the


Midpoint: Q 1=X 2.75=(X 2+X 3)/2 fractional part of the index surrounded by i and j.
Linear: Q 1=X 2.75=X 2+(X 3-X 2)*0.75 ● lower: i.
Lower: Q 1=X 2.75=X 2
● higher: j.
Higher: Q 1=X 2.75=X 3
Nearest: Q 1=X 2.75=X 3 ● nearest: i or j whichever is nearest.
● midpoint: (i + j) / 2.

https://pandas.pydata.org/pandas-
docs/stable/reference/api/pandas.DataFrame.quantile
29 .html 30

29 30

5
3/5/20

Case Study From Problem to Approach


Tom wants to sell his car. But the problem is, he doesn't know how ● Problem: estimate reasonable price that represents the value
of the car, so someone would want to purchase it.
much he should sell his car for. ● Success criteria: minimum error (price difference between
estimated price and real price)
Tom wants to sell his car for as much as he can. But he also wants ● Analytic approach: predictive
to set the price reasonably so someone would want to purchase it.
So the price he sets should represent the value of the car.

31 32

31 32

From Requirement to Collection Automobile Dataset to Predict Reasonable Price


● Assume that automobile dataset is provided from this phase.
● Each row contains car attributes (that represent value of the 205 data
car) and its reasonable price. objects = 205
rows

26 columns

Predict price based on 25


attributes of automobile data

33 http://archive.ics.uci.edu/ml/datasets/Automobile 34

33 34

Describing Data: Automobile Dataset Attributes in Dataset


No Attribute name Attribute value No Attribute name Attribute value
● Format: csv (comma separator)
● Dataset size: 205 instances, 26 attributes (including 1 target att) 0 symboling -3, -2, -1, 0, 1, 2, 3. 9 wheel-base continuous: 86.6..120.9

● Attribute types: next slide 1 normalized-losses continuous: 65..256 10 length continuous: 141.1..208.1

○ Numerical attributes: attribute 0,1,9..13,16,18..25 2 make alfa-romero, audi, ... 11 width continuous: 60.3..72.3
○ Nominal attribute: attribute 2..8,14..15,17
3 fuel-type diesel, gas 12 height continuous: 47.8..59.8
● List of instances
4 aspiration std, turbo 13 curb-weight continuous: 1488..4066
● Basic statistics:
5 num-of-doors four, two 14 engine-type dohc, dohcv, l, ...
○ Basic statistics of numerical attributes: count, mean, std, min, q1..q3, max
○ Basic statistics of nominal attributes: count, unique, top, freq 6 body-style hardtop, wagon, ... ...
○ Data composition of attributes: value and its frequency 7 drive-wheels 4wd, fwd, rwd 24 highway-mpg continuous: 16..54

8 engine-location front, rear 25 price continuous: 5118..45400


35 36

35 36

6
3/5/20

Describing Data in Python Data Analysis (pandas) Df.shape, df.head(n)

df = pd.read_csv(filename): all data is loaded into dataframe structure


df.shape: show dataset size (rows,att)
df.head(n) or df.tail(n): show top or bottom n rows
df.sample(n): show n random rows
df.info(): show dataset size, list of attribute types
df[attribute].describe(): show basic statistics of an attribute
df[attribute].value_counts(): show data composition of an attribute
df[attribute].skew(): show skewness of an attribute
df.describe(): show basic statistics of all numeric attributes 37 38

37 38

0. symboling: -3, -2, -1, 0, 1, 2, 3.


1. normalized-losses: continuous from 65 to 256.
2. make:
alfa-romero, audi, bmw, chevrolet, dodge, honda,
df.tail(n), df.sample(n) df.info() isuzu, jaguar, mazda, mercedes-benz, mercury,
mitsubishi, nissan, peugot, plymouth, porsche,
renault, saab, subaru, toyota, volkswagen, volvo
3. fuel-type: diesel, gas.
4. aspiration: std, turbo.
5. num-of-doors: four, two.
6. body-style: hardtop, wagon, sedan, hatchback, convertible.
7. drive-wheels: 4wd, fwd, rwd.
8. engine-location: front, rear.
9. wheel-base: continuous from 86.6 120.9.
10. length: continuous from 141.1 to 208.1.
Numerical att: 11. width: continuous from 60.3 to 72.3.
12. height: continuous from 47.8 to 59.8.
0,1,9..13,16,18..25 13. curb-weight: continuous from 1488 to 4066.
14. engine-type: dohc, dohcv, l, ohc, ohcf, ohcv, rotor.
Nominal att: 15. num-of-cylinders: eight, five, four, six, three, twelve, two.
16. engine-size: continuous from 61 to 326.
2..8,14..15,17 17. fuel-system: 1bbl, 2bbl, 4bbl, idi, mfi, mpfi, spdi, spfi.
18. bore: continuous from 2.54 to 3.94.
19. stroke: continuous from 2.07 to 4.17.
Report the 20. compression-ratio: continuous from 7 to 23.
differences and 21. horsepower: continuous from 48 to 288.
22. peak-rpm: continuous from 4150 to 6600.
why ? 23. city-mpg: continuous from 13 to 49.
24. highway-mpg: continuous from 16 to 54.
25. price: continuous from 5118 to 45400.

39 40

39 40

Numeric Attribute: df[att].describe(), df[att].value_counts() Nominal Attribute: df[att].describe(), df[att].value_counts()

Attribute make:
alfa-romero, audi, bmw, chevrolet, dodge, honda,
Attribute city-mpg: isuzu, jaguar, mazda, mercedes-benz, mercury,
mitsubishi, nissan, peugot, plymouth, porsche,
continuous from 13 to 49. renault, saab, subaru, toyota, volkswagen, volvo

41 42

41 42

7
3/5/20

0. symboling: -3, -2, -1, 0, 1, 2, 3.


1. normalized-losses: continuous from 65 to 256.
2. make:

Verifying Data Quality alfa-romero, audi, bmw, chevrolet, dodge, honda,


isuzu, jaguar, mazda, mercedes-benz, mercury,
mitsubishi, nissan, peugot, plymouth, porsche,
renault, saab, subaru, toyota, volkswagen, volvo
3. fuel-type: diesel, gas.
1. Identify incorrectness of data type assignment 4. aspiration: std, turbo.
5. num-of-doors: four, two.
2. Identify noise or inconsistent data 6. body-style: hardtop, wagon, sedan, hatchback, convertible.
7. drive-wheels: 4wd, fwd, rwd.

3. Identify missing values 8. engine-location: front, rear.


9. wheel-base: continuous from 86.6 120.9.

Identify outliers Incorrect


10. length: continuous from 141.1 to 208.1.
4. 11. width: continuous from 60.3 to 72.3.
12. height: continuous from 47.8 to 59.8.

data type
13. curb-weight: continuous from 1488 to 4066.
14. engine-type: dohc, dohcv, l, ohc, ohcf, ohcv, rotor.
15. num-of-cylinders: eight, five, four, six, three, twelve, two.

assignment 16. engine-size: continuous from 61 to 326.


17. fuel-system: 1bbl, 2bbl, 4bbl, idi, mfi, mpfi, spdi, spfi.
18. bore: continuous from 2.54 to 3.94.
19. stroke: continuous from 2.07 to 4.17.
20. compression-ratio: continuous from 7 to 23.
21. horsepower: continuous from 48 to 288.
22. peak-rpm: continuous from 4150 to 6600.
23. city-mpg: continuous from 13 to 49.
24. highway-mpg: continuous from 16 to 54.
25. price: continuous from 5118 to 45400.

43 44

43 44

Attribute: Identify Noise or Inconsistent Data Attribute: Identify Missing Values


Is there any attribute
noise, erroneous (at
data entry, violation
of known data
constraints) ?
Is there any
inconsistent data ?
Attribute 5 num-of-doors: four, two
Attribute 1 normalized-losses:
continuous from 65 to 256.

45 46

45 46

Attribute: Identify Outliers Attribute: Identify Outliers


for i in range(len(df.columns)):
if (df[i].dtypes in ['int64','float64']):
print('\nAttribute-',i,':',df[i].dtypes)
Q1=df[i].quantile(0.25)
print('Q1',Q1)
Q3=df[i].quantile(0.75)
print('Q3',Q3)
IQR=Q3-Q1
print('IQR',IQR)
min=df[i].min()
max=df[i].max()
min_IQR=Q1-1.5*IQR
max_IQR=Q3+1.5*IQR
if (min<min_IQR):
print('Low outlier is found')
if (max>max_IQR):
print('High outlier is found')
47 48

47 48

8
3/5/20

Exercise: Data Understanding


1. Load white Wine Quality dataset
(https://archive.ics.uci.edu/ml/datasets/wine+quality)
df = pd.read_csv("winequality-white.csv",sep=';')
2. Describing data:
a. Show dataset size
b. Show surface properties (attribute types, range) of each attribute to understand
the meaning of each attribute and attribute value
3. Verifying data quality: incorrectness of data type assignment, identify
noise, missing value, outliers, and imbalanced dataset.

49

49

You might also like