Data Preprocessing

21-09-2020
DATA MINING:
Data Preprocessing Prof. Sherica Lavinia Menezes
Asst. Professor
Computer Engineering Department
Goa College of Engineering
DATA PREPROCESSING TECHNIQUES
AGGREGATION SAMPLING
DIMENSIONALITY FEATURE SUBSET

REDUCTION SELECTION
DISCRETIZATION AND
FEATURE CREATION
BINARIZATION
VARIABLE
TRANSFORMATION
1
21-09-2020
AGENDA
AGGREGATION
01
02 SAMPLING
DIMENSIONALITY
REDUCTION
03
FEATURE SUBSET
04 SELECTION
LEARNING OBJECTIVES
Explain
Aggregation
01 Appreciate
different sampling
02 techniques
Discuss Feature
Subset Selection
03 Differentiate
between
Dimensionality
04 Reduction and FSS
2
21-09-2020
AGGREGATION
Aggregation is used to combine two or more

objects into a single object.
This is done to reduce the number of objects used

to analyse thus resulting in possibility of using
more expensive and better performing algorithms.
EXAMPLE OF
AGGREGATION
Either omitted
or summarized
as set of items
Replace all transactions of a sold Aggregated
single store with a single by taking
storewide transaction sum
T Id Item Store Location Date Price

1 Watch Margao 09/06/2019 45 Store Location Date Item Price
2 Battery Margao 09/06/2019 67 Margao 09/06/2019 {watch, 45+67+8
Battery, 9+888
3 Shoes Panaji 08/05/2019 88
Shoes,
4 Clothes Panaji 08/05/2019 900 Clothes}
5 Watch Panaji 08/05/2019 876 Panaji 08/05/2019 {Shoes, 88+900+
Clothes, 876
6 Shoes Margao 09/06/2019 89
Watch}
7 Clothes Margao 09/06/2019 888
3
21-09-2020
Motivation for Aggregation

Smaller Data sets resulting from data reduction require
less memory processing time.
Act as change of scope or scale by providing a high level

view of data instead of low level view.
Behaviour of groups of objects is often more stable than

that of individual objects or attributes
Disadvantage: Potential loss of interesting details
SAMPLING
Sampling is commonly used for selecting a subset

of the data objects to be analysed
It is a statistical approach therefore

representativeness of samples will vary: best we
can do is choose a sampling scheme that
guarantees a high probability of getting
representative sample.
4
21-09-2020
Sampling
Statisticians often sample because obtaining the

entire set of data of interest is too expensive or
time consuming.
Sampling is typically used in data mining because
processing the entire set of data of interest is too
expensive or time consuming.
Key principle: for effective sampling: using a
sample will work almost as well as using the
entire data set if the sample is representative.
Sample is representative if it had approximately

the same property as the original set of data.
Sampling Approaches
Simple Random Sampling
Sampling without Sampling with

replacement replacement
Each item selected is Objects are not removed
then removed from the from the population as they
set of all objects in are selected for the sample.
population. Same object can be picked
more than once
Simple to analyse
10
5
21-09-2020
Example
● We have 100 people, 51 are women P(W) = 0.51,

49 men P(M) = 0.49. If I pick two persons what is
the probability P(W,W) that both are women?
● Sampling with replacement: P(W,W) = 0.51*0.51
● Sampling without replacement: P(W,W) = 51/100 *
50/99
11
Sampling Approaches
Stratified Sampling: Starts with prespecified
groups of objects from which samples are drawn
First approach: Equal number of objects are

drawn from each group even if group sizes are
different
Second Approach: No of objects drawn is

proportional to the size of the group
Example 1. I want to understand the differences between legitimate and fraudulent credit card
transactions. 0.1% of transactions are fraudulent. What happens if I select 1000 transactions at
random?
I get 1 fraudulent transaction (in expectation). Not enough to draw any conclusions. Solution: sample
1000 legitimate and 1000 fraudulent transactions
12
6
21-09-2020
Sampling Process
Choose Sample Size

• Larger sample size:
increases probability that
sample will be
Select Sampling representative but
eliminates the advantage
Technique of sampling
• Smaller sample size:
pattern can be missed or
erroneous patterns
detected
13
Example of Various
Sample Size
8000 points 2000 Points 500 Points
14
7
21-09-2020
DIMENSIONALITY REDUCTION
DM Algorithms work better if dimensionality is
lower
Can eliminate irrelevant features and reduce noise

party
Can lead to a more understandable model
Allows data to be more easily visualized
Amount of time and memory required by the data

mining algorithm is reduced
15
Dimensionality Reduction
Reserved for those techniques that reduce
dimensionality by creating new attributes that are
a combination of old attributes
Curse of Dimensionality: Data Analysis become

harder as dimensionality increases
Linear Algebra techniques for Dimensionality

Reduction
PCA find new attributes that are linear

combination of original attribute, orthogonal and
capture maximum amount of variation
16
8
21-09-2020
FEATURE SUBSET SELECTION

A variant of dimensionality reduction
Redundant features and irrelevant features reduce

classification accuracy
Redundant features: duplicate much or all of the

information in one or more attributes
Irrelevant features: contain no useful information

for the data mining task at hand
17
3 Approaches to FSS
FILTER
EMBEDDED WRAPPER
Features are
selected before DM
Algorithm decides algorithms is run DM algorithm is
which attributes to using an approach used as Black Box to
use and which to that is independent find best subset of
ignore of DM algorithm attribute
18
9
21-09-2020
Architecture for FSS
MEASURE FOR SEARCH STRATEGY THAT

EVALUATING A E S CONTROLS GENERATION OF
SUBSET NEW SUBSET OF FEATURES
STOPPING
CRITERION S V VALIDATION
PROCEDURE
19
Flowchart for FSS

Selected Stopping
Evaluation
Attributes Criterion
Validation Not Done

Procedure
Subset of
Attributes Search Strategy
Attributes
20
10
21-09-2020
FEATURE CREATION
New set of attributes are created from the original
attributes that capture important information more
effectively
New set of attributes can be of a smaller size as

compared to the original data
Mapping
Feature the data to Feature
Extraction a new construction
space
21
Feature Extraction
Creation of new set of features from raw data
Classify the photographs as containing human

faces or not
If Data is processed
Raw Data: to provide higher
Pixels – too level features like
cumbersome to absence or
analyse based presence of edges
on pixels pertaining to
human faces
Highly domain specific
22
11
21-09-2020
Mapping Data to a New Space

A different view of data can reveal interesting
patterns that were otherwise hidden
Frequency
Two Sine Waves + Noise Frequency
23
Feature Construction
Constructing features to be in a form that is best
suited for the respective data mining algorithm
Needs domain expertise to construct features
Eg: Set of Artifacts to be classified into wooden,

gold, bronze
Data know is mass and volume
However density proves to be a better measure
for classification
24
12
21-09-2020
DISCRETIZATION AND BINARIZATION

Classification algorithm need categorical attributes;
Association analysis require that data be in from of
binary attributes
Necessary to transform continuous attribute to

categorical attribute: Discretization
Both continuous and discrete attributes may need to

be transformed into binary attributes: Binarization
If a categorical attribute has a large number of values,

or some occur infrequently then better to reduce
number of categories by combining some of the
values
25
Binarization
If there are m
categorical values,
then uniquely assign If attribute is ordinal order
each original value to needs to be maintained.
an integer in the
interval [0, m-1]
Convert each of We will need n binary digits

these m integers to represent the integers
to a binary no. where n = log 2 𝑚
Represent
binary values
using n binary
attributes
26
13
21-09-2020
Example of Binarization
27
Issues in Binarization
Can create unintended relationships among
transformed attributes
Association analysis require only those attributes

where there is a non zero value therefore we
introduce binary attribute for each categorical
value
May be necessary to replace a single binary

attribute with two asymmetric binary attributes
28
14
21-09-2020
Discretization of Continuous Attribute

Applied mainly to attributes of data used in
classification or association analysis
Deciding how many Determining how to map

categories values to the categories
Given values are sorted

and divided into n
intervals by specifying
n-1 split points:
{ (x0, x1], (x1,x2],

(x2,x3], (x3,x4]……(xn-
1,xn] }
29
Unsupervised Discretization
Class information is not used
Equal width approach: divides range into a user

specified no of intervals
Equal frequency approach: tries to put the same

number of objects into each interval
K – means clustering algorithm to create intervals and

map the values
30
15
21-09-2020
Discretization Without Using Class Labels
Data consists of four groups of points and two outliers. Data is one-
dimensional, but a random y component is added to reduce overlap.
31
Equal interval width approach used to obtain 4 values.
32
16
21-09-2020
Equal frequency approach used to obtain 4 values.
33
K-means approach to obtain 4 values.
34
17
21-09-2020
Supervised Discretization
Class information is available
Aims to place splits in a way that maximizes purity of

the interval
Uses Entropy based techniques
k = number of classes
Entropy: mi = number of values
𝒌
in ith interval
𝒆 𝒊 = ෍ 𝒑𝒊𝒋 ∗ log 𝟐 𝒑𝒊𝒋 mij = no of values of
𝒊=𝟏 class j in interval i
35
Categorical Attributes with too many values

If the attributes are ordinal we use the previously
mentioned technique of entropy
If attribute is nominal then we need other approaches
Eg: Goa University: large number of departments

Computer Engineering Department
Department of Microbiology
Department of Information Technology
Based on domain knowledge we can group them into higher
level: Engineering, Life Sciences
Domain Specific and need domain knowledge to

group the values together
36
18
21-09-2020
VARIABLE
TRANSFORMATION
Simple Functions Normalization or

• Simple Standardization
mathematical • Make entire set of values
functions used have same property
• Have to be • Mean and std deviation
applied with strongly affected by outliers
caution as they • Mean is replaced by median
can change the • Std dev by absolute std
nature of data deviation
37
THANKS
CREDITS: This presentation template was created

by Slidesgo, including icons by Flaticon, and
infographics & images by Freepik
Please keep this slide for attribution.
38
19

Data Preprocessing

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Preprocessing

Uploaded by

Copyright:

Available Formats

21-09-2020

DATA PREPROCESSING TECHNIQUES

DIMENSIONALITY FEATURE SUBSET

04 Reduction and FSS

Aggregation is used to combine two or more

This is done to reduce the number of objects used

T Id Item Store Location Date Price

Motivation for Aggregation

Act as change of scope or scale by providing a high level

Behaviour of groups of objects is often more stable than

Disadvantage: Potential loss of interesting details

Sampling is commonly used for selecting a subset

It is a statistical approach therefore

Statisticians often sample because obtaining the

Sample is representative if it had approximately

Simple Random Sampling

Sampling without Sampling with

● We have 100 people, 51 are women P(W) = 0.51,

First approach: Equal number of objects are

Second Approach: No of objects drawn is

Choose Sample Size

8000 points 2000 Points 500 Points

Can eliminate irrelevant features and reduce noise

Can lead to a more understandable model

Allows data to be more easily visualized

Amount of time and memory required by the data

Curse of Dimensionality: Data Analysis become

Linear Algebra techniques for Dimensionality

PCA find new attributes that are linear

FEATURE SUBSET SELECTION

Redundant features and irrelevant features reduce

Redundant features: duplicate much or all of the

Irrelevant features: contain no useful information

Architecture for FSS

MEASURE FOR SEARCH STRATEGY THAT

Flowchart for FSS

Validation Not Done

New set of attributes can be of a smaller size as

Classify the photographs as containing human

Highly domain specific

Mapping Data to a New Space

Two Sine Waves + Noise Frequency

Needs domain expertise to construct features

Eg: Set of Artifacts to be classified into wooden,

DISCRETIZATION AND BINARIZATION

Necessary to transform continuous attribute to

Both continuous and discrete attributes may need to

If a categorical attribute has a large number of values,

Convert each of We will need n binary digits

Association analysis require only those attributes

May be necessary to replace a single binary

Discretization of Continuous Attribute

Deciding how many Determining how to map

Given values are sorted

{ (x0, x1], (x1,x2],

Equal width approach: divides range into a user

Equal frequency approach: tries to put the same

K – means clustering algorithm to create intervals and

Discretization Without Using Class Labels

Discretization Without Using Class Labels

Equal interval width approach used to obtain 4 values.

Discretization Without Using Class Labels