You are on page 1of 19

21-09-2020

DATA MINING:
Data Preprocessing Prof. Sherica Lavinia Menezes
Asst. Professor
Computer Engineering Department
Goa College of Engineering

DATA PREPROCESSING TECHNIQUES

AGGREGATION SAMPLING

DIMENSIONALITY FEATURE SUBSET


REDUCTION SELECTION

DISCRETIZATION AND
FEATURE CREATION
BINARIZATION

VARIABLE
TRANSFORMATION

1
21-09-2020

AGENDA

AGGREGATION
01
02 SAMPLING

DIMENSIONALITY
REDUCTION
03
FEATURE SUBSET

04 SELECTION

LEARNING OBJECTIVES
Explain
Aggregation
01 Appreciate
different sampling

02 techniques

Discuss Feature
Subset Selection
03 Differentiate
between
Dimensionality

04 Reduction and FSS

2
21-09-2020

AGGREGATION

Aggregation is used to combine two or more


objects into a single object.

This is done to reduce the number of objects used


to analyse thus resulting in possibility of using
more expensive and better performing algorithms.

EXAMPLE OF
AGGREGATION
Either omitted
or summarized
as set of items
Replace all transactions of a sold Aggregated
single store with a single by taking
storewide transaction sum

T Id Item Store Location Date Price


1 Watch Margao 09/06/2019 45 Store Location Date Item Price
2 Battery Margao 09/06/2019 67 Margao 09/06/2019 {watch, 45+67+8
Battery, 9+888
3 Shoes Panaji 08/05/2019 88
Shoes,
4 Clothes Panaji 08/05/2019 900 Clothes}
5 Watch Panaji 08/05/2019 876 Panaji 08/05/2019 {Shoes, 88+900+
Clothes, 876
6 Shoes Margao 09/06/2019 89
Watch}
7 Clothes Margao 09/06/2019 888

3
21-09-2020

Motivation for Aggregation


Smaller Data sets resulting from data reduction require
less memory processing time.

Act as change of scope or scale by providing a high level


view of data instead of low level view.

Behaviour of groups of objects is often more stable than


that of individual objects or attributes

Disadvantage: Potential loss of interesting details

SAMPLING

Sampling is commonly used for selecting a subset


of the data objects to be analysed

It is a statistical approach therefore


representativeness of samples will vary: best we
can do is choose a sampling scheme that
guarantees a high probability of getting
representative sample.

4
21-09-2020

Sampling

Statisticians often sample because obtaining the


entire set of data of interest is too expensive or
time consuming.
Sampling is typically used in data mining because
processing the entire set of data of interest is too
expensive or time consuming.
Key principle: for effective sampling: using a
sample will work almost as well as using the
entire data set if the sample is representative.

Sample is representative if it had approximately


the same property as the original set of data.

Sampling Approaches

Simple Random Sampling

Sampling without Sampling with


replacement replacement
Each item selected is Objects are not removed
then removed from the from the population as they
set of all objects in are selected for the sample.
population. Same object can be picked
more than once
Simple to analyse

10

5
21-09-2020

Example

● We have 100 people, 51 are women P(W) = 0.51,


49 men P(M) = 0.49. If I pick two persons what is
the probability P(W,W) that both are women?
● Sampling with replacement: P(W,W) = 0.51*0.51
● Sampling without replacement: P(W,W) = 51/100 *
50/99

11

Sampling Approaches
Stratified Sampling: Starts with prespecified
groups of objects from which samples are drawn

First approach: Equal number of objects are


drawn from each group even if group sizes are
different

Second Approach: No of objects drawn is


proportional to the size of the group

Example 1. I want to understand the differences between legitimate and fraudulent credit card
transactions. 0.1% of transactions are fraudulent. What happens if I select 1000 transactions at
random?
I get 1 fraudulent transaction (in expectation). Not enough to draw any conclusions. Solution: sample
1000 legitimate and 1000 fraudulent transactions

12

6
21-09-2020

Sampling Process

Choose Sample Size


• Larger sample size:
increases probability that
sample will be
Select Sampling representative but
eliminates the advantage
Technique of sampling
• Smaller sample size:
pattern can be missed or
erroneous patterns
detected

13

Example of Various
Sample Size

8000 points 2000 Points 500 Points

14

7
21-09-2020

DIMENSIONALITY REDUCTION
DM Algorithms work better if dimensionality is
lower

Can eliminate irrelevant features and reduce noise


party

Can lead to a more understandable model

Allows data to be more easily visualized

Amount of time and memory required by the data


mining algorithm is reduced

15

Dimensionality Reduction
Reserved for those techniques that reduce
dimensionality by creating new attributes that are
a combination of old attributes

Curse of Dimensionality: Data Analysis become


harder as dimensionality increases

Linear Algebra techniques for Dimensionality


Reduction

PCA find new attributes that are linear


combination of original attribute, orthogonal and
capture maximum amount of variation

16

8
21-09-2020

FEATURE SUBSET SELECTION


A variant of dimensionality reduction

Redundant features and irrelevant features reduce


classification accuracy

Redundant features: duplicate much or all of the


information in one or more attributes

Irrelevant features: contain no useful information


for the data mining task at hand

17

3 Approaches to FSS

FILTER
EMBEDDED WRAPPER
Features are
selected before DM
Algorithm decides algorithms is run DM algorithm is
which attributes to using an approach used as Black Box to
use and which to that is independent find best subset of
ignore of DM algorithm attribute

18

9
21-09-2020

Architecture for FSS

MEASURE FOR SEARCH STRATEGY THAT


EVALUATING A E S CONTROLS GENERATION OF
SUBSET NEW SUBSET OF FEATURES

STOPPING
CRITERION S V VALIDATION
PROCEDURE

19

Flowchart for FSS


Selected Stopping
Evaluation
Attributes Criterion

Validation Not Done


Procedure

Subset of
Attributes Search Strategy
Attributes

20

10
21-09-2020

FEATURE CREATION
New set of attributes are created from the original
attributes that capture important information more
effectively

New set of attributes can be of a smaller size as


compared to the original data

Mapping
Feature the data to Feature
Extraction a new construction
space

21

Feature Extraction
Creation of new set of features from raw data

Classify the photographs as containing human


faces or not
If Data is processed
Raw Data: to provide higher
Pixels – too level features like
cumbersome to absence or
analyse based presence of edges
on pixels pertaining to
human faces

Highly domain specific

22

11
21-09-2020

Mapping Data to a New Space


A different view of data can reveal interesting
patterns that were otherwise hidden

Frequency

Two Sine Waves + Noise Frequency

23

Feature Construction
Constructing features to be in a form that is best
suited for the respective data mining algorithm

Needs domain expertise to construct features

Eg: Set of Artifacts to be classified into wooden,


gold, bronze
Data know is mass and volume
However density proves to be a better measure
for classification

24

12
21-09-2020

DISCRETIZATION AND BINARIZATION


Classification algorithm need categorical attributes;
Association analysis require that data be in from of
binary attributes

Necessary to transform continuous attribute to


categorical attribute: Discretization

Both continuous and discrete attributes may need to


be transformed into binary attributes: Binarization

If a categorical attribute has a large number of values,


or some occur infrequently then better to reduce
number of categories by combining some of the
values

25

Binarization
If there are m
categorical values,
then uniquely assign If attribute is ordinal order
each original value to needs to be maintained.
an integer in the
interval [0, m-1]

Convert each of We will need n binary digits


these m integers to represent the integers
to a binary no. where n = log 2 𝑚

Represent
binary values
using n binary
attributes

26

13
21-09-2020

Example of Binarization

27

Issues in Binarization
Can create unintended relationships among
transformed attributes

Association analysis require only those attributes


where there is a non zero value therefore we
introduce binary attribute for each categorical
value

May be necessary to replace a single binary


attribute with two asymmetric binary attributes

28

14
21-09-2020

Discretization of Continuous Attribute


Applied mainly to attributes of data used in
classification or association analysis

Deciding how many Determining how to map


categories values to the categories

Given values are sorted


and divided into n
intervals by specifying
n-1 split points:

{ (x0, x1], (x1,x2],


(x2,x3], (x3,x4]……(xn-
1,xn] }

29

Unsupervised Discretization
Class information is not used

Equal width approach: divides range into a user


specified no of intervals

Equal frequency approach: tries to put the same


number of objects into each interval

K – means clustering algorithm to create intervals and


map the values

30

15
21-09-2020

Discretization Without Using Class Labels

Data consists of four groups of points and two outliers. Data is one-
dimensional, but a random y component is added to reduce overlap.

31

Discretization Without Using Class Labels

Equal interval width approach used to obtain 4 values.

32

16
21-09-2020

Discretization Without Using Class Labels

Equal frequency approach used to obtain 4 values.

33

Discretization Without Using Class Labels

K-means approach to obtain 4 values.

34

17
21-09-2020

Supervised Discretization
Class information is available

Aims to place splits in a way that maximizes purity of


the interval

Uses Entropy based techniques

k = number of classes
Entropy: mi = number of values
𝒌
in ith interval
𝒆 𝒊 = ෍ 𝒑𝒊𝒋 ∗ log 𝟐 𝒑𝒊𝒋 mij = no of values of
𝒊=𝟏 class j in interval i

35

Categorical Attributes with too many values


If the attributes are ordinal we use the previously
mentioned technique of entropy

If attribute is nominal then we need other approaches

Eg: Goa University: large number of departments


Computer Engineering Department
Department of Microbiology
Department of Information Technology
Based on domain knowledge we can group them into higher
level: Engineering, Life Sciences

Domain Specific and need domain knowledge to


group the values together

36

18
21-09-2020

VARIABLE
TRANSFORMATION

Simple Functions Normalization or


• Simple Standardization
mathematical • Make entire set of values
functions used have same property
• Have to be • Mean and std deviation
applied with strongly affected by outliers
caution as they • Mean is replaced by median
can change the • Std dev by absolute std
nature of data deviation

37

THANKS

CREDITS: This presentation template was created


by Slidesgo, including icons by Flaticon, and
infographics & images by Freepik
Please keep this slide for attribution.

38

19

You might also like