LECTURE01 DataPreprocessing

Topics
K-Nearest Neighbor Editing

Density based Clustering
Subspace Clustering/Bi-Clustering
Skyline Pattern Mining (top-K Pattern Mining)
Hidden Markov Models
Moving Objects Mining
Workflow Discovery (Process Mining)
Items Recommendation
Mining Reviews/Sentiments Data
Self Organized MAPs
Map Reduce
Data Preprocessing
Data Quality
Data quality is a major concern for Data Mining Tasks
Why: At most all Data Mining algorithms induce
knowledge strictly from data
The quality of knowledge extracted highly depends
on the quality of data
There are two main problems in data quality: Missing data: The data not present
Noisy data: The data present but not correct
Missing/Noisy data sources:
Hardware failure
Data transmission error
Data entry problem
Refusal of responds to answer certain questions
Effect of Noisy Data on Results Accuracy
Data Mining
Training data
Discover only those

rules which contain
support (frequency)
greater >= 2
If age <= 30 and income = high

buys_computer = yes
then
If age > 40 and income = medium then

buys_computer = no
Due to the missing value in training
dataset, the accuracy of prediction
reduced to 66.7%
Testing data or actual data
Imputation of Missing Values (Basic)

Imputation is a term that denotes a procedure that
replaces the missing values in a dataset using
plausible values
i.e. by considering relationship among correlated
values among the attributes of the dataset
If we consider only
{attribute#2}, then
value cool appears in 4
records.
Probability of Imputing
for value (20) = 75%
Imputation of Missing Values (Basic)

For {attribute#4}
the value true
appears in 3 records
For {attribute#2,
attribute#3} the
value {cool,
high} appears in
only 2 records
for value (20) =
100%
Methods for Imputing Missing Values
Ignoring and discarding data:There are two

main ways to discard data having missing values
Discard all those records that have missing data |
(also called as discard case analysis)
Discarding only those attributes that have high
level of missing data
Imputation using Mean/median or Mod:- One of

the most frequently used method (Statistical
technique)
Replace (numeric continuous) type attribute
missing values using mean/median. (Median
robust against noise)
Replace missing values of (discrete type attribute)
using MOD

Replace missing values using
prediction/classification model:
Advantage:- it considers relationship among the

known attribute values and the missing values, so
the imputation accuracy is high than statistical
techniques
Disadvantage:- If there exists no correlation
between instances having missing values and
instance having not missing values. Then
imputation cant be performed
(Alternative approach):- Use hybrid combination of
Prediction/Classification model and Mean/MOD
First try to impute missing value using
prediction/classification model, and then Median/MOD
K-Nearest Neighbor (k-NN)

k-NN imputes the missing attribute values on
the basis of nearest K neighbors. Neighbors are
determined on the basis of distance measure
Once K neighbors are determined, then missing
values are imputed by taking mean/median or
MOD of known values
Missing value
record
Other dataset
records
K-Nearest Neighbor (Pseudo-code)

Missing values Imputation using k-NN
Input: Dataset (D), size of K
for each record (x) having missing value
for each data object (y) in D
Computing the Distance between (x,y)
Save the distance in Similarity (S) array
Sort the array S in descending order

Pick the top K data objects from S
Impute the missing attribute value (s) of x on the basic
of known values of S (use Mean/Median or MOD)
Noisy Data
Noise: Random error, Data Present but not
correct
Data Transmission error
Data Entry problem
Removing noise
Data Smoothing (rounding, averaging within a
window).
Clustering/merging and Detecting outliers.
Effect of Continuous Data on Results

Accuracy
age
29
44
55
52
buys_computer
no
no
no
no
Data Mining
If age = 29 then buys_computer = no

What would be the accuracy of above

rules
age
31
33
32
buys_computer
?
?
?
Entropy-Based Discretization
Given a set of samples S, if S is partitioned into two
intervals S1 and S2 using boundary T, the entropy
after partitioning is
Where pi is the probability of class i in S1,

determined by dividing the number of samples of
class i in S1 by the total number of samples in S1.
Entropy-Based Discretization
The boundary that minimizes the entropy function
over all possible boundaries is selected as a binary
discretization.
The process is recursively applied to partitions
obtained until some stopping criterion is met, e.g.,
Example (cont)
ID
Age
Grade
21
22
24
25
27
27
27
35
41
The number of elements in S1 and S2 are:

|S1| = 1
|S2| = 8
The entropy of S1 is
Ent ( S1 ) P (Grade F) log 2 P(Grade F) P (Grade P) log 2 P(Grade P)
(1) log 2 (1) (0) log 2 (0)
The entropy of S2 is
Ent ( S 2 ) P(Grade F) log 2 P(Grade F) P(Grade P) log 2 P(Grade P)

(2) log 2 (2) (6) log 2 (6)

LECTURE01 DataPreprocessing

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

LECTURE01 DataPreprocessing

Uploaded by

Copyright:

Available Formats

Topics

K-Nearest Neighbor Editing

Missing/Noisy data sources:

Effect of Noisy Data on Results Accuracy

Discover only those

If age <= 30 and income = high

If age > 40 and income = medium then

Imputation of Missing Values (Basic)

Imputation of Missing Values (Basic)

Methods for Imputing Missing Values

Ignoring and discarding data:There are two

Imputation using Mean/median or Mod:- One of

Methods for Imputing Missing Values

Advantage:- it considers relationship among the

Methods for Imputing Missing Values

K-Nearest Neighbor (k-NN)

K-Nearest Neighbor (Pseudo-code)

Sort the array S in descending order

Effect of Continuous Data on Results

If age = 29 then buys_computer = no

What would be the accuracy of above

Where pi is the probability of class i in S1,

The number of elements in S1 and S2 are:

Ent ( S 2 ) P(Grade F) log 2 P(Grade F) P(Grade P) log 2 P(Grade P)

You might also like