You are on page 1of 15

Topics

K-Nearest Neighbor Editing


Density based Clustering
Subspace Clustering/Bi-Clustering
Skyline Pattern Mining (top-K Pattern Mining)
Hidden Markov Models
Moving Objects Mining
Workflow Discovery (Process Mining)
Items Recommendation
Mining Reviews/Sentiments Data
Self Organized MAPs
Map Reduce

Data Preprocessing

Data Quality
Data quality is a major concern for Data Mining Tasks
Why: At most all Data Mining algorithms induce
knowledge strictly from data
The quality of knowledge extracted highly depends
on the quality of data
There are two main problems in data quality: Missing data: The data not present
Noisy data: The data present but not correct

Missing/Noisy data sources:

Hardware failure
Data transmission error
Data entry problem
Refusal of responds to answer certain questions

Effect of Noisy Data on Results Accuracy

Data Mining

Training data

Discover only those


rules which contain
support (frequency)
greater >= 2

If age <= 30 and income = high


buys_computer = yes

then

If age > 40 and income = medium then


buys_computer = no
Due to the missing value in training
dataset, the accuracy of prediction
reduced to 66.7%
Testing data or actual data

Imputation of Missing Values (Basic)


Imputation is a term that denotes a procedure that
replaces the missing values in a dataset using
plausible values
i.e. by considering relationship among correlated
values among the attributes of the dataset
If we consider only
{attribute#2}, then
value cool appears in 4
records.
Probability of Imputing
for value (20) = 75%
Probability of Imputing
for value (30) = 25%

Imputation of Missing Values (Basic)


For {attribute#4}
the value true
appears in 3 records
Probability of Imputing
for value (20) = 50%
Probability of Imputing
for value (10) = 50%

For {attribute#2,
attribute#3} the
value {cool,
high} appears in
only 2 records
Probability of Imputing
for value (20) =
100%

Methods for Imputing Missing Values

Ignoring and discarding data:There are two


main ways to discard data having missing values
Discard all those records that have missing data |
(also called as discard case analysis)
Discarding only those attributes that have high
level of missing data

Imputation using Mean/median or Mod:- One of


the most frequently used method (Statistical
technique)
Replace (numeric continuous) type attribute
missing values using mean/median. (Median
robust against noise)
Replace missing values of (discrete type attribute)
using MOD

Methods for Imputing Missing Values


Replace missing values using
prediction/classification model:

Advantage:- it considers relationship among the


known attribute values and the missing values, so
the imputation accuracy is high than statistical
techniques
Disadvantage:- If there exists no correlation
between instances having missing values and
instance having not missing values. Then
imputation cant be performed
(Alternative approach):- Use hybrid combination of
Prediction/Classification model and Mean/MOD
First try to impute missing value using
prediction/classification model, and then Median/MOD

Methods for Imputing Missing Values

K-Nearest Neighbor (k-NN)


k-NN imputes the missing attribute values on
the basis of nearest K neighbors. Neighbors are
determined on the basis of distance measure
Once K neighbors are determined, then missing
values are imputed by taking mean/median or
MOD of known values

Missing value
record

Other dataset
records

K-Nearest Neighbor (Pseudo-code)


Missing values Imputation using k-NN
Input: Dataset (D), size of K
for each record (x) having missing value
for each data object (y) in D
Computing the Distance between (x,y)
Save the distance in Similarity (S) array

Sort the array S in descending order


Pick the top K data objects from S
Impute the missing attribute value (s) of x on the basic
of known values of S (use Mean/Median or MOD)

Noisy Data
Noise: Random error, Data Present but not
correct
Data Transmission error
Data Entry problem

Removing noise
Data Smoothing (rounding, averaging within a
window).
Clustering/merging and Detecting outliers.

Effect of Continuous Data on Results


Accuracy
age
29
44
55
52

buys_computer
no
no
no
no

Data Mining

If age = 29 then buys_computer = no


If age = 30 then buys_computer = no
If age = 41 then buys_computer = no
If age = 52 then buys_computer = no

What would be the accuracy of above


rules

age
31
33
32

buys_computer
?
?
?

Entropy-Based Discretization
Given a set of samples S, if S is partitioned into two
intervals S1 and S2 using boundary T, the entropy
after partitioning is

Where pi is the probability of class i in S1,


determined by dividing the number of samples of
class i in S1 by the total number of samples in S1.

Entropy-Based Discretization
The boundary that minimizes the entropy function
over all possible boundaries is selected as a binary
discretization.
The process is recursively applied to partitions
obtained until some stopping criterion is met, e.g.,

Example (cont)
ID
Age
Grade

21

22

24

25

27

27

27

35

41

The number of elements in S1 and S2 are:


|S1| = 1
|S2| = 8

The entropy of S1 is
Ent ( S1 ) P (Grade F) log 2 P(Grade F) P (Grade P) log 2 P(Grade P)
(1) log 2 (1) (0) log 2 (0)

The entropy of S2 is

Ent ( S 2 ) P(Grade F) log 2 P(Grade F) P(Grade P) log 2 P(Grade P)


(2) log 2 (2) (6) log 2 (6)

You might also like