You are on page 1of 24

Estimating Missing values of

Heterogeneous Datasets by
Clustering
Presented by : Ms. Selva Mary. G
M.E. (Computer Engineering)
ARMIET/M.TECH/CS12 SM 15
Semester- III
Roll No. : 15

Data Mining
• The Extraction of hidden predictive information
• Process

of

identifying

frequent

patterns

from

different databases.
• Algorithms rely on a large number of diverse,
heterogeneous, incomplete but interrelated data
sources.

Data Mining Model .

Preparation of Data • Data preprocessing – Deciding which data will be used as input .

.g.. discrepancy between duplicate records. C” • e.g.3”. now rating “A.Why Data Preprocessing? • Data in the real world is dirty – Incomplete: (Missing Values) • e. Was rating “1. occupation=“ ” – Noisy: • e.... Age=“42” Birthday=“03/07/1997” • e. B.g.g.g.2. Salary=“-10” – Inconsistent: • e. .

What is a missing values? • A missing value (MV) is an empty cell in the table that represents a dataset Attributes Instance s ? .

. bias.Missing Data – Values – Attributes – Entire records • Problem: Misleading results.

Handling Missing values • Ignore the tuple: – not very effective • Fill in the missing value manually: – time consuming and may not be feasible • Use a global constant to fill in the MV: – replace all missing attribute values by the same constant. such as label like “unknown”. .

Handling Missing values • Use the attribute mean to fill MV: – Use mean/mode value to replace discrete values • Use the most probable method (Imputation) regression. inferencebased tools using a Bayesian formalism. a decision tree to predict the missing values . – using the other customer attributes in data set. or decision tree induction.

if users have no idea on the actual distribution of a dataset .Types of Imputation • Single Imputation – provide a single estimate for each missing data value – kNN method – Parametric methods and nonparametric – Non-parametric imputation .

.Types of Imputation • Multiple Imputation – Variable are predicted using existing other variables – The predicted values – imputes – Stages  missing data are filled in m times for m complete data sets  m complete datasets are analyzed by using std procedures  The m complete data sets are combined for the inference.

– First step is Single imputation – Imputation is applied repeatedly till algorithm converges – Convergence – No further effect in the imputed data .Types of Imputation • Iterative Imputation – Variable are predicted using existing values from other variables.

K-means Clustering Imputation (KMI) • Clustering: automatic aggregation of instances in groups • K-means: Given a dataset it identifies k (predefined parameter) groups (clusters) of similar instances. • For each cluster it computes the centroid – Artificial representative of the cluster – Mean/most frequent value of instances in a cluster • MV imputation – Identify the cluster to which the instance with MV belongs to – Take the value of the centroid .

• Disadvantages – Difficulty in comparing quality of the clusters produced – Fixed number of clusters can make it difficult to predict what K should be. .K-Means Clustering Algorithm • Advantages – computationally faster (if K is small).

Fuzzy C-Means Algorithm • One piece of data belongs to two or more clusters • Used in pattern recognition • Used especially when the clusters are overlapping • Can not be used when data has missing values .

• Disadvantages 1)Slower than K-means .Fuzzy C-Means Algorithm • Advantages 1) Gives best result for overlapped data set 2) Unlike k-means data point may belong to more then one cluster center.

Mixture Kernel Estimator • The nonparametric methods are designed for either continuous attributes or discrete independent attributes. .

Mixture Kernel Estimator .

Mixture Kernel Estimator 1) 2) 3) 4) Pre-imputation Mixture kernel functions Setting Confidence intervals Selecting .

Eg. (Spherical kernel and Bayesian Kernel) .Higher Order Kernels • Advantages – Provides high accuracy in prediction. – Provides a larger set of functionalities on higher dimensional space. – The error rate of the imputed data is much lesser – Maximum errors are avoided. – Faster and accurate imputation. – imputes the iterative missing target values effectively and accurately.

1 .Existing System Model .

2 .Existing System Model .

Proposed System model .

(Computer Engineering) ARMIET/M.E.Thank You Presented by : Ms. : 15 . Selva Mary. G M.TECH/CS12 SM 15 Semester.III Roll No.