You are on page 1of 1

Abstract

Most of the real world data mining applications are characterized by high
dimensional data, where all of the features are not important. For example, high
dimensional data, that is, data sets with thousands of features can contain a lot of
irrelevant and noisy information that may greatly degrade the performance of a data
mining process. Even the standard data mining algorithms cannot overcome the
presence of a large number of weakly relevant and redundant features and this is
usually attributed to the curse of dimensionality. In addition, Decision Tree and
random forest algorithms become computationally intractable when the
dimensionality is high. Dimensionality reduction techniques are proposed as a data
preprocessing step, to address the curse of dimensionality problem. This process
identifies a suitable low dimensional representation of original data. The
dimensionality reduction process improves the computational efficiency and
accuracy of the data analysis. Also, it improves comprehensibility of a data mining
model. Dimensionality reduction techniques function either by transforming the
existing features to a new reduced set of features or by selecting a subset of the
existing features. Therefore, two standard tasks are associated with producing a
reduced set of features and they are classified as Feature Selection and Feature
Extraction.
Feature Selection is a technique which is used to reduce the dimensionality of
data or eliminate the irrelevant features and improve the predictive accuracy. Many
approaches have been proposed for feature selection and most of them still suffer
from the problems of stagnation in local optima and high computational cost due
mainly to the large search space. Therefore, an efficient searching technique is
needed to address feature selection tasks. Firstly, in this work, Best first search based
CFS and Naive Bayes algorithm (BFSCFS-NB) and Greedy Search based CFS and
Naive Bayes Algorithms (GSCFS-NB) are proposed for improving the classification
accuracy and minimizing the classification errors. Secondly, Genetic search based
CFS-Naive Bayes Algorithm (GNSCFS-NB), and particle swarm optimization search
based CFS- Naive Bayes algorithm (PSOCFSNB) are proposed for improving the
classification accuracy and minimizing the classification errors.
The accuracy, kappa statistics and classification error measures are calculated
and also compared with other classification algorithms like Support Vector
Machines, J48, Multilayer Perceptron and Radial Basis Function. Based on
experimental results it can be concluded that CFS shows the best feature selector for
common machine learning algorithms.
Finally, to improve the performance further, a new algorithm called Principal
Component Analysis- Naive Bayes (PCA-NB), which reduces the high
dimensional data into a lower dimensional subspace without loss of information,
minimizes the reconstruction error. The prediction accuracy, kappa statistics and
error measures of the proposed work are compared with other existing techniques
like Support Vector Machines, J48, Multilayer Perceptron and Radial Basis Function.
The performance of the proposed algorithm is better than the existing techniques.
The PIMS data set is taken from this work and collected from UCI machine
learning repository which is the standard machine learning repository. The
experimental results depicts that the proposed algorithms are better than the existing
algorithms like SVM, MLP, J48 and RBF in terms of accuracy.

You might also like