You are on page 1of 13

International Journal of Advance Foundation and Research in Computer (IJAFRC)

Volume 2, Issue 10, October - 2015. ISSN 2348 4853, Impact Factor 1.317

Feature Extraction From Big Data

Aarti.B.Sahitya*, DR.Mrs.M.Vijayalakshmi.
Line 2: PG Scholar, V.E.S.I.T, Chembur, Professor, V.E.S.I.T,Chembur
Line 3:* ,
Dimensionality reduction, Feature Extraction & Feature Selection are the terms important in the
concept of data mining when someone has to reduce the size of high dimensional data. Our main
motto of this research is to find an applicable methodology that will reduce the ever increasingly
volume of data . In this paper we describe about various feature extraction & feature selection
methods & proposes an scheme which will able to select a subset of original features based on
some evaluation criteria & able to reduce the volume of high dimensional dataset.
Index Terms : Feature Extraction, Feature Selection, Dimensionality Reduction, Big Data, Data
Visualization, High Dimensional Data.



A dataset is a collection of homogeneous objects. An object is a instance in the dataset. Dimension is the
property which define an object. Dimensionality reduction is the process where at each step the
irrelevant dimensions are reduced without substantial loss of information & without affecting the final
output. Feature extraction is process of extracting new set of reduced features from original features
based on some attributes transformation. Feature selection is a process that chooses an optimal subset of
features according to a objective function. As real world large dataset consist of irrelevant ,redundant &
noisy dimensions, so before applying clustering/classification/regression algorithms on these dataset,
one must consider dimensionality reduction as a pre-processing step.
A. High Dimensionality Data Reduction Challenges In Big Data
Big data is relentless. It is continuously generated on a massive scale. It is generated by online
interactions between people ,systems & by sensor enabled devices. It can be related, linked & integrated
to provide highly detailed information such a detail makes it possible for banks, health care & public
safety to provide specific services. It is creating new business & transforming new traditional markets to
create new business trends. So it is a challenge to statistical community. Additional information for big
data can be obtained from a single large set as opposed to separate smaller sets. It allows correlation to
be found, for instance to spot business trends. It involves increasing volume, that is amount of data,
velocity, that is speed at which data is in & out & variety that is range of data types & sources. This
requires new form of processing for decision making. It produces massive sample sizes that allows us to
discover hidden patterns associated with small subsets of big dataset. High dimensionality & big data
have special features such as noise, accumulation & spurious correlation. Spurious correlation occurs due
to the fact that many uncorrelated random variables may have sample correlation coefficient in high
dimensions. Such correlation leads to wrong inferences. High dimensional data can be generated from
sectors like Biotech, Financial, Satellite imagery & customer financial data. So such data can be stored in
form of data matrix- web term document data, sensor array data, consumer financial data etc. It is
computationally infeasible to directly make inferences based on the raw data. To handle big data from
both the statistical & the computational views, the idea of dimension reduction is an important step
before start processing of big data. High dimensional data can be analyzed through classification,
62 | 2015, IJAFRC All Rights Reserved

International Journal of Advance Foundation and Research in Computer (IJAFRC)

Volume 2, Issue 10, October - 2015. ISSN 2348 4853, Impact Factor 1.317
clustering & regression but these methods alone are not sufficient for processing of big data. So
challenges for reducing high dimensional data are as follows
1. No formal mathematical models are available
2. If sometimes models are available but proper derivation is not available so it creates confusion among
developers how to reduce the features of high dimensional dataset.
B. Need of Dimensionality Reduction
It is required as most machine learning & data mining techniques may not be effective for high
dimensional data, query accuracy & efficiency degrade rapidly as the dimension increases. It is also
required for
visualization-projection of high dimensional data on to 2D or 3D.
Data compression-Efficient storage & retrieval.
Noise removal- Positive effect on query accuracy
Data Mining, the extraction of hidden predictive information from large databases is a powerful new
technology with great potential to help companies focus on the most important information in their data
warehouses. Data mining tools can answer business questions that traditionally were too time
consuming to resolve. Choosing a subset of good features with respect to the target concepts, feature
subset selection is an effective way for reducing dimensionality, removing irrelevant data, increasing
learning accuracy, & improving result comprehensibility[2].
Feature selection is a process that chooses a subset from the original feature set according to some
criterions. The selected feature retains original physical meaning & provide a better understanding for
the data & learning process. Depending on the class label information is required ,feature selection can be
either unsupervised /supervised. For supervised methods, the correlation of each feature with the class
label is computed by distance ,information dependence or consistency measures[3].
A Common & often overwhelming, characteristic of text data is its extremely high dimensionality.
Typically the document vectors are formed using a vector space or bag of words model. Even a
moderately sized document collection can lead to a dimensionality in thousands. This high
dimensionality can be a severe obstacle for classification algorithms based on support vector machine,
linear discriminate analysis, k-nearest neighbor etc. The problem is compounded when the documents
are arranged in a hierarchy of classes & a full feature classifier is applied at each node of the hierarchy[4].
One of the primary task in mining data is feature extraction. The widespread digitization of information
has created a wealth of data that requires novel approaches for feature extraction. Recent advances in
computer technology are fueling radical changes in the information management. With roots of statistics,
machine learning & information theory data mining is emerging as a field of study in its own right Data
mining techniques has created unprecedented opportunity for the development of automatic approaches
to tasks therefore considered intractable[6].
Feature extraction techniques have been used to handle high-dimensional data but unfortunately very
few studies provide concrete evidences on the effectiveness of these feature extraction & they largely
remain to be black boxes. Mining high dimensional data is challenging, not only data volumes increase
significantly as the number of dimensions also increases where number of features is close or greater
than the number of available training instances which makes many statistical & machine learning
63 | 2015, IJAFRC All Rights Reserved

International Journal of Advance Foundation and Research in Computer (IJAFRC)

Volume 2, Issue 10, October - 2015. ISSN 2348 4853, Impact Factor 1.317
procedures not suitable for high dimensional data. Feature selection / feature extraction methods have
been used to select a subset of original features based on some evaluation criteria[7].
Data is growing at a huge speed making it difficult to handle such large amount of data(exabytes). The
main difficulty in handling such large amount of data is because that the volume is increasingly rapidly in
comparison to the computing resources. Big data can be defined by the following properties like variety,
volume, velocity, variability, value & complexity. Big data is different from the data being stored in
traditional warehouses. The data which is stored in warehouses that has to undergoes a process of
ETL(extraction, transformation & loading) but this is not the case with big data as this data is not suitable
to be stored in data warehouses[8].
Big data is also referred to as data intensive technologies, with a long tradition of working with
constantly increasing volume of data in sectors like business, social media, insurance, health care etc. So
modern industry is trying to develop advanced big data technologies & tools[9].
Data explosion is an inevitable trend as the world is connected more than ever. Data are generated faster
than ever & to date about 2.5 quintillion bytes of data are created daily. This speed of data generation will
continue in the coming years & is expected to increase at an exponential level, according to IDC'S recent
survey. The above fact gives birth to the widely circulated concept called big data. But turning big data in
to sights demands an in depth extraction of their values, heavily relies upon & hence boosts deployments
of massive big data systems[10].
One of the strongest new presences in contemporary life is big data, very large data sets that may be big
in volume, velocity, variability, variety & veracity. High volume of data is generated in four areas such as
scientific, governmental, corporate, & personal data. One implication of big data is that humans are
having a wholly different concept & new way of relating to data, where formerly everything was signal,
now 99% is noise, which can possibly lead to overwhelm, especially if there is a failure to adequately
filter the information[11].
The emerging big data paradigm, owing to its broader impact, has profoundly transformed our society &
will continue to attract diverse attentions from both technological experts & the public in general. For
instance, an IDC report predicts that, from 2005 to 2020, the global data volume will grow by a factor of
300, from 130 exabytes to 40,000 exabytes, representing a double growth every two years[12].
To reduce high-dimensionality various methods of feature selection / extraction have been proposed but
these methods have been utilized for extracting a feature from textual data for reducing the high
dimensionality of textual data or for document clustering but not for extracting a feature from high
dimensional dataset which comprises of mixture data. So here we proposes an approach which will take
high dimensional data comprises of mixture data like textual data , numerical data, noisy data etc and
based on some evaluation criteria chooses a subset of features which will reduces the dimensionality of
high dimensionality dataset. Feature extraction and Feature selection method also used as a
preprocessing step, for high dimensional data which is not possible by already tool available in the
market like Weka, Rapidminer etc.
A. Feature Extraction Techniques

64 | 2015, IJAFRC All Rights Reserved

International Journal of Advance Foundation and Research in Computer (IJAFRC)

Volume 2, Issue 10, October - 2015. ISSN 2348 4853, Impact Factor 1.317
When the input data to an algorithm is too large to be processed and it is suspected to be redundant (e.g.
the same measurement in both feet and meters), then it can be transformed into a reduced set
of features (also named features vector). This process is called feature extraction. The extracted features
are expected to contain the relevant information from the input data, so that the desired task can be
performed by using this reduced representation instead of the complete initial data. Criteria for feature
reduction can be based on different problem settings like a)Unsupervised setting: Minimize the
information loss. b)Supervised setting: Maximize the class description[1][4].The process of feature
extraction as follow

Loading of Dataset- First of all very basic step is to load a dataset in to the machine.
Extraction of Features-Apply some algorithm to extract a relevant features.
Stopping criteria- set some threshold which the features has to satisfy that criteria.
Provides results- The features which satisfy the criteria that come as a output.

Fig1. Feature Extraction Process

1. Principal Component Analysis
Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to
convert a set of possibly correlated variables into a set of linearly uncorrelated variables called principal
components. The number of principal components is less than or equal to the number of original
variables. This transformation is carried out in a way that the first principal component has the largest
possible variance and each succeeding component in turn has the highest variance possible under the
constraint that it is orthogonal to the preceding components. The principal components are orthogonal as
they are the eigen-vectors of the symmetric covariance matrix. PCA is the simplest of the
true eigenvector-based multivariate analyses. If a multivariate dataset is visualized as a set of coordinates
in a high-dimensional data space then PCA can supply the user with a lower-dimensional picture. This is
done by using only the first few principal components so that the dimensionality of the transformed data
is reduced. The faithful transformation T = X W maps a data vector x from an original space of p variables
to a new space of p variables which are uncorrelated over the dataset. However, not all the principal
components need to be required only the first L principal components are required, produced by using
only the first L loading vectors and that gives the truncated transformation as where the matrix TL now
has n rows but only L columns. Such dimensionality reduction can be a very useful step for visualizing
and processing high-dimensional datasets. For example, selecting L = 2 columns and keeping only the
first two principal components finds the two-dimensional plane through the high-dimensional dataset in
which the data is most spread out, so if the data contains clusters which spread out, and therefore most
visible on a two-dimensional diagram; whereas if two directions through the data are chosen at random,
65 | 2015, IJAFRC All Rights Reserved

International Journal of Advance Foundation and Research in Computer (IJAFRC)

Volume 2, Issue 10, October - 2015. ISSN 2348 4853, Impact Factor 1.317
the clusters may be much less spread apart from each other, and may in fact they substantially overlay
each other, which makes them indistinguishable.
2. Multifactor Dimensionality Reduction (MDR)
Multifactor dimensionality reduction (MDR) is a data mining approach for detecting and characterizing
combinations of attributes or independent variables that interact to influence a dependent or class
variable. MDR was designed specifically to identify interactions among discrete variables that influence
a binary outcome and is considered a nonparametric alternative to traditional statistical methods such
as logistic regression. The basis of the MDR method is a constructive induction algorithm that converts
two or more variables or attributes to a single attribute. This process of constructing a new attribute
changes the representation space of the data. The end goal is to create or discover a representation that
facilitates the detection of nonlinear or non additive interactions among the attributes such that
prediction of the class variable is improved over that of the original representation of the data. Consider
the following simple example using the exclusive OR (XOR) function. The table below represents a simple
dataset where the relationship between the attributes (X1 and X2) and the class variable (Y) is defined by
the XOR function such that Y = X1 XOR X2.
Table1. XOR Function
X1 X2 Y

If the above example has been solved using any data mining algorithm, that algorithm need to
approximate the XOR function in order to accurately predict Y. So alternate method is to use MDR
constructive induction algorithm which changes the representation of the data. MDR algorithm change
the representation of data by selecting two attributes like x1 & x2 which is there in above example. Each
combination of values for X1 and X2 are examined and the number of times Y=1 and/or Y=0 is counted.
With MDR, the ratio of these counts is computed and compared to a fixed threshold. Here, the ratio of
counts is 0/1 which is less than our fixed threshold of 1. Since 0/1 < 1 we encode a new attribute (Z) as a
0. When the ratio is greater than one we encode Z as a 1. This process is repeated for all unique
combinations of values for X1 and X2.
3. Independent Component Analysis (ICA)
ICA finds the independent components (also called factors, latent variables or sources) by maximizing the
statistical independence of the estimated components. ICA uses the two broadest definitions for
calculating of independence of components 1) Minimization of mutual information- The Minimization-ofMutual information (MMI) family of ICA algorithms uses measures like Kullback-Leibler
Divergence and maximum entropy.2) Maximization of non-Gaussianity-The non-Gaussianity family of ICA
algorithms, motivated by the central limit theorem, uses kurtosis and negentropy. Typical algorithms for
ICA use centering (subtract the mean to create a zero mean signal), whitening (usually with the eigen
value decomposition), and dimensionality reduction as preprocessing steps in order to simplify and
reduce the complexity of the problem for the actual iterative algorithm. Whitening and dimension
66 | 2015, IJAFRC All Rights Reserved

International Journal of Advance Foundation and Research in Computer (IJAFRC)

Volume 2, Issue 10, October - 2015. ISSN 2348 4853, Impact Factor 1.317
reduction can be achieved with principal component analysis or singular value decomposition. Whitening
ensures that all dimensions are treated equally before the algorithm is run. Well-known algorithms for
ICA include infomax, FastICA, and JADE, but there are many others. General defination of ICA is the data
are represented by the random vector
and the components as the random

The task is to transform the observed data

transformation W as




using a linear

components measured



of independence.

4. Neural Network Approach

When the no of hidden units is less than no of inputs, hidden layer performs dimensionality reduction
operation. Hidden units defined by gradient descent to(locally) minimize the squared output
classification / regression error. Each synthesized dimension (each hidden unit) is logistic function of
inputs that allow network with multiple hidden layers. In this approach, a set of feature vector is given to
the neurons for processing of features. If output is >0 then the reduced set of features belongs to class 1
otherwise belongs to class

Fig 2. Learning process of Neural Network

5.Comparison between different Feature Extraction methods
Table 2
To minimize the
reprojection error.



The goal of MDR is to

discover a interaction
among attribute to
improve the prediction
of class variable.

The goal is to
minimize the stastical
dependence between
the vectors.

The goal is to
train the neural
network to
minimize the
network error.

Table 3
67 | 2015, IJAFRC All Rights Reserved




International Journal of Advance Foundation and Research in Computer (IJAFRC)

Volume 2, Issue 10, October - 2015. ISSN 2348 4853, Impact Factor 1.317
It uses orthogonal
transformation to convert
a set of possibly correlated
variables in to a set of
linearly uncorrelated
variables called principal

MDR is a data mining

approach which detects
and characterize the
combinations of
attributes or
independent variable
that interact to influence
a class variable

ICA finds the

latent variable by
maximizing the
independence of the

It uses multiple
hidden layers for
reduction operation
where each
dimension is logistic
function of input

Table 4



Basis vectors are less It

expensive to compute
representation of data to
accurately predict the
class variable

Vectors are spatially


descent method to
locally minimize the
squared output error.

Table 5
Vectors are less spatially

Implementation of
Mining pattern with MDR
from real data is
computationally complex.

Vectors are neither
orthogonal nor in

Neural networks are
difficult to model because
a small change in a single
input will affect the entire

B. Feature Selection Techniques

Feature selection is a process that selects a subset of original features by rejecting irrelevant and/ or
redundant features according to certain criteria. Relevancy of features is typically measured by
discriminating ability of a feature to enhance predictive accuracy of classifier and cluster goodness for
clustering algorithm. Generally, feature redundancy is defined by correlation; two features are redundant
to each other if their values are correlated[2][3][5].
Feature selection process comprises of four steps can be explained through diagram

68 | 2015, IJAFRC All Rights Reserved

International Journal of Advance Foundation and Research in Computer (IJAFRC)

Volume 2, Issue 10, October - 2015. ISSN 2348 4853, Impact Factor 1.317

Fig 3. Feature Selection process


Generation = select feature subset candidate.

Evaluation = compute relevancy value of the subset.
Stopping criterion = determine whether subset is relevant.
Validation = verify subset validity.

Feature selection methods described here as follow

1. Filter Model
It separates feature selection from classifier learning. It relies on 4 types of criteria such as information,
distance, dependence, consistency for evaluation of features from any dataset without using any data
mining algorithm. Methods of filter model are as follow
1. Information Gain (IG)
Information measures typically determine the Information gain or reduction in entropy when the dataset
is split on a feature.
2. Correlation Coefficient (CC)
The correlation coefficient is a numerical way to quantify the relationship between two features.
3. Symmetric uncertainty (SU)
Features are selected based on highest symmetric uncertainty values between the feature and target

Fig 4. Filter model process

2. Wrapper model
69 | 2015, IJAFRC All Rights Reserved

International Journal of Advance Foundation and Research in Computer (IJAFRC)

Volume 2, Issue 10, October - 2015. ISSN 2348 4853, Impact Factor 1.317
Wrapper methods require a predetermined mining algorithm for evaluating generated subsets of
features of dataset. It usually gives superior performance as they find features better suited to the
predetermined mining algorithm. Within the Wrapper category, Predictive Accuracy is used for
Classification and Cluster Goodness for Clustering. Commonly used wrappers methods are as follows
1.K-nearest neighbor classifier
This method finds the neighborhood based on euclidean distance where testing samples are assigned to
the class which is frequently represented among the k-nearest training samples.
2. Linear Discriminant Analysis (LDA)
To guarantee the maximal separability between the inter-class and intra-class variance it maximizes the
ratio of between inter-class variance to the within-class (intra) variance in any particular dataset.
3. Support vector machine (SVM)
SVM is a method for classification of both linear and non-linear data. It uses a non linear mapping to
transform the original training data in to higher dimension. Within new dimension it searches for the
linear optimal decision boundary for separating one class from another.
4. Bayesian Classifier
The Bayesian Classifier is a statistical classifier, which has the ability to predict the probability that a
given instant belongs to particular class. In theory, Bayesian classifiers have the minimum error rate with
respect to other classifiers. However, the others classifiers are not suitable for high dimensional features

Fig 5. Wrapper model process

Our Proposed approach implements the Symmetric Uncertainty Feature Selection(SUFS) method which
selects feature that have highest symmetric Uncertainty values between the feature & the target classes.
The Symmetric Uncertainty Feature Selection is derived from the mutual information by normalizing it to
the entropies of the feature values & the target classes. This method uses information gain & entropy
which is appropriate for high dimensional data and also this method act as a preprocessing step for
giving data to any clustering / classification / regression algorithm for further processing.

70 | 2015, IJAFRC All Rights Reserved

International Journal of Advance Foundation and Research in Computer (IJAFRC)

Volume 2, Issue 10, October - 2015. ISSN 2348 4853, Impact Factor 1.317
where the entropy of variable x is found by

A. Algorithm can be described as follow

Step 1- Input the dataset which contains features and their values.
Step 2- Then we calculate the Relevance value for each feature by using the formula

Step 3- Then we take the average of all relevance values. We named it as Threshold value. We check one
by one each features relevance with the Threshold Value. If The relevance is greater than the Threshold
value we insert that Feature as relevant one else irrelevant one.
Fig 6. SUFS Algorithm
A. Dataset Description
The Proposed algorithm uses the Insurance dataset which comprises of 69 dimensions. The attribute of
dataset as follow
1.Contact No
2.Product Code
3.Prod long des
4.Sum Assured
5.Bill Frequency
6.Premium cessation term
10.Premium_status_description nonmed
Likewise more are there and same can be visualized through graph given below

71 | 2015, IJAFRC All Rights Reserved

International Journal of Advance Foundation and Research in Computer (IJAFRC)

Volume 2, Issue 10, October - 2015. ISSN 2348 4853, Impact Factor 1.317

Fig 7
2. Agent branch short description

Fig 8
3. Agent Client Id

Fig 9
4. Agent Full Name

Fig 10
The above dataset can be given as input to SUFS algorithm described above & the output would be the
reduced set of relevant features which satisfies the criteria.
B. Output

72 | 2015, IJAFRC All Rights Reserved

International Journal of Advance Foundation and Research in Computer (IJAFRC)

Volume 2, Issue 10, October - 2015. ISSN 2348 4853, Impact Factor 1.317

Fig 11.Relevant Features

Fig 12.Irrelevant Features

As a conclusion , we can determine that large dataset directly cannot be given to any clustering or
classification algorithm for further processing ,first that dataset can be reduced properly using feature
selection or feature extraction method described above . As many authors have used chi-squared tests
against these large dataset but unable to provide satisfied results. There are many feature selection
methods like Information Gain, correlation coefficient & Symmetric uncertainty. So, in this paper we
73 | 2015, IJAFRC All Rights Reserved

International Journal of Advance Foundation and Research in Computer (IJAFRC)

Volume 2, Issue 10, October - 2015. ISSN 2348 4853, Impact Factor 1.317
presented a Symmetric uncertainty feature selection(SUFS) method which is suitable for reducing high
dimensional dataset and hence we got the satisfied results.
Further this can be implemented as that above shown relevant features output can be given to any
clustering / classification / regression algorithm for further processing. We are now focusing on
implementing new clustering algorithm for given dataset.

V. Seshadrit ,Sholom M. Weiss , and Raguram Sasisekharan, 1995 KDD-95 Proceedings


Prof.M.Mohamed Musthafa, R.Rokit Kumar, " A Fast Clustering-Based Feature Subset Selection
Algorithm for High-Dimensional Data" IJETS (2014)


Tao Liu, Shengping Liu, Zheng Chen Wei-Ying Ma, "An Evaluation on Feature Selection for Text
clustering" ICML (2003)


Inderjit S. Dhillon, Subramanyam Mallela, Rahul Kumar, "A Divisive Information-Theoretic

Feature Clustering Algorithm for Text Classification" Journal of Machine Learning Research 3


A. Keerthiram Murugesan, B.Jun Zhang, "A New Term Weighting Scheme for Document


Jirada Kuntraruk and William M. Pottenger, " Massively Parallel Distributed Feature Extraction in
Textual Data Mining Using HDDI",IEEE (2001).


JiantingZhang,LeGruenwald,"Opening the Black Box of Feature Extraction: Incorporating

Visualization into High-Dimensional Data Mining Process" IEEE (2006).


Avita Katal, Mohammad Wazid, R H Goudar, " Big Data: Issues, Challenges, Tools and Good
Practices" IEEE (2013).


Yuri Demchenko, Cees de Laat, Peter Membrey," Defining Architecture Components of the Big
Data Ecosystem ", IEEE (2014).


Lei Wang, Jianfeng Zhan , Chunjie Luo, Yuqing Zhu, Qiang Yang, Yongqiang He, Wanling Gao, Zhen
Jia,Yingjie Shi, Shujie Zhang, Chen Zheng, Gang Lu, Kent Zhan, Xiaona Li, and Bizhu Qiu ,"
BigDataBench: a Big Data Benchmark Suite from Internet Services", IEEE (2014).


Melanie Swan," Philosophy of Big Data Expanding the Human-Data Relation with Big Data Science
Services", IEEE (2015).


Han Hu, Yonggang Wen, Tat-Seng Chua, And Xuelong Li," Toward Scalable Systems for Big Data
Analytics: A Technology Tutorial", IEEE (2014).

74 | 2015, IJAFRC All Rights Reserved