Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more
Download
Standard view
Full view
of .
Save to My Library
Look up keyword
Like this
4Activity
0 of .
Results for:
No results containing your search query
P. 1
A PCA Based Feature Extraction Approach for the Qualitative Assessment of Human Spermatozoa

A PCA Based Feature Extraction Approach for the Qualitative Assessment of Human Spermatozoa

Ratings: (0)|Views: 125 |Likes:
Published by ijcsis
Feature extraction is often applied in machine learning to remove distracting variance from a large and complex dataset, so that downstream classifiers or regression estimators can perform better. The computational expense of subsequent data processing can be reduced with a lower dimensionality. Reducing data to two or three dimensions facilitates visualization and further analysis by domain experts. The data components (parameters) measured in computer assisted semen analysis have complicated correlations and their total number is also large. This paper presents Principal Component Analysis to decorrelate components, reduce dimensions and to extract relevant features for classification. It also compares the computation of Principal Component Analysis using the Covariance method and Correlation method. Covariance based Principal Component Analysis was found to be more efficient in reducing the dimensionality of the feature space.
Feature extraction is often applied in machine learning to remove distracting variance from a large and complex dataset, so that downstream classifiers or regression estimators can perform better. The computational expense of subsequent data processing can be reduced with a lower dimensionality. Reducing data to two or three dimensions facilitates visualization and further analysis by domain experts. The data components (parameters) measured in computer assisted semen analysis have complicated correlations and their total number is also large. This paper presents Principal Component Analysis to decorrelate components, reduce dimensions and to extract relevant features for classification. It also compares the computation of Principal Component Analysis using the Covariance method and Correlation method. Covariance based Principal Component Analysis was found to be more efficient in reducing the dimensionality of the feature space.

More info:

Published by: ijcsis on Feb 15, 2011
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less

12/08/2012

pdf

text

original

 
(IJCSIS) International Journal of Computer Science and InformationSecurity,Vol.
9
 , No.
1
 , 201
1
A PCA Based Feature Extraction Approach for theQualitative Assessment of Human Spermatozoa
V.S.Abbiramy
Department of Computer ApplicationsVelammal Engineering CollegeChennai,Indiavml_rithi@yahoo.co.in
Dr. V. Shanthi
Department of Computer ApplicationsSt. Joseph’s Engineering CollegeChennai, Indiadrvshanthi@yahoo.co.in
 Abstract—
Feature extraction is often applied in machinelearning to remove distracting variance from a large and complexdataset, so that downstream classifiers or regression estimatorscan perform better. The computational expense of subsequentdata processing can be reduced with a lower dimensionality.Reducing data to two or three dimensions facilitates visualizationand further analysis by domain experts. The datacomponents(parameters) measured in computer assisted semen analysis havecomplicated correlations and their total number is also large.This paper presents Principal Component Analysis to de-correlate components, reduce dimensions and to extract relevantfeatures for classification. It also compares the computation of Principal Component Analysis using the Covariance method andCorrelation method. Covariance based Principal ComponentAnalysis was found to be more efficient in reducing thedimensionality of the feature space.
 Keywords-Covariance based PCA; Correlation based PCA; Dimensionality Reduction; Eigen Values; Eigen Vectors Feature Extraction; Principal Component Analysis.
I.I
 NTRODUCTION
The term data mining refers to the analysis of largedatasets. Humans often have difficulty comprehending data inmany dimensions. Algorithms that operate on high-dimensionaldata tend to have a very high time complexity. Many machinelearning algorithms and data mining techniques struggle withhigh-dimensional data. This has become known as the curse of dimensionality. Reducing data into fewer dimensions oftenmakes analysis algorithms more efficient and can help machinelearning algorithms make more accurate predictions. Thus Datareduction is considered as an important operation in the data preparation step.Data reduction obtains a reduced representation of the dataset that is much smaller in volume, yet produces the sameanalytical results. One of the strategies used for data reductionis dimensionality reduction. Dimensionality reduction reducesthe data set size by removing attributes or dimensions whichmay be irrelevant to the mining task. The best and worstattributes are determined using tests of statistical significancewhich assumes that the attributes are independent of oneanother. Dimensionality reduction can be divided into featureselection and feature extraction.Feature selection approaches try to find a subset of theoriginal variables. Two strategies are filter (e.g. informationgain) and wrapper (e.g. search guided by the accuracy)approaches. Feature extraction transforms the data in the high-dimensional space to a space of fewer dimensions. . Thetransformation technologies can be categorized into twogroups: linear and non-linear methods. Linear methods uselinear transforms (projections) for dimensional reduction, whilethe non-linear methods use non-linear transforms for the same purpose. . The linear technologies include PCA, LDA, 2DPCA,2DLDA and ICA. The non-linear dimensionality reductiontechnologies include KPCA and KFD.II.RELATEDWORSeveral linear and non linear methods have been discussedin the survey of dimension reduction techniques paper [1]. The paper reviews current linear dimensionality reductiontechniques, such as PCA, Multidimensional scaling(MDS),andnonlinear dimensionality reduction techniques, such asIsometric Feature Mapping (Isomap), Locally Linear Embedding (LLE), Hessian Locally Linear Embedding(HLLE)and Local Tangent Space Alignment (LTSA). In order to apply nonlinear dimensionality reduction techniqueseffectively, the neighborhood, the density, and noise levelsneed to be taken into account [2]. A brief survey of dimensionality reduction methods for classification, dataanalysis and interactive visualization was given [10].In paper [3], an effort has been made to predict the suitabletime period within a year for mustard plant by considering thetotal effect of environmental parameters using the method of factor analysis and principal component analysis. The paper  proposes a mechanism for comparing and evaluating theeffectiveness of dimensionality reduction techniques in thevisual exploration of text document archives. Multivariatevisualization techniques and interactive visual exploration arestudied [4]. The paper compares four different dimensionalityreduction techniques, such as PCA, Independent ComponentAnalysis (ICA), Random Mapping (RM) and statistical noisereduction algorithm and their performance are evaluated in thecontext of text retrieval [5].The paper [6] examines a rough set Feature Selectiontechnique which uses the information gathered from both thelower approximation dependency value and a distance metricwhich considers the number of objects in the boundary regionand the distance of those objects from the lower approximation.The use of this measure in rough set feature selection can result
121http://sites.google.com/site/ijcsis/ISSN 1947-5500
 
(IJCSIS) International Journal of Computer Science and InformationSecurity,Vol.
9
 , No.
1
 , 201
1
in smaller subset sizes than those obtained using thedependency function alone. The methods proposed in paper [7]is bi-level dimensionality reduction methods that integrate filter method and feature extraction method with the aim to improvethe classification performance of the features selected. In level1 of dimensionality reduction, features are selected based onmutual correlation and in level 2 selected features are used toextract features using PCA or LPP.The paper presents an independent component analysis(ICA) approach to DR, to be called ICA-DR which uses mutualinformation as a criterion to measure data statisticalindependency that exceeds second-order statistics. As a result,the ICA-DR can capture information that cannot be retained or  preserved by second-order statistics-based DR techniques [8].In paper [9], KPCA is used has a preprocessing step to extractrelevant feature for classification and to prevent from theHughes phenomenon. Then the classification was done with a backpropagation neural network on real hyper spectralROSISdata from urban area. Results were positively compared to thelinear version (PCA).The author has proposed a model and compared four dimensionality reduction techniques to reduce the feature spaceinto an input space of much lower dimension for the neuralnetwork classifier. Among the four dimensionality reductiontechniques proposed, Principal Component Analysis was foundto be the most effective in reducing the dimensionality of thefeature space [11]. In this study, a novel biomarker selectionapproach is proposed which combines singular valuedecomposition (SVD) and Monte Carlo strategy to earlyOvarian Cancer detection. Comparative study and statisticalanalysis show that the proposed method outperforms SVM-RFE and T-test methods which are the typical supervisedclassification and differential expression detection basedfeature selection methods [12]. The application of threedifferent dimension reduction techniques to the problem of classifying functions in object code form as beingcryptographic in nature or not were compared. It isdemonstrated that when discarding 90% of the measureddimensions,accuracy only suffers by 1% for this problem [13].III.PRINCIPALCOMPONENTANALYSISOften, the variables under study are highly correlated andthey are effectively “saying the same thing”. It may be useful totransform the original set of variables to a newset of uncorrelated variables called principal components. These newvariables are linear combinations of original variables and arederived in decreasing order of importance so that the first principal component accounts for as much as possible of thevariation in the original data [20].The goal is to transform a given data set X of dimension Mto an alternative data set Y of smaller dimension L.Equivalently, we are seeking to find the matrix Y, where Y isthe Karhunen–Loève transform (KLT) of matrix X:

YKLTX
(1)Algorithm for computing PCA[15]using the covariancemethod consists of the following stepsStep 1: Organize the data setThe microscopic examination of spermatozoa datasetcomprisisof Mobservations and eachobservation is describedwith L variables. This dataset is arranged as a set of N datavectors X1,…,Xn with each Xn representing a single groupedobservation of the M variables.Step 2: Calculate the empirical meanFind the empirical mean of the previous step along eachdimension m = 1,...,M and place it into a mean vector u of dimensions M × 1.
 N1u[m]X[m,n] Nn1
(2)Step 3: Calculate the deviations from the meanThe input dataset is centered by subtracting theempiricalmean vector u from each column of the data matrix X and it isstored in the M × N matrix B.
BXuh
(3)where h is a 1×N row vector of all1s.Step 4: Find the covariance matrixCalculate the M × M covariancematrix C from the outer  product of matrix B with itself:
1**CE[BB]E[B.B]B.B N
(4)where E is the expected value operator,is the outer product operator andis the conjugate transpose operator.Step 5: Find the eigenvectors and eigenvalues of the covariancematrixCompute the matrix V of eigenvectors which diagonalizesthe covariance matrix C:
1VCVD
(5)where D is the diagonal matrix of eigenvalues of C. MatrixD will take the form of an M× M diagonal matrix, where
D[p,q]forpqm
λ 
m
(6)is the mth eigenvalue of the covariance matrixStep 6: Rearrange the eigenvectors and eigenvaluesSort the columns of the eigenvector matrix V andeigenvalue matrix D obtained in the previous step in the order of decreasing eigenvalue.Step 7: Compute the cumulative energy content for eacheigenvector The cumulative energy content g for the mth eigenvector isthe sum of the energy content across all of the eigenvaluesfrom 1 through m:
122http://sites.google.com/site/ijcsis/ISSN 1947-5500
 
(IJCSIS) International Journal of Computer Science and InformationSecurity,Vol.
9
 , No.
1
 , 201
1
mg[m]D[q,q]form1,...,Mq1
(7)Step 8: Select a subset of the eigenvectors as basis vectorsThen save the first L columns of V as the M × L matrix W:
L1,...,qM1,..., pfor q]V[p,q]W[p,
(8)where
1LM
The vector g can be used as a guide to choose anappropriate value for L so that it above a certain threshold, like90 percent. ie.
g[mL]90%
Step 9: Convert the source data to z-scoresCreate an M × 1 standard deviation vector s from the squareroot of each element along the main diagonal of the covariancematrix C:
M1...mq pfor q]C[p,{s[m]}s
(9)Also calculate the M × N z-score matrix:
BZs.h
(Divideelement-by-element) (10)Step 10: Project the z-scores of the data onto the new basisThe projected vectors are the columns of the matrix
YW*.ZKLT{X}
(11)where W* is the conjugate transpose of the eigenvector matrix. The columns of matrix Y represent the Karhunen– Loeve transforms (KLT) of the data vectorsin the columns of matrixX.IV.EXPERIMENTALRESULTS
TABLEI.
 
SAMPLEDATAINPUT
The input dataset as shown in Table Iis constructed bycombining the statistical measurement of morphological parameters of spermatozoon given in Table 4.1 and motilitystatistics given in Table 5 of paper [16] & [17].This data set has fifteen features, so every sample is afifteen dimensional vector. The covariance matrix andcorrelation matrix calculated from the dataset for the first fivefeatures are given in Table IIand Table III.
TABLEII.COVARIANCEMATRIXFORINPUTDATASET
AreaPerimeterHeadLengthHeadWidthEccentricityArea
1,483.975911,522.06560171.8895755.438960.97301
Perimeter
1,522.065602,017.41106219.7749672.618753.16497
HeadLength
171.88957219.7749638.3526612.390770.49326
HeadWidth
55.4389672.6187512.390775.240630.18160
Eccentricity
0.973013.164970.493260.181600.05923TABLEIIICORRELATIONMATRIXFORINPUTDATASET
AreaPerimeterHeadLengthHeadWidthEccentricityArea
1.000000.879680.720510.628650.10378
Perimeter
0.879681.000000.790100.706250.28952
HeadLength
0.720510.790101.000000.873990.32726
HeadWidth
0.628650.706250.873991.000000.32594
Eccentricity
0.103780.289520.327260.325941.00000
Based on the correlation matrix, eigen values arecalculated. In the case of N independent variables, there are Neigen values. For the given dataset there are 15 eigen values.The proportion of total variance in input dataset explained bythe i
th
 principal component is simply the ratio between the i
th
eigen value and the sum of all eigen values. Cumulative proportion of variance is computed by adding the current and previous proportion of variance.
AreaPerimeterHeadLenHeadWidthEccentricityMidLenTailLenOrientationEquivDiamMeanDistMeanVelocityABCD
36.000046.70329.89145.32170.842915.620067.570078.54576.7703147.753449.25001.000.000.000.00158.000191.233523.61828.66460.930319.110066.230081.961913.9116121.503340.50001.000.000.000.001.00003.62801.15471.15470.960010.230044.69000.00001.12840.00000.00000.000.000.001.0037.000025.453010.13355.34440.849619.120035.540088.93386.863717.07215.69000.001.000.000.0010.000011.84544.25833.20830.657514.000074.310090.00003.568266.633322.21001.000.000.000.001.00003.62801.15471.15470.000015.070044.16000.00001.128420.00566.66850.001.000.000.007.000013.44575.77351.81450.949315.350038.770045.00002.985417.07215.69070.001.000.000.007.000010.16973.87912.43000.779515.350042.270045.00002.985466.633022.21101.000.000.000.007.684053.928018.75005.05510.953828.125068.7600-89.908881.047253.810317.93670.001.000.000.004.572335.91001.66001.06980.96002.490045.560089.759487.228958.469819.48990.001.000.000.006.413948.76208.46225.25650.959712.693332.4500-89.807884.259036.666612.22220.001.000.000.005.667742.71409.56535.90860.959914.348065.4700-89.963485.355030.907510.30250.001.000.000.007.191852.16402.46521.08930.95963.697845.260089.874184.44777.67752.55920.000.001.000.008.160363.25208.85705.60960.960513.285535.8900-89.999682.8648110.000036.67001.000.000.000.006.588549.89609.92605.81680.959314.889045.3400-89.686883.583946.065215.35000.001.000.000.003.96847.68607.65934.11540.961811.489030.980089.376189.860411.95803.98600.000.001.000.00
123http://sites.google.com/site/ijcsis/ISSN 1947-5500

You're Reading a Free Preview

Download
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->