Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more
Download
Standard view
Full view
of .
Save to My Library
Look up keyword
Like this
6Activity
0 of .
Results for:
No results containing your search query
P. 1
A Comparative Study of Microarray Data Classification with Missing Values Imputation

A Comparative Study of Microarray Data Classification with Missing Values Imputation

Ratings: (0)|Views: 292 |Likes:
Published by ijcsis
The incomplete data is an important problem in data mining. The consequent downstream analysis becomes less effective. Most algorithms for statistical data analysis need a complete set of data. Microarray data usually consists of a small number of samples with high dimensions but with a number of missing values. Many missing value imputation methods have been developed for microarray data, but only a few studies have investigated the relationship between missing value imputation method and classification accuracy. In this paper we carry out experiments with Colon Cancer dataset to evaluate the effectiveness of the four methods dealing with missing values imputations: the Row average method, KNN imputation, KNNFS imputation and Multiple Linear Regression imputation procedure. The considered classifier is the Support Vector Machine (SVM).
The incomplete data is an important problem in data mining. The consequent downstream analysis becomes less effective. Most algorithms for statistical data analysis need a complete set of data. Microarray data usually consists of a small number of samples with high dimensions but with a number of missing values. Many missing value imputation methods have been developed for microarray data, but only a few studies have investigated the relationship between missing value imputation method and classification accuracy. In this paper we carry out experiments with Colon Cancer dataset to evaluate the effectiveness of the four methods dealing with missing values imputations: the Row average method, KNN imputation, KNNFS imputation and Multiple Linear Regression imputation procedure. The considered classifier is the Support Vector Machine (SVM).

More info:

Published by: ijcsis on Jun 12, 2010
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less

07/02/2013

pdf

text

original

 
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 8, No. 2, 2010
 
A Comparative Study of Microarray DataClassification with Missing Values Imputation
 
Kairung Hengpraphrom
1
, Sageemas Na Wichian
2
and Phayung Meesad
3
 
1
Department of Information Technology, Faculty of Information Technology
2
Department of Social and Applied Science, College of Industrial Technology
3
Department of Teacher Training in Electrical Engineering, Faculty of Technical EducationKing Mongkut's University of Technology North Bangkok 1518 Piboolsongkram Rd.Bangsue, Bangkok 10800, Thailandkairung2004@yahoo.com, sgm@kmutnb.ac.th, pym@kmutnb.ac.th
 Abstract
—The incomplete data is an important problem in datamining
.
The consequent downstream analysis becomes lesseffective. Most algorithms for statistical data analysis need acomplete set of data. Microarray data usually consists of a smallnumber of samples with high dimensions but with a number of missing values. Many missing value imputation methods havebeen developed for microarray data, but only a few studies haveinvestigated the relationship between missing value imputationmethod and classification accuracy. In this paper we carry outexperiments with Colon Cancer dataset to evaluate theeffectiveness of the four methods dealing with missing valuesimputations: the Row average method, KNN imputation, KNNFSimputation and Multiple Linear Regression imputationprocedure. The considered classifier is the Support VectorMachine (SVM).
 Keywords;KNN, Regression, Microarray, Imputation, MissingValues
I.
 
I
NTRODUCTION
Microarray data is a representative of thousands of genes atthe same time. In with many types of experimental data,expression data obtained from microarray experiments arefrequently peppered with missing values (MVs) that mayoccur for a variety of reasons, such as insufficient resolution,image corruption, dust, scratches on the slide, or errors in theprocess of experiments. Many data mining techniques havebeen proposed for analysis to identify regulatory patterns orsimilarities in expressions under similar conditions. For theanalysis to be efficient, data mining techniques such asclassification [1-3] and clustering [4-5] techniques require thatthe microarray data must be complete with no missing values[6]. One solution for the missing data problem is to go overthe experiment again, but it is time consuming and veryexpensive [7]. Replacing the missing values by zero andaverage value can be helpful instead of eliminating themissing-value records [8], but the two simple methods are notvery effective.Consequently, many algorithms have been developed toaccurately impute MVs in microarray experiments, forexample K-Nearest Neighbor, Singular Value Decomposition,and Row average method have been proposed to estimatemissing values in microarrays. KNN Impute was found to bethe best among three methods [9]. However, there are stillsome points to improve. Many imputation techniques havebeen proposed to resolve the missing values problems. Forexample, Troyanskaya et al. [9] proposed KNN imputationbased on Singular Value Decomposition and Row averagemethods. The results showed that KNN imputation method isbetter than the Row average method. Oba et al. [10] haveproposed an imputation method called Bayesian PrincipalComponent Analysis (BPCA). The researchers claimed thatBPCA can estimate the missing values better than KNN andSVD. Another efficient method was proposed by Zhou et al.[11]. The method automatically selects gene parameters forestimation of missing values. The algorithm uses linear andnonlinear regression. The key benefit of the algorithm is quick estimation. Another research by Kim et al. [12] proposed localleast squares (LLS) imputation. The idea is to use thesimilarity of structure of data as in least square optimization.This method is very robust. Later, Robust Least SquaresEstimation with Principal Components (RLSP) was proposedby Yoon et al. [13] to improve the efficiency of the previousmethods. RLSP imputation method showed betterperformance than KNN, LLS, and BPCA. The NRMSE iscalculated to measure the imputation performance since theoriginal values are now known.Many missing value imputation methods have beendeveloped for microarray data, but only a few studies haveinvestigated the relationship between missing value imputationmethod and classification accuracy. In this paper, we carry outa model-based analysis to investigate how different propertiesof a dataset influence imputation and classification, and howimputation affects classification performance. We comparefour imputation algorithms: the Row average method, KNN
29http://sites.google.com/site/ijcsis/ISSN 1947-5500
 
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 8, No. 2, 2010
 imputation, KNNFS imputation and Multiple LinearRegression imputation method to measure how well theimputed dataset can preserve the discriminated power residingin the original dataset. The Support Vector Machine (SVM) isused as a classifier in this work.The remainder of this paper is organized as follows. SectionII
 
provides theory and related works. The details of theproposed methodology are given in Section III
.
Section IV
 
illustrates the simulation and comparison results. Finally,concluding remarks are given in section V
.
II.
 
R
ELATED
W
ORK
 
 A.
 
 Microarray Data
Every cell of living organisms contains a full set of chromosomes and identical genes. Only a portion of thesegenes are turned on and it is the subset that is expressed,conferring distinctive properties to each cell category.There are two most important application forms for theDNA microarray technology: 1) identification of sequence(gene/gene mutation) and 2) determination of expression level(abundance) of genes of one sample or comparing genetranscription in two or more different kinds of cells. In datapreparation, DNA Microarrays are small, solid supports ontowhich the sequences from thousands of different genes areattached at fixed locations. The supports themselves areusually glass microscope slides, the size of two side-by-sidesmall fingers, but can also be silicon chips or nylonmembranes. The DNA is printed, spotted, or actuallysynthesized directly onto the support. With the aid of acomputer, the amount of mRNA bounding to the spots on themicroarray is precisely measured, which generates a profile of gene expression in the cell. The generating process usuallyproduces a lot of missing values and resulting in lessefficiency of the downstream computational analysis [14].
 B.
 
K-nearest neighbor(KNN)
Due to its simplicity, K-Nearest Neighbor (KNN) methodis one of the well-known methods to impute missing values inmicroarray data. The KNN method imputes missing values byselecting genes with expression values similar to the gene of interest. The steps of KNN imputation are as follows.
 
Step 1: Chose K genes that are most similar to the genewith the missing value (MV). In order to estimate the missingvalue
 x
ij
of 
i
th
gene in
 j
th
sample, K genes are selected whoseexpression vectors are similar to genetic expression of 
i
insamples other than
 j
.Step 2: Measure the distance between two expressionvectors
 x
i
and
 x
 j
by using the Euclidian distance over theobserved components in
 j
th
sample. Euclidean distancebetween
 x
i
and
 x
 j
can be calculated from (1)
21
(,)()
nijikj
ij
ddistxxxx
=
= =
(1)Where
dist 
(
 x
i
 ,x
 j
) is the Euclidean distance between
 
samples
 x
i
and
 
 x
 j
;
n
is the number of features or dimensions of microarray; and
 x
ik 
is the
th
feature of sample
 
 x
i
.
 
Step 3: Estimate the missing value as an average of the Knearest neighbors, corresponding entries in the selected Kexpression vectors by using (2)
1
ˆ
ij
 X  x
=
=
(2)
1...12
|{,,...,}
kiMi
 XXddd
=
=
 where is the estimated missing value at
i
th
gene in
 j
th
 sample;
i
is the
i
th
rank in distance of neighbor;
 X 
is the inputmatrix containing
th
rank in the nearest neighbor geneexpressions; and
 M 
is the total number of samples in thetraining data.
ij
 x
ˆ
C.
 
The Algorithm of KNNFS
The algorithm of the combination of KNN-based featureselection and KNN-based imputation is as follows [15].
Phase 1:
Feature SelectionStep 1: Initialize
feature;Step 2: Calculate feature distance between
 X 
 j
,
j
= 1,…,
col
and
 X 
miss
(the feature with missingvalues) by using (1);Step 3: Sort feature distance in ascending order;Step 4: Select
minimum distances;
Phase 2:
Imputation of Missing ValuesStep 5: Initialize
samples;Step 6: Use
feature to calculate sample distancebetween
 R
i
,
i
= 1, …,
row
and
 R
miss
(the rowwith missing values) by using (1);Step 7: Sort sample distance ascending;Step 8: Select
minimum distance;Step 9: Use
sample to estimate missing value byan average of 
most similar values byusing (2).
 
 D.
 
 Multiple Linear Regression
Multiple linear regression (MLR) is a method used to modelthe linear relationship between a dependent variable and oneor more independent variables. The dependent variable issometimes also called the predictand, and the independentvariables are called the predictors.The model expresses the value of a predictand variable as alinear function of one or more predictor variables and an errorterm:
01,12,2,
...
iiiki
 ybbxbxbxe
i
= + + + + +
(3)
,
i
 x
is value of predictor in case
i
 
th
0
b
is regression constant
i
b
is coefficient on the predictor
th
is total number of predictors
i
 y
is predictand in case
i
e
is error term
30http://sites.google.com/site/ijcsis/ISSN 1947-5500
 
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 8, No. 2, 2010
 
ki
The model (3) is estimated by least squares, which yieldsparameter estimates such that the sum of squares of errors isminimized. The resulting prediction equation is
01,12,2,
ˆˆˆˆˆ...
iii
 ybbxbxbx
= + + + +
 
Where the variables are defined as in (3) except that “^”denotes estimated values(4)III.
 
T
HE
E
XPERIMENTAL
D
ESIGN
 To compare the performance of the KNN, Row, Regression,and KNNFS imputation algorithms, NRMSE was used tomeasure the experimental results. The missing valueestimation techniques were tested by randomly removing datavalues and then computing the estimation error. In theexperiments, between 1% and 10% of the values wereremoved from the dataset randomly. Next, the four imputationalgorithms as mention above are applied separately tocalculate the missing values and then the imputed data(complete data) were used for accuracy measurement(NRMSE and classification accuracy) by SVM classifier. Theoverall process is shown in Fig. 1.
Figure 1. Simulation flow chart.
 
To test the effectiveness of the different imputationalgorithms, Conlon Cancer dataset was used. The data werecollected from 62 patients: 40 tumor and 22 normal cases. Thedataset has 2,000 selected genes. It is clean and contains nomissing values.The effectiveness of missing values imputation wascomputed by Normalized Room Mean Squared Error(NRMSE) [12] as shown in equation 5.
2
[()][]
guessansans
meanyy NRMSE stdy
=
 
(5)Subject tois estimated value
 
guess
 y
 y
]is prototype gene's value
ans
y[std
ans
is stand deviation of prototype geneIV.
 
T
HE
E
XPERIMENTAL RESULTS
 To evaluate the effectiveness of the imputation methods, theNRMSE values were computed using each algorithm asdescript above. The experiment is repeated 10 times andreported the average as the result. The experimental results areshown in Tables I and Fig. 1.Table I and Fig. 1 show the NRMSE of the estimation errorfor Colon Tumor data. The results show that the Regressionmethod has a lower NRMSE compared to the other methods.
TABLE I. N
ORMALIZE ROOT MEANS SQUARE ERROR OF MISSING
-
VALUEIMPUTATION FOR
C
OLON
C
ANCER
D
ATA
 
%MissColon Cancer
 Row KNN KNNFS Regression
1 0.6363 0.5486 0.4990 0.40492 0.6121 0.5366 0.4918 0.41033 0.6319 0.5606 0.5173 0.42824 0.6339 0.5621 0.5169 0.42515 0.6301 0.5673 0.5267 0.44106 0.6281 0.5634 0.5212 0.45737 0.6288 0.5680 0.5254 0.44158 0.6382 0.5882 0.5534 0.45489 0.6310 0.5858 0.5481 0.441810 0.6296 0.5849 0.5483 0.4450
Generate artificialmissing valuesCompletedataData withmissin
 
Figure 2. Normalize root means square error of missing value imputation forColon Cancer Data
The classification accuracy by using the SVM classifier issummarized in Table II and Fig. 2. The experimental resultsshow that the accuracy of the row average method is rangedbetween 82.10% and 83.39%, while the neighbour-basedmethods (KNN, KNNFS) gave the result between 82.90% and84.77%, and the regression method ranges between 82.90%and 84.84%.gImputeddataClassificationaccuracyFeature Selection Method,Missing estimation MethodNRMSE
31http://sites.google.com/site/ijcsis/ISSN 1947-5500

Activity (6)

You've already reviewed this. Edit your review.
1 hundred reads
1 thousand reads
friyani_1 liked this
Yi Ding liked this
ntuzz liked this
SuadAbdelkhaliq liked this

You're Reading a Free Preview

Download
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->