This action might not be possible to undo. Are you sure you want to continue?

2, 2010

**A Comparative Study of Microarray Data Classification with Missing Values Imputation
**

Kairung Hengpraphrom1, Sageemas Na Wichian2 and Phayung Meesad3

1 2 3

Department of Information Technology, Faculty of Information Technology

Department of Social and Applied Science, College of Industrial Technology

Department of Teacher Training in Electrical Engineering, Faculty of Technical Education King Mongkut's University of Technology North Bangkok 1518 Piboolsongkram Rd.Bangsue, Bangkok 10800, Thailand kairung2004@yahoo.com, sgm@kmutnb.ac.th, pym@kmutnb.ac.th

Abstract—The incomplete data is an important problem in data mining. The consequent downstream analysis becomes less effective. Most algorithms for statistical data analysis need a complete set of data. Microarray data usually consists of a small number of samples with high dimensions but with a number of missing values. Many missing value imputation methods have been developed for microarray data, but only a few studies have investigated the relationship between missing value imputation method and classification accuracy. In this paper we carry out experiments with Colon Cancer dataset to evaluate the effectiveness of the four methods dealing with missing values imputations: the Row average method, KNN imputation, KNNFS imputation and Multiple Linear Regression imputation procedure. The considered classifier is the Support Vector Machine (SVM). Keywords;KNN, Regression, Microarray, Imputation, Missing Values

I.

INTRODUCTION

Microarray data is a representative of thousands of genes at the same time. In with many types of experimental data, expression data obtained from microarray experiments are frequently peppered with missing values (MVs) that may occur for a variety of reasons, such as insufficient resolution, image corruption, dust, scratches on the slide, or errors in the process of experiments. Many data mining techniques have been proposed for analysis to identify regulatory patterns or similarities in expressions under similar conditions. For the analysis to be efficient, data mining techniques such as classification [1-3] and clustering [4-5] techniques require that the microarray data must be complete with no missing values [6]. One solution for the missing data problem is to go over the experiment again, but it is time consuming and very expensive [7]. Replacing the missing values by zero and average value can be helpful instead of eliminating the missing-value records [8], but the two simple methods are not very effective.

Consequently, many algorithms have been developed to accurately impute MVs in microarray experiments, for example K-Nearest Neighbor, Singular Value Decomposition, and Row average method have been proposed to estimate missing values in microarrays. KNN Impute was found to be the best among three methods [9]. However, there are still some points to improve. Many imputation techniques have been proposed to resolve the missing values problems. For example, Troyanskaya et al. [9] proposed KNN imputation based on Singular Value Decomposition and Row average methods. The results showed that KNN imputation method is better than the Row average method. Oba et al. [10] have proposed an imputation method called Bayesian Principal Component Analysis (BPCA). The researchers claimed that BPCA can estimate the missing values better than KNN and SVD. Another efficient method was proposed by Zhou et al. [11]. The method automatically selects gene parameters for estimation of missing values. The algorithm uses linear and nonlinear regression. The key benefit of the algorithm is quick estimation. Another research by Kim et al. [12] proposed local least squares (LLS) imputation. The idea is to use the similarity of structure of data as in least square optimization. This method is very robust. Later, Robust Least Squares Estimation with Principal Components (RLSP) was proposed by Yoon et al. [13] to improve the efficiency of the previous methods. RLSP imputation method showed better performance than KNN, LLS, and BPCA. The NRMSE is calculated to measure the imputation performance since the original values are now known. Many missing value imputation methods have been developed for microarray data, but only a few studies have investigated the relationship between missing value imputation method and classification accuracy. In this paper, we carry out a model-based analysis to investigate how different properties of a dataset influence imputation and classification, and how imputation affects classification performance. We compare four imputation algorithms: the Row average method, KNN

29

http://sites.google.com/site/ijcsis/ ISSN 1947-5500

(IJCSIS) International Journal of Computer Science and Information Security, Vol. 8, No. 2, 2010

imputation, KNNFS imputation and Multiple Linear Regression imputation method to measure how well the imputed dataset can preserve the discriminated power residing in the original dataset. The Support Vector Machine (SVM) is used as a classifier in this work. The remainder of this paper is organized as follows. Section II provides theory and related works. The details of the proposed methodology are given in Section III. Section IV illustrates the simulation and comparison results. Finally, concluding remarks are given in section V. II. RELATED WORK

Step 3: Estimate the missing value as an average of the K nearest neighbors, corresponding entries in the selected K expression vectors by using (2)

**K X k = X i =1...M | di ∈{d1 , d 2 ,..., d K } ˆ where xij is the estimated missing value at ith gene in jth
**

sample; di is the ith rank in distance of neighbor; Xk is the input matrix containing kth rank in the nearest neighbor gene expressions; and M is the total number of samples in the training data. C. The Algorithm of KNNFS The algorithm of the combination of KNN-based feature selection and KNN-based imputation is as follows [15]. Phase 1: Feature Selection Step 1: Initialize KF feature; Step 2: Calculate feature distance between Xj, j = 1, …, col and Xmiss (the feature with missing values) by using (1); Step 3: Sort feature distance in ascending order; Step 4: Select KF minimum distances; Phase 2: Imputation of Missing Values Step 5: Initialize KC samples; Step 6: Use KF feature to calculate sample distance between Ri, i = 1, …, row and Rmiss (the row with missing values) by using (1); Step 7: Sort sample distance ascending; Step 8: Select KC minimum distance; Step 9: Use KC sample to estimate missing value by an average of KC most similar values by using (2). D. Multiple Linear Regression Multiple linear regression (MLR) is a method used to model the linear relationship between a dependent variable and one or more independent variables. The dependent variable is sometimes also called the predictand, and the independent variables are called the predictors. The model expresses the value of a predictand variable as a linear function of one or more predictor variables and an error term: yi = b0 + b1 xi ,1 + b2 xi ,2 + ... + bk xi ,k + ei (3)

ˆ xij =

∑X

k =1

K

k

(2)

A. Microarray Data Every cell of living organisms contains a full set of chromosomes and identical genes. Only a portion of these genes are turned on and it is the subset that is expressed, conferring distinctive properties to each cell category. There are two most important application forms for the DNA microarray technology: 1) identification of sequence (gene/gene mutation) and 2) determination of expression level (abundance) of genes of one sample or comparing gene transcription in two or more different kinds of cells. In data preparation, DNA Microarrays are small, solid supports onto which the sequences from thousands of different genes are attached at fixed locations. The supports themselves are usually glass microscope slides, the size of two side-by-side small fingers, but can also be silicon chips or nylon membranes. The DNA is printed, spotted, or actually synthesized directly onto the support. With the aid of a computer, the amount of mRNA bounding to the spots on the microarray is precisely measured, which generates a profile of gene expression in the cell. The generating process usually produces a lot of missing values and resulting in less efficiency of the downstream computational analysis [14]. B. K-nearest neighbor(KNN) Due to its simplicity, K-Nearest Neighbor (KNN) method is one of the well-known methods to impute missing values in microarray data. The KNN method imputes missing values by selecting genes with expression values similar to the gene of interest. The steps of KNN imputation are as follows. Step 1: Chose K genes that are most similar to the gene with the missing value (MV). In order to estimate the missing value xij of ith gene in jth sample, K genes are selected whose expression vectors are similar to genetic expression of i in samples other than j. Step 2: Measure the distance between two expression vectors xi and xj by using the Euclidian distance over the observed components in jth sample. Euclidean distance between xi and xj can be calculated from (1)

d ij = dist ( xi , x j ) =

**xi ,k is value of k th predictor in case i
**

b0 is regression constant

∑ (x

k =1

n

**bi is coefficient on k th the predictor
**

K is total number of predictors yi is predictand in case

ik

− x jk )

2

(1)

Where dist(xi,xj) is the Euclidean distance between samples xi and xj; n is the number of features or dimensions of microarray; and xik is the kth feature of sample xi.

ei is error term

30

http://sites.google.com/site/ijcsis/ ISSN 1947-5500

(IJCSIS) International Journal of Computer Science and Information Security, Vol. 8, No. 2, 2010

The model (3) is estimated by least squares, which yields parameter estimates such that the sum of squares of errors is minimized. The resulting prediction equation is

IV.

THE EXPERIMENTAL RESULTS

ˆ ˆ ˆ ˆ ˆ yi = b0 + b1 xi ,1 + b2 xi ,2 + ... + bk xi ,k

(4)

Where the variables are defined as in (3) except that “^” denotes estimated values III. THE EXPERIMENTAL DESIGN

To evaluate the effectiveness of the imputation methods, the NRMSE values were computed using each algorithm as descript above. The experiment is repeated 10 times and reported the average as the result. The experimental results are shown in Tables I and Fig. 1. Table I and Fig. 1 show the NRMSE of the estimation error for Colon Tumor data. The results show that the Regression method has a lower NRMSE compared to the other methods.

TABLE I.

% Miss

To compare the performance of the KNN, Row, Regression, and KNNFS imputation algorithms, NRMSE was used to measure the experimental results. The missing value estimation techniques were tested by randomly removing data values and then computing the estimation error. In the experiments, between 1% and 10% of the values were removed from the dataset randomly. Next, the four imputation algorithms as mention above are applied separately to calculate the missing values and then the imputed data (complete data) were used for accuracy measurement (NRMSE and classification accuracy) by SVM classifier. The overall process is shown in Fig. 1.

**NORMALIZE ROOT MEANS SQUARE ERROR OF MISSING-VALUE IMPUTATION FOR COLON CANCER DATA
**

Colon Cancer

Row KNN KNNFS Regression

1 2 3 4 5 6

0.6363 0.6121 0.6319 0.6339 0.6301 0.6281 0.6288 0.6382 0.6310 0.6296

0.5486 0.5366 0.5606 0.5621 0.5673 0.5634 0.5680 0.5882 0.5858 0.5849

0.4990 0.4918 0.5173 0.5169 0.5267 0.5212 0.5254 0.5534 0.5481 0.5483

0.4049 0.4103 0.4282 0.4251 0.4410 0.4573 0.4415 0.4548 0.4418 0.4450

Complete data

Generate artificial missing values

Data with missing

7 8 9 10

Feature Selection Method, Missing estimation Method Classification accuracy NRMSE

Imputed data

Figure 1. Simulation flow chart.

To test the effectiveness of the different imputation algorithms, Conlon Cancer dataset was used. The data were collected from 62 patients: 40 tumor and 22 normal cases. The dataset has 2,000 selected genes. It is clean and contains no missing values. The effectiveness of missing values imputation was computed by Normalized Room Mean Squared Error (NRMSE) [12] as shown in equation 5. NRMSE = Subject to mean[( y guess − yans ) ]

2

Figure 2. Normalize root means square error of missing value imputation for Colon Cancer Data

(5)

std [ yans ]

**y guess is estimated value
**

y ans is prototype gene's value

std[ y ans ] is stand deviation of prototype gene

The classification accuracy by using the SVM classifier is summarized in Table II and Fig. 2. The experimental results show that the accuracy of the row average method is ranged between 82.10% and 83.39%, while the neighbour-based methods (KNN, KNNFS) gave the result between 82.90% and 84.77%, and the regression method ranges between 82.90% and 84.84%.

31

http://sites.google.com/site/ijcsis/ ISSN 1947-5500

(IJCSIS) International Journal of Computer Science and Information Security, Vol. 8, No. 2, 2010

V.

CONCLUSION

[1]

REFERENCES

M. P. S. Brown, W. N. Grundy, D. Lin , N Cristianini, C. W. Sugnet, T. S. Furey, M. J. Ares, D. Haussler, “Knowledge-based analysis of microarray gene expression data by using support vector machines”, Proc Natl Acad Sci USA, vol. 97, pp. 262-267, 2000. X. L. Ji, J. L. Ling ,Z. R. Sun, “Mining gene expression data using a novel approach based on hidden Markov models”, FEBS Letters, vol. 542, pp. 125-131, 2003. O. Alter, P. O. Brown, D. Botstein, “Singular Value decomposition for genome-wide expression data processing and modeling”, Proc Natl Acad Sci USA, vol. 97, pp. 10101-10106, 2000. M. B. Eisen, P. T. Spellman, P. O. Brown, D. Botstein , “Cluster analysis and display of genome-wide expression patterns”, Proc Natl Acad Sci USA, vol. 97, pp. 262-267, 1998. P. Tamayo, D. Slonim , J. Mesirov, Q. Zhu, S. Kitareewan, E. Dmitrovsky,E. S. Lander, T. R. Golub , “Interpreting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation”, Proc Natl Acad Sci USA, vol. 96, pp. 2907-2912, 1999. E. Wit, and J. McClure, “Statistics for Microarrays: Design, Analysis and Inference”, West Sussex: John Wiley and Sons Ltd, pp.65-69, 2004. M. S. Sehgal, L. Gondal, L. S. Dooley, “Collateral Missing value imputation: a new robust missing value estimation algorithm for microarray data”, Bioinformatics, vol. 21, pp. 2417-2423, 2005. A. A. Alizadeh, M. B. Eisen, R. E. Davis, C. Ma, I. S. Lossos, A. Rosenwald, J. C. Boldrick, H. Sabet, T. Tran X. Yu, J. I. Powell, L. Yang, G. E. Marti, T. Moore, J. J. Hudson, L. Lu, D. B. Lewis, R. Tibshirani, G. Sherlock, W. C. Chan, T. C. Greiner, D. D. Weisenburger, J. O. Armitage, R. Warnke, L. M. Staudt, et al., “Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling”, Nature, vol. 403, pp. 503-511, 2000. O. Troyanskaya, M. Cantor, G. Sherlock, P. Brown, T. Hastie, R. Tibshirani, D. Botstein, R. B. Altman, “Missing values estimation methods for DNA microarrays”, Bioinformatics, vol. 17, pp. 520-525, 2001. S. Oba, M. A. Sato, I. Takemasa, M. Monden, K. I. Matsubara, S. Ishii, “A Bayesian missing value estimation method for gene expression profile data”, Bioinformatics, vol. 19, pp. 2088-2096, 2003. X. B. Xhou, X. D. Wang, E. R. Dougherty, “Missing –value estimation using linear and non-linear regression with Bayesian gene selection”, Bioinformatics, vol. 19, pp. 2302-2307, 2003. H. Kim, G.H. Golub, H. Park, “Missing value estimation for DNA microarray gene expression data: local least squares imputation”, Bioinformatics, vol. 21, pp. 187-198, 2005. D. Yoon, E. K. Lee, T. Park, “Robust imputation method for missing values in microarray data”, BMC Bioinformatics, vol. 8, no. 2:S6, 2007. J. Quackenbush, “Microarray data normalization and transformation”, Nature Genetics Supplement, vol. 32, pp. 496-501, 2002. P. Meesad and K. Hengpraprohm, “Combination of KNN-Based Feature Selection and KNNBased Missing-Value Imputation of Microarray Data”, 2008 3rd International Conference on Innovative Computing Information and Control, pp.341, 2008.

This research studies the effectiveness of MVs imputation methods to the classification problems. The model-based approach is employed. Four methods for imputation (Row average, KNN, KNNFS, Regression) are used to compare the performance of classification accuracy in this research. The Colon Cancer data is used in this experiment. To evaluate the performance of the imputation methods, we randomly removed known expression values between 1% and 10% of the values from the complete matrices, imputed MVs, and assessed the performance by using the NRMSE. The results show that the Row average method yields a very poor effectiveness comparing with other methods in term of NRMSE. And also, it gives lowest classification accuracy with SVM classifier. For other methods, although the Regression yields the best performance in term of NRMSE, it is not different in classification accuracy.

TABLE II. ACCURACY OF SVM CLASSIFYER FOR COLON CANCER DATA CLASSIFICATION

Colon Cancer

Row KNN KNNFS Regression

[2]

[3]

[4]

[5]

[6] [7]

% Miss

[8] 84.84 84.19 84.84 83.71 83.51 83.87 84.19 84.03 84.35 82.90 [12] [11] [10] [9]

1 2 3 4 5 6 7 8 9 10

83.39 83.23 83.06 82.74 82.62 82.90 82.42 82.10 83.23 82.26

84.03 84.35 83.87 84.19 84.23 82.90 83.87 83.39 84.35 83.55

84.35 84.03 83.71 83.87 84.77 82.74 83.87 83.23 84.68 83.71

[13] [14] [15]

Figure 3. Accuracy of SVM Classifyer for Colon Cancer

32

http://sites.google.com/site/ijcsis/ ISSN 1947-5500

- Journal of Computer Science IJCSIS March 2016 Part II
- Journal of Computer Science IJCSIS March 2016 Part I
- Journal of Computer Science IJCSIS April 2016 Part II
- Journal of Computer Science IJCSIS April 2016 Part I
- Journal of Computer Science IJCSIS February 2016
- Journal of Computer Science IJCSIS Special Issue February 2016
- Journal of Computer Science IJCSIS January 2016
- Journal of Computer Science IJCSIS December 2015
- Journal of Computer Science IJCSIS November 2015
- Journal of Computer Science IJCSIS October 2015
- Journal of Computer Science IJCSIS June 2015
- Journal of Computer Science IJCSIS July 2015
- International Journal of Computer Science IJCSIS September 2015
- Journal of Computer Science IJCSIS August 2015
- Journal of Computer Science IJCSIS April 2015
- Journal of Computer Science IJCSIS March 2015
- Fraudulent Electronic Transaction Detection Using Dynamic KDA Model
- Embedded Mobile Agent (EMA) for Distributed Information Retrieval
- A Survey
- Security Architecture with NAC using Crescent University as Case study
- An Analysis of Various Algorithms For Text Spam Classification and Clustering Using RapidMiner and Weka
- Unweighted Class Specific Soft Voting based ensemble of Extreme Learning Machine and its variant
- An Efficient Model to Automatically Find Index in Databases
- Base Station Radiation’s Optimization using Two Phase Shifting Dipoles
- Low Footprint Hybrid Finite Field Multiplier for Embedded Cryptography

Sign up to vote on this title

UsefulNot useful by ijcsis

0.0 (0)

The incomplete data is an important problem in data mining. The consequent downstream analysis becomes less effective. Most algorithms for statistical data analysis need a complete set of data. Mic...

The incomplete data is an important problem in data mining. The consequent downstream analysis becomes less effective. Most algorithms for statistical data analysis need a complete set of data. Microarray data usually consists of a small number of samples with high dimensions but with a number of missing values. Many missing value imputation methods have been developed for microarray data, but only a few studies have investigated the relationship between missing value imputation method and classification accuracy. In this paper we carry out experiments with Colon Cancer dataset to evaluate the effectiveness of the four methods dealing with missing values imputations: the Row average method, KNN imputation, KNNFS imputation and Multiple Linear Regression imputation procedure. The considered classifier is the Support Vector Machine (SVM).

- Statistical Evaluation of Diagnostic Testsby BONFRING
- A Comparative Analysis of Classification Techniques on Medical Data Setsby International Journal of Research in Engineering and Technology
- A Comparative Study of Feature Selection Methods for Cancer Classification using Gene Expression Datasetby Journal of Computer Applications
- PR-July15-01by বিমূর্ত বাঁধন

- Classification of Micro Array Gene Expression using kNN, SVM and Naive Classifiers
- Machine Learning Based Approaches for Cancer Classification Using Gene Expression Data
- DMIN11 Published Paper
- Paper 9
- Algoritmos-Herramientas Data Mining
- out_7
- Dimension Reduction by Mutual Information Discriminant Analysis
- IRJET-Efficient learning of Arrhythmia data set with Multi class-cost sensitive classifiers
- A Novel Approach to Implement Feature Extraction of Hyperspectral Images
- Gaonkar_ProjectReport_BIOL536
- nearest neighbour
- Vol4 Iss2 257 - 263 a Decision Support System for Parki
- Efficient Cancer Classification Using Fast Adaptive Neuro-Fuzzy Inference System (FANFIS) Based on Statistical Techniques
- Medical Data
- DecisionTrees_RandomForest_v2
- Decisiontrees Boosting
- IJCSE12-04-03-050
- Classifier Ensemble With Mutually Exclusive Features
- Contents
- Visual Saliency Model Using Sift and Comparison of Learning Approaches
- Statistical Evaluation of Diagnostic Tests
- A Comparative Analysis of Classification Techniques on Medical Data Sets
- A Comparative Study of Feature Selection Methods for Cancer Classification using Gene Expression Dataset
- PR-July15-01
- Text Summarization as Feature Selection for Arabic Text Classification
- Dn 33686693
- 2_Ch06
- A Classifier for Guitar Tabs
- MktRes-MARK4338-Lecture19-20
- Ew 32928931
- A Comparative Study of Microarray Data Classification with Missing Values Imputation