Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more
Download
Standard view
Full view
of .
Save to My Library
Look up keyword
Like this
9Activity
0 of .
Results for:
No results containing your search query
P. 1
A Survey on Data Mining Techniques for Gene Selection and Cancer Classification

A Survey on Data Mining Techniques for Gene Selection and Cancer Classification

Ratings: (0)|Views: 487 |Likes:
Published by ijcsis
Cancer research is one of the major research areas in the medical field. Classification is critically important for cancer diagnosis and treatment accurate prediction of different tumor types have great value in providing better treatment and toxicity minimization on the patients. Previously, cancer classification has always been morphological and clinical based. These conventional cancer classification methods are reported to have several limitations in their diagnostic ability. In order to gain a better insight into the problem of cancer classification, systematic approaches based on global gene expression analysis have been proposed. The recent advent of microarray technology has allowed the simultaneous monitoring of thousands of genes, which motivated the development in cancer classification using gene expression data. Though still in its early stages of development, results obtained so far seemed promising .The survey report presents the most used data mining techniques for gene selection and cancer classification. Particular, this survey focus on algorithms proposed on four main emerging fields. They are neural networks based algorithms, machine learning algorithms, genetic algorithms and cluster based algorithms. In addition, it provides a general idea for future improvement in this field.
Cancer research is one of the major research areas in the medical field. Classification is critically important for cancer diagnosis and treatment accurate prediction of different tumor types have great value in providing better treatment and toxicity minimization on the patients. Previously, cancer classification has always been morphological and clinical based. These conventional cancer classification methods are reported to have several limitations in their diagnostic ability. In order to gain a better insight into the problem of cancer classification, systematic approaches based on global gene expression analysis have been proposed. The recent advent of microarray technology has allowed the simultaneous monitoring of thousands of genes, which motivated the development in cancer classification using gene expression data. Though still in its early stages of development, results obtained so far seemed promising .The survey report presents the most used data mining techniques for gene selection and cancer classification. Particular, this survey focus on algorithms proposed on four main emerging fields. They are neural networks based algorithms, machine learning algorithms, genetic algorithms and cluster based algorithms. In addition, it provides a general idea for future improvement in this field.

More info:

Published by: ijcsis on Jun 30, 2010
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less

03/16/2011

pdf

text

original

 
A Survey on Data Mining Techniques for GeneSelection and Cancer Classification
 
Dr. S. Santhosh BabooReader, PG and Research department of Computer Science,Dwaraka Doss Goverdhan Doss Vaishnav CollegeChennaisanthos2001@sify.comS. SasikalaHead, Department of Computer ScienceSree Saraswathi Thyagaraja CollegePollachisasivenkatesh04@gmail.com
 Abstract 
 ─ 
Cancer research is one of the major research areasin the medical field. Classification is critically important forcancer diagnosis and treatment accurate prediction of differenttumor types have great value in providing better treatment andtoxicity minimization on the patients. Previously, cancerclassification has always been morphological and clinical based.These conventional cancer classification methods are reported tohave several limitations in their diagnostic ability. In order togain a better insight into the problem of cancer classification,systematic approaches based on global gene expression analysishave been proposed. The recent advent of microarray technologyhas allowed the simultaneous monitoring of thousands of genes,which motivated the development in cancer classification usinggene expression data. Though still in its early stages of development, results obtained so far seemed promising .Thesurvey report presents the most used data mining techniques forgene selection and cancer classification. Particular, this surveyfocus on algorithms proposed on four main emerging fields. Theyare neural networks based algorithms, machine learningalgorithms, genetic algorithms and cluster based algorithms. Inaddition, it provides a general idea for future improvement inthis field.
  Keywords
 ─ 
Data Mining, Gene Selection, CancerClassification, Neural Network, Support Vector Machine,Clustering, Genetic Algorithms.
I.
 
I
 NTRODUCTION
 Data mining (also known as Knowledge Discovery inDatabases - KDD) has been defined as “The nontrivialextraction of implicit, previously unknown, and potentiallyuseful information from data.The KDD is an iterative process. Once the discovered knowledge is presented to theuser, the evaluation measures can be enhanced, the mining can be further refined, new data can be selected or further transformed, or new data sources can be integrated, in order toget different, more appropriate results. Cancer classificationthrough gene expression data analysis has recently emerged asan active area of research. In recent years numeroustechniques were proposed in literature for gene selection andcancer classification. Data mining and knowledge extraction isan important problem in bioinformatics. Biological datamining is an emerging field of research and development.A large amount of biological data has been produced in thelast years. Important knowledge can be extracted from thesedata by the use of data analysis techniques. This surveyfocuses on various data mining and machine learningtechniques for proper gene selection, which leads to accuratecancer classification. It discusses various techniques andmethods proposed earlier in literature for biological dataanalysis. Particular, this survey focus on algorithms proposedon four main emerging fields. They are neural networks basedalgorithms, machine learning algorithms, genetic algorithmsand cluster based algorithms. In addition, it provides a generalidea for future improvement in this field.The remainder of this paper is organized as follows. SectionII discusses various techniques and methods proposed earlier in literature for gene selection and cancer classification.Section III provides a marginal idea for further direction inthis field. Section IV concludes the paper with fewer discussions.II.
 
ELATED
W
ORK 
 Cancer classification is a challenging area in the field of Bioinformatics. It uses machine learning, statistical andvisualization techniques to discover and present knowledge ina form which is easily comprehensible to humans. Recentresearch has demonstrated that gene selection is the pre-stepfor cancer classification. The survey focus on various geneselection and cancer classification methods based on Neural Networks based algorithms, Machine Learning BasedAlgorithms, Genetic Algorithms and Clustering Algorithms.
 A.
 
 Neural Network Based Algorithms
An
Artificial Neural Network (ANN)
, usually called
Neural Network
” (NN), is a mathematical modeling or computational modeling that tries to simulate the structureand/or functional aspects of biological neural network. Itconsists of an interconnected group of artificial neurons and processes information using a connectionist approach tocomputation. Neural networks are non-linear statistical datamodeling tools. They can be used to model complexrelationships between inputs and outputs or to find patterns indata.A gene classification artificial neural system has beendeveloped for rapid annotation of the molecular sequencingdata being generated by the Human Genome Project. CathyH.Wu et al 1995 [4] designed an ANN system to classify new(unknown) sequences into predefined (known) classes. In caseof gene classification NN is used for rapid annotations of themolecular sequencing data being generated by the humangenome projects. The system evaluates three neural network sequence classification system, GenCANS-PIR for PIR super family placement of protein sequences. GenCANS_RDP for 
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 8, No. 1, April 2010216http://sites.google.com/site/ijcsis/ISSN 1947-5500
 
RDP phylogenetic classification of small subunit rRNAsequences and GenCANS_Blocks for prosite/Blocks proteingrouping of protein sequences. The design of neural systemcan be easily extended to classify other nucleic acidsequences. A sequence classification method is used and itholds many advantages such as speed, sensitivity andautomated family assignments.The method using gene expression profiles is moreobjective, accurate and reliable compared with traditionaltumor diagnostic methods based mainly on the morphologicalappearance of the tumor. Lipo wang et al 2007 [19] proposeda FNN method to find the smallest set of genes that can ensurehighly accurate classification of cancers from microarray datawhich includes two stepsi)
 
They choose some important genes using a featureimportance ranking schemes.ii)
 
The classification capability of all simple combinations of those important genes is tested by using good classifiers.The method used “divide and conquer” approach in whichaccuracy is obtained and significantly deduced the number of genes required for highly reliable cancer diagnosis. Theimportance ranking of each gene is computed using featureranking measures such as T-Test and Class separability. After selecting some top genes in the importance ranking list, theselected gene is inputted in to the classifier such as Fuzzy Neural Network and Support Vector Machine. If accuracy isnot obtained, the 2-gene combinations are obtained. This procedure is repeated until good accuracy is obtained. The performance of classifiers is tested with lymphoma data set inwhich 93.85 percent accuracy is obtained, with SRBCT dataset 95 percent accuracy is obtained ,with liver cancer data set98.1 % accuracy is obtained and with GCM data set 81.25 percent accuracy is obtained.
 B.
 
 Machine Learning Algorithms1)
 
Support vector machines (SVMs)
SVMs are a set of related supervised learning methods usedfor classification and regression. In simple words, given a setof training examples, each marked as belonging to one of twocategories, an SVM training algorithm builds a model that predicts whether a new example falls into one category or theother. It is relatively new learning algorithm proposed byvapnik.et al.
 
Junying Zhang et al. 2003 in [13] discussed about the recentSVM approaches for gene selection, cancer classification andfunctional gene classification. One of the major challenges of gene expression data is the large number of genes in the datasets.SVM method used for gene selection was RecursiveFeature Elimination (RFE). SVM methods are demonstrated indetail on samples consisting of ovarian cancer tissues, normalovarian tissues and other normal tissues. For functional geneclassification SVM employs distance functions that operate inextremely high dimensional feature spaces.SVM works wellfor the analysis of broad pattern of gene expression. They caneasily deal with large number of features and a small number of training patterns.Boyang Li et al. 2008 in [3] proposed an improved SVMclassifier with soft decision boundary. SVM classifiers haveshown to be an efficient approach to tackle a variety of classification problems, because it is based on the marginmaximization and statistical algorithms. Gene data differsfrom other classification data in several ways. One gene mayhave several different functions, so some gene may have morethan one functional label. Since some kind of hard boundariesare commonly used to classify the data arbitrarily in mostconventional method, they are invalid for the data with amutual part between the classes. Another representative problem in gene data is data imbalance that means the size of one class is much larger than other classes, which is the mainreason for causing the excursion of separation boundary. Thesystem defines a kind of belief degree based on the decisionsvalues of the samples. The boundary is a classification boundary based on belief degree of data. Statistical methodsand curve fitting algorithms of SVM is used to classify multi-label gene data and also deals with data imbalance.Kai-BO Duan et al. 2005 in [16] presented a new geneselection method that uses a backward elimination proceduresimilar to that of SVM-RFE. The proposed MSVM-RFEmethod selects better gene subsets than SVM-RFE andimproves cancer classification accuracy and also leads toredundancy reduction. Unlike the SVM-RFE method, at eachstep, the proposed approach com putes the feature rankingscores from a statistical analysis of weight vectors of multiplelinear SVMs trained on sub samples of the original trainingdata. The method is tested on four gene expression datasets for cancer classification. The results show that the proposedfeature selection method selects better gene subsets than theoriginal SVM-RFE and improves the classification accuracy.A gene ontology-based similarity assessment indicates that theselected subsets are functionally diverse, further validating thegene selection method. The investigation also suggests that,for gene expression-based cancer classification, average testerror from multiple partitions of training and test sets can berecommended as a reference of performance quality. Thismethod can select better gene subsets than SVM-RFE andimprove the cancer classification accuracy. Gene selectionalso improves the performance of SVMs and is a necessarystep for cancer classification with gene expression data. GO- based similarity values of pairs of genes belonging to subsetsselected by MSVM-RFE are significantly low, which may beseen as an indicator of functional diversity. The proposedmethod is a powerful approach for gene selection and cancer classification.
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 8, No. 1, April 2010217http://sites.google.com/site/ijcsis/ISSN 1947-5500
 
Wei Luo et al. in [24] proposed SVM method for cancer classification. This includes two stages. Modified t-testmethod used to select discrimatory features as the first level.The second level extracts principle components from the top-ranked genes based on modified t-test method. Selectingimportant features and building effective classifier are both pivotal process to cancer classification. The results (Table 2) proved the effectiveness of gene selection methods usingSVM.Kaibo Duan et al. in [15] discussed about a variant of SVM-RFE to do gene selection for cancer classification withexpression data. In gene expression-based cancer classification, a large number of genes in conjunction with asmall number of samples make the gene selection problemmore important but also more challenging. Leave-one-out procedure is used along with SVM-RFE. This combinationworks well on all gene expression datasets. In this methodnested subsets of features are selected in a sequential backward elimination manner, which starts with all thefeatures and each time removes one feature with the smallestranking score. At each step, the coefficients of the weightvector w of a linear SVM are used as the feature rankingcriterion. For gene expression-based cancer classification data,only a few training samples are used. In this case, in order tomake better use of valuable available training samples, Leave-One-Out (LOO) procedure is used. LOO-SVM-RFE iscomparative with SVM-RFE and performs constantly well onall the gene expression datasets used. A set of more relevantgenes are selected by T-statistics, may not be optimal for  building a good classifier due to possible redundancy with inthem. Gene selection also improves the performance of SVMand is a necessary step for cancer classification withexpression data.Yuchun Tang et al. in [26] proposed an efficient algorithmwhich includes two stages. The first stage deals witheliminating most of the irrelevant, redundant and noisy genes.A final selection for the final gene subset is the performed atthe second stage. This is done with gene selection algorithmssuch as Correlation-based feature ranking algorithm work inthe forward selection way by ranking genes individually interms of correlation-based metric. Some top ranked genes areselected to form the most informative gene sub set [19,20,21]and back elimination algorithms which works by iterativelyremoving one “worst” gene at a time until the predefined sizeof the final gene subset is reached. In each loop, the remaininggenes are ranked again, elimination algorithm which achievednotable performance improvement. The instability of theSVM-RFE algorithm may reduce over fitting. To overcomethe instability problem the new two-stage SVM-RFEalgorithm is proposed. The system is better than correlation- based methods because it avoids the orthogonally assumptionsresulting in modified gene ranking.
2)
 
 Extreme learning machine (ELM)
Recently, a new learning algorithm for the feed-forwardneural network named the extreme learning machine (ELM)which can give better performance than traditional tuning- based learning methods for feed-forward neural networks interms of generalization and learning speed has been proposed by Huang et al.Runxuan Zhang en al. in [21] proposed a fast and efficientclassification method called ELM algorithm. In ELM one mayrandomly choose and fix all the hidden node parameters andthen analytically determine the output weights. Studies haveshown [10] that ELM has good generalization performanceand can be implemented easily. Many nonlinear activationfunctions can be used in ELM, like sigmoid, sine, hard limit[18], radial basis functions [15] [16], and complex activationfunctions [6]. In order to evaluate the performance of ELMalgorithm for micro category cancer diagnosis, three benchmark micro array data sets, namely, the GCM, the lungand the lymphoma data sets are used. For gene selectionrecursive feature elimination method is used. ELM can perform multicategory classification directly with out anymodification. This algorithm achieves higher classificationaccuracy than the other algorithms such as ANN, SANN andSVM with less training time and a smaller network structure(Table 3).
3)
 
 Relevance vector machine (RVM)
 Relevance vector machine (RVM)
is a machine learningtechnique that uses Bayesian inference to obtain parsimonioussolutions for regression and classification. The RVM has anidentical functional form to the support vector machine, but provides probabilistic classification. It is actually equivalent toa Gaussian process model with covariance function:where
φ
is thekernel
T
ABLE
1
THE COMPARISON OF DIFFERENT NUMBER OF GENES AND CLASSIFICATION ACCURACY FOR THE
 S
RBCT DATA SET
.
S.No Number of GenesClassificationAccuracy
1 50 100%2 25 100%3 12 100%4 60 98.4127%5 33 93.6508%T
ABLE
2V
ALIDATION
A
CCURACY
(%)
OF
D
IFFERENT
A
LGORITHMS
 #GenesELM SVM-OVO Accuracy14 74.34 8.5 70.20 8.2 68.75 5028 78.52 10.7 74.36 7.9 71.53 59.7642 80.57 9.9 75.05 10.9 72.92 64.5856 81.95 8.8 75.72 8.9 79.17 70.1470 83.35 8.5 77.86 11.6 76.4 59.7280 84.06 9.4 77.86 10.5 80.56 70.8398 83.40 8.5 79.21 7.8 77.08 72.22
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 8, No. 1, April 2010218http://sites.google.com/site/ijcsis/ISSN 1947-5500

Activity (9)

You've already reviewed this. Edit your review.
1 thousand reads
1 hundred reads
syedfarmhan liked this
vizhiezhilanME liked this
Rathna Gopisetty liked this
kmkkmk liked this
bkmishra21 liked this
hak2310 liked this

You're Reading a Free Preview

Download
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->