You are on page 1of 6

(IJCSIS) International Journal of Computer Science and Information Security,

Vol. 8, No. 1, April 2010

A Survey on Data Mining Techniques for Gene


Selection and Cancer Classification
Dr. S. Santhosh Baboo S. Sasikala
Reader, PG and Research department of Computer Science, Head, Department of Computer Science
Dwaraka Doss Goverdhan Doss Vaishnav College Sree Saraswathi Thyagaraja College
Chennai Pollachi
santhos2001@sify.com sasivenkatesh04@gmail.com

Abstract─Cancer research is one of the major research areas methods proposed earlier in literature for biological data
in the medical field. Classification is critically important for analysis. Particular, this survey focus on algorithms proposed
cancer diagnosis and treatment accurate prediction of different on four main emerging fields. They are neural networks based
tumor types have great value in providing better treatment and algorithms, machine learning algorithms, genetic algorithms
toxicity minimization on the patients. Previously, cancer and cluster based algorithms. In addition, it provides a general
classification has always been morphological and clinical based.
These conventional cancer classification methods are reported to
idea for future improvement in this field.
have several limitations in their diagnostic ability. In order to The remainder of this paper is organized as follows. Section
gain a better insight into the problem of cancer classification, II discusses various techniques and methods proposed earlier
systematic approaches based on global gene expression analysis in literature for gene selection and cancer classification.
have been proposed. The recent advent of microarray technology Section III provides a marginal idea for further direction in
has allowed the simultaneous monitoring of thousands of genes, this field. Section IV concludes the paper with fewer
which motivated the development in cancer classification using discussions.
gene expression data. Though still in its early stages of
development, results obtained so far seemed promising .The
II. RELATED WORK
survey report presents the most used data mining techniques for
gene selection and cancer classification. Particular, this survey Cancer classification is a challenging area in the field of
focus on algorithms proposed on four main emerging fields. They Bioinformatics. It uses machine learning, statistical and
are neural networks based algorithms, machine learning visualization techniques to discover and present knowledge in
algorithms, genetic algorithms and cluster based algorithms. In a form which is easily comprehensible to humans. Recent
addition, it provides a general idea for future improvement in research has demonstrated that gene selection is the pre-step
this field.
for cancer classification. The survey focus on various gene
Keywords─Data Mining, Gene Selection, Cancer selection and cancer classification methods based on Neural
Classification, Neural Network, Support Vector Machine, Networks based algorithms, Machine Learning Based
Clustering, Genetic Algorithms. Algorithms, Genetic Algorithms and Clustering Algorithms.

I. INTRODUCTION
A. Neural Network Based Algorithms
Data mining (also known as Knowledge Discovery in An Artificial Neural Network (ANN), usually called
Databases - KDD) has been defined as “The nontrivial “Neural Network” (NN), is a mathematical modeling or
extraction of implicit, previously unknown, and potentially computational modeling that tries to simulate the structure
useful information from data.” The KDD is an iterative and/or functional aspects of biological neural network. It
process. Once the discovered knowledge is presented to the consists of an interconnected group of artificial neurons and
user, the evaluation measures can be enhanced, the mining can processes information using a connectionist approach to
be further refined, new data can be selected or further computation. Neural networks are non-linear statistical data
transformed, or new data sources can be integrated, in order to modeling tools. They can be used to model complex
get different, more appropriate results. Cancer classification relationships between inputs and outputs or to find patterns in
through gene expression data analysis has recently emerged as data.
an active area of research. In recent years numerous
techniques were proposed in literature for gene selection and A gene classification artificial neural system has been
cancer classification. Data mining and knowledge extraction is developed for rapid annotation of the molecular sequencing
an important problem in bioinformatics. Biological data data being generated by the Human Genome Project. Cathy
mining is an emerging field of research and development. H.Wu et al 1995 [4] designed an ANN system to classify new
(unknown) sequences into predefined (known) classes. In case
A large amount of biological data has been produced in the of gene classification NN is used for rapid annotations of the
last years. Important knowledge can be extracted from these molecular sequencing data being generated by the human
data by the use of data analysis techniques. This survey genome projects. The system evaluates three neural network
focuses on various data mining and machine learning sequence classification system, GenCANS-PIR for PIR super
techniques for proper gene selection, which leads to accurate family placement of protein sequences. GenCANS_RDP for
cancer classification. It discusses various techniques and

216 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 8, No. 1, April 2010

RDP phylogenetic classification of small subunit rRNA classification SVM employs distance functions that operate in
sequences and GenCANS_Blocks for prosite/Blocks protein extremely high dimensional feature spaces.SVM works well
grouping of protein sequences. The design of neural system for the analysis of broad pattern of gene expression. They can
can be easily extended to classify other nucleic acid easily deal with large number of features and a small number
sequences. A sequence classification method is used and it of training patterns.
holds many advantages such as speed, sensitivity and
automated family assignments. Boyang Li et al. 2008 in [3] proposed an improved SVM
classifier with soft decision boundary. SVM classifiers have
The method using gene expression profiles is more shown to be an efficient approach to tackle a variety of
objective, accurate and reliable compared with traditional classification problems, because it is based on the margin
tumor diagnostic methods based mainly on the morphological maximization and statistical algorithms. Gene data differs
appearance of the tumor. Lipo wang et al 2007 [19] proposed from other classification data in several ways. One gene may
a FNN method to find the smallest set of genes that can ensure have several different functions, so some gene may have more
highly accurate classification of cancers from microarray data than one functional label. Since some kind of hard boundaries
which includes two steps are commonly used to classify the data arbitrarily in most
conventional method, they are invalid for the data with a
i) They choose some important genes using a feature mutual part between the classes. Another representative
importance ranking schemes. problem in gene data is data imbalance that means the size of
ii) The classification capability of all simple combinations of one class is much larger than other classes, which is the main
those important genes is tested by using good classifiers. reason for causing the excursion of separation boundary. The
system defines a kind of belief degree based on the decisions
The method used “divide and conquer” approach in which values of the samples. The boundary is a classification
accuracy is obtained and significantly deduced the number of boundary based on belief degree of data. Statistical methods
genes required for highly reliable cancer diagnosis. The and curve fitting algorithms of SVM is used to classify multi-
importance ranking of each gene is computed using feature label gene data and also deals with data imbalance.
ranking measures such as T-Test and Class separability. After
selecting some top genes in the importance ranking list, the Kai-BO Duan et al. 2005 in [16] presented a new gene
selected gene is inputted in to the classifier such as Fuzzy selection method that uses a backward elimination procedure
Neural Network and Support Vector Machine. If accuracy is similar to that of SVM-RFE. The proposed MSVM-RFE
not obtained, the 2-gene combinations are obtained. This method selects better gene subsets than SVM-RFE and
procedure is repeated until good accuracy is obtained. The improves cancer classification accuracy and also leads to
performance of classifiers is tested with lymphoma data set in redundancy reduction. Unlike the SVM-RFE method, at each
which 93.85 percent accuracy is obtained, with SRBCT data step, the proposed approach computes the feature ranking
set 95 percent accuracy is obtained ,with liver cancer data set scores from a statistical analysis of weight vectors of multiple
98.1 % accuracy is obtained and with GCM data set 81.25 linear SVMs trained on sub samples of the original training
percent accuracy is obtained. data. The method is tested on four gene expression datasets for
cancer classification. The results show that the proposed
B. Machine Learning Algorithms
feature selection method selects better gene subsets than the
1) Support vector machines (SVMs) original SVM-RFE and improves the classification accuracy.
SVMs are a set of related supervised learning methods used A gene ontology-based similarity assessment indicates that the
for classification and regression. In simple words, given a set selected subsets are functionally diverse, further validating the
of training examples, each marked as belonging to one of two gene selection method. The investigation also suggests that,
categories, an SVM training algorithm builds a model that for gene expression-based cancer classification, average test
predicts whether a new example falls into one category or the error from multiple partitions of training and test sets can be
other. It is relatively new learning algorithm proposed by recommended as a reference of performance quality. This
vapnik.et al. method can select better gene subsets than SVM-RFE and
improve the cancer classification accuracy. Gene selection
Junying Zhang et al. 2003 in [13] discussed about the recent also improves the performance of SVMs and is a necessary
SVM approaches for gene selection, cancer classification and step for cancer classification with gene expression data. GO-
functional gene classification. One of the major challenges of based similarity values of pairs of genes belonging to subsets
gene expression data is the large number of genes in the data selected by MSVM-RFE are significantly low, which may be
sets.SVM method used for gene selection was Recursive seen as an indicator of functional diversity. The proposed
Feature Elimination (RFE). SVM methods are demonstrated in method is a powerful approach for gene selection and cancer
detail on samples consisting of ovarian cancer tissues, normal classification.
ovarian tissues and other normal tissues. For functional gene

217 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 8, No. 1, April 2010

Wei Luo et al. in [24] proposed SVM method for cancer based learning methods for feed-forward neural networks in
classification. This includes two stages. Modified t-test terms of generalization and learning speed has been proposed
method used to select discrimatory features as the first level. by Huang et al.
The second level extracts principle components from the top-
ranked genes based on modified t-test method. Selecting Runxuan Zhang en al. in [21] proposed a fast and efficient
important features and building effective classifier are both classification method called ELM algorithm. In ELM one may
pivotal process to cancer classification. The results (Table 2) randomly choose and fix all the hidden node parameters and
proved the effectiveness of gene selection methods using then analytically determine the output weights. Studies have
SVM. shown [10] that ELM has good generalization performance
and can be implemented easily. Many nonlinear activation
Kaibo Duan et al. in [15] discussed about a variant of SVM- functions can be used in ELM, like sigmoid, sine, hard limit
RFE to do gene selection for cancer classification with [18], radial basis functions [15] [16], and complex activation
expression data. In gene expression-based cancer functions [6]. In order to evaluate the performance of ELM
classification, a large number of genes in conjunction with a TABLE 2
small number of samples make the gene selection problem VALIDATION ACCURACY (%) OF DIFFERENT ALGORITHMS
more important but also more challenging. Leave-one-out # ELM SVM-OVO Accuracy
procedure is used along with SVM-RFE. This combination Genes
works well on all gene expression datasets. In this method 14 74.34 8.5 70.20 8.2 68.75 50
nested subsets of features are selected in a sequential 28 78.52 10.7 74.36 7.9 71.53 59.76
42 80.57 9.9 75.05 10.9 72.92 64.58
backward elimination manner, which starts with all the 56 81.95 8.8 75.72 8.9 79.17 70.14
features and each time removes one feature with the smallest 70 83.35 8.5 77.86 11.6 76.4 59.72
ranking score. At each step, the coefficients of the weight 80 84.06 9.4 77.86 10.5 80.56 70.83
vector w of a linear SVM are used as the feature ranking 98 83.40 8.5 79.21 7.8 77.08 72.22
criterion. For gene expression-based cancer classification data,
algorithm for micro category cancer diagnosis, three
only a few training samples are used. In this case, in order to
benchmark micro array data sets, namely, the GCM, the lung
make better use of valuable available training samples, Leave-
and the lymphoma data sets are used. For gene selection
One-Out (LOO) procedure is used. LOO-SVM-RFE is
recursive feature elimination method is used. ELM can
comparative with SVM-RFE and performs constantly well on
perform multicategory classification directly with out any
all the gene expression datasets used. A set of more relevant
modification. This algorithm achieves higher classification
genes are selected by T-statistics, may not be optimal for
accuracy than the other algorithms such as ANN, SANN and
building a good classifier due to possible redundancy with in
SVM with less training time and a smaller network structure
them. Gene selection also improves the performance of SVM
(Table 3).
and is a necessary step for cancer classification with
expression data.

Yuchun Tang et al. in [26] proposed an efficient algorithm


which includes two stages. The first stage deals with 3) Relevance vector machine (RVM)
eliminating most of the irrelevant, redundant and noisy genes.
A final selection for the final gene subset is the performed at Relevance vector machine (RVM) is a machine learning
the second stage. This is done with gene selection algorithms technique that uses Bayesian inference to obtain parsimonious
such as Correlation-based feature ranking algorithm work in solutions for regression and classification. The RVM has an
the forward selection way by ranking genes individually in identical functional form to the support vector machine, but
terms of correlation-based metric. Some top ranked genes are provides probabilistic classification. It is actually equivalent to
selected to form the most informative gene sub set [19,20,21] a Gaussian process model with covariance function:
and back elimination algorithms which works by iteratively
removing one “worst” gene at a time until the predefined size
of the final gene subset is reached. In each loop, the remaining
genes are ranked again, elimination algorithm which achieved
TABLE 1 THE COMPARISON OF DIFFERENT NUMBER
notable performance improvement. The instability of the OF GENES AND CLASSIFICATION ACCURACY FOR THE
SVM-RFE algorithm may reduce over fitting. To overcome SRBCT DATA SET.
the instability problem the new two-stage SVM-RFE where φ is the
algorithm is proposed. The system is better than correlation- kernel S.No Number of Classification
based methods because it avoids the orthogonally assumptions Genes Accuracy
resulting in modified gene ranking. 1 50 100%
2 25 100%
3 12 100%
2) Extreme learning machine (ELM) 4 60 98.4127%
Recently, a new learning algorithm for the feed-forward 5 33 93.6508%
neural network named the extreme learning machine (ELM)
which can give better performance than traditional tuning-

218 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 8, No. 1, April 2010

function (usually Gaussian), and x1,…,xN are the input vectors There are several approaches utilized to counter the over
of the training set. A novel machine learning method RVM is fitting problem – using simple rules, increasing the training
proposed by Tipping in the year 2000. samples, using a sub set of test samples and integrating over
different predictors. GP is considered to be a powerful search
Wen Zhang et al. in [25] proposed a new RVM-RFE method algorithm with a penchant for over fitting. The difficulty with
for gene selection. SVM-RFE is one of the important feature selection when there are many features is that it is a
approaches for gene selection, which combines support vector NP-hard problem and unless some information loss is
machine with recursive procedure. The RVM-RFE is used for acceptable, a huge amount of computational effort has to be
gene selection by combining RVM and RFE. As a competitor spend to discover the most significant combination of the
to SVM, RVM lends itself particularly well to the analysis of features. GP can leverage its population-based method to serve
broad patterns of gene expression from DNA microarray data. as a powerful feature selection tool with the computational
The experimental results on real datasets demonstrate that the burden alleviated by parallelization. To select the features
method can lead to satisfy appropriate running time, compared using GP the statistics cab be used which extract the multiple
with SVM-RFE, linear RVM and other methods. The runs with different parameters.
approach improves classification accuracy, which indicates
better and more effective disease diagnosis, finally it results in Cancer classification based on the DNA array data is still a
a significant difference in a patients changes for recovery this challenging task. Jinn-Yi Yeh et al. in [12] applies genetic
method leads to comparable accuracy and shorter running algorithm (GA) with an initial solution provided by t-statistics
time. (t-GA) for selecting a group of relevant genes from cancer
microarray data. Then the decision tree based classification is
C. Genetic algorithms built. It has the highest accuracy rate. The accuracy rate
remains stable when the number of genes is changed. Gene
The Genetic Algorithms were invented to mimic some of expression appears in the process of transcribing a gene’s
the processes observed in natural evolution. Many people, DNA sequence in to RNA. A gene expression level indicates
biologists included, are astonished that life at the level of the approximate number of copies of that gene’s RNA
complexity that we observe could have evolved in the produced in a cell and it is correlated with the amount of the
relatively short time suggested by the fossil record. The idea corresponding protein made. Genetic Algorithms, t-statistics,
with GA is to use this power of evolution to solve Correlation based, Information gain and Decision tree were
optimization problems. The father of the original Genetic used as gene selection methods. To evaluate the performance
Algorithm was John Holland who invented it in the early for the cascade of t.statistics, GA and decision tree Colon,
1970's. Leukemia, Lymphoma, Lung and Central nervous system
datasets are used. The average accuracy of t-GA is 89.24%,
Topon Kumar Paul et al. in [23] present a Majority Voting GA 88.80%, t-statistics 77.42%, Infogain 77.26% and GS
Genetic Programming Classifiers (MVGPC) for the 69.35% for Colon data set.
classification of microarray data. They evolve multiple rules
with genetic programming (GA) and then apply those rules to
test samples to determine their labels with majority voting D. Cluster Based Algorithms
technique. Over fitting is a major problem in classification of A non-hierarchical approach to forming good clusters is to
gene expression data using machine learning techniques. Since specify a desired number of clusters, say, k, then assign each
the number of available training samples is very small case (object) to one of k clusters so as to minimize a measure
compared to a huge number of genes, and the number of of dispersion within the clusters. A very common measure is
samples per class is not evenly distributed, a single rule or a the sum of distances or sum of squared Euclidean distances
single set of rules may produce very poor test accuracy. from the mean of each cluster. The problem can be set up as
Instead of single rule, multiple rules in multiple GP can be an integer programming problem but because solving integer
produced and could be employed them to predict the labels of programs with a large number of variables is time consuming,
the test samples through majority voting. The method is clusters are often computed using a fast, heuristic method that
applied for 4 micro array data sets which includes brain cancer generally produces good (but not necessarily optimal)
[5] prostate cancer [20] breast cancer [8, 4, and 6] and lung solutions. The k-means algorithm is one such method.
carcinoma [7]. The accuracy obtained is better than the
average accuracy of single rule or set of rules. The method Instead of classifying cancers based on microscopic
seems to be an appropriate method for the prediction of labels histology and tumor morphology, the introduction of micro
to test samples. array technology significantly improves the discovery rates of
the different types of cancers through monitoring thousands of
Nodal staging has been identified as an independent gene expressions in a parallel, rapid and efficient manner.
indicator of prognosis. Cancer of the urinary bladder is a Gene expression patterns are useful for classification,
major epidemiological problem that continues to grow every diagnosis and understanding of diseases [12]. Leung et al. in
year. Arpit et al. in [2] made use of genetic programming as an [17] proposed a new method for gene selection. This method
appropriate learning and hypothesis generation tool in a uses pair-wise data comparisons instead of one-over-the-rest
biological/clinical setting. Over fitting is an important concern approach. The method involves using Signal-to-Noise Ratio
in any machine learning task, especially in classification. (SNR) across the pairs of groups one by one. SNR is

219 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 8, No. 1, April 2010

performed between any two groups. Genes selected by pair about various Neural Network Based Algorithms for gene
wise SNR method give much better differentiation of the selection and cancer classification. In case of gene
samples from different classes. Results are analyzed using classification NN is used for rapid annotations of the
hierarchical clustering and k-means clustering. The best molecular sequencing data being generated by the human
accuracy achieved is 95% while it is only 83% using one- genome projects. FNN method is used to find the smallest set
over-the-rest approach of genes that can ensure highly accurate classification of
. cancers from micro array data. After selecting some top genes
Supoj Hengpraprohm et al. in [22] present a method for in the importance ranking list, the selected gene is inputted in
selecting informative features using k-means clustering and to the classifier such as Fuzzy Neural Network and Support
SNR ranking. The performance of the proposed method was Vector Machines. Machine learning based algorithms for gene
tested on cancer classification problems. The experiment selection and cancer classification were discussed in detail in
results suggest that the proposed system gives higher accuracy section II. Three important machine learning algorithms such
than using the SNR ranking alone and higher than using all the as Support Vector Machine, Extreme Learning Machine and
genes in classification. By using these methods good result in Relevance Vector Machines were discussed. The SVM based
terms of classification accuracy is performed. Genetic algorithms focus on better gene selection and higher
programming is employed as a classifier. The clustering step classification accuracy. ELM can perform multicategory
assures that the selected genes have low redundancy; hence classification directly with out any modification. This
the classifier can exploit these features to obtain better algorithm achieves higher classification accuracy than the
performance. The experimental results suggested that using K- other algorithms such as ANN, SANN and SVM with less
Means clustering and selecting the best feature with SNR training time and a smaller network structure. The RVM
ranking can select useful features to achieve good results in approach improves classification accuracy, which indicates
terms of classification accuracy. Due to clustering technique, better and more effective disease diagnosis, finally it results in
features which are similarly expressed will be grouped into the a significant difference in a patients changes for recovery.
same cluster. Each features generated by SNR ranking This method leads to comparable accuracy and shorter running
provides useful information to learning algorithms. time. The survey discuss about Genetic algorithms such as
Majority Voting Genetic Programming Classifiers (MVGPC)
Using gene expression data for cancer classifications is one for the classification of micro array data. They evolve
of the famous research topics in bioinformatics. Larry T.H.Yu multiple rules with genetic programming (GA) and then apply
et al. in [17] proposed an EPPC approach to cope with the those rules to test samples to determine their labels with
cancer detection problem. The proposed EPPC algorithm majority voting technique. A overcomes the problem of over
includes three phases namely, initialization, iteration, and fitting. The decision tree based classification is built. It has the
refinement. In general the initialization phase is to pick the highest accuracy rate. The accuracy rate remains stable when
initial cluster seeds for the iteration phase. In the iteration the number of genes is changed. Additionally, the study
phase, data points are assigned to different clusters and the discuss about various cluster based algorithms which focus on
projected dimensions of those newly formed clusters are being higher accuracy rate for cancer classification. The clustering
evaluated. The iteration phase continues to improve the quality step assures that the selected genes have low redundancy;
of clusters until the number of user specified clusters are hence the classifier can exploit these features to obtain better
obtained. Once the set of best cluster seeds is obtained after performance.
iterations, the refinement phase will start and all the data
points will be reassigned to those cluster seeds obtained by the REFERENCES
iteration phase to form the final cluster. The method gives [1] Alper Kucukural, Reyyan Yeniterzi, Suveyda Yeniterzi, O. Ugur
comparable accuracy and more readability to the end users. Sezerman, “Evolutionary Selection of Minimum Number of Features for
Classification of Gene Expression Data Using Genetic Algorithms,”
GEECO’07, ACM 978-1-59593-697-4/07/0007, July 2007.
III. SCOPE FOR FUTURE RESEARCH [2] Arpit A. Almal, Anirban P. Mitra, Ram H. Datar, Peter F. Lenehan,
Among several algorithms, machine learning based David W. Fry, Richard J. Cote, William P. Worzel, “Using genetic
programming to classify node positive patients in Bladder Cancer,”
classification algorithms are well performed for cancer ACM, 2006.
classification. As gene selection is the pre-step for cancer [3] Boyang Li, Liang peng Ma, jinglu Hu, and Kotaro Hirasawa, “Gene
classification, to improve the performance of gene selection, Classification Using An Improved SVM Classifier with Soft Decision
Fuzzy based gene selection techniques can be incorporated. Boundary,” SICE Annual Conference, August 2008.
[4] Cathy H. Wu, His-Lien Chen, “Gene Classification Artificial Neural
Further, machine learning based methodology can be tuned for System,” IEEE 0-8186-7116-5/95, 1995.
higher accuracy and to speed up the process of cancer [5] N. Dhiman, R. Bonilla, D. J. O. Kane and G. A. Poland, “Gene
classification. Expression Microarrays: a21st Century Tool for Directed Vaccine
Design,” Vaccine, vol.20, pp.22-30, 2001.
[6] K. Duan and J. C. Rajapakse, “A Variant of SVM-RFE for Gene
IV. CONCLUSION Selection in Cancer Classification with Expression Data,” Proceedings
Cancer classification is an emerging research area in the of IEEE Symposium on Computational Intelligence in Bioinformatics
and Computational Biology, pp. 49-55, 2004.
field of bioinformatics. Several data mining algorithms are [7] T. Furey, N. Cristianini, N. Duffy, D. Bednarski, M .Schummer, and D.
used for gene selection and cancer classification In this survey Haussler, “Support Vector Machine Classification and Validation of
we discussed about various data mining techniques for gene Cancer Tissues Samples Using Microarray Expression Data,”
selection and cancer classification. Related work discusses Bioinformatics, vol.16, pp. 906-914, 2000.

220 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 8, No. 1, April 2010

[8] G. B. Huang, and C.K. Siew, “Extreme Learning Machine: RBF conferences and training programmes, both as a participant and as a resource
Network Case,” Proceedings of Eighth International Conference on person. He has been keenly involved in organizing training programmes for
Control, Automation, Robotics, and Vision, Dec.2004. students and faculty members. His good rapport with the IT companies has
[9] G. B. Huang, and C.K. Siew, “Extreme Learning Machine with been instrumental in on/off campus interviews, and has helped the post
Randomly Assigned RBF Kernrls,” International, Information graduate students to get real time projects. He has also guided many such live
Technology, vol.11, no.1, 2005. projects. Lt. Dr. Santhosh Baboo has authored a commendable number of
[10] G. B. Huang, P. Suratchandran, and N. Sundararajan, “Fully Complex research papers in international/national Conference/journals and also guides
Extreme Learning Machine,” Neuro Computing, vol.68, pp.306-314, research scholars in Computer Science. Currently he is Senior Lecturer in the
2005. Postgraduate and Research department of Computer Science at Dwaraka Doss
[11] G. B. Huang, K. Z. Mao, Q. Y. Zhu, C. K. Siew, P. Suratchandran, and Goverdhan Doss Vaishnav College (accredited at ‘A’ grade by NAAC), one
N. Sundararajan, “Can Threshold Network be Trained Directly?” IEEE, of the premier institutions in Chennai.
Circuits and Systems II, vol.53, no.3, pp. 187-191, 2006.
[12] Jinn-Yi Yeh, Tai-Shi Wu, Min-Che Wu, Der-ming Chang, “Applying Mrs. S. Sasikala, done her Under-Graduation affiliated
Data Mining Techniques for Cancer Classification from Gene to Bharathiar University and Post-Graduation and
Expression Data,” IEEE, 0-7695-3038-9/07, 2007. Master of Philosophy in Bharathidasan university. She is
[13] Junying Zhang, Richard lee, Yue Joseph Wang, “Support Vector currently pursuing her Ph.D., in Computer Science in
Machine Classifications for Microarray Expression Data Set,” IEEE, 0- Dravidian University, Kuppam, Andra Pradesh. She is
7695-1957-1/03, 2003. working as Head, Department of Computer Science,
[14] Jung Yi Lin, “Cancer Classification Using Microarray and Layered Sree Saraswathi Thyagaraja College Pollachi. She has
Architecture Genetic Programming,” ACM, 978-1-60558-505-5, July organized various National, State-level Seminars,
2009. Technical Symposiums, Workshops and Intercollegiate meets. She has
[15] Kaibo Duan and Jagath C. Rajapakse, “A Variant of SVM_RFE for participated in various conferences and presented papers. She has 9 years of
Gene Selection in Cancer Classification with Expression Data,” IEEE 0- teaching experience. Her research area includes Data Mining, Networking and
7803-8728-7/04, 2004. Soft-computing.
[16] Kai-Bo Duan, Jagath C. Rajapakse, Haiying Wang, Francisco Azuaje,
“Multiple SVM-RFE for Gene Selection in cancer Classification with
Expression Data”, IEEE, 1536-1241, 2005.
[17] Larry T. H. Yu, Fu-Lai Chung, Stephen C. F. Chan and Simon M. C.
Yuen, “Using Emerging Pattern Based Projected Clustering and Gene
Expression Data for Cancer Detection,” Australian Computer Society,
Inc, 2004.
[18] Leung, Chang, Hung and Fung, “Gene Selection in Microarray Data
analysis for Brain Cancer classification,” IEEE, 1-4244-0385-5/06,
2006.
[19] Lipo Wang, Feng Chu, Wei Xie, “Accurate cancer classification using
expression of very few genes,” IEEE/ACM Transactions on
Computational Biology and Bioinformatics, January 2007.
[20] P. Pavlidis , “Gene Functional Analysis from Heterogeneous Data,”
Proceedings on Conference Research in Computational Molecular
Biology (RECOMB), pp. 249-255, 2001
[21] Runxuan Zhang, Guang-Bin Huang, Narasimhan Sundararajan and P.
Saratchandran, “Multicategory Classification Using an Extreme
Learning Machine for Microarray Gene Expression Cancer Diagnosis”,
IEEE/ACM Transactions on Computational Biology and Bioinformatics,
vol.4, no.3, July-September 2007.
[22] Supoj Hengpraprohm and Prabhas Chongstitvatana, ‘Selecting
Informative Genes from Microarray Data for Cancer Classification with
Genetic Programming Classifier Using K-Means Clustering and SNR
Ranking,” IEEE, 0-7695-2999-2, 2007.
[23] Topon Kumar Paul and Hitoshi Iba, “Prediction of Cancer Class with
Majority Voting Genetic Programming Classifier using Gene Expression
Data,” IEEE/ACM Transactions on Computational Biology and
Bioinformatics, vol.6, no.2, April-June 2009.
[24] Wei Luo, Lipo Wang, Jingjing Sun, “Feature Selection for Cancer
Classification Based on Support Vector Machine,” IEEE 978-0-7695-
3571-5, 2009.
[25] Wen Zhang, Juan Liu, “Gene Selection for Cancer Classification Using
Relevance Vector Machine,” IEEE, 1-4244-1120-3/07, 2007.
[26] Yuchun Tang, Yan-Qing Zhang, Zhen Huang, “Development of Two-
Stage SVM-RFE Gene Selection Strategy For Microarray Expression
Data Analysis, Computational Biology and Bioinformatics,” vol.4,
pp.683-705, 2007.

Lt. Dr. S. Santhosh Baboo, aged forty, has around


Seventeen years of postgraduate teaching experience in
Computer Science, which includes Six years of
administrative experience. He is a member, board of
studies, in several autonomous colleges, and designs the
curriculum of undergraduate and postgraduate
programmes. He is a consultant for starting new
courses, setting up computer labs, and recruiting
lecturers for many colleges. Equipped with a Masters degree in Computer
Science and a Doctorate in Computer Science, he is a visiting faculty to IT
companies. It is customary to see him at several national/international

221 http://sites.google.com/site/ijcsis/
ISSN 1947-5500