A Survey on Data Mining Techniques for GeneSelection and Cancer Classification
Dr. S. Santhosh BabooReader, PG and Research department of Computer Science,Dwaraka Doss Goverdhan Doss Vaishnav CollegeChennaisanthos2001@sify.comS. SasikalaHead, Department of Computer ScienceSree Saraswathi Thyagaraja CollegePollachisasivenkatesh04@gmail.com
Abstract
─
Cancer research is one of the major research areasin the medical field. Classification is critically important forcancer diagnosis and treatment accurate prediction of differenttumor types have great value in providing better treatment andtoxicity minimization on the patients. Previously, cancerclassification has always been morphological and clinical based.These conventional cancer classification methods are reported tohave several limitations in their diagnostic ability. In order togain a better insight into the problem of cancer classification,systematic approaches based on global gene expression analysishave been proposed. The recent advent of microarray technologyhas allowed the simultaneous monitoring of thousands of genes,which motivated the development in cancer classification usinggene expression data. Though still in its early stages of development, results obtained so far seemed promising .Thesurvey report presents the most used data mining techniques forgene selection and cancer classification. Particular, this surveyfocus on algorithms proposed on four main emerging fields. Theyare neural networks based algorithms, machine learningalgorithms, genetic algorithms and cluster based algorithms. Inaddition, it provides a general idea for future improvement inthis field.
Keywords
─
Data Mining, Gene Selection, CancerClassification, Neural Network, Support Vector Machine,Clustering, Genetic Algorithms.
I.
I
NTRODUCTION
Data mining (also known as Knowledge Discovery inDatabases - KDD) has been defined as “The nontrivialextraction of implicit, previously unknown, and potentiallyuseful information from data.” The KDD is an iterative process. Once the discovered knowledge is presented to theuser, the evaluation measures can be enhanced, the mining can be further refined, new data can be selected or further transformed, or new data sources can be integrated, in order toget different, more appropriate results. Cancer classificationthrough gene expression data analysis has recently emerged asan active area of research. In recent years numeroustechniques were proposed in literature for gene selection andcancer classification. Data mining and knowledge extraction isan important problem in bioinformatics. Biological datamining is an emerging field of research and development.A large amount of biological data has been produced in thelast years. Important knowledge can be extracted from thesedata by the use of data analysis techniques. This surveyfocuses on various data mining and machine learningtechniques for proper gene selection, which leads to accuratecancer classification. It discusses various techniques andmethods proposed earlier in literature for biological dataanalysis. Particular, this survey focus on algorithms proposedon four main emerging fields. They are neural networks basedalgorithms, machine learning algorithms, genetic algorithmsand cluster based algorithms. In addition, it provides a generalidea for future improvement in this field.The remainder of this paper is organized as follows. SectionII discusses various techniques and methods proposed earlier in literature for gene selection and cancer classification.Section III provides a marginal idea for further direction inthis field. Section IV concludes the paper with fewer discussions.II.
R
ELATED
W
ORK
Cancer classification is a challenging area in the field of Bioinformatics. It uses machine learning, statistical andvisualization techniques to discover and present knowledge ina form which is easily comprehensible to humans. Recentresearch has demonstrated that gene selection is the pre-stepfor cancer classification. The survey focus on various geneselection and cancer classification methods based on Neural Networks based algorithms, Machine Learning BasedAlgorithms, Genetic Algorithms and Clustering Algorithms.
A.
Neural Network Based Algorithms
An
Artificial Neural Network (ANN)
, usually called“
Neural Network
” (NN), is a mathematical modeling or computational modeling that tries to simulate the structureand/or functional aspects of biological neural network. Itconsists of an interconnected group of artificial neurons and processes information using a connectionist approach tocomputation. Neural networks are non-linear statistical datamodeling tools. They can be used to model complexrelationships between inputs and outputs or to find patterns indata.A gene classification artificial neural system has beendeveloped for rapid annotation of the molecular sequencingdata being generated by the Human Genome Project. CathyH.Wu et al 1995 [4] designed an ANN system to classify new(unknown) sequences into predefined (known) classes. In caseof gene classification NN is used for rapid annotations of themolecular sequencing data being generated by the humangenome projects. The system evaluates three neural network sequence classification system, GenCANS-PIR for PIR super family placement of protein sequences. GenCANS_RDP for
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 8, No. 1, April 2010216http://sites.google.com/site/ijcsis/ISSN 1947-5500