Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more
Download
Standard view
Full view
of .
Look up keyword
Like this
1Activity
0 of .
Results for:
No results containing your search query
P. 1
GA-ANN based Dominant Gene Prediction in Microarray Dataset

GA-ANN based Dominant Gene Prediction in Microarray Dataset

Ratings: (0)|Views: 67 |Likes:
Published by ijcsis
Genome Analysis of a human being permits useful insight into the ancestry of that person and also facilitates the determination of weaknesses and susceptibilities of that person towards inherited diseases. The amount of accumulated genome data is increasing at a tremendous rate with the rapid development of genome sequencing technologies and gene prediction is one of the most challenging tasks in genome analysis. Many tools have been developed for gene prediction which still remains as an active research area. Gene prediction involves the analysis of the entire genomic data that is accumulated in the database and hence scrutinizing the predicted genes takes too much of time. However, the computational time can be reduced and the process can be made more effective through the selection of dominant genes. In this paper, a novel method is presented to predict the dominant genes of ALL/AML cancer. First, to train an FFANN a combinational data of the input dataset is generated and its dimensionality is reduced through Probability Principal Component Analysis (PPCA). Then, the classified database of ALL/AML cancer is given as the training dataset to design the FF-ANN. After the FF-ANN is designed, the genetic algorithm is applied on the test input sequence and the fitness function is computed using the designed FF-ANN. After that, the genetic operations crossover, mutation and selection are carried out. Finally, through analysis, the optimal dominant genes are predicted.
Genome Analysis of a human being permits useful insight into the ancestry of that person and also facilitates the determination of weaknesses and susceptibilities of that person towards inherited diseases. The amount of accumulated genome data is increasing at a tremendous rate with the rapid development of genome sequencing technologies and gene prediction is one of the most challenging tasks in genome analysis. Many tools have been developed for gene prediction which still remains as an active research area. Gene prediction involves the analysis of the entire genomic data that is accumulated in the database and hence scrutinizing the predicted genes takes too much of time. However, the computational time can be reduced and the process can be made more effective through the selection of dominant genes. In this paper, a novel method is presented to predict the dominant genes of ALL/AML cancer. First, to train an FFANN a combinational data of the input dataset is generated and its dimensionality is reduced through Probability Principal Component Analysis (PPCA). Then, the classified database of ALL/AML cancer is given as the training dataset to design the FF-ANN. After the FF-ANN is designed, the genetic algorithm is applied on the test input sequence and the fitness function is computed using the designed FF-ANN. After that, the genetic operations crossover, mutation and selection are carried out. Finally, through analysis, the optimal dominant genes are predicted.

More info:

Published by: ijcsis on Dec 04, 2010
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less

12/04/2010

pdf

text

original

 
GA-ANN based Dominant Gene Predictionin Microarray Dataset
Manaswini Pradhan
 
Lecturer, P.G. Department of Information andCommunication Technology,Fakir Mohan University, Orissa, IndiaE-mail:ms.manaswini.pradhan@gmail.com 
Dr. B. Mittra
Reader, School of Biotechnology,Fakir Mohan University, Orissa, India
Dr. Sabyasachi Pattnaik
Reader,P.G. Department of Information andCommunication Technology,Fakir Mohan University, Orissa, India.
 
Dr. Ranjit Kumar Sahu
Assistant Surgeon, Post Doctoral Department of Plastic and Reconstructive Surgery,S.C.B. Medical College, Cuttack,Orissa, IndiaE-mail:drsahurk@yahoo.com
 Abstract-
Genome Analysis of a human being permits usefulinsight into the ancestry of that person and also facilitates thedetermination of weaknesses and susceptibilities of that persontowards inherited diseases. The amount of accumulatedgenome data is increasing at a tremendous rate with the rapiddevelopment of genome sequencing technologies and geneprediction is one of the most challenging tasks in genomeanalysis. Many tools have been developed for gene predictionwhich still remains as an active research area. Gene predictioninvolves the analysis of the entire genomic data that isaccumulated in the database and hence scrutinizing thepredicted genes takes too much of time. However, thecomputational time can be reduced and the process can bemade more effective through the selection of dominant genes.In this paper, a novel method is presented to predict thedominant genes of ALL/AML cancer. First, to train an FF-ANN a combinational data of the input dataset is generatedand its dimensionality is reduced through Probability PrincipalComponent Analysis (PPCA). Then, the classified database of ALL/AML cancer is given as the training dataset to design theFF-ANN. After the FF-ANN is designed, the genetic algorithmis applied on the test input sequence and the fitness function iscomputed using the designed FF-ANN. After that, the geneticoperations crossover, mutation and selection are carried out.Finally, through analysis, the optimal dominant genes arepredicted.
 
 Keywords- gene prediction, Microarray gene expression data, Probabilistic PCA (PPCA), dimensionality reduction, Artificial Neural Network (ANN), Back propagation (BP), dominant gene, genetic algorithm.
I. INTRODUCTIONIn the public domain huge quantity of genomic andproteomic data are accessible. The capability to process thisinformation in ways that are helpful to humankind isbecoming more and more significant [1].
A
fundamentalstep in the understanding of a genome is the computationalrecognition, and in the analysis of newly sequencedgenomes it is one of the challenges. Accurate and speedytools are essential for the analysis of genomic sequences andfor interpreting genes [2]. In such circumstances,conventional and modern signal processing techniques playsa vital part in these fields [1]. Genomic signal processing[11] (GSP) is a comparatively novel area in bio-informatics.It deals with the utilization of traditional digital signalprocessing (DSP) techniques in the representation andanalysis of genomic data.The code for the chemical composition of aparticular protein is enclosed in the DNA which is asegment of gene. Genes functions as the pattern for proteinsand some extra products, and the main intermediary thattranslates gene information in the production of geneticallyencoded molecules is mRNA [4]. Usually sequences of nucleotide symbols, symbolic codons (triplets of nucleotides), or symbolic sequences of amino acids in thecorresponding polypeptide chains present in the strands of DNA molecules represent the genomic information.
 
[2].Gene expression microchip, which is perhaps the mostrapidly expanding tool of genome analysis enablessimultaneous monitoring of the expression levels of tens of thousands of genes under diverse experimental conditions.An influential tool in the study of collective gene reaction tochanges in their environments is presented by geneexpression microchip, and it also offers indications aboutthe structures of the involved gene networks [3].Nowadays, in a solitary experiment by employingmicroarrays the expression levels of thousands of genes,possibly all genes in an organism can be measuredsimultaneously [4]. In monitoring genome-wide expressionlevels of gene microarray technology has become a requisitetool [5]. The evaluation of the gene expression profiles in a
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 8, No. 8, November 201083http://sites.google.com/site/ijcsis/ISSN 1947-5500
 
variety of organs which employs microarray technologiesdisclose separate genes, gene ensembles, and the metabolicways underlying the structural and functional organizationof an organ and its physiological function [6]. By theemployment of microarray technology the diagnostic chorecan be automated and the precision of the conventionaldiagnostic techniques can be enhanced. Simultaneousexamination of thousands of gene expressions is beingfacilitated by microarray technology [7].Efficient representation of cell characterization atthe molecular level is possible with microarray technologywhich simultaneously measures the expression levels of tensof thousands of genes [8]. Gene expression analysis [10][12] that utilizes microarray technology has a broad varietyof latent for discovering the biology of cells and organisms[9]. Accurate prediction and diagnosis of diseases is beenassist by the microarray technology. For envisaging theentire gene structure, mainly the precise exon-intronstructure of a gene in a eukaryotic genomic DNA sequencegene identification is employed. After sequencing, findingthe genes is one of the first and most significant steps inknowing the genome of a species [13]. A field of computational biology which is involved withalgorithmically distinguishing the stretches of sequence,generally genomicDNA that are biologically functional isknown as gene finding. This in particular not only engrossesprotein-coding genes but also includes added functionalelements for instance RNA genes and regulatory regions[14]. Some of the researches on the gene prediction are [15],[16], [17] and [18].In this paper, we propose an effective geneprediction technique which predicts the dominant genes.Initially, the classified microarray gene dataset (either AcuteMyeloid Leukemia (AML) or Acute LymphoblasticLeukemia (ALL)) which is of high dimension is reducedthrough the Probability Principal Component Analysis(PPCA) to generate the training dataset for the neuralnetwork. Consequently, through the training data the FeedForward-ANN is designed and then the genetic algorithm isutilized to predict the dominant genes
 
of ALL/AML cancer.Subsequently the gene which causes either AML or ALL ispredicted devoid of analyzing the entire database. The restof the paper is organized as follows. Section 2 details thegenetic algorithm and in Section 3, a brief review of some of the existing works in gene prediction is presented. Theproposed effective gene prediction is detailed in Section 4.Section 5 describes the results and discussion. Theconclusions are summed up in Section 6.II. GENETIC ALGORITHMThe heredity and evolution of living organisms arestimulated by computer programs known as GeneticAlgorithms [27]. By utilizing GAs an ideal solution ispossible even for multi modal objective functions becausethey are multi-point search methods. Moreover, GA’s areapplicable to distinct problem in the search space. Hence,GA is not only very simple to use but also a very powerfuloptimization tool [28]. Strings are present in the searchspace of GA, each of which represents a candidate solutionto the problem and are termed as chromosomes. Fitnessvalue is the objective function value of each chromosome. Aset of chromosomes along with their associated fitness istermed as population. The populations which are generatedin an iteration of the genetic algorithm are termed asgenerations [29].New generations (offspring) are generated byutilize crossover and mutation techniques. Twochromosomes are split by crossover and by taking one splitpart from each chromosome and combining those two newchromosomes are created. A single bit of a chromosome ischanged by mutation. The chromosomes with the bestfitness value calculated for a certain fitness criteria areretained while the other chromosomes are removed. Theprocess is repeated until one chromosome has the bestfitness value and that chromosome is selected as the solutionfor the problem [30].III. REVIEW ON RELATED RESEARCHESA handful of recent research works available in theliterature are briefly reviewed in this section.A computational technique for patient outcome predictionwas introduced by Huiqing Liu
et al.
[19]. Two extremetypes of patient samples were utilized for the training phaseof this technique:1) short-term survivors who got an inopportune result in asmall period and2) long-term survivors who were preserving a positiveoutcome after a long follow-up time.These incredible training samples generated a clearplatform for identifying suitable genes whose expressionwas intimately related to the outcome. With the assistance of a support vector machine the selected extreme samples andthe important genes were then integrated in order toconstruct a prediction model. Every validation sample isowed a risk score that falls into one of the special pre-defined risk groups by employing that prediction model.Several public datasets adapts this technique. In quite a fewcases as perceived in their Kaplan–Meier curves, patients inhigh and low risk groups who are rated by the suggestedtechnique have obviously clear outcome position. They havealso established that for enhancing the prediction accuracy,the suggestion of deciding merely extreme patient samplesfor training is efficient when diverse gene selectiontechniques are employed.MiTarget which is a SVM classifier for miRNAtarget gene prediction was introduced by Kim
et al.
[20]. Itemployed a radial basis function kernel and was then
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 8, No. 8, November 201084http://sites.google.com/site/ijcsis/ISSN 1947-5500
 
categorized by structural, thermodynamic, and position-based features as a similarity measure for SVM features. Forthe first time, the features were presented and themechanism of miRNA binding was reproduced. Whencompared with previous tools the SVM classifier has createdhigh performance with the assistance of biologicallypertinent data set that was attained from the literature. Theimportant tasks for human miR-1, miR-124a, and miR-373was computed by employing Gene Ontology (GO) analysisand the importance of pairing at positions 4, 5, and 6 in the5' region of a miRNA was explained from a featureselection experiment. A web interface for the program wasalso presented by them.Based on the information that a majority of exonsequences have a 3-base periodicity, and intron sequencesdo not have the sole characteristic, a technique to predictprotein coding regions was developed by Changchuan Yin
et al.
[21]. By employing nucleotide distributions in thethree codon positions of the DNA sequences this techniquecomputed the 3-base periodicity and the background noiseof the stepwise DNA segments of the target DNAsequences. From the trends of the ratio of the 3-baseperiodicity to the background noise in the DNA sequencesthe exon and intron sequences can be recognized. Casestudies on genes from diverse organisms illustrated that theproposed technique was an efficient means for exonpredictionOn the basis of a two-stage machine learningapproach a gene prediction algorithm for metagenomicfragments was proposed by Hoff 
et al.
[22]. Initially, forextracting the features from DNA sequences, lineardiscriminants were employed for monocodon usage,dicodon usage and translation initiation sites. Secondly, forcalculating the chance in such a way that the open readingframe encodes a protein and an artificial neural network combines these characteristics with open reading framelength and fragment GC-content. This probability wasemployed for categorizing and achieving the genecandidates. By means of extensive training this techniqueformed fast single fragment predictions with fine qualitysensitivity and specificity on artificially fragmentedgenomic DNA. Additionally, with high consistency thistechnique can precisely calculate translation initiation sitesand distinguish complete genes from incomplete genes.Extensive machine learning techniques were well-suited forpredicting the genes in metagenomic DNA fragments.
 
Specially, the association of linear discriminants and neuralnetworks was a very promising one and are believed to betaken into consideration for incorporating into metagenomicanalysis pipelines.
 
Based on the physicochemical features of codonscomputed from molecular dynamics (MD) simulations an abinitio model for gene prediction in prokaryotic genomes wasintroduced by Poonam Singhal
et al.
[15]. For every codonthe model requires a statement of three computed quantities,the double-helical trinucleotide base pairing energy, the basepair stacking energy, and a codon propensity index forprotein-nucleic acid interactions. Fixing these threeparameters, for each codon, eases the computation of themagnitude and direction of a cumulative three-dimensionalvector for any length DNA sequence in all the six genomicreading frames. Analysis of 372 genomes containing350,000 genes has confirmed that the orientations of thegene and non-gene vectors were significantly apart and aapparent difference was made probable between genic andnon-genic sequences at a level comparable to or superiorthan currently accessible knowledge-based models trainedon the basis of empirical data, providing a strong evidencefor the likelihood of a unique and valuable physicochemicalclassification of DNA sequences from codons to genomes.
 
For the genus Aspergillus a program calledNetAspGene which is a dedicated, publicly available, splicesite prediction was developed by Kai Wang
et al.
[23]. Themost widespread mould pathogen that is the gene sequencesfrom Aspergillus fumigatus, were employed to build andtest their model. Aspergillus encloses smaller introns whencompared with several animals and plants; and hence tocover both the donor and acceptor site information theyhave applied a larger window size on single local networksfor training. NetAspGene was applied to other Aspergilli,including Aspergillus nidulans, Aspergillus oryzae, andAspergillus niger. Valuation with independent data setsdisclosed that NetAspGene executed significantly bettersplice site prediction than the other available tools.Bayesian kernel was represented for the SupportVector Machine (SVM) by Alashwal
et al.
[24] so as topredict protein-protein interactions. By putting together theprobability characteristic of the existing experimentalprotein-protein interactions data, the classifier performancesthat were amassed from diverse sources could be improved.In addition to that, so as to organize more research on thehighly estimated interactions, the biologists are enhancedwith the probabilistic outputs that are attained from theBayesian kernel. The results have illustrated that byemploying the Bayesian kernel when compared with thestandard SVM kernels, the precision of the classifier hasbeen enhanced. Those results have suggested that by meansof Bayesian kernel, the protein-protein interaction could becomputed with superior accuracy as when compared to thestandard SVM kernels.IV. PROPOSED DOMINANT GENE PREDICTIONUSING GENETIC ALGORITHMGenerally, utilization of large gene dataset fordisease analysis increases the computation time anddegrades the performance of the process. Hence, a techniquethat requires less computational time to predict dominantgenes is essential. Hence, an efficient technique is proposed
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 8, No. 8, November 201085http://sites.google.com/site/ijcsis/ISSN 1947-5500

You're Reading a Free Preview

Download
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->