Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more
Download
Standard view
Full view
of .
Save to My Library
Look up keyword
Like this
2Activity
0 of .
Results for:
No results containing your search query
P. 1
An Extensive Survey on Gene Prediction Methodologies

An Extensive Survey on Gene Prediction Methodologies

Ratings: (0)|Views: 167 |Likes:
Published by ijcsis
In recent times, Bioinformatics plays an increasingly important role in the study of modern biology. Bioinformatics deals with the management and analysis of biological information stored in databases. The field of genomics is dependant on Bioinformatics which is a significant novel tool emerging in biology for finding facts about gene sequences, interaction of genomes, and unified working of genes in the formation of final syndrome or phenotype. The rising popularity of genome sequencing has resulted in the utilization of computational methods for gene finding in DNA sequences. Recently computer assisted gene prediction has gained impetus and tremendous amount of work has been carried out on this subject. An ample range of noteworthy techniques have been proposed by the researchers for the prediction of genes. An extensive review of the prevailing literature related to gene prediction is presented along with classification by utilizing an assortment of techniques. In addition, a succinct introduction about the prediction of genes is presented to get acquainted with the vital information on the subject gene prediction.
In recent times, Bioinformatics plays an increasingly important role in the study of modern biology. Bioinformatics deals with the management and analysis of biological information stored in databases. The field of genomics is dependant on Bioinformatics which is a significant novel tool emerging in biology for finding facts about gene sequences, interaction of genomes, and unified working of genes in the formation of final syndrome or phenotype. The rising popularity of genome sequencing has resulted in the utilization of computational methods for gene finding in DNA sequences. Recently computer assisted gene prediction has gained impetus and tremendous amount of work has been carried out on this subject. An ample range of noteworthy techniques have been proposed by the researchers for the prediction of genes. An extensive review of the prevailing literature related to gene prediction is presented along with classification by utilizing an assortment of techniques. In addition, a succinct introduction about the prediction of genes is presented to get acquainted with the vital information on the subject gene prediction.

More info:

Published by: ijcsis on Nov 02, 2010
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less

11/13/2012

pdf

text

original

 
 
An Extensive Survey on Gene PredictionMethodologies
Manaswini Pradhan
 
 Lecturer, P.G. Department of Information and Communication Technology,Fakir Mohan University, Orissa, India
E-mail:ms.manaswini.pradhan@gmail.com 
 Abstract-
In recent times, Bioinformatics plays an increasinglyimportant role in the study of modern biology. Bioinformaticsdeals with the management and analysis of biological informationstored in databases. The field of genomics is dependant onBioinformatics which is a significant novel tool emerging inbiology for finding facts about gene sequences, interaction of genomes, and unified working of genes in the formation of finalsyndrome or phenotype. The rising popularity of genomesequencing has resulted in the utilization of computationalmethods for gene finding in DNA sequences. Recently computerassisted gene prediction has gained impetus and tremendousamount of work has been carried out on this subject. An amplerange of noteworthy techniques have been proposed by theresearchers for the prediction of genes. An extensive review of theprevailing literature related to gene prediction is presented alongwith classification by utilizing an assortment of techniques. Inaddition, a succinct introduction about the prediction of genes ispresented to get acquainted with the vital information on thesubject gene prediction.
 Keywords- Genomic Signal Processing (GSP), gene, exon,intron, gene prediction, DNA sequence, RNA, protein, sensitivity, specificity, mRNA.
I.
 
INTRODUCTIONBiology and biotechnology are transforming researchinto an information-rich enterprise and hence they aredeveloping technological revolution. The implementation of computer technology into the administration of biologicalinformation is Bioinformatics [3]. It is a fast growing area of computer science that deals with the collection, organizationand analysis of DNA and protein sequence. Nowadays, foraddressing the recognized and realistic issues which originatein the management and analysis of biological data, itincorporates the construction and development of databases,algorithms, computational and statistical methods andhypothesis [1]. It is debatable that back to Mendel’s discoveryof genetic inheritance in 1865, the origin of bioinformaticshistory can be discovered. On the other hand, bioinformaticsresearch in a real sense began in late 1960s which isrepresented by Dayoff’s atlas of protein sequences as well asthe early modeling analysis of protein and RNA structures [3].
Dr. Ranjit Kumar Sahu
Assistant Surgeon, Post Doctoral Department of Plastic andReconstructive Surgery,S.C.B. Medical College, Cuttack,Orissa, IndiaE-mail:drsahurk@yahoo.co.in Due to the availability of excessive amount of genomic and proteomic data in public domain, it is becomingprogressively more significant to process this information insuch a way that are valuable to humankind [4]. One of thechallenges in the analysis of newly sequenced genomes is thecomputational recognition of genes and the understanding of the genome is the fundamental step. For evaluating genomicsequences and annotate genes, it is required to discover preciseand fast tools [5]. In this framework, a significant role in thesefields has been played by the established and recent signalprocessing techniques [4]. Comparatively, Genomic signalprocessing (GSP) is a new field in bio-informatics that dealswith the digital signal representations of genomic data andanalysis of the same by means of conventional digital signalprocessing (DSP) techniques [6].In the DNA (deoxyribonucleic acid) of a livingorganism, the genetic information is accumulated. DNA is amacro molecule in the form of a double helix. There are pairsof bases among the two strands of the backbone. There arefour bases called adenine, cytosine, guanine, and thymine.They are abbreviated with the letters A, C, G, and Trespectively [1]. For the chemical composition of oneindividual protein, Gene is a fragment of DNA consisting of the formula. Genes serve as the blueprints for proteins and afew additional products. During the production of anygenetically encoded molecule, mRNA is the initialintermediate [8]. The genomic information is frequentlypresented by means of the sequences of nucleotide symbols inthe strands of DNA molecules or by using the symboliccodons (triplets of nucleotides) or by the symbolic sequencesof amino acids in the subsequent polypeptide chains [5].Genes and the intergenic spaces are the two types of regions in a DNA sequence. Proteins are the building blocksof every organism and the information for the generation of the proteins are stored in the gene, where genes are in chargefor the construction of distinct proteins. Although, every cellin an organism consists of identical DNA, only a subset isexpressed in any particular family of cells and hence they haveidentical genes [1]. The exons and the introns are the two
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 8, No. 7, October 201088http://sites.google.com/site/ijcsis/ISSN 1947-5500
 
regions in the genes of eukaryotes. The exons and the intronsare the two regions in the genes of eukaryotes. The exonswhich are the protein coding region of a gene are distributedwith interrupting sequences of introns. The biologicalsignificance of intron is not well known still; therefore theyare termed as protein non coding regions. The borders in-between the introns and the exons are described as splice sites[9].When a gene is expressed, it is recorded first as pre-mRNA. Then, it goes through a process called splicing wherenon-coding regions are eliminated. A mature mRNA whichdoes not consist of introns, serves as a template for thesynthesis of a protein in translation. In translation, each andevery codon which is a collection of three adjacent base pairsin mRNA directs the addition of one amino acid to a peptidefor synthesizing. Therefore, a protein is a sequence of aminoacid residues subsequent to the mRNA sequence of a gene [7].The process is shown in the fig.1,
Figure 1:
Transcription of RNA, splicing of intron, and translation of proteinprocesses
One of the most important objectives of genomesequencing is to recognize all the genes. In eukaryoticgenomes, the analysis of a coding region is also based on theaccurate identification of the exon-intron structures. On theother hand, the task becomes very challenging due to vastlength and structural complexity of sequence data. [9]. Inrecent years, a wide range of gene prediction techniques foranalyzing, predicting diseases and more have been reported byhuge range of researchers. In this paper, we present anextensive review of significant researches on gene predictionalong with its processing techniques. The prevailing literatureavailable in gene prediction are classified and reviewedextensively and in addition we present a concise descriptionabout gene prediction. In section 2, a brief description of computational gene prediction is presented. An extensivereview on the study of significant research methods in geneprediction is provided in section 3. Section 4 sums up theconclusion.
Figure 2:
Gene structure’s state diagram. The mirror-symmetry reveals thefact that DNA is double-stranded and genes appear on both the strands. The 3-periodicity in the state diagram correlates to the translation of nucleotidetriplets into amino acids.
II. COMPUTATIONAL GENE PREDICTIONFor the automatic analysis and annotation of largeuncharacterized genomic sequences, computational geneprediction is becoming increasingly important [2]. Geneidentification is for predicting the complete gene structure,particularly the accurate exon-intron structure of a gene in aeukaryotic genomic DNA sequence. After sequencing, findingthe genes is one of the first and most significant steps inknowing the genome of a species [40]. Gene finding usuallyrefers to the field of computational biology which is involvedwith algorithmically recognizing the stretches of sequence,generally genomicDNA that are biologically functional. Thisspecially not only involves protein-coding genes but may alsoinclude additional functional elements for instance RNA genesand regulatory regions [16].Genomic sequences which are constructed now arewith length in the order of many millions of base pairs. Thesesequences contain a group of genes that are separated fromeach other by long stretches of intergenic regions [10]. Withthe intention of providing tentative annotation on the location,
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 8, No. 7, October 201089http://sites.google.com/site/ijcsis/ISSN 1947-5500
 
structure and the functional class of protein-coding genes, thedifficulty in gene identification is the problem of interpretingnucleotide sequences by computer [13]. The improvement of techniques for identifying the genes in DNA sequences and forgenome analysis, evaluating their functions is significant [12].Almost 20 years ago, gene identification efforts havebeen started and it constructed a huge number of practicallyeffectual systems [11]. In particular, this not only includesprotein-coding genes but also additional functional elementsfor instance RNA genes and regulatory regions. Calculation of protein-coding genes includes identification of correct spliceand translation of signals in DNA sequences [14]. On theother hand, due to the exon-intron structure of eukaryoticgenes, prediction is problematical. Introns are the non-codingregions that are spliced out at acceptor and donor splice sites[17].Gene prediction is used for involving prediction of genes proteins [15]. The gene prediction accurateness iscalculated using the standard measures, sensitivity andspecificity. For a feature for instance coding base, exon andgene, the sensitivity is the number of properly predictedfeatures that are separated by the number of annotatedfeatures. The specificity is defined as the number of appropriately predicted features alienated by the number of predicted features. A predicted exon is measured correct if both the splice sites are at annotated position of an exon. Apredicted gene is measured correct if all the exons are properlypredicted and there should be no additional exons in theannotation. Predicted partial genes were estimated as predictedgenes [10]. The formulas for sensitivity and specificity areshown below.
Sensitivity:
The fraction of identified genes (or bases orexons) which are correctly predicted.
FN TPTPrealityintrueall TPS
n
+==
 where
TP
- True Positive,
FN 
- False Negative
Specificity:
The fraction of predicted genes (or bases orexons) which corresponds to true genes
FPTPTPrediction pintrueall TPS
 p
+==
 III. EXTENSIVE REVIEW OF SIGNIFICANTRESEARCHES ON GENE PREDICTIONA wide range of research methodologies employedfor the analysis and the prediction is presented in this section.The reviewed gene prediction based on some mechanisms areclassified and detailed in the following subsections.
 A. Support Vector Machine
Jiang Qian
et al
. [70] presented an approach whichdepends upon the SVMs for predicting the targets of atranscription factor by recognizing subtle relationshipsbetween their expression profiles. Particularly, they usedSVMs for predicting the regulatory targets for 36 transcriptionfactors in the Saccharomyces cerevisiae genome whichdepends on the microarray expression data from lots of different physiological conditions. In order to incorporate animportant number of both positive and negative examples,they trained and tested their SVM on a data set that areconstructed by discussing the data imbalance issues directly.This was non-trivial where nearly all the known experimentalinformation specified is only for positives. On the whole, theydiscovered that 63% of their TF–target relationships wereapproved by means of cross-validation. By analyzing theperformance with the results from two recent genome-wideChIP-chip experiments, they further estimated theperformance of their regulatory network identifications. Onthe whole, the agreement between their results and thoseexperiments which can be comparable to the agreement (albeitlow) between the two experiments have been discovered bythem. With a specified transcription factor having targetscomparatively broaden evenly over the genome, theyidentified that this network has a delocalized structureregarding the chromosomal positioning.MicroRNAs (miRNAs) which play an important roleas post transcriptional regulators are small non-coding RNAs.For the 5' components, the purpose of animal miRNAsnormally depends upon complementarities. Even though lot of suggested numerous computational miRNA target-geneprediction techniques, they still have drawbacks in revealingactual target genes. MiTarget which is a SVM classifier formiRNA target gene prediction have been introduced by Kim
et al
. [38]. As a similarity measure for SVM features, it used aradial basis function kernel and is then classifed by structural,thermodynamic, and position-based features. For the first time,it presented the features and it reproduced the mechanism of miRNA binding. With the help of biologically relevant dataset that is achieved from the literature, the SVM classifier hascreated high performance comparing with earlier tools. UsingGene Ontology (GO) analysis, they calculated important tasksfor human miR-1, miR-124a, and miR-373 and from a featureselection experiment, explained the importance of pairing atpositions 4, 5, and 6 in the 5' region of a miRNA. They havealso presented a web interface for the program.A Bayesian framework depends upon the functionaltaxonomy constraints for merging the multiple classifiers havebeen introduced by Zafer Barutcuoglu
et al.
[67]. A hierarchyof SVM classifiers has been trained on multiple data types.For attaining the most probable consistent set of predictions,they have merged predictions in the suggested Bayesianframework. Experiments proved that the suggested Bayesian
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 8, No. 7, October 201090http://sites.google.com/site/ijcsis/ISSN 1947-5500

You're Reading a Free Preview

Download
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->