SURVEY OF ONTOLOGIES IN BIOINFORMATICS by SHAKTHEEVEL.

S INTRODUCTION TO ONTOLOGIES Recent technological advances have resulted in an onslaught of biological information that is accessible online. In the post genomic era, a major bottle neck is the coherent integration of all these public, online resources. Online bioinformatics databases are especially difficult to integrate because they are complex, highly heterogeneous, dispersed, and incessantly evolving. Online data are often described only in human-readable formats that are difficult for computers to analyze due to the lack of Standardized structures. They are large number of biomedical ontologies and data bases that are currently available , and more continue to be developed .there is even a site that tracks the publicly available Sources. Ontologies have emerged because of the need for a Common language to develop effective human and computer Communication across scattered, personal sources of data and Knowledge. The survey of Ontologies and databases used in the bioinformatics Community. The ontologies in this section are concerned with medical and biological terminology and with ontologies for organizing other ontologies. The focus of the survey as used main XML- based ontologies for for bioinformatics. Some of many data bases that have been developed for biomedical purposes. Each database has its own structure and there fore can be regarded as defining a ontology. BIO-ONTOLOGIES The Ontologies are a versatile mechanism for understanding concepts and relationships. In this section the concern is with the human communication of biomedical concepts as well as with understanding of the knowledge. The first one was originally focused on medical terminology but now also includes many other biomedical vocabularies, has grown to be impressively large, but is some times incoherent as a result. The second ontology focuses exclusively on terminology for genomics. UNIFIED MEDICAL LANGUAGE SYSTEM Terminology is the most common denominator of all biomedical literature resources, including the names of organisms, tissue, cell types, genes, proteins, diseases. There are various controlled Vocabularies such as the medical Subject Headings (MeSH) associated with recourses. MeSH was developed by the U.S. National Library of Medicine(NLM). In 1986, NLM began a long –term research and development project to build the unified Medical language System(UMLS). The UMLS is a repository of biomedical vocabularies and is the NLM’s biological ontology(Lindberg et al.1993; Baclawski et al.2000; Yandell and Majoros 2002).The UMLS is composed of three main components: the Metathesaurus(META), the SPECIALIST lexicon and associated lexical programs and the Sementic Network(SN) (Denny et al. 2003). The UMLS is a rich source of Knowledge in the biomedical domin. The UMLS is used for research and development in a rang of different applications, including natural language processing (Baclawski et al.2000; McCray et al.2001) Ref: www.nlm.nih.gov/research/ulms.

THE GENE ONTOLOGY The most prominent ontology for bioinformatics is GO. GO is produced by the GO Consortium, which seeks to provide a structured, controlled vocabulary for the description of the gene product function, process, and location(GO 2003,2004). A description of a gene product using the GO terminology is called an annotation. One important use of GO is the prediction of gene function based on patterns of annotation. These annotation classifying it three a) Molecular function b) Biological process c) Cellular component Many programs have been developed for profiling gene expression based on GO file format. Some as follows DAG- Edit: DNA- Edit is an open source tool written in Java for browsing, searching and modifying structured controlled vocabularies. Ref: sourceforge.net/project/showfiles.php?group_id=36855 GenMAPP: This tool visualizes gene expression and other genomic data on maps representing biological pathways and grouping of genes. Ref: www.GenMAPP.org GoMiner: This program package organizes lists of “interesting” genes for biological interpretation. GoMiner provides quantitative and statistical output files . Ref: discover.nci.nih.gov/gominer NetAffx GO Mining Tool: This Tool permits web- based, interactive traversal of the GO graph in the context of microarry data (chang et al.2004) Ref: www.affymetrix.cim/analysis/index.affx FatiGo: This tool extracts GO terms that are significantly over or underrepresented in sets of genes within the context of a genome – Scale experiment( Al- Shahrour et l.2004) Ref: fatgo.bioinfo.cnio.es GOAL: The GO Automated Lexicon is a web – based application for the automated identification of functions and process.(volinia et al.2004) Ref: microarrays.unife.it Onto-Tools: This is a Collection of tools for a varity of tasks all of which involve the use of GO terminology(Draghici et al.2003), Ref : vortex.cs.wayne.edu/projects.html DAVID: The Database for Annotation, Visualization and Intergrated Discovery is aweb based gene list.(Dennis, Jr.et al 2003) Ref: david.niaid,nih,gov GOTM: The GO Tree Machine is a web – based platform for interpreting microarray data or other interesting gene sets using GO (Zhang et la.2004) Ref: genereg.ornl.gov/gotm ONTOLOGIES OF BIOINFORMATICS ONTOLOGIES With the proliferation of biological ontologies and databases, the ontologies themselves need to be organized and classified. OBO: The Open Biological Ontologies seeks to collect ontogies for the domains of genomics and proteomics. these Ontology to be open ,use either GO or OWL syntax. Ref: www.obo.sourceforge.net

Ontology in OBO of zygote development from one cell stage to Two cell stage $ structurers.goff; ZFIN:0000000 <001_Zygote\:1-cell\,embryo; ZFIN:0000004 <001_Zygote\:1-cell\,blastomere; ZFIN:0000001 <001_Zygote\:1-cell\,Yolk; ZFIN:0000012 <001_Zygote\:1-cell\,extraembryonic; ZFIN:0000005 <001_Zygote\:1-cell\,chorion; ZFIN:0000002 <002_Cleavage\:2-cell\,embryo; ZFIN:0000017 <002_Cleavage\:2-cell\,blastomeres; ZFIN:0000013 <002_Cleavage\:2-cell\,Yolk; ZFIN:0000025 <002_Cleavage\:2-cell\,extraembryonic; ZFIN:0000018 <002_Cleavage\:2-cell\,chorion;ZFIN:0000014 TAMBIS: TAMBIS is a project that aims to help researchers in biological Science by building a homogenizing layer on top of various biological information services. Ref: img.cs.man.ac.uk/tambis ONTOLOGY LANGUAGES IN BIOINFORMATICS The main XML- based ontologies that have been developed for Bioinformatics. The number of such ontologies is large , and continually increasing , so some of the ontologies as below BSML: The bioinformatics Sequence markup Language(BSML) is a language that encodes biological sequence information , which encompasses graphical representation of biologically meaningful objects such as nucleotide or protein sequences. Ref: www.bsml.org BioML: The Biopolymer Markup language provides an extensible framework for annotating experimental information about molecular extensible proteins and genes. Ref: www.rdcormia.com/COIN78/files/XML_Finals/BIOML/Pages/BIOML.htm SBML: The Systems Biology Markup Language is an XML- based language for storing biochemical models (Hucka et al. 2003) Ref: www.sbw-sbml.org MAGE-ML: The MicroArray Gene Expression Markup Language is an XML Ontology for microarry data. MAGE- ML aims to create a common data format so that data can be shared easily between projects(Stoeckert, Jr.et al.2002 Ref: www.mged.org CellML: The CellML ontology is being developed by physiome Science Inc. The purpose of CellML is to store and exchange computer- based biological models. Ref: www.cellml.org RNAML: These provides a standard syntax that allows for the storage and Exchange of information about RNA sequence as well as secondary and tertiary structures. Ref: www.1bit.iro.umontreal.ca/rnaml. AGAVE: The Architecture for Genomic annotation , Visualization and Exchange is an XML language created by Double Twist, Inc, for representing genomic annotation data Ref: www.animorphics.net/lifesci.html

CML: The purpose of the CML is to manage chemical information. CML supported by tools such as the popular Jumbo browser. Ref: www.xml-cml.org GAME: GAME is an XML language for curation of DNA,RNA, or protein sequences.GAME uses an XMLDTD to specify the syntactic structure of the content of a GAME document. Ref: www.fruitfly.org/comparative NeuroML: The Neural Open Markup Language is an XML language for describing models, methods , and literature for neuroscience.(Goddard et al.2002) Ref: www.neuroml.org/main.html TML: Taxonomical Markup Language is mainly an XML language format for representing the topology of a phylogeny,but alos representation for statistical metadata describing (Gilmour2000) NUCLEOTIDE SEQUENCE DATABASES GenBank: Genbank is a comprehensive database that contains publicy available DNA Sequences for more than 140,000 name organisms. The sequences are primarily obtained through submission from individual laboratories and batch submission from large-scale sequencing projects(Benson et al.2004) Ref: www.ncbi.nlm.nih.gov/Genbank EMBL: The EMBL Nucleotide Sequence Database, maintained at the European Bioinformatics Institute(EBI), incorporates, organizes, and distributes nucleotide sequence from public sources (kulilova et al.2004). The database is a part of an international collaboration with DDBJ and GenBank Ref: www.ebi.ac.uk/embl DDBJ: DDBJ is maintained at the National Institute of Genetics in japan. Its available in several formats, including FASTA and XML. The XML format is defined by the DTD at ftp://ftp.ddbj.nig.ac.jp/database/ddbj.xml/DDBJXML.dtd. PROTEIN SEQUENCE DATABASE SWISS-PROT: SWISS-PROT is the most widely used publicly available protein sequence database. This database aims to be nonredundant, fully annotated, and highly cross- referenced (Jung et al.2001) . The XML format is defined both as a DTD and using XSD. Ref: www.au.expasy.org/sprot / ftp://ftp.ebi.ac.uk/pub/databases/uniprot/knowledgebase/uniprot_sprot.xml.gz. NUCLEOTIDE STRUCTURE DATABASES NDB: The most prominent nucleotide structure database is the Nucleic Acid Database. NDB was establish in 1991 as a resource to assemble and distribute structural information about nucleic acids(Berman et al.1992) Ref: www.ndbserver.rutgers.edu PROTEIN STRUCTURE DATABASE Protein structure databases deal with progressively ‘ higher-order’ types of structure: secondary, tertiary, quaternary, and functional. Pfam: The protein family database is large collection of protein families and domains(Bateman et al.2004), The Pfam database is available in the FASTA format

Ref: www.sanger.ac.uk/software/Pfam SMART: The Simple modular Architecture Research Tool is a web tool for the identification and annotation of protein domins, and provides a platform for the comparative study of complex domin architectures in genes and proteins. Ref: www.smart.embl.de PROSITE: PROSITE is a compilation of sites and patterns found in protein sequences(sigrist et al.2002; Hulo et al.2004). The use of protein sequence patterns to determine the protein function has become one of the essential tools in sequence analysis. PROSITE is closely related to the SWISS-PROT protein sequence databank Ref: www.expasy.org/prosite BLOCKS: Blocks are defined as ungapped multiple alignments corresponding to the most conserved regions of proteins.Blocks contains “ multiple alignment” information,and the use of the BLOCKS database can improve the detection of sequence similarities in searches of sequence databases. The BLOCKS database contains more than 24,294 blocks from nearly 5000 different protein groups(henikoff et al.2000). Ref: www.blocks.fhcrc.org COG: the database of clusters of orthologus groups of proteins (COGs) attempts to give a phylogenetic classisication of the protein encoded in 21 complete genomes of bacteria, archaea, and eukaryotes(Tatusov et al.2000). Ref: www.ncbi.nlm.nih.gov/COG PRINTS: PRINTS is a compendium of protein fingerprints(Attwood et al.1999,2003). Its also available in the FASTA format. Ref: www.umber.sbs.man.ac.uk/dbbrowser/PRINTS ProDom: ProDom is a comprehensive set of protein domain famils automatically generated from the SWISS-PROT and TrEMBL sequence databases(servant et al.2002) Ref: http://protein.toulouse.inra.fr/prodom/current/html/home.php PDB: The protein databank is the largest source of publicy available biomolecular 3D structures(Bateman et al.2004). PDB was established at Brookhaven National Laboratories(BNL) in 1971 as an archive for biological macromolecular crystal structures. The PDB database has two non-XML formats, PDB and mmCIF, that are in use by many other structure databases. The current XML schema file is located at Ref: www.sit.pdb.org/pdbml/pdbx-vxsd, www.rcsb.org/pdb SCOP: The structural Classification of proteins database classifies proteins by domains that have a common ancector based on sequence, structural, and functional evidence(Murzin et al.1995). DIP: The database of interacting proteins is a research tool for studying cellular networks of protein interactions(Salwinski et al.2004) Ref: www.dip.doe-mbi.ucla.edu MINT: The Molecular INTeraction database is a relational database containing interaction data between biological molecules(Zanzoni et al.2002) Ref: http://160.80.34.4/mint

HPID: The Human protein Interaction Database was designed for provide human protein interaction data precomputed from existing structural and experimental data using appropriate methods. Ref: http://wilab.inha.ac.kr/hpid/ TRANSCRIPTION FACTOR DATABASES TRANSFAC: The most complete transcription factor database is TRANSFAC(wingender et al.1996)This database is concerned with eukaryotic transcription regulation. Ref: www.transfac.gbf.de COMPEL: COMPLE is a database of composite regulatory elements, the basic structures of combinatorial regulation. Access to COMPLE requires registration , but it is free for noncommercial use. Ref: www.comple.bionet.nsc.ru SPECIES – SPECIFIC DATABASES SGD: The Sacharomyces Genome Database is a database of the molecular biology and genetics of the budding yeast Saccharomyces cerevisiae. Ref: www.yeastgenome.org FlyBase: The fruit fly, Drosophila melanogaster, is one of the most studied eukaryotic organisms and a central model for the Human Genome Project(FlyBase2002). Ref: www.flybase.bio.indiana.edu MGD: The Mouse Gnome Databases at the Jackson Laboratory in Bar Harbor,Maine, is a resource for the mouse genome information. Ref: www.informatics.jax.org. SPECIALIZED PROTEIN DATABASES ORDB: The Olfactory Receptor Database is a central repository of olfactory receptor(OR) and Olfactory receptor –like gene and protein sequences(crasto et al.2002). Human detect odorants through Ors, Which are located on the olfactory sensory neurons in the olfactory epithelium of the nose (Buck and Axel;1991 Buck200). Ref: www.senselab.med.yale.edu/senselab/ordb RiboWeb: RiboWeb is a relational database conatning a representation of the primary 3D data relevant to the structure of their ribosome of the prokaryotic 30s ribosomal subunit,which initiates the translation of messenger RNA into protein an dis the site of action of numerous antibiotics(chen et al.1997). Ref: www.smi-web.stanford.edu/projects/helix/riboweb.html

TRANSCRIPTOMICS DATABASES It is use ful to study the temporal and spatial patterns of gene expression. Transcriptomics is defined as the use of quantitative mRNA measurements of gene expression to characterize biological process and elucidate gene transcription mechanisms. S.NO 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. NAME OF THE DATABASES NCBI’s dbEST Database The GeneCards database Kidney development Gene Expression Database Gene Expression in Tooth Mouse Gene Expression Database The Cardiac Gene Expression Knowledgebase Cancer Gene Expression Omnibus Saccharomyces Genome Database The Nematode Expression pattern Database NCBI’s Gene Expression Omnibus TOOL FOR SEARCH www.ncbi.nlm.nih.gov/dbEST/ bioinformatics.weizmann.ac.il/cards organogenesis.ucsd.edu bite-it.helsinki.fi www.informatics.jax.org www.cage.wbmei.jhu.edu www.cage.wbmei.jhu.edu www.yeastgenome.org Nematode.lab.nig.ac.jp www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=geo

PROTEOMIC DATABASE Proteomics is defined as the use of quantitative protein-level measurements of gene expression to characterize biological processes and elucidate the mechanisms of gene translation. There are generally two steps in proteomics as protein separation and identification. S.NO NAME OF THE DATABASES 1. HEART-2DPAGE 2. 3. 4. 5. Heart High-performance 2 – DE Database SWISS-2DPAGE REPRODUCTION-2DPAGE Fishprom TOOL FOR SEARCH userpage.chemie.fuberlin.de/~pleiss/dhzb.html www.mdc-berlin.de/~emu/heart au.expasy.org/ch2d www.reprod.njmu.edu.cn/cgi-bin/2d/2d.cgi www.abdn.ac.uk/fishprom/index.shtml

PATHWAY DATABASE A pathway is a system of molecules(especially protein) that work together. Pathway is also called Molecular Interaction networks. BioPAX: BioPAX is a collaborative effort to create a data exchange format for biological pathway data. The current format is called BioPAX level-1 and represents metabolic pathway information. Ref: www.biopax.org/ KEGG: The Kyoto Encyclopedia of genes and Genomes (kanehisa and Goto2000; kanehisa et al.2002) is the primary database resource of the Japanese genome net service for understanding higher-order functional meanings and utilities of the cell or the organism from its genome information. Ref: www.genome.ad.jp/kegg EcoCyc: Ecocyc is an organism- Specific pathway database describe the metabolic and signal transduction pathways of E.coliK12 MG1655, its enzymes , and its transport proteins(Karp et al.2002c) Ref: www.ecocyc.org References:  Bioinformatics by Kenneth Baclawski & Tianhua niu  Benson,D.A.,I.karsch-Mizrachi,D.J.Lipman,J.Ostell,and D.L.Wheeler.2004.GeneBank:update.Nucleic Acids res.32:D23-D26.Database issue.  Bateman, A.,L.coin,R.Durbin,R.D.Finn,V.Hollich,S.GriffithsJones, A.Khanna, M.Marshall, S.Moxon, E.L.Sonnhammer, D.J.Studholme,C.Yets, and S.R.Eddy.2004.The Pfam Protein families database, Nucleic Acids Res.32:D138- D141.Data-base issue.  Murzin,A.G.,S.E.Brenner,T.Hubbard,and C.Chothia.1995.SCOP: a structural Classification of protein database for the investigation of Sequence and Structures. J.Mol.Biol.247:536-540.  http://www.geocities.com/bioinformaticsweb/datalink.html  http://www.clcbio.com/index.php?id=502

Sign up to vote on this title
UsefulNot useful

Master Your Semester with Scribd & The New York Times

Special offer for students: Only $4.99/month.

Master Your Semester with a Special Offer from Scribd & The New York Times

Cancel anytime.