You are on page 1of 21


and proteome annotation using automatically recognized concepts and functional networks

Adrian Bivol, Tobias Wittkop, Darcy Davis, and Sean Mooney Mooney laboratory, Buck Institute for Research on Aging, Novato, CA National Center for Biomedical Ontology, Stanford University, Stanford, CA

Gene function/disease prediction

Typically uses Gene Ontology (GO) or disease annotation (e.g. OMIM) Many tools utilize similar set of features/networks, e.g. PPI networks, co-expression networks, sequence similarity,... Input: Set of genes with known function/disease Output: ranked list of remaining genes (closest at the top)

Can these tools be used for other annotations then GO or disease?

Systematic evaluation of automated annotations

1. Annotate all (human) genes to terms from ontologies outside GO and OMIM, e.g. Phenotype Ontology, CHEBI, or Pathway Ontology. 2. For each term (gene set) evaluate predictability, i.e. how well can we predict the genes that are annotated to it using existing gene function prediction methods.

Gene annotations outside of GO

NCBO currently includes over 250 ontologies Ontologies are structured controlled vocabularies Gene/protein summary in Entrez Gene and UniProt often more up-to-date than manually curated GO NCBO provides annotator service1 that matches text to terms

1C. Jonquet et al. AMIA Summit on Translational Bioinformatics (2009)

Automatic gene annotation pipeline


1" 2"
The*status,*quality,*and*expansion*of*the*NIH*fullBlength*cDNAproject:* the*Mammalian*Gene*CollecKon*(MGC).*;*KinaseBselecKve*enrichment* enables*quanKtaKve*phosphoproteomics*oShe*kinome*across*the*cell* cycle.*;*A*quanKtaKve*atlas*of*mitoKc*phosphorylaKon.*;*A*synopsis*of* eukaryoKc*NalphaBterminal*acetyltransferases:nomenclature,*subunits* and* substrates.* ;* Knockdown* of* human* N* alphaBterminal* acetyltransferase*complex*C*leadsto*p53Bdependent* * * apoptosis* * * * * * *apoptosis* and* aberrant* human* Arl8b* localizaKon.* ;* Lysine* acetylaKon* targets* protein* complexes* and* coBregulates* majorcellular* funcKons.* ;B!B* FUNCTION:* CatalyKc* subunit* of* the* NBterminal* acetyltransferase* C(NatC)* complex.* Catalyzes* acetylaKon* of* the* NBterminal* methionineresidues*of*pepKdes*beginning*with*MetBLeuBAla*and*MetB LeuBGly.Necessary* for* the* lysosomal* localizaKon* and* funcKon* of* ARL8B.B!B* CATALYTIC* ACTIVITY:* AcetylBCoA* +* pepKde* =* N(alpha)B acetylpepKde+* CoA.B!B* SUBUNIT:* Component* of* the* NBterminal* acetyltransferase* C* (NatC)complex,* which* is* composed* of* NAA35,* LSMD1* and* NAA30.B!B* SUBCELLULAR* LOCATION:* Cytoplasm.B!B* ALTERNATIVE* PRODUCTS:Event=AlternaKve* splicing;* Named* i s o f o r m s = 2 ; N a m e = 1 ; I s o I d = Q 1 4 7 X 3 B 1 ;* S e q u e n c e = D i s p l a y e d ; N a m e = 2 ; I s o I d = Q 1 4 7 X 3 B 2 ;* Sequence=VSP_031581;Note=No* experimental* conrmaKon* available;B!B*SIMILARITY:*Belongs*to*the* acetyltransferase* * * * * * * * * * * * *acetyltransferase* family.*MAK3subfamily.B!B*SIMILARITY:*Contains*1*NBacetyltransferase* domain.*B.*

1.Collect genes/ proteins from Entrez Gene and UniProt


2.Collect descriptive text for each gene/ protein from Entrez Gene/UniProt 3.Annotate text to over 200 ontologies via NCBO Annotator

Biological*process* CytokineKc* process* DNA* replicaKon* iniKaKon* *

Biological* process*

Molecular* funcKon* Cellular* funcKon*





Over*200* biomedical* ontologies*


Gene/protein specific text as annotation source

Gene text from Entrez Gene Protein text from UniProt Gene/Protein summary Publication titles GO annotations Pathway annotations GeneRIFs


The' status,' quality,' and' expansion' of' the' NIH' full>length' cDNAproject:' the' Mammalian' Gene' CollecIon' (MGC).' ;' Kinase>selecIve' enrichment' enables' quanItaIve'phosphoproteomics'oQhe'kinome'across'the'cell'cycle.';'A'quanItaIve' atlas' of' mitoIc' phosphorylaIon.' ;' A' synopsis' of' eukaryoIc' Nalpha>terminal' acetyltransferases:nomenclature,'subunits'and'substrates.';'Knockdown'of'human'N' alpha>terminal' acetyltransferase' complex' C' leadsto' p53>dependent'

apoptosis' and' aberrant' human' Arl8b' localizaIon.' ;' Lysine' acetylaIon'

targets' protein' complexes' and' co>regulates' majorcellular' funcIons.' ;>!>' FUNCTION:' CatalyIc' subunit' of' the' N>terminal' acetyltransferase' C(NatC)' complex.' Catalyzes' acetylaIon' of' the' N>terminal' methionineresidues' of' pepIdes' beginning' with' Met> Leu>Ala' and' Met>Leu>Gly.Necessary' for' the' lysosomal' localizaIon' and' funcIon' of' ARL8B.>!>' CATALYTIC' ACTIVITY:' Acetyl>CoA' +' pepIde' =' N(alpha)>acetylpepIde+' CoA.>!>' SUBUNIT:' Component' of' the' N>terminal' acetyltransferase' C' (NatC)complex,' which' is' composed' of' NAA35,' LSMD1' and' NAA30.>!>' SUBCELLULAR' LOCATION:' Cytoplasm.>!>' ALTERNATIVE' PRODUCTS:Event=AlternaIve' splicing;' Named' isoforms=2;Name=1;IsoId=Q147X3>1;'Sequence=Displayed;Name=2;IsoId=Q147X3>2;' Sequence=VSP_031581;Note=No' experimental' conrmaIon' available;>!>' SIMILARITY:' Belongs' to' the' ' family.' MAK3subfamily.>!>'SIMILARITY:'Contains'1'N>acetyltransferase'domain.'>.KEGG;'hsa: 122830;' >.UCSC;' uc001xcx.2;' human.CTD;' 122830;' >.GeneCards;' GC14P038022;' >.H> InvDB;' HIX0011696;' >.HGNC;' HGNC:19844;' NAA30.neXtProt;' NX_Q147X3;' > . P h a r m G K B ;' P A 1 3 4 9 3 1 3 1 5 ;' > . e g g N O G ;' p r N O G 1 5 4 6 3 ;' > . G e n e T r e e ;' ENSGT00390000005665;' >.HOGENOM;' HBG282398;' >.HOVERGEN;' HBG082671;' >.InParanoid;' Q147X3;' >.OMA;' AGVHSGE;' >.OrthoDB;' EOG4KKZ4S;' >.PhylomeDB;' Q147X3;' >.NextBio;' 81013;' >.ArrayExpress;' Q147X3;' >.Bgee;' Q147X3;' >.CleanEx;' HS_NAT12;' >.GenevesIgator;' Q147X3;' >.GO;' GO:0005737;' C:cytoplasm;' IEA:UniProtKB>SubCell.GO;' GO:0004596;' F:pepIde' alpha>N>acetyltransferase' acIvity;' IEA:EC.InterPro;' IPR000182;' AcTrfase_GCN5>related_dom.InterPro;' I P R 0 1 6 1 8 1 ;' A c y l _ C o A _ a c y l t r a n s f e r a s e . G e n e 3 D ;' G 3 D S A : 3 . 4 0 . 6 3 0 . 3 0 ;' Acyl_CoA_acyltransferase;' 1.Pfam;' PF00583;' Acetyltransf_1;' 1.SUPFAM;' SSF55729;' Acyl_CoA_acyltransferase;'1.PROSITE;'PS51186;'GNAT;'1.' '


Protein complexes, domains, interactions We lter for author names, db names, numbers

The NCBO annotator1

Simple string matching using mgrep Synonyms are annotated Annotations are propagated to the root No NLP Very fast

Biological$process$ Cytokine6c$ process$ DNA$ replica6on$ ini6a6on$ $

Biological$ process$ Molecular$ func6on$ Cellular$ func6on$





1C. Jonquet et al. AMIA Summit on Translational Bioinformatics (2009)

Annotation results
683,753,623 annotations of 426,392 genes and proteins to 529,544 terms from 267 ontologies for 7 organism (human, mouse, rat, y, worm, yeast, E. coli) For human:

94,844,772 annotations of 43,823 genes to 436,576 terms 146,221,448 annotations of 68,079 proteins to 373,222 terms

RESTful webservice at: Term enrichment tool STOP1:

1Wittkop et al. BMC Bioinformatics (2013)

Systematic evaluation of automated annotations

1. Annotate all (human) genes to terms from ontologies outside GO and OMIM, e.g. Phenotype Ontology, CHEBI, or Pathway Ontology. 2. For each term (gene set) evaluate predictability, i.e. how well can we predict the genes that are annotated to it using existing gene function prediction methods.

Systematic evaluation of automated annotations

Use GeneMANIA1 for gene prioritization:

Combine biological networks with more weight to

networks that connect input genes

Find closest genes in genome Fast, accurate and can be executed locally

1Mostafavi et al. Genome Biol. (2008)

Systematic evaluation of automated annotations

3-fold cross-validation of all terms (between 5 and

1000 genes) using the gene prioritization tool GeneMANIA

3-fold cross-validation of random control using two

distributions (uniform and gene-annotation-frequency based)

3-fold cross-validation of GO annotations from GOA1

(including/excluding IEA annotations) and from DAVID2

Use AUROC as quality measure to compare

1E. Camon et al. NAR (2004), 2D. Huang et al. Genome Biol. (2007)

Automated annotations are more predictable than random ...

Control uniform Control annotation-frequency-based Automated annotations

For human genes: 127.000 out of 200.000 analyzed terms

are statistically signicant above random

... and perform comparable to existing annotations

GOA (EXP) GOA (IEA) DAVID (GO db) Automated annotations

Note that GOA annotations are more sparse and have on

average smaller gene sets

Differences between ontologies demand further analysis

Ontologies with more then 1000 terms ordered by average AUROC
PRotein Ontology (PRO) Molecule role (INOH Protein name/family name ontology) Cell Cycle Ontology Online Mendelian Inheritance in Man Gene Ontology Extension Gene Ontology Neural-Immune Gene Ontology Medical Subject Headings NIFSTD National Drug File NCI Thesaurus CRISP Thesaurus 2006 Logical Observation Identifier Names and Codes MedDRA Chemical entities of biological interest SNOMED Clinical Terms Experimental Factor Ontology Suggested Ontology for Pharmacogenomics Read Codes Clinical Terms Version 3 (CTV3) Bone Dysplasia Ontology RadLex Galen Human developmental anatomy timed version 0.78 0.803 0.825 0.848 0.87


ribonuclease P protein subunit p40 Protein Ontology AUC = 1


GO:0032041 (NAD-dependent histone deacetylase activity) Gene Ontology AUC = 1


GO:0072599 (establishment of protein localization in endoplasmic reticulum) Gene Ontology AUC = 0.99


Pancytopenia OMIM AUC = 0.96


Severe combined immunodeciency OMIM AUC = 0.97


Existing Gene function prediction methods might be

applied to other gene annotations

Automated annotations have prediction power Differences in prediction performance between

ontologies/terms exist

Future directions: What are the important features for

individual ontologies or subgroups of terms ?

Thank you for your attention

Special Thanks to ... Buck Institute for Research on Aging Adrian Bivol, Darcy Davis, Emily TerAvest, Uday Evani, Ari Berman,Tal Oron Ronnen, Mathew Fleisch, Corey Powell


NIH R01 LM009722 (PI:Mooney), Stanford University National Center for Biomedical Ontology U54 HG004028, and the Buck Trust.

NCBO! ! ! ! ! Nigam Shah and Trish Wetzel