Genome  and  proteome  annotation  using   automatically  recognized  concepts  and   functional  networks

Adrian  Bivol,  Tobias  Wittkop,  Darcy  Davis,  and  Sean  Mooney Mooney  laboratory,  Buck  Institute  for  Research  on  Aging,  Novato,  CA National  Center  for  Biomedical  Ontology,  Stanford  University,  Stanford,  CA

Gene  function/disease  prediction

• Typically uses Gene Ontology (GO) or disease annotation (e.g. OMIM) • Many tools utilize similar set of features/networks, e.g. PPI networks, co-expression networks, sequence similarity,... • Input: Set of genes with known function/disease • Output: ranked list of remaining genes (closest at the top)

Can these tools be used for other annotations then GO or disease?

Systematic  evaluation  of  automated   annotations

1. Annotate all (human) genes to terms from ontologies outside GO and OMIM, e.g. Phenotype Ontology, CHEBI, or Pathway Ontology. 2. For each term (gene set) evaluate “predictability”, i.e. how well can we predict the genes that are annotated to it using existing gene function prediction methods.

Gene  annotations  outside  of  GO

• NCBO currently includes over 250 ontologies • Ontologies are structured controlled vocabularies • Gene/protein summary in Entrez Gene and UniProt often more up-to-date than manually curated GO • NCBO provides annotator service1 that matches text to terms

1C.  Jonquet  et  al.  AMIA  Summit  on  Translational  Bioinformatics  (2009)

Automatic  gene  annotation  pipeline

The*status,*quality,*and*expansion*of*the*NIH*fullBlength*cDNAproject:* the*Mammalian*Gene*CollecKon*(MGC).*;*KinaseBselecKve*enrichment* enables*quanKtaKve*phosphoproteomics*oShe*kinome*across*the*cell* cycle.*;*A*quanKtaKve*atlas*of*mitoKc*phosphorylaKon.*;*A*synopsis*of* eukaryoKc*NalphaBterminal*acetyltransferases:nomenclature,*subunits* and* substrates.* ;* Knockdown* of* human* N* alphaBterminal* acetyltransferase*complex*C*leadsto*p53Bdependent* * * apoptosis* * * * * * *apoptosis* and* aberrant* human* Arl8b* localizaKon.* ;* Lysine* acetylaKon* targets* protein* complexes* and* coBregulates* majorcellular* funcKons.* ;B!B* FUNCTION:* CatalyKc* subunit* of* the* NBterminal* acetyltransferase* C(NatC)* complex.* Catalyzes* acetylaKon* of* the* NBterminal* methionineresidues*of*pepKdes*beginning*with*MetBLeuBAla*and*MetB LeuBGly.Necessary* for* the* lysosomal* localizaKon* and* funcKon* of* ARL8B.B!B* CATALYTIC* ACTIVITY:* AcetylBCoA* +* pepKde* =* N(alpha)B acetylpepKde+* CoA.B!B* SUBUNIT:* Component* of* the* NBterminal* acetyltransferase* C* (NatC)complex,* which* is* composed* of* NAA35,* LSMD1* and* NAA30.B!B* SUBCELLULAR* LOCATION:* Cytoplasm.B!B* ALTERNATIVE* PRODUCTS:Event=AlternaKve* splicing;* Named* i s o f o r m s = 2 ; N a m e = 1 ; I s o I d = Q 1 4 7 X 3 B 1 ;* S e q u e n c e = D i s p l a y e d ; N a m e = 2 ; I s o I d = Q 1 4 7 X 3 B 2 ;* Sequence=VSP_031581;Note=No* experimental* confirmaKon* available;B!B*SIMILARITY:*Belongs*to*the* acetyltransferase* * * * * * * * * * * * *acetyltransferase* family.*MAK3subfamily.B!B*SIMILARITY:*Contains*1*NBacetyltransferase* domain.*B.*

1.Collect genes/ proteins from Entrez Gene and UniProt


2.Collect descriptive text for each gene/ protein from Entrez Gene/UniProt 3.Annotate text to over 200 ontologies via NCBO Annotator

Biological*process* CytokineKc* process* DNA* replicaKon* iniKaKon* *

Biological* process*

Molecular* funcKon* Cellular* funcKon*





Over*200* biomedical* ontologies*


Gene/protein  specific  text  as   annotation  source
• Gene text from Entrez Gene • Protein text from UniProt • Gene/Protein summary • Publication titles • GO annotations • Pathway annotations • GeneRIFs


The' status,' quality,' and' expansion' of' the' NIH' full>length' cDNAproject:' the' Mammalian' Gene' CollecIon' (MGC).' ;' Kinase>selecIve' enrichment' enables' quanItaIve'phosphoproteomics'oQhe'kinome'across'the'cell'cycle.';'A'quanItaIve' atlas' of' mitoIc' phosphorylaIon.' ;' A' synopsis' of' eukaryoIc' Nalpha>terminal' acetyltransferases:nomenclature,'subunits'and'substrates.';'Knockdown'of'human'N' alpha>terminal' acetyltransferase' complex' C' leadsto' p53>dependent'

apoptosis' and' aberrant' human' Arl8b' localizaIon.' ;' Lysine' acetylaIon'

targets' protein' complexes' and' co>regulates' majorcellular' funcIons.' ;>!>' FUNCTION:' CatalyIc' subunit' of' the' N>terminal' acetyltransferase' C(NatC)' complex.' Catalyzes' acetylaIon' of' the' N>terminal' methionineresidues' of' pepIdes' beginning' with' Met> Leu>Ala' and' Met>Leu>Gly.Necessary' for' the' lysosomal' localizaIon' and' funcIon' of' ARL8B.>!>' CATALYTIC' ACTIVITY:' Acetyl>CoA' +' pepIde' =' N(alpha)>acetylpepIde+' CoA.>!>' SUBUNIT:' Component' of' the' N>terminal' acetyltransferase' C' (NatC)complex,' which' is' composed' of' NAA35,' LSMD1' and' NAA30.>!>' SUBCELLULAR' LOCATION:' Cytoplasm.>!>' ALTERNATIVE' PRODUCTS:Event=AlternaIve' splicing;' Named' isoforms=2;Name=1;IsoId=Q147X3>1;'Sequence=Displayed;Name=2;IsoId=Q147X3>2;' Sequence=VSP_031581;Note=No' experimental' confirmaIon' available;>!>' SIMILARITY:' Belongs' to' the' ' family.' MAK3subfamily.>!>'SIMILARITY:'Contains'1'N>acetyltransferase'domain.'>.KEGG;'hsa: 122830;' >.UCSC;' uc001xcx.2;' human.CTD;' 122830;' >.GeneCards;' GC14P038022;' >.H> InvDB;' HIX0011696;' >.HGNC;' HGNC:19844;' NAA30.neXtProt;' NX_Q147X3;' > . P h a r m G K B ;' P A 1 3 4 9 3 1 3 1 5 ;' > . e g g N O G ;' p r N O G 1 5 4 6 3 ;' > . G e n e T r e e ;' ENSGT00390000005665;' >.HOGENOM;' HBG282398;' >.HOVERGEN;' HBG082671;' >.InParanoid;' Q147X3;' >.OMA;' AGVHSGE;' >.OrthoDB;' EOG4KKZ4S;' >.PhylomeDB;' Q147X3;' >.NextBio;' 81013;' >.ArrayExpress;' Q147X3;' >.Bgee;' Q147X3;' >.CleanEx;' HS_NAT12;' >.GenevesIgator;' Q147X3;' >.GO;' GO:0005737;' C:cytoplasm;' IEA:UniProtKB>SubCell.GO;' GO:0004596;' F:pepIde' alpha>N>acetyltransferase' acIvity;' IEA:EC.InterPro;' IPR000182;' AcTrfase_GCN5>related_dom.InterPro;' I P R 0 1 6 1 8 1 ;' A c y l _ C o A _ a c y l t r a n s f e r a s e . G e n e 3 D ;' G 3 D S A : 3 . 4 0 . 6 3 0 . 3 0 ;' Acyl_CoA_acyltransferase;' 1.Pfam;' PF00583;' Acetyltransf_1;' 1.SUPFAM;' SSF55729;' Acyl_CoA_acyltransferase;'1.PROSITE;'PS51186;'GNAT;'1.' '


• Protein complexes, domains, interactions • We filter for author names, db names, numbers

The  NCBO  annotator1

• Simple string matching using mgrep • Synonyms are annotated • Annotations are propagated to the root • No NLP • Very fast

Biological$process$ Cytokine6c$ process$ DNA$ replica6on$ ini6a6on$ $

Biological$ process$ Molecular$ func6on$ Cellular$ func6on$





1C.  Jonquet  et  al.  AMIA  Summit  on  Translational  Bioinformatics  (2009)

Annotation  results
683,753,623 annotations of 426,392 genes and proteins to 529,544 terms from 267 ontologies for 7 organism (human, mouse, rat, fly, worm, yeast, E. coli) For human:

• 94,844,772 annotations of 43,823 genes to 436,576 terms • 146,221,448 annotations of 68,079 proteins to 373,222 terms

• RESTful webservice at: • Term enrichment tool STOP1:
1Wittkop  et  al.  BMC  Bioinformatics  (2013)

Systematic  evaluation  of  automated   annotations

Use GeneMANIA1 for gene prioritization:

• Combine biological networks with more weight to
networks that connect input genes

• Find closest genes in genome • Fast, accurate and can be executed locally

1Mostafavi  et  al.  Genome  Biol.  (2008)

Systematic  evaluation  of  automated   annotations

• 3-fold cross-validation of all terms (between 5 and
1000 genes) using the gene prioritization tool GeneMANIA

• 3-fold cross-validation of random control using two
distributions (uniform and gene-annotation-frequency based)

• 3-fold cross-validation of GO annotations from GOA1
(including/excluding IEA annotations) and from DAVID2

• Use AUROC as quality measure to compare
1E.  Camon  et  al.  NAR  (2004),  2D.  Huang  et  al.  Genome  Biol.  (2007)

Automated  annotations  are  more   predictable  than  random  ...
Control uniform • Control annotation-frequency-based • Automated annotations

• For human genes: 127.000 out of 200.000 analyzed terms
are statistically significant above random

...  and  perform  comparable  to  existing   annotations
GOA (EXP) • GOA (IEA) • DAVID (GO db) • Automated annotations

• Note that GOA annotations are more sparse and have on
average smaller gene sets

Differences  between  ontologies  demand   further  analysis
Ontologies with more then 1000 terms ordered by average AUROC
PRotein Ontology (PRO) Molecule role (INOH Protein name/family name ontology) Cell Cycle Ontology Online Mendelian Inheritance in Man Gene Ontology Extension Gene Ontology Neural-Immune Gene Ontology Medical Subject Headings NIFSTD National Drug File NCI Thesaurus CRISP Thesaurus 2006 Logical Observation Identifier Names and Codes MedDRA Chemical entities of biological interest SNOMED Clinical Terms Experimental Factor Ontology Suggested Ontology for Pharmacogenomics Read Codes Clinical Terms Version 3 (CTV3) Bone Dysplasia Ontology RadLex Galen Human developmental anatomy timed version 0.78 0.803 0.825 0.848 0.87


ribonuclease P protein subunit p40 • Protein Ontology • AUC = 1


GO:0032041 (NAD-dependent histone deacetylase activity) Gene Ontology • AUC = 1


GO:0072599 (establishment of protein localization in endoplasmic reticulum) • Gene Ontology • AUC = 0.99


Pancytopenia • OMIM • AUC = 0.96


Severe combined immunodeficiency • OMIM • AUC = 0.97


• Existing Gene function prediction methods might be
applied to other gene annotations

• Automated annotations have prediction power • Differences in prediction performance between
ontologies/terms exist

• Future directions: What are the important features for
individual ontologies or subgroups of terms ?

Thank you for your attention

Special Thanks to ... Buck Institute for Research on Aging Adrian Bivol, Darcy Davis, Emily TerAvest, Uday Evani, Ari Berman,Tal Oron Ronnen, Mathew Fleisch, Corey Powell








NIH R01 LM009722 (PI:Mooney), Stanford University National Center for Biomedical Ontology U54 HG004028, and the Buck Trust.

NCBO! ! ! ! ! Nigam Shah and Trish Wetzel




