You are on page 1of 21

Genome

and proteome annotation using automatically recognized concepts and functional networks

Adrian Bivol, Tobias Wittkop, Darcy Davis, and Sean Mooney Mooney laboratory, Buck Institute for Research on Aging, Novato, CA National Center for Biomedical Ontology, Stanford University, Stanford, CA

Gene function/disease prediction

Typically uses Gene Ontology (GO) or disease annotation (e.g. OMIM) Many tools utilize similar set of features/networks, e.g. PPI networks, co-expression networks, sequence similarity,... Input: Set of genes with known function/disease Output: ranked list of remaining genes (closest at the top)

Can these tools be used for other annotations then GO or disease?

Systematic evaluation of automated annotations

1. Annotate all (human) genes to terms from ontologies outside GO and OMIM, e.g. Phenotype Ontology, CHEBI, or Pathway Ontology. 2. For each term (gene set) evaluate predictability, i.e. how well can we predict the genes that are annotated to it using existing gene function prediction methods.

Gene annotations outside of GO

NCBO currently includes over 250 ontologies Ontologies are structured controlled vocabularies Gene/protein summary in Entrez Gene and UniProt often more up-to-date than manually curated GO NCBO provides annotator service1 that matches text to terms

1C. Jonquet et al. AMIA Summit on Translational Bioinformatics (2009)

Automatic gene annotation pipeline


Genome/Proteome*

1" 2"
Q147X3**human*****
The*status,*quality,*and*expansion*of*the*NIH*fullBlength*cDNAproject:* the*Mammalian*Gene*CollecKon*(MGC).*;*KinaseBselecKve*enrichment* enables*quanKtaKve*phosphoproteomics*oShe*kinome*across*the*cell* cycle.*;*A*quanKtaKve*atlas*of*mitoKc*phosphorylaKon.*;*A*synopsis*of* eukaryoKc*NalphaBterminal*acetyltransferases:nomenclature,*subunits* and* substrates.* ;* Knockdown* of* human* N* alphaBterminal* acetyltransferase*complex*C*leadsto*p53Bdependent* * * apoptosis* * * * * * *apoptosis* and* aberrant* human* Arl8b* localizaKon.* ;* Lysine* acetylaKon* targets* protein* complexes* and* coBregulates* majorcellular* funcKons.* ;B!B* FUNCTION:* CatalyKc* subunit* of* the* NBterminal* acetyltransferase* C(NatC)* complex.* Catalyzes* acetylaKon* of* the* NBterminal* methionineresidues*of*pepKdes*beginning*with*MetBLeuBAla*and*MetB LeuBGly.Necessary* for* the* lysosomal* localizaKon* and* funcKon* of* ARL8B.B!B* CATALYTIC* ACTIVITY:* AcetylBCoA* +* pepKde* =* N(alpha)B acetylpepKde+* CoA.B!B* SUBUNIT:* Component* of* the* NBterminal* acetyltransferase* C* (NatC)complex,* which* is* composed* of* NAA35,* LSMD1* and* NAA30.B!B* SUBCELLULAR* LOCATION:* Cytoplasm.B!B* ALTERNATIVE* PRODUCTS:Event=AlternaKve* splicing;* Named* i s o f o r m s = 2 ; N a m e = 1 ; I s o I d = Q 1 4 7 X 3 B 1 ;* S e q u e n c e = D i s p l a y e d ; N a m e = 2 ; I s o I d = Q 1 4 7 X 3 B 2 ;* Sequence=VSP_031581;Note=No* experimental* conrmaKon* available;B!B*SIMILARITY:*Belongs*to*the* acetyltransferase* * * * * * * * * * * * *acetyltransferase* family.*MAK3subfamily.B!B*SIMILARITY:*Contains*1*NBacetyltransferase* domain.*B.*
*

1.Collect genes/ proteins from Entrez Gene and UniProt

2"
Gene"Ontology"

2.Collect descriptive text for each gene/ protein from Entrez Gene/UniProt 3.Annotate text to over 200 ontologies via NCBO Annotator

3"
Biological*process* CytokineKc* process* DNA* replicaKon* iniKaKon* *

Biological* process*

Molecular* funcKon* Cellular* funcKon*

Apoptosis"

signaling*

Cell"cycle"ontology"

Biological*conKnuant*

Over*200* biomedical* ontologies*

Acetyltransferase"

Gene/protein specific text as annotation source


Gene text from Entrez Gene Protein text from UniProt Gene/Protein summary Publication titles GO annotations Pathway annotations GeneRIFs

Q147X3''human'''''

The' status,' quality,' and' expansion' of' the' NIH' full>length' cDNAproject:' the' Mammalian' Gene' CollecIon' (MGC).' ;' Kinase>selecIve' enrichment' enables' quanItaIve'phosphoproteomics'oQhe'kinome'across'the'cell'cycle.';'A'quanItaIve' atlas' of' mitoIc' phosphorylaIon.' ;' A' synopsis' of' eukaryoIc' Nalpha>terminal' acetyltransferases:nomenclature,'subunits'and'substrates.';'Knockdown'of'human'N' alpha>terminal' acetyltransferase' complex' C' leadsto' p53>dependent'

apoptosis' and' aberrant' human' Arl8b' localizaIon.' ;' Lysine' acetylaIon'

targets' protein' complexes' and' co>regulates' majorcellular' funcIons.' ;>!>' FUNCTION:' CatalyIc' subunit' of' the' N>terminal' acetyltransferase' C(NatC)' complex.' Catalyzes' acetylaIon' of' the' N>terminal' methionineresidues' of' pepIdes' beginning' with' Met> Leu>Ala' and' Met>Leu>Gly.Necessary' for' the' lysosomal' localizaIon' and' funcIon' of' ARL8B.>!>' CATALYTIC' ACTIVITY:' Acetyl>CoA' +' pepIde' =' N(alpha)>acetylpepIde+' CoA.>!>' SUBUNIT:' Component' of' the' N>terminal' acetyltransferase' C' (NatC)complex,' which' is' composed' of' NAA35,' LSMD1' and' NAA30.>!>' SUBCELLULAR' LOCATION:' Cytoplasm.>!>' ALTERNATIVE' PRODUCTS:Event=AlternaIve' splicing;' Named' isoforms=2;Name=1;IsoId=Q147X3>1;'Sequence=Displayed;Name=2;IsoId=Q147X3>2;' Sequence=VSP_031581;Note=No' experimental' conrmaIon' available;>!>' SIMILARITY:' Belongs' to' the' ' family.' MAK3subfamily.>!>'SIMILARITY:'Contains'1'N>acetyltransferase'domain.'>.KEGG;'hsa: 122830;' >.UCSC;' uc001xcx.2;' human.CTD;' 122830;' >.GeneCards;' GC14P038022;' >.H> InvDB;' HIX0011696;' >.HGNC;' HGNC:19844;' NAA30.neXtProt;' NX_Q147X3;' > . P h a r m G K B ;' P A 1 3 4 9 3 1 3 1 5 ;' > . e g g N O G ;' p r N O G 1 5 4 6 3 ;' > . G e n e T r e e ;' ENSGT00390000005665;' >.HOGENOM;' HBG282398;' >.HOVERGEN;' HBG082671;' >.InParanoid;' Q147X3;' >.OMA;' AGVHSGE;' >.OrthoDB;' EOG4KKZ4S;' >.PhylomeDB;' Q147X3;' >.NextBio;' 81013;' >.ArrayExpress;' Q147X3;' >.Bgee;' Q147X3;' >.CleanEx;' HS_NAT12;' >.GenevesIgator;' Q147X3;' >.GO;' GO:0005737;' C:cytoplasm;' IEA:UniProtKB>SubCell.GO;' GO:0004596;' F:pepIde' alpha>N>acetyltransferase' acIvity;' IEA:EC.InterPro;' IPR000182;' AcTrfase_GCN5>related_dom.InterPro;' I P R 0 1 6 1 8 1 ;' A c y l _ C o A _ a c y l t r a n s f e r a s e . G e n e 3 D ;' G 3 D S A : 3 . 4 0 . 6 3 0 . 3 0 ;' Acyl_CoA_acyltransferase;' 1.Pfam;' PF00583;' Acetyltransf_1;' 1.SUPFAM;' SSF55729;' Acyl_CoA_acyltransferase;'1.PROSITE;'PS51186;'GNAT;'1.' '

acetyltransferase

Protein complexes, domains, interactions We lter for author names, db names, numbers

The NCBO annotator1

Simple string matching using mgrep Synonyms are annotated Annotations are propagated to the root No NLP Very fast

3"
Biological$process$ Cytokine6c$ process$ DNA$ replica6on$ ini6a6on$ $

Gene$Ontology$
Biological$ process$ Molecular$ func6on$ Cellular$ func6on$

Apoptosis$

signaling$

Cell$cycle$ontology$
Biological$con6nuant$

Acetyltransferase$

1C. Jonquet et al. AMIA Summit on Translational Bioinformatics (2009)

Annotation results
683,753,623 annotations of 426,392 genes and proteins to 529,544 terms from 267 ontologies for 7 organism (human, mouse, rat, y, worm, yeast, E. coli) For human:

94,844,772 annotations of 43,823 genes to 436,576 terms 146,221,448 annotations of 68,079 proteins to 373,222 terms
Availability:

RESTful webservice at: rest.mooneygroup.org Term enrichment tool STOP1: mooneygroup.org/stop


1Wittkop et al. BMC Bioinformatics (2013)

Systematic evaluation of automated annotations

1. Annotate all (human) genes to terms from ontologies outside GO and OMIM, e.g. Phenotype Ontology, CHEBI, or Pathway Ontology. 2. For each term (gene set) evaluate predictability, i.e. how well can we predict the genes that are annotated to it using existing gene function prediction methods.

Systematic evaluation of automated annotations

Use GeneMANIA1 for gene prioritization:

Combine biological networks with more weight to


networks that connect input genes

Find closest genes in genome Fast, accurate and can be executed locally

1Mostafavi et al. Genome Biol. (2008)

Systematic evaluation of automated annotations

3-fold cross-validation of all terms (between 5 and


1000 genes) using the gene prioritization tool GeneMANIA

3-fold cross-validation of random control using two


distributions (uniform and gene-annotation-frequency based)

3-fold cross-validation of GO annotations from GOA1


(including/excluding IEA annotations) and from DAVID2

Use AUROC as quality measure to compare


predictability
1E. Camon et al. NAR (2004), 2D. Huang et al. Genome Biol. (2007)

Automated annotations are more predictable than random ...


Control uniform Control annotation-frequency-based Automated annotations

For human genes: 127.000 out of 200.000 analyzed terms


are statistically signicant above random

... and perform comparable to existing annotations


GOA (EXP) GOA (IEA) DAVID (GO db) Automated annotations

Note that GOA annotations are more sparse and have on


average smaller gene sets

Differences between ontologies demand further analysis


Ontologies with more then 1000 terms ordered by average AUROC
PRotein Ontology (PRO) Molecule role (INOH Protein name/family name ontology) Cell Cycle Ontology Online Mendelian Inheritance in Man Gene Ontology Extension Gene Ontology Neural-Immune Gene Ontology Medical Subject Headings NIFSTD National Drug File NCI Thesaurus CRISP Thesaurus 2006 Logical Observation Identifier Names and Codes MedDRA Chemical entities of biological interest SNOMED Clinical Terms Experimental Factor Ontology Suggested Ontology for Pharmacogenomics Read Codes Clinical Terms Version 3 (CTV3) Bone Dysplasia Ontology RadLex Galen Human developmental anatomy timed version 0.78 0.803 0.825 0.848 0.87

Examples

ribonuclease P protein subunit p40 Protein Ontology AUC = 1

Examples

GO:0032041 (NAD-dependent histone deacetylase activity) Gene Ontology AUC = 1

Examples

GO:0072599 (establishment of protein localization in endoplasmic reticulum) Gene Ontology AUC = 0.99

Examples

Pancytopenia OMIM AUC = 0.96

Examples

Severe combined immunodeciency OMIM AUC = 0.97

Conclusions

Existing Gene function prediction methods might be


applied to other gene annotations

Automated annotations have prediction power Differences in prediction performance between


ontologies/terms exist

Future directions: What are the important features for


individual ontologies or subgroups of terms ?

Thank you for your attention

Special Thanks to ... Buck Institute for Research on Aging Adrian Bivol, Darcy Davis, Emily TerAvest, Uday Evani, Ari Berman,Tal Oron Ronnen, Mathew Fleisch, Corey Powell

!
!

Funding
NIH R01 LM009722 (PI:Mooney), Stanford University National Center for Biomedical Ontology U54 HG004028, and the Buck Trust.

NCBO! ! ! ! ! Nigam Shah and Trish Wetzel