2/7/2013

Protein Function Prediction
and

Human readable description
• A scientist’s first contact with a protein is often through the description e.g. in a BLAST search

Human Readable Descriptions:
Integrating Sequence Similarity, Protein Domain Architectures and Lexical Scoring

Heiko Schoof
Crop Bioinformatics Institut für Nutzpflanzenwissenschaften und Ressourcenschutz (INRES) Universität Bonn http://www.uni-bonn.de/cropbio
Kathrin Klee, Asis Hallab, Girish Srinivasa Murthy, Mythri Bangalore

Human readable description
• Human readable descriptions should

Transferring HRD
• Collect HRD from homologs • Identify structure
Sequences producing significant alignments: tr|A6S0W9|A6S0W9_BOTFB tr|E4ZV13|E4ZV13_LEPMJ tr|E3RSU0|E3RSU0_PYRTT tr|Q2URL6|Q2URL6_ASPOR tr|B2WD91|B2WD91_PYRTR tr|A1D4X3|A1D4X3_NEOFI tr|Q5BGQ8|Q5BGQ8_EMENI tr|C8VUD2|C8VUD2_EMENI tr|C1GKJ7|C1GKJ7_PARBD tr|B8MY91|B8MY91_ASPFN tr|C4JVY0|C4JVY0_UNCRE tr|C1H4B7|C1H4B7_PARBA tr|C0SES8|C0SES8_PARBP Putative uncharacterized protein OS=Botry... Similar to aminoglycoside phosphotransfer... Putative uncharacterized protein OS=Pyren... Predicted aminoglycoside phosphotransfera... Acyl-CoA dehydrogenase family member 11 O... Phosphotransferase enzyme family domain p... Putative uncharacterized protein OS=Emeri... Phosphotransferase enzyme family domain p... Phosphotransferase enzyme family domain-c... Phosphotransferase enzyme family domain p... Putative uncharacterized protein OS=Uncin... Phosphotransferase enzyme family domain-c... Aminoglycoside phosphotransferase OS=Para... Score E (bits) Value 469 429 426 425 424 423 421 421 420 417 417 416 416 e-130 e-118 e-117 e-117 e-116 e-116 e-116 e-116 e-115 e-114 e-114 e-114 e-114

– give a hint at function (be informative) – be short – be correct
• Human readable descriptions in non-curated databases

– are often transferred from homologs – contain specific information (e.g. clone ids) – are automatically assigned
• contain “similar to”, “hypothetical” • are wrong

Transferring HRD
• Collect HRD from homologs • Identify structure
Sequences producing significant alignments: tr|A6S0W9|A6S0W9_BOTFB tr|E4ZV13|E4ZV13_LEPMJ tr|E3RSU0|E3RSU0_PYRTT tr|Q2URL6|Q2URL6_ASPOR tr|B2WD91|B2WD91_PYRTR tr|A1D4X3|A1D4X3_NEOFI tr|Q5BGQ8|Q5BGQ8_EMENI tr|C8VUD2|C8VUD2_EMENI tr|C1GKJ7|C1GKJ7_PARBD tr|B8MY91|B8MY91_ASPFN tr|C4JVY0|C4JVY0_UNCRE tr|C1H4B7|C1H4B7_PARBA tr|C0SES8|C0SES8_PARBP Putative uncharacterized protein OS=Botry... Similar to aminoglycoside phosphotransfer... Putative uncharacterized protein OS=Pyren... Predicted aminoglycoside phosphotransfera... Acyl-CoA dehydrogenase family member 11 O... Phosphotransferase enzyme family domain p... Putative uncharacterized protein OS=Emeri... Phosphotransferase enzyme family domain p... Phosphotransferase enzyme family domain-c... Phosphotransferase enzyme family domain p... Putative uncharacterized protein OS=Uncin... Phosphotransferase enzyme family domain-c... Aminoglycoside phosphotransferase OS=Para... Score E (bits) Value 469 429 426 425 424 423 421 421 420 417 417 416 416 e-130 e-118 e-117 e-117 e-116 e-116 e-116 e-116 e-115 e-114 e-114 e-114 e-114

Transferring HRD
• Remove non-informative or specific words • Select highest scoring HRD and transfer
Sequences producing significant alignments: tr|A6S0W9|A6S0W9_BOTFB tr|E4ZV13|E4ZV13_LEPMJ tr|E3RSU0|E3RSU0_PYRTT tr|Q2URL6|Q2URL6_ASPOR tr|B2WD91|B2WD91_PYRTR tr|A1D4X3|A1D4X3_NEOFI tr|Q5BGQ8|Q5BGQ8_EMENI tr|C8VUD2|C8VUD2_EMENI tr|C1GKJ7|C1GKJ7_PARBD tr|B8MY91|B8MY91_ASPFN tr|C4JVY0|C4JVY0_UNCRE tr|C1H4B7|C1H4B7_PARBA tr|C0SES8|C0SES8_PARBP Putative uncharacterized protein OS=Botry... Similar to aminoglycoside phosphotransfer... Putative uncharacterized protein OS=Pyren... Predicted aminoglycoside phosphotransfera... Acyl-CoA dehydrogenase family member 11 O... Phosphotransferase enzyme family domain p... Putative uncharacterized protein OS=Emeri... Phosphotransferase enzyme family domain p... Phosphotransferase enzyme family domain-c... Phosphotransferase enzyme family domain p... Putative uncharacterized protein OS=Uncin... Phosphotransferase enzyme family domain-c... Aminoglycoside phosphotransferase OS=Para... Score E (bits) Value 469 429 426 425 424 423 421 421 420 417 417 416 416 e-130 e-118 e-117 e-117 e-116 e-116 e-116 e-116 e-115 e-114 e-114 e-114 e-114

1

2/7/2013

Transferring HRD
• Remove non-informative or specific words • Select highest scoring HRD and transfer
Sequences producing significant alignments: tr|A6S0W9|A6S0W9_BOTFB tr|E4ZV13|E4ZV13_LEPMJ tr|E3RSU0|E3RSU0_PYRTT tr|Q2URL6|Q2URL6_ASPOR tr|B2WD91|B2WD91_PYRTR tr|A1D4X3|A1D4X3_NEOFI tr|Q5BGQ8|Q5BGQ8_EMENI tr|C8VUD2|C8VUD2_EMENI tr|C1GKJ7|C1GKJ7_PARBD tr|B8MY91|B8MY91_ASPFN tr|C4JVY0|C4JVY0_UNCRE tr|C1H4B7|C1H4B7_PARBA tr|C0SES8|C0SES8_PARBP Putative uncharacterized protein OS=Botry... Similar to aminoglycoside phosphotransfer... Putative uncharacterized protein OS=Pyren... Predicted aminoglycoside phosphotransfera... Acyl-CoA dehydrogenase family member 11 O... Phosphotransferase enzyme family domain p... Putative uncharacterized protein OS=Emeri... Phosphotransferase enzyme family domain p... Phosphotransferase enzyme family domain-c... Phosphotransferase enzyme family domain p... Putative uncharacterized protein OS=Uncin... Phosphotransferase enzyme family domain-c... Aminoglycoside phosphotransferase OS=Para... Score E (bits) Value 469 429 426 425 424 423 421 421 420 417 417 416 416 e-130 e-118 e-117 e-117 e-116 e-116 e-116 e-116 e-115 e-114 e-114 e-114 e-114

Transferring HRD
• Remove non-informative or specific words • Select highest scoring HRD and transfer
Sequences producing significant alignments: tr|A6S0W9|A6S0W9_BOTFB tr|E4ZV13|E4ZV13_LEPMJ tr|E3RSU0|E3RSU0_PYRTT tr|Q2URL6|Q2URL6_ASPOR tr|B2WD91|B2WD91_PYRTR tr|A1D4X3|A1D4X3_NEOFI tr|Q5BGQ8|Q5BGQ8_EMENI tr|C8VUD2|C8VUD2_EMENI tr|C1GKJ7|C1GKJ7_PARBD tr|B8MY91|B8MY91_ASPFN tr|C4JVY0|C4JVY0_UNCRE tr|C1H4B7|C1H4B7_PARBA tr|C0SES8|C0SES8_PARBP Putative uncharacterized protein OS=Botry... Similar to aminoglycoside phosphotransfer... Putative uncharacterized protein OS=Pyren... Predicted aminoglycoside phosphotransfera... Acyl-CoA dehydrogenase family member 11 O... Phosphotransferase enzyme family domain p... Putative uncharacterized protein OS=Emeri... Phosphotransferase enzyme family domain p... Phosphotransferase enzyme family domain-c... Phosphotransferase enzyme family domain p... Putative uncharacterized protein OS=Uncin... Phosphotransferase enzyme family domain-c... Aminoglycoside phosphotransferase OS=Para... Score E (bits) Value 469 429 426 425 424 423 421 421 420 417 417 416 416 e-130 e-118 e-117 e-117 e-116 e-116 e-116 e-116 e-115 e-114 e-114 e-114 e-114

Transferring HRD
Sequences producing significant alignments: tr|A6S0W9|A6S0W9_BOTFB tr|E4ZV13|E4ZV13_LEPMJ tr|E3RSU0|E3RSU0_PYRTT tr|Q2URL6|Q2URL6_ASPOR tr|B2WD91|B2WD91_PYRTR tr|A1D4X3|A1D4X3_NEOFI tr|Q5BGQ8|Q5BGQ8_EMENI tr|C8VUD2|C8VUD2_EMENI tr|C1GKJ7|C1GKJ7_PARBD tr|B8MY91|B8MY91_ASPFN tr|C4JVY0|C4JVY0_UNCRE tr|C1H4B7|C1H4B7_PARBA tr|C0SES8|C0SES8_PARBP Putative uncharacterized protein OS=Botry... Similar to aminoglycoside phosphotransfer... Putative uncharacterized protein OS=Pyren... Predicted aminoglycoside phosphotransfera... Acyl-CoA dehydrogenase family member 11 O... Phosphotransferase enzyme family domain p... Putative uncharacterized protein OS=Emeri... Phosphotransferase enzyme family domain p... Phosphotransferase enzyme family domain-c... Phosphotransferase enzyme family domain p... Putative uncharacterized protein OS=Uncin... Phosphotransferase enzyme family domain-c... Aminoglycoside phosphotransferase OS=Para... Score E (bits) Value 469 429 426 425 424 423 421 421 420 417 417 416 416 e-130 e-118 e-117 e-117 e-116 e-116 e-116 e-116 e-115 e-114 e-114 e-114 e-114

AHRD: Automatic assignment of human readable descriptions
For each protein sequence a description line is selected that:

ACD10 and Acyl-CoA dehydrogenase family members contain a phosphotransferase domain: Query ACD10

– comes from a high-scoring BLAST match – contains words occurring frequently in the descriptions of highest scoring BLAST matches – does not contain meaningless "fill words" – contains words also occurring in any GO terms assigned to the query protein

AHRD: Automatic assignment of human readable descriptions
• Input:

AHRD workflow
• Collect HRD candidates from 200 best BLAST results • Filter using blacklists • Score and rank HRD candidates, select best-scoring HRD • Assign best-scoring HRD; append metadata and Interpro domain informatoin

– BLAST results against several databases (Swissprot, trEMBL, TAIR) – InterproScan results – GeneOntology annotation

2

2/7/2013

AHRD scoring
• List all words (tokens) occurring in the HRD candidates (BLAST results) • Assign a score based on – the frequency in the HRD candidates – the score of the respective BLAST alignment – the overlap of the respective BLAST alignment – a score derived from the “trust” put in the source database: e.g. high for Swissprot, low for trEMBL

AHRD scoring
• Descriptions receive a score based on the sum of word scores, normalized by the maximum word score and corrected for the proportion of informative words.

ds(di ) = cf

TS + gs(di ) + d * blastscore(di ) + a * patternscore(di ) mts

ts(t ) = b

bs(t ) os(t ) ws(t ) +s +w B O W

• Add a bonus for words also occurring in GO annotation. • Add the BLAST score of the description • Add a patternscore based on how often the exact same description appears
BLAST (Swissprot) results:
Sequences producing significant alignments: sp|Q6JQN1|ACD10_HUMAN sp|Q5ZHT1|ACD11_CHICK sp|Q8K370|ACD10_MOUSE sp|Q5R778|ACD11_PONAB sp|Q709F0|ACD11_HUMAN sp|B3DMA2|ACD11_RAT sp|Q80XL6|ACD11_MOUSE Acyl-CoA Acyl-CoA Acyl-CoA Acyl-CoA Acyl-CoA Acyl-CoA Acyl-CoA dehydrogenase dehydrogenase dehydrogenase dehydrogenase dehydrogenase dehydrogenase dehydrogenase family family family family family family family member member member member member member member 10 11 10 11 11 11 11 OS... OS... OS... OS... OS... OS... OS... Score E (bits) Value 223 218 213 212 212 211 204 1e-57 5e-56 1e-54 3e-54 4e-54 8e-54 6e-52 Overlap Desc Score Score 0.937 0.945 0.937 0.872 0.872 0.872 0.872 0.105 0.103 0.100 0.100 0.100 0.100 0.096

BLAST (TAIR) results:
Sequences producing significant alignments: AT3G06810.1 | Symbols: IBR3 | IBR3 (IBA-RESPONSE 3); acyl-CoA de... Score E (bits) Value 210 1e-54 Overlap Desc Score Score 0.937 0.198

BLAST (trEMBL) results (only the first 20 are shown) :

Example
• We evaluate based on 1400 expert annotated proteins from the Blumeria graminis genome, which are not yet in Uniprot and have never been available as source for HRD transfer

Aminoglycoside Phosphotransferase
BLAST (Swissprot) results:
Score E (bits) Value e-130 Overlap e-128 Score e-118 e-118 0.937 0.945 e-117 0.937 e-117 0.872 e-116 0.872 e-116 0.872 e-116 0.872 e-116 e-115 e-115 e-115 Overlap e-115 Score e-114 e-114 0.937 e-114 e-114 e-113 e-113

Sequences producing significant alignments:

Overlap Desc Score Score
Desc Score 0.105 0.103 0.100 0.100 0.997 0.100 0.964 0.100

ACD10 and Acyl-CoA dehydrogenase family members contain a phosphotransferase domain: 0.964 0.389 Query
Overlap Desc Score Score

tr|A6S0W9|A6S0W9_BOTFB Putative uncharacterized protein OS=Botry... 469 Score E tr|A7ECR7|A7ECR7_SCLS1 Putative uncharacterized protein OS=Scler... 461 Sequences producing significant alignments: (bits) Value tr|Q0CUW4|Q0CUW4_ASPTN Putative uncharacterized protein OS=Asper... 429 tr|E4ZV13|E4ZV13_LEPMJ Similar aminoglycoside sp|Q6JQN1|ACD10_HUMAN Acyl-CoAto dehydrogenase familyphosphotransfer... member 10 OS... 223 429 1e-57 sp|Q5ZHT1|ACD11_CHICK Acyl-CoA uncharacterized dehydrogenase family member OS=Pyren... 11 OS... 218 426 5e-56 tr|E3RSU0|E3RSU0_PYRTT Putative protein sp|Q8K370|ACD10_MOUSE Acyl-CoA dehydrogenase family member 10 OS... 213 425 1e-54 tr|Q2URL6|Q2URL6_ASPOR Predicted aminoglycoside phosphotransfera... sp|Q5R778|ACD11_PONAB Acyl-CoA dehydrogenase dehydrogenase family member 11 OS... 212 424 3e-54 tr|B2WD91|B2WD91_PYRTR Acyl-CoA family member 11 O... sp|Q709F0|ACD11_HUMAN Acyl-CoA dehydrogenase family member 11 OS... 212 4e-54 tr|A1D4X3|A1D4X3_NEOFI Phosphotransferase enzyme family domain p... 423 sp|B3DMA2|ACD11_RAT Acyl-CoA dehydrogenase family member 11 OS... 211 8e-54 tr|Q5BGQ8|Q5BGQ8_EMENI Putative protein sp|Q80XL6|ACD11_MOUSE Acyl-CoA uncharacterized dehydrogenase family member OS=Emeri... 11 OS... 204 421 6e-52 tr|C8VUD2|C8VUD2_EMENI Phosphotransferase enzyme family domain p... 421 tr|Q4WKD2|Q4WKD2_ASPFU Phosphotransferase enzyme family domain p... 421 BLAST (TAIR) results: tr|B0XMM4|B0XMM4_ASPFC Phosphotransferase enzyme family domain p... 421 tr|A1CRY2|A1CRY2_ASPCL Phosphotransferase enzyme family domain p... 421 Score 420 E tr|C1GKJ7|C1GKJ7_PARBD Phosphotransferase enzyme family domain-c... Sequences producing significant alignments: (bits) Value tr|B8MY91|B8MY91_ASPFN Phosphotransferase enzyme family domain p... 417 tr|C4JVY0|C4JVY0_UNCRE Putative uncharacterized protein OS=Uncin... AT3G06810.1 | Symbols: IBR3 | IBR3 (IBA-RESPONSE 3); acyl-CoA de... 210 417 1e-54 tr|C1H4B7|C1H4B7_PARBA Phosphotransferase enzyme family domain-c... 416 tr|C0SES8|C0SES8_PARBP Aminoglycoside phosphotransferase OS=Para... 416 BLAST (trEMBL) results (only the first 20 are shown) : OS=Cocci... tr|E9DDR6|E9DDR6_COCPS Putative uncharacterized protein 412 tr|C5P107|C5P107_COCP7 Electron transport oxidoreductase, putati... 412 Score E … Sequences producing significant alignments: (bits) Value 7 x Acyl-CoA dehydrogenase 77 x Aminoglycoside phosphotransferase tr|A6S0W9|A6S0W9_BOTFB Putative uncharacterized protein OS=Botry... 469 e-130 …Putative uncharacterized protein OS=Scler... tr|A7ECR7|A7ECR7_SCLS1 461 e-128

0.400 0.545 0.290 0.543 0.543 0.543 0.542 0.540 0.539 2.657

0.964 0.964 0.964 0.964 Desc 0.964 Score 0.970 0.964 0.964
0.198

0.096

ACD10

BLAST (Swissprot) results:

Acyl-CoA dehydrogenase family members contain a phosphotransferase domain:
Sequences producing significant alignments:

Query

Score E (bits) Value 223 218 213 212 212 211 204 E

Overlap Desc Score Score 0.937 0.945 0.937 0.872 0.872 0.872 0.872 Desc
Score 0.105 0.103 0.100 Overlap Desc 0.100 Score Score 0.100 0.100 0.096 0.937 0.198

sp|Q6JQN1|ACD10_HUMAN Best BLAST hit sp|Q5ZHT1|ACD11_CHICK

Acyl-CoA dehydrogenase Acyl-CoA dehydrogenase sp|Q8K370|ACD10_MOUSE Acyl-CoA dehydrogenase sp|Q5R778|ACD11_PONAB Acyl-CoA dehydrogenase sp|Q709F0|ACD11_HUMAN Acyl-CoA dehydrogenase BLAST (Swissprot) results: sp|B3DMA2|ACD11_RAT Acyl-CoA dehydrogenase sp|Q80XL6|ACD11_MOUSE Acyl-CoA dehydrogenase
Sequences producing significant alignments:

family family family family family family family

member member member member member member member

10 11 10 11 11 11 11

OS... OS... OS... OS... OS... OS... OS... Score

1e-57 5e-56 1e-54 3e-54 4e-54 8e-54 6e-52 Overlap
Score

0.105 0.103 0.100 0.100 0.100 0.100 0.096

(bits) Value

BLAST (TAIR) results: 0.937 sp|Q6JQN1|ACD10_HUMAN Acyl-CoA dehydrogenase family member 10 OS... 223 1e-57 0.945 sp|Q5ZHT1|ACD11_CHICK Acyl-CoA dehydrogenase family member 11 OS... 218 5e-56 0.937 sp|Q8K370|ACD10_MOUSE Acyl-CoA dehydrogenase family member 10 OS... 213 1e-54 E 0.872 sp|Q5R778|ACD11_PONAB Acyl-CoA dehydrogenase family member 11 OS... 212 Score 3e-54 Sequences producing significant 0.872 sp|Q709F0|ACD11_HUMAN Acyl-CoA alignments: dehydrogenase family member 11 OS... 212 (bits) 4e-54 Value 0.872 sp|B3DMA2|ACD11_RAT Acyl-CoA dehydrogenase family member 11 OS... 211 8e-54 0.872 sp|Q80XL6|ACD11_MOUSE Acyl-CoA dehydrogenase family3); member 11 OS... 204 210 6e-52 1e-54 AT3G06810.1 | Symbols: IBR3 | IBR3 (IBA-RESPONSE acyl-CoA de...
BLAST (TAIR) results: BLAST (trEMBL) results (only the first 20 are shown) :
Sequences producing significant alignments:

tr|Q0CUW4|Q0CUW4_ASPTN Putative uncharacterized protein OS=Asper... 429 e-118 tr|E4ZV13|E4ZV13_LEPMJ Similar to aminoglycoside phosphotransfer... 429 e-118 tr|E3RSU0|E3RSU0_PYRTT Putative uncharacterized protein OS=Pyren... 426 e-117 tr|Q2URL6|Q2URL6_ASPOR Predicted phosphotransfera... 425 e-117 Rejected aminoglycoside descriptions matching any regex of the description blacklist 0.997 tr|B2WD91|B2WD91_PYRTR Acyl-CoA dehydrogenase family member 11 O... 424 e-116 Deleted parts of theenzyme descriptions matching regex of the filtering lists 0.964 tr|A1D4X3|A1D4X3_NEOFI Phosphotransferase family domain any p... 423 e-116 Ignored uncharacterized tokens matching any regex OS=Emeri... of the token blacklist tr|Q5BGQ8|Q5BGQ8_EMENI Putative protein 421 e-116 0.964 tr|C8VUD2|C8VUD2_EMENI Phosphotransferase enzyme family domain p... 421 e-116 High scoring tokens 0.964 tr|Q4WKD2|Q4WKD2_ASPFU Phosphotransferase enzyme family domain p... 421 e-115 Low scoring tokensenzyme family domain p... 421 e-115 0.964 tr|B0XMM4|B0XMM4_ASPFC Phosphotransferase Over. Overlap score of the blast result: (queryEnd – queryStart 1) / queryLength 0.964 tr|A1CRY2|A1CRY2_ASPCL Phosphotransferase enzyme family domain p... 421 +e-115 0.964 tr|C1GKJ7|C1GKJ7_PARBD Phosphotransferase enzyme family 420 e-115 Desc. Final description scores assigned bydomain-c... AHRD (Desc Score) 0.970 tr|B8MY91|B8MY91_ASPFN Phosphotransferase enzyme family domain p... 417 e-114 Description chosen by AHRD tr|C4JVY0|C4JVY0_UNCRE Putative uncharacterized protein OS=Uncin... 417 e-114 0.964 tr|C1H4B7|C1H4B7_PARBA Phosphotransferase enzyme family domain-c... 416 e-114 BLAST (Swissprot) results: 0.964 tr|C0SES8|C0SES8_PARBP Aminoglycoside phosphotransferase OS=Para... 416 e-114 tr|E9DDR6|E9DDR6_COCPS Putative uncharacterized protein OS=Cocci... 412 e-113 Overlap E 0.964 tr|C5P107|C5P107_COCP7 Electron transport oxidoreductase, putati... Score 412 e-113 Score Sequences producing significant alignments: (bits) Value … 7 x Acyl-CoA dehydrogenase 0.937 sp|Q6JQN1|ACD10_HUMAN Acyl-CoA dehydrogenase family member 10 OS... 223 1e-57 77 x Aminoglycoside phosphotransferase 0.945 sp|Q5ZHT1|ACD11_CHICK Acyl-CoA dehydrogenase family member 11 OS... 218 5e-56 … 0.937 sp|Q8K370|ACD10_MOUSE Acyl-CoA dehydrogenase family member 10 OS... 213 1e-54 0.872 sp|Q5R778|ACD11_PONAB Acyl-CoA dehydrogenase family member 11 OS... 212 3e-54 0.872 sp|Q709F0|ACD11_HUMAN Acyl-CoA dehydrogenase family member 11 OS... 212 4e-54 0.872 sp|B3DMA2|ACD11_RAT Acyl-CoA dehydrogenase familyany member 11 OS... 211 8e-54 Rejected descriptions matching regex of the description blacklist 0.872 sp|Q80XL6|ACD11_MOUSE Acyl-CoA dehydrogenase family member 11 OS... 204 6e-52 Deleted parts of the descriptions matching any regex of the filtering lists

0.400 0.545 0.290 0.543 0.543 0.543 0.542 0.540 0.539 2.657
Desc 0.389 Score 0.105 0.103 0.100 0.100 0.100 0.100 0.096

Ignored tokens matching any regex of the token blacklist High scoring tokens BLAST (TAIR) results: Low scoring tokens
Score E (bits) Value Score 210 1e-54 Overlap Desc Score Overlap Desc E Score 0.937

Sequences producing significant alignments:

AT3G06810.1 | Symbols: IBR3 | IBR3 (IBA-RESPONSE 3); acyl-CoA de...

Aminoglycoside Phosphotransferase
(bits) Value e-130 e-128 e-118 e-118 Overlap Score e-117 e-117 e-116 e-116 e-116 e-116 e-115 0.997 e-115 0.964 e-115 e-115 0.964 e-114 0.964 e-114 0.964 0.964 e-114 0.964 e-114 0.970 e-113 e-113 0.964
0.964 0.964

Overlap score of the blast result: (queryEnd – queryStartScore + 1) / queryLength E Desc. Final description scores assigned by AHRD (Desc Score) Sequences producing significant alignments: (bits) Value
Over.

Score
0.198

Score

AT3G06810.1 | Symbols: IBR3 | IBR3 (IBA-RESPONSE 3); acyl-CoA de...

Aminoglycoside Phosphotransferase
Description chosen by AHRD
210 1e-54 0.937

Overlap Desc Score Score 0.198

tr|A6S0W9|A6S0W9_BOTFB Putative uncharacterized protein OS=Botry... 469 tr|A7ECR7|A7ECR7_SCLS1 Putative uncharacterized protein 461 BLAST (trEMBL) results (only the first 20 are shown) : OS=Scler... tr|Q0CUW4|Q0CUW4_ASPTN Putative uncharacterized protein OS=Asper... 429 tr|E4ZV13|E4ZV13_LEPMJ Similar to aminoglycoside phosphotransfer... Score 429 E Sequences producing significant alignments: (bits) 426 Value tr|E3RSU0|E3RSU0_PYRTT Putative uncharacterized protein OS=Pyren... tr|Q2URL6|Q2URL6_ASPOR Predicted aminoglycoside phosphotransfera... 425 tr|A6S0W9|A6S0W9_BOTFB Putative uncharacterized protein OS=Botry... 469 e-130 tr|B2WD91|B2WD91_PYRTR Acyl-CoA dehydrogenase family member 11 O... 424 tr|A7ECR7|A7ECR7_SCLS1 Putative uncharacterized protein OS=Scler... 461 e-128 tr|A1D4X3|A1D4X3_NEOFI Phosphotransferase enzyme family domain p... 423 tr|Q0CUW4|Q0CUW4_ASPTN Putative uncharacterized protein OS=Asper... 429 e-118 tr|Q5BGQ8|Q5BGQ8_EMENI Putative protein OS=Emeri... tr|E4ZV13|E4ZV13_LEPMJ Similar uncharacterized to aminoglycoside phosphotransfer... 429 421 e-118 tr|C8VUD2|C8VUD2_EMENI Phosphotransferase enzyme family domain p... tr|E3RSU0|E3RSU0_PYRTT Putative uncharacterized protein OS=Pyren... 426 421 e-117 tr|Q4WKD2|Q4WKD2_ASPFU Phosphotransferase enzyme family domain p... tr|Q2URL6|Q2URL6_ASPOR Predicted aminoglycoside phosphotransfera... 425 421 e-117 tr|B2WD91|B2WD91_PYRTR Acyl-CoA dehydrogenase family memberdomain 11 O...p... 424 421 e-116 tr|B0XMM4|B0XMM4_ASPFC Phosphotransferase enzyme family tr|A1D4X3|A1D4X3_NEOFI Phosphotransferase enzyme family domain p...p... 423 421 e-116 tr|A1CRY2|A1CRY2_ASPCL Phosphotransferase enzyme family domain tr|Q5BGQ8|Q5BGQ8_EMENI Putative uncharacterized protein OS=Emeri... 421 e-116 tr|C1GKJ7|C1GKJ7_PARBD Phosphotransferase enzyme family domain-c... 420 tr|C8VUD2|C8VUD2_EMENI Phosphotransferase enzyme family domain p... 421 e-116 tr|B8MY91|B8MY91_ASPFN Phosphotransferase enzyme family domain tr|Q4WKD2|Q4WKD2_ASPFU Phosphotransferase enzyme family domain p...p... 421 417 e-115 tr|C4JVY0|C4JVY0_UNCRE Putative uncharacterized protein OS=Uncin... tr|B0XMM4|B0XMM4_ASPFC Phosphotransferase enzyme family domain p... 421 417 e-115 tr|C1H4B7|C1H4B7_PARBA Phosphotransferase enzyme family domain-c... tr|A1CRY2|A1CRY2_ASPCL Phosphotransferase enzyme family domain p... 421 416 e-115 tr|C1GKJ7|C1GKJ7_PARBD Phosphotransferase enzyme family domain-c... 420 416 e-115 tr|C0SES8|C0SES8_PARBP Aminoglycoside phosphotransferase OS=Para... tr|B8MY91|B8MY91_ASPFN Phosphotransferase enzyme family domain p... 417 e-114 tr|E9DDR6|E9DDR6_COCPS Putative uncharacterized protein OS=Cocci... 412 tr|C4JVY0|C4JVY0_UNCRE Putative uncharacterized protein OS=Uncin... 417 e-114 tr|C5P107|C5P107_COCP7 Electron transport oxidoreductase, putati... 412 tr|C1H4B7|C1H4B7_PARBA Phosphotransferase enzyme family domain-c... 416 e-114 … tr|C0SES8|C0SES8_PARBP Aminoglycoside phosphotransferase OS=Para... 416 e-114 7 x Acyl-CoA dehydrogenase tr|E9DDR6|E9DDR6_COCPS Putative uncharacterized protein OS=Cocci... 412 e-113 77 x Aminoglycoside phosphotransferase tr|C5P107|C5P107_COCP7 Electron transport oxidoreductase, putati... 412 e-113 ……
7 x Acyl-CoA dehydrogenase 77 x Aminoglycoside phosphotransferase …

BLAST (trEMBL) results (only the first 20 are shown) :
Desc Score Sequences producing significant alignments: Score E (bits) Value 469 461 429 429 426 425 424 423 421 421 421 421 421 420 417 417 416 416 412 412 e-130 e-128 e-118 e-118 e-117 e-117 e-116 e-116 e-116 e-116 e-115 e-115 e-115 e-115 e-114 e-114 e-114 e-114 e-113 e-113 Overlap Desc Score Score

0.997 0.964 0.964 0.964 0.400 0.964 0.545 0.964 0.964 0.290 0.970 0.543 0.964 0.543 0.542 0.964
0.540 0.539 2.657 0.389 0.543

0.400 0.545 0.290 0.543 0.543 0.543 0.542 0.540 0.539 2.657 0.389

0.964

tr|A6S0W9|A6S0W9_BOTFB tr|A7ECR7|A7ECR7_SCLS1 tr|Q0CUW4|Q0CUW4_ASPTN tr|E4ZV13|E4ZV13_LEPMJ tr|E3RSU0|E3RSU0_PYRTT tr|Q2URL6|Q2URL6_ASPOR tr|B2WD91|B2WD91_PYRTR tr|A1D4X3|A1D4X3_NEOFI tr|Q5BGQ8|Q5BGQ8_EMENI tr|C8VUD2|C8VUD2_EMENI tr|Q4WKD2|Q4WKD2_ASPFU tr|B0XMM4|B0XMM4_ASPFC tr|A1CRY2|A1CRY2_ASPCL tr|C1GKJ7|C1GKJ7_PARBD tr|B8MY91|B8MY91_ASPFN tr|C4JVY0|C4JVY0_UNCRE tr|C1H4B7|C1H4B7_PARBA tr|C0SES8|C0SES8_PARBP tr|E9DDR6|E9DDR6_COCPS tr|C5P107|C5P107_COCP7

Putative uncharacterized protein OS=Botry... Putative uncharacterized protein OS=Scler... Putative uncharacterized protein OS=Asper... Similar to aminoglycoside phosphotransfer... Putative uncharacterized protein OS=Pyren... Predicted aminoglycoside phosphotransfera... Acyl-CoA dehydrogenase family member 11 O... Phosphotransferase enzyme family domain p... Putative uncharacterized protein OS=Emeri... Phosphotransferase enzyme family domain p... Phosphotransferase enzyme family domain p... Phosphotransferase enzyme family domain p... Phosphotransferase enzyme family domain p... Phosphotransferase enzyme family domain-c... Phosphotransferase enzyme family domain p... Putative uncharacterized protein OS=Uncin... Phosphotransferase enzyme family domain-c... Aminoglycoside phosphotransferase OS=Para... Putative uncharacterized protein OS=Cocci... Electron transport oxidoreductase, putati... … 7 x Acyl-CoA dehydrogenase 77 x Aminoglycoside phosphotransferase …

0.997 0.964 0.964 0.964 0.964 0.964 0.964 0.970 0.964 0.964 0.964

0.400 0.545 0.290 0.543 0.543 0.543 0.542 0.540 0.539 2.657 0.389

Rejected descriptions matching any regex of the description blacklist Deleted parts of the descriptions matching any regex of the filtering lists Rejected descriptions matching regex of the description blacklist Ignored tokens matching any any regex of the token blacklist Deleted parts of the descriptions matching any regex of the filtering lists High scoring tokens Ignored tokens matching any regex of the token blacklist Low scoring High scoringtokens tokens Over. Overlap scoretokens of the blast result: (queryEnd – queryStart + 1) / queryLength Low scoring Desc. Over. Overlap score of the blast assigned result: (queryEnd – queryStart + 1) / queryLength Final description scores by AHRD (Desc Score) Desc. Final description scores assigned by AHRD (Desc Score) Description chosen by AHRD
Description chosen by AHRD

Over. Desc.

Rejected descriptions matching any regex of the description blacklist Deleted parts of the descriptions matching any regex of the filtering lists Ignored tokens matching any regex of the token blacklist High scoring tokens Low scoring tokens Overlap score of the blast result: (queryEnd – queryStart + 1) / queryLength Final description scores assigned by AHRD (Desc Score) Description chosen by AHRD

3

2/7/2013

Optimize parameters
• In AHRD, there are lots of parameters/weights: BLAST score vs. Overlap vs. Database weight, etc… • We evaluate based on an F score (weighted harmonic mean of precision and recall) calculated from the number of shared words between predicted and reference description. • We attempted both manual selection of parameters and automated methods like simulated annealing.

Optimize parameters
• Changes in individual parameters do not affect performance • Only minimal change in performance (avg. F-score):

– original settings: 0.63 – optimal settings: 0.67
• But: in optimal settings, many parameters are 0, e.g. overlap, and mainly the blast score is considered • Manual inspection shows descriptions are closer to reference, but in one case that is actually wrong

Crossvalidation
• Three datasets:

Integrate protein domains

– 1400 curated proteins from Blumeria graminis – 1000 curated proteins from tomato – 1000 proteins recently added to Swissprot
• Train parameters on one dataset, evaluate on another
Blumeria original optimized on Blumeria optimized on Swissprot optimized on tomato 0.63 0.67 0.62 0.65 Tomato 0.52 0.5 0.48 0.48 Swissprot 0.67 0.75 0.82 0.68

• If query and hit share protein domains detected by Interproscan, increase the score of that description • Does not improve average F-score in evaluation • However, in a single case an error in the curated reference description is uncovered

Ring free for round two…

Conclusions
• We are able to reproduce the decisions made by curators in an automated tool • AHRD is robust to changes in parameters • Optimization leads to overfitting: Curators too often choose the best BLAST hit – we are limited by the quality of the test set • Domain similarity does not add significant information, except in cases of wrong annotations

vs.

4

2/7/2013

Thank you!
• • • • Kathrin Klee Asis Hallab Girish Srinivasa-Murthy Mythri Bangalore

• AHRD is available on Github:

– https://github.com/groupschoof/AHRD

http://www.uni-bonn.de/cropbio

5

Master your semester with Scribd & The New York Times

Special offer for students: Only $4.99/month.

Master your semester with Scribd & The New York Times

Cancel anytime.