You are on page 1of 26

Evaluation and integration

of multiple datasets

using Bayes theorem

John van Dam


How can we integrate multiple datasets?
Proteomics data

Published data
Genetic data

?
Expression data Evolutionary data
How can we integrate multiple datasets?
Proteomics data

Published data
Genetic data

Expression data Evolutionary data


Thomas Bayes (1701 – 1761)
• Presbyterian minister
• Fellow of the Royal Society

• Published two works:


• A religious essay
• An essay defending the work of Sir Isaac Newton

• His work on the “Bayes’ theorem” was published by Richard Price in 1763

• Mathematics of probabilities
• A hot topic in science in early 18th century
• A lot of people at the time were interested in mathematics,
statistics and probabilities because of gambling!
Bayes’ theorem

P(B | A)×P(A)
P ( A | B) =
P(B)

• P(A|B) = Probability of A given observation B

• P(B|A) = Probability of observation of B given A

• P(A) = The a priori probability of A

• P(B) = The probability that B is observed

• Bayes’ theorem deals with “inverse probabilities”


Example:
• A friend tells you he had a nice conversation with someone in the train to
Nijmegen
• What is the chance that this other person is a woman?
• Your friend only tells you that this person has long hair.
• Does this change the previous probability?

• Say:
• 75% of women have long hair
• 15% of men have long hair
Bayes’ theorem
• What if your friend told you that this person was also wearing high heels?
• We can use P(W|L) as the new prior!

P(H | W )×P(W | L)
P(W | L & H) =
P(H )

• This is called Bayesian updating


• You adjust your ‘belief’ with each new piece of information!

• Bayesian updating assumes no relationship between L and H other than via W!


Bayesian odds
• For convenience we can rewrite Bayes’ equation into odds (or Bayes factor)

P(L | W )×P(W )
P(W | L) P(L)
=
P(M | L) P(L | M )×P(M )
P(L)

P(L | W )×P(W )×P(L) P(L | W )×P(W ) P(W ) P(L | W )


= = = ×
P(L)×P(L | M )×P(M ) P(L | M )×P(M ) P(M ) P(L | M )
Bayesian odds
• If we now perform Bayesian updating we can simply write

P(W | L & H ) P(W ) P(L | W ) P(H | W )


= × ×
P(M | L & H ) P(M ) P(L | M ) P(H | M )
Beware of ‘extreme’ cases (or priors)
• “A Bayesian is one who, vaguely expecting a horse, and catching a glimpse
of a donkey, strongly believes he has seen a mule.”
• http://www2.isye.gatech.edu/~brani/isyebayes/jokes.html

• What did we just “probabilistically” describe if the person was actually a


man?
How can we integrate multiple datasets?
Proteomics data

Published data
Genetic data

Expression data Evolutionary data


Ciliary biology; a relatively young field
Ciliated tissues (some examples)
Inner ear:
Cilia function in hearing
and balance

Sperm cells

Cerebral cavities,
Bronchia &
Fallopian tubes
Retina:
Cones and Rods
Bayesian integration on SysCilia data
• Tandem Affinity Purifications & SILAC
• Yeast 2 Hybrid screens
• Ciliary evolutionary co-occurrence
• Gene presence/absence profiles matching ciliary presence/absence
• System co-expression
• Genes with XBOX transcription factor binding sites

• What is the probability that gene X is ciliary given that


it is reported by experiments 1, 2, 3, …, and n?

15
Bayesian integration of multiple observations
log(a ×b) =log(a) + log(b)

• n is the number of datasets considered


• fi = dataset i
• P(fi|T) = probability that a gene is reported by dataset i given it is a known ciliary gene

• We take log odds because deviations, caused by rounding and measurement


errors, are not enlarged with each multiplication
Can we say something about genes that were not reported?
• In case of yes/no experiments, “No” can also have meaning.

P ( T |! fi ) P(Cilium) P(! fi | T )
= ×
P ( F |! fi ) P(!Cilium) P(! fi | F)
• In case we have a result which has a value, we can use categories.
For instance:
P ( T | fi(0,0.1) ) P(Cilium) P( fi(0,0.1) | T )
= ×
P ( F | fi(0,0.1) ) P(!Cilium) P( fi(0,0.1) | F)
P ( T | fi[ 0.1,0.2) ) P(Cilium) P( fi[0.1,0.2) | T )
= ×
P ( F | fi[ 0.1,0.2) ) P(!Cilium) P( fi[0.1,0.2) | F)

• Each gene falls into one category for each experiment.


Evaluating True and False per experiment
• We need a list of known ciliary genes (a Gold Standard)
• We need a list of known non-ciliary genes (a Negative Set)

P ( fi T )
• P ( fi F ) Then simply becomes

Fraction of GS reported by experiment i


Fraction of NS reported by experiment i
Gold Standard & Negative set
System co-expression
Distinguishing between ciliary vs. non-ciliary genes
Ranking based on Bayesian integration

Ciliary
Predicted
Non-ciliary

The Bayesian integration enriches for more known ciliary genes, than the individual
datasets. We can control for False Discovery Rate.

23
ROC-curve and performance of individual
datasets

AUC: 0.86

24
Application of the Bayesian integration
• Predicting causative genes in ciliopathy disease loci or exome data
• Predict which genes are likely involved in ciliary function, and which are not
• Example BBS5 locus (182 genes):

Ensembl GeneID Gene Symbol Rank Score

ENSG00000123607 TTC21B 65 5.580545431


ENSG00000163093 BBS5 99 4.863543816

ENSG00000154479 CCDC173 157 3.916022407


ENSG00000081479 LRP2 503 0.756945148

25
Conclusion
• Bayesian integration is a powerful way to predict novel ciliary genes by
objective evaluation and integration of experimental datasets
• New datasets can easily be incorporated

• You can use such a Bayesian integration to


• Predict novel ciliary genes
• Rank target genes from new experiments
• Predict causative genes in patient exome data
Acknowledgements
• Huynen Lab, Radboud UMC Ueffing lab, Tübingen

• Roepman lab, Radboud UMC

• Oliver Blacque, UCD Dublin

You might also like