University of Pittsburgh Intelligent Systems Program
An Efficient Genetic Model Selection Algorithm
to Predict Outcomes from Genomic Data
Acknowledgements
This research was funded by NLM grant T15 LM007059 to the University of Pittsburgh
Biomedical Informatics Training Program.
Model Averaged NB
Usually, only the completearc NB model
is used
However, are 2
A
possible NB
models, representing all
combinations of arcs (structure)
Instead of using just one model,
average over all the NB models!
Introduction
Outcome prediction is an important goal of genomic analysis
Use Single Nucleotide Polymorphism (SNP) data to find meaningful
variants, apply them for prognosis
Many possible models of association
Does a SNP affect phenotype?
How to model the effect?
Methods
Data: Four Wellcome Trust CaseControl Consortium GWAS SNP datasets
Hypertension (HT), Type 1 Diabetes (T1D), Type 2 Diabetes (T2D), Rheumatoid Arthritis
(RA)
~446k SNPs, ~3000 cases, ~2000 controls for each dataset
Algorithms: Five versions of MANB applied to predict outcome from genomic data (P(TX))
Four MANB versions assume the same genetic model for every SNP (G, D, R, H)
GMANB selects the model for each SNP under which association is most likely
Results
Each model learns parameters (80% train), computes P(TX) (20% test)
Predictive performance measured as area under the ROC curve
GMANB performs the best in 3 of 4 datasets, and is not significantly
different from the best in the fourth dataset
GMANB model selection is better than using any single model
Recessive model works well for several datasets
Plausible mechanism for disease SNPs
Heterozygous model performs the worst in general
Biologically uncommon model
Department of Biomedical Informatics
Background
Nave Bayes (NB) model computes P(TX), using Bayes Theorem and
conditionally independent probabilities P(X
1
T), P(X
2
T), P(X
A
T)
Reduces to a counting problem over N instances of A features
For a genomic problem, requires 6 parameters per feature X
Summary
Genetic Model Averaged Nave Bayes (GMANB) uses a Bayesian
metric to select genetic models for SNP features
Reduces overfitting by limiting the number of parameters estimated
Shows improved predictive power over singlemodel techniques
Algorithmic complexity no worse than Nave Bayes
T
X
1
X
2
X
3
X
A
T=0 T=1
X1=0 100 110
X1=1 101 111
X1=2 102 112
T=0 T=1
X2=0 200 210
X2=1 201 211
X2=2 202 212
T=0 T=1
XA=0 A00 A10
XA=1 A01 A11
XA=2 A02 A12
Genetic Model Selection
In SNP data, each variable has 3 feature states (AA, Aa, aa genotype)
SNPs affect phenotype under different models:
Genotypic each SNP state affects phenotype differently (e.g, additive model)
Dominant only one copy of the risk allele is required to affect the phenotype
Recessive two copies of the risk allele are required to affect the phenotype
Heterozygous both alleles are required to affect phenotype (i.e. Aa genotype)
These models imply functional equivalence between SNP states
Compute MANB under each possible genetic model, and pick the most appropriate one
for each SNP
GMANB selects the most likely model as the one under which the likelihood of there
being an association (arc) is highest
Fewer feature states Fewer parameters in the NB model Less overfitting
Counts T=0 T=1
Xa=0 N00 N10
Xa=1 N01 N11
Xa=2 N02 N12
Counts T=0 T=1
Xa=0 or
Xa=1
N00 +
N01
N10 +
N11
Xa=2 N02 N12
Counts T=0 T=1
Xa=0 or
Xa=2
N00 +
N02
N10 +
N12
Xa=1 N01 N11
Genotypic Dominant Heterozygous
Counts T=0 T=1
Xa=0 N00 N10
Xa=1 or
Xa=2
N01 +
N02
N11 +
N12
Recessive
Weight each model by posterior probability
of the model given the observed data
Likelihood is a counting problem
Some possible NB models
[ [
= =


.

\

+
=
T
G
i r
j
r
k
ijk G
i ij
G
i
i i
TG
s
N
r N
r
X T D P
1 1
!
)! 1 (
)! 1 (
)  (
[ [
= =


.

\

+
=
T
D
i r
j
r
k
ijk D
i ij
D
i
i i
TD
s
N
r N
r
X T D P
1 1
!
)! 1 (
)! 1 (
)  (
[ [
= =


.

\

+
=
T
R
i r
j
r
k
ijk R
i ij
R
i
i i
TR
s
N
r N
r
X T D P
1 1
!
)! 1 (
)! 1 (
)  (
[ [
= =


.

\

+
=
T
H
i r
j
r
k
ijk H
i ij
H
i
i i
TH
s
N
r N
r
X T D P
1 1
!
)! 1 (
)! 1 (
)  (
=
M
D M P M X T P X T P )  ( ) ,  ( )  (
Using independence, find one network that
is equivalent to averaging in linear time!
[ [
= =


.

\

=
T i r
j
r
k
ijk
i ij
i
i i
N
r N
r
X T D P
1 1
!
)! 1 (
)! 1 (
)  (
Alg.
Data
MANB
Genotypic
MANB
Dominant
MANB
Recessive
MANB
Heterozygous
GMANB
HT
0.7718
0.0092
0.7881
0.0102
0.8026
0.0116
0.7158
0.0098
0.8091
0.0108
T1D
0.8691
0.0090
0.8196
0.0099
0.8192
0.0111
0.8002
0.0117
0.8704
0.0092
T2D
0.7562
0.0092
0.7997
0.0108
0.8069
0.0117
0.7802
0.0125
0.7951
0.0113
RA
0.8252
0.0098
0.8157
0.0110
0.8295
0.0110
0.7846
0.0125
0.8380
0.0104
Fivefold crossvalidated prediction AUC for 5 MANB models (=0.05).
Highlighted cells show the bestperforming algorithm for each dataset.
NB Structure
NB Parameters
Select best
fitting model
!
)! 1 (
)! 1 (
) ...  (
1
* [
=
+
=
i r
k
k i
i i
i
i i
N
r N
r
X T D P