Haplotalk

Introduction Induced Haplotyping Improved FPT algorithm Simple Kernel Conclusion
Extended Islands of Tractability for

Parsimony Haplotyping
Rudolf Fleischer1 , Jiong Guo2 , Rolf Niedermeier3 , Johannes

Uhlmann3 , Yihui Wang1 , Mathias Weller3 , and Xi Wu1
1 Fudan University Shanghai, 2 Universität des Saarlandes,

and 3 Friedrich-Schiller-Universität Jena
January 7, 2015
1 / 26
First, Some Biology...
A G T T A G C G A
A G T C A G C A A
gene gene
approx. 0.1% of human nucleotide sites differ between individuals
2 / 26
A G T T A G C G A
A G T C A G C A A
gene gene
approx. 0.1% of human nucleotide sites differ between individuals
2 / 26
A G T T A G C G A
A G T C A G C A A
gene gene
the sequence of SNPs is called a haplotype
2 / 26
T G
C A
the sequence of SNPs is called a haplotype
2 / 26
humans are diploid ; 2 chromosome sets

haplotype = SNP sequence in one chromosome set
genotype = SNP sequence in the combined chromosome sets
3 / 26
humans are diploid ; 2 chromosome sets

haplotype = SNP sequence in one chromosome set
genotype = SNP sequence in the combined chromosome sets
3 / 26
Motivation
Haplotype Inference
goal: find relation between certain SNPs and genetic diseases
problem: difficult (expensive) to sequence both haplotypes
but: easy (cheap) to sequence the genotype instead
; idea: sequence genotype and computationally infer haplotypes
4 / 26
Motivation
Haplotype Inference
goal: find relation between certain SNPs and genetic diseases
problem: difficult (expensive) to sequence both haplotypes
but: easy (cheap) to sequence the genotype instead
; idea: sequence genotype and computationally infer haplotypes
Problems
impossible to infer haplotypes of just 1 genotype
; sequence and infer groups/populations
which explanation should be preferred if there are multiple?
; parsimony
how to perform the actual computation fast?
4 / 26
Parsimony
Parsimony (Ockham’s razor)

Under many plausible explanations of an observed phenomenon,
the one requiring the fewest assumptions should be preferred.
5 / 26
Parsimony
Parsimony (Ockham’s razor)

Under many plausible explanations of an observed phenomenon,
the one requiring the fewest assumptions should be preferred.
used in...
... Clark’s problem [Clark, Molecular Biology and Evolution ’90]
... pure parsimony haplotyping [Lancia et al., INFORMS ’04]
... minimum perfect phylogeny [Gusfield & Orzack, HBI ’05]
... k-minimum recombination configuration [Li & Jiang, JBCB ’03]

...
5 / 26
Preliminary Definitions
Definition (Haplotype, Genotype, Resolve)

haplotype h = string over {0, 1}
genotype g = string over {0, 1, 2}
h1 , h2 resolve g ⇔
for all i ∈ N, g [i] = h1 [i] = h2 [i] or g [i] = 2 and h1 [i] 6= h2 [i]
multiset H ; res (H)
multiset H resolves multiset G ⇔ G ⊆ res (H)
Example
haplotype1: 0 0 1 0 1 1 1 0 0 0 1
haplotype2: 0 0 1 1 0 1 0 0 1 1 1
genotype: 0 0 1 2 2 1 2 0 2 2 1
6 / 26

Example
haplotype1: 0 0 1 0 1 1 1 0 0 0 1
haplotype2: 0 0 1 1 0 1 0 0 1 1 1
genotype: 0 0 1 2 2 1 2 0 2 2 1
6 / 26

Example
haplotype1: 0 0 1 1 1 1 1 0 0 0 1
haplotype2: 0 0 1 0 0 1 0 0 1 1 1
genotype: 0 0 1 2 2 1 2 0 2 2 1
6 / 26
Definition (Haplotype Graph)

Given H and G ⊆ res (H). haplotype graph of H and G :
|H| vertices (labeled by H)
|G | edges (labeled by G )
haplotypes of each edge resolve its genotype
Example
01001 11111
21221
1
11011 2102
22222
11122
12
21
2
12120
10110 11100
7 / 26
Definition (Haplotype Graph)

Given H and G ⊆ res (H). haplotype graph of H and G :
|H| vertices (labeled by H)
|G | edges (labeled by G )
haplotypes of each edge resolve its genotype
Example
01001 11111
21221
1
11011 2102
22222
11122
12
21
2
12022
10110 11100
7 / 26
Haplotype Inference by Parsimony
Definition (Haplotype Inference by Parsimony)

Input: multiset G of length-m genotypes, integer k ≥ 0
Question: ∃ multiset H of k haplotypes that resolves G ?
Example
genotypes haplotype graph haplotypes
11122 01001
21221 11111 01001
12120 11011 2102
1
10110
12212
22222
11122
11011
21021
12
11100
21
2
21221 12120 11111

22222 10110 11100
8 / 26
Previous Work
Complexity
NP-hard [Halldórsson et al., DMTCS ’03]
APX-hard [Lancia et al., INFORMS ’04]
many special cases in P (e.g. [Lancia & Rizzi, ORL ’06])
Algorithms
O(2|G |·d ) Branch&Bound [Wang & Xu, Bioinformatics ’03]
ILP [Lancia & Serafini, INFORMS ’08]

2
O(mk 2k ) FPT algorithm [Sharan et al., TCBB ’06]
Factor-2d−1 -Approximation [Lancia & Rizzi, ORL ’06]

Constrained Version [Fellows et al., CPM ’09]
k := #haplotypes, m := stringlength, d := max #of 2’s in a genotype

9 / 26
Our Contribution
polynomial-time solvable special case:

“Induced Haplotype Inference”
k 4k · poly (|G |, m) time algorithm
simple O(2k · k 2 )-bit problem kernel
10 / 26
Induced Haplotype Inference
remember: H resolves G ⇔ G ⊆ res (H)

what if G = res (H)? ; haplotype graph is a clique
Definition
G = res (H) ; H induces G
11 / 26
Induced Haplotype Inference
remember: H resolves G ⇔ G ⊆ res (H)

what if G = res (H)? ; haplotype graph is a clique
Definition
G = res (H) ; H induces G
Definition (Induced Haplotype Inference by Parsimony)

Input: multiset G of length-m genotypes
Question: ∃ multiset H of haplotypes that induces G ?
11 / 26
Observations & Algorithm
Observation
G can be “nicely” partitioned into G0 , G1 , and G2 .
12 / 26
Observation
Example
2 0 1 2 1
2 0 2 1 1
2 2 1 1 2 10 genotypes in total ; 5 haplotypes
2 2 1 2 2
1 0 2 2 1
1 2 1 0 2
1 2 1 2 2
1 2 2 1 2
1 2 2 2 2
1 1 1 2 0
12 / 26
Observation
Example
2 0 1 2 1
2 0 2 1 1
2 2 1 2 2 6 genotypes with 1 ; 4 haplotypes with 1
1 0 2 2 1 0 genotypes with 0 ; 1 haplotype with 0
1 2 1 0 2
1 2 1 2 2
1 2 2 1 2
1 2 2 2 2
1 1 1 2 0
12 / 26
Observation
Example
2 0 1 2 1
2 0 2 1 1
1 2 1 2 2
1 2 2 1 2
1 2 2 2 2
1 1 1 2 0
12 / 26
Observation
Example
2 0 1 2 1
2 0 2 1 1
1 2 1 2 2 1 genotype with 1 ; 2 haplotypes with 1
1 2 2 1 2 1 genotype with 0 ; 2 haplotypes with 0
1 2 2 2 2
1 1 1 2 0
12 / 26
Observation
Example
2 0 1 2 1
2 0 2 1 1
1 2 1 2 2
1 2 2 1 2
1 2 2 2 2
1 1 1 2 0
12 / 26
Observation
Observation
G2 6= ∅ but G0 = ∅ or G1 = ∅ ; poly
|G0 | = |G1 | = 1 ; poly (although we may get 2 solutions)
12 / 26
Observation
Observation
G2 6= ∅ but G0 = ∅ or G1 = ∅ ; poly
Strategy: Divide & Conquer

divide-step
base cases
merge-step
12 / 26
Observation
Observation
G2 6= ∅ but G0 = ∅ or G1 = ∅ ; poly

divide-step ; OK
base cases
merge-step
12 / 26
Observation
Observation
G2 6= ∅ but G0 = ∅ or G1 = ∅ ; poly

divide-step ; OK
base cases ; OK
merge-step
12 / 26
Observation
Observation
G2 6= ∅ but G0 = ∅ or G1 = ∅ ; poly

divide-step ; OK
base cases ; OK
merge-step ; problem!
need to find a way to compute H1 for given H0 and G2
12 / 26
Extending a Subclique Solution
Observation
Let H0 induce G0 and let g be a genotype in G2 with the smallest
number of 2’s. ; All h ∈ H0 that are consistent with g are
equal.
13 / 26
Observation
equal.
Proof Idea
H1 H0
h’=1100... g=2120...
h=0110...
13 / 26
Observation
equal.
Proof Idea
H1 H0
h’=1100... g=2120...
h=0110...
g’
h’’=01?0...
13 / 26
Concluding The Induced Problem
Divide & Conquer like algorithm

divide step O(|G |)
base solutions O(m)
extend (merge) O(|G2 | · |Hx | · m)
all in all O(|G | · k · m)
14 / 26
Concluding The Induced Problem
Divide & Conquer like algorithm

divide step O(|G |)
base solutions O(m)
extend (merge) O(|G2 | · |Hx | · m)
all in all O(|G | · k · m)
Theorem
Induced Haplotype Inference by Parsimony can be solved in
O(|G | · k · m) time.
14 / 26
General Haplotype Inference by Parsimony
now: drop the clique-constraint

; arbitrary haplotype graph
; NP-hard
15 / 26
General Haplotype Inference by Parsimony
now: drop the clique-constraint

; arbitrary haplotype graph
; NP-hard
Idea
If we knew only the genotype labels of the correct haplotype graph,
could we reconstruct the correct haplotype labels?
; inference graph
15 / 26
Definitions
Definition (Inference Graph)

inference graph Γ of G = an order-k graph with edges
consistently labeled by the genotypes in G
Example
21221
1
2102
22222
11122
12
21
2
12120
16 / 26
Observations
Observation (non-bipartite components of Γ)

C = cycle in Γ
; for all i, |{g ∈ C | g [i] = 2}| is even
non-bipartite components of Γ ; O(|Γ| · m) time
Example
21221
1
2102
22222
11122
12
21
2
12120
17 / 26
Observations
Observation (bipartite components of Γ)

all g [i] = 2 for some i ⇒ choose arbitrarily
bipartite components of Γ ; O(|Γ| · m) time
Example
21221
22222
11122
12120
18 / 26
Observations
Observation
(G , k) yes-instance with solution H
⇒ ∃ Γ extendable (O(|Γ| · m) time) to a haplotype graph of H and G
19 / 26
Observations
Observation
algorithmic idea: guess Γ
19 / 26
Observations
Observation
algorithmic idea: guess Γ
better idea: guess a “spanning” subgraph of Γ
19 / 26
Solving The Problem
Algorithm
1 guess “spanning” subgraph of Γ
2 infer the haplotype multiset H ; O(k · m) time
3 check whether H resolves G ; O(k 2 · m) time
20 / 26
Solving The Problem
Algorithm
1 guess a size-k genotype subset of G ; O(k 2k ) possibilities
2 for these genotypes, guess 2 (of k) vertices ; O(k 2k ) possibilities
20 / 26
Solving The Problem
Algorithm
1 guess a size-k genotype subset of G ; O(k 2k ) possibilities
2 for these genotypes, guess 2 (of k) vertices ; O(k 2k ) possibilities
Theorem
Haplotype Inference by Parsimony can be solved in O(k 4k+2 · m)
time.
20 / 26
Example
11101 01011
21221
12222
01022
22020
10010 01000
01001 11111
21221
1
11011 2102
22222
11122
12
21
2
12120
10110 11100
21 / 26
Example
21221
12222
01022
22020
21221
1
2102
22222
11122
12
21
2
21 / 26
Example
21221
12222
01022
22020
?10?1
21221
1
2102
22222
11122
12
21
2
21 / 26
Example
11??1 010?1
21221
12222
01022
22020
1?0?0 010?0
?10?1 111?1
21221
1
11011 2102
22222
11122
12
21
2
1??1? 111??
21 / 26
Example
111?1 010?1
21221
12222
01022
22020
100?0 010?0
01001 11111
21221
1
11011 2102
22222
11122
12
21
2
10110 11100
21 / 26
Example
11101 01011
21221
12222
01022
22020
10010 01000
01001 11111
21221
1
11011 2102
22222
11122
12
21
2
10110 11100
21 / 26
Kernelization: A Simple Exponential Kernel
In the following: Matrix Representation MG and MH
22 / 26

Observation
Equal columns in MG ; no new constraints or information
22 / 26

Observation
Idea: remove all but one of these columns
22 / 26

Observation
Idea: remove all but one of these columns

Reduction Rule
For all equal columns i,j of MG , delete column j.
22 / 26
Correctness, Size, and Example
Lemma
MH resolves MG :
columns i,j of MH equal ⇒ columns i,j of MG equal
23 / 26
Lemma
MH resolves MG :
Example
MG MH
1 1 1 2 1 2
0 1 0 0 0 1
1 2 1 2 1 0
1 0 1 1 1 0
1 2 2 1 2 2
1 1 0 1 0 1
2 1 0 2 0 1
1 1 1 0 1 1
2 1 2 2 2 1
1 1 1 1 1 1
2 2 2 2 2 2
23 / 26
Lemma
MH resolves MG :
Example
MG MH
1 1 1 2 1 2
0 1 0 0 0 1
1 2 1 2 1 0
1 0 1 1 1 0
1 2 2 1 2 2
1 1 0 1 0 1
2 1 0 2 0 1
1 1 1 0 1 1
2 1 2 2 2 1
1 1 1 1 1 1
2 2 2 2 2 2
23 / 26
Lemma
MH resolves MG :
Example
MG MH
1 1 1 2 2
0 1 0 0 1
1 2 1 2 0
1 0 1 1 0
1 2 2 1 2
1 1 0 1 1
2 1 0 2 1
1 1 1 0 1
2 1 2 2 1
1 1 1 1 1
2 2 2 2 2
23 / 26
Lemma
MH resolves MG :
Example
MG MH
1 1 1 2 1 2
0 1 0 0 0 1
1 2 1 2 1 0
1 0 1 1 1 0
1 2 2 1 2 2
1 1 0 1 0 1
2 1 0 2 0 1
1 1 1 0 1 1
2 1 2 2 2 1
1 1 1 1 1 1
2 2 2 2 2 2
23 / 26
Kernel Conclusion
Theorem
columns: ≤ 2k , rows: ≤ k2

; overall kernel size: 2k · k2

computation in O(|G | · m log m) time

; previous algorithm: O(k 4k+2 · 2k + k 2 · m log m) time
24 / 26
Conclusion
what we saw. . .
introduced induced variant with O(k 3 · m) time algorithm
2
improved 2O(k log k) time algorithm to 2O(k log k) time
presented O(2k · k 2 )-bit kernel
25 / 26
Conclusion
what we saw. . .
2
also in the paper

results also hold for sets instead of multisets
algorithmic results basically also hold for constrained variant
25 / 26
Conclusion
what we saw. . .
2
also in the paper

results also hold for sets instead of multisets
algorithmic results basically also hold for constrained variant
future work
find polynomial kernel (or prove nonexistence)
distance from triviality measures
find 2O(k) time algorithm
25 / 26
Thank you
26 / 26

Haplotalk

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Haplotalk

Uploaded by

Copyright:

Available Formats

Introduction Induced Haplotyping Improved FPT algorithm Simple Kernel Conclusion

Extended Islands of Tractability for

Rudolf Fleischer1 , Jiong Guo2 , Rolf Niedermeier3 , Johannes

1 Fudan University Shanghai, 2 Universität des Saarlandes,

First, Some Biology...

First, Some Biology...

First, Some Biology...

First, Some Biology...

the sequence of SNPs is called a haplotype

First, Some Biology...

humans are diploid ; 2 chromosome sets

First, Some Biology...

humans are diploid ; 2 chromosome sets

Parsimony (Ockham’s razor)

Parsimony (Ockham’s razor)

... pure parsimony haplotyping [Lancia et al., INFORMS ’04]

... minimum perfect phylogeny [Gusfield & Orzack, HBI ’05]

... k-minimum recombination configuration [Li & Jiang, JBCB ’03]

Definition (Haplotype, Genotype, Resolve)

Definition (Haplotype, Genotype, Resolve)

Definition (Haplotype, Genotype, Resolve)

Definition (Haplotype Graph)

Definition (Haplotype Graph)

Haplotype Inference by Parsimony

Definition (Haplotype Inference by Parsimony)

21221 12120 11111

APX-hard [Lancia et al., INFORMS ’04]

many special cases in P (e.g. [Lancia & Rizzi, ORL ’06])

ILP [Lancia & Serafini, INFORMS ’08]

Factor-2d−1 -Approximation [Lancia & Rizzi, ORL ’06]

k := #haplotypes, m := stringlength, d := max #of 2’s in a genotype

polynomial-time solvable special case:

Induced Haplotype Inference

remember: H resolves G ⇔ G ⊆ res (H)

Induced Haplotype Inference

remember: H resolves G ⇔ G ⊆ res (H)

Definition (Induced Haplotype Inference by Parsimony)

Observations & Algorithm

Observations & Algorithm

Observations & Algorithm

Observations & Algorithm

Observations & Algorithm

Observations & Algorithm

Observations & Algorithm

Observations & Algorithm

Strategy: Divide & Conquer

Observations & Algorithm

Strategy: Divide & Conquer

Observations & Algorithm

Strategy: Divide & Conquer

Observations & Algorithm

Strategy: Divide & Conquer

Extending a Subclique Solution

Extending a Subclique Solution

Extending a Subclique Solution

Concluding The Induced Problem

Divide & Conquer like algorithm

Concluding The Induced Problem

Divide & Conquer like algorithm

General Haplotype Inference by Parsimony

now: drop the clique-constraint

General Haplotype Inference by Parsimony

now: drop the clique-constraint