You are on page 1of 46

Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences

Thomas Schmidt Jens Stoye


CPM 2004, Istanbul

Overview:
Introduction
Formal Model Algorithms Results

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences

Gene Order and Function in Bacteria:

Observations: - Gene order in bacterial genomes is weakly conserved - Some genes tend to cluster together even in unrelated species - Functional association of genes inside a cluster
Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 3

Gene Order and Function in Bacteria:

Observations: - Gene order in bacterial genomes is weakly conserved - Some genes tend to cluster together even in unrelated species - Functional association of genes inside a cluster
Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 4

Gene Order and Function in Bacteria:

Observations: - Gene order in bacterial genomes is weakly conserved - Some genes tend to cluster together even in unrelated species - Functional association of genes inside a cluster
Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 5

Gene Order and Function in Bacteria:

?
Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 6

Gene Order and Function in Bacteria:

?
Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 7

Gene Order and Function in Bacteria:

?
Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 8

Gene Order and Function in Bacteria:

Are there more clusters ?

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences

Gene Order and Function in Bacteria:

Are there more clusters ?

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences

10

Gene Order and Function in Bacteria:

Task:

Establish a model and search for gene clusters

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences

11

Formalization of Gene Clusters:


Genomes: Genes: permutations 1, 2 ,, k numbers 1,,n

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences

12

Formalization of Gene Clusters:


Genomes: Genes: permutations 1, 2 ,, k numbers 1,,n

1
2 3 4

6 7

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences

13

Formalization of Gene Clusters:


Genomes: Genes: permutations 1, 2 ,, k numbers 1,,n

1
2 3 4
3

6 7

8
1

7
2 5

6
8

4
7

5
6 4

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences

14

Formalization of Gene Clusters:


Genomes: Genes: Gene cluster: permutations 1, 2 ,, k numbers 1,,n common interval subset of numbers occurring contiguously in all permutations)

1 2 3 4

8
3

7
1

6
2

4
5

5
8

2
7

1
6

3
4

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences

15

Formalization of Gene Clusters:


Genomes: Genes: Gene cluster: permutations 1, 2 ,, k numbers 1,,n common interval subset of numbers occurring contiguously in all permutations)

1 2 3 4

8
3

7
1

6
2

4
5

5
8

2
7

1
6

3
4

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences

16

Formalization of Gene Clusters:


Genomes: Genes: Gene cluster: permutations 1, 2 ,, k numbers 1,,n common interval subset of numbers occurring contiguously in all permutations)

1 2 3 4

8
3

7
1

6
2

4
5

5
8

2
7

1
6

3
4

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences

17

Formalization of Gene Clusters:


Genomes: Genes: Gene cluster: permutations 1, 2 ,, k numbers 1,,n common interval subset of numbers occurring contiguously in all permutations)

Algorithms: - Uno & Yagiura, Algorithmica 2000: Find all common intervals of two permutations in O(n+|output|) time. - Heber & Stoye, CPM 2001: Find all common intervals of k 2 permutations in O(kn+|output|) time.

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences

18

Modeling multiple copies of a gene (paralogs):


Problem: - Gene duplication results in multiple copies of a gene inside a genome - Difficult to assign the correct gene pair

2
3

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences

19

Modeling multiple copies of a gene (paralogs):


Problem: - Gene duplication results in multiple copies of a gene inside a genome - Difficult to assign the correct gene pair

2
3

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences

20

Modeling multiple copies of a gene (paralogs):


Problem: - Gene duplication results in multiple copies of a gene inside a genome - Difficult to assign the correct gene pair

2
3
3 1 2 ? ?

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences

21

Modeling multiple copies of a gene (paralogs):


Problem: - Gene duplication results in multiple copies of a gene inside a genome - Difficult to assign the correct gene pair

2
3
3 ? 2 1 ?

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences

22

Modeling multiple copies of a gene (paralogs):


Solution: - Do not distinguish between paralogous gene copies - Each paralogous copy of a gene gets the same number Consequence: - Genomes are modeled as sequences instead of permutations 1 2 3 4 5 6 7 8

S1
8 7 2 6 4 7 8 5 7 6 4 2 1 1 3 2

S2
3 1

S3
Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 23

Overview:
Introduction - Comparative genomics - Common Intervals and Gene Clusters
Formal Model

Algorithms - Simple Data Structure: Quadratic Space - Saving Space


Results

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences

24

Formal Model:
Given: String S over a finite alphabet
= the i-th character of S S[i,j] = substring of S starting at index i and ending at j

Notation: S[i]

Definition: The character set CS(S[i,j]) := {S[k] | i k j} is the set of all characters occurring in the substring S[i,j].

Example:

S: 3 1 2 3 1 5 2 6
1 2 3 4 5 6 7 8

CS(S[2,5]) := {1,2,3}
Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 25

Formal Model:
Given: Subset C

Definition: (i, j) is a CS-location of C in S, iff CS(S[i,j]) = C left-maximal = S[i-1] C right-maximal = S[j+1] C maximal = both left- and right-maximal Example:

S: 3 1 2 3 1 5 2 6
1 2 3 4 5 6 7 8

The pair (3,5) is a CS-location of the set C={1,2,3}, because CS(S[3,5]) = {1,2,3}, but it is not leftmaximal !
Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 26

Formal Model:
Given: Collection of k strings S* = (S1,...,Sk) over alphabet

Definition: C is a common CS-factor of S* if and only if C has a CS-location in each Sl , 1 l k. Example:

S1 : 3 2 1 3 1 5 1 6 S2 : 4 3 5 5 5 1 4 2 2 S3 : 7 0 51 12 53 34 65 56
1 2 3 4 5 6 7

7
8 9

common CS-factor: {1,3,5} => S1: (3,7) S2: (2,6) S3: (2,5)
Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 27

Problem Formulation:
A common CS-factor of k strings represents a gene cluster that occurs in each of the k genomes.
Given a collection of k strings S*: Problem 1: Find all common CS-factors in S*. Problem 2: For each common CS-factor find all its maximal CS-locations in each of the strings.

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences

28

Overview:
Introduction
Formal Model Algorithms - Simple Data Structure: Quadratic Space - Saving Space Results

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences

29

Algorithm "Connecting Intervals" (CI)


Algorithm CI solves Problem 1 and Problem 2 for two sequences
Input: Two sequences of length up to n with characters drawn from = {1,...,m}, m 2n Output: Pairs of CS-locations of all common CS-factors Time & Space complexity: O(n)

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences

30

Preprocessing
Compute two tables for S1= (3,1,2,3,1,5,2,6) POS[1] = 2,5 POS[2] = 3,7 POS[3] = 1,4 POS[4] = empty POS[5] = 6 POS[6] = 8 NUM(i,j) :
i

1 2 3 1 1 2 3 2 1 2 3 1 4 5 6 7 8

4 3 3 2 1

5 3 3 3 2 1

6 4 4 4 3 2 1

7 4 4 4 4 3 2 1

8 5 5 5 5 4 3 2 1

POS[c] holds all positions where character c occurs in S1.


NUM(i,j) counts the number of different characters in S1[i,j].

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences

31

Algorithm CI
POS[1] = 2,5 POS[2] = 3,7 POS[3] = 1,4 POS[4] = empty POS[5] = 6 POS[6] = 8

NUM(i,j) :

1 2 3 1 1 2 3 2 1 2 3 1 4 5 6 7 8

4 3 3 2 1

5 3 3 3 2 1

6 4 4 4 3 2 1

7 4 4 4 4 3 2 1

8 5 5 5 5 4 3 2 1

Algorithm: While reading S2, mark in S1 the observed character and track maximal intervals of marked characters

S2 : 4 3 5 5 5 1 4 2 2
i j

S1 : 3 1 2 3 1 5 2 6
1 2 3 4 5 6 7 8

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences

32

Algorithm CI
POS[1] = 2,5 POS[2] = 3,7 POS[3] = 1,4 POS[4] = empty POS[5] = 6 POS[6] = 8

NUM(i,j) :

1 2 3 1 1 2 3 2 1 2 3 1 4 5 6 7 8

4 3 3 2 1

5 3 3 3 2 1

6 4 4 4 3 2 1

7 4 4 4 4 3 2 1

8 5 5 5 5 4 3 2 1

Algorithm: While reading S2, mark in S1 the observed character and track maximal intervals of marked characters

S2 : 4 3 5 5 5 1 4 2 2
i j

S1 : 3 1 2 3 1 5 2 6
1 2 3 4 5 6 7 8

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences

33

Algorithm CI
POS[1] = 2,5 POS[2] = 3,7 POS[3] = 1,4 POS[4] = empty POS[5] = 6 POS[6] = 8

NUM(i,j) :

1 2 3 1 1 2 3 2 1 2 3 1 4 5 6 7 8

4 3 3 2 1

5 3 3 3 2 1

6 4 4 4 3 2 1

7 4 4 4 4 3 2 1

8 5 5 5 5 4 3 2 1

Algorithm: While reading S2, mark in S1 the observed character and track maximal intervals of marked characters

S2 : 4 3 5 5 5 1 4 2 2
i j
Output: ((2,2)-(1,1)) ((2,2)-(4,4))

S1 : 3 1 2 3 1 5 2 6
1 2 3 4 5 6 7 8

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences

34

Algorithm CI
POS[1] = 2,5 POS[2] = 3,7 POS[3] = 1,4 POS[4] = empty POS[5] = 6 POS[6] = 8

NUM(i,j) :

1 2 3 1 1 2 3 2 1 2 3 1 4 5 6 7 8

4 3 3 2 1

5 3 3 3 2 1

6 4 4 4 3 2 1

7 4 4 4 4 3 2 1

8 5 5 5 5 4 3 2 1

Algorithm: While reading S2, mark in S1 the observed character and track maximal intervals of marked characters

S2 : 4 3 5 5 5 1 4 2 2
i j

S1 : 3 1 2 3 1 5 2 6
1 2 3 4 5 6 7 8

Output: ((2,2)-(1,1)) ((2,2)-(4,4))


Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 35

Algorithm CI
POS[1] = 2,5 POS[2] = 3,7 POS[3] = 1,4 POS[4] = empty POS[5] = 6 POS[6] = 8

NUM(i,j) :

1 2 3 1 1 2 3 2 1 2 3 1 4 5 6 7 8

4 3 3 2 1

5 3 3 3 2 1

6 4 4 4 3 2 1

7 4 4 4 4 3 2 1

8 5 5 5 5 4 3 2 1

Algorithm: While reading S2, mark in S1 the observed character and track maximal intervals of marked characters

S2 : 4 3 5 5 5 1 4 2 2
i j

S1 : 3 1 2 3 1 5 2 6
1 2 3 4 5 6 7 8

Output: ((2,2)-(1,1)) ((2,2)-(4,4))


Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 36

Algorithm CI
POS[1] = 2,5 POS[2] = 3,7 POS[3] = 1,4 POS[4] = empty POS[5] = 6 POS[6] = 8

NUM(i,j) :

1 2 3 1 1 2 3 2 1 2 3 1 4 5 6 7 8

4 3 3 2 1

5 3 3 3 2 1

6 4 4 4 3 2 1

7 4 4 4 4 3 2 1

8 5 5 5 5 4 3 2 1

Algorithm: While reading S2, mark in S1 the observed character and track maximal intervals of marked characters

S2 : 4 3 5 5 5 1 4 2 2
i j

S1 : 3 1 2 3 1 5 2 6
1 2 3 4 5 6 7 8

Output: ((2,2)-(1,1)) ((2,2)-(4,4))


Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 37

Algorithm CI
POS[1] = 2,5 POS[2] = 3,7 POS[3] = 1,4 POS[4] = empty POS[5] = 6 POS[6] = 8

NUM(i,j) :

1 2 3 1 1 2 3 2 1 2 3 1 4 5 6 7 8

4 3 3 2 1

5 3 3 3 2 1

6 4 4 4 3 2 1

7 4 4 4 4 3 2 1

8 5 5 5 5 4 3 2 1

Algorithm: While reading S2, mark in S1 the observed character and track maximal intervals of marked characters

S2 : 4 3 5 5 5 1 4 2 2
i j

S1 : 3 1 2 3 1 5 2 6
1 2 3 4 5 6 7 8

Output: ((2,2)-(1,1)) ((2,2)-(4,4)) ((1,5)-(4,6))


Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 38

Algorithm CI
POS[1] = 2,5 POS[2] = 3,7 POS[3] = 1,4 POS[4] = empty POS[5] = 6 POS[6] = 8

NUM(i,j) :

1 2 3 1 1 2 3 2 1 2 3 1 4 5 6 7 8

4 3 3 2 1

5 3 3 3 2 1

6 4 4 4 3 2 1

7 4 4 4 4 3 2 1

8 5 5 5 5 4 3 2 1

Algorithm: While reading S2, mark in S1 the observed character and track maximal intervals of marked characters

S2 : 4 3 5 5 5 1 4 2 2
i j

S1 : 3 1 2 3 1 5 2 6
1 2 3 4 5 6 7 8

(i,j) not left-maximal !


Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 39

Time Complexity
Algorithm CI finds all common CS-factors of S1 and S2 in O(n) time. 1. for i = 1,...,|S2| do 2. j=i 3. while j < |S2| and (i,j) is maximal do 4. if (c = S2[j]) is seen the first time 5. for each entry in POS(c) do 6. mark and track 7. end for 8. end if 9. j=j+1 10. end while 11. end for

POS[1] = 1,4 POS[2] = 2,6 POS[3] = 0,3 POS[4] = empty POS[5] = 5 POS[6] = 7

S2 : 4 3 5 5 5 1 4 2 2
Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 40

Multiple Genomes
Goal : Find all common CS-factors of a collection S*=(S1,S2,...,Sk)
Algorithm : 1. Apply Algorithm CI to all pairs (S1,Sl), 2 l k 2. Output only the common CS-factor detected in all pairs Time complexity : O(kn) Space complexity : O(kn) with redundant output, O(n) otherwise

Further extension : Find all common CS-factors appearing in at least k' of k strings of S* Time complexity : O(k(1+k-k')n)
Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 41

Saving Space
Due to the storage of the table NUM, Algorithm CI requires quadratic space.
An algorithm presented by Didier, WABI 2003, detects all common CS-factors of two sequences in O(n log n) time and linear space In a modified version, replacing a binary search by a constant time Range Maximum Query, it is possible to reduce the time complexity to O(n) staying still linear in space.

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences

42

Overview:
Introduction - Comparative genomics - Common Intervals and Gene Clusters
Formal Model

Algorithms - Simple Data Structure: Quadratic Space - Saving Space


Results

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences

43

Results on real data


Data set: - 43 bacterial genome sequences from NCBI
- All classified in the "Clusters of Orthologous Groups of Proteins" database (COG) - Genes are identified by their COG number - Computation time: approx. 5 -10 minutes on a standard PC

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences

44

Results on real data (k'= 2)


all 43 genomes without closely related genomes (k = 32)

cluster size 2

cluster size 2

cluster size 3

cluster size 3

Teekkr ederim !

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences

46