0 Up votes0 Down votes

5 views46 pagesJan 17, 2014

© Attribution Non-Commercial (BY-NC)

PPT, PDF, TXT or read online from Scribd

Attribution Non-Commercial (BY-NC)

5 views

Attribution Non-Commercial (BY-NC)

- Steve Jobs
- Wheel of Time
- NIV, Holy Bible, eBook
- NIV, Holy Bible, eBook, Red Letter Edition
- Cryptonomicon
- The Woman Who Smashed Codes: A True Story of Love, Spies, and the Unlikely Heroine who Outwitted America's Enemies
- Contagious: Why Things Catch On
- Crossing the Chasm: Marketing and Selling Technology Project
- Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are
- Zero to One: Notes on Start-ups, or How to Build the Future
- Console Wars: Sega, Nintendo, and the Battle that Defined a Generation
- Dust: Scarpetta (Book 21)
- Hit Refresh: The Quest to Rediscover Microsoft's Soul and Imagine a Better Future for Everyone
- The Innovators: How a Group of Hackers, Geniuses, and Geeks Created the Digital Revolution
- Crushing It!: How Great Entrepreneurs Build Their Business and Influence—and How You Can, Too
- Make Time: How to Focus on What Matters Every Day
- Algorithms to Live By: The Computer Science of Human Decisions
- Wild Cards

You are on page 1of 46

CPM 2004, Istanbul

Overview:

Introduction

Formal Model Algorithms Results

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences

Observations: - Gene order in bacterial genomes is weakly conserved - Some genes tend to cluster together even in unrelated species - Functional association of genes inside a cluster

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 3

Observations: - Gene order in bacterial genomes is weakly conserved - Some genes tend to cluster together even in unrelated species - Functional association of genes inside a cluster

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 4

Observations: - Gene order in bacterial genomes is weakly conserved - Some genes tend to cluster together even in unrelated species - Functional association of genes inside a cluster

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 5

?

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 6

?

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 7

?

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 8

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences

10

Task:

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences

11

Genomes: Genes: permutations 1, 2 ,, k numbers 1,,n

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences

12

Genomes: Genes: permutations 1, 2 ,, k numbers 1,,n

1

2 3 4

6 7

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences

13

Genomes: Genes: permutations 1, 2 ,, k numbers 1,,n

1

2 3 4

3

6 7

8

1

7

2 5

6

8

4

7

5

6 4

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences

14

Genomes: Genes: Gene cluster: permutations 1, 2 ,, k numbers 1,,n common interval subset of numbers occurring contiguously in all permutations)

1 2 3 4

8

3

7

1

6

2

4

5

5

8

2

7

1

6

3

4

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences

15

Genomes: Genes: Gene cluster: permutations 1, 2 ,, k numbers 1,,n common interval subset of numbers occurring contiguously in all permutations)

1 2 3 4

8

3

7

1

6

2

4

5

5

8

2

7

1

6

3

4

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences

16

Genomes: Genes: Gene cluster: permutations 1, 2 ,, k numbers 1,,n common interval subset of numbers occurring contiguously in all permutations)

1 2 3 4

8

3

7

1

6

2

4

5

5

8

2

7

1

6

3

4

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences

17

Genomes: Genes: Gene cluster: permutations 1, 2 ,, k numbers 1,,n common interval subset of numbers occurring contiguously in all permutations)

Algorithms: - Uno & Yagiura, Algorithmica 2000: Find all common intervals of two permutations in O(n+|output|) time. - Heber & Stoye, CPM 2001: Find all common intervals of k 2 permutations in O(kn+|output|) time.

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences

18

Problem: - Gene duplication results in multiple copies of a gene inside a genome - Difficult to assign the correct gene pair

2

3

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences

19

Problem: - Gene duplication results in multiple copies of a gene inside a genome - Difficult to assign the correct gene pair

2

3

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences

20

Problem: - Gene duplication results in multiple copies of a gene inside a genome - Difficult to assign the correct gene pair

2

3

3 1 2 ? ?

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences

21

Problem: - Gene duplication results in multiple copies of a gene inside a genome - Difficult to assign the correct gene pair

2

3

3 ? 2 1 ?

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences

22

Solution: - Do not distinguish between paralogous gene copies - Each paralogous copy of a gene gets the same number Consequence: - Genomes are modeled as sequences instead of permutations 1 2 3 4 5 6 7 8

S1

8 7 2 6 4 7 8 5 7 6 4 2 1 1 3 2

S2

3 1

S3

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 23

Overview:

Introduction - Comparative genomics - Common Intervals and Gene Clusters

Formal Model

Results

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences

24

Formal Model:

Given: String S over a finite alphabet

= the i-th character of S S[i,j] = substring of S starting at index i and ending at j

Notation: S[i]

Definition: The character set CS(S[i,j]) := {S[k] | i k j} is the set of all characters occurring in the substring S[i,j].

Example:

S: 3 1 2 3 1 5 2 6

1 2 3 4 5 6 7 8

CS(S[2,5]) := {1,2,3}

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 25

Formal Model:

Given: Subset C

Definition: (i, j) is a CS-location of C in S, iff CS(S[i,j]) = C left-maximal = S[i-1] C right-maximal = S[j+1] C maximal = both left- and right-maximal Example:

S: 3 1 2 3 1 5 2 6

1 2 3 4 5 6 7 8

The pair (3,5) is a CS-location of the set C={1,2,3}, because CS(S[3,5]) = {1,2,3}, but it is not leftmaximal !

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 26

Formal Model:

Given: Collection of k strings S* = (S1,...,Sk) over alphabet

S1 : 3 2 1 3 1 5 1 6 S2 : 4 3 5 5 5 1 4 2 2 S3 : 7 0 51 12 53 34 65 56

1 2 3 4 5 6 7

7

8 9

common CS-factor: {1,3,5} => S1: (3,7) S2: (2,6) S3: (2,5)

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 27

Problem Formulation:

A common CS-factor of k strings represents a gene cluster that occurs in each of the k genomes.

Given a collection of k strings S*: Problem 1: Find all common CS-factors in S*. Problem 2: For each common CS-factor find all its maximal CS-locations in each of the strings.

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences

28

Overview:

Introduction

Formal Model Algorithms - Simple Data Structure: Quadratic Space - Saving Space Results

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences

29

Algorithm CI solves Problem 1 and Problem 2 for two sequences

Input: Two sequences of length up to n with characters drawn from = {1,...,m}, m 2n Output: Pairs of CS-locations of all common CS-factors Time & Space complexity: O(n)

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences

30

Preprocessing

Compute two tables for S1= (3,1,2,3,1,5,2,6) POS[1] = 2,5 POS[2] = 3,7 POS[3] = 1,4 POS[4] = empty POS[5] = 6 POS[6] = 8 NUM(i,j) :

i

1 2 3 1 1 2 3 2 1 2 3 1 4 5 6 7 8

4 3 3 2 1

5 3 3 3 2 1

6 4 4 4 3 2 1

7 4 4 4 4 3 2 1

8 5 5 5 5 4 3 2 1

NUM(i,j) counts the number of different characters in S1[i,j].

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences

31

Algorithm CI

POS[1] = 2,5 POS[2] = 3,7 POS[3] = 1,4 POS[4] = empty POS[5] = 6 POS[6] = 8

NUM(i,j) :

1 2 3 1 1 2 3 2 1 2 3 1 4 5 6 7 8

4 3 3 2 1

5 3 3 3 2 1

6 4 4 4 3 2 1

7 4 4 4 4 3 2 1

8 5 5 5 5 4 3 2 1

Algorithm: While reading S2, mark in S1 the observed character and track maximal intervals of marked characters

S2 : 4 3 5 5 5 1 4 2 2

i j

S1 : 3 1 2 3 1 5 2 6

1 2 3 4 5 6 7 8

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences

32

Algorithm CI

POS[1] = 2,5 POS[2] = 3,7 POS[3] = 1,4 POS[4] = empty POS[5] = 6 POS[6] = 8

NUM(i,j) :

1 2 3 1 1 2 3 2 1 2 3 1 4 5 6 7 8

4 3 3 2 1

5 3 3 3 2 1

6 4 4 4 3 2 1

7 4 4 4 4 3 2 1

8 5 5 5 5 4 3 2 1

Algorithm: While reading S2, mark in S1 the observed character and track maximal intervals of marked characters

S2 : 4 3 5 5 5 1 4 2 2

i j

S1 : 3 1 2 3 1 5 2 6

1 2 3 4 5 6 7 8

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences

33

Algorithm CI

POS[1] = 2,5 POS[2] = 3,7 POS[3] = 1,4 POS[4] = empty POS[5] = 6 POS[6] = 8

NUM(i,j) :

1 2 3 1 1 2 3 2 1 2 3 1 4 5 6 7 8

4 3 3 2 1

5 3 3 3 2 1

6 4 4 4 3 2 1

7 4 4 4 4 3 2 1

8 5 5 5 5 4 3 2 1

Algorithm: While reading S2, mark in S1 the observed character and track maximal intervals of marked characters

S2 : 4 3 5 5 5 1 4 2 2

i j

Output: ((2,2)-(1,1)) ((2,2)-(4,4))

S1 : 3 1 2 3 1 5 2 6

1 2 3 4 5 6 7 8

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences

34

Algorithm CI

POS[1] = 2,5 POS[2] = 3,7 POS[3] = 1,4 POS[4] = empty POS[5] = 6 POS[6] = 8

NUM(i,j) :

1 2 3 1 1 2 3 2 1 2 3 1 4 5 6 7 8

4 3 3 2 1

5 3 3 3 2 1

6 4 4 4 3 2 1

7 4 4 4 4 3 2 1

8 5 5 5 5 4 3 2 1

S2 : 4 3 5 5 5 1 4 2 2

i j

S1 : 3 1 2 3 1 5 2 6

1 2 3 4 5 6 7 8

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 35

Algorithm CI

POS[1] = 2,5 POS[2] = 3,7 POS[3] = 1,4 POS[4] = empty POS[5] = 6 POS[6] = 8

NUM(i,j) :

1 2 3 1 1 2 3 2 1 2 3 1 4 5 6 7 8

4 3 3 2 1

5 3 3 3 2 1

6 4 4 4 3 2 1

7 4 4 4 4 3 2 1

8 5 5 5 5 4 3 2 1

S2 : 4 3 5 5 5 1 4 2 2

i j

S1 : 3 1 2 3 1 5 2 6

1 2 3 4 5 6 7 8

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 36

Algorithm CI

POS[1] = 2,5 POS[2] = 3,7 POS[3] = 1,4 POS[4] = empty POS[5] = 6 POS[6] = 8

NUM(i,j) :

1 2 3 1 1 2 3 2 1 2 3 1 4 5 6 7 8

4 3 3 2 1

5 3 3 3 2 1

6 4 4 4 3 2 1

7 4 4 4 4 3 2 1

8 5 5 5 5 4 3 2 1

S2 : 4 3 5 5 5 1 4 2 2

i j

S1 : 3 1 2 3 1 5 2 6

1 2 3 4 5 6 7 8

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 37

Algorithm CI

POS[1] = 2,5 POS[2] = 3,7 POS[3] = 1,4 POS[4] = empty POS[5] = 6 POS[6] = 8

NUM(i,j) :

1 2 3 1 1 2 3 2 1 2 3 1 4 5 6 7 8

4 3 3 2 1

5 3 3 3 2 1

6 4 4 4 3 2 1

7 4 4 4 4 3 2 1

8 5 5 5 5 4 3 2 1

S2 : 4 3 5 5 5 1 4 2 2

i j

S1 : 3 1 2 3 1 5 2 6

1 2 3 4 5 6 7 8

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 38

Algorithm CI

POS[1] = 2,5 POS[2] = 3,7 POS[3] = 1,4 POS[4] = empty POS[5] = 6 POS[6] = 8

NUM(i,j) :

1 2 3 1 1 2 3 2 1 2 3 1 4 5 6 7 8

4 3 3 2 1

5 3 3 3 2 1

6 4 4 4 3 2 1

7 4 4 4 4 3 2 1

8 5 5 5 5 4 3 2 1

S2 : 4 3 5 5 5 1 4 2 2

i j

S1 : 3 1 2 3 1 5 2 6

1 2 3 4 5 6 7 8

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 39

Time Complexity

Algorithm CI finds all common CS-factors of S1 and S2 in O(n) time. 1. for i = 1,...,|S2| do 2. j=i 3. while j < |S2| and (i,j) is maximal do 4. if (c = S2[j]) is seen the first time 5. for each entry in POS(c) do 6. mark and track 7. end for 8. end if 9. j=j+1 10. end while 11. end for

POS[1] = 1,4 POS[2] = 2,6 POS[3] = 0,3 POS[4] = empty POS[5] = 5 POS[6] = 7

S2 : 4 3 5 5 5 1 4 2 2

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 40

Multiple Genomes

Goal : Find all common CS-factors of a collection S*=(S1,S2,...,Sk)

Algorithm : 1. Apply Algorithm CI to all pairs (S1,Sl), 2 l k 2. Output only the common CS-factor detected in all pairs Time complexity : O(kn) Space complexity : O(kn) with redundant output, O(n) otherwise

Further extension : Find all common CS-factors appearing in at least k' of k strings of S* Time complexity : O(k(1+k-k')n)

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 41

Saving Space

Due to the storage of the table NUM, Algorithm CI requires quadratic space.

An algorithm presented by Didier, WABI 2003, detects all common CS-factors of two sequences in O(n log n) time and linear space In a modified version, replacing a binary search by a constant time Range Maximum Query, it is possible to reduce the time complexity to O(n) staying still linear in space.

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences

42

Overview:

Introduction - Comparative genomics - Common Intervals and Gene Clusters

Formal Model

Results

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences

43

Data set: - 43 bacterial genome sequences from NCBI

- All classified in the "Clusters of Orthologous Groups of Proteins" database (COG) - Genes are identified by their COG number - Computation time: approx. 5 -10 minutes on a standard PC

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences

44

all 43 genomes without closely related genomes (k = 32)

cluster size 2

cluster size 2

cluster size 3

cluster size 3

Teekkr ederim !

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences

46

- Dynamic Programmingscribd sucks in real timescribd sucks in real timescribd sucks in real timescribd sucks in real timescribd sucks in real timeUploaded byDhayalanBalakrishnan
- Assignment 4Uploaded by123fake
- Chapter 11Uploaded byAthir03
- Chapter11.pdfUploaded byAmara Bhargav
- 1 Time ComplexityUploaded byAbhishek Jaisingh
- computer science- question and answer in computer scienceUploaded byAlok Nikhil
- Elsevier 2015 - Square TilingUploaded bynidee_nishok
- Time ComplexityUploaded byRahul Gupta
- The Music of Life-sourcebookUploaded bySrinidhi Prabhu
- lec3Uploaded bySushmita Choudhary
- lects1-10Uploaded byRokibul Alam
- About a new Smarandache type sequenceUploaded byRyanElias
- Daa ObjectiveUploaded bykarunakar
- Final s2008 Sol 1Uploaded bysaahil480
- Lecture 07Uploaded byJames Yang
- algo.pptUploaded byraj
- 10.1.1.108.155Uploaded byrodda1
- Theory of ComputationUploaded byRehan Shabbir
- ROCK a Robust Clustering Algorithm for Categorical Attributes (2000)Guha00rockUploaded byabcd2003959
- Harry R. LewisUploaded byluckyhala
- 235Uploaded bySwanand Koli
- 8. Syllabus B.sc_. Agriculture,Uploaded byChandan A. Wagh
- Algoritmos Taller 2Uploaded byJonathan Andres Campo Rangel
- Introduction to GeneticsUploaded byJennifer Dixon
- lesly austin fixed outlineUploaded byapi-302780826
- Dchip ExpressionUploaded byZoe Ling Ean Cheung
- CSE1020_Final_2012SUploaded byexamkiller
- evolution genetic variationUploaded byapi-469031829
- presentation 12Uploaded byapi-255075359
- machines copyUploaded byapi-316523250

- JERUSALEM IN THE QUR’ANUploaded byshakebfaruqui
- Spektroskopia fotoelektronUploaded byLiridon Sulejmani
- Serial KillerUploaded byLiridon Sulejmani
- Nikola TeslaUploaded byLiridon Sulejmani
- 270.8TheChainRuleUploaded byLiridon Sulejmani
- Literatura në Matematika 1Uploaded byLiridon Sulejmani
- LotUploaded byLiridon Sulejmani
- Math MattersUploaded byLiridon Sulejmani
- ScienceUploaded byLiridon Sulejmani
- LeoUploaded byLiridon Sulejmani
- LogUploaded byLiridon Sulejmani
- New Microsoft PowerPoint PresentationUploaded byLiridon Sulejmani
- Operation_Guide_EL-531W_serUploaded byRavi Seedath
- New Microsoft PowerPoint PresentationUploaded byLiridon Sulejmani
- Lojërat e Real Madrid në la ligaUploaded byLiridon Sulejmani
- Dokumentari mbi kaosinUploaded byLiridon Sulejmani

- Occult Symbolism v4Uploaded bysir sly babs
- Final Marketingplan WOTJOBUploaded byalberto staffini
- St Poly Technical DataUploaded byShafiq Ismathullakhan
- Suicide in Nazi Concentration CampsUploaded byhablablaogkeisocos
- Kamut Salad with Carrots & Pomegranate from Ancient Grains for Modern Meals by Maria SpeckUploaded byThe Recipe Club
- 12-Respiration-Topic-Booklet-1-CIE-IGCSE-Biology.pdfUploaded bySyuhadaratul Ainy
- GridCodev03-01-08-2018-compressed.pdfUploaded byGloria
- Teaching EAL StudentsUploaded byRaden Mas Wahyu Hadiningrat
- Happenings April 2014Uploaded byNewsletters
- Iridex - Sec FilingsUploaded byDan Morris
- chemical enviromekntUploaded byMahmoud Qousi
- Self HealingUploaded byAliMurtazaKothawala
- Slavoj Žižek · ‘You May!’ the post-modern superego ·Uploaded bygeorgefeick
- understanding race- witt ch 13Uploaded byapi-366218414
- Q1 - 2015 - Complete Solutions for Instrumental Analysis - Final - MM- Version-with-frontUploaded bychabib
- The Brand Marketing of Halal Products at BruneiUploaded byRizvi Syed
- Szavak kifejezésekUploaded byGidófalviVeronika
- Chapter 14 the Behavior of GasesUploaded byHeather Wright
- Atlas of lymph node pathologyUploaded byAequoAnimo
- Scott Labs Handbook 2016Uploaded byamormem
- Quiz AnswersUploaded byChuah Chong Yang
- Sinopsis AwakenUploaded byPhyllis Yong
- Daily a'AmaalUploaded bymzamin786
- General Introduction to AnolyteUploaded byAjay Sharma
- Ee 1213 - Electrical Drives and ControlUploaded bysubhaz
- Carel Standard Chiller Modular HP User Manual EngUploaded bysaeedehj5086
- How to Make Kaleidoscopes -18Nov2015Uploaded by12flyfish
- The Impact of the PROGRESA Oportunidades Conditional Cash Transfer Program on Health and Related Outcomes for the Aging in MexicoUploaded byoeamaoesaha
- hanson denise student evaluation 1Uploaded byapi-246846430
- 2010315102247994Uploaded bymuhdjusri

## Much more than documents.

Discover everything Scribd has to offer, including books and audiobooks from major publishers.

Cancel anytime.