You are on page 1of 16

Tutorial Note 7

Midterm Exam Review (Again!)


The Chinese University of Hong Kong
CSCI3220 Algorithms for Bioinformatics

TA: Zhenghao Zhang


30/10/2018
Agenda
• Suggested Solutions for Assignment 2
• Key Points Wrap-up
• Q&A

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 2


Assignment 2. Q1. a)
• Given DNA sequence s=GTAACTGTAGTG$, build Suffix Trie.
$ A C G T

A C G T $ T A G

C T T G A G A G $ T

T G G T A G $ C T A

G T $ A C T T G G

T A G T G G $ T

A G T G $ T G

G T G T A $

T G $ A G

G $ G T

$ T G

G $

• Suffix Trie <--> Trie of Suffixes


• Insert every suffixes (automatically, or carefully manually)
• Do not forget ‘$’.
• Left: Lex. Smaller, Right: Lex. Larger, ‘$’ < ‘A’-’Z’

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 3


Assignment 2. Q1. b)
• Compress Suffix Trie to Suffix Tree, with position labelling.
$ A C G T
13-13 3-3 5-13 1-1 2-2
A C G T $ T A G

C T T G A G A G $ T 4-13 5-13 10-13 13-13 2-2 3-3 1-1

T G G T A G $ C T A
3-3 12-13 4-13 10-13 13-13 8-13
G T $ A C T T G G
4-13 10-13
T A G T G G $ T

A G T G $ T G

G T G T A $

T G $ A G

G $ G T

$ T G

G $

• Compress “caterpillars” (both “dangling” ones and those


“inside” the tree).

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 4


Assignment 2. Q1. c)
• Perform DFS on Suffix Tree.
A

B C G H O

D E F I J P S

K N Q R T U

L M

• ABACDCECFCAGAHIHJKLKMKJNJHAOPQPRPOSTSUSOA
• Upon arriving leaves, output corresponding suffix starting
position (you can record current “depth” as suffix length).
• SA = [13,3,4,9,5,12,1,7,10,2,8,11,6]

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 5


Assignment 2. Q1. d)
• Obtain same Suffix Array using an alternative way.
• Recall the naïve way: Sort all Suffixes.
Suffix Position Suffix Sorted Position
GTAACTGTAGTG$ 1 $ 13
TAACTGTAGTG$ 2 AACTGTAGTG$ 3
AACTGTAGTG$ 3 ACTGTAGTG$ 4
ACTGTAGTG$ 4 AGTG$ 9
CTGTAGTG$ 5 CTGTAGTG$ 5
TGTAGTG$ 6 G$ 12
GTAGTG$ 7 GTAACTGTAGTG$ 1
TAGTG$ 8 GTAGTG$ 7
AGTG$ 9 GTG$ 10
GTG$ 10 TAACTGTAGTG$ 2
TG$ 11 TAGTG$ 8
G$ 12 TG$ 11
$ 13 TGTAGTG$ 6

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 6


Assignment 2. Q1. e)
• Build BWT from Suffix Array (SA  BWT).
i Sorted Suffix in BWT Rotations SA[i] BWT[i] = s[?]
1 $GTAACTGTAGTG 13 12
2 AACTGTAGTG$GT 3 2
3 ACTGTAGTG$GTA 4 3
4 AGTG$GTAACTGT 9 8
5 CTGTAGTG$GTAA 5 4
6 G$GTAACTGTAGT 12 11
7 GTAACTGTAGTG$ 1 13
8 GTAGTG$GTAACT 7 6
9 GTG$GTAACTGTA 10 9
10 TAACTGTAGTG$G 2 1
11 TAGTG$GTAACTG 8 7
• BWT[i] = s[SA[i] – 1] (where s[-1] = s[n])
12 TG$GTAACTGTAG 11 10
13 TGTAGTG$GTAAC 6 5

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 7


Assignment 2. Q1. f)
• Build BWT directly.
• The original “Rotation” way (table from Wikipedia):

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 8


Assignment 2. Q1. g)
• FM Index (Single pattern matching using BWT).
• Pattern: q=“GTA”, BWT: b=“GTATAT$TAGGGC”
• Step 1: Build “O Table” (Cumulative count of each
character in BWT):
BWT G T A T A T $ T A G G G C
i 1 2 3 4 5 6 7 8 9 10 11 12 13
x
A 0 0 1 1 2 2 2 2 3 3 3 3 3
C 0 0 0 0 0 0 0 0 0 0 0 0 1
G 1 1 1 1 1 1 1 1 1 2 3 4 4
T 0 1 1 2 2 3 3 4 4 4 4 4 4

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 9


Assignment 2. Q1. g)
• Pattern: q=“GTA”, BWT: b=“GTATAT$TAGGGC”
• Step 2: Build “F Table” (Starting position of each character
in the first column of BWT rotation matrix):

1 $GTAACTGTAGTG
2 AACTGTAGTG$GT
3 ACTGTAGTG$GTA
4 AGTG$GTAACTGT
x A C G T 5 CTGTAGTG$GTAA
F(x) 2 5 6 10 6 G$GTAACTGTAGT
7 GTAACTGTAGTG$
8 GTAGTG$GTAACT
9 GTG$GTAACTGTA
10 TAACTGTAGTG$G
11 TAGTG$GTAACTG
12 TG$GTAACTGTAG
13 TGTAGTG$GTAAC
 

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 10


Assignment 2. Q1. g)
• Pattern: q=“GTA”, BWT: b=“GTATAT$TAGGGC”
• Step 3: Matching Backwardly. (Use “O” and “F” to
implement the following matching process)
1 $GTAACTGTAGTG 1 $GTAACTGTAGTG 1 $GTAACTGTAGTG
2 AACTGTAGTG$GT 2 AACTGTAGTG$GT 2 AACTGTAGTG$GT
3 ACTGTAGTG$GTA 3 ACTGTAGTG$GTA 3 ACTGTAGTG$GTA
4 AGTG$GTAACTGT 4 AGTG$GTAACTGT 4 AGTG$GTAACTGT
5 CTGTAGTG$GTAA 5 CTGTAGTG$GTAA 5 CTGTAGTG$GTAA
6 G$GTAACTGTAGT 6 G$GTAACTGTAGT 6 G$GTAACTGTAGT
7 GTAACTGTAGTG$ 7 GTAACTGTAGTG$ 7 GTAACTGTAGTG$
8 GTAGTG$GTAACT 8 GTAGTG$GTAACT 8 GTAGTG$GTAACT
9 GTG$GTAACTGTA 9 GTG$GTAACTGTA 9 GTG$GTAACTGTA
10 TAACTGTAGTG$G 10 TAACTGTAGTG$G 10 TAACTGTAGTG$G
11 TAGTG$GTAACTG 11 TAGTG$GTAACTG 11 TAGTG$GTAACTG
12 TG$GTAACTGTAG 12 TG$GTAACTGTAG 12 TG$GTAACTGTAG
13 TGTAGTG$GTAAC 13 TGTAGTG$GTAAC 13 TGTAGTG$GTAAC
    q=“GTA”   q=“GTA”
q=“GTA”

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 11


Assignment 2. Q1. g)
• Pattern: q=“GTA”, BWT: b=“GTATAT$TAGGGC”
• Step 4: Get desired positions of “GTA” appearance by using
Suffix Array. i Sorted Suffix in BWT Rotations SA[i] BWT[i] = s[?]
1 $GTAACTGTAGTG 1 $GTAACTGTAGTG 13 12
2 AACTGTAGTG$GT 2 AACTGTAGTG$GT 3 2
3 ACTGTAGTG$GTA 3 ACTGTAGTG$GTA 4 3
4 AGTG$GTAACTGT 4 AGTG$GTAACTGT 9 8
5 CTGTAGTG$GTAA 5 CTGTAGTG$GTAA 5 4
6 G$GTAACTGTAGT 6 G$GTAACTGTAGT 12 11
7 GTAACTGTAGTG$ 7 GTAACTGTAGTG$ 1 13
8 GTAGTG$GTAACT 8 GTAGTG$GTAACT 7 6
9 GTG$GTAACTGTA 9 GTG$GTAACTGTA 10 9
10 TAACTGTAGTG$G 10 TAACTGTAGTG$G 2 1
11 TAGTG$GTAACTG 11 TAGTG$GTAACTG 8 7
12 TG$GTAACTGTAG 12 TG$GTAACTGTAG 11 10
13 TGTAGTG$GTAAC 13 TGTAGTG$GTAAC 6 5
  q=“GTA”

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 12


Assignment 2. Q1. h)
• Suppose given DNA sequence is randomly generated, if we
append ‘$’ and perform BWT on it, will ‘$’ have equal
probability to appear in each of the n+1 positions of BWT
output?
• Consider the appearance of BWT rotation matrix:

1 $GTAACTGTAGTG
2 AACTGTAGTG$GT
• Observation: ‘$’ must be lying on the
3 ACTGTAGTG$GTA top-left corner of BWT rotation matrix
4 AGTG$GTAACTGT
5 CTGTAGTG$GTAA • Thus, ‘$’ is impossible to be BWT[1]
6 G$GTAACTGTAGT
7 GTAACTGTAGTG$ (unless DNA sequence is empty)
8 GTAGTG$GTAACT
9 GTG$GTAACTGTA
10 TAACTGTAGTG$G
11 TAGTG$GTAACTG
12 TG$GTAACTGTAG
13 TGTAGTG$GTAAC
 
CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 13
Assignment 2. Q1. h)
• Suppose
  given DNA sequence is randomly generated, if we
append ‘$’ and perform BWT on it, will ‘$’ have equal
probability to appear in each of the n+1 positions of BWT
output?
• Or, consider how you calculate probabilities:

• If cannot be divided be then it is impossible to have equal

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 14


Assignment 2. Q2
• Implement an Eulerian path finder.
• Key points:
– How to determine Eulerian path existence.
– How to determine starting point(s).
– How to build, store and use Graph structure.
– How to get Eulerian path (one path and all paths).
• Minor Issues:
– Python & Java users should be aware of EOFException.
– Allocate sufficient array space (estimate the upper bound of
node number and edge number), or use dynamic arrays (sacrifice
performance) to avoid Runtime Error.

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 15


Recap: Eulerian Path
• A path visiting each edge exactly once
• How to determine the existence of Eulerian path:
– Undirected Graph:
• All vertexes are connected
• Scheme 1: For each vertex, degree is even
 Eulerian circuit exists, every vertex is possible starting point
• Scheme 2: Have and only have two odd degree vertexes,
For each of remaining vertexes, degree is even
 Eulerian path exists, odd vertexes are possible starting point
– Directed Graph:
• Ignore edge direction, all vertexes are connected
• Scheme 1: For each vertex, in-degree == out-degree
 Eulerian circuit exists, every vertex is possible starting point
• Scheme 2: Have and only have one vertex s with in-degree+1 == out-degree,
Have and only have one vertex t with out-degree+1 == in-degree,
For each of remaining vertexes, in-degree == out-degree
 Eulerian path exists, starts at s and ends at t
• Note:
– Circuit is a special case of Path
– Zero-degree vertexes have no impact on Eulerian path existence
CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Prof. Kevin YIP, Mr. Chenyang Hong, Mr. Zhenghao Zhang| Fall 2018 16

You might also like