Midterm Exam Review (Again

Tutorial Note 7
Midterm Exam Review (Again!)

The Chinese University of Hong Kong
CSCI3220 Algorithms for Bioinformatics
TA: Zhenghao Zhang

30/10/2018
Agenda
• Suggested Solutions for Assignment 2
• Key Points Wrap-up
• Q&A
CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 2

Assignment 2. Q1. a)
• Given DNA sequence s=GTAACTGTAGTG$, build Suffix Trie.
$ A C G T
A C G T $ T A G
C T T G A G A G $ T
T G G T A G $ C T A
G T $ A C T T G G
T A G T G G $ T
A G T G $ T G
G T G T A $
T G $ A G
G $ G T
$ T G
G $
• Suffix Trie <--> Trie of Suffixes

• Insert every suffixes (automatically, or carefully manually)
• Do not forget ‘$’.
• Left: Lex. Smaller, Right: Lex. Larger, ‘$’ < ‘A’-’Z’

Assignment 2. Q1. b)
• Compress Suffix Trie to Suffix Tree, with position labelling.
$ A C G T
13-13 3-3 5-13 1-1 2-2
A C G T $ T A G
C T T G A G A G $ T 4-13 5-13 10-13 13-13 2-2 3-3 1-1
T G G T A G $ C T A
3-3 12-13 4-13 10-13 13-13 8-13
G T $ A C T T G G
4-13 10-13
T A G T G G $ T
A G T G $ T G
G T G T A $
T G $ A G
G $ G T
$ T G
G $
• Compress “caterpillars” (both “dangling” ones and those

“inside” the tree).

Assignment 2. Q1. c)
• Perform DFS on Suffix Tree.
A
B C G H O
D E F I J P S
K N Q R T U
L M
• ABACDCECFCAGAHIHJKLKMKJNJHAOPQPRPOSTSUSOA
• Upon arriving leaves, output corresponding suffix starting
position (you can record current “depth” as suffix length).
• SA = [13,3,4,9,5,12,1,7,10,2,8,11,6]

Assignment 2. Q1. d)
• Obtain same Suffix Array using an alternative way.
• Recall the naïve way: Sort all Suffixes.
Suffix Position Suffix Sorted Position
GTAACTGTAGTG$ 1 $ 13
TAACTGTAGTG$ 2 AACTGTAGTG$ 3
AACTGTAGTG$ 3 ACTGTAGTG$ 4
ACTGTAGTG$ 4 AGTG$ 9
CTGTAGTG$ 5 CTGTAGTG$ 5
TGTAGTG$ 6 G$ 12
GTAGTG$ 7 GTAACTGTAGTG$ 1
TAGTG$ 8 GTAGTG$ 7
AGTG$ 9 GTG$ 10
GTG$ 10 TAACTGTAGTG$ 2
TG$ 11 TAGTG$ 8
G$ 12 TG$ 11
$ 13 TGTAGTG$ 6

Assignment 2. Q1. e)
• Build BWT from Suffix Array (SA  BWT).
i Sorted Suffix in BWT Rotations SA[i] BWT[i] = s[?]
1 $GTAACTGTAGTG 13 12
2 AACTGTAGTG$GT 3 2
3 ACTGTAGTG$GTA 4 3
4 AGTG$GTAACTGT 9 8
5 CTGTAGTG$GTAA 5 4
6 G$GTAACTGTAGT 12 11
7 GTAACTGTAGTG$ 1 13
8 GTAGTG$GTAACT 7 6
9 GTG$GTAACTGTA 10 9
10 TAACTGTAGTG$G 2 1
11 TAGTG$GTAACTG 8 7
• BWT[i] = s[SA[i] – 1] (where s[-1] = s[n])
12 TG$GTAACTGTAG 11 10
13 TGTAGTG$GTAAC 6 5

Assignment 2. Q1. f)
• Build BWT directly.
• The original “Rotation” way (table from Wikipedia):

Assignment 2. Q1. g)
• FM Index (Single pattern matching using BWT).
• Pattern: q=“GTA”, BWT: b=“GTATAT$TAGGGC”
• Step 1: Build “O Table” (Cumulative count of each
character in BWT):
BWT G T A T A T $ T A G G G C
i 1 2 3 4 5 6 7 8 9 10 11 12 13
x
A 0 0 1 1 2 2 2 2 3 3 3 3 3
C 0 0 0 0 0 0 0 0 0 0 0 0 1
G 1 1 1 1 1 1 1 1 1 2 3 4 4
T 0 1 1 2 2 3 3 4 4 4 4 4 4

• Step 2: Build “F Table” (Starting position of each character
in the first column of BWT rotation matrix):
1 $GTAACTGTAGTG
2 AACTGTAGTG$GT
3 ACTGTAGTG$GTA
4 AGTG$GTAACTGT
x A C G T 5 CTGTAGTG$GTAA
F(x) 2 5 6 10 6 G$GTAACTGTAGT
7 GTAACTGTAGTG$
8 GTAGTG$GTAACT
9 GTG$GTAACTGTA
10 TAACTGTAGTG$G
11 TAGTG$GTAACTG
12 TG$GTAACTGTAG
13 TGTAGTG$GTAAC


• Step 3: Matching Backwardly. (Use “O” and “F” to
implement the following matching process)
1 $GTAACTGTAGTG 1 $GTAACTGTAGTG 1 $GTAACTGTAGTG
2 AACTGTAGTG$GT 2 AACTGTAGTG$GT 2 AACTGTAGTG$GT
3 ACTGTAGTG$GTA 3 ACTGTAGTG$GTA 3 ACTGTAGTG$GTA
4 AGTG$GTAACTGT 4 AGTG$GTAACTGT 4 AGTG$GTAACTGT
5 CTGTAGTG$GTAA 5 CTGTAGTG$GTAA 5 CTGTAGTG$GTAA
6 G$GTAACTGTAGT 6 G$GTAACTGTAGT 6 G$GTAACTGTAGT
7 GTAACTGTAGTG$ 7 GTAACTGTAGTG$ 7 GTAACTGTAGTG$
8 GTAGTG$GTAACT 8 GTAGTG$GTAACT 8 GTAGTG$GTAACT
9 GTG$GTAACTGTA 9 GTG$GTAACTGTA 9 GTG$GTAACTGTA
10 TAACTGTAGTG$G 10 TAACTGTAGTG$G 10 TAACTGTAGTG$G
11 TAGTG$GTAACTG 11 TAGTG$GTAACTG 11 TAGTG$GTAACTG
12 TG$GTAACTGTAG 12 TG$GTAACTGTAG 12 TG$GTAACTGTAG
13 TGTAGTG$GTAAC 13 TGTAGTG$GTAAC 13 TGTAGTG$GTAAC
q=“GTA” q=“GTA”
q=“GTA”

• Step 4: Get desired positions of “GTA” appearance by using
Suffix Array. i Sorted Suffix in BWT Rotations SA[i] BWT[i] = s[?]
1 $GTAACTGTAGTG 1 $GTAACTGTAGTG 13 12
2 AACTGTAGTG$GT 2 AACTGTAGTG$GT 3 2
3 ACTGTAGTG$GTA 3 ACTGTAGTG$GTA 4 3
4 AGTG$GTAACTGT 4 AGTG$GTAACTGT 9 8
5 CTGTAGTG$GTAA 5 CTGTAGTG$GTAA 5 4
6 G$GTAACTGTAGT 6 G$GTAACTGTAGT 12 11
7 GTAACTGTAGTG$ 7 GTAACTGTAGTG$ 1 13
8 GTAGTG$GTAACT 8 GTAGTG$GTAACT 7 6
9 GTG$GTAACTGTA 9 GTG$GTAACTGTA 10 9
10 TAACTGTAGTG$G 10 TAACTGTAGTG$G 2 1
11 TAGTG$GTAACTG 11 TAGTG$GTAACTG 8 7
12 TG$GTAACTGTAG 12 TG$GTAACTGTAG 11 10
13 TGTAGTG$GTAAC 13 TGTAGTG$GTAAC 6 5
q=“GTA”

Assignment 2. Q1. h)
• Suppose given DNA sequence is randomly generated, if we
append ‘$’ and perform BWT on it, will ‘$’ have equal
probability to appear in each of the n+1 positions of BWT
output?
• Consider the appearance of BWT rotation matrix:
1 $GTAACTGTAGTG
2 AACTGTAGTG$GT
• Observation: ‘$’ must be lying on the
3 ACTGTAGTG$GTA top-left corner of BWT rotation matrix
4 AGTG$GTAACTGT
5 CTGTAGTG$GTAA • Thus, ‘$’ is impossible to be BWT[1]
6 G$GTAACTGTAGT
7 GTAACTGTAGTG$ (unless DNA sequence is empty)
8 GTAGTG$GTAACT
9 GTG$GTAACTGTA
10 TAACTGTAGTG$G
11 TAGTG$GTAACTG
12 TG$GTAACTGTAG
13 TGTAGTG$GTAAC

Assignment 2. Q1. h)
• Suppose
given DNA sequence is randomly generated, if we
append ‘$’ and perform BWT on it, will ‘$’ have equal
probability to appear in each of the n+1 positions of BWT
output?
• Or, consider how you calculate probabilities:
• If cannot be divided be then it is impossible to have equal

Assignment 2. Q2
• Implement an Eulerian path finder.
• Key points:
– How to determine Eulerian path existence.
– How to determine starting point(s).
– How to build, store and use Graph structure.
– How to get Eulerian path (one path and all paths).
• Minor Issues:
– Python & Java users should be aware of EOFException.
– Allocate sufficient array space (estimate the upper bound of
node number and edge number), or use dynamic arrays (sacrifice
performance) to avoid Runtime Error.

Recap: Eulerian Path
• A path visiting each edge exactly once
• How to determine the existence of Eulerian path:
– Undirected Graph:
• All vertexes are connected
• Scheme 1: For each vertex, degree is even
 Eulerian circuit exists, every vertex is possible starting point
• Scheme 2: Have and only have two odd degree vertexes,
For each of remaining vertexes, degree is even
 Eulerian path exists, odd vertexes are possible starting point
– Directed Graph:
• Ignore edge direction, all vertexes are connected
• Scheme 1: For each vertex, in-degree == out-degree
 Eulerian circuit exists, every vertex is possible starting point
• Scheme 2: Have and only have one vertex s with in-degree+1 == out-degree,
Have and only have one vertex t with out-degree+1 == in-degree,
For each of remaining vertexes, in-degree == out-degree
 Eulerian path exists, starts at s and ends at t
• Note:
– Circuit is a special case of Path
– Zero-degree vertexes have no impact on Eulerian path existence
CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Prof. Kevin YIP, Mr. Chenyang Hong, Mr. Zhenghao Zhang| Fall 2018 16

Midterm Exam Review (Again

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Midterm Exam Review (Again

Uploaded by

Copyright:

Available Formats

Tutorial Note 7

Midterm Exam Review (Again!)

TA: Zhenghao Zhang

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 2

• Suffix Trie <--> Trie of Suffixes

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 3

C T T G A G A G $ T 4-13 5-13 10-13 13-13 2-2 3-3 1-1

• Compress “caterpillars” (both “dangling” ones and those

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 4

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 5

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 6

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 7

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 8

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 9

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 10

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 11

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 12

• If cannot be divided be then it is impossible to have equal

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 14

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 15

You might also like