You are on page 1of 35

Tutorial Note 10

Midterm Solutions & Phylogenetic Tree

The Chinese University of Hong Kong


CSCI3220 Algorithms for Bioinformatics

TA: Chenyang HONG


20/11/2018
Agenda
• Midterm Suggested Solutions
• Introduction to Phylogenetic Tree
• File Format of Phylogenetic Tree (Newick)
• Parsimony

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 2


Question 1
1.Answer the following short questions:
• (a) Describe the two key ideas behind dynamic
programming.
– Dividing a problem into simpler sub-problems and
solving the original problem by combining the results of
the sub-problems.
– Storing the results of sub-problems such that they can
be reused.

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 3


Question 1
• (b) If dynamic programming is used to perform multiple
sequence alignment between k DNA sequences each with
n nucleotides, what is the time complexity of the
algorithm? Explain your answer by an analysis of the
algorithm.
– There are O(nk) entries in the dynamic programming
table.
– Filling each entry requires the comparison of 2k-1
entries.
– Therefore, the algorithm has a time complexity of
O((2n)k).

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 4


Question 1
• (c) If you want to compare a query protein sequence with
a database of DNA sequences, why is it better to convert
the DNA sequences in the database into possible protein
sequences than to convert the query protein sequence
into possible DNA sequences? Give two reasons.
– For every DNA coding sequence, there are only six possible
protein sequences, but for each protein sequence, there is an
exponential number of possible DNA sequences.
– The sequences in the database only need to be converted once
and can be reused as many times as needed.

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 5


Question 1
• (d) Why must the suffix tree for a sequence of length n
(including the end-of-sequence symbol, $) have no more
than 2n nodes?
– Each leaf node corresponds to a suffix and there are n suffixes,
and therefore there are n leaf nodes.
– Each internal node has at least two child nodes, thus reducing
the number of disconnected sub-trees by at least one. Therefore,
there are at most n-1 internal nodes.
– Combing these two facts, the total number of nodes is no more
than 2n.

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 6


Question 2
2. This question is about optimal global sequence alignment.
In the whole question, we consider a scoring scheme of
having +1 score for a match, -1 score for a mismatch, and -1
score for an indel without applying affine gap penalty.
• (a) Fill in the following dynamic programming table to
perform optimal global alignment between the sequences
r=CAACT and s=AAGAT based on the above scoring
scheme. Draw all arrows that lead to the score of each cell
(i.e., only the “red arrows”). Then report the optimal
alignment score and all alignments with this optimal score.

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 7


Question 2

s
A A G A T Ø
r
Optimal alignment score: 0
C 0 0 -1 -1 -3 -5

Optimal alignments:
A 1 1 0 0 -2 -4
r CA_ACT
A
s AAGA_T
-1 0 0 1 -1 -3
 
C -3 -2 -1 0 0 -2 r CAAC_T
s _AAGAT
T -3 -2 -1 0 1 -1  
r CAA_CT
Ø -5 -4 -3 -2 -1 0 s _AAGAT

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 8


Question 2
• (b) Suppose the scoring scheme remains the same as the one above
but we can change r and s to any pair of DNA sequences each with 5
nucleotides. What are the highest and lowest possible optimal
global alignment scores? Give an example of the corresponding
sequences r and s and their alignment in each case.
– Highest possible optimal global alignment score is 5.
– Example:
• r ACGTA
• s ACGTA
– Lowest possible optimal global alignment score is -5.
– Example:
• r AAACC
• s GTGTG

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 9


Question 2
• (c) Based on the scoring scheme above, is it possible for two
DNA sequences each with 5 nucleotides to form a global
alignment with score 4? Explain why or why not.
– It is impossible to have a global alignment with score 4 given
the requirements.
– This is because in order to have a global alignment score of 4,
there must be at least 4 matches. For the remaining
nucleotide in r, if it match then the resulting score of the
alignment must be 5.
– On the other hand, if it does not form a match with a
nucleotide in s, either the alignment would contain a
mismatch or two indels. The corresponding alignment scores
would be 3 and 2 in the two cases, respectively.

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 10


Question 2
• (d) Based on the scoring scheme above, give an example of
two DNA sequences such that their optimal global
alignment score is 3. Explain why this alignment score must
be optimal for your two sequences without actually
performing dynamic programming.
– An example is r=AAAAC and s=AAAAG.
– Since there is a nucleotide in r that does not exist in s, it is
impossible to have 5 matches, and therefore it is impossible to
have an optimal score of 5.
– Now, by aligning the two sequences without having any gaps, the
alignment score is 4-1=3. Since in Part c it has been proved that it is
impossible to have a global alignment score of 4 between two DNA
sequences each with 5 nucleotides, this score must be optimal.

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 11


Question 2
• (e) What is the maximum number of distinct optimal
global alignments between two given DNA sequences if
each of them has 5 nucleotides and their optimal global
alignment score is 1 based on the scoring scheme above?
This type of questions is useful for estimating the output
size in sequence alignment.

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 12


Question 2(e)
Answers:
• In order to have an optimal global alignment score of 1, the alignment
must contain at least 1 match.
• The alicannot have 5 matches because in that case the optimal alignment
score would be 5.
• The alignment cannot have exactly 4 matches because in that case the
optimal alignment score would be either 2 or 3.
• gnment The alignment cannot have exactly 2, 1 or 0 matches because in
those cases the optimal alignment score would be lower than 1.
• Therefore, the alignment must have exactly 3 matches.
• Correspondingly, in order to have an alignment score of 1, there must be
exactly two mismatches and no indels.
• Therefore, there must be only one optimal global alignment with a score
of 1.

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 13


Question 3
3. This question is about Burrows-Wheeler transform.
• (a) Produce the BWT of DNA sequence s=AGCCTCGC$ using any method. Show
your steps in detail.
– The suffixes of s and their corresponding starting positions in s are as
follows:
Suffix Starting position Suffix Starting position
AGCCTCGC$ 1 $ 9
GCCTCGC$ 2 AGCCTCGC$ 1
CCTCGC$ 3 C$ 8
CTCGC$ 4 CCTCGC$ 3
TCGC$ 5 CGC$ 6
CGC$ 6 CTCGC$ 4
GC$ 7 GC$ 7
C$ 8 GCCTCGC$ 2
$ 9 TCGC$ 5

– Accordingly, the BWT of s is s[8]s[9]s[7]s[2]s[5]s[3]s[6]s[1]s[4] = C$GGTCCAC

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 14


Question 3
• (b) Is it possible to have another sequence with the same
BWT as the BWT of s in Part a? Explain why or why not.
– It is impossible.
– The BWT of a sequence is defined as the last column of its sorted
rotation matrix. Based on it, the first column of the sorted
rotation matrix can also be obtained by sorting the set of
characters contained in the BWT.
– Accordingly, the conceptual method discussed during lecture can
be used to reproduce the whole sorted rotation matrix, and
finally the original sequence can be deduced from the row that
ends with $. Since this whole process is deterministic, the
reconstructed sequence must be unique.

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 15


Question 3
• (c) Suppose sequence s in Part a comes from one strand of DNA,
but the sequencing reads produced could come from either of the
two strands. Therefore, when doing short read alignment, each read
should actually be compared to both s and its reverse complement.
State one advantage and one disadvantage of indexing both s and its
reverse complement, as compared to aligning every read and its
reverse complement to s.
– Advantage: searching time is shorter if all the data structures to
be stored in memory can still fit into memory.
– Disadvantage: the index becomes two times in size, which may
no longer be able to fit into memory.

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 16


Question 3
• (d) Design a DNA sequence with 10 nucleotides (including
the $ symbol) such that its BWT does not contain the same
nucleotide at any two consecutive positions. Prove that
this is the case for your constructed sequence.
– One possible sequence is TAATCCTGG$.
– By construction, it is easy to see that all the rotations can be
sorted by considering only the first two nucleotides.
$T…G
AA…T
AT…A
CC…T
CT…C
BWT is:
G$…G GTATCGT$AC
GG…T
TA…$
TC…A
TG…C

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 17


Question 3
• (e) Suppose you are given the BWT b of a DNA sequence s of
length n (including the $ symbol). Propose an algorithm that
can determine the suffix array of s in O(n) time.
– One way is to use the procedure to reconstruct s from its BWT b.
– First, a linear scan of b is performed in O(n) time to do two things:
• Count the number of occurrences of each nucleotide in b.
• Record the “color” of each nucleotide in b, i.e., whether each
nucleotide encountered is the first occurrence of this nucleotide
in b, or the second occurrence, and so on.
– Then we can do the “first column-last column” tracing to reconstruct
s backward. This can be done in O(n) time with the two data
structures above.
– Every time when a first column is traced, the suffix that it represents
is recorded according to the position of the current nucleotide in s
being deduced.
CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 18
Question 3(e)
• Example:
• b: GTGA$AG
First Last Suffix Array Suffix Array
column column
$ G 7 7
A T 5
A G 2
G A
6 6
G $
1
G A
3
T G
4

• s = GAGTAG$

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 19


Question 4
4. This question is about sequence assembly and heuristic
sequence alignment.
• (a) Why is it not always possible to get back the original
DNA sequence by sequence assembly of short reads
produced from it? Give two reasons.
– The original sequence contains tandem (i.e., consecutive)
repeats longer than the read length, and therefore the number
of copies of the repeating unit cannot be determined.
– Some parts of the original sequence are not covered by any read.
– The reads contain errors, which mislead the assembly algorithm.

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 20


Question 4
• (b) Suppose some short reads are produced from a DNA sequence,
such that every read is a sub-sequence of the original sequence with
no errors. Suppose also that there is an algorithm that can always
find the shortest superstring(s) from a set of input strings. If the
short reads described here are provided as inputs to the algorithm,
can its output sequence(s) be longer than the actual original DNA
sequence? Explain why or why not.
– The output sequence(s) cannot be longer than the actual sequence.
– If a sequence is constructed such that it contains all the covered regions of
the original sequence, every read must be a sub-sequence of it and thus it is a
superstring of the reads. This sequence must not be longer than the original
sequence.
– Since the algorithm can always find the shortest superstring, its output must
be not longer than this sequence.

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 21


Question 4
• (c) State a complete set of necessary and sufficient conditions that
an undirected graph has an Eulerian path.
– The graph is connected.
– All or all but two nodes have an even degree.
• (d) In practice, when performing sequence assembly, it is difficult to
handle the large number of sequencing reads all at the same time.
One common strategy is to first cluster the reads such that each
cluster contains reads that can likely be assembled together, and
then work on each cluster separately. Explain how PSI-BLAST can be
used to perform this clustering (although not very efficiently), and
how sequence profiles should be constructed from a set of reads
when using PSI-BLAST for this purpose.

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 22


Question 4(d)
• What is PSI-BLAST?
• PSI-BLAST (Position-Specific Iterative Basic Local Alignment Search
Tool) derives a position-specific scoring matrix (PSSM) or profile
from the multiple sequence alignment of sequences detected above
a given score threshold using protein–protein BLAST. This PSSM is
used to further search the database for new matches, and is
updated for subsequent iterations with these newly detected
sequences. Thus, PSI-BLAST provides a means of detecting distant
relationships between proteins.

• From https://www.ncbi.nlm.nih.gov/books/NBK2590/

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 23


Question 4(d)
• Answers:
• The usage of PSI-BLAST:
– One way is to take a read and find other reads similar to it using BLAST. These
reads can likely be assembled together due to their high sequence similarity.
– The whole set of reads can then be used to find additional similar reads using
BLAST again in an iterative fashion, until no more reads are found or the cluster
is large enough to stop.
– The whole set of reads then form a cluster. This process can be repeated for
different starting reads, sequentially or in parallel.
• Constructing sequence profiles:
– When more and more reads are included in the cluster, the old reads will
become less useful in retrieving new reads since many reads similar to them
will have already been included in the cluster.
– Then one way to build the profile is to get a consensus sequence by the
alignment of the newest reads, or to incorporate a weight in the scoring
scheme such that alignments involving the new reads weight more.

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 24


Question 4
• (e) Sequence assembly can also be used at the RNA level. Suppose
there is a type of experiments that can produce sequencing reads
that are reverse complementary to the RNA transcripts. The
assembly task is then to take these RNA reads and assemble them
back to the original RNA transcripts. State two aspects that this
assembly is different from assembling DNA reads, except for the
trivial difference that two different sets of nucleotides are involved.
– Since RNA is single-stranded while DNA is double-stranded,
when assembling RNA reads, it is not necessary to consider the
possibility of having reads produced from both strands of each
target sequence.
– In RNA sequence assembly, the target set involves many short
transcripts rather than a few long DNA sequences.

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 25


Introduction to Phylogenetic Tree
• Given a set of DNA/protein sequences
• Construct a phylogenetic tree such that it likely refers
to the historical evolutionary events based on
– Sequences of different species
• Parsimony
• Likelihood
– Distances of different species
• UPGMA
• Neighbour Joining (NJ)
• Understand the file format (Newick) and databases
(TreeFam, UCSC genome browser) of phylogenetic tree

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 26


Newick
• Newick, nested brackets with distances
(A:0.1,B:0.2,(C:0.3,D:0.4):0.5);

Image credit: wikipedia

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 27


Exercise: File Format
• Use Newick to represent the following phylogenetic
tree
Phylogenetic Tree:

• ((((Chimp_Vellerosus:0,
Chimp_Schweinfurthii:0):0.04,
Chimp_Troglodytes:0.04):0.09,
European_Human:0.09):0.13, Chimp_Verus:0.13);
CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 28
Maximum Parsimony
• Sequence-based reconstruction method

• Beliefs:
– Mutations are rare
– The simplest explanation is likely the correct one

• However, no polynomial time algorithm is known.


• It’s because the number of tree topologies with n
node is in the order of n!

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 29


Small Parsimony (Simple version)
• Upward Phase:
– If Sleft-child ∩ Sright-child = {},
then Scurrent ← Sleft-child U Sright-child
– If Sleft-child ∩ Sright-child ≠ {},
then Scurrent ← Sleft-child ∩ Sright-child
• Downward Phase:
– If Sparent ∩ Scurrent = {},
then Scurrent ← any one from Scurrent
– If Sparent ∩ Scurrent ≠ {}, then Scurrent ← Sparent

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 30


Example: Small Parsimony (Simple version)

G[CT]
Upward Phase
[AG]C

AC GC GT AC GC GT

Downward Phase

GC GT

GC GC

AC GC GT AC GC GT

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 31


Exercise: Small Parsimony Method
• Given the structure of the phylogenetic tree, fill in the
sequences such that the number of mutation is
minimum. Using both simple and extended versions
respectively,
– what is the optimal number of mutations?
– how many possible sets of sequences for the internal
nodes?

GCC ACT CCC TCT TCC

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 32


Answer: Small Parsimony Method
• After upward phase

[AGT]CC

TCC

[AG]C[CT] [CT]C[CT]

GCC ACT CCC TCT TCC

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 33


Answer: Small Parsimony Method (Simple)
• After downward phase TCC
TCC
ACC
ACC TCC
TCC
GCC ACT CCC TCT TCC
ACC TCC

GCC ACT CCC TCT TCC TCC


TCC
GCC GCC TCC
TCC
GCC ACT CCC TCT TCC
GCC TCC

GCC ACT CCC TCT TCC

Number of mutations: 5
Number of possible sets: 4
CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 34
A Summary on Parsimony
Large Parsimony Small Parsimony

Observed sequences Given Given

Ancestral sequences Have to work out Have to work out

Tree topology Have to work out Given

Number of mutations Minimum in all cases Minimum subjected to the given tree
topology

No polynomial time
Algorithms Simple version, Extended version
algorithm

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 35

You might also like