Professional Documents
Culture Documents
by University of California San Diego & National Research University Higher School of Economics
Timeline
Previous weeks
START 07/13
Week 1
WEEK 1
Week 2
WEEK 2
Week 3
WEEK 3
Week 4
WEEK 4
END 08/16
Following weeks
Next Step
Welcome
It'll take about 4 min. After you're done, continue on and try finishing ahead of schedule.
Instructor's Note
Thanks for signing up for the Algorithms on Strings class! We are excited that the class is beginning and
look forward to interacting with you!
This course is part four of the Algorithms and Data Structure specialization. Although we recommend
signing up for the entire specialization, you still can take this class separately if you haven't taken the
previous parts yet.
In a few years, you may have your genome sequenced in a doctor’s office as a part of a routine medical
procedure. But how can your doctor find mutations that make your genome different from other
genomes and figure out which of them are implicated in diseases? Algorithmically, this problem is not
very different from tasks you face every day: spell-checking your documents or searching internet. In
this course, you will learn about suffix trees, suffix arrays, and other brilliant algorithmic ideas that help
doctors to find differences between genomes and power lighting fast internet searches.
We look forward to seeing you in this class! We know it will make you a better programmer.
https://www.coursera.org/learn/algorithms-on-strings/lecture/avHa3/welcome
Week 1
Algorithms on Strings
Week 1
Discuss and ask questions about Week 1.
Suffix Trees
How would you search for a longest repeat in a string in LINEAR time? In 1973, Peter Weiner
came up with a surprising solution that was based on suffix trees, the key data structure in pattern
matching. Computer scientists were so impressed with his algorithm that they called it the
Algorithm of the Year. In this lesson, we will explore some key ideas for pattern matching that will -
through a series of trials and errors - bring us to suffix trees.
Less
Key Concepts
Develop a program to build a trie out of a list of strings
Develop a program to search for multiple patterns in a string using trie
Develop a program to build a suffix tree of a string
Apply suffix tree to find the longest non-shared substring of two strings
Less
From Genome Sequencing to Pattern Matching
Video: LectureWelcome
4 min
Resume
. Click to resume
8 min
2 min
5 min
10 min
6 min
Video: LectureSuffix Trees
4 min
5 questions
Reading: FAQ
10 min
10 min
Programming Assignment
10 min
Reading: FAQ on Programming Assignments
10 min
String algorithms are everywhere. They underlie both things that you use every day and some exciting
cutting edge research. One of the research areas that makes heavy use of string algorithms is
Bioinformatics. It studies DNA and genes of humans, animals, and primitive organisms. Genes
determine how the organism will develop and which genetic diseases are likely to happen. DNA is a
complex three dimensional structure and you can see an example of part of it on the left. You see that
it consists of two intertwined strands. We can extract many of the properties of the DNA, if we take
just one of those strands and record the sequence of the molecules in it. If we consider just a few
dozen types of those molecules called amino acids, and we denote each of them by a Latin letter,
then we can convert a strand into a string of Latin letters. And what we can do next is to study those
strings. For example, we can put together a string for a human DNA, and for a chimp DNA, and then
for a mouse DNA. And then we can find the parts which are the same and the parts which differ. And,
this way, we can observe how did the life on earth evolve and what changed in the genes. Another
thing we can do with this, is to put a genome of a person with a rare genetic disease next to a genome
of a healthy person. Find the differences and then make hypotheses on which mutations in the genes
cause these genetic disease which can help to later cure it. You will hear more about it later in the
course from Pavel Pevzner who has been working in the field of Bioinformatics for decades.
Moving on to everyday things, depending on where you are now, you probably use one of these
search engines every day: Google in most of the world, Yandex in Russia, Baidu in China, or Naver in
South Korea. Search engines crawl the Internet and download petabytes of data that they find there.
And, then, when you type in your search query they use mostly the text of those documents in the
Internet to match to your query. And those texts can represent the strings, lots of string algorithms
are used in the process.
Another example is your favorite text editor when you launch spellchecking or are just trying to find
something in your text, string algorithms are working. A less obvious example is the software which
protects our computers and computer networks. The anti-virus software looks for suspicious patterns
in the code of the programs that you want to launch on your computer and the network intrusion
systems looks for suspicious patterns in the network traffic. So, a string algorithms are a great tool for
pattern matching both exact and approximate string algorithms are used in the software.
Last, but not least, software engineers are needed to implement all of the stuff that I've shown to you
and they need their own tools. If you've ever participated in a programming project within a team,
then you've probably reviewed some changes made by one of your teammates. And you use some
version control system and a diff tool built in into the system, like the one in the example with
Wikipedia. And to guess correctly what was changed in the code or in the Wikipedia article without
actually knowing what was changed, string algorithms are used for aligning and matching parts of the
text. In this course, we will start with applications of string algorithms to Bioinformatics. You will hear
about that from Pavel Pevzner who also has a whole separate specialisation in Coursera about
Bioinformatics. You will see how basic string algorithms can be applied to solve the problems which
are raised in Bioinformatics. And, then, the second part of the course you will meet again with me and
we will work through some algorithmic challenges around how to make those algorithms to run really
fast. And fast enough, for example, to apply them to genomes which consist of millions or even
billions of characters, or to the text in the Internet, in your text editor, and so on. Also, in the end of
the specialization, you will have a capstone project called Genome assembly. And, there, you will
apply string algorithms, graph algorithms, and other algorithms to build the genome from a million
pieces. See you later in this course.
Hello. I haven't seen you for a long time, since we were working on the change problem. But today
we'll work on a completely different topic called String Algorithms. String Algorithms are everywhere.
Every time you spell check your documents, or Google something, you execute sophisticated String
Algorithms. But today, we walk about very different application of String Algorithms. Sam Berns gave
a fantastic pep talk when he was 16. He was talking about his life and a year later, he died.
Play video starting at 38 seconds and follow transcript0:38
Sam was suffering from a rare genetic disease called progeria. Children with this progeria often will be
above average intelligence, look old already at the age ten and usually die in their teen years.
Play video starting at 54 seconds and follow transcript0:54
But for many years biologists have no clue of what causes progeria. But in 2003, they figure out that
progeria is caused by single mutation, on chromosome one. To understand what pattern matching
has to do with progeria, we need to learn something about genome sequencing.
Play video starting at 1 minute 16 seconds and follow transcript1:16
When my children were young, that's how I was explaining them how genome sequencing work. I was
using an example of the newspaper problem. Take many copies of the identical New York Times
newspaper, then set them on a pile of dynamite.
Play video starting at 1 minute 35 seconds and follow transcript1:35
Don't do it at home, and then wait until explosion is over, and collect the remaining pieces. Of course
many pieces will burn in the explosion, but some pieces will remain. And so your goal is to reconstruct
the content of the New York Times. A natural way to solve the newspaper problem, is to consider it as
an overlapping puzzle. Look at different pieces and try to mix them together like this.
Play video starting at 2 minutes 6 seconds and follow transcript2:06
And then slowly but surely, hopefully you'll be able to assemble the whole gem. And that's roughly
how the human genome was assembled in 2000. And here, Bill Clinton is congratulating Craig Venter,
one of the leaders of the Human Genome Project, on completion of this $3 billion mega science
project. We don't need to know much about genomes for the rest of these talks. The only thing we
probably need to know is a genome is simply a long strand in A, C, G, T alphabet.
Play video starting at 2 minutes 43 seconds and follow transcript2:43
I will try to explain how the newspaper problem translates into genome sequencing. They start from
millions of identical copies of a genome.
Then they break the genome at random positions using molecular scissors. These molecular scissors
don't look quite like this one shown in this picture. Then we generate short Substrings of the genome
called reads, using modern sequencing machines. Of course, during this generation some reads are
lost.
And the only thing left is to the assemble the genome from millions, or maybe even billions of tiny
pieces. The largest jigsaw puzzle humans ever attempted to solve. Today, I won't be able to tell you
about algorithms for genome assembly. But if you are interested in learning about this algorithm, you
can attend or Bionformatics specialization at Coursera. Or read the book, Bioinformatics Algorithms.
Assembling human genome was a challenging $3 billion project and afterwards the era of genome
sequencing began. But in the first ten years after sequencing human genome, biologists were able to
sequence only about ten other mammalian genome because it was still difficult.
Play video starting at 4 minutes 10 seconds and follow transcript4:10
However, five, six years ago, so-called next-generation sequencing revolution happened. And today,
biologists sequence thousands of genomes every year.
Play video starting at 4 minutes 23 seconds and follow transcript4:23
Why do biologists sequence thousands of species? There are many applications. For example, the
next big science sequencing project after the human genome was mouse genome. Because we can
learn lot about Human biology and diseases from mouse gene. An important application in
agriculture, for example by sequencing rice genome, biologists are able to develop new high yield
crops of different plants like rice. And there are many hundreds and hundreds of other applications.
Play video starting at 4 minutes 55 seconds and follow transcript4:55
But recently, in addition to sequencing of many species,
there is also much effort on sequencing millions of personal genomes. The things that make us
different are mutations. However, there are surprisingly few mutations that distinguish my genome
from your genome. Roughly one mutation per thousand nucleotides. However, this mutation make
big difference. They account for height or they account for more than 7,000 known genetic diseases.
Play video starting at 5 minutes 32 seconds and follow transcript5:32
Five years ago, the era of personalized genomics started and Nicholas Volker is a foster child of
personalized genomics. He was so sick that he went through thousands of surgery, but doctors still
were not able to figure out what is wrong with this kid. However, after he is genome sequence can
reveal a mutation in a gene linked to defect in the immune system. Doctors applied immunotherapy
and Nicholas Volker is now a healthy child. However, sequence in personal genomes, from scratch,
still remains difficult even today. What biologists do today however, they do so called reference base
human genome sequences. Let's start from Craig Venter genome assembled in 2000, call it reference
genome. And then let's start sequencing my genome by generating all reads from my genome. Here's
some of the three perfectly match to genome, but some of them don't. And based on these reads
that do not match, we will be able to figure out what is my genome. For example, we can find a
mutation of T into C and deletion of T in my genome as compared to.
It brings us to a number of computational problems. The easiest one is the exact pattern matching.
Given a String pattern and a String text, we want to find all positions and texts where pattern appears
as a Substring.
Play video starting at 7 minutes 13 seconds and follow transcript7:13
But our goal is to find mutations, and therefore we want also to solve Approximate Pattern Matching
Problem. Where input is a string Pattern, a string Text, and an integer d. And we want to find all
positions in Text where Pattern appears as a Substring with at most d mismatches.
Play video starting at 7 minutes 37 seconds and follow transcript7:37
I think you already have some good ideas on how to solve this rather simple problem. But think about
this, even if you have fast algorithms for solving this problem, would you be able to solve the next
problem?
Play video starting at 7 minutes 51 seconds and follow transcript7:51
To answer the question where do billions of reads generated for me, match the reference genome
from. And this leads us to Multiple Pattern Matching Problem given a set of strings, Patterns and a
string Text, find all positioning texts where a string from Patterns appears as a substring.
What will be the number of nodes (including the root node) in the trie constructed from
patterns ATAGAATAGA, ATCATC, and GATGAT?
Adds brute force approach to pattern matching
is slow when we try to match billions of patterns.
Let's try to figure out why.
So, what is happening if we put each pattern in its own car,
then the first dives the first car, then the second car, then the third car,
the next car and next car, and that's why it takes a lot of time.
Here's a new idea, let's pack all patterns in a bus, and
let's drive this bus along the text.
But how do the constructors bus?
Play video starting at 39 seconds and follow transcript0:39
Let me show you how we can construct the bus from multiplied patterns.
Let's start from the first pattern and represent it as a pass in a tree.
Continue to the next button, continue this next button, and
continue this next button.
So far it was easy and not interesting.
We have four patterns and we constructed for passes from the root of the tree.
Let's go to the next one, antenna.
Now the first letter in antenna actually already
appears on the way from the root it is right here.
The second letter also appear away from the root.
And then, we need to branch
the previous pass into two passes to construct the pass for antenna.
Now let's do bandana.
So bandana we press it further.
And now we again have to branch the pass.
Continue with ananas, again branching.
And finally continue with nana, branching again.
And of what we've constructed is actually our bus.
And which is called trie of patters.
Play video starting at 1 minute 52 seconds and follow transcript1:52
How do we use this bus?
How would we drive?
After constructing this bus, how would we drive with along text?
PPT Slides
Ppt slides
Well, you'll use TrieMatching. And we'll drive this whole trie along the text, and at each position of
the text, we will walk down the trie by spelling symbols of text.
Play video starting at 2 minutes 17 seconds and follow transcript2:17
A pattern from the set patterns matches text each time we reach a leaf. Let me show you how it
works. So our bus, in now looking at the first letter of the text, p, so if we walk along the edge, labeled
by p, the next letter is a, we walk along this edge, the next letter is n, we walk along this edge, and we
found that part of pan appears in panamabananas. Next, we move to the next letter of the text and
we start my chunk again. A, n, a and now there is no match so at this position there is no match
between, patterns and texts. Continue further, n, a.
Play video starting at 3 minutes 7 seconds and follow transcript3:07
Once again, there is no match. A, once again, there is no match. For m, from the very beginning, there
is no match. We continue further. A. Once again, there is no match. Let's try here. B, a, n, a, n, a we
came to so we found the pattern, but we have to continue further. Further. Further. Further. Found
the pattern again. Continue further.
Play video starting at 3 minutes 35 seconds and follow transcript3:35
Again found the pattern. And now, there is no more match. So, we found, in a single run on our bus,
we found all matches of patterns against the jacks.
Play video starting at 3 minutes 50 seconds and follow transcript3:50
Actually, I haven't finished yet. We also have to match n, a, s.
Play video starting at 3 minutes 54 seconds and follow transcript3:54
No. a? No. Now we are done.
Our bus is very fast, recalls at runtime of our brute force approach was O text time patterns. Where
pattern says the total lens or for pattern. That 's why it was slow the total length of all patterns is
huge. In the case when we tried to match reeds against the genome. But the run time of TrieMatching
is only o of text time the length of the longest pattern. And typically in modern sequencing pattern,
the reeds have lengths only 200, 300 nucleotide.
Play video starting at 4 minutes 38 seconds and follow transcript4:38
So it looks like finally they are done. Should we go home?
Play video starting at 4 minutes 43 seconds and follow transcript4:43
We are not ready to go home just yet.
Play video starting at 4 minutes 47 seconds and follow transcript4:47
Note that trie we constructed has 30 edges. But in general, the number of edges for a trie is O of the
total length of the patterns.
Play video starting at 5 minutes 1 second and follow transcript5:01
And for the human genomes, the total length of the patterns will be in trillions. So unfortunately, our
algorithm will be impractical for read matrix.
Play video starting at 5 minutes 15 seconds and follow transcript5:15
Should we give up?
Ppt slides
Ppt slides
Let's start by adding a dollar sign to the end of panamabananas and I will explain later why I add this
strange dollar sign to the end of my string.
Play video starting at 1 minute 6 seconds and follow transcript1:06
So we start from the longest suffix of panamabanana$ and builds a corresponding parse in the trie
continue continue continue further continue continue so far there is no branching. Continue continue
and now the first branch in our suffix trie appears. Continue further there are new branches showing
up and then finally be constructed something that we call SuffixTrie of text. How can we use this for
part lecture? Well let's take our pattern once again, put it in the car and let's drive along the branches
of our SuffixTrie.
Play video starting at 1 minute 51 seconds and follow transcript1:51
First we match first symbol in pattern. The next symbol, the next symbol. And finally, we found a
match over the pattern to one of suffixes of the text to which we found a match of a pattern in the
text.
Play video starting at 2 minutes 7 seconds and follow transcript2:07
We use banana it will go like this.
Play video starting at 2 minutes 10 seconds and follow transcript2:10
We found banana with this nab, it goes like this. Unfortunately, we didn't find it because there is no
continuation for b. Let's see for antenna, we go this way and finally have to stall because there is no
match for t. So it looks like this SuffixTrie idea worked for us. But there is one important question we
forgot to answer. Where are the matches? How do we find our patterns match the text? There's no
information in the suffix trie yet that allows us to answer this question. Here's an idea. Let's try to add
some information to the leaves of our tree. But what information to add? Let's say for every leaf, let's
add information about the starting position of the suffix, that generated this leaf. For example, for
bananas, we will let position six, because bananas, start at position six of the text.
Play video starting at 3 minutes 15 seconds and follow transcript3:15
Let's see how it works. So for panamabananas$, we will be adding zero because this suffix starts at
position zero. Then we will be adding for anamabananas$ we will be adding correspondent position.
Continue, continue, continue, continue, continue, continue, continue, continue, continue. And finally
our tree, leafs on our tree, get decorated to this positions of the suffixes in the text and finally when
we looked at all suffixes our tree got all the information we need to figure out where the positions of
patterns are and that is actually what Is called SuffixTrie.
Play video starting at 4 minutes 5 seconds and follow transcript4:05
Original SuffixTrie as I described earlier, decorated with the position of all the leaves in the text.
Play video starting at 4 minutes 14 seconds and follow transcript4:14
However, getting information about position of suffixes to the leaves of the suffix try doesn't yet help
us to figure out where the string bananas appears in the text. So, what we want to do, once we find a
match, like match of bananas, we want to walk down to the leaf or maybe leaves, in order to find the
starting position of the match. Let's see how it works.
Play video starting at 4 minutes 43 seconds and follow transcript4:43
So for banana, we ended in the middle of the trie, but we'll continue walking, continue walking, and
finally, we find where banana start at position six.
Play video starting at 4 minutes 57 seconds and follow transcript4:57
For ana the continued working but there are three ways to continue working towards the leaves. This
is the first one. This is the next one. And this is another one. So in this case we find that baana actually
appears three times in the text and the three positions are shown on the top. So it looks like we finally
solved the problem of finding positional patterns in the tree, which means we now have a fast
algorithm for solving the problem.
Given only the suffix trie of some Text on the picture below, can you tell which of the following
strings appears the most number of times in this Text, and how many times exactly does it
appear?
We saw that Suffix Trie results in a fast algorithm for the part I mentioned, but let's take a look at the
memory footprint of Suffix Trie. Suffix Trie is formed from text suffixes of text. The average length of
the suffixes is roughly text over two. And therefore, the total length of those suffixes is length of the
text, multiplied by length of the text minus one, divided over two.
Play video starting at 6 minutes 4 seconds and follow transcript6:04
For human genome it appears huge impractical memory footprint.
Play video starting at 6 minutes 11 seconds and follow transcript6:11
Should they give up?
Suffix Trees
PPT Slides
PPT Slides
Is so that pattern matching with suffix tries is fast, but impractical with respect to the memory
footprint. How about this idea? We saw that bananas takes a lot of edges in our suffix try. Can we
compress all these edges into a single edge?
Play video starting at 23 seconds and follow transcript0:23
That's very easy to do. So let's simply do this, and do this with every known branching pass in our
suffix tree. Very quickly our tree gets much smaller. So if you're almost done, continue, continue And
finally, we construct something that is called SuffixTree(text). And since each suffix adds one leaf and
at most one internal vertex to the suffix tree, then the number of vertices In the suffix tree is less than
two times text and therefore
Play video starting at 1 minute 9 seconds and follow transcript1:09
memory footprint of the suffix tree which is proportional to the number of edges in the suffix tree is
simply all the lines of the text.
Play video starting at 1 minute 21 seconds and follow transcript1:21
This sounds like cheating! Because we haven't answered the question, how do we store all the edge
labels? They will take the same total length of pattern space that all labels in the suffix tree took.
Play video starting at 1 minute 38 seconds and follow transcript1:38
However, let's try to do the following. So instead of storing the whole string bananas as a label of our
edge, let's notice that bananas start at the position six of the text and has lance eight. And therefore
instead of storing bananas on the edge, we will only store two numbers, 6 at the starting position of
bananas and 8 the last of bananas. That will be sufficient to reconstruct the entire label bananas. And
we will do it for all edge labels. And as a result, now you see that suffix tree is indeed a very memory
efficient way to code all information about suffixes of the text.
You may be wondering, why did we add this silly dollar sign? To panamabananas.
Play video starting at 2 minutes 38 seconds and follow transcript2:38
I added it because I wanted to make sure that each suffix corresponds to a leaf.
Play video starting at 2 minutes 44 seconds and follow transcript2:44
But why do we want to make sure that each suffix correspond to leaf? I suggest you try to construct
suffix tree for papa without adding the dollar sign and compare with the suffix tree for Papa, with
dollar sign and you will see why the dollar sign is important.
Play video starting at 3 minutes 6 seconds and follow transcript3:06
The sos of suffix trees are a fast and memory efficient way to do pattern match. However,
construction of suffix trees is not for faint hearted because they need to combine all suffixes of the
caps into the suffix tree. And the name of it for doing this takes quadratic O text squared time.
Play video starting at 3 minutes 32 seconds and follow transcript3:32
However there is an ingenious linear time algorithm for constructing suffix trees called and it was
developed over 40 years ago and this algorithm amazingly has linear write of text to construct the
suffix tree. So it looks like we are done, finally, after all the effort.
And now, I want to tell you about the big secret of the big O notation. Something that Sasha, Daniel,
Misha, and Neil, forget to tell you about. Indeed, suffix trees enable fast exact multiply pattern
matching run time.
Play video starting at 4 minutes 14 seconds and follow transcript4:14
All of text plus patterns, and memory of text, that's the best we can hope for.
Play video starting at 4 minutes 21 seconds and follow transcript4:21
However, Big O notation hides constant, and the best known implementation of suffix tree has large
memory footprints of 20 time text which reaches very large memory requirement for long genomes
like human genomes. But even more importantly, we want to find mutations. And it is not clear how
to develop fast Approximate Multiple Pattern Matching category using suffix tree. So once again we
are facing an open problem that we have to solve.
Please see this link, section "Coursera week 1" for some of the frequently asked questions and
answers about this week's material.
Suffix-Trees-Reduced.pdfPDF File
References
See Chapter 9: How Do We Locate Disease-Causing Mutations (Combinatorial Pattern Matching)
in [CP15] Phillip Compeau, Pavel Pevzner. Bioinformatics Algorithms: An Active Learning
Approach, 2nd Ed. Vol. 1. Active Learning Publishers. 2015.
Also see the course "Finding Mutations in DNA and Proteins" of the Bioinformatics Specialization.
If you want to learn how to assemble genomes, also see Chapter 3: How Do We Assemble
Genomes (Graph Algorithms) in [CP15] Phillip Compeau, Pavel Pevzner. Bioinformatics
Algorithms: An Active Learning Approach, 2nd Ed. Vol. 1. Active Learning Publishers. 2015.
Programming assignment
Week 2
Algorithms on Strings
Week 2
Discuss and ask questions about Week 2.
Although EXACT pattern matching with suffix trees is fast, it is not clear how to use suffix trees for
APPROXIMATE pattern matching. In 1994, Michael Burrows and David Wheeler invented an
ingenious algorithm for text compression that is now known as Burrows-Wheeler Transform. They
knew nothing about genomics, and they could not have imagined that 15 years later their
algorithm will become the workhorse of biologists searching for genomic mutations. But what text
compression has to do with pattern matching??? In this lesson you will learn that the fate of an
algorithm is often hard to predict – its applications may appear in a field that has nothing to do with
the original plan of its inventors.
Less
Key Concepts
Explain how Burrows-Wheeler transform allows to reduce the memory needed to store
genome and search patterns in genome efficiently
Develop a program to compute Burrows-Wheeler Transform of a string
Develop a program to invert Burrows-Wheeler Transform of a string
Develop a program to search in a string given as its Burrows-Wheeler Transform
Develop a program to build a suffix array of a string
Explain how suffix array can be used to search for patterns in a string given as its
Burrows-Wheeler Transform
Explain how partial suffix array can be used to reduce memory needed for suffix array and
still be able to search for patterns in a string
Less
Burrows-Wheeler Transform
Video: LectureBurrows-Wheeler Transform
4 min
Resume
. Click to resume
7 min
6 min
10 min
Suffix Arrays
Video: LectureSuffix Arrays
5 min
10 min
6 min
Reading: FAQ
10 min
10 min
Programming Assignment
4 questions
Burrows-Wheeler Transform
The previous lecture ended with a rather difficult algorithmic challenge
that we will try to solve using the Burrows-Wheeler Transform and
suffix array.
Let's start with the Burrows-Wheeler Transform.
Allow me to slightly change the focus.
Instead of pattern mention, we'll talk about text compression.
And run-length encoding is the simplest way to compress text.
Where a run of a single symbol is substituted by the number of times
the symbol appeared in this run, followed by the symbol itself.
Play video starting at 38 seconds and follow transcript0:38
You may be wondering why we want to do run-length encoding for
genomes because genomes don't have many runs.
But they do have many repeats.
For example, more than half of human genome is formed by repetitive DNA and
the lion's share of many plant genomes is formed also by various types of repeats.
But here's an idea.
Let's convert test into something else so
that our repeats will be converted into runs.
We'll start from the genome then we'll turn it into ConvertedGenome and
then we will apply run-length encoding to the ConvertedGenome.
Because our hope is that ConvertedGenome will have many runs.
Play video starting at 1 minute 30 seconds and follow transcript1:30
Let me show how we can accomplish this.
Ppt slides
So let's consider all cyclic rotations of our favorite string, panamabananas$.
Play video starting at 1 minute 41 seconds and follow transcript1:41
We start from this one, and this one, and this one, and continue to form all cyclic rotations of this
string.
Play video starting at 1 minute 50 seconds and follow transcript1:50
And then after you generated all the cyclic rotations, let's sort them. The dollar sign is viewed as the
first letter of alphabet, even before A, so we'll start with $panamabananas. Continue, continue,
continue, continue, and finally we'll have a sorted list of all suffixes of the text.
Play video starting at 2 minutes 18 seconds and follow transcript2:18
You might be wondering why we are doing this, but look at this strange sync. If we look at the last
column of the resulting array and the last column of the resulting array is called Burrows-Wheeler
transform of the text. That you will notice that also our regional string, panamabananas$ did not have
many runs. The Burrows-Wheeler transform of this string actually has many runs. For example here's
a run of five A in the Burrows-Wheeler Transform of our original text.
Play video starting at 2 minutes 57 seconds and follow transcript2:57
How have we achieved it? Let me explain it by using an example from the famous Double Helix Paper
by Watson and Crick where they first presented the structure of DNA. And these are just some
consecutive strings in Burrows-Wheeler transform of this book. And you see that there are many runs
of A in the Burrows-Wheeler transform of this text. Why so many runs of A? Well one of the most
common words in English is and. And every time you have and in the text, it is likely to contribute to a
run of A in the Burrows-Wheeler transport, as you see in this example. So our goal now is to start
from the genome, apply Burrows–Wheeler transform to the genome. And we can now, hopefully,
comprise Burrows–Wheeler transform of the genome. And after you apply this compression, we will
greatly reduce memory for storing our genome. But it totally makes sense if we can invert this
transformation. And from compression version of Burrows-Wheeler transform, we can easily go back
to the original Burrows-Wheeler transform. But can we go back from the Burrows-Wheeler transform
of the genome, to the genome itself, is it even possible?
Okay, let's learn how to do pattern matching with Burrows-Wheeler Transform. Let me first
summarize what we learned about pattern matching with the suffix tree. The runtime is equal to
O(length of the text + total length of all patterns). Memory is, in the best implementation, known
today is 20 * length of the text, or which is high for live strengths like human genome. So it would be
in nuclear diclonk. So the question we will try to address in this lesson is, can we use Burrows-
Wheeler Transform to design a more memory efficient linear-time algorithm for multiple pattern
matching?
Ppt slides
So let's see how we can do this. Let's search for ana in panamabananas. Well, we'll definitely start by
noticing that there are six rows that start from letter a. But when we look, and please notice also that
we are currently matching the last symbol in ana as in the first one. This will be important. So there
are six rows starting from a, but only three of them are ending in n. What we need, because we are
looking for matching the last two symbols now of ana, which is na. So the mental attention to these
three symbols, and using the first last property, we can figure out where these three n's hide in the
first column of our Burrows-Wheeler matrix, here, here, and here.
All ppt slides
Play video starting at 1 minute 45 seconds and follow transcript1:45
After they found where they appear in the first column, we know where na appears. In the string can
be actually found, three matches of na. This is the last two symbol in ana. Let's now try to match the
first symbol in ana, and we know where to look for this first symbol. We'll look for them
correspondingly in the last columns of these three strings. And after we found a in these three rows,
then using again the first last property, we find where these three occurrences of ana appears in the
beginning of our cyclic rotation.
As a result, we found three matches of ana. Let me specify some details of the algorithm that we just
discussed. We will use two pointers, top and bottom, that specify the range of positions in the
Burrow-Wheelers matrix that we are interested in. In the beginning, top will go to 0 and bottom equal
to 13, to cover all positions in the text. In the next iteration, the range of position we are interested in
is narrowed to all position where a appears in the first column. Then, what do you do afterwards? We
are looking for the next symbol, which is n in ana, and we are looking for the first occurrence of this
symbol in the last column among positions from top to bottom, among rows from top to bottom. And
likewise, afterwards, we are looking for the last occurrence of the symbol. As soon as we found the
first and last occurrence of this symbol in this case, and the first last property will tell us where this n
and all n's in between are hiding in the first column. As a result, the pointers top and bottom equal to
1 and 6 are changing into 9 and 11, they narrow the search. And then we continue further, and that's
how we find the positions of ana in the text.
The algorithm that I just described translate in the following BWMatching pseudocode. And you can
see lines in green describe what we have been doing with these top and bottom pointers. Note that
we are using last to first array and given a symbol at position index in lastColumn, LastToFirst index
defines the position of the symbol in the first column. So it's implement first to last property.
Play video starting at 4 minutes 39 seconds and follow transcript4:39
It looks like now, finally, we are done. We have a very first pattern matching algorithm based on
Burrows-Wheeler Transform, and it has good memory footprint.
Play video starting at 4 minutes 52 seconds and follow transcript4:52
The only problem, though, is that BWMatching is very slow. It analyzes every symbol from top to
bottom in the last column in each step.
Play video starting at 5 minutes 5 seconds and follow transcript5:05
What should we do? The trick here is to introduce the count array. And the count array describes the
number of appearances of a given symbol in the first i positions of the last quote. This slide shows the
count array, and ardent is the count array. We can design a better version of BWMatching by
substituting four green lines that we discussed before by two green lines that are using the count
array. And as you can see, we don't need anymore to explore every symbol between top and bottom
indices in the last column. If you are wondering about the details of transformation
Play video starting at 5 minutes 51 seconds and follow transcript5:51
from the previous four lines into two lines using the count array, check our Coursera course or
Play video starting at 6 minutes 1 second and follow transcript6:01
get details from our books that describes this transformation. So it looks like finally, after all these
complications, we are done. But there is still one question we found quanza. Where are the matches
that they found? Where do they appear in the text?
You may find it useful before implementing some of the problems in the Programming Assignment
to look closer at the pseudocode for the algorithms discussed in the lectures.
Here is the pseudocode for BWMatching algorithm from the lecture:
bwmatching.pdfPDF File
Here is the pseudocode for BetterBWMatching from the lecture:
better_bwmatching.pdfPDF File
Alternatively, you can see this interactive text which has more details about using BWT for pattern
matching (this link leads to Finding Mutations in DNA and Proteins course of the Bioinformatics
specializtion). Note that you don't need to pass the code challenge in the end of the interactive
text as it won't affect your Coursera grade for this course: we have prepared a separate
Programming Assignment for you.
Suffix Arrays
Ppt slides
At the end of the last lecture, we faced the challenge of finding positions of the pattern in text when
they tried to develop pattern matching with Burrows Wheeler Transform. Now I will explain how we
will use suffix arrays for solving this problem. Suffix array simply holds starting positions of each suffix.
For example, the first suffix start at position 13, the second suffix start at position 5, the next suffix
starts at position 3 and we can fill up the end of the suffix array.
Play video starting at 39 seconds and follow transcript0:39
Here it is. So, when suffix array is constructed, we can very quickly answer the question where the
occurrences of the part are not. In this case, and the case of ana, our pattern appear at position 1, 7
and 9.
Play video starting at 58 seconds and follow transcript0:58
The challenge, however, is how to construct the suffix array quickly. Because the naive algorithm for
constructing suffix array that is based on sorting all suffixes of text requires all of length of text
logarithm length of text comparison. And each comparison also may take time.
Ppt slides
Play video starting at 1 minute 19 seconds and follow transcript1:19
There is a way to construct a suffix array if you're already constructing a suffix tree. Because, as you
can see from this example, a suffix array is simply a depth-first reversal of the suffix tree. Indeed, you
start from lead 5, continue to lead 3. One, seven, nine, continue further and by simply traversing all
leaves in the suffix tree in order to reconstruct the suffix array. To summarize, if the construct suffix
array by the depth-first traversal of the suffix tree, then it takes all of text time and roughly 20 times
text space.
Play video starting at 2 minutes 7 seconds and follow transcript2:07
And Misha will also explain later how to quickly construct suffix array without relying on suffix tree. In
fact, Manber-Myers, in 1990, constructed the first algorithm for the first linear timeout for suffix array
that require O of text time four time text space. However, for genomics application, even this reduced
space four times length of the text, is still large. What can we do to reduce it?
Here is a trick for reducing the memory for suffix array. I will first ask a question. Can we store? Only a
fraction of the suffix array, but still do fast pattern matching. For example, can we store only
Play video starting at 3 minutes 2 seconds and follow transcript3:02
elements of the suffix array that are multiples of some integer K. Shown here, if integer K is equal to
five, then we only store elements five 10 and 0 of the suffix array. We have no access to other
members of suffix array. How it can be useful? Let me show you how to use the partial suffix array to
find the position of matches. When we have complete suffix array it is trivial, we simply look up at the
highest point elements in the suffix array. But what do we do when we only have partial suffix array?
Where are these ana occurrences appear in the text?
Play video starting at 3 minutes 49 seconds and follow transcript3:49
Well, we do not know, because there is no corresponding k element, let's say, for a2na. How do we
find where it appears?
Ppt slides
Indeed, we don't know yet where it appears, but we can also ask a question where is b1 a n a
appears. Once again, we don't know how to answer this question because there is corresponding
[INAUDIBLE] in the partial suffix. But, using the first last part of it we can ask different question.
Where is the string a 1 b ana appears? And now we can answer this question. Because in this case,
the element of the partial suffix array is Present. It is fast. So, we know where there is a1bana
appears, now the thing left is to figure out where ana appears. And it's easy to do because if aabana
appears at position 5, then b1ana appears at position 6. And ana appears at position 7, so we figure
out how to use suffix array for fast pattern matching. Of course, the time to search for pattern will be
multiplied by a factor of k, because if you store, we potentially can search for up to k position before
you find the fill element of the partial suffix array, but K is a constant in this algorithm.
You may find it useful before implementing some of the problems in the Programming Assignment
to look closer at the pseudocode for the algorithms discussed in the lectures.
suffix_array_matching.pdf
Ppt slides
Ppt slides
FAQ
Please see this link, sections "Coursera week 2" and "Coursera week 3" for some of the frequently
asked questions and answers about this week's material.
BWT-Suffix-Arrays-Reduced.pdfPDF File
References
See Chapter 9: How Do We Locate Disease-Causing Mutations (Combinatorial Pattern Matching)
in [CP15] Phillip Compeau, Pavel Pevzner. Bioinformatics Algorithms: An Active Learning
Approach, 2nd Ed. Vol. 1. Active Learning Publishers. 2015.
Also see the course "Finding Mutations in DNA and Proteins" of the Bioinformatics Specialization.
If you want to learn how to assemble genomes, also see Chapter 3: How Do We Assemble
Genomes (Graph Algorithms) in [CP15] Phillip Compeau, Pavel Pevzner. Bioinformatics
Algorithms: An Active Learning Approach, 2nd Ed. Vol. 1. Active Learning Publishers. 2015.
1. Instructions
2. My submission
3. Discussions
Welcome to your second programming assignment of the Algorithms on Strings class! In this
programming assignment, you will be practicing implementing Burrows--Wheeler transform and
suffix arrays.
Recall that starting from this programming assignment, the grader will show you only the first few
tests (please review the FAQ section for a more detailed explanation of this behavior of the
grader).
Knuth–Morris–Pratt Algorithm
Congratulations, you have now learned the key pattern matching concepts: tries, suffix trees,
suffix arrays and even the Burrows-Wheeler transform! However, some of the results Pavel
mentioned remain mysterious: e.g., how can we perform exact pattern matching in O(|Text|) time
rather than in O(|Text|*|Pattern|) time as in the naïve brute force algorithm? How can it be that
matching a 1000-nucleotide pattern against the human genome is nearly as fast as matching a 3-
nucleotide pattern??? Also, even though Pavel showed how to quickly construct the suffix array
given the suffix tree, he has not revealed the magic behind the fast algorithms for the suffix tree
construction!In this module, Miсhael will address some algorithmic challenges that Pavel tried to
hide from you :) such as the Knuth-Morris-Pratt algorithm for exact pattern matching and more
efficient algorithms for suffix tree and suffix array construction.
Less
Key Concepts
Explain what is a prefix function
Explain how to compute prefix function on each step of Knuth-Morris-Pratt algorithm
Apply amortized analysis to explain why prefix function is computed in linear time in
Knuth-Morris-Pratt algorithm
Develop a program to find pattern in text using Knuth-Morris-Pratt algorithm (first problem
of the last programming assignment which is in the next week)
Less
Knuth-Morris-Pratt Algorithm
9 min
Resume
. Click to resume
Video: LectureSafe Shift
3 min
Video: LecturePrefix Function
7 min
9 min
Video: LectureKnuth-Morris-Pratt Algorithm
5 min
4 questions
2h
10 min
Knuth-Morris-Pratt Algorithm
Exact Pattern Matching
PPT slides
Play video starting at 1 minute 16 seconds and follow transcript1:16
You learned brute force algorithm for that, which basically slides the pattern down the text, and the
running time of that algorithm is the product of the length of the text and the length of the pattern.
What we are going to do in this lesson is to improve this time to the sum of the length of the text and
the length of the pattern, but first, let's recall how brute force algorithm works. It first aligns the
pattern and the text such that the pattern starts from the zero position in the text, and tries to match
the pattern by comparing it character by character to the corresponding characters of the text.
Play video starting at 1 minute 59 seconds and follow transcript1:59
And in this example, we find pattern right away in the position number 0. So we add position number
0 to the output, and then we slide pattern one position to the right. And we compare the first symbols
and they don't match, so we slide the pattern again, and again, no match. And again, and then we just
slide the pattern. We compare the symbols and if they don't match, we slide the pattern to the right.
And then in the last possible position, we again find the occurrence of the pattern in the text. And so
our output is a list of positions 0 and 7, and those are all the positions where the pattern occurs in the
text. So the question is now, can we avoid some of the positions where we tried to align the pattern
with the text? But it actually didn't made sense given what we already know about the previous
comparisons.
PPT Slides
Play video starting at 2 minutes 55 seconds and follow transcript2:55
And the answer is yes, we can. In this particular example, we've already found the pattern in the text
starting from the 0 position, and that means that when we slide the pattern to the right and try to
align it to the first position in the text. We're going to compare the prefix of the pattern without the
last character with the suffix of the same pattern without the first character. And they don't match, so
there is no occurrence of the pattern starting from the first position. But, if we somehow pre-
processed the pattern and knew that the prefix without last character is not equal to the suffix
without the first character, then we could just keep all this alignment with the first position in text at
all. And the same is true about the next position of the pattern. If we knew that the prefix of the
pattern of length two, is not equal to the suffix of the pattern of the length two. Then we could skip
also positioning the pattern against the second position on the text, and then, when we slide one
position to the right again. We compare this prefix of length one which is letter a, with the suffix of
length one which is also letter a. And these are equal, so this position, we cannot skip according to
this rule. But it means that instead of comparing the pattern to the text in positions zero, one, two
and three, we could just safely move the pattern from the position zero to the position three, skipping
positions one and two. And in the more general case, we could skip even more positions, depending
on what is the life of the pattern, and how much of its prefixes coincide with their corresponding
suffixes.
Another example is when we don't even find the whole pattern in the text. We still can skip some of
the positions. So in this example, the longest prefix which is common for the text and the pattern
consists of six characters and the pattern is longer.
Play video starting at 5 minutes 3 seconds and follow transcript5:03
So we cannot compare prefixes of the pattern with suffixes of the same pattern and then decide that
we don't need to check some of the positions in which to align pattern of the text.
Play video starting at 5 minutes 17 seconds and follow transcript5:17
But instead we need to do the same thing with the string marked in green. We need to compare
prefixes of this string with suffixes of this string. And we can notice that the first position where the
prefix of a string can coincide with the corresponding suffix Is position number four. So we can just
move the whole pattern from position zero to the position four in the text. And then, try to compare
the pattern with the text and we find an occurrence. We couldn't find an occurrence earlier, because
no longer prefix of the string a, b, c, d, a, b, coincides with the corresponding suffix.
And another example, again we find the longest common prefix of the pattern in the text. It has
length 6, and is the string a, b, a, b, a, b. And for this string the longest prefix which coincides with the
corresponding suffix is a, b, a, b of length four. And that means that we can move the pattern two
positions to the right, and skip the alignment at position one of the text.
Play video starting at 6 minutes 25 seconds and follow transcript6:25
Now we again find the longest common prefix of the pattern, and the suffix of the text starting in
position two. It again has length six, which means that there is no occurrence of the whole pattern in
the text in position two. But we need to consider the string a, b, a, b, a, b, which is the longest
common prefix. And again, compare the prefixes of this string with the suffixes. And we already know
that the longest prefix which coincides with the suffix is a, b, a, b, of length four. So we can again
move the pattern to the right, so that the prefix and the corresponding suffix match. And now we find
the occurrence of the pattern in the text.
To make an algorithm from these observations, we will need the definition of a border. So border of a
string is a prefix of the string which is equal to the suffix of the string of the same length. For example,
for string arba, a is a border, because the prefix a is equal to the suffix a. And ab is a border of a string
a, b, c, d, a, b, which we saw in the second example. And a, b, a, b, is a border of a, b, a, b, a, b.
Play video starting at 7 minutes 36 seconds and follow transcript7:36
And do you notice that the prefix a, b, a, b intersects with the suffix a, b, a, b? And that is okay and we
just mark the fact that they intersect with an orange color. But actually we notice that not just a, b is a
border but also a longer string of line four which is a, b, a, b is also a border. However string a, b is not
a border of the same string a, b because we require that the border doesn't coincide with the whole
string. We need only those prefixes and suffixes which are shorter than the initial string.
Now, let's consider shifting all the pattern along the text in the general station. So, the first thing we
do is we find the longest common prefix of the pattern with the suffix of the text to which we've just
aligned our pattern. Then we find w, the longest border of u, so that there is a w in the beginning of u,
and there is one in the end of u, and also mark both w's in the text T. Now, I suggest to move pattern
P to the right in such a way that the first w in P coincides with the second w in T.
Play video starting at 8 minutes 50 seconds and follow transcript8:50
And that is the way I suggest to skip some of the positions where we don't need to align pattern with
the text. Now you know that it is possible to avoid some of the comparisons that the brute force
algorithm does. And I've suggested a specific general way to do that. But we don't want to miss any of
the occurrences of the pattern in the text. So the question is, is it really safe to move the pattern in
the suggested way? You will learn that in the next video.
Safe Shift
Hi. In this video, you will see that the method I suggested in the last video for shifting the pattern on
the text is safe, in the sense that we won't miss any of the occurrences of the pattern in the text by
shifting it that way. But first we need suffix notation. We denote by S with index k, suffix of string S
which starts in the position k. For example, for s = abcd, the suffix starting at position 2 is S2, and it is
cd. And for string T, which is = abc, the suffix starting in position 0 is T0, and it is abc. Note that, again,
we use indexes starting from 0 for all the strings
PPT Slides
. Suppose the pattern is aligned with position k in the text.
Play video starting at 49 seconds and follow transcript0:49
Let's denote by u, the longest common prefix of the pattern, and the suffix Tk. Then, select the
longest border w of the string u. In the last video, I suggested to move the pattern to the right in such
a way that the left w in the pattern coincides with the right w in the text. And now, we'll prove that
there cannot be any occurrences of pattern in the text in the red area between the current position k
and the start of the right w in the text. This will prove that the shift suggested in the last video is safe,
we won't miss any occurrences by shifting the pattern this way.
Play video starting at 1 minute 33 seconds and follow transcript1:33
Suppose the pattern occurs in the text in some position i between k and the start of the line w.
Play video starting at 1 minute 41 seconds and follow transcript1:41
Let's move the pattern to align with that position i.
Play video starting at 1 minute 45 seconds and follow transcript1:45
Then we can notice that there is a prefix, v, of the pattern that is also a suffix of u in the text.
Play video starting at 1 minute 55 seconds and follow transcript1:55
And so this string v, which is both a prefix of P and a suffix of u, is actually both a prefix and a suffix of
u. And also this v is longer than w, because it started before w in the text and ended in the same
position as the right w in the text. And so v is a border of u, because it's both a prefix and a suffix. And
it is also a border which is longer than w, but w was the longest border of u. So we got the
contradiction with the assumption that our pattern P occurs somewhere between the current
position k and the start of the right win the text. Now you know that it is actually possible to avoid
many of the comparison that the Brute Force algorithm does by shifting the pattern along the test and
skipping some of the positions in which the Brute Force algorithm tries to align the pattern with the
text. But how to actually determine the best shifts of the pattern, how to compute those longest
borders and the common prefixes, that you will learn in the next videos.
Prefix Function
Ppt slides
Hi, in the previous video, you've learned that we need to quickly compute longest borders of different
prefixes of the pattern. And in this lecture you will learn the notion of the prefix function, which helps
to do exactly that. And we will study some of the properties of the prefix that allow it to compute it
fast. Prefix function of a pattern P is such function that for each position i in the pattern, returns the
longest border of the prefix of the pattern ending in this position i. Let’s consider an example. Here's a
string P and we consider the first prefix of this string, a, for this prefix there is no border. So, prefix
function is 0, now for the prefix a b, ending in the position number 1, there is also no border, because
a is not equal to b, and so, the prefix function is again zero. For prefix a b a, ending in position 2, the
longest border has length 1, and it is border a. For the string ab, a b the longest border is a b of length
two, and for the next prefix the longest border is already of length 3. And although the prefix aba, and
the suffix aba intersect, this is still a valid border. And for the next string, the longest border is a b a b
of length 4. And then we meet character c, and the prefix function begins from zero again, because
there is no border for the string a b a b a b c. And for the next string, the longest border is a. For the
next one, again, a, and for last one ab. So here is the Prefix function.
Now we will prove a useful property of the prefix function that the prefix ending in position i has a
border of length s(i+1)-1. To see that, let's first consider the longest border w of the next prefix
ending in position i+1. W has length exactly as of i+1, by definition of the prefix function. Let's
consider the last character of w and cut it out. What's left, denoted by w prime, is a border of the
prefix ending in position i and it's length is exactly s(i+1)-1. So we've just proved the property, so we
see the prefix ending in i has a border of length as, i + 1- 1, and it means that the longest border of the
prefix ending in position i, is at least s(i + 1)- 1.
From this we get an immediate corollary that the prefix function cannot grow very fast when moving
from position to the next position. In particular it cannot increase by more than one. As we saw in the
example, it can of course decrease, or stay the same, but it cannot increase by more than one from
position to the next position. In the algorithms to follow, we will need to efficiently go through all the
borders of a prefix pattern and this Lemma helps us with that. It says that all the borders of some
prefix of the pattern, but for the longest one, are borders of this longest border in term. And to see
that, let's look at the longest border, and some border u of the prefix, which is shorter than the
longest border.
And then we can see that this border u is both a prefix, and the suffix of the longest border, and it was
also shorter than the longest border. So u is indeed a border of P[0..s(i)- 1]. And the useful corollary
from that is that all the borders of P[0..i] can be enumerated by simple ways. First we take the longest
of the borders, and then we find the longest border of that string, and then we find the longest
border of that string, and so on until we get an empty string as a border. And by the end, we have
gone through all the borders of the initial prefix of the pattern. And to go from any prefix to its
longest border, we just need to use the prefix function. And then again, the Prefix Function for that
prefix of P, and then the Prefix function for that prefix of P and so on. So if we know the Prefix
Function of the pattern we can go through all the borders of any prefix of the pattern In an efficient
way. Going only through the borders and not encountering any other positions in the pattern.
Now lets think how to compute the prefix function. We know that s(0) is 0 because the prefix ending
in position 0 has length 1 and has no known empty borders. Now to compute s(i + 1) if we already
know the values of the prefix function for all the previous positions, let's consider the longest border
of the prefix ending in position i. And let's look at the character and position i + 1 and the character
right after the prefix with length s(i). If those characters are equal than s(i) + 1 is at least s(i)+1
because we can just increase the length of the border. And that means that s(i+1) is exactly equal to
s(i)+1 because we've learnt that the prefix function cannot grow by more than One from position to
the next position but if those characters are different then everything is a bit more complex. So we
know that there is border of the prefix ending in position i that has length exactly as i+1-1.So if we
find that border then the next green character after it will be the same as the character in position
i+1. It will be the same x. So what we need to do is we need to go through all the boarders of the
prefix ending in position i by decreasing length. And as soon as we find some boarder that the next
character after it is the same as the character and position i plus one, we can compute as of i plus one
as the length of that boarder plus one. So now we basically have the algorithm for computing all the
values of the prefix function. We start with initializing s(0) with zero, and then we go and compute
each next value of s. If the character s(i+1) is equal To the character right after the previous border.
Then we just increase the value of the prefix function by one and go ahead. If those characters are
different, we go to the next longest border of the prefix ending in position i using prefix function and
look at the next character after it. If it coincides with the character in position i + 1 then we found the
answer. Otherwise we again go to the next longest border using the prefix function and look at the
next character after it and so on. At some point, we may come to the station that the longest border
is empty. And then we'll need to compare the character in position i + 1 with the character in position
0. And either they are the same, and then the prefix function is 1. Or they are different, and then the
prefix function has the value of 0. Now you know a lot of useful properties of the prefix function but
we still don't know exactly how to compute all of its values and you will learn that in the next video.
And so all in all, border can increase at most length of the pattern times.
Play video starting at 8 minutes 51 seconds and follow transcript8:51
And it is decreased by at least 1 on each iteration of the while loop.
Play video starting at 8 minutes 55 seconds and follow transcript8:55
And border is also always non-negative. It means that it can be decreased at most, we go off length of
the pattern times. And so there are at most linear number of while loop iterations in total. Now you
know how to compute prefix function efficiently in linear time in the length of the pattern. But how to
actually solve the initial problem? How to find the pattern in the text? That you will finally learn in the
next video.
Knuth-Morris-Pratt Algorithm
Hi, in this video, you will find and learn the Knuth-Morris-Pratt algorithm that allows to find all the
occurrences of a pattern in the text in the time linear in terms of the length of the pattern and length
of the text. So instead of the product of those lengths as in the brute force algorithm, we will need
just some of those lengths to find all the occurrences.
Play video starting at 24 seconds and follow transcript0:24
The algorithm goes as following. First we create a new long string S which consists of the pattern, the
text and between them we insert a special character called dollar. Which is basically. Any character
that is absent from both pattern and text. It cannot be specifically the character dollar. It is just a
placeholder for some character that is absent from both pattern and the text.
Play video starting at 50 seconds and follow transcript0:50
After we've assembled this longstring S, we need to compute it's prefix function. And after we've
computed its prefix function, we need to look in the positions in the string s, which are inside the text
part of it. So we'll look at all positions i such that i is more than length of the pattern. So after the
pattern and after the dollar, and if the prefix function for that position is equal to the length of the
pattern, then we know that there is an occurrence of the pattern in text ending in that position. For
example, here we have a prefix function value of four. And we have an occurrence of the pattern
ending in the corresponding position.
Play video starting at 1 minute 34 seconds and follow transcript1:34
We need to find all the positions where the pattern starts in the text. So from the position where it
ends we need to compute the position where it starts, and to do that we need to subtract length of
the pattern -1, but that would be the position in S. And to compute the position in the text, we also
need to subtract the length of the pattern and one for the dollar. So in total, we need to subtract two
length of the pattern from the position in the string S to find the starting position of the pattern in the
initial text. And there is another place where prefix function of string big S is equal to the length of the
pattern. And again, there is an occurrence of the pattern ending in that position. And why does
algorithm even works?
First, we need to notice that the prefix function for this string big S is always less than or equal to the
length of the pattern. Because of the dollar sign it occurs right after the end of the pattern so when
the border is bigger, we would need to have another occurrence of dollar in the string big S. But dollar
is only between pattern on the text and is absent from the text. So prefix function cannot be bigger in
under the life of the pattern.
Play video starting at 2 minutes 52 seconds and follow transcript2:52
If we look at the position i, which is to the right from the dollar and the prefix function is equal to the
length of the pattern. And that means that the pattern is a border of the corresponding prefix of S.
And so it ends in position i. And we only need to determine the position in which it starts in the text T.
And to do that, we need to do a few computations. And we will see that this position is i minus two
length of the pattern.
Play video starting at 3 minutes 21 seconds and follow transcript3:21
However, if the prefix function in some position i is strictly less than the length of the pattern, then it
means that the pattern doesn't end in that position in the string S. And that means that it doesn't end
in the corresponding position in the text. And that means that we've found all the positions in which
the pattern ends in the text. And so we've of course also found all the positions where it starts in the
texts by subtracting two lines of pattern from each side position.
So the codes for this algorithm is already pretty simple. We take as input pattern P and text T, we
assemble string S by pattern with special symbol dollar and with the text, then we compute the prefix
function of this long string S. We initialize the resulting list of positions where pattern occurs in the
text. And we go through all the positions, i and s, which are to the right from the dollar sign.
Play video starting at 4 minutes 24 seconds and follow transcript4:24
And if, at some position, we see that the value of the prefix function is the same as the length of the
pattern, we just append i minus two length of the pattern to the result. And we return this resulting
list of positions in the end.
Play video starting at 4 minutes 38 seconds and follow transcript4:38
This algorithm is already pretty simple, and it works in time proportional to sum of the length of the
pattern and the text.
Play video starting at 4 minutes 46 seconds and follow transcript4:46
To prove that, we know that string S can be built in the time proportional to sum of the length of
strings P and T. Computing prefix function is done in the proportional time. And the four loop runs
through part of the string. So it also runs in time proportional to sum of the length of the pattern and
the text. In conclusion, you now know the Knuth-Morris-Pratt algorithm for exact pattern matching.
You can find all occurrences of the same pattern, in all of the text in time linear in terms of the length
of the pattern and length of the text. You can also compute prefix function of any string in linear time.
And you can go through all the borders of any string in the order of decreasing length using prefix
function. And in the next lessons, we will learn how to build suffix array and suffix tree in time which
will allow you to find many different patterns in the same text even faster than if you use algorithms
like Knuth-Morris-Pratt's.
QUIZ • 30 MIN
We intend you to solve the Programming Assignment 3 during this week and the next week, you
can already access it now. You should be ready now to solve the first problem of
the Programming Assignment 3, "Find all Occurrences of a Pattern in a String". To solve the other
problems, you should first go through the lectures and readings of the next module (and please
have a look at the pseudocode provided in the readings before starting to work on the
Programming Assignment).
References
See chapters 1, 2.3, 3.3 in [G97] Dan Gusfield. Algorithms on Strings, Trees and Sequences:
Computer Science and Computational Biology (1st Edition). Cambridge University Press. 1997.
WEEK
4
5 hours to complete
Constructing Suffix Arrays and Suffix Trees
In this module we continue studying algorithmic challenges of the string algorithms. You will learn
an O(n log n) algorithm for suffix array construction and a linear time algorithm for construction of
suffix tree from a suffix array. You will also implement these algorithms and the Knuth-Morris-Pratt
algorithm in the last Programming Assignment in this course.
Suffix Array
Hi, in this lesson you will learn how to build a suffix array of a string in time and log N.
Play video starting at 6 seconds and follow transcript0:06
Suffix arrays are useful data structure that you already used in the previous modules but now you will
learn how to build it really fast and first we'll recall what is a suffix array. So the problem of
construction of a suffix array is very simple you're given a string and you need to sort all of it's suffixes
in lexicographic order. However as we will soon see you won't need to actually compute all the
suffixes and then solve them and output all of them because that will use too much both time and
memory. You will just need to know in which order are those suffixes. And the suffixes themselves
sorted in lexicographical order are only in our head. They're not stored anywhere in the problem. So
we assume that the alphabet from which our strings are built are ordered, so that any two characters
we can say which one of them is smaller. For example, in English we can order all the characters from
a to z in a binary alphabet we just have zero and one and zero is less than one.
By definition a string S is smaller than a different string T if either S is a prefix of T or S and T coincide
from beginning up to some character and then the next character in S is smaller than the
corresponding character in T. For example, if s is ab, and t is bc. Then they don't coincide. But the first
character is already different. And the character in s is a. And the character in t is b. A is less than b, so
s is less than t. And in the second example, s and t coincide for the first two characters. And then the
third character c is less than character d. So s is smaller than t. And in the third case, s is a prefix of t,
but it is different from t, so s is smaller than t.
And here is an example of suffix array. We have a string, s, and all suffixes ordered in lexicographic
order are a, aa, and so on. So, here are exactly six suffixes because the length of string S is six, and so
we have six different suffixes. We want to avoid this case when S is a prefix of T and that is why S is
less than T because this case is different from all others and usually you just compare S and T from the
first character and go the right until they differ. And then see which character is smaller. And this is a
corner case when you go up to the end of the S, and then you see that there is nothing there and so
that is why S is smaller. So to avoid using that rule at all, we will append a special character called
dollar to the end of the string for which we'll build suffix array. So all the suffixes will have this dollar
on the end. And now if initially some suffix was a prefix of another suffix. Now it is just smaller by the
usual rule, because as soon as it ends, and it is still coinciding with the prefix of the bigger suffix, the
next character in the smaller one is taller, which is smaller than all other characters. And so we can
determine by the usual rule that the smaller suffix is actually smaller.
And it works in time proportional to length of the stream plus size of the alphabet. because we know
that this is the running time of the counting sort for length of S items, each of which can take only size
of the alphabet, different values. And I need to note here that typically the size of the alphabet is
small like for example, four letters, four streams in a genome, or 26 characters. If we are only working
with the English words, or maybe alphanumeric characters. Then there will be 26 small letters, 26 big
letters, and 10 digits. But sometimes the alphabet can be very very big, such as Unicode. And in this
case counting sort might not be appropriate. If your string for example has only 1000 characters but
those are all unique code, and the alphabet size is a few million character, then maybe you could sort
the characters of this string in a more efficient way.
Play video starting at 3 minutes 13 seconds and follow transcript3:13
Apart from sorting the characters, we will also need additional information to make the following
steps of the algorithm more efficient. And to do that, we introduce equivalence classes of the partial
cyclic shift. So we denote by c with index i, partial cyclic shift of length L, where L is the current length
of the cyclic shifts, which we already have sorted. And initially, we have sorted single characters. So L
is equal to 1. And then on the further phases of the algorithm, L will increase from one to two, to four,
and so on, twice in each iteration. So, some of the cyclic shifts can be equal to different cyclic shifts
starting in different positions. Ci can be equal to Cj and then they should be in the same equivalence
class. So to assign equivalence classes, we define the area class. And class of i is equal to the number
of different cyclic shifts of length L that are strictly smaller that the cyclic shift starting at position i.
Play video starting at 4 minutes 25 seconds and follow transcript4:25
So for different cyclic shifts which are equal, the value of class[i] and class[j] will be the same. Because
the same other cyclic shifts are smaller than these two equal cyclic shifts.
Play video starting at 4 minutes 40 seconds and follow transcript4:40
And we'll need to compute this array class to increase the speed of the next phase. And before
computing this array class, we assume that we have already sorted all the cyclic shifts of the current
length L.
PPT Slides
So, how to actually compute the classes of the cyclic shifts when we already know their order. Let's
look at the example of sorted characters of the string. So we know already that the characters are
sorted, and their order is 6, 0, 2, 5, 1, and 3. Now let's assign classes. We want to assign class 0 to the
smallest of the cyclic shifts of the current length. Which is dollar, which is in position six. So, we write
0 in position six of the class. And we initially set up a class to be of length equal to the length of the
string of course. The next, smallest cyclic shift is letter a and it is different from the previous smallest
one which is dollar. So we need a new equivalence class for a. And so, we assign 1 to the equivalent
class of a which is in position 0 in the initial string. So we assigned 1 to class of 0. And the next one is
also a which is already in position two. But it is equal to the previous one. So we are saying the same
equivalence class to it. So we'll write down 1 as the value of class of 2. The next one is also a. It is also
equal to the previous one. So we assign 1 to class of 4. And the same one we do with class of 5.
Play video starting at 6 minutes 27 seconds and follow transcript6:27
And the next one is b which is different again from the previous one and so we assign new class which
is bigger by one which is two so we assign 2 to the value. Value of class of one because b we find in
position 1. So class of 1, we assign to value 2. And then the last one is also b it is equal to the previous
one so again we assign 2 to class of 3. And now we know the classes of all the single character cyclic
shifts. We know that the smallest one is dollar. And it is the only one that's equals 0. We know that 4
a's are in the equivalence class 1. And we know that 2 b's are in the equivalentce class 2.
Counting Sort
Note that we can sort not only integers using Counting Sort. We can sort any objects which can be
numbered, if there are not many different objects. For example, if we want to sort characters, and
we know that the characters have integer codes from 0 to 256, such that smaller characters have
smaller integer codes, then we can sort them using Counting Sort by sorting the integer codes
instead of characters themselves. The number of possible values for a character is different in
different programming languages, so find out what is the range of integer codes for characters in
your programming language of choice before using this in a Programming Assignment!
Sort Doubled Cyclic Shifts
Hi, in this video you will learn how to implement the transition phase of the suffix array construction
algorithm. In the transition phase, you assume that you have already sorted cyclic shifts of some
length, L. And you know not only their order but also their equivalence classes, and you need to sort
based on that cyclic shifts of length 2L. The main idea is the following. Let's denote by Ci cyclic shift of
length L starting in position i, and by Ci prime the doubled cyclic shift starting in i. That is, cyclic shift of
length 2L, starting in position i. Then, Ci prime is equal to Ci concatenated with Ci + L. So we just take
string Ci, we take string Ci + L, put it after string Ci. And the total string of length 2L is equal to string Ci
prime. And so to compare Ci prime with Cj prime it's sufficient to separately compare Ci with Cj, and
Ci + L with Cj + L. And we already know the order of the cyclic shifts of length L. So instead of
comparing them directly, we can just look in the array of their order and determine which one is
before which one. And that one is going to be smaller or the same as the other one. And also, we
have the area with equivalence classes. And so we can determine whether two cyclic shifts of length L
are really equal, or they're different, by looking in the array for equivalence classes and comparing
their equivalence classes. So basically we can compare two cyclic shifts of length L in constant time.
And that is why we can sort the doubled cyclic shifts faster.
Play video starting at 1 minute 58 seconds and follow transcript1:58
For example, if S is our initial string, ababaa$, and the current length L is 2. And position i is also 2,
then Ci is C2, is a cyclic shift starting in position 2, which is ab. Ci + L is C2 + 2 which is C4, which is aa.
Play video starting at 2 minutes 20 seconds and follow transcript2:20
And Ci prime is equal to abaa. So this is the distinction between cyclic shifts of length L and 2L, and
how we combine C2 and C4 to get C Prime 2. So now we have to think about the following problem.
We need to sort pairs of numbers basically, because each cyclic shift of length L corresponds to its
number of position in the order of all cyclic shifts of length L. And we first need to sort by second
element of pair, and then we stable sort by the first element of pair. And if we do these two steps,
then our pairs will be sorted because they will be sorted by first element. And also inside the equal
first elements it will be sorted by the second element, because it was initially sorted by the second
element and the sort is stable. So we didn't break the order of the second element in the case when
the first elements are the same. So, this is the idea for sorting pairs of objects.
PPT Slides
Play video starting at 3 minutes 36 seconds and follow transcript3:36
And let's look at this example. So let's suppose our current length is 2, and we already sorted all the
cyclic shifts of length 2, and they are to the right in the sorted order.
Play video starting at 3 minutes 49 seconds and follow transcript3:49
Now for each of the cyclic shifts of length 2, let's look at the cyclic shift of length 4 which ends in this
cyclic shift of length 2. So we take the two previous characters and add them to the left. So C4 prime
and in C6, and also we'll look at C5. We take the two previous characters, and C3 prime ends in C5,
and so on. So we go by two characters to the left from each of the cyclic shifts of length 2, and we get
a set of cyclic shifts of length 4. Now we have highlighted in yellow the first elements of the pairs,
which are also cyclic shifts of length 2. Those are not sorted, but we know their starting positions and
we know what are the correct starting positions in the sorted order. So we can reorder this list of
cyclic shifts of length 4 by the order of the first halves of the elements in this list using the known
order. And we will need to do so in a stable sort fashion so that if, for example, C2 prime is before C0
prime, and the first half of C2 prime and C0 prime are the same. They need to stay in the same order
in the final sort.
Play video starting at 5 minutes 23 seconds and follow transcript5:23
And the same goes about C3 prime and C1 prime. They both start in ba. So when we sort by the first
half, C3 prime has to stay before C1 prime. That's our requirement. So suppose we manage to sort the
first halves in such a way, and we started the whole cyclic shifts of length 4 accordingly, what do we
get?
We actually get the sorted list of cyclic shifts of length 4. And of course for those which differ in the
first half, it's obvious that they compare in the correct order. But for those which are the same in the
first half, their second half is also sorted because it was sorted initially in the second column and we
implemented a stable sort. So C2 prime is still before C0 prime, and C3 prime is still before C1 prime.
So this is the idea.
SortDouble Implementation
What will be the order of values assigned to the variable startstart in this example?
[4,3,2,5,0,1,6]
[4,3,2,5,0,1,6]
[6,5,4,3,2,1,0]
[1,3,0,2,4,5,6]
[6,1,0,5,2,3,4]
Now lets consider the pseudocode for the procedure SortDoubled which will sort the doubled cycled
shifts of length to L, given the string S. The current length L, the order of the current shifts of length L,
and their equivalents classes in the array class. We'll start with initializing the array count with the
zero array of size equal to the length of the string this is the standard array for counting sort, but as
oppose to sort characters procedure it will sort not characters, but equivalents classes of cyclic shift of
length L. And there are at most length of S difference including classes that's why we initialized the
array with size length of the string. As opposed to sort characters where we initialized it with the size
of the alphabet.
Play video starting at 54 seconds and follow transcript0:54
We'll also need another array new order, which will store our answer. It will be the order of the
sorted doubled cyclic shift. We initialize it with area of size, length of S. The next two four loops are
standard four loops for the counting sort. When we first count the number of occurrences of each
equivalence class of single cyclic shifts and then we compute the partial sums of that counting array
and the last four loop of the counting sort needs to go through the array. We're going to sort from the
end to the beginning and that is important for the sort to be stable. So, we need to go through the
array of double cyclic shifts which are initially sorted by their second half in the reverse order.
Play video starting at 1 minute 45 seconds and follow transcript1:45
But you don't want to actually build this array of doubled cyclic shifts and then go through it in
reverse order. We want to only build this array in our head and in the code, we just want to go
through this array in the reverse order. So how to do that? Remember that we have the array order,
and if we go in the direct order of this array, we'll go through all the cyclic shifts of length L in
increasing order.
Play video starting at 2 minutes 13 seconds and follow transcript2:13
What we need instead is, first, to go not through cyclic shifts of length L, but through cyclic shifts of
length to L which starts exactly L counter clockwise from those.
Play video starting at 2 minutes 28 seconds and follow transcript2:28
And that is why we decrease order I by L and at length of the string and take modulo S, just because
we're going through a circle.
Play video starting at 2 minutes 40 seconds and follow transcript2:40
And we need to go downwards from the last i to the first i, because we need to go in the reverse
order. So these two lines for i from length of S- 1 down to 0. And the last line which assigns variable
starts to order i minus L plus length of string s module s. What they basically do is they go with
variable start in the reverse order through the array of double cyclic shifts. Sorted by their second
half.
Play video starting at 3 minutes 16 seconds and follow transcript3:16
So start goes through the starts of those double cyclic shifts in the reverse order. Now, everything else
that happens in this for loop is just regular counting sort. We take the class of this start position,
which is the class of the first half of the corresponding doubled shift by which we want to sort.
Play video starting at 3 minutes 44 seconds and follow transcript3:44
Then we go and decrease the partial sum corresponding to that equivalence class in our counting
array.
Play video starting at 3 minutes 52 seconds and follow transcript3:52
And then we just put our start in the position which the counting sort prescribes to it.
Play video starting at 4 minutes 1 second and follow transcript4:01
So these three lines from getting the clust of the start position decreasing the partial sum and
assigning the start to the position counter of clause are the three standards lines of the counting sort.
The complexity here is that start is going in the reverse order. Through the array of double cyclic shifts
sorted by their second half and that we instead of comparing characters or something else we
compare equivalence classes of the single cyclic shift.
Play video starting at 4 minutes 44 seconds and follow transcript4:44
So this is what this last forlob does. And in the end what we have is the array new order. Which
contains the double cyclic shifts which were initially sorted by their second half and then we sorted
them by count and sort, by their first half.
Play video starting at 5 minutes 6 seconds and follow transcript5:06
And so now they are sorted by the first half and the count and sort was stable. So in case when their
first part, first half is the same, they're also sorted by the second half, because they were sorted by
second half initially. So new order finally contains all the dabbled cyclic shifts in the correct order, in
the sorted order.
Play video starting at 5 minutes 31 seconds and follow transcript5:31
So this is the function that sorts all the doubled cyclic shifts. And the running time of this procedure is
linear because this is basically the regular counting sort. Although it sorts very complex objects, in
practice in the code, it just sorts integers, the equivalent classes of the single cyclic shifts and it does
so in the running time of the counting sort which runs in the time number of items plus number of
different values. Number of items is equal to length of the string and the number of different values
of a clauses is also to smallest length of the stream. So all in all, those three for loops run in linear
time. In the next video, we will talk about how to update the classes of those double cyclic shifts after
they are sorted, and how to finally build the suffix array from scratch.
Updating Classes
Hi, in this video you will learn how to update the equivalence classes of the double cyclic shifts after
sorting them. And that will be the last step before we can actually present the whole algorithm for
building the suffix array. So to update classes, we need to compare the pairs of single shifts which
constitute the double cyclic shifts which we have just sorted. We have already sorted the pairs. So, we
just need to go through them in order and compare each pair to the previous pair. If it's the same,
then we need to assign it to the same class. If it's bigger, then we need to create a new class and
assign it to this pair. To compare the pairs, we can compare them separately by first element and then
by second element. Of course the elements of the pairs are cyclic shifts and we don't want to
compare them directly character by character. But, for that we already know their equivalent class is
of the single cyclic shift, and we can just compare the equivalence classes instead of the cyclic shifts
themselves. So we can compare any two pairs of single cyclic shifts in constant time.
Play video starting at 1 minute 13 seconds and follow transcript1:13
Let's look at an example.
PPT Slides
Play video starting at 1 minute 15 seconds and follow transcript1:15
S is our initial string and suppose we've already sorted the doubled cyclic shifts of length 2, and our
initial cyclic shifts were of length 1.
Play video starting at 1 minute 30 seconds and follow transcript1:30
So we have our array class of the equivalence classes of the cyclic shifts of length 1, which is basically
letters. And remember that this array has one element which is equal to 0 which corresponds to the
dollar, and it is in position six. We have four elements which are equal to 1 which correspond to
letters a in positions 0, 2, 4, and 5. And we have two elements which are equal to 2 which correspond
to letters b in positions 1 and 3. So these are the equivalence classes of the single cyclic shifts. Now for
the double cyclic shifts we can write them down in the order because we've already sorted them. And
we know the new order which is the order of the double cyclic shifts. They go 6, 5, 4, 0, 2, 1, 3. From
$a to ba.
Play video starting at 2 minutes 27 seconds and follow transcript2:27
And along with each doubled cyclic shift, we'll also write down the pair of the equivalence classes of
its halves. For example, for $a, the equivalence class for dollar is 0 and the equivalence class for a is 1.
So it corresponds to pair 0, 1. And for ab, for example, the equivalence class of a is 1, and equivalence
class of b is 2. So we write down the pair 1, 2.
Play video starting at 2 minutes 54 seconds and follow transcript2:54
These are the pairs of the equivalents classes of the single cyclic shifts. And now we need to compute
the equivalence classes of the doubled cyclic shifts. And write them down into the array newClass. To
do that, we go through the double cyclic shifts in the sorted order using array newOrder. And we start
from the first one, which is $a. And we write down value 0 for its class in position 6 because it is in
position 6 as we see from the array newOrder.
Play video starting at 3 minutes 27 seconds and follow transcript3:27
Then we'll proceed to the next doubled cyclic shift. And to assign class to it, we need to compare it to
the previous one. And of course in this picture, we could compare directly these double cyclic shifts
the previous one, and determine that it's different. But in practice, in general stage we don't want to
do that. And instead of comparing the cyclic shift directly, we compare the pairs of numbers written
to the right from them and we see that the pair 1, 0 is different from the pair 0, 1.
Play video starting at 3 minutes 58 seconds and follow transcript3:58
And we do this comparison just by two comparisons of numbers instead of comparing full cyclic shifts.
Play video starting at 4 minutes 6 seconds and follow transcript4:06
As far as this double cyclic shift is different, we need a new class for it. And we assign it to class 1. And
write it into position 5 because this is the position for this double cyclic shift as we see from array
newOrder. Now we proceed to the next one which is aa. We again compare it with the previous one,
by pairs. 1, 1 on 1, 0 are different pairs, so we write down a new class again, class 2 in position 4, as
given in the array newOrder.
Play video starting at 4 minutes 35 seconds and follow transcript4:35
Then proceed to ab. It is again different. 1, 2 is different from 1, 1. So we create a new class 3 and put
it in position 0 as given by array newOrder. Then we'll look at ab again and it is the same as the
previous ab. As we see from pairs 1,2, 1,2 which are equal. So, we don’t need to create a new class.
We write down the same class 3 into position 2 as given by the array newOrder. Now look at ba, it is
different from ab. So, we create new class 4. And then the second ba's of course are equal to the
previous ba, so we write down 4 in position 3 as given by the newOrder array. So this is how it works,
updating of the classes. Now let's look at the code.
Correct
Correct! This is because midmid is a position in the string, so it should be between 00 and n -
1n−1, and we need to go by LL positions clockwise in terms of the cyclic string.
Yes
is selected.This is correct.
Correct! This is because midmidis a position in the string, so it should be between 00and n -
1n−1, and we need to go by LLpositions clockwise in terms of the cyclic string.
No
Play video starting at 5 minutes 22 seconds and follow transcript5:22
So the procedure UpdateClasses does exactly the same as we did in the example. It takes as input
array newOrder, the order of the double cyclic shifts. It also takes classes of the single cyclic shifts.
And also it takes the length of the single cyclic shifts as inputs. And it will return the array with
equivalent classes of the double cyclic shifts as a result.
Play video starting at 5 minutes 50 seconds and follow transcript5:50
First we initialize variable n with the size of newOrder. Basically n will be equal to the length of the
string but we don't have string as an input so we need variable to compute it's length. And we
initialize the array newClass with an array of size n.
Play video starting at 6 minutes 7 seconds and follow transcript6:07
And first we assign class 0 to the smallest double cyclic shift which is given by newOrder of 0. And
then we go through all the double cyclic shifts from position 1 to n-1, and we need to compare the
double cyclic shift number i with the double cyclic shift number i-1. To do that, we first compute their
starting positions. Cur is the starting positions of the doubled cyclic shift number i, and prev is the
position of the previous one. And also need to compute the positions of their middle of the position
where their second half starts. So, we need to compare them half by half. So, cur and prev are the
starting positions of the doubled cyclic shifts and mid and midPrev are the starting positions of their
second halves. To compute them, we just take the position clockwise to the right by L. So we add L
and take everything modular n. Which is the length of the string.
Play video starting at 7 minutes 12 seconds and follow transcript7:12
And now we do just what we did in the example. We compare the classes of the current position and
the previous position. And the classes of the starting positions of the second halfs. If at least one of
the halfs is different, it means that the pair is different from the previous one. And we need to create
a new class, increase the current class by 1 and assign to the current position. Otherwise, the pair is
the same as the previous one, and we don't need to create a new class. We just assign the same class
to the current position. And we then return the array with the new classes of the double cyclic shifts.
Play video starting at 7 minutes 50 seconds and follow transcript7:50
We state that the running time of this algorithm UpdateClasses is linear.
Play video starting at 7 minutes 55 seconds and follow transcript7:55
And that's easy to prove because, well basically, we only have one for loop with linear number of
iterations and constant time operations happening inside.
Full Algorithm
Now to the full algorithm for building the suffix array, finally.
Play video starting at 6 seconds and follow transcript0:06
So procedure BuildSuffixArray takes in only string S and
returns the order of the cyclic shifts or of the suffixes of this string.
We assume that S already has $ in the and, and
$ is smaller than all the characters in the string.
Play video starting at 27 seconds and follow transcript0:27
We start with sorting the characters,
single character cyclic shifts of S, and save the result in the right order.
And also compute the equivalence classes of those characters and
save the result in the right class.
And we initialize the current length as one.
And then we have the main loop, which proceeds while the current length is still
less, strictly less than the length of the string.
If it is, then we first need to sort the double cyclic shifts of length to L.
Play video starting at 58 seconds and follow transcript0:58
And then we also need to update their equivalence classes so
that the next iteration can use them to again sort the doubled cyclic shifts.
Play video starting at 1 minute 7 seconds and follow transcript1:07
And then we just multiply L by 2, and go on in our while loop
until we get to the station when L is more than or equal to the length of S.
Play video starting at 1 minute 18 seconds and follow transcript1:18
And by the time array order will contain the correct order of all the full
cyclic shifts of the string S, which is the same as the correct
order of all the suffixes of the string S if it has a $ on the end.
And the running time of BuildSuffixArray procedure is length of S times logarithm of that, plus size of
the alphabet. So the size of the alphabet is because of the counting sort of characters in the
beginning. because we're sorted them in time proportional to length of the string plus size of the
alphabet. But if we wanted, we could just sort them in time S log S without using the count and sort
to sort the characters. So we could actually remove the plus alphabet from the BuildSuffixArray
asymptotics. Although in practice usually the alphabet is very small, so we don't need to do that. And
using counting sorts is better than actually sorting the characters in S log S.
Play video starting at 2 minutes 20 seconds and follow transcript2:20
And also compute the classes of the characters in linear time after that.
Play video starting at 2 minutes 25 seconds and follow transcript2:25
In each while loop iteration, we do both sorting of the double cyclic shifts and update their clusters in
linear time.
Play video starting at 2 minutes 33 seconds and follow transcript2:33
And we have only logarithmic number of iterations, because L is doubled every iteration, and as soon
as it gets at least S or more, we stop. So it's on a logarithmic number of iterations, so all in all, the
while loop runs for S log S. And adding to that, the initialization cost, we get S log S plus size of the
alphabet.
14_algorithmic_challenges_2_suffix_array.pdfPDF File
References
See chapter 4 in [CHL01] Maxime Crochemore, Cristophe Hancart, Thierry Lecroq. Algorithms on
Strings, Cambridge University Press, 2001.
Review the lecture on the Counting Sort. Also see this answer for an example of difference
between stable sorting and a non-stable sorting algorithms. Counting Sort is a stable sort.
UIZ • 12 MIN
Hi, in this lesson, you will learn how to build suffix tree of a string given its suffix array in linear time.
At first we'll explore some connections between suffix array and suffix tree, and then we'll learn to
compute some additional information to the suffix array. And then finally we will use suffix array and
the traditional information called LCP array to build a suffix tree.
Play video starting at 22 seconds and follow transcript0:22
First recall the problem. It's very simple. You're given a string S and you need to compute its suffix
tree. And you already know how to do that actually. But the algorithm you know works in square
time, and so it will work only for short strings, maybe up to 1,000 or 10,000 characters. And if you
want to build suffix tree for strings of length of millions or billions, you will need a much faster
algorithm. And after you learn this lesson, you will know how to build suffix tree in time, length of
string times logarithm of these lengths, because you can build suffix array in this time and then
construct suffix tree from the suffix array in linear time.
So the general plan is to construct suffix array in time as log S, then compute some additional
information called LCP array in the linear time. And then given both suffix array and this additional
information, construct the suffix tree in linear time. First, let's explore how suffix array and suffix tree
are connected. Here we have a string S, ababaa$. And again, we insert $ in the nth which is smaller
than any of the characters of the string both to build suffix array and then to build suffix tree from it.
And on the left, we have in the column. All the suffixes of the string S sorted in lexicographic order. So
that is basically the suffix array.
Play video starting at 1 minute 54 seconds and follow transcript1:54
And on the right, we have the fully built suffix tree of the string, which is already compressed so that
you see that on the edges we have not single letters but whole sub strings of string S. And by the way,
interesting question is how do we store suffix tree? We shouldn't, of course, store the sub strings that
are written on the edges directly because that could lead to quadratic memory usage and we want
linear memory usage. So instead of storing the sub strings themselves, we just store the index of the
start, and index of the end index of the corresponding sub string. So for each edge, we store two
indexes, the start of that edge in the string and end of that edge in the string. And to store the nodes,
we just store, for example, an array of pointers to the children nodes. And that array is indexed by the
first character of the edge outgoing from this node into the child.
Play video starting at 2 minutes 55 seconds and follow transcript2:55
And we can store the information about the edge itself in the node for which this edge is going from
its parent.
Play video starting at 3 minutes 2 seconds and follow transcript3:02
This is one of the ways to store everything but you may organize everything in another way.
Play video starting at 3 minutes 10 seconds and follow transcript3:10
The important thing is that you shouldn't store edges as substrings. So what corresponds in the suffix
tree to the suffix array elements?
PPT Slides
Let's take the first element of the suffix array. Actually, it is corresponding to suffix in the string S and
that is corresponding to a leaf in the suffix tree and also to the path from a root vertex to the
corresponding leaf vertex in the tree. So the first element of the suffix array corresponds to this route
highlighted in blue, and then if we go to the next element of the suffix array, we get another route
from route vertex to the leaf number 1. And then if we go to the next element, we get route from the
route vertex to the leaf number 2. And note that the indexes of the leaves, and the indexes of the
suffixes are just in the sorted order. So, those are not positions in the string S. Those are numbers of
the suffixes in the increasing order, from 0 to number of the suffixes minus 1. So, each of the
elements of the suffix array corresponds to some path from root to leaf in the suffix tree, that is what
we know. That is unfortunately not yet sufficient to build the tree from the suffix array because there
are many ways to create some paths from root to different nodes. Which corresponds to suffixes of
the suffix array. So we will need some additional properties.
And this additional property we will need is call longest common prefix, or often it is just said as lcp.
Play video starting at 5 minutes 5 seconds and follow transcript5:05
So lcp of two strings S and T is the longest such string u, that it is both a prefix of S and of T. And we
denote by big LCP(S, T), the function which returns the length of the lcp of strings S and T. For
example, LCP("ababc" and "abc") is 2 because their longest common prefix is ab, and it's length is 2.
And LCP("a","b") = 0 because their longest common prefix is empty.
PPT Slides
Play video starting at 5 minutes 39 seconds and follow transcript5:39
Now let's look again at the suffix array and suffix tree and also take into account LCP between the
neighboring elements of the suffix array. So when we look at the first element of the suffix array, we
just have an edge corresponding to it in the suffix tree. And when we have the next element, we have
a path from root to another vertex. But if we compute the longest common prefix of this element of
the suffix array with the previous element of the suffix array, we'll see that this longest common
prefix is empty. And that corresponds to empty intersection, between the previous path and the new
path. The only common node is the root node, and they don't have any edges in the intersection.
However, if we proceed to the next suffix,
Play video starting at 6 minutes 35 seconds and follow transcript6:35
it has a common prefix of length 1 with the previous suffix. And it corresponds to the common path in
the tree highlighted in yellow, starting in the root node and going through edge a to another node
which is still a common node for the current suffix and the previous one. So this is how LCP
corresponds to the tree. If you go to the next suffix, it again has the same longest common prefix with
the previous suffix. So we have the same common path from root to the next node by edge a and
then the part of the path is different from the current suffix and from the previous one.
Play video starting at 7 minutes 17 seconds and follow transcript7:17
If we go to the next suffix, their longest common prefix with the previous one is even longer, and so
the common part of the path is now
Play video starting at 7 minutes 28 seconds and follow transcript7:28
consisting of three nodes and two edges. A root node, next node by edge a and next node by edge ba.
And the rest of the path is unique to the current suffix. If we go to the next suffix, it again doesn't
have any common prefix with the previous one so the only common in the path is the root node. And
the next suffix has longest common prefix of ba, and that's why we see this path from root to another
node via edge ba. And this is the common part of the path for the current suffix and the previous one.
So we see that basically all the nodes but the leaves are corresponding to the longest common prefix
of the neighboring suffixes in the suffix array. And this is how we can actually build the suffix tree by
first computing the longest common prefixes of the neighboring elements in the suffix array. And then
building those internal nodes. And then, in the way of that, we will also build the leaves as the ending
points of the paths corresponding to the suffixes from the suffix array. So this is the plan of what we'll
do. But first, we'll need to compute those longest common prefixes for the elements of the suffix
array.
LCP Array
So we define LCP array and let's consider suffix array A of string S in the raw form that is that A[0] is a
suffix, A[1] is a suffix and so on up to A[S-1], all those are suffixes of S in lexicographic order. Then LCP
array of string S is the array LCP small of size length of S-1. It contains fewer elements than the suffix
array and then the string itself. Besides that, each element lcp[i] is just equal to the longest common
prefix length. Between A[i] and A[i+1]. So it's the longest common prefix of two neighboring elements
in the suffix array and what we want is to compute the values of this array.
Ppt slides
Play video starting at 51 seconds and follow transcript0:51
For example if we have our string ababaa$. Then, we first compute the longest common prefix of $,
and a$, which is 0. Then, we compute the longest common prefix of a$ and aa$, which is a of length 1.
Then it's again a of length 1. Then it's aba of length 3. Then it's empty. And then it's ba of length 2. so
the LCP array for this string is 0, 1,1, 3, 0, 2.
And the central LCP array property which will enable us to compute it fast is that for any end assist i
and j In the suffix array, where i is less than j. The longest common prefix between A[i] and A[j] which
are far from each other, is not bigger than the LCP of i, which is basically the longest common prefix of
i and the next element. So what I'm saying with this Lemma is that the LCP of two neighboring
elements is always at least as big as the LCP of the first one of them with any of the next elements.
And the same goes the other way. The LCP of two neighboring elements is at least the same as LCP of
the second of them with any of the previous ones.
Play video starting at 2 minutes 15 seconds and follow transcript2:15
And to see that let's look at some hypothetical example that we have some long suffix array and
elements i and i+1 are here. And also there is some element j farther in the suffix array. And we see
that really the common prefix of suffixes i and i+1 is pretty long.
Play video starting at 2 minutes 38 seconds and follow transcript2:38
It's not so long with suffix j, it's only of length two. But this example doesn't yet prove anything. So
maybe for some other situation with a suffix number i+1, it could be solved that the common prefix of
i and j will be bigger than common prefix of i and i+1. So let's suppose that, and we don't know what
is suffix i + 1, so we just replace it with many x. X is an unknown letter. We know that the LCP of i and j
is equal to 2. So let's consider k which is the length of the longest common prefix of A[i] and A[i + 1].
And we suppose that it is smaller than 2 in this case.
Play video starting at 3 minutes 26 seconds and follow transcript3:26
So how can that be? One variant is if A[i + 1] is shorter than 2, and then A[i + 1] is actually a prefix of
A[i]. But in this case, A[i+1] is smaller than Ai which contradicts the property of the suffix array. That
the suffixes are sorted.
And if suffix i+1 is sufficiently long then it follows that it's kth character is different from the kth
character of both ith suffix and jth suffix. And in this case there are again two cases. In the first case is
that this character in suffix i+1 is bigger than the corresponding one in strings i and j. But from this it
immediately follows that suffix i+1 is bigger than suffix j which contradicts the suffix array properties,
so it is impossible.
Play video starting at 4 minutes 19 seconds and follow transcript4:19
And another case is that this character is less than the corresponding character in both strings i and j.
But in this case it immediately follows that A[i] is bigger than A[i + 1] which again contradicts the suffix
array property. So in all cases we found the contradiction. And so, it is not possible that the longest
common prefix of i and j is bigger than the longest common prefix of i and i+1. And we proved the LCP
array property because for this symmetric case, the proof is a null x.
Now how do we compute the LCP array? One variant is to go for each i, compare A[i] and A[i+1]
character by character and compute the LCP directly.
Play video starting at 5 minutes 7 seconds and follow transcript5:07
But this will work in linear time for each i. And in total it will be length of the string squared. And we
want to compute everything in linear time. So how to do this faster, and you will learn that in the next
video.
Hi, in this video, you will learn how to compute LCP array in a linear time.
And the main idea is the following.
We'll start by computing LCP of the first two smallest suffixes directly by
comparing them character by character.
But then, on each next iteration,
instead of going to next pair of suffixes in the suffix array,
we move the smaller suffix in the stream one position to the right and
then compute its LCP with the next suffix in the suffix array.
So we won't go in good order in the suffix array, we will go in some strange order.
But this order is good.
It will show that if we go in this order through the smaller suffixes,
then the LCP of the smaller suffix and the next suffix will decrease by,
at most, one on each duration.
And so, we will know that most of the characters of the two new suffixes,
we have already compared many of them, and we don't need to compare them again.
We'll start from there, and we will convert the next character and
the next one directly, and the LCP itself will be very easy to compute,
because we will still do that by direct comparison of characters with characters.
We will just avoid some of the comparisons,
because we will know from the previous durations that the common prefix has
at least such length, and we don't need to compare the first such many characters.
And in the end, it turns out this will work in linear time.
So we will denote by A and of Pi the suffix, starting in the next position in the stream, after suffix Ai, in
the suffix array. So the next one in the suffix array will be Ai plus one, but we won't know that one,
but the one which starts in the string one position to the right.
Play video starting at 1 minute 54 seconds and follow transcript1:54
So here's an example. We have a string, which is ababdabc, and the smallest suffix is ababdabc. The
whole string is actually the smallest suffix, and the next one in the suffix array is abc. And their longest
common prefix is ab, and here we see it. And we compute this longest common prefix, which is equal
to two, by the length, directly.
Play video starting at 2 minutes 19 seconds and follow transcript2:19
And then, we will know that if we move to the next two suffixes in the stream to the next one after a
zero and the next one after a one, those will both start with letter B. So the last of the common prefix
decreased by, at most, one, cuzboth suffixes just moved one position to the right, and we cut away
only one position of the longest common prefix. Of course, these two suffixes are not, probably, two
neighboring suffixes in the suffix array in the general situation. It might be so that there is some suffix
between them. But because of the property of the LCP array, the longest common prefix of the
smaller suffix with the next suffix in the suffix array will be even bigger or at least the same as its
longest common prefix with the next suffix in the string, because the next suffix in the string is bigger
than the smaller one. And by the property of LCP array, the common prefix with the next element is
the same or bigger than the common prefix with some element farther away in the suffix array. So
now, we can move to the next element from the smaller one and then take the next one to it in the
suffix array, and compute their L speed directly but remembering that the first several characters, we
don't need to compare exactly those, which are in the LCP of the previous pair. So, this is basically the
algorithm.
We compute LCP(A[0] and A[1]) directly and save it's value as variable LCP. And then on each
iteration, first suffix in the pair, which is smaller, goes to the next in the string, then we find which one
is the next in the suffix array in the order, and we compute their longest common prefix knowing that
we don't need to compare the first lcp- 1 characters.
Play video starting at 4 minutes 26 seconds and follow transcript4:26
And then on each comparison, if it's successful, we increase LCP, and we go to the next comparison,
and we repeat that until we feel the whole LCP array. And the idea is when we make each
comparison, we increase LCP. And when we move to the next pair, we decrease LCP by at most one.
And this is why we cannot do too many iterations.
So the Lemma states that this algorithm computes LCP array in linear time.
Play video starting at 4 minutes 55 seconds and follow transcript4:55
And to prove that is now easy, because each comparison, we do between one suffix and another.
Play video starting at 5 minutes 2 seconds and follow transcript5:02
It either finishes the iteration, and number of such comparisons is at most number of iterations, and
we have at most length of the string iterations, or if it is a successful comparison, then it increases the
current value of variable LCP.
Play video starting at 5 minutes 18 seconds and follow transcript5:18
And the variable lcp cannot be bigger than the length of the string in any moment, and at each
duration, lcp decreases by at most one. So if we start from zero, we cannot go higher than length of S,
and at each iteration, we decrease by at most one. We cannot do more than linear time of increasing
LCP, and so we cannot do more than linear number of comparisons. And this is why this algorithm
works in linear time. And as soon as we can now compute the LCP array, we can proceed in the next
video to construct the suffix three given the suffix array and the lcp array.
We encourage you to review the slides attached here, as they contain an additional example
regarding LCP array construction and the pseudocode in the section "LCP Array Computation".
We encourage you to review the slides attached here, as they contain the pseudocode for suffix
tree construction from suffix array and LCP array in the section "Construting Suffix Tree".
References
[CP15] Phillip Compeau, Pavel Pevzner. Bioinformatics Algorithms: An Active Learning Approach,
2nd Ed. Vol. 1. Active Learning Publishers. 2015.
1. Instructions
2. My submission
3. Discussions
Welcome to your third programming assignment of the Algorithms on Strings class! In this
programming assignment, you will be practicing implementing very efficient string algorithms.
How to submit
When you're ready to submit, you can upload files for each part of the assignment on the "My
submission" tab.