You are on page 1of 256

Algorithms on Strings

by University of California San Diego & National Research University Higher School of Economics

Timeline

Previous weeks

START 07/13

Week 1

Programming Assignment, Programming Assignment 1, Jul 19

WEEK 1

Week 2

Programming Assignment, Programming Assignment 2, Jul 26

WEEK 2

Week 3

Quiz, Exact Pattern Matching, Aug 2

WEEK 3

Week 4

Quiz, Suffix Array Construction, Aug 9

Programming Assignment, Programming Assignment 3, Aug 9

WEEK 4

END 08/16

Following weeks

Next Step

WEEK 1 | VIDEO · 4 MIN

Welcome

It'll take about 4 min. After you're done, continue on and try finishing ahead of schedule.

Instructor's Note
Thanks for signing up for the Algorithms on Strings class! We are excited that the class is beginning and
look forward to interacting with you!

This course is part four of the Algorithms and Data Structure specialization. Although we recommend
signing up for the entire specialization, you still can take this class separately if you haven't taken the
previous parts yet.

In a few years, you may have your genome sequenced in a doctor’s office as a part of a routine medical
procedure. But how can your doctor find mutations that make your genome different from other
genomes and figure out which of them are implicated in diseases? Algorithmically, this problem is not
very different from tasks you face every day: spell-checking your documents or searching internet. In
this course, you will learn about suffix trees, suffix arrays, and other brilliant algorithmic ideas that help
doctors to find differences between genomes and power lighting fast internet searches.

We look forward to seeing you in this class! We know it will make you a better programmer.
https://www.coursera.org/learn/algorithms-on-strings/lecture/avHa3/welcome

Week 1
Algorithms on Strings
Week 1
Discuss and ask questions about Week 1.

12 threads · Last post 16 days ago


Go to forum

Suffix Trees

How would you search for a longest repeat in a string in LINEAR time? In 1973, Peter Weiner
came up with a surprising solution that was based on suffix trees, the key data structure in pattern
matching. Computer scientists were so impressed with his algorithm that they called it the
Algorithm of the Year. In this lesson, we will explore some key ideas for pattern matching that will -
through a series of trials and errors - bring us to suffix trees.
Less
Key Concepts
 Develop a program to build a trie out of a list of strings
 Develop a program to search for multiple patterns in a string using trie
 Develop a program to build a suffix tree of a string
 Apply suffix tree to find the longest non-shared substring of two strings

Less
From Genome Sequencing to Pattern Matching

Video: LectureWelcome

4 min
Resume

. Click to resume

Video: LectureFrom Genome Sequencing to Pattern Matching

8 min

Video: LectureBrute Force Approach to Pattern Matching

2 min

Video: LectureHerding Patterns into Trie

5 min

Reading: Trie Construction - Pseudocode

10 min

Video: LectureHerding Text into Suffix Trie

6 min

Video: LectureSuffix Trees

4 min

Practice Quiz: Tries and Suffix Trees

5 questions

Reading: FAQ

10 min

Reading: Slides and External References

10 min

Programming Assignment

Reading: Available Programming Languages

10 min


Reading: FAQ on Programming Assignments

10 min

Purchase a subscription to unlock this item.

Programming Assignment: Programming Assignment 1

From Genome Sequencing to Pattern Matching


Welcome

String algorithms are everywhere. They underlie both things that you use every day and some exciting
cutting edge research. One of the research areas that makes heavy use of string algorithms is
Bioinformatics. It studies DNA and genes of humans, animals, and primitive organisms. Genes
determine how the organism will develop and which genetic diseases are likely to happen. DNA is a
complex three dimensional structure and you can see an example of part of it on the left. You see that
it consists of two intertwined strands. We can extract many of the properties of the DNA, if we take
just one of those strands and record the sequence of the molecules in it. If we consider just a few
dozen types of those molecules called amino acids, and we denote each of them by a Latin letter,
then we can convert a strand into a string of Latin letters. And what we can do next is to study those
strings. For example, we can put together a string for a human DNA, and for a chimp DNA, and then
for a mouse DNA. And then we can find the parts which are the same and the parts which differ. And,
this way, we can observe how did the life on earth evolve and what changed in the genes. Another
thing we can do with this, is to put a genome of a person with a rare genetic disease next to a genome
of a healthy person. Find the differences and then make hypotheses on which mutations in the genes
cause these genetic disease which can help to later cure it. You will hear more about it later in the
course from Pavel Pevzner who has been working in the field of Bioinformatics for decades.

Moving on to everyday things, depending on where you are now, you probably use one of these
search engines every day: Google in most of the world, Yandex in Russia, Baidu in China, or Naver in
South Korea. Search engines crawl the Internet and download petabytes of data that they find there.
And, then, when you type in your search query they use mostly the text of those documents in the
Internet to match to your query. And those texts can represent the strings, lots of string algorithms
are used in the process.
Another example is your favorite text editor when you launch spellchecking or are just trying to find
something in your text, string algorithms are working. A less obvious example is the software which
protects our computers and computer networks. The anti-virus software looks for suspicious patterns
in the code of the programs that you want to launch on your computer and the network intrusion
systems looks for suspicious patterns in the network traffic. So, a string algorithms are a great tool for
pattern matching both exact and approximate string algorithms are used in the software.
Last, but not least, software engineers are needed to implement all of the stuff that I've shown to you
and they need their own tools. If you've ever participated in a programming project within a team,
then you've probably reviewed some changes made by one of your teammates. And you use some
version control system and a diff tool built in into the system, like the one in the example with
Wikipedia. And to guess correctly what was changed in the code or in the Wikipedia article without
actually knowing what was changed, string algorithms are used for aligning and matching parts of the
text. In this course, we will start with applications of string algorithms to Bioinformatics. You will hear
about that from Pavel Pevzner who also has a whole separate specialisation in Coursera about
Bioinformatics. You will see how basic string algorithms can be applied to solve the problems which
are raised in Bioinformatics. And, then, the second part of the course you will meet again with me and
we will work through some algorithmic challenges around how to make those algorithms to run really
fast. And fast enough, for example, to apply them to genomes which consist of millions or even
billions of characters, or to the text in the Internet, in your text editor, and so on. Also, in the end of
the specialization, you will have a capstone project called Genome assembly. And, there, you will
apply string algorithms, graph algorithms, and other algorithms to build the genome from a million
pieces. See you later in this course.

From Genome Sequencing to Pattern Matching


PPT Slides

Hello. I haven't seen you for a long time, since we were working on the change problem. But today
we'll work on a completely different topic called String Algorithms. String Algorithms are everywhere.
Every time you spell check your documents, or Google something, you execute sophisticated String
Algorithms. But today, we walk about very different application of String Algorithms. Sam Berns gave
a fantastic pep talk when he was 16. He was talking about his life and a year later, he died.
Play video starting at 38 seconds and follow transcript0:38
Sam was suffering from a rare genetic disease called progeria. Children with this progeria often will be
above average intelligence, look old already at the age ten and usually die in their teen years.
Play video starting at 54 seconds and follow transcript0:54
But for many years biologists have no clue of what causes progeria. But in 2003, they figure out that
progeria is caused by single mutation, on chromosome one. To understand what pattern matching
has to do with progeria, we need to learn something about genome sequencing.
Play video starting at 1 minute 16 seconds and follow transcript1:16
When my children were young, that's how I was explaining them how genome sequencing work. I was
using an example of the newspaper problem. Take many copies of the identical New York Times
newspaper, then set them on a pile of dynamite.
Play video starting at 1 minute 35 seconds and follow transcript1:35
Don't do it at home, and then wait until explosion is over, and collect the remaining pieces. Of course
many pieces will burn in the explosion, but some pieces will remain. And so your goal is to reconstruct
the content of the New York Times. A natural way to solve the newspaper problem, is to consider it as
an overlapping puzzle. Look at different pieces and try to mix them together like this.
Play video starting at 2 minutes 6 seconds and follow transcript2:06
And then slowly but surely, hopefully you'll be able to assemble the whole gem. And that's roughly
how the human genome was assembled in 2000. And here, Bill Clinton is congratulating Craig Venter,
one of the leaders of the Human Genome Project, on completion of this $3 billion mega science
project. We don't need to know much about genomes for the rest of these talks. The only thing we
probably need to know is a genome is simply a long strand in A, C, G, T alphabet.
Play video starting at 2 minutes 43 seconds and follow transcript2:43
I will try to explain how the newspaper problem translates into genome sequencing. They start from
millions of identical copies of a genome.
Then they break the genome at random positions using molecular scissors. These molecular scissors
don't look quite like this one shown in this picture. Then we generate short Substrings of the genome
called reads, using modern sequencing machines. Of course, during this generation some reads are
lost.
And the only thing left is to the assemble the genome from millions, or maybe even billions of tiny
pieces. The largest jigsaw puzzle humans ever attempted to solve. Today, I won't be able to tell you
about algorithms for genome assembly. But if you are interested in learning about this algorithm, you
can attend or Bionformatics specialization at Coursera. Or read the book, Bioinformatics Algorithms.
Assembling human genome was a challenging $3 billion project and afterwards the era of genome
sequencing began. But in the first ten years after sequencing human genome, biologists were able to
sequence only about ten other mammalian genome because it was still difficult.
Play video starting at 4 minutes 10 seconds and follow transcript4:10
However, five, six years ago, so-called next-generation sequencing revolution happened. And today,
biologists sequence thousands of genomes every year.
Play video starting at 4 minutes 23 seconds and follow transcript4:23
Why do biologists sequence thousands of species? There are many applications. For example, the
next big science sequencing project after the human genome was mouse genome. Because we can
learn lot about Human biology and diseases from mouse gene. An important application in
agriculture, for example by sequencing rice genome, biologists are able to develop new high yield
crops of different plants like rice. And there are many hundreds and hundreds of other applications.
Play video starting at 4 minutes 55 seconds and follow transcript4:55
But recently, in addition to sequencing of many species,
there is also much effort on sequencing millions of personal genomes. The things that make us
different are mutations. However, there are surprisingly few mutations that distinguish my genome
from your genome. Roughly one mutation per thousand nucleotides. However, this mutation make
big difference. They account for height or they account for more than 7,000 known genetic diseases.
Play video starting at 5 minutes 32 seconds and follow transcript5:32
Five years ago, the era of personalized genomics started and Nicholas Volker is a foster child of
personalized genomics. He was so sick that he went through thousands of surgery, but doctors still
were not able to figure out what is wrong with this kid. However, after he is genome sequence can
reveal a mutation in a gene linked to defect in the immune system. Doctors applied immunotherapy
and Nicholas Volker is now a healthy child. However, sequence in personal genomes, from scratch,
still remains difficult even today. What biologists do today however, they do so called reference base
human genome sequences. Let's start from Craig Venter genome assembled in 2000, call it reference
genome. And then let's start sequencing my genome by generating all reads from my genome. Here's
some of the three perfectly match to genome, but some of them don't. And based on these reads
that do not match, we will be able to figure out what is my genome. For example, we can find a
mutation of T into C and deletion of T in my genome as compared to.
It brings us to a number of computational problems. The easiest one is the exact pattern matching.
Given a String pattern and a String text, we want to find all positions and texts where pattern appears
as a Substring.
Play video starting at 7 minutes 13 seconds and follow transcript7:13
But our goal is to find mutations, and therefore we want also to solve Approximate Pattern Matching
Problem. Where input is a string Pattern, a string Text, and an integer d. And we want to find all
positions in Text where Pattern appears as a Substring with at most d mismatches.
Play video starting at 7 minutes 37 seconds and follow transcript7:37
I think you already have some good ideas on how to solve this rather simple problem. But think about
this, even if you have fast algorithms for solving this problem, would you be able to solve the next
problem?
Play video starting at 7 minutes 51 seconds and follow transcript7:51
To answer the question where do billions of reads generated for me, match the reference genome
from. And this leads us to Multiple Pattern Matching Problem given a set of strings, Patterns and a
string Text, find all positioning texts where a string from Patterns appears as a substring.

Brute Force Approach to Pattern Matching


PPT Slides
So let's develop the first brute force approach to pattern matching. First thing, let's get our pattern
into a car and let's drive this car along the text. Of course while we are driving, each time we see our
pattern appear with in the text we report that there is an occurrence of pattern in the text. So how it
works? Let's try to drive nana along panamabananas. So, first letter doesn't match, therefore we need
to drive further. First letter again doesn't match, so we need to drive further. Now, n matches n. A
makes an a but m doesn't make n so we need to drive further. We continue, continue, continue,
continue, continue and now there is again a match. N, a, n, a, we found the pattern and then we
continued further, problem solved. This approach is actually very fast. It takes only all of text time
pattern time because every time we drive along the text, it may take us up to pattern symbol
comparison to figure out. Where pattern matches the text at a given position. And Mischa will tell you
later how Knuth-Morris-Pratt algorithm allows to speed up this naive brute force algorithm and by
running O of text time independently on the lens of the pattern. So looks like they succeeded. It can
go home, right?
Play video starting at 1 minute 33 seconds and follow transcript1:33
But wait a second. Let's see how this algorithm would work for billions of patterns. And it turn out
that for billions of pattern it will take time text times patterns. To where patterns is the total lengths
of all patterns and if we apply it to my genome then text will be 3 billion nucleotide long and the total
length of the pattern will be maybe as large as 10 to the power of 12. So, the naive algorithm, or even
Knuth Morris Pratt Algorithm will not work.
Play video starting at 2 minutes 14 seconds and follow transcript2:14
Should we give up?
Herding Patterns into Trie
PPT

What will be the number of nodes (including the root node) in the trie constructed from
patterns ATAGAATAGA, ATCATC, and GATGAT?
Adds brute force approach to pattern matching 
is slow when we try to match billions of patterns. 
Let's try to figure out why. 
So, what is happening if we put each pattern in its own car, 
then the first dives the first car, then the second car, then the third car, 
the next car and next car, and that's why it takes a lot of time. 
Here's a new idea, let's pack all patterns in a bus, and 
let's drive this bus along the text. 
But how do the constructors bus?
Play video starting at 39 seconds and follow transcript0:39
Let me show you how we can construct the bus from multiplied patterns. 
Let's start from the first pattern and represent it as a pass in a tree. 
Continue to the next button, continue this next button, and 
continue this next button. 
So far it was easy and not interesting. 
We have four patterns and we constructed for passes from the root of the tree. 
Let's go to the next one, antenna. 
Now the first letter in antenna actually already 
appears on the way from the root it is right here. 
The second letter also appear away from the root. 
And then, we need to branch 
the previous pass into two passes to construct the pass for antenna. 
Now let's do bandana. 
So bandana we press it further. 
And now we again have to branch the pass. 
Continue with ananas, again branching. 
And finally continue with nana, branching again. 
And of what we've constructed is actually our bus. 
And which is called trie of patters.
Play video starting at 1 minute 52 seconds and follow transcript1:52
How do we use this bus? 
How would we drive? 
After constructing this bus, how would we drive with along text?

PPT Slides
Ppt slides
Well, you'll use TrieMatching. And we'll drive this whole trie along the text, and at each position of
the text, we will walk down the trie by spelling symbols of text.
Play video starting at 2 minutes 17 seconds and follow transcript2:17
A pattern from the set patterns matches text each time we reach a leaf. Let me show you how it
works. So our bus, in now looking at the first letter of the text, p, so if we walk along the edge, labeled
by p, the next letter is a, we walk along this edge, the next letter is n, we walk along this edge, and we
found that part of pan appears in panamabananas. Next, we move to the next letter of the text and
we start my chunk again. A, n, a and now there is no match so at this position there is no match
between, patterns and texts. Continue further, n, a.
Play video starting at 3 minutes 7 seconds and follow transcript3:07
Once again, there is no match. A, once again, there is no match. For m, from the very beginning, there
is no match. We continue further. A. Once again, there is no match. Let's try here. B, a, n, a, n, a we
came to so we found the pattern, but we have to continue further. Further. Further. Further. Found
the pattern again. Continue further.
Play video starting at 3 minutes 35 seconds and follow transcript3:35
Again found the pattern. And now, there is no more match. So, we found, in a single run on our bus,
we found all matches of patterns against the jacks.
Play video starting at 3 minutes 50 seconds and follow transcript3:50
Actually, I haven't finished yet. We also have to match n, a, s.
Play video starting at 3 minutes 54 seconds and follow transcript3:54
No. a? No. Now we are done.
Our bus is very fast, recalls at runtime of our brute force approach was O text time patterns. Where
pattern says the total lens or for pattern. That 's why it was slow the total length of all patterns is
huge. In the case when we tried to match reeds against the genome. But the run time of TrieMatching
is only o of text time the length of the longest pattern. And typically in modern sequencing pattern,
the reeds have lengths only 200, 300 nucleotide.
Play video starting at 4 minutes 38 seconds and follow transcript4:38
So it looks like finally they are done. Should we go home?
Play video starting at 4 minutes 43 seconds and follow transcript4:43
We are not ready to go home just yet.
Play video starting at 4 minutes 47 seconds and follow transcript4:47
Note that trie we constructed has 30 edges. But in general, the number of edges for a trie is O of the
total length of the patterns.
Play video starting at 5 minutes 1 second and follow transcript5:01
And for the human genomes, the total length of the patterns will be in trillions. So unfortunately, our
algorithm will be impractical for read matrix.
Play video starting at 5 minutes 15 seconds and follow transcript5:15
Should we give up?

Trie Construction - Pseudocode


You may find it useful before implementing some of the problems in the Programming Assignment
to look closer at the pseudocode for the algorithms discussed in the lectures.

Here is the pseudocode for constructing a trie from a collection of patterns:


Herding Text into Suffix Trie

We saw that using tries dramatically improves on the brute force 


approach with respect to speed but becomes impractical with respect to memory. 
What should we do? 
Let's try a different crazy idea. 
Instead of packing patterns into a bus, let's pack text, into bus. 
Let me explain how we can do this. 
We'll generate all suffixes of text. 
And form a trie out of all the suffixes. 
It will be larger. 
A rather large suffix trie. 
For each pattern, we can check if it can be spelled out from the root 
downward in the suffix trie. 
So we are building a very large bus this time. 
Let's see how this idea works for panamabananas.

Ppt slides
Ppt slides
Let's start by adding a dollar sign to the end of panamabananas and I will explain later why I add this
strange dollar sign to the end of my string.
Play video starting at 1 minute 6 seconds and follow transcript1:06
So we start from the longest suffix of panamabanana$ and builds a corresponding parse in the trie
continue continue continue further continue continue so far there is no branching. Continue continue
and now the first branch in our suffix trie appears. Continue further there are new branches showing
up and then finally be constructed something that we call SuffixTrie of text. How can we use this for
part lecture? Well let's take our pattern once again, put it in the car and let's drive along the branches
of our SuffixTrie.
Play video starting at 1 minute 51 seconds and follow transcript1:51
First we match first symbol in pattern. The next symbol, the next symbol. And finally, we found a
match over the pattern to one of suffixes of the text to which we found a match of a pattern in the
text.
Play video starting at 2 minutes 7 seconds and follow transcript2:07
We use banana it will go like this.
Play video starting at 2 minutes 10 seconds and follow transcript2:10
We found banana with this nab, it goes like this. Unfortunately, we didn't find it because there is no
continuation for b. Let's see for antenna, we go this way and finally have to stall because there is no
match for t. So it looks like this SuffixTrie idea worked for us. But there is one important question we
forgot to answer. Where are the matches? How do we find our patterns match the text? There's no
information in the suffix trie yet that allows us to answer this question. Here's an idea. Let's try to add
some information to the leaves of our tree. But what information to add? Let's say for every leaf, let's
add information about the starting position of the suffix, that generated this leaf. For example, for
bananas, we will let position six, because bananas, start at position six of the text.
Play video starting at 3 minutes 15 seconds and follow transcript3:15
Let's see how it works. So for panamabananas$, we will be adding zero because this suffix starts at
position zero. Then we will be adding for anamabananas$ we will be adding correspondent position.
Continue, continue, continue, continue, continue, continue, continue, continue, continue. And finally
our tree, leafs on our tree, get decorated to this positions of the suffixes in the text and finally when
we looked at all suffixes our tree got all the information we need to figure out where the positions of
patterns are and that is actually what Is called SuffixTrie.
Play video starting at 4 minutes 5 seconds and follow transcript4:05
Original SuffixTrie as I described earlier, decorated with the position of all the leaves in the text.
Play video starting at 4 minutes 14 seconds and follow transcript4:14
However, getting information about position of suffixes to the leaves of the suffix try doesn't yet help
us to figure out where the string bananas appears in the text. So, what we want to do, once we find a
match, like match of bananas, we want to walk down to the leaf or maybe leaves, in order to find the
starting position of the match. Let's see how it works.
Play video starting at 4 minutes 43 seconds and follow transcript4:43
So for banana, we ended in the middle of the trie, but we'll continue walking, continue walking, and
finally, we find where banana start at position six.
Play video starting at 4 minutes 57 seconds and follow transcript4:57
For ana the continued working but there are three ways to continue working towards the leaves. This
is the first one. This is the next one. And this is another one. So in this case we find that baana actually
appears three times in the text and the three positions are shown on the top. So it looks like we finally
solved the problem of finding positional patterns in the tree, which means we now have a fast
algorithm for solving the problem.
Given only the suffix trie of some Text on the picture below, can you tell which of the following
strings appears the most number of times in this Text, and how many times exactly does it
appear?
We saw that Suffix Trie results in a fast algorithm for the part I mentioned, but let's take a look at the
memory footprint of Suffix Trie. Suffix Trie is formed from text suffixes of text. The average length of
the suffixes is roughly text over two. And therefore, the total length of those suffixes is length of the
text, multiplied by length of the text minus one, divided over two.
Play video starting at 6 minutes 4 seconds and follow transcript6:04
For human genome it appears huge impractical memory footprint.
Play video starting at 6 minutes 11 seconds and follow transcript6:11
Should they give up?

Suffix Trees

PPT Slides
PPT Slides
Is so that pattern matching with suffix tries is fast, but impractical with respect to the memory
footprint. How about this idea? We saw that bananas takes a lot of edges in our suffix try. Can we
compress all these edges into a single edge?
Play video starting at 23 seconds and follow transcript0:23
That's very easy to do. So let's simply do this, and do this with every known branching pass in our
suffix tree. Very quickly our tree gets much smaller. So if you're almost done, continue, continue And
finally, we construct something that is called SuffixTree(text). And since each suffix adds one leaf and
at most one internal vertex to the suffix tree, then the number of vertices In the suffix tree is less than
two times text and therefore
Play video starting at 1 minute 9 seconds and follow transcript1:09
memory footprint of the suffix tree which is proportional to the number of edges in the suffix tree is
simply all the lines of the text.
Play video starting at 1 minute 21 seconds and follow transcript1:21
This sounds like cheating! Because we haven't answered the question, how do we store all the edge
labels? They will take the same total length of pattern space that all labels in the suffix tree took.
Play video starting at 1 minute 38 seconds and follow transcript1:38
However, let's try to do the following. So instead of storing the whole string bananas as a label of our
edge, let's notice that bananas start at the position six of the text and has lance eight. And therefore
instead of storing bananas on the edge, we will only store two numbers, 6 at the starting position of
bananas and 8 the last of bananas. That will be sufficient to reconstruct the entire label bananas. And
we will do it for all edge labels. And as a result, now you see that suffix tree is indeed a very memory
efficient way to code all information about suffixes of the text.
You may be wondering, why did we add this silly dollar sign? To panamabananas.
Play video starting at 2 minutes 38 seconds and follow transcript2:38
I added it because I wanted to make sure that each suffix corresponds to a leaf.
Play video starting at 2 minutes 44 seconds and follow transcript2:44
But why do we want to make sure that each suffix correspond to leaf? I suggest you try to construct
suffix tree for papa without adding the dollar sign and compare with the suffix tree for Papa, with
dollar sign and you will see why the dollar sign is important.
Play video starting at 3 minutes 6 seconds and follow transcript3:06
The sos of suffix trees are a fast and memory efficient way to do pattern match. However,
construction of suffix trees is not for faint hearted because they need to combine all suffixes of the
caps into the suffix tree. And the name of it for doing this takes quadratic O text squared time.
Play video starting at 3 minutes 32 seconds and follow transcript3:32
However there is an ingenious linear time algorithm for constructing suffix trees called and it was
developed over 40 years ago and this algorithm amazingly has linear write of text to construct the
suffix tree. So it looks like we are done, finally, after all the effort.

And now, I want to tell you about the big secret of the big O notation. Something that Sasha, Daniel,
Misha, and Neil, forget to tell you about. Indeed, suffix trees enable fast exact multiply pattern
matching run time.
Play video starting at 4 minutes 14 seconds and follow transcript4:14
All of text plus patterns, and memory of text, that's the best we can hope for.
Play video starting at 4 minutes 21 seconds and follow transcript4:21
However, Big O notation hides constant, and the best known implementation of suffix tree has large
memory footprints of 20 time text which reaches very large memory requirement for long genomes
like human genomes. But even more importantly, we want to find mutations. And it is not clear how
to develop fast Approximate Multiple Pattern Matching category using suffix tree. So once again we
are facing an open problem that we have to solve.

PRACTICE QUIZ • 10 MIN

Tries and Suffix Trees


FAQ

Please see this link, section "Coursera week 1" for some of the frequently asked questions and
answers about this week's material.

Slides and External References


Download the slides for Suffix Trees:

Suffix-Trees-Reduced.pdfPDF File

References
See Chapter 9: How Do We Locate Disease-Causing Mutations (Combinatorial Pattern Matching)
in [CP15] Phillip Compeau, Pavel Pevzner. Bioinformatics Algorithms: An Active Learning
Approach, 2nd Ed. Vol. 1. Active Learning Publishers. 2015.

Also see the course "Finding Mutations in DNA and Proteins" of the Bioinformatics Specialization.

If you want to learn how to assemble genomes, also see Chapter 3: How Do We Assemble
Genomes (Graph Algorithms) in [CP15] Phillip Compeau, Pavel Pevzner. Bioinformatics
Algorithms: An Active Learning Approach, 2nd Ed. Vol. 1. Active Learning Publishers. 2015.

Programming assignment
Week 2
Algorithms on Strings
Week 2
Discuss and ask questions about Week 2.

11 threads · Last post 22 days ago


Go to forum

Burrows-Wheeler Transform and Suffix Arrays

Although EXACT pattern matching with suffix trees is fast, it is not clear how to use suffix trees for
APPROXIMATE pattern matching. In 1994, Michael Burrows and David Wheeler invented an
ingenious algorithm for text compression that is now known as Burrows-Wheeler Transform. They
knew nothing about genomics, and they could not have imagined that 15 years later their
algorithm will become the workhorse of biologists searching for genomic mutations. But what text
compression has to do with pattern matching??? In this lesson you will learn that the fate of an
algorithm is often hard to predict – its applications may appear in a field that has nothing to do with
the original plan of its inventors.
Less
Key Concepts
 Explain how Burrows-Wheeler transform allows to reduce the memory needed to store
genome and search patterns in genome efficiently
 Develop a program to compute Burrows-Wheeler Transform of a string
 Develop a program to invert Burrows-Wheeler Transform of a string
 Develop a program to search in a string given as its Burrows-Wheeler Transform
 Develop a program to build a suffix array of a string
 Explain how suffix array can be used to search for patterns in a string given as its
Burrows-Wheeler Transform
 Explain how partial suffix array can be used to reduce memory needed for suffix array and
still be able to search for patterns in a string

Less
Burrows-Wheeler Transform

Video: LectureBurrows-Wheeler Transform

4 min

Resume

. Click to resume

Video: LectureInverting Burrows-Wheeler Transform

7 min

Video: LectureUsing BWT for Pattern Matching

6 min

Reading: Using BWT for Pattern Matching

10 min

Suffix Arrays


Video: LectureSuffix Arrays

5 min

Reading: Pattern Matching with Suffix Array

10 min

Approximate Pattern Matching and Mutations of the


Genome

Video: LectureApproximate Pattern Matching

6 min

Slides and External References

Reading: FAQ

10 min

Reading: Slides and External References

10 min
Programming Assignment

Practice Quiz: Burrows-Wheeler Transform and Suffix Arrays

4 questions

Purchase a subscription to unlock this item.

Programming Assignment: Programming Assignment 2

Burrows-Wheeler Transform
The previous lecture ended with a rather difficult algorithmic challenge 
that we will try to solve using the Burrows-Wheeler Transform and 
suffix array. 
Let's start with the Burrows-Wheeler Transform. 
Allow me to slightly change the focus. 
Instead of pattern mention, we'll talk about text compression. 
And run-length encoding is the simplest way to compress text. 
Where a run of a single symbol is substituted by the number of times 
the symbol appeared in this run, followed by the symbol itself.
Play video starting at 38 seconds and follow transcript0:38
You may be wondering why we want to do run-length encoding for 
genomes because genomes don't have many runs. 
But they do have many repeats. 
For example, more than half of human genome is formed by repetitive DNA and 
the lion's share of many plant genomes is formed also by various types of repeats. 
But here's an idea. 
Let's convert test into something else so 
that our repeats will be converted into runs. 
We'll start from the genome then we'll turn it into ConvertedGenome and 
then we will apply run-length encoding to the ConvertedGenome. 
Because our hope is that ConvertedGenome will have many runs.
Play video starting at 1 minute 30 seconds and follow transcript1:30
Let me show how we can accomplish this.

Ppt slides
So let's consider all cyclic rotations of our favorite string, panamabananas$.
Play video starting at 1 minute 41 seconds and follow transcript1:41
We start from this one, and this one, and this one, and continue to form all cyclic rotations of this
string.
Play video starting at 1 minute 50 seconds and follow transcript1:50
And then after you generated all the cyclic rotations, let's sort them. The dollar sign is viewed as the
first letter of alphabet, even before A, so we'll start with $panamabananas. Continue, continue,
continue, continue, and finally we'll have a sorted list of all suffixes of the text.
Play video starting at 2 minutes 18 seconds and follow transcript2:18
You might be wondering why we are doing this, but look at this strange sync. If we look at the last
column of the resulting array and the last column of the resulting array is called Burrows-Wheeler
transform of the text. That you will notice that also our regional string, panamabananas$ did not have
many runs. The Burrows-Wheeler transform of this string actually has many runs. For example here's
a run of five A in the Burrows-Wheeler Transform of our original text.
Play video starting at 2 minutes 57 seconds and follow transcript2:57
How have we achieved it? Let me explain it by using an example from the famous Double Helix Paper
by Watson and Crick where they first presented the structure of DNA. And these are just some
consecutive strings in Burrows-Wheeler transform of this book. And you see that there are many runs
of A in the Burrows-Wheeler transform of this text. Why so many runs of A? Well one of the most
common words in English is and. And every time you have and in the text, it is likely to contribute to a
run of A in the Burrows-Wheeler transport, as you see in this example. So our goal now is to start
from the genome, apply Burrows–Wheeler transform to the genome. And we can now, hopefully,
comprise Burrows–Wheeler transform of the genome. And after you apply this compression, we will
greatly reduce memory for storing our genome. But it totally makes sense if we can invert this
transformation. And from compression version of Burrows-Wheeler transform, we can easily go back
to the original Burrows-Wheeler transform. But can we go back from the Burrows-Wheeler transform
of the genome, to the genome itself, is it even possible?

Inverting Burrows-Wheeler Transform


We saw in the previous lecture that the Burrows-Wheeler transform idea will not work unless we
figure out how to invert the Burrows-Wheeler transform. Let's try to reconstruct the text banana
from its Burrows-Wheeler transform, annb$aa. We know the last column of the Burrows-Wheeler
matrix. We also know the first column, because the first column is simply sorting all elements of the
Burrows-Wheeler transform. And therefore, we know all 2-mers in the cyclic rotations of our string.
And if we sort them then we will get the first two elements in each row of our Burrows-Wheeler
transform matrix. Now after we know the assorted 2-mers we once again return to the original
Burrows-Wheeler transform matrix.
PPT Slides
And for each such 2-mers, they actually know the symbols that precede every 2-mer, and therefore,
we know all 3-mers. We sort them and now they appear in the same order as they appear in Burrows-
Wheeler matrix. We once again return to the original Burrows-Wheeler matrix, and we know again
the symbols that precede every 3-mer, we'll continue. We generated all 4-mers, we sort them again,
repeating the same, Thing again. And this way we generate 5-mers, we once again sort. And with the
final step, we generate all 6-mer, sort them, and we now know our string, banana, banana.
Play video starting at 1 minute 48 seconds and follow tr
anscript1:48
So we now know the entire matrix, and therefore, symbol in the first row of this matrix after the
dollar sign spell banana, exactly what we need. We saw how to invert the Burrows-Wheeler
transform, but it was a very memory intensive algorithm. Indeed, for reconstructing Text from the
Burrows-Wheeler transform, we needed to store Text, cyclic rotations of the string Text.
Play video starting at 2 minutes 17 seconds and follow transcript2:17
Can we invert the Burrows-Wheeler transform with less space and without Text rounds of sorting?
Ppt slides
To develop a faster and more memory efficient algorithm for inverting the Burrows-Wheeler
transform, we'll start from an interesting observation. Let's take a look at all occurrences of a in the
first column and in the last column. And let's ask a question, where is the first a in the first column?
It's hiding along the circle which represent Text.
Play video starting at 2 minutes 52 seconds and follow transcript2:52
And you can see that it's hiding right after panam, shown in green.
Play video starting at 2 minutes 59 seconds and follow transcript2:59
So next question I want to ask, where is the first a in the last column? It's hiding along the circle. And
maybe this is just coincidence, but strangely, it is hiding exactly in the same place.
Play video starting at 3 minutes 15 seconds and follow transcript3:15
Let's ask the same question about the second a. And second a is hiding once again in the same
position along the circle, bothe for the first column and for the last column, right after pan.
Play video starting at 3 minutes 31 seconds and follow transcript3:31
The next question we will ask, where is the is hiding, and it's once again hiding in the same positions.
The same here, the same here, and the same here. So it looks like that i-th position of a in the first
column is hiding at the same position along the column as i-th position of a in the last column. And if
we look at the appearances of n, the same rule you can check, they're the same rule will apply. So is it
true in general? Let's try to answer this question. Well let's number all occurrences of a in the first
column. And then let's chop off the first a, the sorted six strings that appeared in the Burrows-
Wheeler transform matrix remain sorted. Because we simply remove the first identical symbol from
all of them. And now let's add this chop symbol to the end of each of the strings. We added them, and
of course, the strings remain sorted.
Play video starting at 4 minutes 43 seconds and follow transcript4:43
But these are exactly six strings that end in a in our Burrows-Wheeler matrix. Which means that they
follow in our matrix in the same order than the order we started from. And that result in the so-called
first-last property of the Burrows-Wheeler transform. The k-th occurrence of symbol in FirstColumn
and the k-th occurrence of the same symbol in the LastColumn correspond to appearance of this
symbol at the same position of text. And it's shown in the string. Our move is the first-last property,
let's try to invert the Burrows–Wheeler transform again.
Play video starting at 5 minutes 27 seconds and follow transcript5:27
Let's start with the dollar sign that is located in the first row, in the first column in the first row. It
corresponds to s1 in the last column. We know where s1 is located in the first column, let's move
there.
Play video starting at 5 minutes 46 seconds and follow transcript5:46
And s1 in the first column correspond to a6 in the last column, so let's move there. And we know
where a6 is located in the first column, so let's move to this position of a6.
Play video starting at 6 minutes 2 seconds and follow transcript6:02
a6 in the first column corresponds to n3 in the last column. And we know where n3 is located in the
first column, so let's move here. And we'll continue moving through the matrix according to the first-
last property. And slowly but surely we will be spelling our original Text. And when we finish, we
notice that the memory we took is only 2 times Text. And the time we took is simply following these
pointers that I defined by the first-last property. Which is also all in the length of the text. So we are
done.
Play video starting at 6 minutes 51 seconds and follow transcript6:51
The only question left, where is pattern matching in the Burrows-Wheeler transform?

Using BWT for Pattern Matching

Okay, let's learn how to do pattern matching with Burrows-Wheeler Transform. Let me first
summarize what we learned about pattern matching with the suffix tree. The runtime is equal to
O(length of the text + total length of all patterns). Memory is, in the best implementation, known
today is 20 * length of the text, or which is high for live strengths like human genome. So it would be
in nuclear diclonk. So the question we will try to address in this lesson is, can we use Burrows-
Wheeler Transform to design a more memory efficient linear-time algorithm for multiple pattern
matching?
Ppt slides
So let's see how we can do this. Let's search for ana in panamabananas. Well, we'll definitely start by
noticing that there are six rows that start from letter a. But when we look, and please notice also that
we are currently matching the last symbol in ana as in the first one. This will be important. So there
are six rows starting from a, but only three of them are ending in n. What we need, because we are
looking for matching the last two symbols now of ana, which is na. So the mental attention to these
three symbols, and using the first last property, we can figure out where these three n's hide in the
first column of our Burrows-Wheeler matrix, here, here, and here.
All ppt slides
Play video starting at 1 minute 45 seconds and follow transcript1:45
After they found where they appear in the first column, we know where na appears. In the string can
be actually found, three matches of na. This is the last two symbol in ana. Let's now try to match the
first symbol in ana, and we know where to look for this first symbol. We'll look for them
correspondingly in the last columns of these three strings. And after we found a in these three rows,
then using again the first last property, we find where these three occurrences of ana appears in the
beginning of our cyclic rotation.
As a result, we found three matches of ana. Let me specify some details of the algorithm that we just
discussed. We will use two pointers, top and bottom, that specify the range of positions in the
Burrow-Wheelers matrix that we are interested in. In the beginning, top will go to 0 and bottom equal
to 13, to cover all positions in the text. In the next iteration, the range of position we are interested in
is narrowed to all position where a appears in the first column. Then, what do you do afterwards? We
are looking for the next symbol, which is n in ana, and we are looking for the first occurrence of this
symbol in the last column among positions from top to bottom, among rows from top to bottom. And
likewise, afterwards, we are looking for the last occurrence of the symbol. As soon as we found the
first and last occurrence of this symbol in this case, and the first last property will tell us where this n
and all n's in between are hiding in the first column. As a result, the pointers top and bottom equal to
1 and 6 are changing into 9 and 11, they narrow the search. And then we continue further, and that's
how we find the positions of ana in the text.

The algorithm that I just described translate in the following BWMatching pseudocode. And you can
see lines in green describe what we have been doing with these top and bottom pointers. Note that
we are using last to first array and given a symbol at position index in lastColumn, LastToFirst index
defines the position of the symbol in the first column. So it's implement first to last property.
Play video starting at 4 minutes 39 seconds and follow transcript4:39
It looks like now, finally, we are done. We have a very first pattern matching algorithm based on
Burrows-Wheeler Transform, and it has good memory footprint.
Play video starting at 4 minutes 52 seconds and follow transcript4:52
The only problem, though, is that BWMatching is very slow. It analyzes every symbol from top to
bottom in the last column in each step.
Play video starting at 5 minutes 5 seconds and follow transcript5:05
What should we do? The trick here is to introduce the count array. And the count array describes the
number of appearances of a given symbol in the first i positions of the last quote. This slide shows the
count array, and ardent is the count array. We can design a better version of BWMatching by
substituting four green lines that we discussed before by two green lines that are using the count
array. And as you can see, we don't need anymore to explore every symbol between top and bottom
indices in the last column. If you are wondering about the details of transformation
Play video starting at 5 minutes 51 seconds and follow transcript5:51
from the previous four lines into two lines using the count array, check our Coursera course or
Play video starting at 6 minutes 1 second and follow transcript6:01
get details from our books that describes this transformation. So it looks like finally, after all these
complications, we are done. But there is still one question we found quanza. Where are the matches
that they found? Where do they appear in the text?

Using BWT for Pattern Matching

You may find it useful before implementing some of the problems in the Programming Assignment
to look closer at the pseudocode for the algorithms discussed in the lectures.
Here is the pseudocode for BWMatching algorithm from the lecture:

bwmatching.pdfPDF File
Here is the pseudocode for BetterBWMatching from the lecture:

better_bwmatching.pdfPDF File
Alternatively, you can see this interactive text which has more details about using BWT for pattern
matching (this link leads to Finding Mutations in DNA and Proteins course of the Bioinformatics
specializtion). Note that you don't need to pass the code challenge in the end of the interactive
text as it won't affect your Coursera grade for this course: we have prepared a separate
Programming Assignment for you.

Suffix Arrays

Ppt slides
At the end of the last lecture, we faced the challenge of finding positions of the pattern in text when
they tried to develop pattern matching with Burrows Wheeler Transform. Now I will explain how we
will use suffix arrays for solving this problem. Suffix array simply holds starting positions of each suffix.
For example, the first suffix start at position 13, the second suffix start at position 5, the next suffix
starts at position 3 and we can fill up the end of the suffix array.
Play video starting at 39 seconds and follow transcript0:39
Here it is. So, when suffix array is constructed, we can very quickly answer the question where the
occurrences of the part are not. In this case, and the case of ana, our pattern appear at position 1, 7
and 9.
Play video starting at 58 seconds and follow transcript0:58
The challenge, however, is how to construct the suffix array quickly. Because the naive algorithm for
constructing suffix array that is based on sorting all suffixes of text requires all of length of text
logarithm length of text comparison. And each comparison also may take time.

Ppt slides
Play video starting at 1 minute 19 seconds and follow transcript1:19
There is a way to construct a suffix array if you're already constructing a suffix tree. Because, as you
can see from this example, a suffix array is simply a depth-first reversal of the suffix tree. Indeed, you
start from lead 5, continue to lead 3. One, seven, nine, continue further and by simply traversing all
leaves in the suffix tree in order to reconstruct the suffix array. To summarize, if the construct suffix
array by the depth-first traversal of the suffix tree, then it takes all of text time and roughly 20 times
text space.
Play video starting at 2 minutes 7 seconds and follow transcript2:07
And Misha will also explain later how to quickly construct suffix array without relying on suffix tree. In
fact, Manber-Myers, in 1990, constructed the first algorithm for the first linear timeout for suffix array
that require O of text time four time text space. However, for genomics application, even this reduced
space four times length of the text, is still large. What can we do to reduce it?
Here is a trick for reducing the memory for suffix array. I will first ask a question. Can we store? Only a
fraction of the suffix array, but still do fast pattern matching. For example, can we store only
Play video starting at 3 minutes 2 seconds and follow transcript3:02
elements of the suffix array that are multiples of some integer K. Shown here, if integer K is equal to
five, then we only store elements five 10 and 0 of the suffix array. We have no access to other
members of suffix array. How it can be useful? Let me show you how to use the partial suffix array to
find the position of matches. When we have complete suffix array it is trivial, we simply look up at the
highest point elements in the suffix array. But what do we do when we only have partial suffix array?
Where are these ana occurrences appear in the text?
Play video starting at 3 minutes 49 seconds and follow transcript3:49
Well, we do not know, because there is no corresponding k element, let's say, for a2na. How do we
find where it appears?

Ppt slides
Indeed, we don't know yet where it appears, but we can also ask a question where is b1 a n a
appears. Once again, we don't know how to answer this question because there is corresponding
[INAUDIBLE] in the partial suffix. But, using the first last part of it we can ask different question.
Where is the string a 1 b ana appears? And now we can answer this question. Because in this case,
the element of the partial suffix array is Present. It is fast. So, we know where there is a1bana
appears, now the thing left is to figure out where ana appears. And it's easy to do because if aabana
appears at position 5, then b1ana appears at position 6. And ana appears at position 7, so we figure
out how to use suffix array for fast pattern matching. Of course, the time to search for pattern will be
multiplied by a factor of k, because if you store, we potentially can search for up to k position before
you find the fill element of the partial suffix array, but K is a constant in this algorithm.

Pattern Matching with Suffix Array

You may find it useful before implementing some of the problems in the Programming Assignment
to look closer at the pseudocode for the algorithms discussed in the lectures.

Here is the pseudocode for pattern matching with suffix array:

suffix_array_matching.pdf

Approximate pattern matching mutations of the Genome


Approximate Pattern Matching
So far, we were focusing on exact measures, which will help us to answer the question, where is the
positions on my journal and in the reference journal identical? But we, of course, are most interested
in approximate pattern matching, because we want to find positions in my journal that are different
from the reference journal.
Play video starting at 24 seconds and follow transcript0:24
And approximate pattern matching problem is given a string Pattern, a string Text, and an integer d,
try to find all positions in text where the string Pattern appears as a substring with at most d
mismatches.

Ppt slides
Ppt slides

Play video starting at 41 seconds and follow transcript0:41


I'm sure you can come up with a fast algorithm for solving this problem, but I'm really interested in a
more difficult problem. Multiple approximate pattern matching, a set of strings Patterns, a string Text,
and an integer d is an input. And we want to find all positions in Text where a string from Patterns
appears as a substring with at most d mismatches. Let's try to use the Burrows-Wheeler Transform
again to find approximate matches of ana in panamabananas, and we will allow up to one mutation in
ana. We will start again with finding all rows in the Burrows-Wheeler matrix that start with all a, here
they are. And amongst them, we want to find rows that contains na.
Play video starting at 1 minute 39 seconds and follow transcript1:39
And among six rows that start with a, only three of them actually end with n.
Play video starting at 1 minute 48 seconds and follow transcript1:48
Here are these rows, and we are interested in them. They form exact matching of the last two
symbols of ana to our text. But that's not the only thing we are interested in. In the past, it was the
only thing, but now, they're actually interested in all six rows starting starting from a, because we are
interested in approximate matches as well. And to find approximate matches, we need to retain all
the six rows. And specify the number of mismatches for each of these rows, here they are.
Play video starting at 2 minutes 29 seconds and follow transcript2:29
After we found all rows we're interested in, we use again the first last property to find where the
symbol in the last elements of these rows appears in the first column. And they will be appearing
here. Once again, we check which of these six appearances will match ana with, at most, one
notation. And it turn out that one of them actually doesn't satisfy this property, because there is a
match, but it's a match with two mutations, which is beyond our maximum allowable number of
mutation. And that's why we are not interested in this row anymore.
And then by applying last to first property again, we find all occurrences of ana, where there is up to
one mutations in text.
Play video starting at 3 minutes 31 seconds and follow transcript3:31
How do we find where all these five approximate occurrences appear in the text? Well, you can use
suffix array, or more precisely, the partial suffix array. And you can figure out how to use the partial
suffix array to find approximate occurrences as well.
I tried to hide some details of approximate pattern matching with Burrows-Wheeler Transform, to
make it a little bit easier for you to understand how it works. In reality, it is a bit more complex. If you
wanted to learn about the details of approximate pattern matching with Burrows-Wheeler Transform,
you can find those details in our Bioinformatics Algorithm course on Coursera or in our book.
Play video starting at 4 minutes 14 seconds and follow transcript4:14
Sam Berns had a very rare genetic disease. In fact, there are less that 1,000 people on earth with this
progeria.
Play video starting at 4 minutes 26 seconds and follow transcript4:26
However, there are over 7,000 of such rare genetic diseases. And as a result, about 10% of human
population have a rare genetic disease. We now learn how pattern matching will help the doctors of
the future to learn about mutations in our genome and will allow them to diagnose many of these
mutations. However, even if a child is diagnosed with a disease causing mutation, in the case of
progeria, there is no cure.
Play video starting at 5 minutes 5 seconds and follow transcript5:05
And, this is the next challenge for personalized genomics, moving from diagnostic to new drugs aimed
at specific diseases. And to finish this lecture, I will tell you just about one case of a very successful
drug that biologists developed based on exact knowledge of specific mutation implicated in a disease.
I will talk about more complex type of mutations. So far, we've talked about a point mutation when
one nucleotide is changing into another nucleotide. But there are more complex mutations that work
as an earthquake operating on the genome. In this particular case, I'm talking about so-called
Philadelphia chromosome that is formed from two normal human chromosomes. Pieces of these two
normal chromosomes exchange position. As a result, two Chimera chromosomes are formed as
shown here. Biologists figured out how to detect this event, and it turns out that it is a biomarker for
chronic myeloid leukemia.
Play video starting at 6 minutes 17 seconds and follow transcript6:17
And based on exact knowledge of biological maheners, admittedly, it's more complex as the
mutations restarted, but it's once again, a mutation in the human genome. Biologists were able to
develop a miracle drug called Gleevec that is very efficient for chronic myeloid leukemia.

FAQ

Please see this link, sections "Coursera week 2" and "Coursera week 3" for some of the frequently
asked questions and answers about this week's material.

Slides and External References

Download slides for Burrows-Wheeler Transform and Suffix Arrays:

BWT-Suffix-Arrays-Reduced.pdfPDF File

References
See Chapter 9: How Do We Locate Disease-Causing Mutations (Combinatorial Pattern Matching)
in [CP15] Phillip Compeau, Pavel Pevzner. Bioinformatics Algorithms: An Active Learning
Approach, 2nd Ed. Vol. 1. Active Learning Publishers. 2015.

Also see the course "Finding Mutations in DNA and Proteins" of the Bioinformatics Specialization.

If you want to learn how to assemble genomes, also see Chapter 3: How Do We Assemble
Genomes (Graph Algorithms) in [CP15] Phillip Compeau, Pavel Pevzner. Bioinformatics
Algorithms: An Active Learning Approach, 2nd Ed. Vol. 1. Active Learning Publishers. 2015.

PRACTICE QUIZ • 8 MIN

Burrows-Wheeler Transform and Suffix Arrays


Submit your assignment
Start
Programming Assignment: Programming Assignment 2
You have not submitted. You must earn 2/4 points to pass.

Deadlin Pass this assignment by Jul 26, 11:59 PM PDT


e

1. Instructions
2. My submission
3. Discussions
Welcome to your second programming assignment of the Algorithms on Strings class! In this
programming assignment, you will be practicing implementing Burrows--Wheeler transform and
suffix arrays.

Recall that starting from this programming assignment, the grader will show you only the first few
tests (please review the FAQ section for a more detailed explanation of this behavior of the
grader).

Download instructions and starter files:

Programming Assignment 2.pdfPDF File

Programming Assignment 2.zip


Week 3
Algorithms on Strings
Week 3
Discuss and ask questions about Week 3.

3 threads · Last post 2 months ago


Go to forum

Knuth–Morris–Pratt Algorithm

Congratulations, you have now learned the key pattern matching concepts: tries, suffix trees,
suffix arrays and even the Burrows-Wheeler transform! However, some of the results Pavel
mentioned remain mysterious: e.g., how can we perform exact pattern matching in O(|Text|) time
rather than in O(|Text|*|Pattern|) time as in the naïve brute force algorithm? How can it be that
matching a 1000-nucleotide pattern against the human genome is nearly as fast as matching a 3-
nucleotide pattern??? Also, even though Pavel showed how to quickly construct the suffix array
given the suffix tree, he has not revealed the magic behind the fast algorithms for the suffix tree
construction!In this module, Miсhael will address some algorithmic challenges that Pavel tried to
hide from you :) such as the Knuth-Morris-Pratt algorithm for exact pattern matching and more
efficient algorithms for suffix tree and suffix array construction.
Less
Key Concepts
 Explain what is a prefix function
 Explain how to compute prefix function on each step of Knuth-Morris-Pratt algorithm
 Apply amortized analysis to explain why prefix function is computed in linear time in
Knuth-Morris-Pratt algorithm
 Develop a program to find pattern in text using Knuth-Morris-Pratt algorithm (first problem
of the last programming assignment which is in the next week)

Less
Knuth-Morris-Pratt Algorithm

Video: LectureExact Pattern Matching

9 min

Resume

. Click to resume

Video: LectureSafe Shift

3 min

Video: LecturePrefix Function

7 min

Video: LectureComputing Prefix Function

9 min

Video: LectureKnuth-Morris-Pratt Algorithm
5 min

Purchase a subscription to unlock this item.

Quiz: Exact Pattern Matching

4 questions

Due Aug 2, 11:59 PM PDT

Reading: Programming Assignment 3 lasts for two weeks

2h

Reading: Slides and External References

10 min

Knuth-Morris-Pratt Algorithm
Exact Pattern Matching

Hi, in this module, Algorithmic Challenges, 


you will learn some of the more challenging algorithms on strings. 
We will start with the Knuth-Morris-Pratt Algorithm for exact pattern matching. 
It allows to find all occurrences of a pattern in the text, 
in the time proportional to the sum of the length of the pattern and the text.
Play video starting at 20 seconds and follow transcript0:20
Then, we'll proceed to learn the algorithm for 
building suffix array of a string in time n log n. 
And after that, you will learn how to build a suffix tree, 
given a suffix array of the string in linear time. 
Of course, you already know how to build a suffix tree of the string ,but 
the algorithm you know is n squared. 
And that doesn't allow you to take really long strings of millions or 
billions of characters and build suffix array for them in reasonable time. 
And the n log n algorithm will allow you to tackle that. 
So first, let's recall what exact pattern matching means.
The problem is very simply formulated, you are given a long text and the pattern that you need to find
in it. And you need to find all the positions where the pattern starts in the texts, all the positions, and
when number positions from zero in the whole module.

PPT slides
Play video starting at 1 minute 16 seconds and follow transcript1:16
You learned brute force algorithm for that, which basically slides the pattern down the text, and the
running time of that algorithm is the product of the length of the text and the length of the pattern.
What we are going to do in this lesson is to improve this time to the sum of the length of the text and
the length of the pattern, but first, let's recall how brute force algorithm works. It first aligns the
pattern and the text such that the pattern starts from the zero position in the text, and tries to match
the pattern by comparing it character by character to the corresponding characters of the text.
Play video starting at 1 minute 59 seconds and follow transcript1:59
And in this example, we find pattern right away in the position number 0. So we add position number
0 to the output, and then we slide pattern one position to the right. And we compare the first symbols
and they don't match, so we slide the pattern again, and again, no match. And again, and then we just
slide the pattern. We compare the symbols and if they don't match, we slide the pattern to the right.
And then in the last possible position, we again find the occurrence of the pattern in the text. And so
our output is a list of positions 0 and 7, and those are all the positions where the pattern occurs in the
text. So the question is now, can we avoid some of the positions where we tried to align the pattern
with the text? But it actually didn't made sense given what we already know about the previous
comparisons.
PPT Slides
Play video starting at 2 minutes 55 seconds and follow transcript2:55
And the answer is yes, we can. In this particular example, we've already found the pattern in the text
starting from the 0 position, and that means that when we slide the pattern to the right and try to
align it to the first position in the text. We're going to compare the prefix of the pattern without the
last character with the suffix of the same pattern without the first character. And they don't match, so
there is no occurrence of the pattern starting from the first position. But, if we somehow pre-
processed the pattern and knew that the prefix without last character is not equal to the suffix
without the first character, then we could just keep all this alignment with the first position in text at
all. And the same is true about the next position of the pattern. If we knew that the prefix of the
pattern of length two, is not equal to the suffix of the pattern of the length two. Then we could skip
also positioning the pattern against the second position on the text, and then, when we slide one
position to the right again. We compare this prefix of length one which is letter a, with the suffix of
length one which is also letter a. And these are equal, so this position, we cannot skip according to
this rule. But it means that instead of comparing the pattern to the text in positions zero, one, two
and three, we could just safely move the pattern from the position zero to the position three, skipping
positions one and two. And in the more general case, we could skip even more positions, depending
on what is the life of the pattern, and how much of its prefixes coincide with their corresponding
suffixes.
Another example is when we don't even find the whole pattern in the text. We still can skip some of
the positions. So in this example, the longest prefix which is common for the text and the pattern
consists of six characters and the pattern is longer.
Play video starting at 5 minutes 3 seconds and follow transcript5:03
So we cannot compare prefixes of the pattern with suffixes of the same pattern and then decide that
we don't need to check some of the positions in which to align pattern of the text.
Play video starting at 5 minutes 17 seconds and follow transcript5:17
But instead we need to do the same thing with the string marked in green. We need to compare
prefixes of this string with suffixes of this string. And we can notice that the first position where the
prefix of a string can coincide with the corresponding suffix Is position number four. So we can just
move the whole pattern from position zero to the position four in the text. And then, try to compare
the pattern with the text and we find an occurrence. We couldn't find an occurrence earlier, because
no longer prefix of the string a, b, c, d, a, b, coincides with the corresponding suffix.
And another example, again we find the longest common prefix of the pattern in the text. It has
length 6, and is the string a, b, a, b, a, b. And for this string the longest prefix which coincides with the
corresponding suffix is a, b, a, b of length four. And that means that we can move the pattern two
positions to the right, and skip the alignment at position one of the text.
Play video starting at 6 minutes 25 seconds and follow transcript6:25
Now we again find the longest common prefix of the pattern, and the suffix of the text starting in
position two. It again has length six, which means that there is no occurrence of the whole pattern in
the text in position two. But we need to consider the string a, b, a, b, a, b, which is the longest
common prefix. And again, compare the prefixes of this string with the suffixes. And we already know
that the longest prefix which coincides with the suffix is a, b, a, b, of length four. So we can again
move the pattern to the right, so that the prefix and the corresponding suffix match. And now we find
the occurrence of the pattern in the text.

To make an algorithm from these observations, we will need the definition of a border. So border of a
string is a prefix of the string which is equal to the suffix of the string of the same length. For example,
for string arba, a is a border, because the prefix a is equal to the suffix a. And ab is a border of a string
a, b, c, d, a, b, which we saw in the second example. And a, b, a, b, is a border of a, b, a, b, a, b.
Play video starting at 7 minutes 36 seconds and follow transcript7:36
And do you notice that the prefix a, b, a, b intersects with the suffix a, b, a, b? And that is okay and we
just mark the fact that they intersect with an orange color. But actually we notice that not just a, b is a
border but also a longer string of line four which is a, b, a, b is also a border. However string a, b is not
a border of the same string a, b because we require that the border doesn't coincide with the whole
string. We need only those prefixes and suffixes which are shorter than the initial string.
Now, let's consider shifting all the pattern along the text in the general station. So, the first thing we
do is we find the longest common prefix of the pattern with the suffix of the text to which we've just
aligned our pattern. Then we find w, the longest border of u, so that there is a w in the beginning of u,
and there is one in the end of u, and also mark both w's in the text T. Now, I suggest to move pattern
P to the right in such a way that the first w in P coincides with the second w in T.
Play video starting at 8 minutes 50 seconds and follow transcript8:50
And that is the way I suggest to skip some of the positions where we don't need to align pattern with
the text. Now you know that it is possible to avoid some of the comparisons that the brute force
algorithm does. And I've suggested a specific general way to do that. But we don't want to miss any of
the occurrences of the pattern in the text. So the question is, is it really safe to move the pattern in
the suggested way? You will learn that in the next video.
Safe Shift

Hi. In this video, you will see that the method I suggested in the last video for shifting the pattern on
the text is safe, in the sense that we won't miss any of the occurrences of the pattern in the text by
shifting it that way. But first we need suffix notation. We denote by S with index k, suffix of string S
which starts in the position k. For example, for s = abcd, the suffix starting at position 2 is S2, and it is
cd. And for string T, which is = abc, the suffix starting in position 0 is T0, and it is abc. Note that, again,
we use indexes starting from 0 for all the strings
PPT Slides
. Suppose the pattern is aligned with position k in the text.
Play video starting at 49 seconds and follow transcript0:49
Let's denote by u, the longest common prefix of the pattern, and the suffix Tk. Then, select the
longest border w of the string u. In the last video, I suggested to move the pattern to the right in such
a way that the left w in the pattern coincides with the right w in the text. And now, we'll prove that
there cannot be any occurrences of pattern in the text in the red area between the current position k
and the start of the right w in the text. This will prove that the shift suggested in the last video is safe,
we won't miss any occurrences by shifting the pattern this way.
Play video starting at 1 minute 33 seconds and follow transcript1:33
Suppose the pattern occurs in the text in some position i between k and the start of the line w.
Play video starting at 1 minute 41 seconds and follow transcript1:41
Let's move the pattern to align with that position i.
Play video starting at 1 minute 45 seconds and follow transcript1:45
Then we can notice that there is a prefix, v, of the pattern that is also a suffix of u in the text.
Play video starting at 1 minute 55 seconds and follow transcript1:55
And so this string v, which is both a prefix of P and a suffix of u, is actually both a prefix and a suffix of
u. And also this v is longer than w, because it started before w in the text and ended in the same
position as the right w in the text. And so v is a border of u, because it's both a prefix and a suffix. And
it is also a border which is longer than w, but w was the longest border of u. So we got the
contradiction with the assumption that our pattern P occurs somewhere between the current
position k and the start of the right win the text. Now you know that it is actually possible to avoid
many of the comparison that the Brute Force algorithm does by shifting the pattern along the test and
skipping some of the positions in which the Brute Force algorithm tries to align the pattern with the
text. But how to actually determine the best shifts of the pattern, how to compute those longest
borders and the common prefixes, that you will learn in the next videos.

Prefix Function
Ppt slides

Hi, in the previous video, you've learned that we need to quickly compute longest borders of different
prefixes of the pattern. And in this lecture you will learn the notion of the prefix function, which helps
to do exactly that. And we will study some of the properties of the prefix that allow it to compute it
fast. Prefix function of a pattern P is such function that for each position i in the pattern, returns the
longest border of the prefix of the pattern ending in this position i. Let’s consider an example. Here's a
string P and we consider the first prefix of this string, a, for this prefix there is no border. So, prefix
function is 0, now for the prefix a b, ending in the position number 1, there is also no border, because
a is not equal to b, and so, the prefix function is again zero. For prefix a b a, ending in position 2, the
longest border has length 1, and it is border a. For the string ab, a b the longest border is a b of length
two, and for the next prefix the longest border is already of length 3. And although the prefix aba, and
the suffix aba intersect, this is still a valid border. And for the next string, the longest border is a b a b
of length 4. And then we meet character c, and the prefix function begins from zero again, because
there is no border for the string a b a b a b c. And for the next string, the longest border is a. For the
next one, again, a, and for last one ab. So here is the Prefix function.

Now we will prove a useful property of the prefix function that the prefix ending in position i has a
border of length s(i+1)-1. To see that, let's first consider the longest border w of the next prefix
ending in position i+1. W has length exactly as of i+1, by definition of the prefix function. Let's
consider the last character of w and cut it out. What's left, denoted by w prime, is a border of the
prefix ending in position i and it's length is exactly s(i+1)-1. So we've just proved the property, so we
see the prefix ending in i has a border of length as, i + 1- 1, and it means that the longest border of the
prefix ending in position i, is at least s(i + 1)- 1.

From this we get an immediate corollary that the prefix function cannot grow very fast when moving
from position to the next position. In particular it cannot increase by more than one. As we saw in the
example, it can of course decrease, or stay the same, but it cannot increase by more than one from
position to the next position. In the algorithms to follow, we will need to efficiently go through all the
borders of a prefix pattern and this Lemma helps us with that. It says that all the borders of some
prefix of the pattern, but for the longest one, are borders of this longest border in term. And to see
that, let's look at the longest border, and some border u of the prefix, which is shorter than the
longest border.
And then we can see that this border u is both a prefix, and the suffix of the longest border, and it was
also shorter than the longest border. So u is indeed a border of P[0..s(i)- 1]. And the useful corollary
from that is that all the borders of P[0..i] can be enumerated by simple ways. First we take the longest
of the borders, and then we find the longest border of that string, and then we find the longest
border of that string, and so on until we get an empty string as a border. And by the end, we have
gone through all the borders of the initial prefix of the pattern. And to go from any prefix to its
longest border, we just need to use the prefix function. And then again, the Prefix Function for that
prefix of P, and then the Prefix function for that prefix of P and so on. So if we know the Prefix
Function of the pattern we can go through all the borders of any prefix of the pattern In an efficient
way. Going only through the borders and not encountering any other positions in the pattern.
Now lets think how to compute the prefix function. We know that s(0) is 0 because the prefix ending
in position 0 has length 1 and has no known empty borders. Now to compute s(i + 1) if we already
know the values of the prefix function for all the previous positions, let's consider the longest border
of the prefix ending in position i. And let's look at the character and position i + 1 and the character
right after the prefix with length s(i). If those characters are equal than s(i) + 1 is at least s(i)+1
because we can just increase the length of the border. And that means that s(i+1) is exactly equal to
s(i)+1 because we've learnt that the prefix function cannot grow by more than One from position to
the next position but if those characters are different then everything is a bit more complex. So we
know that there is border of the prefix ending in position i that has length exactly as i+1-1.So if we
find that border then the next green character after it will be the same as the character in position
i+1. It will be the same x. So what we need to do is we need to go through all the boarders of the
prefix ending in position i by decreasing length. And as soon as we find some boarder that the next
character after it is the same as the character and position i plus one, we can compute as of i plus one
as the length of that boarder plus one. So now we basically have the algorithm for computing all the
values of the prefix function. We start with initializing s(0) with zero, and then we go and compute
each next value of s. If the character s(i+1) is equal To the character right after the previous border.
Then we just increase the value of the prefix function by one and go ahead. If those characters are
different, we go to the next longest border of the prefix ending in position i using prefix function and
look at the next character after it. If it coincides with the character in position i + 1 then we found the
answer. Otherwise we again go to the next longest border using the prefix function and look at the
next character after it and so on. At some point, we may come to the station that the longest border
is empty. And then we'll need to compare the character in position i + 1 with the character in position
0. And either they are the same, and then the prefix function is 1. Or they are different, and then the
prefix function has the value of 0. Now you know a lot of useful properties of the prefix function but
we still don't know exactly how to compute all of its values and you will learn that in the next video.

Computing Prefix Function


PPT Slides
Hi, in this video, you will learn how to efficiently compute all the values of the prefix function in an
efficient way. We'll start with an example. We're given a pattern, P, and we want to compute its prefix
function, s. We'll start with position 0, and we'll fill in 0. And S of 0 is always 0, because the prefix of
pattern ending in position 0 has length 1, and has no known empty borders. Why don't we move on to
the character in position one, which is b. And we compare it with the next character after the end of
the previous border. The end of the previous border was before the pattern P starts, so the next
character is a. And we compare a with b, they are different. And so the value of the prefix function is
again zero, because we cannot take border of the empty border. And so we have to acknowledge that
the prefix function is again zero.
Play video starting at 54 seconds and follow transcript0:54
Now we'll look at the next character a and we'll look at the next character after the end of the
previous border which is again the a in position 0. And we see that these characters are the same. So
we increase the previous value of s by one. And our value of prefix function for position 2 is 1, and
now our current border is of length 1 and it contains just the letter a in position 0. And we look at the
next character which is b. And we need to compare it with the character right after the end of the
current border, which is the b in position one. And those characters are the same, so we increase the
length of the previous border by one. And we write down that s of three is equal to two. And our
current border is of length two, and it is ab. Now we'll look at the next character a. We need to
compare it with the next character after the current border, which is a in position two. And they're
the same. So again, increase the value of our prefix function by a and we increase the length of the
current boarder and it becomes aba.
Play video starting at 2 minutes 11 seconds and follow transcript2:11
The next character is b, we need to compare it with the next character after the end of our border
and they are again, the same. So, we increase the value of our prefix function by one and it becomes
four. And our border is abab. Now we'll look at the character c, we need to compare it with the next
character after the end of the current border. And they are different. So what we need to do is to take
the longest border of our current border, and we'll look at position three, and s of 3 is two. And so the
longest border of abab is just ab.
Play video starting at 2 minutes 49 seconds and follow transcript2:49
And now, we need to compare our current character c with the next character after the end of that
border, which is a. And they're again different, so we need to take the longest border of our current
border, and that has length 0. So, that will be empty string. And so now, we need to compare our
current character with the first character of the pattern, which is a. And they're again different. So
we'll write down that our prefix function is 0. We move on to the next character a. We compare it
with the next character after the end of the border which is the first character of pattern. And they
are the same so we write down 1 as the value of our prefix function. Now the border has length 1 it is
just string a.
Play video starting at 3 minutes 34 seconds and follow transcript3:34
We look at the next character a, and we compare it with the character right after the end of the
current border. They're different, so we need to take the longest border of our current border, which
is empty string. And so we need to compare current character with the first character of the pattern.
They're the same, so we write down 1 as the value of our graphics function and the current border
has length 1.
Play video starting at 3 minutes 58 seconds and follow transcript3:58
And finally, we go to the symbol b in the end of the pattern and we need to compare it with b in
position 1. They are the same so increase the value of the prefix function by 1 and write down 2 as
the value of the prefix function for the last position. This was how the computation of the prefix
function values works on an example.
And now let's look at the pseudo code which can compute these values for any string P.
Play video starting at 4 minutes 24 seconds and follow transcript4:24
It will return an array of values s, which has the length the same as the length of the string, and which
will contain the values of the prefix function for positions from 0 to length of the string minus 1. We
start by initializing this array, and we'll write down s of 0 equal to 0, as I explained. And we also need
additional variable border which will contain the length of the current border at all times. And then
this for loop where position I goes from one to the last position in the string will actually compute as
of I. At first we need to look at our current character P of I, and we need to compare it with the
character right after the end of our current border.
Play video starting at 5 minutes 10 seconds and follow transcript5:10
For example if the length of our current border is 2 then the border itself contains characters in
positions 0 and 1. And so the next character right after the end of the current border is in position
two. This is why we compare character in position i, P of i, with the character in position border. P of
border.
Play video starting at 5 minutes 31 seconds and follow transcript5:31
So we compare them and as soon as they are the same, we know what to do. But, if they're different
we need to take the longest border of our current border. How to do that? Our current border ends in
position border minus 1. We know that the length of its longest border is contained in the
corresponding value of the prefix function. So we assign value s of border minus 1 to our variable
border.
Play video starting at 6 minutes 1 second and follow transcript6:01
And we only end this while loop if either current character p of i is the same as the character P of
border or our border became empty string.
Play video starting at 6 minutes 16 seconds and follow transcript6:16
After that we compare our current character again with the character in position border. And either
they are the same, then we increase our current border by one. And that is the value we need to
assign to our prefix function. Or, they are still different after border became zero. And so, our current
prefix function value should be zero. And in the event of the for loop iteration, we just assign the
value to the prefix function.
Play video starting at 6 minutes 46 seconds and follow transcript6:46
And that will return the array with values of the prefix function.
And we state that this algorithm runs in linear time. Why is that? Well, let's first forget for a moment
about the inner while loop. Then what we are left with is initialization which obviously works in linear
time. And the for loop with linear number of iterations. And everything inside for loop, but for the
while loop, for which we don't account now, runs in constant time. So, everything but for the inner
while loop works in linear time. Now on each iteration of the four loop, the while loop could take in
theory as many as linear number of iterations. And if we sum all that up, that would be quadratic in
the length of the pattern. And now, we will bound the total number of the while loop iterations by big
O of the length of the pattern.
And to do that, we will consider the graph of the values of the prefix function. Here, the horizontal
axis contains positions in the string. And vertical axis contains values of the prefix function. The graph
always starts in .00. Because s of 0 is 0.
Play video starting at 7 minutes 57 seconds and follow transcript7:57
And it can stay the same, it can increase, it can decrease, but it never becomes less than zero because
prefix function is always non-negative by definition. And also, it can never increase by more than one
on each iteration. And our variable border in the code behaves the same way.
Play video starting at 8 minutes 16 seconds and follow transcript8:16
It can be increased by at most one on each four loop iteration. But after each successful iteration of
the inner while loop it gets decreased by at least one, because we consider the current border. We
look at the next character and if it's different from the character in position i we take the longest
border of our current border which is strictly shorter. So our border value decreases.

And so all in all, border can increase at most length of the pattern times.
Play video starting at 8 minutes 51 seconds and follow transcript8:51
And it is decreased by at least 1 on each iteration of the while loop.
Play video starting at 8 minutes 55 seconds and follow transcript8:55
And border is also always non-negative. It means that it can be decreased at most, we go off length of
the pattern times. And so there are at most linear number of while loop iterations in total. Now you
know how to compute prefix function efficiently in linear time in the length of the pattern. But how to
actually solve the initial problem? How to find the pattern in the text? That you will finally learn in the
next video.

Knuth-Morris-Pratt Algorithm
Hi, in this video, you will find and learn the Knuth-Morris-Pratt algorithm that allows to find all the
occurrences of a pattern in the text in the time linear in terms of the length of the pattern and length
of the text. So instead of the product of those lengths as in the brute force algorithm, we will need
just some of those lengths to find all the occurrences.
Play video starting at 24 seconds and follow transcript0:24
The algorithm goes as following. First we create a new long string S which consists of the pattern, the
text and between them we insert a special character called dollar. Which is basically. Any character
that is absent from both pattern and text. It cannot be specifically the character dollar. It is just a
placeholder for some character that is absent from both pattern and the text.
Play video starting at 50 seconds and follow transcript0:50
After we've assembled this longstring S, we need to compute it's prefix function. And after we've
computed its prefix function, we need to look in the positions in the string s, which are inside the text
part of it. So we'll look at all positions i such that i is more than length of the pattern. So after the
pattern and after the dollar, and if the prefix function for that position is equal to the length of the
pattern, then we know that there is an occurrence of the pattern in text ending in that position. For
example, here we have a prefix function value of four. And we have an occurrence of the pattern
ending in the corresponding position.
Play video starting at 1 minute 34 seconds and follow transcript1:34
We need to find all the positions where the pattern starts in the text. So from the position where it
ends we need to compute the position where it starts, and to do that we need to subtract length of
the pattern -1, but that would be the position in S. And to compute the position in the text, we also
need to subtract the length of the pattern and one for the dollar. So in total, we need to subtract two
length of the pattern from the position in the string S to find the starting position of the pattern in the
initial text. And there is another place where prefix function of string big S is equal to the length of the
pattern. And again, there is an occurrence of the pattern ending in that position. And why does
algorithm even works?

First, we need to notice that the prefix function for this string big S is always less than or equal to the
length of the pattern. Because of the dollar sign it occurs right after the end of the pattern so when
the border is bigger, we would need to have another occurrence of dollar in the string big S. But dollar
is only between pattern on the text and is absent from the text. So prefix function cannot be bigger in
under the life of the pattern.
Play video starting at 2 minutes 52 seconds and follow transcript2:52
If we look at the position i, which is to the right from the dollar and the prefix function is equal to the
length of the pattern. And that means that the pattern is a border of the corresponding prefix of S.
And so it ends in position i. And we only need to determine the position in which it starts in the text T.
And to do that, we need to do a few computations. And we will see that this position is i minus two
length of the pattern.
Play video starting at 3 minutes 21 seconds and follow transcript3:21
However, if the prefix function in some position i is strictly less than the length of the pattern, then it
means that the pattern doesn't end in that position in the string S. And that means that it doesn't end
in the corresponding position in the text. And that means that we've found all the positions in which
the pattern ends in the text. And so we've of course also found all the positions where it starts in the
texts by subtracting two lines of pattern from each side position.

So the codes for this algorithm is already pretty simple. We take as input pattern P and text T, we
assemble string S by pattern with special symbol dollar and with the text, then we compute the prefix
function of this long string S. We initialize the resulting list of positions where pattern occurs in the
text. And we go through all the positions, i and s, which are to the right from the dollar sign.
Play video starting at 4 minutes 24 seconds and follow transcript4:24
And if, at some position, we see that the value of the prefix function is the same as the length of the
pattern, we just append i minus two length of the pattern to the result. And we return this resulting
list of positions in the end.
Play video starting at 4 minutes 38 seconds and follow transcript4:38
This algorithm is already pretty simple, and it works in time proportional to sum of the length of the
pattern and the text.
Play video starting at 4 minutes 46 seconds and follow transcript4:46
To prove that, we know that string S can be built in the time proportional to sum of the length of
strings P and T. Computing prefix function is done in the proportional time. And the four loop runs
through part of the string. So it also runs in time proportional to sum of the length of the pattern and
the text. In conclusion, you now know the Knuth-Morris-Pratt algorithm for exact pattern matching.
You can find all occurrences of the same pattern, in all of the text in time linear in terms of the length
of the pattern and length of the text. You can also compute prefix function of any string in linear time.
And you can go through all the borders of any string in the order of decreasing length using prefix
function. And in the next lessons, we will learn how to build suffix array and suffix tree in time which
will allow you to find many different patterns in the same text even faster than if you use algorithms
like Knuth-Morris-Pratt's.

QUIZ • 30 MIN

Exact Pattern Matching


Programming Assignment 3 lasts for two weeks

We intend you to solve the Programming Assignment 3 during this week and the next week, you
can already access it now. You should be ready now to solve the first problem of
the Programming Assignment 3, "Find all Occurrences of a Pattern in a String". To solve the other
problems, you should first go through the lectures and readings of the next module (and please
have a look at the pseudocode provided in the readings before starting to work on the
Programming Assignment).

Slides and External References

Download the slides on Knuth-Morris-Pratt algorithm:

14_algorithmic_challenges_1_knuth_morris_pratt.pdf PDF File

References
See chapters 1, 2.3, 3.3 in [G97] Dan Gusfield. Algorithms on Strings, Trees and Sequences:
Computer Science and Computational Biology (1st Edition). Cambridge University Press. 1997.

See this visualization/tracing of the Knuth-Morris-Pratt algorithm.

WEEK
4
5 hours to complete
Constructing Suffix Arrays and Suffix Trees

In this module we continue studying algorithmic challenges of the string algorithms. You will learn
an O(n log n) algorithm for suffix array construction and a linear time algorithm for construction of
suffix tree from a suffix array. You will also implement these algorithms and the Knuth-Morris-Pratt
algorithm in the last Programming Assignment in this course.

11 videos (Total 76 min), 5 readings, 2 quizzes


SEE LESS
11 videos
Suffix Array6m
General Strategy6m
Initialization9m
Sort Doubled Cyclic Shifts8m
SortDouble Implementation6m
Updating Classes8m
Full Algorithm3m
Suffix Array and Suffix Tree8m
LCP Array5m
Computing the LCP Array6m
Construct Suffix Tree from Suffix Array and LCP Array6m
5 readings
Counting Sort10m
Slides and External References2m
Computing the LCP Array - Additional Slides10m
Suffix Tree Construction - Pseudocode10m
Slides and External References2m
1 practice exercise
Suffix Array Construction12m

Suffix Array Construction

Suffix Array
Hi, in this lesson you will learn how to build a suffix array of a string in time and log N.
Play video starting at 6 seconds and follow transcript0:06
Suffix arrays are useful data structure that you already used in the previous modules but now you will
learn how to build it really fast and first we'll recall what is a suffix array. So the problem of
construction of a suffix array is very simple you're given a string and you need to sort all of it's suffixes
in lexicographic order. However as we will soon see you won't need to actually compute all the
suffixes and then solve them and output all of them because that will use too much both time and
memory. You will just need to know in which order are those suffixes. And the suffixes themselves
sorted in lexicographical order are only in our head. They're not stored anywhere in the problem. So
we assume that the alphabet from which our strings are built are ordered, so that any two characters
we can say which one of them is smaller. For example, in English we can order all the characters from
a to z in a binary alphabet we just have zero and one and zero is less than one.
By definition a string S is smaller than a different string T if either S is a prefix of T or S and T coincide
from beginning up to some character and then the next character in S is smaller than the
corresponding character in T. For example, if s is ab, and t is bc. Then they don't coincide. But the first
character is already different. And the character in s is a. And the character in t is b. A is less than b, so
s is less than t. And in the second example, s and t coincide for the first two characters. And then the
third character c is less than character d. So s is smaller than t. And in the third case, s is a prefix of t,
but it is different from t, so s is smaller than t.
And here is an example of suffix array. We have a string, s, and all suffixes ordered in lexicographic
order are a, aa, and so on. So, here are exactly six suffixes because the length of string S is six, and so
we have six different suffixes. We want to avoid this case when S is a prefix of T and that is why S is
less than T because this case is different from all others and usually you just compare S and T from the
first character and go the right until they differ. And then see which character is smaller. And this is a
corner case when you go up to the end of the S, and then you see that there is nothing there and so
that is why S is smaller. So to avoid using that rule at all, we will append a special character called
dollar to the end of the string for which we'll build suffix array. So all the suffixes will have this dollar
on the end. And now if initially some suffix was a prefix of another suffix. Now it is just smaller by the
usual rule, because as soon as it ends, and it is still coinciding with the prefix of the bigger suffix, the
next character in the smaller one is taller, which is smaller than all other characters. And so we can
determine by the usual rule that the smaller suffix is actually smaller.

Play video starting at 3 minutes 27 seconds and follow transcript3:27


So how the suffix array changes in this case. We have initial string S, ababaa and we append dollar to
the end and we get S prime. And now all the suffixes in the lexicographic order of string S prime are $,
a$, aa$ and so on. And if we just remove dollar from the end of each of these strings we will have the
suffix array of the initial string s, preceded by an empty string.
Play video starting at 3 minutes 55 seconds and follow transcript3:55
So building a suffix array for s prime is giving us a suffix array for string S right after removing the
empty string from it.
Play video starting at 4 minutes 9 seconds and follow transcript4:09
What about storing the suffix array, suppose we have some algorithm to computed. How are we
going to store? We want our suffix array to be stored in a linear memory, but the total length of all
suffixes is some of arithmetic progression from one to the length of the string, which is quadratic. And
so, it will take too much memory to store the suffix rate, even if we can compute it fast. So, we need
to store only the order of the suffixes, not the suffixes themselves. And the order is just a permutation
of numbers, and the number of those numbers is the length of the string. So that will take
[INAUDIBLE] of your time. And so that is what we mean by suffix array. This order. So it will be just an
array of positions. And all the positions are from 0 to length of the string minus 1. And the array has
length equal to the length of the string. Now let's look at an example of such order.
PPT Slides
So we have initially string S which is ababaa$. And we number all the suffixes by their starting
positions. For example ababaa$ is 0 and abaa$ is 2. And we will start the order of the suffixes in an
array order. So the smallest suffix is just $ because $ is smaller than any other characters. So the first
suffix and the order is suffix number 6. And the next one is a$ which is number 5 and then aa$ which
is number 4 then ababaa$ which is 2 then 0 then 3 and then 1. So this is the kind of array which we
call suffix array. Which is the order of all the suffixes of the initial string. And, we don't store the
suffixes themselves. However, if we need to look at, for example, the third character of the second
suffix in order we can first go into the area order, find out which suffix is number two, and that will be
the first position of that suffix in the stream. And if we then add two to that we'll get character with
position two in that suffix. So in theory we can look at any character of any suffix really efficiently
although we don't store those suffixes directly. Okay, now you know how to store the suffix array and
how to manipulate it efficiently. But you probably wonder how to actually construct it. And you will
learn that the next few videos.
General Strategy
Ppt slides
In this video you will learn in general how to efficiently construct suffix array; what are the steps and
the substeps, and we will work out the details in the next few videos.
Play video starting at 10 seconds and follow transcript0:10
But first, we need to go from suffixes to cyclic shifts. A cyclic shift of a string is a string we get if we
write our initial string around the circle. And then start from any position and go through the whole
circle. So for the initial string ababaa$, we can write down all the seven characters around the circle
and then we will have the following cyclic shift. The initial string, the string babaa$a, which we'd
gather if we start from character b and so on, moving around the circle. So these are the seven cyclic
shifts of the initial string.
Play video starting at 52 seconds and follow transcript0:52
And if we sort cyclic shifts instead of suffixes, let's see what happens, so these are all the cyclic shifts.
Let's suppose we somehow manage to sort them in lexicographic order. And then we remove all the
characters after the first occurrence of $, in each of those cyclic shifts. What we get is actually the
suffix array we wanted. All those strings in the third column are suffixes of S in the sorted order.
This is actually always true, we have a lemma that after adding to the end of string s character $,
which is smaller then all other characters in that string. Sorting cycle shifts of the string and sorting
suffixes of the stream is equivalent. We won't prove this lemma, we'll leave this as an exercise, we'll
just use it in the following.
Play video starting at 1 minute 50 seconds and follow transcript1:50
Apart form total full cyclic shifts, which go the full circle starting from some position in the circle or
string S, we'll also need partial cyclic shifts. And those are basically sub strings of cyclic string S. So it
can take any position in the cyclic string and go a few characters in the order in the clockwise order
from there, and you will get a partial cyclic shift. For example, if we take the same initial string and
want to build all cyclic shifts of length 4, you can start from the first character a and you will get abab,
or we can start from b and get baba, and so on. We will have again, seven different cyclic shifts of
length four ending in the $aba. Now that we only go in the clockwise order, because this is the order
when you try our cyclic string S as three a's written around the circle.
So the general strategy for constructing a suffix array of the string S, is we start with a simple task, we
sort all the single characters of the string S and those single characters are actually partial cyclic shifts
of length one. So we assign L to one and now we have our base. We have sorted all the cyclic shifts of
length L and then we will do iterations while L is still less than the length of the string,
Play video starting at 3 minutes 19 seconds and follow transcript3:19
we will use our order of cyclic shifts of length L to sort the cyclic shifts of twice the length. And we will
do that efficiently using the order of the current cyclic shifts of length L and thus will increase the
length of the source suffixes twice. And then we will again increase it twice and so on and at some
point L will be greater than or equal to the initial string S. And then we will have sorted the cyclic
shifts of length greater than or equal to the length of the initial string and the order of those cyclic
shifts is the same as the order of the full cyclic shifts of the initial string. So, this is the general strategy
how we'll build the suffix array of the string.
PPT Slides
Now let's look at an example of application of these general strategy in practice. We use the same
string in all the examples. So this is our string, and we start with sorting the partial cyclic shifts of
length 1, which is just single characters of the string. And the smallest is $, then go four instances of
letter a and then two instances of letter b. And six is the position of dollar in the string and 0, 2, 4 and
5 are the positions of letters a and 1 and 3 are the positions of letters b. In this case, the order of the
partial cyclic shifts is 6, 0, 2, 4, 5, 1 and 3.
Play video starting at 4 minutes 49 seconds and follow transcript4:49
And on the next step we go from cyclic shifts of length 1 to cyclic shifts of length 2 and their ordered
changes. First goes $a, because $ is the smallest character and then goes a$, and then aa, and then
two instances of ab, and then two instances of ba.
Play video starting at 5 minutes 10 seconds and follow transcript5:10
And the order in this case is already 6, 5, 4, 0, 2, 1, 3, and that's because dollar is in position 6. And
then a, after which there is a $, is in position 5. And then a, after which there is an a, is in position 4,
and so on.
Play video starting at 5 minutes 28 seconds and follow transcript5:28
On the next step we go from cyclic shifts of length 2 to cyclic shifts of length 4 and we get $aba and so
on up to baba. And the order changes again it is 6, 5, 4, 2, 0, 3, 1. And on the last step we go from
these partial cyclic shifts of length 4 to partial cyclic shifts of length 8, which are already longer than
the initial stream. And we start from $ababaa$ and end with babaa$ab. And the order actually didn't
change in this case. And now we have the order of cyclic shifts of length a which is the same as the
order of the full cyclic shifts of the string s, which is in turn the same as the order of the suffixes of this
string. So if we remove everything after the first occurrence of $ in all those partial cyclic shifts, we'll
get the order of the size fixes of the initial string s. And so order is now our suffix array. And in the
next video we'll start working through the details of this general strategy.
Initialization
Hi, in this video you'll learn the algorithms used in the initialization phase of the suffix array
construction. And those are sorting of the single characters of the initial string and also competing
equivalence classes of those characters.
Play video starting at 15 seconds and follow transcript0:15
First we know that we assume the alphabet is finite. And we can use thus counting sort to compute
order of the characters. You probably remember a counting sort from the first course, algorithm tool
box. If you don't, I encourage you to go through those lectures once again, because we will use
counting sort twice in the construction of the suffix array. In the initialization phase and in the transfer
phase. Here is the pseudo codes for the procedure SortCharacters, which takes string as an input and
returns the order of the characters of that string as the output. We start with initializing that order as
an array, of size equal to the length of the string. And we'll also need another array count, used in the
counting sort. Which will initialize with zeroes, and which has size equal to the size of the alphabet,
not the size of the string, but the size of the alphabet.
Play video starting at 1 minute 11 seconds and follow transcript1:11
And then what you have in the following two for loops is just the familiar code for the recomputation
phase of the counting sort. We count the number of occurrences of each of the characters in the
string and then we also compute the partial sums of that array. And then in the end, we go from the
right to left in our string S. We look at the character and we know that the partial sums array contains
the position after the position where this character should be in the order. So we decrease the
counter by one and we save our character position in the corresponding cell of the array order and in
the end we just return the order of the character. So this is just an implementation of the counting
sort as applied to characters of string has.

And it works in time proportional to length of the stream plus size of the alphabet. because we know
that this is the running time of the counting sort for length of S items, each of which can take only size
of the alphabet, different values. And I need to note here that typically the size of the alphabet is
small like for example, four letters, four streams in a genome, or 26 characters. If we are only working
with the English words, or maybe alphanumeric characters. Then there will be 26 small letters, 26 big
letters, and 10 digits. But sometimes the alphabet can be very very big, such as Unicode. And in this
case counting sort might not be appropriate. If your string for example has only 1000 characters but
those are all unique code, and the alphabet size is a few million character, then maybe you could sort
the characters of this string in a more efficient way.
Play video starting at 3 minutes 13 seconds and follow transcript3:13
Apart from sorting the characters, we will also need additional information to make the following
steps of the algorithm more efficient. And to do that, we introduce equivalence classes of the partial
cyclic shift. So we denote by c with index i, partial cyclic shift of length L, where L is the current length
of the cyclic shifts, which we already have sorted. And initially, we have sorted single characters. So L
is equal to 1. And then on the further phases of the algorithm, L will increase from one to two, to four,
and so on, twice in each iteration. So, some of the cyclic shifts can be equal to different cyclic shifts
starting in different positions. Ci can be equal to Cj and then they should be in the same equivalence
class. So to assign equivalence classes, we define the area class. And class of i is equal to the number
of different cyclic shifts of length L that are strictly smaller that the cyclic shift starting at position i.
Play video starting at 4 minutes 25 seconds and follow transcript4:25
So for different cyclic shifts which are equal, the value of class[i] and class[j] will be the same. Because
the same other cyclic shifts are smaller than these two equal cyclic shifts.
Play video starting at 4 minutes 40 seconds and follow transcript4:40
And we'll need to compute this array class to increase the speed of the next phase. And before
computing this array class, we assume that we have already sorted all the cyclic shifts of the current
length L.
PPT Slides
So, how to actually compute the classes of the cyclic shifts when we already know their order. Let's
look at the example of sorted characters of the string. So we know already that the characters are
sorted, and their order is 6, 0, 2, 5, 1, and 3. Now let's assign classes. We want to assign class 0 to the
smallest of the cyclic shifts of the current length. Which is dollar, which is in position six. So, we write
0 in position six of the class. And we initially set up a class to be of length equal to the length of the
string of course. The next, smallest cyclic shift is letter a and it is different from the previous smallest
one which is dollar. So we need a new equivalence class for a. And so, we assign 1 to the equivalent
class of a which is in position 0 in the initial string. So we assigned 1 to class of 0. And the next one is
also a which is already in position two. But it is equal to the previous one. So we are saying the same
equivalence class to it. So we'll write down 1 as the value of class of 2. The next one is also a. It is also
equal to the previous one. So we assign 1 to class of 4. And the same one we do with class of 5.
Play video starting at 6 minutes 27 seconds and follow transcript6:27
And the next one is b which is different again from the previous one and so we assign new class which
is bigger by one which is two so we assign 2 to the value. Value of class of one because b we find in
position 1. So class of 1, we assign to value 2. And then the last one is also b it is equal to the previous
one so again we assign 2 to class of 3. And now we know the classes of all the single character cyclic
shifts. We know that the smallest one is dollar. And it is the only one that's equals 0. We know that 4
a's are in the equivalence class 1. And we know that 2 b's are in the equivalentce class 2.

Play video starting at 7 minutes 11 seconds and follow transcript7:11


Here is the pseudo code for the algorithm ComputeCharClasses which takes its inputs string S, and the
order of the characters. And computes the equivalence classes just for single character cyclic shifts of
the string S, given their order. So we initialize the array class with just an area of size equal to the
length of the string S and that will be our return value. Also initialize the first value of this class array.
But we don't initialize class of 0. We initialize class of order of 0, because order of 0 is the position in
which the smallest character in the string occurs. And we initialize this character with class 0. So we
assign class of order of 0 to 0 saying that the character in position order of 0 has equivalent class of 0.
And then starting from second character in order up to the end, we go through the characters of
string in order and we assign classes. To assign a class to a new character, we compare it with the
previous one in the order. If it's different from the previous one means it's bigger because we go
through them in the order. And so we need a new class. And so we just take the value of the previous
class, increase it by one and assign to the class of this character. Otherwise, if this character is the
same as the previous one, we don't need to create a new class. We just assign the same class as the
class of the previous character.
Play video starting at 8 minutes 44 seconds and follow transcript8:44
And in that, we return the array with the classes.
And we state that the running time of this algorithm is linear. Which is obvious, because we only have
initialization of the array. And then for loop, which runs for linear number of iterations with constant
number actions performed in each iteration of the for loop. And that's all for the initialization phase
of the suffix array construction. And in the next video, we'll learn the transition phase from the
current length to twice the length of the cyclic shifts.

Counting Sort

Do you remember the Counting Sort algorithm from the Algorithmic Toolbox class? Here is its


pseudocode with comments:
A simple, but crucial observation: it is stable. It means that it keeps the order of equal elements.
Of course, it doesn't matter for the sorting algorithm itself in what order to put equal elements: they
can go in any order in the sorted array. But for some algorithms that use sorting it is important, as
we will see in the following lecture. If you sort an array which has equal elements using Counting
Sort, and one of the two equal elements was before another one initially in the array, it will still go
first after sorting. Also see this answer for an example of difference between stable sorting and a
non-stable sorting algorithms.

Note that we can sort not only integers using Counting Sort. We can sort any objects which can be
numbered, if there are not many different objects. For example, if we want to sort characters, and
we know that the characters have integer codes from 0 to 256, such that smaller characters have
smaller integer codes, then we can sort them using Counting Sort by sorting the integer codes
instead of characters themselves. The number of possible values for a character is different in
different programming languages, so find out what is the range of integer codes for characters in
your programming language of choice before using this in a Programming Assignment!
Sort Doubled Cyclic Shifts

Hi, in this video you will learn how to implement the transition phase of the suffix array construction
algorithm. In the transition phase, you assume that you have already sorted cyclic shifts of some
length, L. And you know not only their order but also their equivalence classes, and you need to sort
based on that cyclic shifts of length 2L. The main idea is the following. Let's denote by Ci cyclic shift of
length L starting in position i, and by Ci prime the doubled cyclic shift starting in i. That is, cyclic shift of
length 2L, starting in position i. Then, Ci prime is equal to Ci concatenated with Ci + L. So we just take
string Ci, we take string Ci + L, put it after string Ci. And the total string of length 2L is equal to string Ci
prime. And so to compare Ci prime with Cj prime it's sufficient to separately compare Ci with Cj, and
Ci + L with Cj + L. And we already know the order of the cyclic shifts of length L. So instead of
comparing them directly, we can just look in the array of their order and determine which one is
before which one. And that one is going to be smaller or the same as the other one. And also, we
have the area with equivalence classes. And so we can determine whether two cyclic shifts of length L
are really equal, or they're different, by looking in the array for equivalence classes and comparing
their equivalence classes. So basically we can compare two cyclic shifts of length L in constant time.
And that is why we can sort the doubled cyclic shifts faster.
Play video starting at 1 minute 58 seconds and follow transcript1:58
For example, if S is our initial string, ababaa$, and the current length L is 2. And position i is also 2,
then Ci is C2, is a cyclic shift starting in position 2, which is ab. Ci + L is C2 + 2 which is C4, which is aa.
Play video starting at 2 minutes 20 seconds and follow transcript2:20
And Ci prime is equal to abaa. So this is the distinction between cyclic shifts of length L and 2L, and
how we combine C2 and C4 to get C Prime 2. So now we have to think about the following problem.
We need to sort pairs of numbers basically, because each cyclic shift of length L corresponds to its
number of position in the order of all cyclic shifts of length L. And we first need to sort by second
element of pair, and then we stable sort by the first element of pair. And if we do these two steps,
then our pairs will be sorted because they will be sorted by first element. And also inside the equal
first elements it will be sorted by the second element, because it was initially sorted by the second
element and the sort is stable. So we didn't break the order of the second element in the case when
the first elements are the same. So, this is the idea for sorting pairs of objects.
PPT Slides
Play video starting at 3 minutes 36 seconds and follow transcript3:36
And let's look at this example. So let's suppose our current length is 2, and we already sorted all the
cyclic shifts of length 2, and they are to the right in the sorted order.
Play video starting at 3 minutes 49 seconds and follow transcript3:49
Now for each of the cyclic shifts of length 2, let's look at the cyclic shift of length 4 which ends in this
cyclic shift of length 2. So we take the two previous characters and add them to the left. So C4 prime
and in C6, and also we'll look at C5. We take the two previous characters, and C3 prime ends in C5,
and so on. So we go by two characters to the left from each of the cyclic shifts of length 2, and we get
a set of cyclic shifts of length 4. Now we have highlighted in yellow the first elements of the pairs,
which are also cyclic shifts of length 2. Those are not sorted, but we know their starting positions and
we know what are the correct starting positions in the sorted order. So we can reorder this list of
cyclic shifts of length 4 by the order of the first halves of the elements in this list using the known
order. And we will need to do so in a stable sort fashion so that if, for example, C2 prime is before C0
prime, and the first half of C2 prime and C0 prime are the same. They need to stay in the same order
in the final sort.
Play video starting at 5 minutes 23 seconds and follow transcript5:23
And the same goes about C3 prime and C1 prime. They both start in ba. So when we sort by the first
half, C3 prime has to stay before C1 prime. That's our requirement. So suppose we manage to sort the
first halves in such a way, and we started the whole cyclic shifts of length 4 accordingly, what do we
get?
We actually get the sorted list of cyclic shifts of length 4. And of course for those which differ in the
first half, it's obvious that they compare in the correct order. But for those which are the same in the
first half, their second half is also sorted because it was sorted initially in the second column and we
implemented a stable sort. So C2 prime is still before C0 prime, and C3 prime is still before C1 prime.
So this is the idea.

Play video starting at 6 minutes 25 seconds and follow transcript6:25


For sorting double cyclic shifts we take Ci prime, which is a double cyclic shift starting in position i, and
we know that there's a pair of Ci and Ci + L. And already know that the single cyclic shifts are already
sorted. C order[0] is the smallest one, and C order |S|- 1 is the biggest one. Now let's take the
doubled cyclic shifts starting exactly L to the left, counter clockwise from those. And then C prime
order of [0]- L, C prime order of [1]- L, and so on are sorted by second element of the pair of single
cyclic shifts, sorted already. And when I decrease order of 0 by L, I mean decrease modulo the length
of the string because this is a cyclic string. So we need to do a cyclic subtraction.
So we get these C prime order [0]- L and so on sorted by the second element of a pair, and we need
only a stable sort by first elements of pairs. But we know that counting sort is stable, and we know
equivalence classes of single shifts for the counting sort. And we know that there are not many
different single shifts. At most their number of different single shifts is length of string. So we can
again use counting sort to sort those
Play video starting at 7 minutes 57 seconds and follow transcript7:57
equivalence clusters of the single shifts.

SortDouble Implementation

What will be the order of values assigned to the variable startstart in this example?
[4,3,2,5,0,1,6]

This should not be selected 


Variable startstart goes through the starts of the doubled cyclic shifts in the reversed order of
their sorted second halves, not in the direct order.

[4,3,2,5,0,1,6]

is selected.This is wrong. It should not be selected.


Variable startstartgoes through the starts of the doubled cyclic shifts in the reversed order of their
sorted second halves, not in the direct order.

[6,5,4,3,2,1,0]
[1,3,0,2,4,5,6]

[6,1,0,5,2,3,4]

Now lets consider the pseudocode for the procedure SortDoubled which will sort the doubled cycled
shifts of length to L, given the string S. The current length L, the order of the current shifts of length L,
and their equivalents classes in the array class. We'll start with initializing the array count with the
zero array of size equal to the length of the string this is the standard array for counting sort, but as
oppose to sort characters procedure it will sort not characters, but equivalents classes of cyclic shift of
length L. And there are at most length of S difference including classes that's why we initialized the
array with size length of the string. As opposed to sort characters where we initialized it with the size
of the alphabet.
Play video starting at 54 seconds and follow transcript0:54
We'll also need another array new order, which will store our answer. It will be the order of the
sorted doubled cyclic shift. We initialize it with area of size, length of S. The next two four loops are
standard four loops for the counting sort. When we first count the number of occurrences of each
equivalence class of single cyclic shifts and then we compute the partial sums of that counting array
and the last four loop of the counting sort needs to go through the array. We're going to sort from the
end to the beginning and that is important for the sort to be stable. So, we need to go through the
array of double cyclic shifts which are initially sorted by their second half in the reverse order.
Play video starting at 1 minute 45 seconds and follow transcript1:45
But you don't want to actually build this array of doubled cyclic shifts and then go through it in
reverse order. We want to only build this array in our head and in the code, we just want to go
through this array in the reverse order. So how to do that? Remember that we have the array order,
and if we go in the direct order of this array, we'll go through all the cyclic shifts of length L in
increasing order.
Play video starting at 2 minutes 13 seconds and follow transcript2:13
What we need instead is, first, to go not through cyclic shifts of length L, but through cyclic shifts of
length to L which starts exactly L counter clockwise from those.
Play video starting at 2 minutes 28 seconds and follow transcript2:28
And that is why we decrease order I by L and at length of the string and take modulo S, just because
we're going through a circle.
Play video starting at 2 minutes 40 seconds and follow transcript2:40
And we need to go downwards from the last i to the first i, because we need to go in the reverse
order. So these two lines for i from length of S- 1 down to 0. And the last line which assigns variable
starts to order i minus L plus length of string s module s. What they basically do is they go with
variable start in the reverse order through the array of double cyclic shifts. Sorted by their second
half.
Play video starting at 3 minutes 16 seconds and follow transcript3:16
So start goes through the starts of those double cyclic shifts in the reverse order. Now, everything else
that happens in this for loop is just regular counting sort. We take the class of this start position,
which is the class of the first half of the corresponding doubled shift by which we want to sort.
Play video starting at 3 minutes 44 seconds and follow transcript3:44
Then we go and decrease the partial sum corresponding to that equivalence class in our counting
array.
Play video starting at 3 minutes 52 seconds and follow transcript3:52
And then we just put our start in the position which the counting sort prescribes to it.
Play video starting at 4 minutes 1 second and follow transcript4:01
So these three lines from getting the clust of the start position decreasing the partial sum and
assigning the start to the position counter of clause are the three standards lines of the counting sort.
The complexity here is that start is going in the reverse order. Through the array of double cyclic shifts
sorted by their second half and that we instead of comparing characters or something else we
compare equivalence classes of the single cyclic shift.
Play video starting at 4 minutes 44 seconds and follow transcript4:44
So this is what this last forlob does. And in the end what we have is the array new order. Which
contains the double cyclic shifts which were initially sorted by their second half and then we sorted
them by count and sort, by their first half.
Play video starting at 5 minutes 6 seconds and follow transcript5:06
And so now they are sorted by the first half and the count and sort was stable. So in case when their
first part, first half is the same, they're also sorted by the second half, because they were sorted by
second half initially. So new order finally contains all the dabbled cyclic shifts in the correct order, in
the sorted order.
Play video starting at 5 minutes 31 seconds and follow transcript5:31
So this is the function that sorts all the doubled cyclic shifts. And the running time of this procedure is
linear because this is basically the regular counting sort. Although it sorts very complex objects, in
practice in the code, it just sorts integers, the equivalent classes of the single cyclic shifts and it does
so in the running time of the counting sort which runs in the time number of items plus number of
different values. Number of items is equal to length of the string and the number of different values
of a clauses is also to smallest length of the stream. So all in all, those three for loops run in linear
time. In the next video, we will talk about how to update the classes of those double cyclic shifts after
they are sorted, and how to finally build the suffix array from scratch.
Updating Classes
Hi, in this video you will learn how to update the equivalence classes of the double cyclic shifts after
sorting them. And that will be the last step before we can actually present the whole algorithm for
building the suffix array. So to update classes, we need to compare the pairs of single shifts which
constitute the double cyclic shifts which we have just sorted. We have already sorted the pairs. So, we
just need to go through them in order and compare each pair to the previous pair. If it's the same,
then we need to assign it to the same class. If it's bigger, then we need to create a new class and
assign it to this pair. To compare the pairs, we can compare them separately by first element and then
by second element. Of course the elements of the pairs are cyclic shifts and we don't want to
compare them directly character by character. But, for that we already know their equivalent class is
of the single cyclic shift, and we can just compare the equivalence classes instead of the cyclic shifts
themselves. So we can compare any two pairs of single cyclic shifts in constant time.
Play video starting at 1 minute 13 seconds and follow transcript1:13
Let's look at an example.
PPT Slides
Play video starting at 1 minute 15 seconds and follow transcript1:15
S is our initial string and suppose we've already sorted the doubled cyclic shifts of length 2, and our
initial cyclic shifts were of length 1.
Play video starting at 1 minute 30 seconds and follow transcript1:30
So we have our array class of the equivalence classes of the cyclic shifts of length 1, which is basically
letters. And remember that this array has one element which is equal to 0 which corresponds to the
dollar, and it is in position six. We have four elements which are equal to 1 which correspond to
letters a in positions 0, 2, 4, and 5. And we have two elements which are equal to 2 which correspond
to letters b in positions 1 and 3. So these are the equivalence classes of the single cyclic shifts. Now for
the double cyclic shifts we can write them down in the order because we've already sorted them. And
we know the new order which is the order of the double cyclic shifts. They go 6, 5, 4, 0, 2, 1, 3. From
$a to ba.
Play video starting at 2 minutes 27 seconds and follow transcript2:27
And along with each doubled cyclic shift, we'll also write down the pair of the equivalence classes of
its halves. For example, for $a, the equivalence class for dollar is 0 and the equivalence class for a is 1.
So it corresponds to pair 0, 1. And for ab, for example, the equivalence class of a is 1, and equivalence
class of b is 2. So we write down the pair 1, 2.
Play video starting at 2 minutes 54 seconds and follow transcript2:54
These are the pairs of the equivalents classes of the single cyclic shifts. And now we need to compute
the equivalence classes of the doubled cyclic shifts. And write them down into the array newClass. To
do that, we go through the double cyclic shifts in the sorted order using array newOrder. And we start
from the first one, which is $a. And we write down value 0 for its class in position 6 because it is in
position 6 as we see from the array newOrder.
Play video starting at 3 minutes 27 seconds and follow transcript3:27
Then we'll proceed to the next doubled cyclic shift. And to assign class to it, we need to compare it to
the previous one. And of course in this picture, we could compare directly these double cyclic shifts
the previous one, and determine that it's different. But in practice, in general stage we don't want to
do that. And instead of comparing the cyclic shift directly, we compare the pairs of numbers written
to the right from them and we see that the pair 1, 0 is different from the pair 0, 1.
Play video starting at 3 minutes 58 seconds and follow transcript3:58
And we do this comparison just by two comparisons of numbers instead of comparing full cyclic shifts.
Play video starting at 4 minutes 6 seconds and follow transcript4:06
As far as this double cyclic shift is different, we need a new class for it. And we assign it to class 1. And
write it into position 5 because this is the position for this double cyclic shift as we see from array
newOrder. Now we proceed to the next one which is aa. We again compare it with the previous one,
by pairs. 1, 1 on 1, 0 are different pairs, so we write down a new class again, class 2 in position 4, as
given in the array newOrder.
Play video starting at 4 minutes 35 seconds and follow transcript4:35
Then proceed to ab. It is again different. 1, 2 is different from 1, 1. So we create a new class 3 and put
it in position 0 as given by array newOrder. Then we'll look at ab again and it is the same as the
previous ab. As we see from pairs 1,2, 1,2 which are equal. So, we don’t need to create a new class.
We write down the same class 3 into position 2 as given by the array newOrder. Now look at ba, it is
different from ab. So, we create new class 4. And then the second ba's of course are equal to the
previous ba, so we write down 4 in position 3 as given by the newOrder array. So this is how it works,
updating of the classes. Now let's look at the code.

We take modulo nn before assigning to midPrevmidPrev. Do we need to also take


modulo nnbefore assigning to midmid?
Yes

Correct 
Correct! This is because midmid is a position in the string, so it should be between 00 and n -
1n−1, and we need to go by LL positions clockwise in terms of the cyclic string.

Yes

is selected.This is correct.
Correct! This is because midmidis a position in the string, so it should be between 00and n -
1n−1, and we need to go by LLpositions clockwise in terms of the cyclic string.

No
Play video starting at 5 minutes 22 seconds and follow transcript5:22
So the procedure UpdateClasses does exactly the same as we did in the example. It takes as input
array newOrder, the order of the double cyclic shifts. It also takes classes of the single cyclic shifts.
And also it takes the length of the single cyclic shifts as inputs. And it will return the array with
equivalent classes of the double cyclic shifts as a result.
Play video starting at 5 minutes 50 seconds and follow transcript5:50
First we initialize variable n with the size of newOrder. Basically n will be equal to the length of the
string but we don't have string as an input so we need variable to compute it's length. And we
initialize the array newClass with an array of size n.
Play video starting at 6 minutes 7 seconds and follow transcript6:07
And first we assign class 0 to the smallest double cyclic shift which is given by newOrder of 0. And
then we go through all the double cyclic shifts from position 1 to n-1, and we need to compare the
double cyclic shift number i with the double cyclic shift number i-1. To do that, we first compute their
starting positions. Cur is the starting positions of the doubled cyclic shift number i, and prev is the
position of the previous one. And also need to compute the positions of their middle of the position
where their second half starts. So, we need to compare them half by half. So, cur and prev are the
starting positions of the doubled cyclic shifts and mid and midPrev are the starting positions of their
second halves. To compute them, we just take the position clockwise to the right by L. So we add L
and take everything modular n. Which is the length of the string.
Play video starting at 7 minutes 12 seconds and follow transcript7:12
And now we do just what we did in the example. We compare the classes of the current position and
the previous position. And the classes of the starting positions of the second halfs. If at least one of
the halfs is different, it means that the pair is different from the previous one. And we need to create
a new class, increase the current class by 1 and assign to the current position. Otherwise, the pair is
the same as the previous one, and we don't need to create a new class. We just assign the same class
to the current position. And we then return the array with the new classes of the double cyclic shifts.
Play video starting at 7 minutes 50 seconds and follow transcript7:50
We state that the running time of this algorithm UpdateClasses is linear.
Play video starting at 7 minutes 55 seconds and follow transcript7:55
And that's easy to prove because, well basically, we only have one for loop with linear number of
iterations and constant time operations happening inside.

Full Algorithm

Now to the full algorithm for building the suffix array, finally.
Play video starting at 6 seconds and follow transcript0:06
So procedure BuildSuffixArray takes in only string S and 
returns the order of the cyclic shifts or of the suffixes of this string. 
We assume that S already has $ in the and, and 
$ is smaller than all the characters in the string.
Play video starting at 27 seconds and follow transcript0:27
We start with sorting the characters, 
single character cyclic shifts of S, and save the result in the right order. 
And also compute the equivalence classes of those characters and 
save the result in the right class. 
And we initialize the current length as one. 
And then we have the main loop, which proceeds while the current length is still 
less, strictly less than the length of the string. 
If it is, then we first need to sort the double cyclic shifts of length to L.
Play video starting at 58 seconds and follow transcript0:58
And then we also need to update their equivalence classes so 
that the next iteration can use them to again sort the doubled cyclic shifts.
Play video starting at 1 minute 7 seconds and follow transcript1:07
And then we just multiply L by 2, and go on in our while loop 
until we get to the station when L is more than or equal to the length of S.
Play video starting at 1 minute 18 seconds and follow transcript1:18
And by the time array order will contain the correct order of all the full 
cyclic shifts of the string S, which is the same as the correct 
order of all the suffixes of the string S if it has a $ on the end.

And the running time of BuildSuffixArray procedure is length of S times logarithm of that, plus size of
the alphabet. So the size of the alphabet is because of the counting sort of characters in the
beginning. because we're sorted them in time proportional to length of the string plus size of the
alphabet. But if we wanted, we could just sort them in time S log S without using the count and sort
to sort the characters. So we could actually remove the plus alphabet from the BuildSuffixArray
asymptotics. Although in practice usually the alphabet is very small, so we don't need to do that. And
using counting sorts is better than actually sorting the characters in S log S.
Play video starting at 2 minutes 20 seconds and follow transcript2:20
And also compute the classes of the characters in linear time after that.
Play video starting at 2 minutes 25 seconds and follow transcript2:25
In each while loop iteration, we do both sorting of the double cyclic shifts and update their clusters in
linear time.
Play video starting at 2 minutes 33 seconds and follow transcript2:33
And we have only logarithmic number of iterations, because L is doubled every iteration, and as soon
as it gets at least S or more, we stop. So it's on a logarithmic number of iterations, so all in all, the
while loop runs for S log S. And adding to that, the initialization cost, we get S log S plus size of the
alphabet.

Play video starting at 2 minutes 56 seconds and follow transcript2:56


So now you finally can build suffix array of a string S in time length of S times logarithm of that using
linear mode mode of memory. And you can do not only that, but you can also sort all cyclic shifts of a
string in the same time. And you know that suffix array handles many fast operations with the string.
And also, in the next lesson you will learn to build suffix tree of the string from its suffix array in linear
time. And that combined will give you an algorithm to build suffix tree in time S log S. And of course
you already knew how to build a suffix tree in quadratic time, but S log S is much, much better than
that. So you will learn that in the next lesson.

Slides and External References

Download the slides on suffix array:

14_algorithmic_challenges_2_suffix_array.pdfPDF File

References
See chapter 4 in [CHL01] Maxime Crochemore, Cristophe Hancart, Thierry Lecroq. Algorithms on
Strings, Cambridge University Press, 2001.

Review the lecture on the Counting Sort. Also see this answer for an example of difference
between stable sorting and a non-stable sorting algorithms. Counting Sort is a stable sort.

UIZ • 12 MIN

Suffix Array Construction


From suffix array to suffix Tree
Suffix Array and Suffix Tree

Hi, in this lesson, you will learn how to build suffix tree of a string given its suffix array in linear time.
At first we'll explore some connections between suffix array and suffix tree, and then we'll learn to
compute some additional information to the suffix array. And then finally we will use suffix array and
the traditional information called LCP array to build a suffix tree.
Play video starting at 22 seconds and follow transcript0:22
First recall the problem. It's very simple. You're given a string S and you need to compute its suffix
tree. And you already know how to do that actually. But the algorithm you know works in square
time, and so it will work only for short strings, maybe up to 1,000 or 10,000 characters. And if you
want to build suffix tree for strings of length of millions or billions, you will need a much faster
algorithm. And after you learn this lesson, you will know how to build suffix tree in time, length of
string times logarithm of these lengths, because you can build suffix array in this time and then
construct suffix tree from the suffix array in linear time.
So the general plan is to construct suffix array in time as log S, then compute some additional
information called LCP array in the linear time. And then given both suffix array and this additional
information, construct the suffix tree in linear time. First, let's explore how suffix array and suffix tree
are connected. Here we have a string S, ababaa$. And again, we insert $ in the nth which is smaller
than any of the characters of the string both to build suffix array and then to build suffix tree from it.
And on the left, we have in the column. All the suffixes of the string S sorted in lexicographic order. So
that is basically the suffix array.
Play video starting at 1 minute 54 seconds and follow transcript1:54
And on the right, we have the fully built suffix tree of the string, which is already compressed so that
you see that on the edges we have not single letters but whole sub strings of string S. And by the way,
interesting question is how do we store suffix tree? We shouldn't, of course, store the sub strings that
are written on the edges directly because that could lead to quadratic memory usage and we want
linear memory usage. So instead of storing the sub strings themselves, we just store the index of the
start, and index of the end index of the corresponding sub string. So for each edge, we store two
indexes, the start of that edge in the string and end of that edge in the string. And to store the nodes,
we just store, for example, an array of pointers to the children nodes. And that array is indexed by the
first character of the edge outgoing from this node into the child.
Play video starting at 2 minutes 55 seconds and follow transcript2:55
And we can store the information about the edge itself in the node for which this edge is going from
its parent.
Play video starting at 3 minutes 2 seconds and follow transcript3:02
This is one of the ways to store everything but you may organize everything in another way.
Play video starting at 3 minutes 10 seconds and follow transcript3:10
The important thing is that you shouldn't store edges as substrings. So what corresponds in the suffix
tree to the suffix array elements?
PPT Slides
Let's take the first element of the suffix array. Actually, it is corresponding to suffix in the string S and
that is corresponding to a leaf in the suffix tree and also to the path from a root vertex to the
corresponding leaf vertex in the tree. So the first element of the suffix array corresponds to this route
highlighted in blue, and then if we go to the next element of the suffix array, we get another route
from route vertex to the leaf number 1. And then if we go to the next element, we get route from the
route vertex to the leaf number 2. And note that the indexes of the leaves, and the indexes of the
suffixes are just in the sorted order. So, those are not positions in the string S. Those are numbers of
the suffixes in the increasing order, from 0 to number of the suffixes minus 1. So, each of the
elements of the suffix array corresponds to some path from root to leaf in the suffix tree, that is what
we know. That is unfortunately not yet sufficient to build the tree from the suffix array because there
are many ways to create some paths from root to different nodes. Which corresponds to suffixes of
the suffix array. So we will need some additional properties.
And this additional property we will need is call longest common prefix, or often it is just said as lcp.
Play video starting at 5 minutes 5 seconds and follow transcript5:05
So lcp of two strings S and T is the longest such string u, that it is both a prefix of S and of T. And we
denote by big LCP(S, T), the function which returns the length of the lcp of strings S and T. For
example, LCP("ababc" and "abc") is 2 because their longest common prefix is ab, and it's length is 2.
And LCP("a","b") = 0 because their longest common prefix is empty.
PPT Slides
Play video starting at 5 minutes 39 seconds and follow transcript5:39
Now let's look again at the suffix array and suffix tree and also take into account LCP between the
neighboring elements of the suffix array. So when we look at the first element of the suffix array, we
just have an edge corresponding to it in the suffix tree. And when we have the next element, we have
a path from root to another vertex. But if we compute the longest common prefix of this element of
the suffix array with the previous element of the suffix array, we'll see that this longest common
prefix is empty. And that corresponds to empty intersection, between the previous path and the new
path. The only common node is the root node, and they don't have any edges in the intersection.
However, if we proceed to the next suffix,
Play video starting at 6 minutes 35 seconds and follow transcript6:35
it has a common prefix of length 1 with the previous suffix. And it corresponds to the common path in
the tree highlighted in yellow, starting in the root node and going through edge a to another node
which is still a common node for the current suffix and the previous one. So this is how LCP
corresponds to the tree. If you go to the next suffix, it again has the same longest common prefix with
the previous suffix. So we have the same common path from root to the next node by edge a and
then the part of the path is different from the current suffix and from the previous one.
Play video starting at 7 minutes 17 seconds and follow transcript7:17
If we go to the next suffix, their longest common prefix with the previous one is even longer, and so
the common part of the path is now
Play video starting at 7 minutes 28 seconds and follow transcript7:28
consisting of three nodes and two edges. A root node, next node by edge a and next node by edge ba.
And the rest of the path is unique to the current suffix. If we go to the next suffix, it again doesn't
have any common prefix with the previous one so the only common in the path is the root node. And
the next suffix has longest common prefix of ba, and that's why we see this path from root to another
node via edge ba. And this is the common part of the path for the current suffix and the previous one.
So we see that basically all the nodes but the leaves are corresponding to the longest common prefix
of the neighboring suffixes in the suffix array. And this is how we can actually build the suffix tree by
first computing the longest common prefixes of the neighboring elements in the suffix array. And then
building those internal nodes. And then, in the way of that, we will also build the leaves as the ending
points of the paths corresponding to the suffixes from the suffix array. So this is the plan of what we'll
do. But first, we'll need to compute those longest common prefixes for the elements of the suffix
array.

LCP Array

So we define LCP array and let's consider suffix array A of string S in the raw form that is that A[0] is a
suffix, A[1] is a suffix and so on up to A[S-1], all those are suffixes of S in lexicographic order. Then LCP
array of string S is the array LCP small of size length of S-1. It contains fewer elements than the suffix
array and then the string itself. Besides that, each element lcp[i] is just equal to the longest common
prefix length. Between A[i] and A[i+1]. So it's the longest common prefix of two neighboring elements
in the suffix array and what we want is to compute the values of this array.
Ppt slides
Play video starting at 51 seconds and follow transcript0:51
For example if we have our string ababaa$. Then, we first compute the longest common prefix of $,
and a$, which is 0. Then, we compute the longest common prefix of a$ and aa$, which is a of length 1.
Then it's again a of length 1. Then it's aba of length 3. Then it's empty. And then it's ba of length 2. so
the LCP array for this string is 0, 1,1, 3, 0, 2.

And the central LCP array property which will enable us to compute it fast is that for any end assist i
and j In the suffix array, where i is less than j. The longest common prefix between A[i] and A[j] which
are far from each other, is not bigger than the LCP of i, which is basically the longest common prefix of
i and the next element. So what I'm saying with this Lemma is that the LCP of two neighboring
elements is always at least as big as the LCP of the first one of them with any of the next elements.
And the same goes the other way. The LCP of two neighboring elements is at least the same as LCP of
the second of them with any of the previous ones.
Play video starting at 2 minutes 15 seconds and follow transcript2:15
And to see that let's look at some hypothetical example that we have some long suffix array and
elements i and i+1 are here. And also there is some element j farther in the suffix array. And we see
that really the common prefix of suffixes i and i+1 is pretty long.
Play video starting at 2 minutes 38 seconds and follow transcript2:38
It's not so long with suffix j, it's only of length two. But this example doesn't yet prove anything. So
maybe for some other situation with a suffix number i+1, it could be solved that the common prefix of
i and j will be bigger than common prefix of i and i+1. So let's suppose that, and we don't know what
is suffix i + 1, so we just replace it with many x. X is an unknown letter. We know that the LCP of i and j
is equal to 2. So let's consider k which is the length of the longest common prefix of A[i] and A[i + 1].
And we suppose that it is smaller than 2 in this case.
Play video starting at 3 minutes 26 seconds and follow transcript3:26
So how can that be? One variant is if A[i + 1] is shorter than 2, and then A[i + 1] is actually a prefix of
A[i]. But in this case, A[i+1] is smaller than Ai which contradicts the property of the suffix array. That
the suffixes are sorted.
And if suffix i+1 is sufficiently long then it follows that it's kth character is different from the kth
character of both ith suffix and jth suffix. And in this case there are again two cases. In the first case is
that this character in suffix i+1 is bigger than the corresponding one in strings i and j. But from this it
immediately follows that suffix i+1 is bigger than suffix j which contradicts the suffix array properties,
so it is impossible.
Play video starting at 4 minutes 19 seconds and follow transcript4:19
And another case is that this character is less than the corresponding character in both strings i and j.
But in this case it immediately follows that A[i] is bigger than A[i + 1] which again contradicts the suffix
array property. So in all cases we found the contradiction. And so, it is not possible that the longest
common prefix of i and j is bigger than the longest common prefix of i and i+1. And we proved the LCP
array property because for this symmetric case, the proof is a null x.

Now how do we compute the LCP array? One variant is to go for each i, compare A[i] and A[i+1]
character by character and compute the LCP directly.
Play video starting at 5 minutes 7 seconds and follow transcript5:07
But this will work in linear time for each i. And in total it will be length of the string squared. And we
want to compute everything in linear time. So how to do this faster, and you will learn that in the next
video.

Computing the LCP Array

Hi, in this video, you will learn how to compute LCP array in a linear time. 
And the main idea is the following. 
We'll start by computing LCP of the first two smallest suffixes directly by 
comparing them character by character. 
But then, on each next iteration, 
instead of going to next pair of suffixes in the suffix array, 
we move the smaller suffix in the stream one position to the right and 
then compute its LCP with the next suffix in the suffix array. 
So we won't go in good order in the suffix array, we will go in some strange order. 
But this order is good. 
It will show that if we go in this order through the smaller suffixes, 
then the LCP of the smaller suffix and the next suffix will decrease by, 
at most, one on each duration. 
And so, we will know that most of the characters of the two new suffixes, 
we have already compared many of them, and we don't need to compare them again. 
We'll start from there, and we will convert the next character and 
the next one directly, and the LCP itself will be very easy to compute, 
because we will still do that by direct comparison of characters with characters. 
We will just avoid some of the comparisons, 
because we will know from the previous durations that the common prefix has 
at least such length, and we don't need to compare the first such many characters. 
And in the end, it turns out this will work in linear time.

So we will denote by A and of Pi the suffix, starting in the next position in the stream, after suffix Ai, in
the suffix array. So the next one in the suffix array will be Ai plus one, but we won't know that one,
but the one which starts in the string one position to the right.
Play video starting at 1 minute 54 seconds and follow transcript1:54
So here's an example. We have a string, which is ababdabc, and the smallest suffix is ababdabc. The
whole string is actually the smallest suffix, and the next one in the suffix array is abc. And their longest
common prefix is ab, and here we see it. And we compute this longest common prefix, which is equal
to two, by the length, directly.
Play video starting at 2 minutes 19 seconds and follow transcript2:19
And then, we will know that if we move to the next two suffixes in the stream to the next one after a
zero and the next one after a one, those will both start with letter B. So the last of the common prefix
decreased by, at most, one, cuzboth suffixes just moved one position to the right, and we cut away
only one position of the longest common prefix. Of course, these two suffixes are not, probably, two
neighboring suffixes in the suffix array in the general situation. It might be so that there is some suffix
between them. But because of the property of the LCP array, the longest common prefix of the
smaller suffix with the next suffix in the suffix array will be even bigger or at least the same as its
longest common prefix with the next suffix in the string, because the next suffix in the string is bigger
than the smaller one. And by the property of LCP array, the common prefix with the next element is
the same or bigger than the common prefix with some element farther away in the suffix array. So
now, we can move to the next element from the smaller one and then take the next one to it in the
suffix array, and compute their L speed directly but remembering that the first several characters, we
don't need to compare exactly those, which are in the LCP of the previous pair. So, this is basically the
algorithm.
We compute LCP(A[0] and A[1]) directly and save it's value as variable LCP. And then on each
iteration, first suffix in the pair, which is smaller, goes to the next in the string, then we find which one
is the next in the suffix array in the order, and we compute their longest common prefix knowing that
we don't need to compare the first lcp- 1 characters.
Play video starting at 4 minutes 26 seconds and follow transcript4:26
And then on each comparison, if it's successful, we increase LCP, and we go to the next comparison,
and we repeat that until we feel the whole LCP array. And the idea is when we make each
comparison, we increase LCP. And when we move to the next pair, we decrease LCP by at most one.
And this is why we cannot do too many iterations.
So the Lemma states that this algorithm computes LCP array in linear time.
Play video starting at 4 minutes 55 seconds and follow transcript4:55
And to prove that is now easy, because each comparison, we do between one suffix and another.
Play video starting at 5 minutes 2 seconds and follow transcript5:02
It either finishes the iteration, and number of such comparisons is at most number of iterations, and
we have at most length of the string iterations, or if it is a successful comparison, then it increases the
current value of variable LCP.
Play video starting at 5 minutes 18 seconds and follow transcript5:18
And the variable lcp cannot be bigger than the length of the string in any moment, and at each
duration, lcp decreases by at most one. So if we start from zero, we cannot go higher than length of S,
and at each iteration, we decrease by at most one. We cannot do more than linear time of increasing
LCP, and so we cannot do more than linear number of comparisons. And this is why this algorithm
works in linear time. And as soon as we can now compute the LCP array, we can proceed in the next
video to construct the suffix three given the suffix array and the lcp array.

Computing the LCP Array - Additional Slides

We encourage you to review the slides attached here, as they contain an additional example
regarding LCP array construction and the pseudocode in the section "LCP Array Computation".

Construct Suffix Tree from Suffix Array and LCP Array


Hi. In this lecture you will finally learn how to build suffix tree of a string, given its suffix array and the
LCP array. And we will do everything on this example. We have our string, we have our sorted suffixes
in the order, and we start building the tree from just the root vertex. We consider the first suffix, and
we create a leaf in the tree corresponding to the suffix. And we connect it with edge to the root node.
Play video starting at 27 seconds and follow transcript0:27
When we go to the next suffix, we see that the longest common prefix of the suffix with the previous
one Is empty, we know that from the array. And so, the only common part of the path from root to
the new leaf and to the previous leaf is the root note. So we don't need to create any new nodes,
other than the leaf node for the new suffix, and we create an edge from root directly to this new
node. And we write down the corresponding suffix on this edge. When we go to the next suffix
however, there is already a common prefix of length one, which is a, so we'll need to create a new
node in the middle of this last created edge. And we'll divide it into edges with letter a and with letter
$. And this is what happens. So now we have a new node which is connected with letter a to the root
and from which, there are two outgoing edges, one with the letter $, and another with string a$,
corresponding to the last considered suffix. Now when we consider the next suffix, the longest
common prefix with the previous one is again just a. So we don't need to create a new node other
than the leaf node. So we create a leaf node for the new suffix abaa$ and we connect it with the
yellow node which corresponds to the longest common suffix prefix with the previous suffix, with an
edge where we write baa$, everything that is left in the suffix.
Play video starting at 2 minutes 6 seconds and follow transcript2:06
When we consider the next suffix the longest common prefix with the current one is of length 3, so
we'll need to subdivide the edge again. And this is what we get. The yellow node is the node
corresponding to the longest common prefix aba of this suffix and the previous one. And then we
create a new leaf node for the suffix ababaa$ and the node with number 4 corresponding to position
4 in the suffix array. Then we can consider the suffix baa$, it doesn't have any common prefix with the
previous ones, so it starts from the root and goes into the new node. And then the next one has a
common prefix with it, ba. So we create a new node after ba, subdivide the edge, and create a new
leaf node for the last suffix with edge baa$. So this is basically what is going to happen. How do we
implement this creation of new nodes and subdividing of edges?
Play video starting at 3 minutes 8 seconds and follow transcript3:08
The following way. When we build an edge to the leaf for some suffix, we go and sit in that leaf node
and when we consider the next suffix, we go up from that node using the pointer to the parent node
in the tree. Until we are high enough, so that the longest common prefix is below us. As soon as we
jumped to the longest common prefix or higher, we stop. How do we whether we're higher than the
longest common prefix or not? We need to also store the depth in the nodes and the depth is the
number of characters on the path from the root to this node. This is easy to keep during the suffix
tree construction, so I'll just assume we have it. So we go up from the leaf until we are in the longest
common prefix or above it. If we're exactly in the longest common prefix with the previous suffix, we
don't need to build any new nodes. We just build a leaf for the new suffix, and connect it with the
current node. However, if we are higher than the longest common prefix, then we'll need to create a
new node in the middle of the current edge, going down from our node in the direction of the longest
common prefix. So we divide our edge in the middle. We create a new node which is corresponding to
the longest common prefix with the previous suffix. And we create a new leaf node as usual for the
suffix and connect it with its new note. So this is the whole algorithm.

Play video starting at 4 minutes 39 seconds and follow transcript4:39


To repeat. To build the suffix tree from scratch, we first build suffix array and then build LCP array
from that. We start building the tree from only root vertex. We grow the first edge for the first suffix
just from the root. And then for each next suffix, we're sitting in the leaf we just built for the previous
suffix, we go up from the leaf until we jump higher than LCP with previous suffix. And then we build a
new edge and a new leaf for the new suffix and maybe, depending on where we are in the tree, we
need to subdivide the current edge.
I state that this algorithm runs in linear time and that is pretty easy to see. We know that the total
number of edges in the suffix tree is linear from the previous modules. And during this process for
each edge we go at most once down when we do this edge, and then we go at most once up when we
go to find the l sub e. And then we go down, we don't go through the same edge again. Because
we've already been there. So, for each edge we go through this edge at most twice, disregarding
maybe additional one per iteration, and we have only linear number of iterations of the whole
algorithm. And the time to create a new edge or a new leaf or subdivide an existing edge's constant,
so in total this algorithm works in linear time.
Play video starting at 6 minutes 10 seconds and follow transcript6:10
So now you are fully equipped to build suffix structures, such as suffix array and suffix tree. And you
can do that pretty fast in time SlogS where S is the length of the string and that's cool, because you
can solve very complex problems using these data structures which are basically impossible to solve
without them. And some of them will be in the programming assignment. So see you there.

Suffix Tree Construction - Pseudocode

We encourage you to review the slides attached here, as they contain the pseudocode for suffix
tree construction from suffix array and LCP array in the section "Construting Suffix Tree".

Slides and External References


Download the slides on suffix tree:

14_algorithmic_challenges_3_from_suffix_array_to_suffix_tree.pdf PDF File

References
[CP15] Phillip Compeau, Pavel Pevzner. Bioinformatics Algorithms: An Active Learning Approach,
2nd Ed. Vol. 1. Active Learning Publishers. 2015.

Programming Assignment: Programming Assignment 3


You have not submitted. You must earn 2/4 points to pass.

Deadlin Pass this assignment by Aug 9, 11:59 PM PDT


e

1. Instructions
2. My submission
3. Discussions
Welcome to your third programming assignment of the Algorithms on Strings class! In this
programming assignment, you will be practicing implementing very efficient string algorithms.

Download instructions and starter files:

Programming Assignment 3.pdfPDF File

Programming Assignment 3.zip

How to submit
When you're ready to submit, you can upload files for each part of the assignment on the "My
submission" tab.

You might also like